28

So basically i want to merge a couple of CSV files. Im using the following script to do that :

paste -d , *.csv > final.txt

However this has worked for me in the past but this time it doesn't work. It appends the data next to each other as opposed to below each other. For instance two files that contain records in the following format

CreatedAt   ID
Mon Jul 07 20:43:47 +0000 2014  4.86249E+17
Mon Jul 07 19:58:29 +0000 2014  4.86238E+17
Mon Jul 07 19:42:33 +0000 2014  4.86234E+17

When merged give

CreatedAt   ID CreatedAt    ID
Mon Jul 07 20:43:47 +0000 2014  4.86249E+17 Mon Jul 07 18:25:53 +0000 2014  4.86215E+17
Mon Jul 07 19:58:29 +0000 2014  4.86238E+17 Mon Jul 07 17:19:18 +0000 2014  4.86198E+17
Mon Jul 07 19:42:33 +0000 2014  4.86234E+17 Mon Jul 07 15:45:13 +0000 2014  4.86174E+17
                                            Mon Jul 07 15:34:13 +0000 2014  4.86176E+17

Would anyone know what the reason behind this is? Or what i can do to force merge below records?

6
  • it seems like one of your .csv file has more# of lines that other .csv file. Not sure from where you are getting the space. paste command uses "," to separate the entries.
    – AKS
    Commented Jul 8, 2014 at 21:51
  • 1
    Do you mean that you did cat file*.csv > final.csv . That would give you records "below each other". Good luck.
    – shellter
    Commented Jul 8, 2014 at 22:07
  • What is the purpose of -d ,?
    – Cyrus
    Commented Jul 8, 2014 at 22:17
  • How should the result look like? Do you mean join?
    – Cyrus
    Commented Jul 8, 2014 at 22:32
  • @ArunSangal : Yes but the count shouldnt matter for a join should it? Cyrus - Yes i mean join. The purpose of -d , was to separate it by comma. Also the Answer below worked. Commented Jul 9, 2014 at 9:38

5 Answers 5

60

Assuming that all the csv files have the same format and all start with the same header, you can write a little script as the following to append all files in only one and to take only one time the header.

#!/bin/bash
OutFileName="X.csv"                       # Fix the output name
i=0                                       # Reset a counter
for filename in ./*.csv; do 
 if [ "$filename"  != "$OutFileName" ] ;      # Avoid recursion 
 then 
   if [[ $i -eq 0 ]] ; then 
      head -1  "$filename" >   "$OutFileName" # Copy header if it is the first file
   fi
   tail -n +2  "$filename" >>  "$OutFileName" # Append from the 2nd line each file
   i=$(( $i + 1 ))                            # Increase the counter
 fi
done

Notes:

  • The head -1 or head -n 1 command print the first line of a file (the head).
  • The tail -n +2 prints the tail of a file starting from the lines number 2 (+2)
  • Test [ ... ] is used to exclude the output file from the input list.
  • The output file is rewritten each time.
  • The command cat a.csv b.csv > X.csv can be simply used to append a.csv and b csv in a single file (but you copy 2 times the header).

The paste command pastes the files one on a side of the other. If a file has white spaces as lines you can obtain the output that you reported above.
The use of -d , asks to paste command to define fields separated by a comma ,, but this is not the case for the format of the files you reported above.

The cat command instead concatenates files and prints on the standard output, that means it writes one file after the other.

Refer to man head or man tail for the syntax of the single options (some version allows head -1 other instead head -n 1)...

4
  • I read now what he meant. Btw, you can put that increment to "i" variable within IF statement instead within the loop.
    – AKS
    Commented Jul 9, 2014 at 19:57
  • @ArunSangal It's right. My error, I copied an old version. If the increment is outside the if block and the file of output is the first of the list, you will never have the header in the output file.
    – Hastur
    Commented Jul 10, 2014 at 0:10
  • Noticed a small corner case issue: it breaks if filenames contain spaces. Can be fixed with adding some quotes: "$filename".
    – Jonik
    Commented Jan 17, 2019 at 10:13
  • @Jonik Right and proper, thanks; fixed. It is devious to peek around a corner ... As you do, you risk to spot another one: better to put " even to $OutFileName ;-)
    – Hastur
    Commented Jan 21, 2019 at 21:11
2

Alternative simple answer, this as combine_csv.sh:

#!/bin/bash
{ head -n 1 $1 && tail -q -n +2 $*; }

can be used like this:

pattern="my*filenames*.csv"
combine_csv.sh ${pattern} > result.csv
1
  • 2
    Nice, & should be && though
    – user239558
    Commented Sep 4, 2021 at 6:40
1

Thank you so much @wahwahwah. I used your script to make nautilus-action, but it work correctly only with this changes:

#!/bin/bash

for last; do true; done

OutFileName=$last/RESULT_`date +"%d-%m-%Y"`.csv                       # Fix the output name

i=0                                       # Reset a counter
for filename in "$last/"*".csv"; do

 if [ "$filename" != "$OutFileName" ] ;      # Avoid recursion 
 then 
   if [[ $i -eq 0 ]] ; then 
      head -1  "$filename" > "$OutFileName" # Copy header if it is the first file
   fi
   tail -n +2  "$filename" >> "$OutFileName" # Append from the 2nd line each file
   i=$(( $i + 1 ))                        # Increase the counter
 fi
done
1

Here's how I concatenate CSV files that have the same columns:

(head -qn 1 *.csv | head -n 1; tail -qn +2 *.csv) >combined.csv

Save time by calling head on any of the files specifically:

(head -n 1 first.csv; tail -n +2 *.csv) >combined.csv

No scripting or funky awk necessary!

0

Give joinem a try, available via PyPi: python3 -m pip install joinem.

joinem provides a CLI for fast, flexbile concatenation of tabular data using polars. I/O is lazily streamed in order to give good performance when working with numerous, large files.

Example Usage

Pass input files via stdin and output file as an argument.

ls -1 path/to/*.csv | python3 -m joinem out.parquet

You can add the --progress flag to get a progress.

Further Information

joinem is also compatible with parquet, JSON, and feather file types. See the project's README for more usage examples and a full command-line interface API listing.

disclosure: I am the library author of joinem.

Not the answer you're looking for? Browse other questions tagged or ask your own question.