How to join two CSV files?

Question

Suppose you have one CSV file with 2 fields: ID and email. You have another file with 2 fields: email and name. How can you produce a file with all three fields joined on email?

A little more detail on the join (i.e., inner, outer, left). Is the email list on the 1st CSV identical to the second list? Or does one contain more? — hyperslug, Commented Aug 20, 2009 at 23:29
Examples of the csv files would be handy to, along with the OS you are using? — Troggy, Commented Aug 20, 2009 at 23:35
i think 1st and 2nd list are identical. I am using Linux. Please help!!! thanks!! :) — crst53, Commented Aug 21, 2009 at 0:00

hyperslug · Accepted Answer · 2009-08-21 01:01:23Z

36

Revision3:

You must sort both lists on email alphabetically, then join. Given that the email field the 2nd field of file1 and the 1st field of file2:

sort -t , -k 2,2 file1.csv > sort1.csv
sort -t , -k 1,1 file2.csv > sort2.csv
join -t , -1 2 -2 1 sort1.csv sort2.csv > sort3.csv

parameter meaning

-t ,   : ',' is the field separator
-k 2,2 : character sort on 2nd field
-k 1,1 : character sort on 1st field
-1 2   : file 1, 2nd field
-2 1   : file 2, 1st field
>      : output to file

produces

email,ID,name
email,ID,name
...

sorted by email alphabetically.

Note that if any email is missing from either file it will be omitted from the results.

edited Aug 21, 2009 at 1:01

answered Aug 21, 2009 at 0:11

hyperslug

13.7k4 gold badges49 silver badges62 bronze badges

3

CSV is more complicated than this. The field separator can be escaped for example.
– pguardiario
Commented Dec 11, 2016 at 0:45
@hyperslug can i do full outer join?
– Abu Shoeb
Commented Mar 15, 2018 at 6:29
This won't work if the CSV is mixed quoted/unqoted, if the ID contains a comma. Use this solution only for one-time processing where you check the result. But I recommend not using it for a production-level script.
– Ondra Žižka
Commented Nov 10, 2018 at 12:51

Add a comment |

Tgr · Accepted Answer · 2014-12-12 00:29:59Z

42

Use csvkit:

csvjoin -c email id_email.csv email_name.csv

or

csvjoin -c 2,1 id_email.csv email_name.csv

answered Dec 12, 2014 at 0:29

Tgr

3,0734 gold badges27 silver badges32 bronze badges

11

Why isn't this the top answer?
– alexg
Commented Oct 28, 2015 at 9:14
1

awesome tool. Even recognized, that one of my files has a different than "," delimiter.
– D_K
Commented Nov 22, 2018 at 13:32
1

Thanks for letting me know this exists!
– podperson
Commented Nov 27, 2020 at 9:09

Add a comment |

Peter Mortensen · Accepted Answer · 2009-12-13 04:51:31Z

6

Perhaps it is overkill, but you could import into a database (e.g. OpenOffice Base) as two kinds of tables and define a report that is the desired output.

If the CSV import is a problem, then a spreadsheet program (e.g. OpenOffice Calc) can do the import. The result can then easily be transferred to the database.

edited Dec 13, 2009 at 4:51

user1931

answered Aug 20, 2009 at 23:36

Peter Mortensen

12.2k23 gold badges71 silver badges90 bronze badges

Add a comment |

jim in austin · Accepted Answer · 2009-08-21 15:02:46Z

3

As a future reference you might want to start playing around with AWK. It's a very simple little scripting language that exists in some form on every *nix system and its sole mission is life is the manipulation of standard delimited textual databases. With a few lines of throwaway script you can do some very useful things. The language is small and elegant and has a better utility/complexity ratio than anything else I am aware of.

answered Aug 21, 2009 at 15:02

jim in austin

1941 silver badge3 bronze badges

Perl is in many ways a successor of awk.
– reinierpost
Commented Sep 14, 2010 at 11:33
awk doesn't handle quoting and escaping (e.g. dealing with ,s in a ,-separated CSV file) as far as I know. If you need that, using a dedicated CSV handling library is easier; they exist for many languages.
– reinierpost
Commented Sep 14, 2010 at 11:34

Add a comment |

chrislusf · Accepted Answer · 2016-10-22 17:46:34Z

Use Go: https://github.com/chrislusf/gleam

package main

import (
    "flag"
    "os"

    "github.com/chrislusf/gleam"
    "github.com/chrislusf/gleam/source/csv"
)

var (
    aFile = flag.String("a", "a.csv", "first csv file with 2 fields, the first one being the key")
    bFile = flag.String("b", "b.csv", "second csv file with 2 fields, the first one being the key")
)

func main() {

    flag.Parse()

    f := gleam.New()
    a := f.Input(csv.New(*aFile))
    b := f.Input(csv.New(*bFile))

    a.Join(b).Fprintf(os.Stdout, "%s,%s,%s\n").Run()

}

Ondra Žižka · Accepted Answer · 2018-11-10 12:47:42Z

Try CSV Cruncher.

It takes CSV files as SQL tables and then allows SQL queries, resulting in another CSV or JSON file.

For your case, you would just call:

crunch -in tableA.csv tableB.csv -out output.csv \
   "SELECT tableA.id, tableA.email, tableB.name 
    FROM tableA LEFT JOIN tableB USING (email)"

The tool needs Java 8 or later.

Some of the advantages:

You really get CSV support, not just "let's assume the data is correct".
You can join on multiple keys.
Easier to use and understand than join-based solutions.
You can combine more than 2 CSV files.
You can join by SQL expressions - the values don't have to be the same.

Disclaimer: I wrote that tool. It used to be in disarray after Google Code was closed, but I revived it and added new features as I use it.

Ondra Žižka · Accepted Answer · 2018-11-10 14:03:36Z

0

You could read the CSV file with a spreadsheet program like LibreOffice and use VLOOKUP() macro to search for the name in second file.

edited Nov 10, 2018 at 14:03

Ondra Žižka

7524 gold badges10 silver badges26 bronze badges

answered Dec 6, 2010 at 12:45

Janek

1

7

File extension xlsx implies Microsoft Excel and I think VLOOKUP does as well. This question is tagged with Linux. Is Microsoft Excel available for Linux?
– Peter Mortensen
Commented Mar 11, 2011 at 22:26
Now LibreOffice has VLOOKUP too.
– Cristian Ciupitu
Commented Jun 3, 2014 at 19:20

Add a comment |

baltakatei · Accepted Answer · 2020-02-21 01:56:14Z

In Bash 5.0.3 with GNU Coreutils 8.30 and building off of hyperslug's answer:

If you have unsorted CSV files with duplicate lines and don't want to omit data due to a missing field in a line of either file1.csv or file2.csv, then you can do the following:

Sort file 1 by field 2 and sort file 2 by field 1:

( head -n1 file1.csv && tail -n+2 file1.csv | sort -t, -k2,2 ) > sort1.csv
( head -n1 file2.csv && tail -n+2 file2.csv | sort -t, -k1,1 ) > sort2.csv

Expanding on hyperslug's parameters:

-k 2,2     : character sort starting and stopping on 2nd field
-k 1,1     : character sort starting and stopping on 1st field
head -n1   : read first line
tail -n+1: : read all but first line
(  )       : subshell
>          : output to file

I had to do head and tail within the subshell ( ) in order to preserve the first header line of the CSV file when sorting by a given field.

Then,

join -t , -a1 -a2 -1 2 -2 1 -o auto sort1.csv sort2.csv > sort3.csv

Expanding on hyperslug's parameters:

-t ,    : ',' is the field separator
-a1     : Do not omit lines from file 1 if no match in file 2 found
-a2     : Do not omit lines from file 2 if no match in file 1 found.
-1 2    : file 1, 2nd field
-2 1    : file 2, 1st field
-o auto : Auto format: includes extra commas indicating unmatched fields
>       : output to file

Here is an example file1.csv, file2.csv, and the resulting sort3.csv:

file1.csv:

ID,email
02,[email protected]
03,[email protected]
05,[email protected]
07,[email protected]
11,[email protected]

file2.csv:

email,name
[email protected],Timothy Brown
[email protected],Robert Green
[email protected],Raul Vasquez
[email protected],Carol Lindsey

sort3.csv:

email,ID,name
[email protected],02,Robert Green
[email protected],,Carol Lindsey
[email protected],03,
[email protected],07,Raul Vasquez
[email protected],05,
[email protected],,Timothy Brown
[email protected],11,

You can see Timothy Brown and Carol Lindsey lack IDs but are still included in the joined CSV file (with their names and emails in the correct fields).

liket · Accepted Answer · 2017-11-22 20:40:01Z

-1

You could also use a tool specifically designed for joining csv files, such as the one found on https://filerefinery.com

The operations we currently support are: Joining csv files. It is possible to perform the SQL equivalent of outer, inner, left and right join operations on two csv files. Which column will be used as a join key in each of the files is configurable.

edited Nov 22, 2017 at 20:40

answered Nov 20, 2017 at 20:04

liket

11 bronze badge

Please quote the essential parts of the answer from the reference link(s), as the answer can become invalid if the linked page(s) change.
– DavidPostill ♦
Commented Nov 20, 2017 at 20:25
No longer exists.
– Ondra Žižka
Commented Nov 10, 2018 at 12:34

Add a comment |

Stack Exchange Network

How to join two CSV files?

9 Answers 9

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
csv
.

Linked

Hot Network Questions

How to join two CSV files?

9 Answers 9

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxcsv.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
csv
.