35

Suppose you have one CSV file with 2 fields: ID and email. You have another file with 2 fields: email and name. How can you produce a file with all three fields joined on email?

4
  • 5
    A little more detail on the join (i.e., inner, outer, left). Is the email list on the 1st CSV identical to the second list? Or does one contain more?
    – hyperslug
    Commented Aug 20, 2009 at 23:29
  • Examples of the csv files would be handy to, along with the OS you are using?
    – Troggy
    Commented Aug 20, 2009 at 23:35
  • i think 1st and 2nd list are identical. I am using Linux. Please help!!! thanks!! :)
    – crst53
    Commented Aug 21, 2009 at 0:00
  • 1
    how large is the data?
    – Joshua
    Commented Aug 21, 2009 at 0:13

9 Answers 9

36

Revision3:

You must sort both lists on email alphabetically, then join. Given that the email field the 2nd field of file1 and the 1st field of file2:

sort -t , -k 2,2 file1.csv > sort1.csv
sort -t , -k 1,1 file2.csv > sort2.csv
join -t , -1 2 -2 1 sort1.csv sort2.csv > sort3.csv

parameter meaning

-t ,   : ',' is the field separator
-k 2,2 : character sort on 2nd field
-k 1,1 : character sort on 1st field
-1 2   : file 1, 2nd field
-2 1   : file 2, 1st field
>      : output to file

produces

email,ID,name
email,ID,name
...

sorted by email alphabetically.

Note that if any email is missing from either file it will be omitted from the results.

3
  • 3
    CSV is more complicated than this. The field separator can be escaped for example. Commented Dec 11, 2016 at 0:45
  • @hyperslug can i do full outer join?
    – Abu Shoeb
    Commented Mar 15, 2018 at 6:29
  • This won't work if the CSV is mixed quoted/unqoted, if the ID contains a comma. Use this solution only for one-time processing where you check the result. But I recommend not using it for a production-level script. Commented Nov 10, 2018 at 12:51
42

Use csvkit:

csvjoin -c email id_email.csv email_name.csv

or

csvjoin -c 2,1 id_email.csv email_name.csv
3
  • 11
    Why isn't this the top answer?
    – alexg
    Commented Oct 28, 2015 at 9:14
  • 1
    awesome tool. Even recognized, that one of my files has a different than "," delimiter.
    – D_K
    Commented Nov 22, 2018 at 13:32
  • 1
    Thanks for letting me know this exists!
    – podperson
    Commented Nov 27, 2020 at 9:09
6

Perhaps it is overkill, but you could import into a database (e.g. OpenOffice Base) as two kinds of tables and define a report that is the desired output.

If the CSV import is a problem, then a spreadsheet program (e.g. OpenOffice Calc) can do the import. The result can then easily be transferred to the database.

3

As a future reference you might want to start playing around with AWK. It's a very simple little scripting language that exists in some form on every *nix system and its sole mission is life is the manipulation of standard delimited textual databases. With a few lines of throwaway script you can do some very useful things. The language is small and elegant and has a better utility/complexity ratio than anything else I am aware of.

2
  • Perl is in many ways a successor of awk. Commented Sep 14, 2010 at 11:33
  • awk doesn't handle quoting and escaping (e.g. dealing with ,s in a ,-separated CSV file) as far as I know. If you need that, using a dedicated CSV handling library is easier; they exist for many languages. Commented Sep 14, 2010 at 11:34
0

Use Go: https://github.com/chrislusf/gleam

package main

import (
    "flag"
    "os"

    "github.com/chrislusf/gleam"
    "github.com/chrislusf/gleam/source/csv"
)

var (
    aFile = flag.String("a", "a.csv", "first csv file with 2 fields, the first one being the key")
    bFile = flag.String("b", "b.csv", "second csv file with 2 fields, the first one being the key")
)

func main() {

    flag.Parse()

    f := gleam.New()
    a := f.Input(csv.New(*aFile))
    b := f.Input(csv.New(*bFile))

    a.Join(b).Fprintf(os.Stdout, "%s,%s,%s\n").Run()

}
0

Try CSV Cruncher.

It takes CSV files as SQL tables and then allows SQL queries, resulting in another CSV or JSON file.

For your case, you would just call:

crunch -in tableA.csv tableB.csv -out output.csv \
   "SELECT tableA.id, tableA.email, tableB.name 
    FROM tableA LEFT JOIN tableB USING (email)"

The tool needs Java 8 or later.

Some of the advantages:

  • You really get CSV support, not just "let's assume the data is correct".
  • You can join on multiple keys.
  • Easier to use and understand than join-based solutions.
  • You can combine more than 2 CSV files.
  • You can join by SQL expressions - the values don't have to be the same.

Disclaimer: I wrote that tool. It used to be in disarray after Google Code was closed, but I revived it and added new features as I use it.

0

You could read the CSV file with a spreadsheet program like LibreOffice and use VLOOKUP() macro to search for the name in second file.

2
  • 7
    File extension xlsx implies Microsoft Excel and I think VLOOKUP does as well. This question is tagged with Linux. Is Microsoft Excel available for Linux? Commented Mar 11, 2011 at 22:26
  • Now LibreOffice has VLOOKUP too. Commented Jun 3, 2014 at 19:20
0

In Bash 5.0.3 with GNU Coreutils 8.30 and building off of hyperslug's answer:

If you have unsorted CSV files with duplicate lines and don't want to omit data due to a missing field in a line of either file1.csv or file2.csv, then you can do the following:

Sort file 1 by field 2 and sort file 2 by field 1:

( head -n1 file1.csv && tail -n+2 file1.csv | sort -t, -k2,2 ) > sort1.csv
( head -n1 file2.csv && tail -n+2 file2.csv | sort -t, -k1,1 ) > sort2.csv

Expanding on hyperslug's parameters:

-k 2,2     : character sort starting and stopping on 2nd field
-k 1,1     : character sort starting and stopping on 1st field
head -n1   : read first line
tail -n+1: : read all but first line
(  )       : subshell
>          : output to file

I had to do head and tail within the subshell ( ) in order to preserve the first header line of the CSV file when sorting by a given field.

Then,

join -t , -a1 -a2 -1 2 -2 1 -o auto sort1.csv sort2.csv > sort3.csv

Expanding on hyperslug's parameters:

-t ,    : ',' is the field separator
-a1     : Do not omit lines from file 1 if no match in file 2 found
-a2     : Do not omit lines from file 2 if no match in file 1 found.
-1 2    : file 1, 2nd field
-2 1    : file 2, 1st field
-o auto : Auto format: includes extra commas indicating unmatched fields
>       : output to file

Here is an example file1.csv, file2.csv, and the resulting sort3.csv:

file1.csv:

ID,email
02,[email protected]
03,[email protected]
05,[email protected]
07,[email protected]
11,[email protected]

file2.csv:

email,name
[email protected],Timothy Brown
[email protected],Robert Green
[email protected],Raul Vasquez
[email protected],Carol Lindsey

sort3.csv:

email,ID,name
[email protected],02,Robert Green
[email protected],,Carol Lindsey
[email protected],03,
[email protected],07,Raul Vasquez
[email protected],05,
[email protected],,Timothy Brown
[email protected],11,

You can see Timothy Brown and Carol Lindsey lack IDs but are still included in the joined CSV file (with their names and emails in the correct fields).

-1

You could also use a tool specifically designed for joining csv files, such as the one found on https://filerefinery.com

The operations we currently support are: Joining csv files. It is possible to perform the SQL equivalent of outer, inner, left and right join operations on two csv files. Which column will be used as a join key in each of the files is configurable.

2
  • Please quote the essential parts of the answer from the reference link(s), as the answer can become invalid if the linked page(s) change.
    – DavidPostill
    Commented Nov 20, 2017 at 20:25
  • No longer exists. Commented Nov 10, 2018 at 12:34

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .