Suppose you have one CSV file with 2 fields: ID and email. You have another file with 2 fields: email and name. How can you produce a file with all three fields joined on email?
-
5A little more detail on the join (i.e., inner, outer, left). Is the email list on the 1st CSV identical to the second list? Or does one contain more?– hyperslugCommented Aug 20, 2009 at 23:29
-
Examples of the csv files would be handy to, along with the OS you are using?– TroggyCommented Aug 20, 2009 at 23:35
-
i think 1st and 2nd list are identical. I am using Linux. Please help!!! thanks!! :)– crst53Commented Aug 21, 2009 at 0:00
-
1how large is the data?– JoshuaCommented Aug 21, 2009 at 0:13
9 Answers
Revision3:
You must sort both lists on email alphabetically, then join. Given that the email field the 2nd field of file1 and the 1st field of file2:
sort -t , -k 2,2 file1.csv > sort1.csv
sort -t , -k 1,1 file2.csv > sort2.csv
join -t , -1 2 -2 1 sort1.csv sort2.csv > sort3.csv
parameter meaning
-t , : ',' is the field separator -k 2,2 : character sort on 2nd field -k 1,1 : character sort on 1st field -1 2 : file 1, 2nd field -2 1 : file 2, 1st field > : output to file
produces
email,ID,name email,ID,name ...
sorted by email alphabetically.
Note that if any email is missing from either file it will be omitted from the results.
-
3CSV is more complicated than this. The field separator can be escaped for example. Commented Dec 11, 2016 at 0:45
-
-
This won't work if the CSV is mixed quoted/unqoted, if the ID contains a comma. Use this solution only for one-time processing where you check the result. But I recommend not using it for a production-level script. Commented Nov 10, 2018 at 12:51
Perhaps it is overkill, but you could import into a database (e.g. OpenOffice Base) as two kinds of tables and define a report that is the desired output.
If the CSV import is a problem, then a spreadsheet program (e.g. OpenOffice Calc) can do the import. The result can then easily be transferred to the database.
As a future reference you might want to start playing around with AWK. It's a very simple little scripting language that exists in some form on every *nix system and its sole mission is life is the manipulation of standard delimited textual databases. With a few lines of throwaway script you can do some very useful things. The language is small and elegant and has a better utility/complexity ratio than anything else I am aware of.
-
-
awk doesn't handle quoting and escaping (e.g. dealing with ,s in a ,-separated CSV file) as far as I know. If you need that, using a dedicated CSV handling library is easier; they exist for many languages. Commented Sep 14, 2010 at 11:34
Use Go: https://github.com/chrislusf/gleam
package main
import (
"flag"
"os"
"github.com/chrislusf/gleam"
"github.com/chrislusf/gleam/source/csv"
)
var (
aFile = flag.String("a", "a.csv", "first csv file with 2 fields, the first one being the key")
bFile = flag.String("b", "b.csv", "second csv file with 2 fields, the first one being the key")
)
func main() {
flag.Parse()
f := gleam.New()
a := f.Input(csv.New(*aFile))
b := f.Input(csv.New(*bFile))
a.Join(b).Fprintf(os.Stdout, "%s,%s,%s\n").Run()
}
Try CSV Cruncher.
It takes CSV files as SQL tables and then allows SQL queries, resulting in another CSV or JSON file.
For your case, you would just call:
crunch -in tableA.csv tableB.csv -out output.csv \
"SELECT tableA.id, tableA.email, tableB.name
FROM tableA LEFT JOIN tableB USING (email)"
The tool needs Java 8 or later.
Some of the advantages:
- You really get CSV support, not just "let's assume the data is correct".
- You can join on multiple keys.
- Easier to use and understand than
join
-based solutions. - You can combine more than 2 CSV files.
- You can join by SQL expressions - the values don't have to be the same.
Disclaimer: I wrote that tool. It used to be in disarray after Google Code was closed, but I revived it and added new features as I use it.
You could read the CSV file with a spreadsheet program like LibreOffice and use VLOOKUP()
macro to search for the name in second file.
-
7File extension xlsx implies Microsoft Excel and I think VLOOKUP does as well. This question is tagged with Linux. Is Microsoft Excel available for Linux? Commented Mar 11, 2011 at 22:26
-
In Bash 5.0.3 with GNU Coreutils 8.30 and building off of hyperslug's answer:
If you have unsorted CSV files with duplicate lines and don't want to omit data due to a missing field in a line of either file1.csv
or file2.csv
, then you can do the following:
Sort file 1 by field 2 and sort file 2 by field 1:
( head -n1 file1.csv && tail -n+2 file1.csv | sort -t, -k2,2 ) > sort1.csv
( head -n1 file2.csv && tail -n+2 file2.csv | sort -t, -k1,1 ) > sort2.csv
Expanding on hyperslug's parameters:
-k 2,2 : character sort starting and stopping on 2nd field
-k 1,1 : character sort starting and stopping on 1st field
head -n1 : read first line
tail -n+1: : read all but first line
( ) : subshell
> : output to file
I had to do head
and tail
within the subshell ( )
in order to preserve the first header line of the CSV file when sorting by a given field.
Then,
join -t , -a1 -a2 -1 2 -2 1 -o auto sort1.csv sort2.csv > sort3.csv
Expanding on hyperslug's parameters:
-t , : ',' is the field separator
-a1 : Do not omit lines from file 1 if no match in file 2 found
-a2 : Do not omit lines from file 2 if no match in file 1 found.
-1 2 : file 1, 2nd field
-2 1 : file 2, 1st field
-o auto : Auto format: includes extra commas indicating unmatched fields
> : output to file
Here is an example file1.csv
, file2.csv
, and the resulting sort3.csv
:
file1.csv
:
ID,email
02,[email protected]
03,[email protected]
05,[email protected]
07,[email protected]
11,[email protected]
file2.csv
:
email,name
[email protected],Timothy Brown
[email protected],Robert Green
[email protected],Raul Vasquez
[email protected],Carol Lindsey
sort3.csv
:
email,ID,name
[email protected],02,Robert Green
[email protected],,Carol Lindsey
[email protected],03,
[email protected],07,Raul Vasquez
[email protected],05,
[email protected],,Timothy Brown
[email protected],11,
You can see Timothy Brown and Carol Lindsey lack IDs but are still included in the joined CSV file (with their names and emails in the correct fields).
You could also use a tool specifically designed for joining csv files, such as the one found on https://filerefinery.com
The operations we currently support are: Joining csv files. It is possible to perform the SQL equivalent of outer, inner, left and right join operations on two csv files. Which column will be used as a join key in each of the files is configurable.
-
Please quote the essential parts of the answer from the reference link(s), as the answer can become invalid if the linked page(s) change.– DavidPostill ♦Commented Nov 20, 2017 at 20:25
-