16

I'm using unix sort to sort a comma delimited file with multiple columns. Thus far, this has worked perfectly for sorting the data either numerically or in alphabetical order:

Example file before any sorting:

C,United States,WA,Tacoma,f,1
A,United States,MA,Boston,f,0
B,United States,NY,New York,f,5
A,Canada,QC,Montreal,f,2
A,Bahamas,Bahamas,Nassau,f,2
A,United States,NY,New York,f,1

Sort the file: $ sort -t ',' -k 2,2 -k 3,3 -k 4,4 -k 5,5r -k 6,6nr tmp.csv

Sorted result:

A,Bahamas,Bahamas,Nassau,f,2
A,Canada,QC,Montreal,f,2
A,United States,MA,Boston,f,0
B,United States,NY,New York,f,5
A,United States,NY,New York,f,1
C,United States,WA,Tacoma,f,1

Here is the issue: I want to sort column 2 based on a custom sort, meaning I want United States first, then Canada, then Bahamas:

Desired sort:

A,United States,MA,Boston,f,0
B,United States,NY,New York,f,5
A,United States,NY,New York,f,1
C,United States,WA,Tacoma,f,1
A,Canada,QC,Montreal,f,2
A,Bahamas,Bahamas,Nassau,f,2

Is there some way to pass unix sort a custom sort order that it can then apply? Something like: $ sort -t ',' -k 2,2:'United States, Canada, Bahamas' -k 3,3 -k 4,4 -k 5,5r -k 6,6nr tmp.csv

Thanks!

4
  • 3
    For these three values, you want reverse alphabetic order. For the general case, you'll need to map the names to a sort order number, and then do the sorting using the sort order number. Or go for a scripting language... One possibility is the join command, but you could end up with a lot of sorting — the input files for join must be sorted in one order, and then you'd be using sort again to put the data into a different order (and losing the sort order column as a post-sort step). Commented Oct 17, 2012 at 17:37
  • In your example input, shouldn't there be t instead of f in the last line? Commented Oct 17, 2012 at 17:56
  • Lev: yes, good catch. My bad; too much cutting and pasting (my actual data set is much larger and I accidentally grabbed the wrong rows).
    – jewelia
    Commented Oct 17, 2012 at 19:03
  • I updated the answer to match your data. Commented Oct 17, 2012 at 20:19

3 Answers 3

11

The other answer and comment answer the question in general, here's how an implementation can look like:

$ cat order
Bahamas,3
Canada,2
United States,1

$ cat data
C,United States,WA,Tacoma,f,1
A,United States,MA,Boston,f,0
B,United States,NY,New York,f,5
A,Canada,QC,Montreal,f,2
A,Bahamas,Bahamas,Nassau,f,2
A,United States,NY,New York,f,1

$ sort -t, -k2 data | join -t, -11 -22 order - | sort -t, -k2n -k4,5 -k6r -k7nr | cut -d, -f 3,1,4-7
A,United States,MA,Boston,f,0
B,United States,NY,New York,f,5
A,United States,NY,New York,f,1
C,United States,WA,Tacoma,f,1
A,Canada,QC,Montreal,f,2
A,Bahamas,Bahamas,Nassau,f,2
2
  • Awesome, thanks for your help. This worked perfectly!
    – jewelia
    Commented Oct 17, 2012 at 20:52
  • @jewelia Improved once more, sed was not really needed here. Commented Oct 17, 2012 at 21:01
3

You can't do that with sort. At this point, you really should be reaching for awk/perl/your-language-of-choice. You can fudge it, though. You could, for example, use sed to change "United States" to 0, "Canada" to 1 and "Bahamas" to 2, then do a numeric sort against that column, then sed it back. Or change "United States" to "United States,0" etc, sort against the extra column and then discard it.

0

I just wrote a helper called csort to make it easy to do this. It prefixes each line with a value of your choosing based on substring or regular expression matches within the line:

$ csort -t, '2=United States' X 2=Canada Y 2=Bahamas Z < tmp.csv | \
sort -t, -k1,1 -k3,3 -k4,4 -k5,5 -k6,6r -k7,7nr
X,A,United States,MA,Boston,f,0
X,B,United States,NY,New York,f,5
X,A,United States,NY,New York,f,1
X,C,United States,WA,Tacoma,f,1
Y,A,Canada,QC,Montreal,f,2
Z,A,Bahamas,Bahamas,Nassau,f,2

The 2=STR notation means "match if the second field equals STR".

You can then optionally pipe the output through cut -c3- to remove the prefix.

You must log in to answer this question.