Combine lines with matching first field

Question

For a few years, I often have a need to combine lines of (sorted) text with a matching first field, and I never found an elegant (i.e. one-liner unix command line) way to do it. What I want is similar to what's possible with the unix join command, but join expects 2 files, with each key appearing maximum once. I want to start with a single file, in which a key might appear multiple tiles.

I have both a ruby and perl script that do this, but there's no way to shorten my algorithm into a one-liner. After years of unix use, I'm still learning new tricks with comm, paste, uniq, etc, and I suspect there's a smart way to do this.

There are some related questions, like join all lines that have the same first column to the same line; Command line to match lines with matching first field (sed, awk, etc.); and Combine lines with matching keys -- but those solutions never really give a clean and reliable solution.

Here's sample input:

apple:A fruit
apple:Type of: pie
banana:tropical fruit
cherry:small burgundy fruit
cherry:1 for me to eat
cherry:bright red

Here's sample output:

apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red

Here's my ideal syntax:

merge --inputDelimiter=":" --outputDelimiter=";" --matchfield=1 infile.txt

The "matchfield" is really optional. It could always be the first field. Subsequent appearances of the delimiter should be treated like plain text.

I don't mind a perl, ruby, awk one-liner, if you can think of a short and elegant algorithm. This should be able to handle millions of lines of input. Any ideas?

Rahul Verma · Accepted Answer · 2017-10-16 06:56:53Z

Using awk one liner

awk -F: -v ORS="" 'a!=$1{a=$1; $0=RS $0} a==$1{ sub($1":",";") } 1' file

Output:

apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red

setting ORS="" ; By default it is \n.
The reason why we have set ORS="" (Output Record Separator) is because we don't want awk to include newlines in the output at the end of each record. We want to handle it in our own way, through our own logic. We are actually including newlines at the start of every record which has the first field different from the previous one.

a!=$1 : When variable a (initially null) doesn't match with first field $1 which is for eg. applein first line, then set a=$1 and $0=RS $0 i.e $0 or simply whole record becomes "\n"$0 (basically adding newline at the beginning of record). a!=$1 will always satisfy when there is a different first field ($1) than the previous line's $1 and is thus a criteria to segregate our records based on first field.

a==$1: If it matches then it probably means you are iterating over a record belonging to the previous record set. In this case substitute first occurrence of $1: (Note the : ) for eg. apple: with ;. $1":" could also be written as $1FS where FS is :

If you have millions of line in your file then this approach would be fastest because it doesn't involve any pre-processing and also we are not using any other data structure say array for storing your keys or records.

Thanks for the nice explanation.
– Michael Douma
Commented Oct 16, 2017 at 6:05 — Michael Douma, Commented Oct 16, 2017 at 6:05
@MichaelD: Welcome Michael.
– Rahul Verma
Commented Oct 16, 2017 at 6:30 — Rahul Verma, Commented Oct 16, 2017 at 6:30

Bruce · Accepted Answer · 2022-03-15 12:06:57Z

@RahulVerma's awk solution is simple and awesome! However, I've been looking for a sed solution. I wondered if sed is capable to solve this sort of problem. Finally, I've found my own solution! The core idea is treat the hold space as a variable, put what we want to search in the hold space, and search it in the current input line. I want to give credit to @chaos, because the core idea is from his answer how-to-search-for-the-word-stored-in-the-hold-space-with-sed#239049.

Using sed one liner

sed -E '1{:OK; h; $b; d}; x; G; s/^([^:]+)(.*)\n\1:(.*)/\1\2;\3/; tOK; s/\n.*//' file

or

sed -E 'x; G; s/^([^:]+)(.*)\n\1:(.*)/\1\2;\3/; tOK; 1d; s/\n.*//; b; :OK; h; $!d' file

Output:

apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red

Explanation: (the 2nd sed)

x exchange the hold space and the pattern space
- The hold space would become the current input line.
- The pattern space would become the content generated by s/// in the last cycle (see below), or the last input line if s/// fails to replace anything. It is empty for the very beginning, i.e., when the current input line is the line 1.

G append the hold space (the current input line) to the pattern space (the content generated in the last cycle). Thus, the pattern space would be two lines (The hold space is not changed). For example,

If the current input line is line 1, the pattern space would be

                             # the empty hold space
apple:A fruit                # the current input line

If it is line 2, the pattern space would be

apple:A fruit                # the last input line, because `s///` fails to replace anything in the last cycle
apple:Type of: pie           # the current input line

If it is line 3, the pattern space would be

apple:A fruit;Type of: pie   # the resulting content generated by `s///` in the last cycle
banana:tropical fruit        # the current input line

s/// use the first field ^([^:]+) (: is the field separator) in the first line of the pattern space to search the second line. The part of \n\1: indicates how it works. If it is found in the second line, concatenate the first line \1\2', a ;, and the 2+ fields of the second line \3. Thus, the result is \1\2;\3.
tOK if the s/// is successful (s has replaced something) jump to the label OK, otherwise continue
- In case of success
  - :OK specify the location of the label OK
  - h put the pattern space which is \1\2;\3, generated by s///, to the hold space. (The first line of the pattern space in the next cycle is made up of this very content)
  - $!d delete it, don't print it, if hasn't reached the last line
- In case of failure, (s/// hasn't changed anything, the pattern space still has two lines)
  - 1d delete the pattern space, don't print it, if the current input line is the line 1. At this moment, there is nothing ready to output. Without this, an empty line would be printed.
  - s/\n.*// delete the second line of the pattern space, which is the current input line.
  - b jump without a label, means end the sed script, and the remaining content in the pattern space would be printed out before starting the next cycle.
    - This is the very place where we finish combining the lines starting with the "current" same field and print out the combined line.
    - The hold space is still holding the the current input line for this cycle, and is going to be "the last input line" for the next cycle (see above). What it holds will trigger a brand new combining of the lines starting with a new same field.

In the second sed command line, the 1d; is not very necessary, however, the output has a subtle difference without it -- an innocuous empty line would be printed on the top, like below,


apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red

Thoughts at the end

The time I spent to figuring out the sed solution and the final look of the sed solution, both indicates that awk is better and easier for problems as complicated as this. sed is not good at searching/replacing a file according to the file itself. Like, searching/replacing constant text, it is easy, but what to search/replace is from the file itself, which means it might vary from line to line. For this kind, sed is not as good as awk -- the superior that has the full programming facilities, like variable, function, if-else, for/while/do. (See further in what is the difference between sed and awk)

Thanks for making and sharing this, including your breakdown. — Michael Douma, Commented Mar 16, 2022 at 14:05

jgrundstad · Accepted Answer · 2017-10-13 18:18:29Z

2

for F in `cut -f1 -d ':' infile.txt | sort | uniq`; do echo "$F:$(grep $F infile.txt | cut -f2- -d ':' | paste -s -d ';' - )"; done

Not sure it qualifies as 'elegant', but it works, though I'm sure not quickly for millions of lines - as the number of grep calls increases it would slow significantly. What % of the matching fields do you expect to be unique?

answered Oct 13, 2017 at 18:18

jgrundstad

715 bronze badges

Thanks for the unix string. I expect around 1-5 repeats of a key/matching-field, so in a million lines, there might be 300k keys.
– Michael Douma
Commented Oct 16, 2017 at 6:03
Ah, 300k grep calls would be unreasonable. Thanks for the feedback
– jgrundstad
Commented Oct 16, 2017 at 15:37
However, this is still a good example of unix philosophy.
– Bruce
Commented Jan 15, 2022 at 10:07

Add a comment |

RomanPerekhrest · Accepted Answer · 2017-10-13 18:36:01Z

2

Discover awk language:

awk -F':' '{ v=substr($0, index($0,":")+1); a[$1]=($1 in a? a[$1]";" : "")v }
           END{ for(i in a) print i,a[i] }' OFS=':' infile.txt

The output:

apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red

edited Oct 13, 2017 at 18:36

answered Oct 13, 2017 at 17:00

RomanPerekhrest

92.3k4 gold badges70 silver badges109 bronze badges

Thanks @RomanPerekhrest, that works. Better than some other awk solutions I've tried in the past which would break on complex lines. That said, I'd still love a shorter command with a simpler syntax, but I am pleased to have a one-liner.
– Michael Douma
Commented Oct 13, 2017 at 17:40

Add a comment |

ctac_ · Accepted Answer · 2017-10-13 19:17:11Z

2

I think this one do the job

 awk -F':' '$1!=a{if(b);print b;b=""}a=$1{$1="";if(!b)b=a;b=b$0}END{print b}' infile

edited Oct 13, 2017 at 19:17

answered Oct 13, 2017 at 19:03

ctac_

2,4612 gold badges8 silver badges18 bronze badges

3

Can you explain it?
– ghoti
Commented Oct 13, 2017 at 21:04

Add a comment |

Paul Hodges · Accepted Answer · 2022-12-09 21:20:36Z

2

As an exercise - or even for smallish datasets - the same in just native bash interpreter commands:

declare -A fruit=()
while IFS=: read k v; do fruit[$k]="${fruit[$k]:+${fruit[$k]};}$v"; done < test.in
for k in "${!fruit[@]}"; do echo "$k:${fruit[$k]}"; done

Output:

apple:A fruit;Type of: pie
banana:tropical fruit
cherry:small burgundy fruit;1 for me to eat;bright red

Any required sorting order might need to be explicitly enforced in more complex datasets.

answered Dec 9, 2022 at 21:20

Paul Hodges

14.9k1 gold badge20 silver badges39 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Combine lines with matching first field

6 Answers 6

Not the answer you're looking for? Browse other questions tagged
unix
join
awk
command-line
bsd
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Not the answer you're looking for? Browse other questions tagged unixjoinawkcommand-linebsd or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
unix
join
awk
command-line
bsd
or ask your own question.