I have a two-column file that you can create as follows
cat > twocol << EOF
007 03
001 03
003 01
137 12
001 11
002 01
002 02
002 03
001 02
002 04
137 94
010 21
001 01
EOF
The resultant file, twocol
, has only the rows of digits.
Desired Result
I want to perform some kind of command on twocol
and get the following result. (I think seeing it is much better than trying to restate my somewhat-confusing question title - "sort by first column then second; output unique 1st column once but all 2nd column".)
001 01
02
03
11
002 01
02
03
04
003 01
007 03
010 21
137 12
94
That's different from what a simple sort
will give me, i.e. different from
001 01
001 02
001 03
001 11
002 01
002 02
002 03
002 04
003 01
007 03
010 21
137 12
137 94
My Work
The only solution I've come first solution I came up with (before I got a decent awk
script going) - which matches the Desired Result above in bold, uses several instances of awk
, a bunch of bash
, and some help from 1.
col_1_max_len=$(awk '
BEGIN{maxl=0;}
{curr=length($1);max1=max1>curr?max1:curr;}
END{print max1}' \
twocol);
len1=$col_1_max_len;
len2=$(awk '
BEGIN{max2=0;}
{curr=length($2);max2=max2>curr?max2:curr;}
END{print max2}' \
twocol);
current_col_1_val="nothing";
while read -r line; do {
current_row="${line}";
col_1_val=$(awk '{print $1}' <<< "${current_row}");
col_2_val=$(awk '{print $2}' <<< "${current_row}");
if [ ! "${col_1_val}" == "${current_col_1_val}" ]; then
printf "%0"$len1"d %0"$len2"d\n" "${col_1_val}" "${col_2_val}";
else
printf "%"$len1"s %0"$len2"d\n" " " "${col_2_val}";
fi;
}; done < <(sort twocol)
I feel like I should be able to use one pass with awk
, something like the answers that follow: 2 , 3 , 4 , 5 , ...
I can't seem to get it hammered together without what feel like extra, clunky, memory-eating arrays. The format is also giving me a problem - the numbers in the first and second columns can go to more digits, and it would be preferable for things to look nice.
Can anyone show me how to get this result with some nice awk
code - preferably that can be used pretty-easily in the terminal? Perl
answers are welcome, too.
Oh, my system
$ uname -a && bash --version | head -1 && awk --version | head -1
CYGWIN_NT-10.0 MY-MACHINE 3.2.0(0.340/5/3) 2021-03-29 08:42 x86_64 Cygwin
GNU bash, version 4.4.12(3)-release (x86_64-unknown-cygwin)
GNU Awk 5.1.0, API: 3.0 (GNU MPFR 4.2.0-p9, GNU MP 6.2.1)
(I get exactly the same behavior on my Fedora and Ubuntu machines.)
Edit
I came up with an awk
solution. It looks all nice and short, but I still feel there are problems.
awk '{if (!vals[$1]++) print($0); else print(" ",$2);}' <(sort twocol)
I think I'm using a bunch of memory with the vals
array - as of now, my file only has ~10k lines, but I hope to scale it up. I hard-coded in the format, but I don't like it because I could have strings of varying lengths.
I can fix that (the formatting) if I make three passes with awk
and pass in variables.
length1=$(awk '
BEGIN{maxl=0;}
{curr=length($1);max1=max1>curr?max1:curr;}
END{print max1}' \
twocol);
length2=$(awk '
BEGIN{max2=0;}
{curr=length($2);max2=max2>curr?max2:curr;}
END{print max2}' \
twocol);
awk -vlen1=$length1 -vlen2=$length2 '
{
if (!vals[$1]++)
printf("%0*d %0*d\n",len1,$1,len2,$2);
else
printf("%*s %0*d\n",len1," ",len2,$2);
}' <(sort twocol)
Result matches the Desired Result exactly (see the part in bold, above), but I hope there's a way to do it all with one pass of awk
.
Can anyone share something that matches the characteristics I've mentioned? Any comments about the time performance and/or the memory performance of the different methods would also be appreciated.
I think it might also be possible to do the sorting in awk
; I'd like to know, especially if it could be more efficient. Edit: It can be done, as @steeldriver and @markp-fuso show below.
awk '{if (!vals[$1]++) print($0); else print(" ",$2);}' <(sort twocol)
works quite well :-); if you actually find yourself with memory issues you can easily replace the array reference with a 'previous' variable (eg, my 2ndawk
script)awk
(eg, my 1stawk
script, steeldriver'sawk
script) are going to require storing the file in memory; you can get away from the memory-usage question by usingsort
to feed a sorted stream toawk
, and depending on yoursort
version there may be some options (memory size, # of cpus) to improve onsort
's performance