Before running the speed tests further down, I would have used either of these approaches, both using GNU awk for FIELDWIDTHS
, \s
and gensub()
:
Print a modified version of each field as you go:
awk -v FIELDWIDTHS='3 10 3' -v OFS=',' '
{
for (i=1; i<=NF; i++) {
printf "%s%s", gensub(/\s+$/,"",1,$i), (i<NF ? OFS : ORS)
}
}
' myfile.txt
999,a bcd efgh,555
8,z,7
1,xx xx xx,48
or save the modified fields in a string then print that:
awk -v FIELDWIDTHS='3 10 3' -v OFS=',' '
{
out = ""
for (i=1; i<=NF; i++) {
out = (i>1 ? out OFS : "") gensub(/\s+$/,"",1,$i)
}
print out
}
' myfile.txt
999,a bcd efgh,555
8,z,7
1,xx xx xx,48
Those will be about the same as each other in terms of execution speed as I/O is considered relatively slow but constantly appending to a variable (and so forcing awk to relocate it in memory at times) might not be noticeably faster.
I (apparently incorrectly, see below) expected them both to be faster than modifying all of the fields (as would happen with gsub(/\s+$/,"",$i)
or $i=gensub(/\s+$/,"",1,$i)
), though, and neither of them change $0 so it's still available as-is for further processing if you like (but you can trivially save $0 to a temp variable before the loop and restore it after the loop with the other solutions at the cost of just 1 more field splitting action).
I decided to test execution speeds here's what I found from 3rd-run timing of the following 4 scripts run against a 3,000,000
line input file produced by awk '{for (i=1; i<=1000000;i++) print}' myfile.txt > file
:
$ time awk -v FIELDWIDTHS='3 10 3' -v OFS=',' '
{
for (i=1; i<=NF; i++) {
printf "%s%s", gensub(/\s+$/,"",1,$i), (i<NF ? OFS : ORS)
}
}
' file > /dev/null
real 0m11.407s
user 0m4.656s
sys 0m0.000s
$ time awk -v FIELDWIDTHS='3 10 3' -v OFS=',' '
{
out = ""
for (i=1; i<=NF; i++) {
out = (i>1 ? out OFS : "") gensub(/\s+$/,"",1,$i)
}
print out
}
' file > /dev/null
real 0m11.319s
user 0m7.921s
sys 0m0.031s
$ time awk -v FIELDWIDTHS='3 10 3' -v OFS=',' '
{
for (i=1; i<=NF; i++) {
$i = gensub(/\s+$/,"",1,$i)
}
print
}
' file > /dev/null
real 0m8.933s
user 0m6.296s
sys 0m0.000s
$ time awk -v FIELDWIDTHS='3 10 3' -v OFS=',' '
{
for (i=1; i<=NF; i++) {
sub(/\s+$/,"",$i)
}
print
}
' file > /dev/null
real 0m9.446s
user 0m4.953s
sys 0m0.000s
So apparently modifying each field and so rebuilding $0 once per field is faster than printing the modified values as you go or saving the modified values in a string to print once at the end for such a small number of fields per line, which makes sense.
Now here's the timing with a different input file that's 300,000 lines long but has 30 fields fields per line instead of 3 (so, fewer lines but more fields per line than the previous tests above) created by awk '{for (i=1; i<=100000;i++) {for (j=1;j<=10;j++) printf "%s", $0; print ""}}' myfile.txt > file
:
$ time awk -v FIELDWIDTHS='3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3' -v OFS=',' '
{
for (i=1; i<=NF; i++) {
printf "%s%s", gensub(/\s+$/,"",1,$i), (i<NF ? OFS : ORS)
}
}
' file > /dev/null
real 0m12.199s
user 0m3.109s
sys 0m0.031s
time awk -v FIELDWIDTHS='3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3' -v OFS=',' '
{
out = ""
for (i=1; i<=NF; i++) {
out = (i>1 ? out OFS : "") gensub(/\s+$/,"",1,$i)
}
print out
}
' file > /dev/null
real 0m10.930s
user 0m6.015s
sys 0m0.046s
time awk -v FIELDWIDTHS='3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3' -v OFS=',' '
{
for (i=1; i<=NF; i++) {
$i = gensub(/\s+$/,"",1,$i)
}
print
}
' file > /dev/null
real 0m7.688s
user 0m4.312s
sys 0m0.031s
time awk -v FIELDWIDTHS='3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3' -v OFS=',' '
{
for (i=1; i<=NF; i++) {
sub(/\s+$/,"",$i)
}
print
}
' file > /dev/null
real 0m7.512s
user 0m4.578s
sys 0m0.031s
and again modifying the fields was faster, which I did not expect. You live and learn!
Text::CSV
module attempts to be compliant as much as possible with RFC4180. You can search for the word "escape" and view defaults for the Raku module here: Text::CSV