Convert a fixed-width file to CSV and remove trailing space

Question

My input file is:

$ cat -e myfile.txt 
999a bcd efgh555$
 8 z         7  $
1  xx xx xx  48 $

And I need a CSV without trailing spaces in the columns:

999,a bcd efgh,555
 8,z,7
1,xx xx xx,48

So far, I succeed to add a coma where needed:

$ gawk '$1=$1' FIELDWIDTHS="3 10 3" OFS=, myfile.txt
999,a bcd efgh,555
 8 ,z         ,7  
1  ,xx xx xx  ,48

How can I also remove the trailing spaces?

Edit: There may already be commas in the data so I need to: (i) enclose the fields in double quotes and (ii) escape double quotes that may already be in the fields using \" (or "" as per RFC 4180). E.g., a,aab"bbccc -> "a,aa","b\"bb","ccc".

I can use gawk (not only awk)
I'm open to any other solution (e.g. perl).
I need an efficient solution (e.g. not gawk ... | sed ...) because I have many big files to process.
I know the field width so it is not needed to calculate FIELDWIDTHS automatically.

If you want to be compliant with RFC 1410, use a CSV module. I edited the Ruby post below. You can also use Perl, Python, Miller. — drewk, Commented Apr 24, 2023 at 22:03
Do you want all fields enclosed in double-quotes? Or only those containing double-quotes, commas, and spaces? I believe RFC1410 specifies the latter. — jubilatious1, Commented May 2, 2023 at 20:05
@jubilatious1 Ideally, I wish to follow the RFC. The RFC says: (i) Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes; (ii) If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. — Thomas, Commented May 3, 2023 at 18:33
Okay, just so we're on the same page it's RFC4180 I presume we're talking about (mistakenly typed "RFC1410" above). Raku's Text::CSV module attempts to be compliant as much as possible with RFC4180. You can search for the word "escape" and view defaults for the Raku module here: Text::CSV — jubilatious1, Commented May 3, 2023 at 18:44

Stéphane Chazelas · Accepted Answer · 2023-04-24 19:23:35Z

With perl:

<your-file perl -C -lnse 'print map {s/\s+$//r} unpack "a3a10a3"' -- -,=,

unpack() doing the equivalent of gawk's FIELDWIDTHS processing.

$,, the equivalent of awk's OFS here is set to , with -,=, where -s causes -var=value arguments to be understood as assigning value to $var. Alternatively, you can omit the -s, and add a BEGIN{$, = ","} statement in the beginning like you'd do BEGIN{OFS = ","} in awk in place of -v OFS=,.

-C to consider the input as UTF-8 encoded if the locale uses UTF-8 as its charmap, ignoring locales with different multibyte charmaps as those are hardly ever used these days.

If the spacing characters to trim are all ASCII ones, as you found out, that can be further simplified by using the A specifier instead of a for unpack() which strips trailing ASCII whitespace (and NULs):

<your-file perl -C -lnse 'print unpack "A3A10A3"' -- -,=,

That's for the width to be considered in terms of number of characters.

For number of bytes instead, remove the -C.

For number of grapheme clusters instead, you could replace unpack "a3a10a3" with /^(\X{3})(\X{10})(\X{3})/.

For display width, taking into account the width of each character (including zero-width, single-width and double-width ones, but not supporting control characters such as TAB¹, CR...), in zsh, you could do:

widths=(3 10 3)
while IFS= read -ru3 line; do
  csv=()
  for width in $widths; do
    field=${(mr[width])line}
    line=${line#$field}
    csv+=("${(M)field##*[^[:space:]]}")
  done
  print -r -- ${(j[,])csv}
done 3< your-file

Where the r[width] right-pads and truncates the text to the given width, with m for that to be done based on display width as opposed to number of characters and ${(M)field##*[^[:space:]]} expands to the leading portion of $field that Matches the pattern, that is everything up to the last non-whitespace (same as ${field%%[[:space:]]#} though without needing set -o extendedglob).

That will likely be a lot slower than perl.

If your files contain only ASCII text like in your sample, they should all be equivalent. Then removing the -C for perl or setting the locale to C/POSIX for sed/gawk/perl will likely improve performance.

In a UTF-8 locale, on your input repeated 100000 times, here I get 1.1 seconds for perl (0.34 for the A variant, 1.7 for the \X variant), 1.3 seconds for Paul's gawk one, 31 seconds for zsh, 2.1 seconds for GNU sed 's/./&,/13;s/./&,/3;s/[[:space:]]*,/,/g;s/[[:space:]]*$//' (standard), 1.1 for sed -E 's/^(.{3})(.{10})/\1,\2,/;s/\s+,/,/g;s/\s+$//' (non-standard).

In the C locale, that becomes, 0.9 (0.27, 1.2), 0.7, 31, 1.3, 0.5 respectively.

Those assume the fields don't contain , or " characters. Some CSV formats will also want fields with leading or trailing whitespace to be quoted.

To create proper CSV output, the easiest would be to use the Text::CSV module of perl:

<your-file perl -C -MText::CSV -lne '
  BEGIN{$csv = Text::CSV->new({binary => 1})}
  $csv->print(*STDOUT, [unpack "A3A10A3"])'

By default,

the separator is ,
quotes are "..."
" are escaped as "" within quotes
only fields that require quoting are quoted

But those can be tuned in the new() invocation. See perldoc Text::CSV for details.

^{¹ Though for TAB specifically, you could preprocess the input with expand to convert those TABs to sequences of spaces; for other ones, the concept of width is generally hardly applicable and dependant on the display device the text is sent to.}

Thanks! Looks very promising given how big are my files! Could you please explain the last part: ` -- -,=,`? — Thomas, Commented Apr 22, 2023 at 9:43
@Thomas, that's the setting of $, via -s which I mention in the answer. Like the -v OFS=, of awk. You could also do it with BEGIN{$, = ","}, like the BEGIN{OFS = ","} of awk. See perldoc perlrun for details — Stéphane Chazelas, Commented Apr 22, 2023 at 9:53
@jubilatious1, Gb would be gigabit, while Go is gigaoctet, either 10^9 or 2^30 octets, octets being 8 bit words. On systems where bytes are 8bit words (virtually all these days to the point that byte and octet have become synonyms), 1Go is 1GB (gigabyte, 10^9 bytes) or possibly 1GiB (1 gibibyte, 2^30 bytes). See en.wikipedia.org/wiki/Octet_(computing) — Stéphane Chazelas, Commented Apr 23, 2023 at 6:20

Paul_Pedant · Accepted Answer · 2023-04-21 19:57:14Z

7

$ cat txx
9  a bcd     55 # <- 1 trailing space here
48 z         7  # <- 2 trailing spaces here
1  xx xx xx  489
aaabbb   bb bccchh

$ awk 'BEGIN { FIELDWIDTHS="3 10 3"; OFS=","; }
{ for (f = 1; f <= NF; ++f) sub (/[[:space:]]*$/, "", $f); print; }' txx
9,a bcd,55
48,z,7
1,xx xx xx,489
aaa,bbb   bb b,ccc

edited Apr 21, 2023 at 19:57

answered Apr 21, 2023 at 19:45

Paul_Pedant

9,0742 gold badges20 silver badges27 bronze badges

Add a comment |

FelixJN · Accepted Answer · 2023-04-21 20:08:38Z

4

You could use string manipulation with a regex matching one or more spaces at the end of an entry. However this means looping over all fields.

For this to work, one needs to first trigger a record to be split according to FIELDWIDTHS via e.g. $1=$1

awk 'BEGIN { FIELDWIDTHS="3 10 3" ; OFS=","}
     {$1=$1 ;
     for (i=1;i<NF;i++) {$i=gensub(/ +$/,"","g",$i)}}
     1' infile

Infile

9  abc d     55 
48 z         7  
1  x x x   xx489

Output

9,abc d,55 
48,z,7  
1,x x x   xx,489

answered Apr 21, 2023 at 20:08

FelixJN

13.7k2 gold badges32 silver badges52 bronze badges

Thank you very much for your answer! Could your please explain what are the pros and cons of this solution (i.e. {$1=$1 ; for (i=1;i<NF;i++) {$i=gensub(/ +$/,"","g",$i)}} 1' versus { for (i=1;i<NF;i++) sub(/ *$/,"", $i); print }. The latter works and seems more readable to me but I'm new to awk so I may miss something. Is there a difference on performances? Thanks again!
– Thomas
Commented Apr 21, 2023 at 23:27
2

@Thomas in the gensub() script remove $1=$1; as it's doing nothing useful and the "g" in the gensub() should be 1 as only 1 string could match that regexp, and <NF should be <=NF in both scripts or it'll leave trailing blanks. After that the only difference is portability since all awks have sub() but not all have gensub(). GNU awk has both. Both scripts are modifying every field which causes awk to rebuild $0 every time a field changes (so it'll rebuild $0 NF times per line) which will be somewhat slow and dwarf any possible performance difference genween sub() and gensub().
– Ed Morton
Commented Apr 23, 2023 at 12:33
2

Looks like modifying $is and so rebuilding $0 each time isn't very slow after all. They are about as fast as each other though, as expected.
– Ed Morton
Commented Apr 23, 2023 at 13:29

Add a comment |

Ed Morton · Accepted Answer · 2023-04-23 20:07:02Z

Before running the speed tests further down, I would have used either of these approaches, both using GNU awk for FIELDWIDTHS, \s and gensub():

Print a modified version of each field as you go:

awk -v FIELDWIDTHS='3 10 3' -v OFS=',' '
    {
        for (i=1; i<=NF; i++) {
            printf "%s%s", gensub(/\s+$/,"",1,$i), (i<NF ? OFS : ORS)
        }
    }
' myfile.txt
999,a bcd efgh,555
 8,z,7
1,xx xx xx,48

or save the modified fields in a string then print that:

awk -v FIELDWIDTHS='3 10 3' -v OFS=',' '
    {
        out = ""
        for (i=1; i<=NF; i++) {
            out = (i>1 ? out OFS : "") gensub(/\s+$/,"",1,$i)
        }
        print out
    }
' myfile.txt
999,a bcd efgh,555
 8,z,7
1,xx xx xx,48

Those will be about the same as each other in terms of execution speed as I/O is considered relatively slow but constantly appending to a variable (and so forcing awk to relocate it in memory at times) might not be noticeably faster.

I (apparently incorrectly, see below) expected them both to be faster than modifying all of the fields (as would happen with gsub(/\s+$/,"",$i) or $i=gensub(/\s+$/,"",1,$i)), though, and neither of them change $0 so it's still available as-is for further processing if you like (but you can trivially save $0 to a temp variable before the loop and restore it after the loop with the other solutions at the cost of just 1 more field splitting action).

I decided to test execution speeds here's what I found from 3rd-run timing of the following 4 scripts run against a 3,000,000 line input file produced by awk '{for (i=1; i<=1000000;i++) print}' myfile.txt > file:

$ time awk -v FIELDWIDTHS='3 10 3' -v OFS=',' '
    {
        for (i=1; i<=NF; i++) {
            printf "%s%s", gensub(/\s+$/,"",1,$i), (i<NF ? OFS : ORS)
        }
    }
' file > /dev/null

real    0m11.407s
user    0m4.656s
sys     0m0.000s

$ time awk -v FIELDWIDTHS='3 10 3' -v OFS=',' '
    {
        out = ""
        for (i=1; i<=NF; i++) {
            out = (i>1 ? out OFS : "") gensub(/\s+$/,"",1,$i)
        }
        print out
    }
' file > /dev/null

real    0m11.319s
user    0m7.921s
sys     0m0.031s

$ time awk -v FIELDWIDTHS='3 10 3' -v OFS=',' '
    {
        for (i=1; i<=NF; i++) {
            $i = gensub(/\s+$/,"",1,$i)
        }
        print
    }
' file > /dev/null

real    0m8.933s
user    0m6.296s
sys     0m0.000s

$ time awk -v FIELDWIDTHS='3 10 3' -v OFS=',' '
    {
        for (i=1; i<=NF; i++) {
            sub(/\s+$/,"",$i)
        }
        print
    }
' file > /dev/null

real    0m9.446s
user    0m4.953s
sys     0m0.000s

So apparently modifying each field and so rebuilding $0 once per field is faster than printing the modified values as you go or saving the modified values in a string to print once at the end for such a small number of fields per line, which makes sense.

Now here's the timing with a different input file that's 300,000 lines long but has 30 fields fields per line instead of 3 (so, fewer lines but more fields per line than the previous tests above) created by awk '{for (i=1; i<=100000;i++) {for (j=1;j<=10;j++) printf "%s", $0; print ""}}' myfile.txt > file:

$ time awk -v FIELDWIDTHS='3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3' -v OFS=',' '
    {
        for (i=1; i<=NF; i++) {
            printf "%s%s", gensub(/\s+$/,"",1,$i), (i<NF ? OFS : ORS)
        }
    }
' file > /dev/null

real    0m12.199s
user    0m3.109s
sys     0m0.031s

time awk -v FIELDWIDTHS='3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3' -v OFS=',' '
    {
        out = ""
        for (i=1; i<=NF; i++) {
            out = (i>1 ? out OFS : "") gensub(/\s+$/,"",1,$i)
        }
        print out
    }
' file > /dev/null

real    0m10.930s
user    0m6.015s
sys     0m0.046s

time awk -v FIELDWIDTHS='3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3' -v OFS=',' '
    {
        for (i=1; i<=NF; i++) {
            $i = gensub(/\s+$/,"",1,$i)
        }
        print
    }
' file > /dev/null

real    0m7.688s
user    0m4.312s
sys     0m0.031s

time awk -v FIELDWIDTHS='3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3' -v OFS=',' '
    {
        for (i=1; i<=NF; i++) {
            sub(/\s+$/,"",$i)
        }
        print
    }
' file > /dev/null

real    0m7.512s
user    0m4.578s
sys     0m0.031s

and again modifying the fields was faster, which I did not expect. You live and learn!

LL3 · Accepted Answer · 2023-04-24 21:08:40Z

I need an efficient solution (e.g. not gawk ... | sed ...) because I have many big files to process.

Note that for efficient solutions it is often actually better to want pipelines.

As a matter of fact, while it is true that for 2+ commands in a pipeline there's the inherent cost of their setup plus the ongoing overhead of the data flowing through the pipe(s), we should also bear in mind that in practice, with multi-thread multi-core CPUs, the pipeline's arms are likely¹ to be run concurrently to each other. Hence performance-wise it is in fact better to split operations into a pipeline as much as (sanely) possible.

So, although not as neat as, for instance, Stephane Chazelas' perl solution which is compact (and also considerably quick for your original simpler requirements), for increased efficiency you might also consider something like the following:

(For your original requirements)

<input-data gawk '{$1=$1; print}' FIELDWIDTHS="3 10 3" OFS=, | gsed 's/ *,/,/g' | gsed 's/ \+$//'

(For your additional requirements of also double-quoting fields and escaping embedded double-quotes)

<input-data gawk '/"/{for(i=1;i<=NF;i++) gsub(/"/, "\"\"", $i)} {$1=$1; print}' FIELDWIDTHS="3 10 3" OFS='","' | gsed 's/ *","/","/g' | gsed 's/ *$/"/;s/^/"/'

Here I split the task into 3 operation blocks, one for each arm of the pipeline, which will presumably¹ run on a separate CPU thread/core each.

On my quad-core machine the first pipeline (your original requirements) takes consistently 0.21s (redirected to /dev/null) to munch 100000 times your input sample (i.e. 300k lines), and likewise 20s for 10M times your sample (30M lines). As a comparison, Stephane's perl -C -lnse 'print unpack "A3A10A3"' -- -,=, on the same machine takes 0.31s and 31s respectively, i.e. roughly 50% slower with all things being equal wrt multibyte support, while the perl non--C version (no multibyte chars support) takes consistently 0.23s and 23s, i.e. still 10% slower than the first pipeline above. Likewise, the gawk/sed/ruby-only solutions are 100% to 200% slower in their fastest (i.e. no multibyte chars support) configurations.

Even truer for your updated requirements: the second pipeline above takes respectively 0.26s and 25s, while the perl Text::CSV (sped-up by Text::CSV_XS) solution takes 0.76s and 1m17s respectively.

You may notice that in my examples above, using GNU tools (BSD ones may be a different story), I didn't even need to set the C locale in order to gain extra speed. Rather, I did take advantage of the fact that you need to trim spaces only, and not other kinds of blank characters, and thus used a simple s/ */. Obviously, not setting the C locale enables correct support for possible multibyte characters in the data. When using a s/[[:space:]]*/ (or alike such as s/\s*/) in place of s/ */, performance drops by more than double unless setting the C locale, so if you need to catch additional space-like characters while also keeping a multibyte (e.g. UTF8 encoded) charset for the input data, you're better off specifying all relevant space-like chars explicitly in a [], which keeps yielding excellent performance.

Perhaps single-tool solutions can be squeezed further in order to gain an extra bit of speed (especially the perl Text::CSV solution can certainly be optimized) but, unless they can cater for multi-core support either natively or programmatically, pipelines are worth considering for "quick-and-easy" performance benefits, at least when:

you can identify blocks of operations out of the overall task, for which you can also possibly select more specialized tools each
you can make them into as many pipeline arms as possible while not exceeding the amount of available CPU cores/threads at the most

As an example for point 1, I kept using your gawk for the first arm of the pipeline, as it is considerably quicker than sed at doing that specific part of the job, while using gawk for, say, the second arm of the pipeline, performance drops from 20s to 35s for the same 30M lines of sample data. Also, in the pipeline for your updated requirements, I didn't split the gawk into two sub-operations because keeping it whole was the simplest, most streamlined, way I could fathom for obtaining fixed-width fields while escaping possible double-quotes in the input data.

As an example for point 2, note the last sed which performs two s commands: I didn't split it further so as not to equal the amount of available CPU cores on my machine. Splitting it further simply performed just the same, no additional improvement.

It's agreed though, that the more requirements your input data needs, the more a pipeline gets increasingly complex/awkward/error-prone, likely more so than with a single-tool solution.

^{1. there are commands to make sure that each pipeline's arm is run by a separate CPU core, but that's usually unnecessary as core re-distribution would happen automatically on modern kernels}

It's generally a bad idea to use an assignment as a condition in awk, as assignment can be falsy like when the value being assigned is empty or is a number that happens to be interpreted as 0 such as 0, 0.0, 00, 0e12, -0, 1e-500, sometimes 0x0 or 0xffp-400. Use {$1=$1; print} for instance. — Stéphane Chazelas, Commented Apr 23, 2023 at 16:33

drewk · Accepted Answer · 2023-04-24 22:12:58Z

Here is a Ruby:

ruby -r csv -e '
puts CSV.generate(**{headers:false, col_sep:","}){|csv|
    $<.each{ |line|
        csv<<line.scan(/^(.{3})(.{10})(.{3})/).flatten.map(&:rstrip)    
    }    
}    
' file

Prints:

999,a bcd efgh,555
 8,z,7
1,xx xx xx,48

Or if you want quotes:

ruby -r csv -e '
puts CSV.generate(**{headers:false, col_sep:",", force_quotes:true}){|csv|
    $<.each{ |line|
        csv<<line.scan(/^(.{3})(.{10})(.{3})/).flatten.map(&:rstrip)    
    }    
}    
' file

Prints:

"999","a bcd efgh","555"
" 8","z","7"
"1","xx xx xx","48"

Could you please add a benchmark?

Reusing Ed's 3 million line test file:

 awk '{for (i=1; i<=1000000;i++) print}' file > testfile

 wc -l testfile 
  3000000 testfile

Here is the time:

time ruby -r csv -e '
puts CSV.generate(**{headers:false, col_sep:","}){|csv|
    $<.each{ |line|
        csv<<line.scan(/^(.{3})(.{10})(.{3})/).flatten.map(&:rstrip)    
    }    
}    
' testfile > /dev/null 

real    0m6.709s
user    0m6.638s
sys 0m0.069s

If speed is your main concern, you can use .unpack vs a regex to get the fixed-width fields:

time ruby -r csv -e '
puts CSV.generate(**{headers:false, col_sep:","}){|csv|
    $<.each{ |line|
        csv<<line.unpack("a3a10a3").map(&:rstrip)    
    }    
}    
' testfile >/dev/null

real    0m4.253s
user    0m4.200s
sys 0m0.051s

Or dispense with a formal CSV treatment and just use .join(","). This is fully line-by-line treatment, so very large files can be processed:

time ruby -e '$<.each{ |line|
    puts line.unpack("a3a10a3").map(&:rstrip).join(",")    
}  ' testfile >/dev/null

real    0m2.127s
user    0m2.089s
sys 0m0.036s

So 2± seconds for 3,000,000 lines.

Edit: There may already be commas in the data so I need to: (i) enclose the fields in double quotes and (ii) escape double quotes that may already be in the fields using " (or "" as per RFC 1410). E.g., a,aab"bbccc -> "a,aa","b"bb","ccc"

These all work in the Ruby version with the CSV module:

echo 'a,aab"bbcccddddd 
1,345,78"0123456' >file 

ruby -r csv -e '
puts CSV.generate(**{headers:false, col_sep:","}){|csv|
    $<.each{ |line|
        csv<<line.unpack("a3a10a3").map(&:rstrip)    
    }    
}' file

Prints:

"a,a","ab""bbcccdd",ddd
"1,3","45,78""0123",456

Or, if you have multi-byte input, you can scan by individual graphemes:

echo 'a,aab"bbcccddddd 
1,345,78"0123456
1,2한bar한foo23😆🤣🤪' >file 

ruby -r csv -e '
puts CSV.generate(**{headers:false, col_sep:","}){|csv|
    $<.each{ |line|
        csv<<line.scan(/^(\X{3})(\X{10})(\X{3})/).flatten.map(&:rstrip)    
    }    
}' file

Prints:

"a,a","ab""bbcccdd",ddd
"1,3","45,78""0123",456
"1,2",한bar한foo23,😆🤣🤪

Would either the csv or the join (or both) versions be supposed to handle multibyte chars as well? they apparently don't on my ruby v2.3.7 — LL3, Commented Apr 24, 2023 at 10:26
You would need to set the encoding that those multibyte characters represent. The .unpack version is ASCII only. — drewk, Commented Apr 24, 2023 at 11:37
For a regex version you can do .scan(/^(\X{3})(\X{10})(\X{3})/)... — drewk, Commented Apr 24, 2023 at 18:20
Impressive! I haven't looked closely at Ruby's Unicode support for a while now. Have things changed with recent versions? Older blog post here: honeybadger.io/blog/ruby-s-unicode-support — jubilatious1, Commented May 2, 2023 at 21:13

jubilatious1 · Accepted Answer · 2023-05-01 22:54:48Z

Using Raku (formerly known as Perl_6)

~$ raku -ne 'use experimental :pack;  \
             my @a = .encode("utf-16").unpack("A3 A10 A3").map: *.trim-trailing;  \
             @a.raku.put;'  file

Sample Input:

999a bcd efgh555$
 8 z         7  $
1  xx xx xx  48 $

Sample Output (intermediate):

["999", "a bcd efgh", "555"]
[" 8", "z", "7"]
["1", "xx xx xx", "48"]

The Raku code above starts by storing each line as an @-sigiled array, and outputting. In order to enable unpack, the use experimental :pack pragma must be added at the top of the one-liner. According to the Docs, "A3 A10 A3" and "a3 a10 a3" mean the same thing.

Above, .raku is used only to visualize Raku's parsing/storage model: it demonstrates that the fields have been captured correctly.

Continuing to a full solution, you can output CSV with the Text::CSV module. Use BEGIN and END phasers, to accumulate the entire dataset in @a array, then write to $*OUT < stdout > as below:

~$ raku  -MText::CSV -ne 'use experimental :pack;  \
                          BEGIN my @a; @a.push: .encode("utf-16").unpack("A3 A10 A3").map: *.trim-trailing;  \
                          END csv( in => @a, out => $*OUT);'  file

Sample Output (final):

999,"a bcd efgh",555
" 8",z,7
1,"xx xx xx",48

Note, whitespace-containing fields are quoted by default. See the markdown document below for setting various other CSV-output parameters.

https://docs.raku.org/routine/unpack
https://github.com/Tux/CSV/blob/master/doc/Text-CSV.md
https://raku.org

jubilatious1 · Accepted Answer · 2023-05-03 11:34:58Z

Using Raku (formerly known as Perl_6)

Theoretically, if you're using a language like Raku that uses Unicode/UTF8 internally (and by default), then multibyte characters should be handled correctly (since such characters are NFC-normalized and represented internally as graphemes):

Sample Input:

~$ cat test_fixed_width.txt
999a bcd efgh555$
 8 z         7  $
1  xx xx xx  48 $
a,aab"bbcccddddd$
1,345,78"0123456$
1,2한bar한foo23😆🤣🤪$

Initial Analysis:

~$ raku -ne 'put join " | ", $_.elems, $_.comb.elems, $_.codes, $_.chars;'  test_fixed_width.txt
1 | 17 | 17 | 17
1 | 17 | 17 | 17
1 | 17 | 17 | 17
1 | 17 | 17 | 17
1 | 17 | 17 | 17
1 | 17 | 17 | 17

So you should have good luck with code as simple as:

~$ raku -ne 'my $a = .match: /^ (. ** 3)  (. ** 10)  (. ** 3) /;  say join " | ", $a.map: *.trim-trailing.raku;'
"999" | "a bcd efgh" | "555"
" 8" | "z" | "7"
"1" | "xx xx xx" | "48"
"a,a" | "ab\"bbcccdd" | "ddd"
"1,3" | "45,78\"0123" | "456"
"1,2" | "한bar한foo23" | "😆🤣🤪"

Above, join on comma instead of | to get your desired CSV file. Note Raku's function call to trim-trailing to remove trailing whitespace. Also, the call to .raku shows you that Raku's internal representation escapes embedded double-quotes per field. For further processing, drop the call to .raku and output using a bona fide CSV parser, such as Raku's Text::CSV module.

https://docs.raku.org/language/unicode.html https://docs.raku.org/language/traps#.chars_gets_the_number_of_graphemes,_not_Codepoints
https://raku.land/github:Tux/Text::CSV
https://raku.org

Stack Exchange Network

Convert a fixed-width file to CSV and remove trailing space

8 Answers 8

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
text-processing
csv
.

Linked

Hot Network Questions

Convert a fixed-width file to CSV and remove trailing space

8 Answers 8

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged text-processingcsv.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
text-processing
csv
.