1

I have a CSV with over 2 million records in it with the following format.

path;name;extension;size;date;user    
/foo/;difacs;cgi;3,795;18-07-2011;Unix User\pads
/foo/;difacs.cgi;bak;2,622;03-12-2009;Unix User\pads
/foo/test/kzt/netcdfSample/testing/;zzz;;401;27-07-2006;Unix User\kzt
/foo/test/kzt/netcdfSample/vic_netcdf_popup/;a;txt;1,832;17-02-2006;Unix User\kzt

I need to join the path, name and extension into one correctly formatted field.

path;size;date;user    
/foo/difacs.cgi;3,795;18-07-2011;Unix User\pads
/foo/difacs.cgi;bak;2,622;03-12-2009;Unix User\pads
/foo/test/kzt/netcdfSample/testing/zzz/;401;27-07-2006;Unix User\kzt
/foo/test/kzt/netcdfSample/vic_netcdf_popup/a.txt;1,832;17-02-2006;Unix User\kzt

Thanks in advance!

4
  • 1
    Which operating system are you on? Which tools to you normally use? What have you tried?
    – slhck
    Commented Jun 4, 2012 at 5:37
  • Windows 7 using TextPad's regex search and replace function but I can just as easily switch over to a linux command line. I'm new to regex so not really understanding it a whole lot.
    – StefWill
    Commented Jun 4, 2012 at 5:39
  • It might be a lot easier to just load this into a spreadsheet and merge the fields there
    – Journeyman Geek
    Commented Jun 4, 2012 at 5:42
  • I would but excel can't deal with 2 million rows and I'd rather not have to split the files.
    – StefWill
    Commented Jun 4, 2012 at 5:47

1 Answer 1

1

This is a variation on slhck's answer which deals properly with an empty extension field (and avoids falsely replacing a dot that might have existed in the second or third field intentionally):

sed 's/^\([^;]*\);\([^;]*\)/\1\2/;ta;:a;s/^[^;]\+;;/&/;t;s/;/./' inputfile

It's not necessary to use a third capture group. That answer works without it. It's not necessary to escape the dot on the right hand side of the substitute command.

Here's an explanation of my script:

  • capture the first two fields, excluding the semicolons that delimit them.
  • ta;:a - if a successful replacement was made, then branch to label :a which immediately follows - this effectively clears the "success" flag
  • s/^[^;]\+;;/&/ - replace a sequence of non-semicolons followed by two semicolons (the concatenated first and second fields followed by an empty third field) with itself - it's a no-op, but it sets the "success" flag.
  • t - if the last replacement was successful (the third field is empty), skip to the end of processing of the current line (since no label was specified)
  • s/;/./ - if we've gotten to this point (the third field was not empty), replace the semicolon with a dot.
4
  • Can you just let me know if you downvoted my answer or not? Trying to retrace a serial downvote incident.
    – slhck
    Commented Jun 4, 2012 at 20:28
  • @slhck: No. I didn't. As you can see from my profile page I've only cast two downvotes in my almost 3 yrs of activity. I know you're only asking for the purpose of process of elimination. Commented Jun 4, 2012 at 23:42
  • That works perfectly. Now I just need to de-construct what you've done and how it works. But you have pulled my bacon out of the fire on this one ;)
    – StefWill
    Commented Jun 4, 2012 at 23:44
  • @StefWill: Feel free to ask for clarification if my explanation is lacking. Commented Jun 4, 2012 at 23:53

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .