Regex to join fields in a CSV

Question

I have a CSV with over 2 million records in it with the following format.

path;name;extension;size;date;user    
/foo/;difacs;cgi;3,795;18-07-2011;Unix User\pads
/foo/;difacs.cgi;bak;2,622;03-12-2009;Unix User\pads
/foo/test/kzt/netcdfSample/testing/;zzz;;401;27-07-2006;Unix User\kzt
/foo/test/kzt/netcdfSample/vic_netcdf_popup/;a;txt;1,832;17-02-2006;Unix User\kzt

I need to join the path, name and extension into one correctly formatted field.

path;size;date;user    
/foo/difacs.cgi;3,795;18-07-2011;Unix User\pads
/foo/difacs.cgi;bak;2,622;03-12-2009;Unix User\pads
/foo/test/kzt/netcdfSample/testing/zzz/;401;27-07-2006;Unix User\kzt
/foo/test/kzt/netcdfSample/vic_netcdf_popup/a.txt;1,832;17-02-2006;Unix User\kzt

Thanks in advance!

Which operating system are you on? Which tools to you normally use? What have you tried? — slhck, Commented Jun 4, 2012 at 5:37
Windows 7 using TextPad's regex search and replace function but I can just as easily switch over to a linux command line. I'm new to regex so not really understanding it a whole lot. — StefWill, Commented Jun 4, 2012 at 5:39
It might be a lot easier to just load this into a spreadsheet and merge the fields there — Journeyman Geek, Commented Jun 4, 2012 at 5:42
I would but excel can't deal with 2 million rows and I'd rather not have to split the files. — StefWill, Commented Jun 4, 2012 at 5:47

Dennis Williamson · Accepted Answer · 2012-06-04 13:17:37Z

1

This is a variation on slhck's answer which deals properly with an empty extension field (and avoids falsely replacing a dot that might have existed in the second or third field intentionally):

sed 's/^\([^;]*\);\([^;]*\)/\1\2/;ta;:a;s/^[^;]\+;;/&/;t;s/;/./' inputfile

It's not necessary to use a third capture group. That answer works without it. It's not necessary to escape the dot on the right hand side of the substitute command.

Here's an explanation of my script:

capture the first two fields, excluding the semicolons that delimit them.
ta;:a - if a successful replacement was made, then branch to label :a which immediately follows - this effectively clears the "success" flag
s/^[^;]\+;;/&/ - replace a sequence of non-semicolons followed by two semicolons (the concatenated first and second fields followed by an empty third field) with itself - it's a no-op, but it sets the "success" flag.
t - if the last replacement was successful (the third field is empty), skip to the end of processing of the current line (since no label was specified)
s/;/./ - if we've gotten to this point (the third field was not empty), replace the semicolon with a dot.

answered Jun 4, 2012 at 13:17

Dennis Williamson

109k19 gold badges172 silver badges190 bronze badges

Can you just let me know if you downvoted my answer or not? Trying to retrace a serial downvote incident.
– slhck
Commented Jun 4, 2012 at 20:28
@slhck: No. I didn't. As you can see from my profile page I've only cast two downvotes in my almost 3 yrs of activity. I know you're only asking for the purpose of process of elimination.
– Dennis Williamson
Commented Jun 4, 2012 at 23:42
That works perfectly. Now I just need to de-construct what you've done and how it works. But you have pulled my bacon out of the fire on this one ;)
– StefWill
Commented Jun 4, 2012 at 23:44
@StefWill: Feel free to ask for clarification if my explanation is lacking.
– Dennis Williamson
Commented Jun 4, 2012 at 23:53

Add a comment |

Stack Exchange Network

Regex to join fields in a CSV

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
windows
linux
regex
csv
.

Hot Network Questions

Regex to join fields in a CSV

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged windowslinuxregexcsv.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
windows
linux
regex
csv
.