I have a .txt file with publications for which columns are separated by a single space. However, titles also have spaces an in order to separate the columns correctly I need to have all titles in quotes. At the moment my data (example.txt) looks something like this:
1y4w 0 'my title no. 1' journal 344 471 480 2004 CODE UK 0022-2836 0070 ? 15522299 16.8768/urlspub714
1y4w 1 'my title no. 2' 3620131
1y44 0 'my title, no. 3.' journal 433 657 661 2005 CODE UK 0028-0836 0006 ? 15654328 10.1038/papukaj03284
2y42 1 ;my title no. 4. ' 'journal' 66 738 ? 2010 ? DK 1744-3091 ? ? 20516614 10.1107/S174430911001626X
1y4p 0 'my title no.5. ; journal 44 6101 6121 2005 CODE US 0006-2960 0033 ? 15835899 10.1021/bi047813a
1y4p 0 my title no.6. ; journal 44 6101 6121 2005 CODE US 0006-2960 0033 ? ? ?
So my idea was to:
- put a single quote in front of the title first word and then
- replace the semi-columns overall.
I only have problems with the first point. Since I didn't know how to do this looking at it column-wise, I thought of the second way: Every row (if correctly recorded) begins with 7-character string after which I want to put th quote if not already there. This string is: 4 characters (small caps or digits) then a space, then a number[0-9] and another space.
My best attempt is:
sed -r "s/([a-z0-9]\{4\}\s[0-9]\s)'?;?/\1'/g" example.txt >> example_corr.txt
That doesn't change anything, though.
Also, removing the -r
outputs the error:
sed: -e expression #1, char 51: invalid reference \1 on `s' command's RHS
I am still very new to UNIX and regular expressions, so I would appreciate any help/ explanation.
P.S. I am using Windows 10 with the built-in ssh to connect to a linux device.
UPDATE (solved) (not as table with columns) with this:
sed -E "s/([a-z0-9]{4}\s[0-9]\s)'?;?/\1'/g" example.txt >> example_corr.txt
So now my output is, what was desired from my first point above:
1y4w 0 'my title no. 1' journal 344 471 480 2004 CODE UK 0022-2836 0070 ? 15522299 16.8768/urlspub714
1y4w 1 'my title no. 2' 3620131
1y44 0 'my title, no. 3.' journal 433 657 661 2005 CODE UK 0028-0836 0006 ? 15654328 10.1038/papukaj03284
2y42 1 'my title no. 4. ' 'journal' 66 738 ? 2010 ? DK 1744-3091 ? ? 20516614 10.1107/S174430911001626X
1y4p 0 'my title no.5. ; journal 44 6101 6121 2005 CODE US 0006-2960 0033 ? 15835899 10.1021/bi047813a
1y4p 0 'my title no.6. ; journal 44 6101 6121 2005 CODE US 0006-2960 0033 ? ? ?
sed -r ...
) use([a-z0-9]{4}
(i.e.{...}
instead of\{...\}
) for basic regular expressions (sed ...
without-r
) use\([a-z0-9]\{4\}\s[0-9]\s\)
(i.e.\(...\)
instead of(...)
).-E
is more portable than-r
to enable regular expressions, and is the preferred option to use nowadays.