0

I have a .txt file with publications for which columns are separated by a single space. However, titles also have spaces an in order to separate the columns correctly I need to have all titles in quotes. At the moment my data (example.txt) looks something like this:

1y4w 0 'my title no. 1' journal 344 471 480 2004 CODE UK 0022-2836 0070 ? 15522299 16.8768/urlspub714
1y4w 1 'my title no. 2' 3620131 
1y44 0 'my title, no. 3.' journal 433 657 661 2005 CODE UK 0028-0836 0006 ? 15654328 10.1038/papukaj03284 
2y42 1 ;my title no. 4. ' 'journal' 66 738 ? 2010 ? DK 1744-3091 ? ? 20516614 10.1107/S174430911001626X 
1y4p 0 'my title no.5. ; journal 44 6101 6121 2005 CODE US 0006-2960 0033 ? 15835899 10.1021/bi047813a 
1y4p 0 my title no.6. ; journal 44 6101 6121 2005 CODE US 0006-2960 0033 ? ? ? 

So my idea was to:

  1. put a single quote in front of the title first word and then
  2. replace the semi-columns overall.

I only have problems with the first point. Since I didn't know how to do this looking at it column-wise, I thought of the second way: Every row (if correctly recorded) begins with 7-character string after which I want to put th quote if not already there. This string is: 4 characters (small caps or digits) then a space, then a number[0-9] and another space.

My best attempt is:

sed -r "s/([a-z0-9]\{4\}\s[0-9]\s)'?;?/\1'/g" example.txt >> example_corr.txt

That doesn't change anything, though. Also, removing the -r outputs the error:

sed: -e expression #1, char 51: invalid reference \1 on `s' command's RHS

I am still very new to UNIX and regular expressions, so I would appreciate any help/ explanation.

P.S. I am using Windows 10 with the built-in ssh to connect to a linux device.

UPDATE (solved) (not as table with columns) with this:

sed -E "s/([a-z0-9]{4}\s[0-9]\s)'?;?/\1'/g" example.txt >> example_corr.txt

So now my output is, what was desired from my first point above:

1y4w 0 'my title no. 1' journal 344 471 480 2004 CODE UK 0022-2836 0070 ? 15522299 16.8768/urlspub714
1y4w 1 'my title no. 2' 3620131
1y44 0 'my title, no. 3.' journal 433 657 661 2005 CODE UK 0028-0836 0006 ? 15654328 10.1038/papukaj03284
2y42 1 'my title no. 4. ' 'journal' 66 738 ? 2010 ? DK 1744-3091 ? ? 20516614 10.1107/S174430911001626X
1y4p 0 'my title no.5. ; journal 44 6101 6121 2005 CODE US 0006-2960 0033 ? 15835899 10.1021/bi047813a
1y4p 0 'my title no.6. ; journal 44 6101 6121 2005 CODE US 0006-2960 0033 ? ? ?
6
  • Welcome to the site. Do you have multiple spaces on your lines, and if so, does it matter that they remain unchanged during the edit? Is there always a space between the single digit in the second "field" of all lines and the beginning of the title (be it the first word or the opening single quote)?
    – AdminBee
    Commented Jan 26, 2021 at 14:49
  • for extended regular expressions (sed -r ...) use ([a-z0-9]{4} (i.e. {...} instead of \{...\}) for basic regular expressions (sed ... without -r) use \([a-z0-9]\{4\}\s[0-9]\s\) (i.e. \(...\) instead of (...)).
    – Bodo
    Commented Jan 26, 2021 at 14:49
  • 1
    You should edit your question and show the expected output matching your example input.
    – Bodo
    Commented Jan 26, 2021 at 14:51
  • Also, please note that -E is more portable than -r to enable regular expressions, and is the preferred option to use nowadays.
    – AdminBee
    Commented Jan 26, 2021 at 14:53
  • Thank you @Bodo and @AdminBee! Using -E and removing the escape from the {} worked!
    – Magi
    Commented Jan 26, 2021 at 15:00

1 Answer 1

1

You can try this without the need of group capturing:

sed -e "s/ [ ';]*/ '/2" -e "s/ ; /' /" file

Output:

1y4w 0 'my title no. 1' journal 344 471 480 2004 CODE UK 0022-2836 0070 ? 15522299 16.8768/urlspub714
1y4w 1 'my title no. 2' 3620131 
1y44 0 'my title, no. 3.' journal 433 657 661 2005 CODE UK 0028-0836 0006 ? 15654328 10.1038/papukaj03284 
2y42 1 'my title no. 4. ' 'journal' 66 738 ? 2010 ? DK 1744-3091 ? ? 20516614 10.1107/S174430911001626X 
1y4p 0 'my title no.5.' journal 44 6101 6121 2005 CODE US 0006-2960 0033 ? 15835899 10.1021/bi047813a 
1y4p 0 'my title no.6.' journal 44 6101 6121 2005 CODE US 0006-2960 0033 ? ? ? 

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .