Trying to replace a portion of a sed matching pattern with an alternate string

Question

I am trying to convert my library of personal science fiction book PDFs into epubs. The middle step where I am collecting all lines into a contiguous paragraph is collating correctly, but I end up with some string sequences which are incorrect, because they are output that way from a conversion tool or random cases from my collating process.

The offending string sequences are as identified in the following test script:

{
cat <<-EnDoFiNpUt
"I will do it," he said.
He said,"Be there ... or be square!"
He said",Be there ... or be square!"
They yelled:"Why now?"
Tell them "No"!{end-of-lineORspace}
EnDoFiNpUt
} |
sed 's+[a-zA-Z0-9]",[A-Z]+{somethingThree}+g' |      (line 3)
sed 's+[a-zA-Z0-9],"[A-Z]+{somethingTwo}+g' |      (line 2)
sed 's+[a-zA-Z0-9]:"[A-Z]+{somethingFour}+g' |      (line 4)
sed 's+[a-zA-Z0-9]"[!]$+{somethingFive}+g'  |      (line 5)
sed 's+[a-zA-Z0-9],"\ [A-Z]+{somethingOne}+g'      (line 1)

My desired output should look like this (without the lines with the '^'s):

"I will do it", he said.
             ^^^
He said, "Be there ... or be square!"
       ^^^
He said, "Be there ... or be square!"     (same result for 2nd scenario)
       ^^^
They yelled: "Why now?"
           ^^^
Tell them "No!"
             ^^^

The issue is that I am specifying existence of a character (alphanum) before and after the pattern, but I want to replace only the strings that are being swapped, keeping the characters that were matched before or after those instances.

With my limited understanding of sed, the only other posting that seemed to come close to my own problem was this, but I couldn't decipher that, let alone try my own hand at using the technique for my problem.

I would prefer to do a post-collation sed operation, and not rework my already complex AWK logic for the paragraph recognition and collating.

How can I do that for those scenarios?

Peter Mortensen · Accepted Answer · 2024-05-12 23:49:34Z

0

Using sed

sed -E 's/(")(,)?([^"]*)((,)(") )?/\2\1\3\6\5/;

Output:

s/([[:punct:]])(")/\1 \2/;
s/("[^"]*)(")([[:punct:]])($| )/\1\3\2\4/;
s/",/& /' input_file
"I will do it", he said.
He said, "Be there ... or be square!"
He said, "Be there ... or be square!"
They yelled: "Why now?"
Tell them "No!"

edited May 12 at 23:49

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Jul 16, 2023 at 1:25

sseLtaH

11k5 gold badges16 silver badges33 bronze badges

Add a comment |

markp-fuso · Accepted Answer · 2024-05-13 13:57:34Z

You're probably looking to use capture groups whereby the contents within a set of parens can be referenced with a numeric back reference in the 2nd half of the script.

A simple example:

$ echo 'ABC' | sed -E 's/(A)(B)(C)/\1x\2y\3/'
AxByC

Where:

-E - enables extended regexes and allows the use of parens to designate capture groups; without -E you need to escape each paren (eg, sed 's/\(A\)\(B\)\(C\)/\1x\2y\3/')
(A) (B) (C) - (define) 1st, 2nd and 3rd capture groups
\1 \2 \3 - (use/reference) 1st, 2nd and 3rd capture groups

Focusing on requirement to move a comma to the 'outside' of a quoted string ...

need to define the start and end of the quoted string
need to define everything within the pair of quotes
assumes no embedded double quotes

Looking at a pair of the requirements:

#    ",body_of_quote"    =>    ,"body_of_quote"

sed -E 's/",([^"]*")/,"\1/g'
          ^^ start of quote and initial comma
            ^^^^^^^^ capture group consisting of 0-or-more characters that are not a double quote, end of quote
                     ^^ reverse order of comma and double quote
                       ^^ copy of capture group

#    "body_of_quote,"    =>    "body_of_quote",

sed -E 's/("[^"]*),"/\1",/g'
          ^^^^^^^^ capture group consisting of start of quote plus 0-or-more characters and are not a double quote
                  ^^ comma and end of quote
                     ^^ copy of capture group
                       ^^ reverse order of comma and double quote

Using a file for the sample input:

$ cat sample.dat
"I will do it," he said.
He said,"Be there ... or be square!"
He said",Be there ... or be square!"
They yelled:"Why now?"
Tell them "No"!

Testing the sed scripts:

$ sed -E 's/",([^"]*")/,"\1/g' sample.dat | sed -E 's/("[^"]*),"/\1",/g' | grep -Ei 'he said'
"I will do it", he said.
He said,"Be there ... or be square!"
He said,"Be there ... or be square!"

Combining into a single compound sed script:

$ sed -E 's/",([^"]*")/,"\1/g; s/("[^"]*),"/\1",/g' sample.dat | grep -Ei 'he said'
"I will do it", he said.
He said,"Be there ... or be square!"
He said,"Be there ... or be square!"

NOTES:

a similar approach can be used for the requirement to add a space before a quoted string
not sure I understand the requirement for the last line (add a space after !?) but a capture group will likely come in useful there, too
if you have problems getting the rest of the sed scripts to work then I'd suggest asking a new question that focuses solely on the sed scripts you're having issues with
you've stated you don't want to modify your current awk script; keep in mind that everything we're doing here with sed can be done in awk and would be more efficient (ie, no need to spawn a separate (sub)process for the sed scripting)

Collectives™ on Stack Overflow

Trying to replace a portion of a sed matching pattern with an alternate string

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
regex
bash
sed
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged regexbashsed or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
regex
bash
sed
or ask your own question.