I am trying to convert my library of personal science fiction book PDFs into epubs. The middle step where I am collecting all lines into a contiguous paragraph is collating correctly, but I end up with some string sequences which are incorrect, because they are output that way from a conversion tool or random cases from my collating process.
The offending string sequences are as identified in the following test script:
{
cat <<-EnDoFiNpUt
"I will do it," he said.
He said,"Be there ... or be square!"
He said",Be there ... or be square!"
They yelled:"Why now?"
Tell them "No"!{end-of-lineORspace}
EnDoFiNpUt
} |
sed 's+[a-zA-Z0-9]",[A-Z]+{somethingThree}+g' | (line 3)
sed 's+[a-zA-Z0-9],"[A-Z]+{somethingTwo}+g' | (line 2)
sed 's+[a-zA-Z0-9]:"[A-Z]+{somethingFour}+g' | (line 4)
sed 's+[a-zA-Z0-9]"[!]$+{somethingFive}+g' | (line 5)
sed 's+[a-zA-Z0-9],"\ [A-Z]+{somethingOne}+g' (line 1)
My desired output should look like this (without the lines with the '^'s):
"I will do it", he said.
^^^
He said, "Be there ... or be square!"
^^^
He said, "Be there ... or be square!" (same result for 2nd scenario)
^^^
They yelled: "Why now?"
^^^
Tell them "No!"
^^^
The issue is that I am specifying existence of a character (alphanum) before and after the pattern, but I want to replace only the strings that are being swapped, keeping the characters that were matched before or after those instances.
With my limited understanding of sed, the only other posting that seemed to come close to my own problem was this, but I couldn't decipher that, let alone try my own hand at using the technique for my problem.
I would prefer to do a post-collation sed operation, and not rework my already complex AWK logic for the paragraph recognition and collating.
How can I do that for those scenarios?