I have a series of lines with arbitrary number of comma separated values, and then a hash marked comment. The challenge is, using only PCRE2 regex for use in PERL, to do the following:
- Store the phrase after '##'
- Add a pipe to the phrase
- Remove from end of string
- Copy this stored phrase to each comma separated value (EDIT: this can be any number of CSVs)
- Replace commas with '##'
- Ensure no '##' or commas remain at the end of the line
Here is my test string:
quirky, stable, fun ##Paul and Jill
mean, rude, sad ##Dave
rich, foolish, gorgeous ##Amanda
Desired outcome:
Paul and Jill|quirky##Paul and Jill|stable##Paul and Jill|fun
Dave|mean##Dave|rude##Dave|sad
Amanda|rich##Amanda|foolish##Amanda|gorgeous
I am using PCRE2 Regex to build a PERL search/replace string for use in ExifTool.
This regex code:
(.+?)(?:,|##)
finds all the comma separated values and stores each one in group 1 as a separate instance after matching three times up to the hash marks.
Meanwhile, this:
(?<=##)(.*)|$
finds the phrase after '##' which is great! But as soon as I put them together
(.+?)(?:,|##)(?<=##)(.*)|$
All of the sudden the separate instances of the comma separated values in group one become one large group, so what I want as (quirky) (stable) (fun) becomes (quirky, stable, fun) which is not useful to me.
The reason this is not useful then is because substitution won't work. Using:
$2|$1
Would give:
Paul and Jill|quirky, stable, fun"
but I want:
Paul and Jill|quirky##Paul and Jill|stable##Paul and Jill|fun
**EDIT: Thanks to @Toto, I modified the regex to this: ^((\w+)\W+)*?(\w+)\h+(##(.+))$ which allows for an arbitrary number of CSVs to capture but then the substitution would be impossible because it would have to know how many groups are captured, as far as I can tell **
I figure if each match is a separate instance of group one, there may be a way to then match the group two phrase after the hash mark with each match from group one.
Intuitively, this makes sense to me: "copy the phrase after the hash and then replace all the commas with a modified form of the phrase".
I don't need help modifying the phrase, its just getting the sample text to parse correctly. A multi-step regex solution is fine too, I'm not trying to optimize this just yet.
####
mark.(?:^|\G(?!^))(?:(\w+),\h(?=.*##(.+)$)|(\w+)\h##(.+)$)
Replace:(?2$2|$1##)(?3$4|$3)