1

I have a series of lines with arbitrary number of comma separated values, and then a hash marked comment. The challenge is, using only PCRE2 regex for use in PERL, to do the following:

  • Store the phrase after '##'
  • Add a pipe to the phrase
  • Remove from end of string
  • Copy this stored phrase to each comma separated value (EDIT: this can be any number of CSVs)
  • Replace commas with '##'
  • Ensure no '##' or commas remain at the end of the line

Here is my test string:

quirky, stable, fun ##Paul and Jill  
mean, rude, sad ##Dave   
rich, foolish, gorgeous ##Amanda

Desired outcome:

Paul and Jill|quirky##Paul and Jill|stable##Paul and Jill|fun  
Dave|mean##Dave|rude##Dave|sad  
Amanda|rich##Amanda|foolish##Amanda|gorgeous  

I am using PCRE2 Regex to build a PERL search/replace string for use in ExifTool.

This regex code:

(.+?)(?:,|##)

finds all the comma separated values and stores each one in group 1 as a separate instance after matching three times up to the hash marks.

Meanwhile, this:

(?<=##)(.*)|$

finds the phrase after '##' which is great! But as soon as I put them together

(.+?)(?:,|##)(?<=##)(.*)|$

All of the sudden the separate instances of the comma separated values in group one become one large group, so what I want as (quirky) (stable) (fun) becomes (quirky, stable, fun) which is not useful to me.

The reason this is not useful then is because substitution won't work. Using:

$2|$1

Would give:

Paul and Jill|quirky, stable, fun"

but I want:

Paul and Jill|quirky##Paul and Jill|stable##Paul and Jill|fun

**EDIT: Thanks to @Toto, I modified the regex to this: ^((\w+)\W+)*?(\w+)\h+(##(.+))$ which allows for an arbitrary number of CSVs to capture but then the substitution would be impossible because it would have to know how many groups are captured, as far as I can tell **

I figure if each match is a separate instance of group one, there may be a way to then match the group two phrase after the hash mark with each match from group one.

Intuitively, this makes sense to me: "copy the phrase after the hash and then replace all the commas with a modified form of the phrase".

I don't need help modifying the phrase, its just getting the sample text to parse correctly. A multi-step regex solution is fine too, I'm not trying to optimize this just yet.

7
  • How about this?
    – Toto
    Commented Feb 19, 2023 at 9:36
  • Thanks @Toto ... how would that work for an arbitrary number of comma separated values? I should have been more clear that there could be any number of values in front of the hash Commented Feb 19, 2023 at 18:58
  • This: ^((\w+)\W+)*?(\w+)\h+(##(.+))$ (thanks for the inspiration @Toto) which allows for an arbitrary number of CSVs to capture but how would one create a dynamic substitution from that? Commented Feb 19, 2023 at 20:08
  • It is not possible to do in one step because PCRE2 doesn't support conditional replacement. Here is the first step that works with any number of values, but a second step is mandatory to remove everything after the #### mark.
    – Toto
    Commented Feb 20, 2023 at 9:29
  • With support of conditional replacement (as in Notepad++), the solution is Find: (?:^|\G(?!^))(?:(\w+),\h(?=.*##(.+)$)|(\w+)\h##(.+)$) Replace: (?2$2|$1##)(?3$4|$3)
    – Toto
    Commented Feb 20, 2023 at 10:28

1 Answer 1

0

Your requierement can't be done with PCRE flavour.

But it can with Notepad++ that uses BOOST flavour:

  • Ctrl+H
  • Find what: (?:^|\G(?!^))(?:(\w+),\h(?=.*##(.+)$)|(\w+)\h##(.+)$)
  • Replace with: (?2$2|$1##)(?3$4|$3)
  • TICK Wrap around
  • SELECT Regular expression
  • UNTICK . matches newline
  • Replace all

Explanation:

(?:             # non capture group
    ^               # beginning of line
  |               # OR
    \G              # restart from last match position
    (?!^)           # not at beginning of line
)               # end group
(?:             # non capture group
    (\w+)           # group 1, 1 or more word character
    ,               # comma
    \h              # horizontal space
    (?=             # positive lookahead, make sure we have after:
        .*              # 0 or more any character but newline
        ##              # literally
        (.+)            # group 2, 1 or more any character but newline
        $               # end of line
    )               # end lookahead
  |               # OR
    (\w+)           # group 3, 1 or more word character
    \h              # horizontal space
    ##              # literally
    (.+)            # group 4, 1 or more any character but newline
    $               # end of line
)               # end group

Replacement:

(?2         # if group 2 exists
    $2          # print its content
    |           # with a pipe
    $1          # content of group 1
    ##          # literally
)           # end condition
(?3         # if group 3 exists
    $4          # print content of group 4  
    |           # a pipe
    $3          # content of group 3
)           # end condition

Screenshot (before):

enter image description here

Screenshot (after):

enter image description here

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .