2

I have an XML file (1000s of records, simplified here), structure (e.g. say):

<LIST>
<ITEM_0>
<NAME>Item Name</NAME>
</ITEM_0>
...
<ITEM_9999>
<NAME>Item Name</NAME>
</ITEM_9999>
</LIST>

I need result:

<LIST>
<ITEM>
<ID>0</ID>
<NAME>Item Name</NAME>
</ITEM>
...
<ITEM>
<ID>9999</ID>
<NAME>Item Name</NAME>
</ITEM>
</LIST>

Using Regex:

Find: \<ITEM_(.*)(>)
Replace: ITEM>\n<ID>\1\</ID>

I get:

<LIST>
<ITEM>
<ID>0</ID>
<NAME>Item Name</NAME>
</ITEM>
<ID>0</ID> <-- This line not wanted
...
<ITEM>
<ID>9999</ID>
<NAME>Item Name</NAME>
</ITEM>
<ID>9999</ID> <-- This line not wanted
</LIST>

It's replacing </ITEM> as well even though (I think) I'm asking it to only replace <ITEM>- what am I doing wrong/how to fix? I may be missing something regarding grouping (or 'greedy'?) but not sure what and have looked all over for similar. There's a million ways to cut and dice it with something else, but it just bugs me getting so close but not there with NPP.

Help appreciated- thanks.

Late Edit: Even if I get the 1st replace to work right, just the <ITEM_#> tag, I'm still left with the </ITEM_#> closing tag as another search/replace operation. The problem here is the current operation replaces both the <ITEM and </ITEM tags...

2
  • Why not do a regular replace and replace the </ITEM_ with something else and then run your regex replacement?
    – Blerg
    Commented Aug 1, 2016 at 20:02
  • Yes, thanks, would work but take 2 replaces, whereas x2 search/replace in 1 regex solution below works OK (but with the Q there still outstanding).
    – Catch21
    Commented Aug 2, 2016 at 21:17

3 Answers 3

0

Yes, it's likely that the .* is too "greedy" and captures as many characters as it can; you need the opposite – the shortest possible match instead.

One method would be to use [^>]* instead – this would still match as many as possible, but only until the first >, so <ITEM_([^>]*)> would only match the opening tag and nothing more.

Depending on regex syntax, .*? might also work – this explicitly switches the * to "non-greedy".

0

Thanks grawity, it helped me broaden my search to here to cover multiple search and replace in one regex.

Trying the following works:

Find: </ITEM_.*(>)|<ITEM_(.*)(>)
Replace: (?1</ITEM>)(?2<ITEM>\n<ID>\2</ID>)
RegEx

The | separates 2 strings looked for and the ?1 and ?2 are their respective replacements.

But I have to look for the closing </ITEM tag first, not the <ITEM tag as you would logically figure. So I have a solution, but can anyone answer the question as to why the above works but the following, looking for <ITEM tag first, fails when we're just reversing the order in which we look?

Find: <ITEM_(.*)(>)|</ITEM_.*(>)
Replace: (?1<ITEM>\n<ID>\1</ID>)(?2</ITEM>
RegEx

Not essential, but enquiring minds might like to know. Thanks.

0
  • Ctrl+H
  • Find what: <ITEM_(\d+)>([\s\S]*)</ITEM_\1>
  • Replace with: <ITEM>\n<ID>$1</ID>$2</ITEM>
  • CHECK Match case
  • CHECK Wrap around
  • CHECK Regular expression
  • UNCHECK . matches newline
  • Replace all

Explanation:

<ITEM_          # literally
(\d+)           # group 1, 1 or more digits, you can use [^>]* if other characters than digits are allowed
>               # literally
([\s\S]*)       # group 2, 0 or more any character, including linebreaks
</ITEM_         # literally
\1              # backreference to group 1
>               # literally

Replacement:

<ITEM>          # literally
\n              # linefeed, use \r\n for windows EOL
<ID>$1</ID>     # ID tag, with the content of group 1
$2              # content of group 2
</ITEM>         # literally

Screenshot (before):

enter image description here

Screenshot (after):

enter image description here

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .