1

I need to delete all html tags, such as <p style="text-align: center;"> , but except </em> and </em> from the html tag <p class="glovo"></p>

EXAMPLE:

<p class="glovo">In these <p style="text-align: center;"> situations we may be forgetting to really <em>bend</em> at our practice and <em>sweat</em> at it.</p>

MUST BECOME:

<p class="glovo">In these situations we may be forgetting to really <em>bend</em> at our practice and <em>sweat</em> at it.</p>

I use this GENERIC formula:

REGION-START(?=(?:(?!REGION-FINAL).)*?FIND REGEX)(?=(?:(?!REGION-FINAL).)).+?REGION-FINAL\R?

REGION-START = <p class="glovo">
REGION-FINAL = </p>
FIND REGEX = <(?!/)[^>]*[^/]>(?!<em>|</em>)

So, my final regex becomes:

FIND:

<p class="glovo">(?=(?:(?!</p>).)*?<(?!/)[^>]*[^/]>(?!<em>|</em>))(?=(?:(?!</p>).)).+?</p>\R?

REPLACE BY: (LEAVE EMPTY)

The problem is that my regex selects THE ENTIRE html tag, not just the tags inside it. Can anyone help me?

1 Answer 1

1
  • Ctrl+H
  • Find what: (?:<p class="glovo">|\G).*?\K<(?!/?em>).*?>(?=.*</p>)
  • Replace with: LEAVE EMPTY
  • TICK Wrap around
  • SELECT Regular expression
  • Replace all

Explanation:

(?:                     # non capture group
    <p class="glovo">       # literally
  |                       # OR
    \G                      # restart from last match position
)                       # end group
.*?                     # 0 or more any character, not greedy
\K                      # forget all we have seen until this position
<                       # literally <
    (?!/?em>)               # not followed by em or /em
    .*?                     # 0 or more any character, not greedy
    >
(?=.*</p>)              # positive lookahead, make sure we have </p> somewhere after

Screenshot (before):

enter image description here

Screenshot (after):

enter image description here

4
  • thanks, but your regex has a little bug. After REPLACE, the final tag </p> is also delete. Must not be deleted Commented Oct 4, 2022 at 13:00
  • @HellenaCrainicu: True. See my updated answer.
    – Toto
    Commented Oct 4, 2022 at 13:43
  • thanks. But what exactly means (?: # non capture group When is it used? Commented Oct 4, 2022 at 13:47
  • @HellenaCrainicu: (?:....) is a non capture group. It means we want to group some part of the string, but the capture is useless because we don't reuse it after. It's very much more efficient than a capture group.
    – Toto
    Commented Oct 4, 2022 at 13:52

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .