-3

I have an XML file where I would to select blocks if and only if they contain a certain value:

<outer block>
  <name>A</name>
  <inner block>Hello</inner block>
</outer block>
<outer block>
  <name>B</name>
  <inner block>Hello again</inner block>
</outer block>
<outer block>
  <name>C</name>
  <inner block>Goodbye</inner block>
</outer block>
<outer block>
  <name><D</name>
  <inner block>Goodbye</inner block>
</outer block>

Notepad++ can query over multiple lines so I can treat this as a single line. I would like to select the two blocks with Goodbye so that the matched strings appear as:

<outer block>
  <name><C</name>
  <inner block>Goodbye</inner block>
</outer block>

<outer block>
  <name><D</name>
  <inner block>Goodbye</inner block>
</outer block>

I have got the following expression

<outer block>(.*?)Goodbye(.*?)</outer block>

, but it gives me the first all the way to the end of the first </outer block> after Goodbye.

Matched string 1

<outer block>
  <name>A</name>
  <inner block>Hello</inner block>
</outer block>
<outer block>
  <name>B</name>
  <inner block>Hello again</inner block>
</outer block>
<outer block>
  <name>C</name>
  <inner block>Goodbye</inner block>
</outer block>

and Matched string 2

<outer block>
  <name><D</name>
  <inner block>Goodbye</inner block>
</outer block>

Ideally, I would get an output for C that is the same as D..*

I have tried lookaheads but Notepad does not accept them as valid expressions. Below is what I would expect the expression to look like.

<outer block>(.anything but another 'outer')Goodbye(.anything but another 'outer')</outer block>

Regex101.com does not use the particular flavour of Regex that Notepad uses so I'm at a loss.

1

1 Answer 1

1

I agree that it would be safer to do this with an XML/DOM parser. I don't know if this can be done with a Notepad++ plugin (perhaps some Python or some JS). You could also make a script on your side, without Notepad++, to do this, in your preferred language.

Regex solution in Notepad++, as asked

But, to let you solve it with a relatively non-efficient regular expression with the help of Notepad++'s search engine, I would use the following pattern:

<outer block>(?:.(?!<outer block>))*?Goodbye.*?<\/outer block>

I've modified the ungreedy pattern to match text before the "Goodbye" you are looking for, by adding a negative lookahead, which checks after each character if it is not followed by "<outer block>".

Test it live: https://regex101.com/r/yQJYR0/1

As I said, it's not very efficient because it's making a lot of checks. They might be a cleaner way to write the pattern. A little improvement could be to change the negative lookahead to match the closing </outer block> tag instead of an opening one. This would reduce the number of steps during the search:

<outer block>(?:.(?!<\/outer block>))*?Goodbye.*?<\/outer block>

Test the minor improvement: https://regex101.com/r/yQJYR0/2

It's not a very sexy solution, but at least it quickly solves your problem without having to write a script or a program.

DOMParser solution in JavaScript

Personally, I find it's a lot of work, compared to the quick and dirty regular expression. And it doesn't select the text in the editor. But for the ones that love commenting "Don't use regex, it's not made for that, use a parser", this would be the safer solution (except if your XML has errors).

You'll have to run it in full screen to see the results properly.

(function (doc) {
  /**
   * Escape some text to display it in HTML.
   *
   * @param {string} rawText The text to escape.
   * @return {string} The HTML escaped text.
   */
  function escape(rawText) {
    const span = doc.createElement("span");
    span.innerText = rawText;
    return span.innerHTML;
  }

  doc.addEventListener("DOMContentLoaded", () => {
    const inputTextarea = doc.getElementById("input");
    const searchButton = doc.getElementById("search");
    const resultsUl = doc.getElementById("results");

    // Click handler for the search button.
    searchButton.addEventListener("click", () => {
      parser = new DOMParser();
      // Parse the input XML, wrapping it in a single element.
      xmlDoc = parser.parseFromString(
        "<document>" + inputTextarea.value + "</document>",
        "text/xml"
      );
      // Check if some errors occured.
      const errorNode = xmlDoc.querySelector("parsererror");
      if (errorNode) {
        alert("Parse error!\n" + errorNode.querySelector("div").innerText);
        return;
      }

      // Search for all inner blocks with an outer block as ancestor.
      const innerBlocks = xmlDoc.querySelectorAll("outer-block inner-block");

      // The list of outer blocks we want to find.
      let foundOuterBlocks = [];

      // Loop over each inner block to see if it contains the word "Goodbye".
      innerBlocks.forEach((innerBlock, i) => {
        if (innerBlock.textContent.match(/\bGoodbye\b/i)) {
          // Get the outer block, looping until we find it in the parents.
          let parent = innerBlock.parentNode;
          while (parent.nodeName !== "outer-block") {
            parent = parent.parentNode;
          }

          foundOuterBlocks.push(parent.outerHTML);
        }
      });

      console.log(foundOuterBlocks);

      // Update the results list.
      resultsUl.innerHTML = foundOuterBlocks
        .map((value) => {
          return "<li><pre><code>" + escape(value) + "</code></pre></li>";
        })
        .join("\n");
    });
  });
})(document);
form {
  display: flex;
  flex-direction: column;
  row-gap: .5em;
}

ul#results {
  margin: 1em 0;
  padding: 0;
  list-style: none;
  counter-reset: result-nbr;
}

ul#results li::before {
  content: "Found item n°" counter(result-nbr) ":";
  counter-increment: result-nbr;
  display: block;
  font-size: .8em;
}
<form action="#">
  <textarea name="input" id="input" cols="50" rows="8">&lt;outer-block&gt;
  &lt;name&gt;A&lt;/name&gt;
  &lt;inner-block&gt;Hello&lt;/inner-block&gt;
&lt;/outer-block&gt;
&lt;outer-block&gt;
  &lt;name&gt;B&lt;/name&gt;
  &lt;inner-block&gt;Hello again&lt;/inner-block&gt;
&lt;/outer-block&gt;
&lt;outer-block&gt;
  &lt;name&gt;C&lt;/name&gt;
  &lt;inner-block&gt;Goodbye&lt;/inner-block&gt;
&lt;/outer-block&gt;
&lt;outer-block&gt;
  &lt;name&gt;D&lt;/name&gt;
  &lt;inner-block&gt;Goodbye&lt;/inner-block&gt;
&lt;/outer-block&gt;</textarea>
  <input type="button" id="search"
         value="search for &lt;outer-block&gt; containing the word 'Goodbye' in a &lt;inner-block&gt;">
</form>

<ul id="results">
</ul>

Not the answer you're looking for? Browse other questions tagged or ask your own question.