0

I have a list of items in which each item has multiple lines. The token to separate the items is unique (per item, HTML <li>) and I've only seen instances of the text that are contained within a single tokenized paragraph (HTML <p>). I'd like a TSV to be made from that, which items are, in order:

  1. date
  2. name
  3. URL
  4. summary

From what I've seen for all items both the URL and the name has a duplicate (in each item), so I went for the 1st URL and 2nd name, as it seemed easiest for me. The summary might contain visual aid tags (ie <strong>) so I use a negative lookahead to do it, as opposed to the date that shouldn't have an internal tag, so I used a negated character class instead.

The first 2 items are

    <li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">On
    Contact: Race and America's long war </a>
    </p>
    <p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">
  <font color="#000080">
    <img src="rt.com-on_contact-220405-no_blurb_html_1dff87941f1c724a.jpg" name="Image1" alt="On Contact: Race and America's long war" align="bottom" width="280" height="157" border="1"/>
  </font>
</a>
</p>
    <p style="margin-bottom: 0in">On the show, Chris Hedges discusses
    America's inner and outer wars and its nexus with capitalism and
    empire with Professor of Social and Cultural Analysis and History at
    New York University Nikhil Pal Singh. The internal violence in the
    United... 
    </p>
    <p style="margin-bottom: 0in">Feb 27, 2022 10:36</p>
    <li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">
  <font color="#000080">
    <img src="rt.com-on_contact-220405-no_blurb_html_198feb67032166ff.png" name="Image3" alt="On Contact: George Washington and the legacy of white supremacy" align="bottom" width="280" height="157" border="1"/>
  </font>
</a>
</p>
    <p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">On
    Contact: George Washington and the legacy of white supremacy </a></strong>
    </p>
    <p style="margin-bottom: 0in">On the show, Chris Hedges discusses
    George Washington, the fallible human being and one of the principal
    architects of the United States, with author Nathaniel Philbrick. As
    America fractures into ideologically hostile camps, it colors how
    we... 
    </p>
    <p style="margin-bottom: 0in">Feb 25, 2022 09:09 
    </p>
    <li>[...]

and the regex I attempted is <li>.*<a href="([^"]+)".*alt="On Contact: ([^"]+)".*<p[^>]*>((?:.(?!<\/p>))+)<\/p><p[^>]*>([^<]+)< and if it worked it would be replaced by $4\t$2\t$1\t$3. I'd like for the regex to work in Notepad++.

Thank you kindly for your help

Update 1

The test string I later used added list items, added display tags in the summary (ie <strong>), and although it's inconsistent with the title I had to remove tabs as they were interfering with TSV creation and I thought I might as well remove newlines in that process (removed [\t\r\n]), resulting in:

<li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">OnContact: Race and America's long war </a></p><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">  <font color="#000080">    <img src="rt.com-on_contact-220405-no_blurb_html_1dff87941f1c724a.jpg" name="Image1" alt="On Contact: Race and America's long war" align="bottom" width="280" height="157" border="1"/>  </font></a></p><p style="margin-bottom: 0in">On the show, Chris Hedges discussesAmerica's inner and outer wars and its nexus with capitalism and <strong>empire</strong> with Professor of Social and Cultural Analysis and History atNew York University Nikhil Pal Singh. The internal violence in theUnited... </p><p style="margin-bottom: 0in">Feb 27, 2022 10:36</p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">  <font color="#000080">    <img src="rt.com-on_contact-220405-no_blurb_html_198feb67032166ff.png" name="Image3" alt="On Contact: George Washington and the legacy of white supremacy" align="bottom" width="280" height="157" border="1"/>  </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">OnContact: George Washington and the legacy of white supremacy </a></strong></p><p style="margin-bottom: 0in">On the show, <span class="host">Chris Hedges</span> discusses George Washington, the fallible human being and one of the principalarchitects of the United States, with author Nathaniel Philbrick. AsAmerica fractures into ideologically hostile camps, it colors howwe... </p><p style="margin-bottom: 0in">Feb 25, 2022 09:09 </p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/549103-oppenheimer-bomb-culture-bird/">  <font color="#000080">    <img src="rt.com-on_contact-220405-no_blurb_html_e46c470920b1171d.jpg" name="Image4" alt="On Contact: Oppenheimer & the bomb culture" align="bottom" width="420" height="236" border="1"/>  </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/549103-oppenheimer-bomb-culture-bird/">OnContact: Oppenheimer &amp; the bomb culture </a></strong></p><p style="margin-bottom: 0in">On the show, Chris Hedges discusses J.Robert Oppenheimer and the making of the bomb with author <span class="author">Kai Bird.J. Robert Oppenheimer</span>, &ldquo;the father of the atomic bomb,&rdquo;was by the end of World War II one of the most celebrated men inAmerica.... </p><p style="margin-bottom: 0in">Feb 20, 2022 06:10 </p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/469859-war-iran-stephen-kinzer/">  <font color="#000080">    <img src="rt.com-on_contact-220405-no_blurb_html_15449064d00f77f3.jpg" name="Image149" alt="On Contact – War with Iran? Stephen Kinzer" align="bottom" width="420" height="236" border="1"/>  </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/469859-war-iran-stephen-kinzer/">OnContact &ndash; War with Iran? Stephen Kinzer </a></strong></p><p style="margin-bottom: 0in">Host Chris Hedges talks to journalistand author, Stephen Kinzer, on efforts by Saudi Arabia and Washington to cripple Iran&rsquo;s economy, inevitably putting Saudi Arabia, its Gulf allies and Washington on a collision course with the <em>Islamic</em>... </p><p style="margin-bottom: 0in">Sep 29, 2019 07:10 </p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/469339-future-amazon-rain-forest/">  <font color="#000080">    <img src="rt.com-on_contact-220405-no_blurb_html_b82502a96022a758.png" name="Image150" alt="The future of the Amazon rain forest – Sonia Bone Guajajara" align="bottom" width="280" height="157" border="1"/>  </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/469339-future-amazon-rain-forest/">Thefuture of the Amazon rain forest &ndash; Sonia Bone Guajajara </a></strong></p><p style="margin-bottom: 0in">Host Chris Hedges talks to Sonia BoneGuajajara, leader of 300 indigenous ethnic groups in Brazil, aboutthe future of the Amazon rain forest, its people, climate change,and the competing goals of agrobusiness, multinational corporations,and the... </p><p style="margin-bottom: 0in">Sep 22, 2019 07:15 </p></ul>
1
  • 1
    What have you tried, and how has what you've tried failed? Ideally, you should provide a Minimal Complete Verifiable Example of what you've tried, and include specific information on how it failed, with error messages and/or erroneous output. Super User is not a code-writing service; the best questions are those which provide useful information so that those who answer can guide you to devising your own correct answer. See How to Ask a Good Question. Commented Apr 5, 2022 at 12:46

2 Answers 2

0

I like to break apart the problem and try to optimize away any .* or .*? that I find. Note that if the structure of the HTML changes, this has a much higher chance to break.

I'm also a fan of regexes that support the /x flag so I can add whitespace and comments to help everything fit into my brain.

This is what I've come up with, peppered with comments to help understand what each section is doing:

<li>
(?>[<](?!a\b)[^<>]*[>]|[^<>]+)*
<a\shref="(?<url>[^"]+)"[^>]*>

# Match until we reach '<img'
(?>[<](?!img\b)[^<>]*[>]|[^<>]+)*
<img

# Match until we reach 'alt=' within '<img...>'
(?>[^<>=]*+(?<!alt)=|"[^<>"=]*"\s)*
alt="(?:On\sContact[\s–:\-–]*)?(?<on_contact>[^"]+)"[^<>]*>

# Match until it reaches a '<p...>' that does not contain some other opening '<' tag element.
(?>[<](?!p\b)[^<>]*[>]|[^<>]+|<p[^>]*>\s*<(?!\/?p\b)[^<>]*>)*
<p[^>]*>

# Match 'stuff stuff ... stuff stuff' without including trailing whitespace.
(?<desc>[^<>\s]+(?>\s+[^<>\s]+)*
  # Handle <strong>...</strong> nested tags
  (?>\s*[<](?!\/p)[^<>]*[>]|\s*[^<>\s]+(?>\s+[^<>\s]+)*)*
)

\s*<\/p>

# Match until we reach another '<p...>'
(?>[<](?!p\b)[^<>]*[>]|[^<>]+)*
<p[^>]*>

# Capture the date
(?<date>[^<]+)

# Match until we reach a '<li>' (or end of string)
(?>[<](?!li\b)[^<>]*[>]|[^<>]+)*

You can see this acting on your original text here.

The same regex, but with the comment lines and whitespace stripped out can be found here as well, which should be able to just drop into Notepad++ or whatever PCRE2-compliant tool you have.

1
  • I'm not sure if this answer will be edited in the future but with the version I marked as answer, I did a replace from <li>(?>[<](?!a\b)[^<>]*[>]|[^<>]+)*<a\shref="(?<url>[^"]+)"[^>]*>(?>[<](?!img\b)[^<>]*[>]|[^<>]+)*<img(?>[^<>=]*+(?<!alt)=|"[^<>"=]*"\s)*alt="(?:On\sContact[\s–:\-–]*)?(?<on_contact>[^"]+)"[^<>]*>(?>[<](?!p\b)[^<>]*[>]|[^<>]+|<p[^>]*>\s*<(?!\/?p\b)[^<>]*>)*<p[^>]*>(?<desc>[^<>\s]+(?>\s+[^<>\s]+)*(?>\s*[<](?!\/p)[^<>]*[>]|\s*[^<>\s]+(?>\s+[^<>\s]+)*)*)\s*<\/p>(?>[<](?!p\b)[^<>]*[>]|[^<>]+)*<p[^>]*>(?<date>[^<]+)(?>[<](?!li\b)[^<>]*[>]|[^<>]+)* to $4\t$2\t$1\t$3\n.
    – DynV
    Commented Apr 8, 2022 at 16:30
1

Your regex contains some errors that make it doesn't match the text.

  • Remove the useless (in Notepad++) escaped for slash character \/ ==> /
  • Replace all your .* with non greedy ones .*?
  • Your Tempered Greedy Token is in wrong order (?:.(?!</p>))+ should be (?:(?!</p>).)+

Moreover, your 2 <li>'s in the sample text don't have same structure:

  • the former has the image in the second <p> paragraphs
  • the later, has the image in the first <p> paragraphs

then the capture groups don't capture the same data.


You can review the regex here


I've changed a little bit your regex, assuming the wanted paragraph doesn't contain any tags, it works for your example:

<li>.*?<a href="([^"]+)".*?alt="On Contact: ([^"]+)".*?<p[^>]*>((?:(?![<>]).)+?)</p>.*?<p[^>]*>([a-zA-Z]{3} \d\d?, \d{4} \d\d?:\d\d)\s*</p>

Demo & explanation


In action in Notepad++

  • Ctrl+H
  • Find what: <li>.*?<a href="([^"]+)".*?alt="On Contact: ([^"]+)".*?<p[^>]*>((?:(?![<>]).)+?)</p>.*?<p[^>]*>([a-zA-Z]{3} \d\d?, \d{4} \d\d?:\d\d)\s*</p>
  • Replace with: $4\n$2\n$1\n$3\n\n
  • CHECK Wrap around
  • CHECK Regular expression
  • CHECK . matches newline
  • Replace all

Screenshot (before):

enter image description here

Screenshot (after):

enter image description here

3
  • You're not wrong about them being different but what matters is that I get the data for the TSV. The summary in both examples start with "On the show, Chris Hedges discusses", but at least one doesn't match that in the couple hundred in the full list. What I'm seeing is that if the paragraph contains a URL that it won't contain a summary. I think <p[^>]*>((?:(?!</p>).)+?)</p> should contain something just before the capture to see that paragraph doesn't contain a URL. I made searches in vain, the best I think was "regex skipping (paragraphs OR sentences) that contains a specific word".
    – DynV
    Commented Apr 5, 2022 at 17:02
  • PS: Another possibility would be make a "hard" capture for the date with <p[^>]*>((?:Jan|Feb|Mar|...) [^<]+)< and have the content of paragraph just before it captured.
    – DynV
    Commented Apr 5, 2022 at 17:44
  • It seems to remove the undesired content but has a lot of white-space I don't want. My goal is to make a flat-file DB out of the list, thus why I requested a TSV both in the thread title and in the 1st paragraph of the OP (that I'd paste in Openoffice Calc). Only if/when the previous part of this post has been dealt with: Is it impossible to do the check that I suggested 2 comments ago, something like <p[^>]*>ADDITIONAL_REGEX_HERE((?:(?!</p>).)+?)</p>, or it's significantly harder to "script"?
    – DynV
    Commented Apr 5, 2022 at 21:35

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .