Make TSV out of multi-line list

Question

I have a list of items in which each item has multiple lines. The token to separate the items is unique (per item, HTML <li>) and I've only seen instances of the text that are contained within a single tokenized paragraph (HTML ). I'd like a TSV to be made from that, which items are, in order:

date
name
URL
summary

From what I've seen for all items both the URL and the name has a duplicate (in each item), so I went for the 1st URL and 2nd name, as it seemed easiest for me. The summary might contain visual aid tags (ie ) so I use a negative lookahead to do it, as opposed to the date that shouldn't have an internal tag, so I used a negated character class instead.

The first 2 items are

    <li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">On
    Contact: Race and America's long war </a>
    </p>
    <p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">
  <font color="#000080">
    <img src="rt.com-on_contact-220405-no_blurb_html_1dff87941f1c724a.jpg" name="Image1" alt="On Contact: Race and America's long war" align="bottom" width="280" height="157" border="1"/>
  </font>
</a>
</p>
    <p style="margin-bottom: 0in">On the show, Chris Hedges discusses
    America's inner and outer wars and its nexus with capitalism and
    empire with Professor of Social and Cultural Analysis and History at
    New York University Nikhil Pal Singh. The internal violence in the
    United... 
    </p>
    <p style="margin-bottom: 0in">Feb 27, 2022 10:36</p>
    <li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">
  <font color="#000080">
    <img src="rt.com-on_contact-220405-no_blurb_html_198feb67032166ff.png" name="Image3" alt="On Contact: George Washington and the legacy of white supremacy" align="bottom" width="280" height="157" border="1"/>
  </font>
</a>
</p>
    <p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">On
    Contact: George Washington and the legacy of white supremacy </a></strong>
    </p>
    <p style="margin-bottom: 0in">On the show, Chris Hedges discusses
    George Washington, the fallible human being and one of the principal
    architects of the United States, with author Nathaniel Philbrick. As
    America fractures into ideologically hostile camps, it colors how
    we... 
    </p>
    <p style="margin-bottom: 0in">Feb 25, 2022 09:09 
    </p>
    <li>[...]

and the regex I attempted is <li>.*<a href="([^"]+)".*alt="On Contact: ([^"]+)".*<p[^>]*>((?:.(?!<\/p>))+)<\/p><p[^>]*>([^<]+)< and if it worked it would be replaced by $4\t$2\t$1\t$3. I'd like for the regex to work in Notepad++.

Thank you kindly for your help

Update 1

The test string I later used added list items, added display tags in the summary (ie ), and although it's inconsistent with the title I had to remove tabs as they were interfering with TSV creation and I thought I might as well remove newlines in that process (removed [\t\r\n]), resulting in:

<li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">OnContact: Race and America's long war </a></p><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">  <font color="#000080">    <img src="rt.com-on_contact-220405-no_blurb_html_1dff87941f1c724a.jpg" name="Image1" alt="On Contact: Race and America's long war" align="bottom" width="280" height="157" border="1"/>  </font></a></p><p style="margin-bottom: 0in">On the show, Chris Hedges discussesAmerica's inner and outer wars and its nexus with capitalism and <strong>empire</strong> with Professor of Social and Cultural Analysis and History atNew York University Nikhil Pal Singh. The internal violence in theUnited... </p><p style="margin-bottom: 0in">Feb 27, 2022 10:36</p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">  <font color="#000080">    <img src="rt.com-on_contact-220405-no_blurb_html_198feb67032166ff.png" name="Image3" alt="On Contact: George Washington and the legacy of white supremacy" align="bottom" width="280" height="157" border="1"/>  </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">OnContact: George Washington and the legacy of white supremacy </a></strong></p><p style="margin-bottom: 0in">On the show, <span class="host">Chris Hedges</span> discusses George Washington, the fallible human being and one of the principalarchitects of the United States, with author Nathaniel Philbrick. AsAmerica fractures into ideologically hostile camps, it colors howwe... </p><p style="margin-bottom: 0in">Feb 25, 2022 09:09 </p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/549103-oppenheimer-bomb-culture-bird/">  <font color="#000080">    <img src="rt.com-on_contact-220405-no_blurb_html_e46c470920b1171d.jpg" name="Image4" alt="On Contact: Oppenheimer & the bomb culture" align="bottom" width="420" height="236" border="1"/>  </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/549103-oppenheimer-bomb-culture-bird/">OnContact: Oppenheimer &amp; the bomb culture </a></strong></p><p style="margin-bottom: 0in">On the show, Chris Hedges discusses J.Robert Oppenheimer and the making of the bomb with author <span class="author">Kai Bird.J. Robert Oppenheimer</span>, &ldquo;the father of the atomic bomb,&rdquo;was by the end of World War II one of the most celebrated men inAmerica.... </p><p style="margin-bottom: 0in">Feb 20, 2022 06:10 </p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/469859-war-iran-stephen-kinzer/">  <font color="#000080">    <img src="rt.com-on_contact-220405-no_blurb_html_15449064d00f77f3.jpg" name="Image149" alt="On Contact â€“ War with Iran? Stephen Kinzer" align="bottom" width="420" height="236" border="1"/>  </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/469859-war-iran-stephen-kinzer/">OnContact &ndash; War with Iran? Stephen Kinzer </a></strong></p><p style="margin-bottom: 0in">Host Chris Hedges talks to journalistand author, Stephen Kinzer, on efforts by Saudi Arabia and Washington to cripple Iran&rsquo;s economy, inevitably putting Saudi Arabia, its Gulf allies and Washington on a collision course with the <em>Islamic</em>... </p><p style="margin-bottom: 0in">Sep 29, 2019 07:10 </p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/469339-future-amazon-rain-forest/">  <font color="#000080">    <img src="rt.com-on_contact-220405-no_blurb_html_b82502a96022a758.png" name="Image150" alt="The future of the Amazon rain forest â€“ Sonia Bone Guajajara" align="bottom" width="280" height="157" border="1"/>  </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/469339-future-amazon-rain-forest/">Thefuture of the Amazon rain forest &ndash; Sonia Bone Guajajara </a></strong></p><p style="margin-bottom: 0in">Host Chris Hedges talks to Sonia BoneGuajajara, leader of 300 indigenous ethnic groups in Brazil, aboutthe future of the Amazon rain forest, its people, climate change,and the competing goals of agrobusiness, multinational corporations,and the... </p><p style="margin-bottom: 0in">Sep 22, 2019 07:15 </p></ul>

What have you tried, and how has what you've tried failed? Ideally, you should provide a Minimal Complete Verifiable Example of what you've tried, and include specific information on how it failed, with error messages and/or erroneous output. Super User is not a code-writing service; the best questions are those which provide useful information so that those who answer can guide you to devising your own correct answer. See How to Ask a Good Question. — Jeff Zeitlin, Commented Apr 5, 2022 at 12:46

OnlineCop · Accepted Answer · 2022-04-08 15:03:45Z

I like to break apart the problem and try to optimize away any .* or .*? that I find. Note that if the structure of the HTML changes, this has a much higher chance to break.

I'm also a fan of regexes that support the /x flag so I can add whitespace and comments to help everything fit into my brain.

This is what I've come up with, peppered with comments to help understand what each section is doing:

<li>
(?>[<](?!a\b)[^<>]*[>]|[^<>]+)*
<a\shref="(?<url>[^"]+)"[^>]*>

# Match until we reach '<img'
(?>[<](?!img\b)[^<>]*[>]|[^<>]+)*
<img

# Match until we reach 'alt=' within '<img...>'
(?>[^<>=]*+(?<!alt)=|"[^<>"=]*"\s)*
alt="(?:On\sContact[\s–:\-â€“]*)?(?<on_contact>[^"]+)"[^<>]*>

# Match until it reaches a '<p...>' that does not contain some other opening '<' tag element.
(?>[<](?!p\b)[^<>]*[>]|[^<>]+|<p[^>]*>\s*<(?!\/?p\b)[^<>]*>)*
<p[^>]*>

# Match 'stuff stuff ... stuff stuff' without including trailing whitespace.
(?<desc>[^<>\s]+(?>\s+[^<>\s]+)*
  # Handle <strong>...</strong> nested tags
  (?>\s*[<](?!\/p)[^<>]*[>]|\s*[^<>\s]+(?>\s+[^<>\s]+)*)*
)

\s*<\/p>

# Match until we reach another '<p...>'
(?>[<](?!p\b)[^<>]*[>]|[^<>]+)*
<p[^>]*>

# Capture the date
(?<date>[^<]+)

# Match until we reach a '<li>' (or end of string)
(?>[<](?!li\b)[^<>]*[>]|[^<>]+)*

You can see this acting on your original text here.

The same regex, but with the comment lines and whitespace stripped out can be found here as well, which should be able to just drop into Notepad++ or whatever PCRE2-compliant tool you have.

I'm not sure if this answer will be edited in the future but with the version I marked as answer, I did a replace from <li>(?>[<](?!a\b)[^<>]*[>]|[^<>]+)*<a\shref="(?<url>[^"]+)"[^>]*>(?>[<](?!img\b)[^<>]*[>]|[^<>]+)*<img(?>[^<>=]*+(?<!alt)=|"[^<>"=]*"\s)*alt="(?:On\sContact[\s–:\-â€“]*)?(?<on_contact>[^"]+)"[^<>]*>(?>[<](?!p\b)[^<>]*[>]|[^<>]+|<p[^>]*>\s*<(?!\/?p\b)[^<>]*>)*<p[^>]*>(?<desc>[^<>\s]+(?>\s+[^<>\s]+)*(?>\s*[<](?!\/p)[^<>]*[>]|\s*[^<>\s]+(?>\s+[^<>\s]+)*)*)\s*<\/p>(?>[<](?!p\b)[^<>]*[>]|[^<>]+)*<p[^>]*>(?<date>[^<]+)(?>[<](?!li\b)[^<>]*[>]|[^<>]+)* to $4\t$2\t$1\t$3\n. — DynV, Commented Apr 8, 2022 at 16:30

Toto · Accepted Answer · 2022-04-05 18:33:24Z

1

Your regex contains some errors that make it doesn't match the text.

Remove the useless (in Notepad++) escaped for slash character \/ ==> /
Replace all your .* with non greedy ones .*?
Your Tempered Greedy Token is in wrong order (?:.(?!))+ should be (?:(?!).)+

Moreover, your 2 <li>'s in the sample text don't have same structure:

the former has the image in the second  paragraphs
the later, has the image in the first  paragraphs

then the capture groups don't capture the same data.

You can review the regex here

I've changed a little bit your regex, assuming the wanted paragraph doesn't contain any tags, it works for your example:

<li>.*?<a href="([^"]+)".*?alt="On Contact: ([^"]+)".*?<p[^>]*>((?:(?![<>]).)+?)</p>.*?<p[^>]*>([a-zA-Z]{3} \d\d?, \d{4} \d\d?:\d\d)\s*</p>

Demo & explanation

In action in Notepad++

Ctrl+H
Find what: <li>.*?<a href="([^"]+)".*?alt="On Contact: ([^"]+)".*?<p[^>]*>((?:(?![<>]).)+?).*?<p[^>]*>([a-zA-Z]{3} \d\d?, \d{4} \d\d?:\d\d)\s*
Replace with: $4\n$2\n$1\n$3\n\n
CHECK Wrap around
CHECK Regular expression
CHECK . matches newline
Replace all

Screenshot (before):

Screenshot (after):

edited Apr 5, 2022 at 18:33

answered Apr 5, 2022 at 15:11

Toto

18.3k72 gold badges33 silver badges45 bronze badges

You're not wrong about them being different but what matters is that I get the data for the TSV. The summary in both examples start with "On the show, Chris Hedges discusses", but at least one doesn't match that in the couple hundred in the full list. What I'm seeing is that if the paragraph contains a URL that it won't contain a summary. I think <p[^>]*>((?:(?!).)+?) should contain something just before the capture to see that paragraph doesn't contain a URL. I made searches in vain, the best I think was "regex skipping (paragraphs OR sentences) that contains a specific word".
– DynV
Commented Apr 5, 2022 at 17:02
PS: Another possibility would be make a "hard" capture for the date with <p[^>]*>((?:Jan|Feb|Mar|...) [^<]+)< and have the content of paragraph just before it captured.
– DynV
Commented Apr 5, 2022 at 17:44
It seems to remove the undesired content but has a lot of white-space I don't want. My goal is to make a flat-file DB out of the list, thus why I requested a TSV both in the thread title and in the 1st paragraph of the OP (that I'd paste in Openoffice Calc). Only if/when the previous part of this post has been dealt with: Is it impossible to do the check that I suggested 2 comments ago, something like <p[^>]*>ADDITIONAL_REGEX_HERE((?:(?!).)+?), or it's significantly harder to "script"?
– DynV
Commented Apr 5, 2022 at 21:35

Add a comment |

Stack Exchange Network

Make TSV out of multi-line list

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
notepad++
regex
.

Hot Network Questions

Make TSV out of multi-line list

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged notepad++regex.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
notepad++
regex
.