I have a list of items in which each item has multiple lines. The token to separate the items is unique (per item, HTML <li>
) and I've only seen instances of the text that are contained within a single tokenized paragraph (HTML <p>
). I'd like a TSV to be made from that, which items are, in order:
- date
- name
- URL
- summary
From what I've seen for all items both the URL and the name has a duplicate (in each item), so I went for the 1st URL and 2nd name, as it seemed easiest for me. The summary might contain visual aid tags (ie <strong>
) so I use a negative lookahead to do it, as opposed to the date that shouldn't have an internal tag, so I used a negated character class instead.
The first 2 items are
<li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">On
Contact: Race and America's long war </a>
</p>
<p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">
<font color="#000080">
<img src="rt.com-on_contact-220405-no_blurb_html_1dff87941f1c724a.jpg" name="Image1" alt="On Contact: Race and America's long war" align="bottom" width="280" height="157" border="1"/>
</font>
</a>
</p>
<p style="margin-bottom: 0in">On the show, Chris Hedges discusses
America's inner and outer wars and its nexus with capitalism and
empire with Professor of Social and Cultural Analysis and History at
New York University Nikhil Pal Singh. The internal violence in the
United...
</p>
<p style="margin-bottom: 0in">Feb 27, 2022 10:36</p>
<li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">
<font color="#000080">
<img src="rt.com-on_contact-220405-no_blurb_html_198feb67032166ff.png" name="Image3" alt="On Contact: George Washington and the legacy of white supremacy" align="bottom" width="280" height="157" border="1"/>
</font>
</a>
</p>
<p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">On
Contact: George Washington and the legacy of white supremacy </a></strong>
</p>
<p style="margin-bottom: 0in">On the show, Chris Hedges discusses
George Washington, the fallible human being and one of the principal
architects of the United States, with author Nathaniel Philbrick. As
America fractures into ideologically hostile camps, it colors how
we...
</p>
<p style="margin-bottom: 0in">Feb 25, 2022 09:09
</p>
<li>[...]
and the regex I attempted is <li>.*<a href="([^"]+)".*alt="On Contact: ([^"]+)".*<p[^>]*>((?:.(?!<\/p>))+)<\/p><p[^>]*>([^<]+)<
and if it worked it would be replaced by $4\t$2\t$1\t$3
. I'd like for the regex to work in Notepad++.
Thank you kindly for your help
Update 1
The test string I later used added list items, added display tags in the summary (ie <strong>
), and although it's inconsistent with the title I had to remove tabs as they were interfering with TSV creation and I thought I might as well remove newlines in that process (removed [\t\r\n]
), resulting in:
<li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">OnContact: Race and America's long war </a></p><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/"> <font color="#000080"> <img src="rt.com-on_contact-220405-no_blurb_html_1dff87941f1c724a.jpg" name="Image1" alt="On Contact: Race and America's long war" align="bottom" width="280" height="157" border="1"/> </font></a></p><p style="margin-bottom: 0in">On the show, Chris Hedges discussesAmerica's inner and outer wars and its nexus with capitalism and <strong>empire</strong> with Professor of Social and Cultural Analysis and History atNew York University Nikhil Pal Singh. The internal violence in theUnited... </p><p style="margin-bottom: 0in">Feb 27, 2022 10:36</p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/"> <font color="#000080"> <img src="rt.com-on_contact-220405-no_blurb_html_198feb67032166ff.png" name="Image3" alt="On Contact: George Washington and the legacy of white supremacy" align="bottom" width="280" height="157" border="1"/> </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">OnContact: George Washington and the legacy of white supremacy </a></strong></p><p style="margin-bottom: 0in">On the show, <span class="host">Chris Hedges</span> discusses George Washington, the fallible human being and one of the principalarchitects of the United States, with author Nathaniel Philbrick. AsAmerica fractures into ideologically hostile camps, it colors howwe... </p><p style="margin-bottom: 0in">Feb 25, 2022 09:09 </p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/549103-oppenheimer-bomb-culture-bird/"> <font color="#000080"> <img src="rt.com-on_contact-220405-no_blurb_html_e46c470920b1171d.jpg" name="Image4" alt="On Contact: Oppenheimer & the bomb culture" align="bottom" width="420" height="236" border="1"/> </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/549103-oppenheimer-bomb-culture-bird/">OnContact: Oppenheimer & the bomb culture </a></strong></p><p style="margin-bottom: 0in">On the show, Chris Hedges discusses J.Robert Oppenheimer and the making of the bomb with author <span class="author">Kai Bird.J. Robert Oppenheimer</span>, “the father of the atomic bomb,”was by the end of World War II one of the most celebrated men inAmerica.... </p><p style="margin-bottom: 0in">Feb 20, 2022 06:10 </p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/469859-war-iran-stephen-kinzer/"> <font color="#000080"> <img src="rt.com-on_contact-220405-no_blurb_html_15449064d00f77f3.jpg" name="Image149" alt="On Contact – War with Iran? Stephen Kinzer" align="bottom" width="420" height="236" border="1"/> </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/469859-war-iran-stephen-kinzer/">OnContact – War with Iran? Stephen Kinzer </a></strong></p><p style="margin-bottom: 0in">Host Chris Hedges talks to journalistand author, Stephen Kinzer, on efforts by Saudi Arabia and Washington to cripple Iran’s economy, inevitably putting Saudi Arabia, its Gulf allies and Washington on a collision course with the <em>Islamic</em>... </p><p style="margin-bottom: 0in">Sep 29, 2019 07:10 </p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/469339-future-amazon-rain-forest/"> <font color="#000080"> <img src="rt.com-on_contact-220405-no_blurb_html_b82502a96022a758.png" name="Image150" alt="The future of the Amazon rain forest – Sonia Bone Guajajara" align="bottom" width="280" height="157" border="1"/> </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/469339-future-amazon-rain-forest/">Thefuture of the Amazon rain forest – Sonia Bone Guajajara </a></strong></p><p style="margin-bottom: 0in">Host Chris Hedges talks to Sonia BoneGuajajara, leader of 300 indigenous ethnic groups in Brazil, aboutthe future of the Amazon rain forest, its people, climate change,and the competing goals of agrobusiness, multinational corporations,and the... </p><p style="margin-bottom: 0in">Sep 22, 2019 07:15 </p></ul>