Regex: Find those html pages which does not contain this particular word on a link

Question

I have these 2 lines on more than 3000 HTML pages:

<link rel="canonical" href="https://mywebsite.com/hi/about.html" />

and

<link rel="canonical" href="https://mywebsite.com/about.html" />

So, I want to find with regex all those pages that contain those lines which DO NOT contain this word hi from the /hi/ link.

Is hi always between mywebsite.com/ and /about.html or can it be anywhere in the url? — Toto, Commented May 25, 2020 at 17:08

Glorfindel · Accepted Answer · 2020-05-25 16:24:58Z

If the /hi/ is always after https://mywebsite.com you can use a negative lookahead to make sure you exclude those matches. In that case,

<link rel="canonical" href="https:\/\/mywebsite\.com\/(?!hi\/)

might work for you (demo). The first part is just a literal match (the backslashes are necessary for escaping, IIRC) and the (?!hi\/) is the negative lookahead: it makes sure the hi\/ does not occur. But Regex101 does a better job of explain the regex than I can.

_{(I assume you're familiar with the Notepad++ capabilities for mass search, but if not, this link may help.)}

Stack Exchange Network

Regex: Find those html pages which does not contain this particular word on a link

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
windows-10
notepad++
regex
.

Hot Network Questions

Regex: Find those html pages which does not contain this particular word on a link

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged windows-10notepad++regex.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
windows-10
notepad++
regex
.