1

I have this words, of which at least three of them are found in any sentence in English.

was, where, were, some, then, than, that, can, by, the, and, with, over, there, is, as, also, through, from, while, just, like, for, such, if, else, still, again, want, will, wish, make, made, well, have, had, has, it, be, do, say, others, go, know, see, think, look, give, use, find, tell, ask, work, seem, feel, try, leave, call, get, take, too, in, addition, to, could, who, he, she, because, of, your, yours, their, doesn't, are, an, these, this, those, but, at, whom, or, out, how, when, between, his, her, they, them, my, without, maybe, even, show, can't, must, couldn't, now, i'm, many, come, own, self, seen, it’s, we, any, other, coming, so, found, more, much, all, very, same, did, which, does, on

Also, I have these two html tags, but only the content of the first one is in English:

<meta name="description" content="Simply Red are a British soul and pop band which formed in Manchester in 1985. The lead vocalist of the band is singer and songwriter Mick Hucknall by">

and one tag in russian:

<meta name="description" content="Simply Red - британская соул- и поп-группа, образованная в Манчестере в 1985 году. Ведущим вокалистом группы является певец и автор песен Мик Хакнелл.">

So, I want to check all html files that contain tags whose content is written in English. For this, I must find those html tags which contains at least 3 of that keywords from the beginning.

My regex, with just few words (short version), looks like this:

SEARCH: (?-s)<meta name="description".+?(?:(was|is|as|on|and|in)).+>

and the larger version will be:

(?-s)<meta name="description".*?(was|where|were|some|then|than|that|can|by|the|and|with|over|there|is|as|also|through|from|while|just|like|for|such|if|else|still|again|want|will|wish|make|made|well|have|had|has|it|be|do|say|others|go|know|see|think|look|give|use|find|tell|ask|work|seem|feel|try|leave|call|get|take|too|in|addition|to|could|who|he|she|because|of|your|yours|their|doesn't|are|an|these|this|those|but|at|whom|or|out|how|when|between|his|her|they|them|my|without|maybe|even|show|can't|must|couldn't|now|i'm|many|come|own|self|seen|it’s|we|any|other|coming|so|found|more|much|all|very|same|did|which|does|on).+>

Ok, the problem is that my regex find also the second tag, whose content is written in russian. I must find only the first one (in english)

1
  • May be I'm wrong but I suspect an XY problem. Why don't you check if the content description doesn't contain russian letter?
    – Toto
    Commented Jul 1, 2021 at 16:32

1 Answer 1

2

Your list is too big, so to demonstrate the technique, here is an example on a small list of four words, one two three four.

enter image description here

Here is an explanation of the search string: (one|two|three|four).*(?-1).*(?-1)

  • (one|two|three|four) : Capture one of the words in the group
  • .* : Find any number of characters
  • (?-1) : Find another match of the group one behind this one (recursive subpattern)
3
  • super answer, thanks. So, in my case, the short version, will be FIND: <meta name="description" content=".*(are|the|in|you).*(?-1).*(?-1).+>
    – Just Me
    Commented Jul 1, 2021 at 16:36
  • 1
    Yes, that's the idea.
    – harrymc
    Commented Jul 1, 2021 at 17:36
  • The regex above doesn't work on Python, so in order to work it can be change a little bit, as this one: <p class="text_obisnuit">.*((are|the|in|you).*){3,}.*</p>
    – Just Me
    Commented Jul 3, 2021 at 19:24

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .