2

I sometimes need to search a local directory containing HTML documents for particular words. Usually I use a program called File Locator Pro, which works nicely most times.

However, in some cases, the word I am looking for is a commonly used keyword or variable name in JavaScript or HTML, such as "child", for instance. In such cases, because this search is just on the raw file contents, the search results explode with thousands of useless matches from within script or tags.

Is there some way I can do a search against HTML file contents across many HTML files where the search will ignore HTML tags and script?

This doesn't have to be using File Locator Pro; any solution is of interest but preferably one that works on Windows and doesn't require other expensive software.

3 Answers 3

1

I'd go with a well known linnux tool ported to windows: grep

Now you'll have to do some tricky parts of chaining to fist match what you're after and then filter the maximum of false positives with something like this to search for age (match <image...> tags and some js too in my test case:

grep -ri 'age' * | grep -v '<script[^>]+>[^<]+<\/script>' | grep -v '<[^>]*age[^>]*>' | grep -E '^[^.]*\.(php|html)'

what it does is as follow (each grep command):

  • It first get all lines containing age recursively with -r and case insensitive with -i
  • Then it match all but things within <script*>*</script> block (-v invert the match), removing the script blocks from the matches
  • The third remove matches from within a tag, this may exclude valid results like <div id=age>age</div>if the tags are on the same line than the searched word.
  • Lastly it filter the results on the filename to keep only php or html files, this need extended regexes (grep option -E) for the A or B construction (A|B)

This is probably a little convoluted but you can NOT parse html with a regex and parsing every file with a (X)HTML parser to then find only the text sounds pretty complex to achieve also.

0

On windows, you can use grepwin (from the tortoise dev) to run grep with a gui in windows systems. It can achieve pretty much everything gnu grep can.

Another way would be to install cygwin and then just use grep as usual.

0

The windows command line is (still) not as powerful as on *nix systems - but even there your scenario and wishes aren't readily solved. As @Tensibai said: you're basically wanting to parse the files for context-based occurences. Windows lightweight grep is nowadays called findstr, a bit better than the old find was, but nowhere as powerful as grep. If you install cygwin like @fab2s suggested you could probably build a script that does something like the following:

  • find all the files your interested in (*.html) [find]
  • output them with line numbers and all line-breaks changed to something unused otherwise (let's say a ControlCharacter), making them be on one line but still "know" where lines ended. [sed]
  • extract all the script blocks and wrapper tags [sed..again]
  • reverse the newline-replacement [sed]

..and finally..

  • grep for your results [grep]

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .