I have a list of several thousand URLs, and I'd like to search each of these pages for a given word. How can I do this programmatically on Windows, preferably using VBScript or Powershell?
2 Answers
Edit: The original question didn't specify VBScript & Powershell. I'm leaving this Python suggestion in hopes that someone in the future will benefit.
What is the quickest way to do this programmatically on Windows? I guess 'quickest' is a function of your abilities.
With my skills, I would whip up a python script for that, as that would be the quickest way for me. The script, as I would write it, would looks kind of like
search_string = "" #String you're search for
sites_with_str = {} #List that'll contain URLs with search_string in them
file = fopen("c:\sites.txt", "r")
for site in file:
html = wget(site)
if html.contains(search_string):
sites_with_str.add(site)
file.fclose() #it's just polite to close your read handles
#Print out the sites with the search string in them
print "\n\nSites Containing Search String \""+search_string+"\":"
for each in sites_with_str:
print each
Of course, that's sort of Pseudo-Python. You'll have to find a library that'll grab a site for you. And obviously it'd require a little recursive function and some string parsing if you wanted to search all pages within each site referenced in the input file.
-
Thanks for the suggestion. I've updated my question to indicate VBScript or Powershell. Commented Jul 12, 2011 at 16:34
-
-
Yes, I'm crying too, not having access to a real OS ;) Commented Jul 12, 2011 at 16:48
-
@Mark and you're being forced to not use Python?? What a Saddistic situation my friend :P Commented Jul 12, 2011 at 16:48
I solved my own problem, in case anyone else faces the same requirement:
$webClient = new-object System.Net.WebClient
$webClient.Headers.Add("user-agent", "PowerShell Script")
$info = get-content c:\path\to\file\urls.txt
foreach ($i in $info) {
$output = ""
$startTime = get-date
$output = $webClient.DownloadString($i)
$endTime = get-date
if ($output -like "*some dirty word*") {
"Success`t`t" + $i + "`t`t" + ($endTime - $startTime).TotalSeconds + " seconds"
}
}