2
$\begingroup$

I am trying to import data from multiple web pages hosted by a single online source. The data is posted by the source as one data set per web page for each week of the year. I would like to import the data for all 52 weeks in a year at the same time rather than modifying my code for each of the weeks and importing one at a time.

Here is my code to import one week's data:

week012012 = 
  Import["http://www.boxofficemojo.com/weekend/chart/?yr=2012&wknd=01&\p=.htm", "Data"]

If it is of interest or relevance here is the further processing I do with the data after it has been imported:

week012012B = Cases[week012012, {_, _, _, _, _, _, _, _, _, _, _, _?NumericQ}, ∞];

Grid[week012012B, Frame -> All]

The site uses a consistent naming scheme for each week of the year and indeed from year-to-year as well. If I were looking to get just a few weeks data I could manually change the URL for the weekend from 01 to 02, 03, 04..., but I want all 52 weeks for 2012. The approach I have been playing with is to use string manipulation to modify the URL and then import and save the data for each of the weeks. Any suggestions?

$\endgroup$
5
  • 1
    $\begingroup$ The site seems to have an RSS feed. So this is probably a duplicate of How to scrape the headlines from New York Times and Wall Street Journal $\endgroup$
    – Jens
    Commented Mar 21, 2013 at 3:28
  • $\begingroup$ Is the question actually just how to generate a list of URL strings? I'm not sure what you're asking. The parsing of the Import is not your concern, or is it? $\endgroup$
    – Jens
    Commented Mar 21, 2013 at 3:32
  • $\begingroup$ Not looking for the RSS feed. Looking to import the data from 52 pages hosted on the site sequentially with a single command. $\endgroup$ Commented Mar 21, 2013 at 3:54
  • $\begingroup$ Then you should probably remove the "Reconstitution" from the title. $\endgroup$
    – Jens
    Commented Mar 21, 2013 at 4:01
  • $\begingroup$ Disagree with the suggestion to alter the title, but I am not going to object that someone edited it. The reason is that the import is reconstituting data tables found on the 52 web pages. $\endgroup$ Commented Mar 30, 2013 at 21:27

1 Answer 1

5
$\begingroup$
wkstrngs = StringJoin /@ Map[ToString, PadLeft[IntegerDigits /@ Range[52]], {-1}]; 
wkurls = Quiet["http://www.boxofficemojo.com/weekend/chart/?yr=2012&wknd=" ~~ 
  # ~~ "&\p=.htm" & /@ wkstrngs[[;; 5]]] (* remove [[;;5]] for all 52 weeks *)
(* {"http://www.boxofficemojo.com/weekend/chart/?yr=2012&wknd=01&\\p=.htm",
    "http://www.boxofficemojo.com/weekend/chart/?yr=2012&wknd=02&\\p=.htm",
    "http://www.boxofficemojo.com/weekend/chart/?yr=2012&wknd=03&\\p=.htm",
    "http://www.boxofficemojo.com/weekend/chart/?yr=2012&wknd=04&\\p=.htm",
    "http://www.boxofficemojo.com/weekend/chart/?yr=2012&wknd=05&\\p=.htm"} *)
fiveweeks = Import[#, "Data"] & /@ wkurls;
data5wks = Cases[#, {__,_?NumericQ}, Infinity] & /@fiveweeks; (*thanks: Mike Honeychurch*)
Grid[#, Frame -> All] & /@ data5wks
$\endgroup$
3
  • $\begingroup$ This is exactly what I was looking to accomplish. Thanks. $\endgroup$ Commented Mar 21, 2013 at 4:02
  • $\begingroup$ if you're testing for a numeric last element presumably {__,_?NumericQ} would suffice as the pattern? $\endgroup$ Commented Mar 21, 2013 at 6:39
  • $\begingroup$ @Mike, you are right; thank you. $\endgroup$
    – kglr
    Commented Mar 21, 2013 at 7:24

Not the answer you're looking for? Browse other questions tagged or ask your own question.