5

I am trying to scrape tables of a website using the google chrome extension webscraper.io. In the tutorial of the extension, it is documented how to scrape a website with different pages, say, "page 1", "page 2" and "page 3" where each of the pages is directly linked on the main page.

In the example of the website I am trying to scrape, however, there is only a "next" button to access the next site. If I follow the steps in the tutorial and create a link for the "next" page, it will only consider page 1 and 2. Creating a "next" link for each page is not feasible because they are too many. How can I get the webscraper to include all pages? Is there a way to loop through pages using the webscraper extension?

I am aware of this possible duplicate: pagination Chrome web scraper. However, it was not well received and contains no useful answers.

1 Answer 1

7

Following the advanced documentation here, the problem is solved by making the "pagination" link a parent of its own. Then, the scraping software will recursively go through all pages and their "next" page. In their words,

To extract items from all of the pagination links including the ones that are not visible at the beginning you need to create another Link selector that selects the pagination links. Figure 2 shows how the link selector should be created in the sitemap. When the scraper opens a category link it will extract items that are available in the page. After that it will find the pagination links and also visit those. If the pagination link selector is made a child to itself it will recursively discover all pagination pages.

5
  • How can I make something a child of itself, if it also needs to be a child of the other page?
    – ike
    Commented Jul 16, 2017 at 3:00
  • The instructions are very unclear, I don't see any way to replicate the chart they have.
    – ike
    Commented Jul 16, 2017 at 3:01
  • 5
    I figured it out. Something can be a child of more than one parent, which wasn't clearly spelled out. Pressing cntl when selecting parents worked.
    – ike
    Commented Jul 16, 2017 at 3:33
  • They explain how to do this in their tutorial, at 2:42 you see an example graph: youtube.com/watch?v=y_n2IsZlLds
    – Manu CJ
    Commented May 29, 2018 at 13:13
  • is it possible to reuse the dependency graph in command line environment (eg. with headless browser?)
    – PEZO
    Commented Jun 21, 2018 at 16:49

Not the answer you're looking for? Browse other questions tagged or ask your own question.