0

I am trying to get within several urls of a webpage and follow the response to the next parser to grab another set of urls on a page. However, from this page I need to grab the next page urls but I wanted to try this by manipulating the page string by parsing it and then passing this as the next page. However, the scraper crawls but it returns nothing not even the output on the final parser when I load item.

Note: I know that I can grab the next page rather simply with an if-statement on the href. However, I wanted to try something different in case I had to face a situation where I would have to do this.

Here's my scraper:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.item import Field
from itemloaders.processors import TakeFirst
from scrapy.loader import ItemLoader

class ZooplasItem(scrapy.Item):
    stuff = Field()
class ZooplasSpider(scrapy.Spider):
    name = 'zooplas'
    start_urls = ['https://www.zoopla.co.uk/overseas/']

    def start_request(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url, 
                callback = self.parse, )

    def parse(self, response):
        container = response.xpath("//ul[@class='list-inline list-unstyled']//li")
        for links in container:
            urls = links.xpath(".//a/@href").get()
            yield response.follow(
                urls, callback = self.parse_places
            )
    def parse_places(self, response):
        container = response.xpath("//ul[@class='listing-results clearfix js-gtm-list']//li")
        for links in container:
            urls = links.xpath('(//div[@class="listing-results-right clearfix"]//a)[position() mod 3=1]//@href').get()
            yield response.follow(
                urls, callback = self.parse_listings
            )
        if response.xpath("//div[@id='content']//div//h1//text()").extract_first():
            page_on = response.xpath("//div[@id='content']//div//h1//text()").extract_first()
            name_of_page = page_on.split()[-1]
        else:
            pass
        if response.xpath("(//div[@class='paginate bg-muted'])//a[last()-1]//href").extract_first():
            url_link = response.xpath("(//div[@class='paginate bg-muted'])//a[last()-1]//href").extract_first()
            url_link = url_link.split('/')
            last_page = url_link[-1].split('=')[-1]
        else:
            pass
        all_pages = []
        for index, n in enumerate(url_link):
            for page_name, page_num in zip(name_of_page, last_page):

                if index == 5:
                    url_link[index] = page_name
                    testit='/'.join(url_link)
                    equal_split = testit.split('=')
                    for another_i, n2 in enumerate(equal_split):
                        if another_i == 3:
                            for range_val in range(1, page_num+1):
                                equal_split[another_i] = str(2)
                                all_pages.append('='.join(equal_split))


        for urls in all_pages:
            yield response.follow(
                urls, callback = self.parse.places
            )
    def parse_listings(self, response):
        loader = ItemLoader(ZooplasItem(), response=response)
        loader.default.output_processor = TakeFirst()
        loader.add_xpath("//article[@class='dp-sidebar-wrapper__summary']//h1//text()")
        yield loader.load_item()

process = CrawlerProcess(
    settings = {
        'FEED_URI':'zoopla.jl',
        'FEED_FORMAT':'jsonlines'
    }
)
process.crawl(ZooplasSpider)
process.start()

I know the way of grabbing the urls works as I have tried it on a single url using the following:

url = "https://www.zoopla.co.uk/overseas/property/ireland/?new_homes=include&include_sold=false&pn=16"

list_of_stuff = ['Ireland', 'Germany','France']
pages_of_stuff = [5, 7, 6]

test = []
all_pages = []
j=0
for index, n in enumerate(a):
    for l_stuff, p_stuff in zip(list_of_stuff,pages_of_stuff):
        if index == 5:
            a[index] = l_stuff
            testit='/'.join(a)
            equal_split = testit.split('=')
            for another_i, n2 in enumerate(equal_split):
                if another_i == 3:
                    for range_val in range(1, p_stuff+1):
                        equal_split[another_i] = str(range_val)
                        print('='.join(equal_split))

Which is the same as the one used above just a change of variables. this outputs the following links and they work:

https://www.zoopla.co.uk/overseas/property/Ireland/?new_homes=include&include_sold=false&pn=1
https://www.zoopla.co.uk/overseas/property/Ireland/?new_homes=include&include_sold=false&pn=2
https://www.zoopla.co.uk/overseas/property/Ireland/?new_homes=include&include_sold=false&pn=3
https://www.zoopla.co.uk/overseas/property/Ireland/?new_homes=include&include_sold=false&pn=4
https://www.zoopla.co.uk/overseas/property/Ireland/?new_homes=include&include_sold=false&pn=5
https://www.zoopla.co.uk/overseas/property/Germany/?new_homes=include&include_sold=false&pn=1
https://www.zoopla.co.uk/overseas/property/Germany/?new_homes=include&include_sold=false&pn=2
...

1 Answer 1

0

Your use case is suited for using scrapy crawl spider. You can write rules on how to extract links to the properties and how to extract links to the next pages. I have changed your code to use a crawl spider class and I have changed your FEEDS settings to use the recommended settings. FEED_URI and FEED_FORMAT are deprecated in newer versions of scrapy.

Read more about the crawl spider from the docs

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
from scrapy.spiders import CrawlSpider, Rule
from scrapy.item import Field
from itemloaders.processors import TakeFirst
from scrapy.loader import ItemLoader

class ZooplasItem(scrapy.Item):
    stuff = Field()
    country = Field()

class ZooplasSpider(CrawlSpider):
    name = 'zooplas'
    allowed_domains = ['zoopla.co.uk']
    start_urls = ['https://www.zoopla.co.uk/overseas/']

    rules = (
        Rule(LinkExtractor(restrict_css='a.link-novisit'), follow=True), # follow the countries links
        Rule(LinkExtractor(restrict_css='div.paginate'), follow=True), # follow pagination links
        Rule(LinkExtractor(restrict_xpaths="//a[contains(@class,'listing-result')]"), callback='parse_item', follow=True), # follow the link to actual property listing
    )

    def parse_item(self, response):
        # here you are on the details page for each property
        loader = ItemLoader(ZooplasItem(), response=response)
        loader.default_output_processor = TakeFirst()
        loader.add_xpath("stuff", "//article[@class='dp-sidebar-wrapper__summary']//h1//text()")
        loader.add_xpath("country","//li[@class='ui-breadcrumbs__item'][3]/a/text()")
        yield loader.load_item()

if __name__ == '__main__':
    process = CrawlerProcess(
        settings = {
            'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36',
            'FEEDS': {
                'zoopla.jl': {
                    'format': 'jsonlines'
                }
            }
        }
    )
    process.crawl(ZooplasSpider)
    process.start()
5
  • Thanks for showing me with a working example on the rules, I have been meaning to learn this and think I have not understood them. I have read the documentation and your example helped cement the working knowledge. Thanks so much! Commented Feb 8, 2022 at 9:26
  • I have a question; How would I parse information from the previous pages that are grabbed from the linkextractor to load into the itemloader later on? Commented Feb 8, 2022 at 12:48
  • Because it looks like a restriction was applied on the last linkextractor so thereofre I won't be able to parse any of the previous pages. Is there a way around this? ie. (//ul[@class='list-inline list-unstyled'])[1]//li//a//text() is the xpath from the first page to grab the names of the countries. Commented Feb 8, 2022 at 12:53
  • You do not need the previous links to get the country name. Using this xpath on the property details page, you will get the country name //li[@class='ui-breadcrumbs__item'][3]/a/text(). I have edited the answer and added the xpath for you.
    – msenior_
    Commented Feb 8, 2022 at 13:48
  • Thanks for checking! However, I only wanted to know if it were possible as I have a few scrapers where something like this cannot happen. I'll have to directly get data from previous pages. Is something like this possible with LinkExtractors or will I have to use itemloaders? Commented Feb 8, 2022 at 22:37

Not the answer you're looking for? Browse other questions tagged or ask your own question.