0

I wish to know what is the scrapy equivalent for response to Requests r.content, for example, let's say I have this script:

import requests
import pandas as pd
url = "https://www.example.com"
r = requests.get(url)
pd.read_html(r.content)

would return me a table if the url has an tables. However, what's the equivalent in scrapy?

I have tried:

response.body
response.text

but neither are working for this.

If I try:

pd.read_html(response.content)

I get-

AttributeError: 'HtmlResponse' object has no attribute 'content'

So what is the equivalent so that I can read pandas tables directly from the response?

Tried example:

import scrapy
import pandas as pd
from scrapy.crawl import CrawlerProcess

class GsmSpider(scrapy.Spider):
    name = 'gsm'
    
    def start_requests(self):
        yield scrapy.Request(
            url = "https://www.gsmarena.com/makers.php3",
            callback = self.parse
        )
    def parse(self, response):
        data = pd.read_html(response.text)
        yield data

process = CrawlerProcess(
    settings = {
    'FEED_URI':'data.jl',
    'FEED_FORMAT':'jsonlines'
}
process.crawl(GsmSpider)
process.start()
4
  • When you say neither are working do you mean they don't exist on the object? If so, that behavior would be unexpected. pd.read_html(response.text) should work.
    – lmonninger
    Commented Jan 24, 2022 at 18:09
  • @lmonninger I get the following error ERROR: Spider must return request, item, or None, got 'list' Commented Jan 24, 2022 at 18:30
  • This likely has to do with how you are creating your pipeline then, i.e. you've reduced your problem too far. Check this out and see if it applies: stackoverflow.com/questions/39763002/…
    – lmonninger
    Commented Jan 24, 2022 at 18:35
  • @lmonninger I've updated with a working link and example but I still get the error no matter how I re-arrange it. Commented Jan 24, 2022 at 18:49

1 Answer 1

0

You can't just yield a text. Like the error says: you need to return request, item or None. Read about items.

import scrapy
import pandas as pd


class GsmSpider(scrapy.Spider):
    name = 'gsm'

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.gsmarena.com/makers.php3",
            callback=self.parse
        )

    def parse(self, response):
        data = pd.read_html(response.text)
        yield {'data': data}

EDIT:

A more specific solution for OP's question. (I have never used pandas before so there's maybe a better way).

import numpy as np
import scrapy
import pandas as pd
from scrapy.crawler import CrawlerProcess


class GsmSpider(scrapy.Spider):
    name = 'gsm'

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.gsmarena.com/makers.php3",
            callback=self.parse
        )

    def parse(self, response):
        data = pd.read_html(response.text)
        # There is only one item on the list so you don't have to use an index
        for i, d in enumerate(data):
            yield {f"data_{i+1}": d.replace(np.nan, None).to_dict()}


process = CrawlerProcess(
    settings={
        'FEEDS': {'data.jl': {'format': 'jsonlines'}}
    })
process.crawl(GsmSpider)
process.start()
5
  • I get the following error: TypeError: Object of type DataFrame is not JSON serializable Commented Jan 24, 2022 at 19:46
  • How do you run the spider?\
    – SuperUser
    Commented Jan 25, 2022 at 9:24
  • @dollarbill The code works for me. If you're having problems with other parts of the code you need to update it.
    – SuperUser
    Commented Jan 25, 2022 at 14:20
  • I have updated the answer with how I run it Commented Feb 1, 2022 at 17:07
  • @dollarbill See the edit.
    – SuperUser
    Commented Feb 1, 2022 at 18:03

Not the answer you're looking for? Browse other questions tagged or ask your own question.