Scrapy equivalent to Requests r.content

Question

I wish to know what is the scrapy equivalent for response to Requests r.content, for example, let's say I have this script:

import requests
import pandas as pd
url = "https://www.example.com"
r = requests.get(url)
pd.read_html(r.content)

would return me a table if the url has an tables. However, what's the equivalent in scrapy?

I have tried:

response.body
response.text

but neither are working for this.

If I try:

pd.read_html(response.content)

I get-

AttributeError: 'HtmlResponse' object has no attribute 'content'

So what is the equivalent so that I can read pandas tables directly from the response?

Tried example:

import scrapy
import pandas as pd
from scrapy.crawl import CrawlerProcess

class GsmSpider(scrapy.Spider):
    name = 'gsm'
    
    def start_requests(self):
        yield scrapy.Request(
            url = "https://www.gsmarena.com/makers.php3",
            callback = self.parse
        )
    def parse(self, response):
        data = pd.read_html(response.text)
        yield data

process = CrawlerProcess(
    settings = {
    'FEED_URI':'data.jl',
    'FEED_FORMAT':'jsonlines'
}
process.crawl(GsmSpider)
process.start()

When you say neither are working do you mean they don't exist on the object? If so, that behavior would be unexpected. pd.read_html(response.text) should work. — lmonninger, Commented Jan 24, 2022 at 18:09
@lmonninger I get the following error ERROR: Spider must return request, item, or None, got 'list' — dollar bill, Commented Jan 24, 2022 at 18:30
This likely has to do with how you are creating your pipeline then, i.e. you've reduced your problem too far. Check this out and see if it applies: stackoverflow.com/questions/39763002/… — lmonninger, Commented Jan 24, 2022 at 18:35
@lmonninger I've updated with a working link and example but I still get the error no matter how I re-arrange it. — dollar bill, Commented Jan 24, 2022 at 18:49

SuperUser · Accepted Answer · 2022-02-01 18:03:38Z

0

You can't just yield a text. Like the error says: you need to return request, item or None. Read about items.

import scrapy
import pandas as pd


class GsmSpider(scrapy.Spider):
    name = 'gsm'

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.gsmarena.com/makers.php3",
            callback=self.parse
        )

    def parse(self, response):
        data = pd.read_html(response.text)
        yield {'data': data}

EDIT:

A more specific solution for OP's question. (I have never used pandas before so there's maybe a better way).

import numpy as np
import scrapy
import pandas as pd
from scrapy.crawler import CrawlerProcess


class GsmSpider(scrapy.Spider):
    name = 'gsm'

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.gsmarena.com/makers.php3",
            callback=self.parse
        )

    def parse(self, response):
        data = pd.read_html(response.text)
        # There is only one item on the list so you don't have to use an index
        for i, d in enumerate(data):
            yield {f"data_{i+1}": d.replace(np.nan, None).to_dict()}


process = CrawlerProcess(
    settings={
        'FEEDS': {'data.jl': {'format': 'jsonlines'}}
    })
process.crawl(GsmSpider)
process.start()

edited Feb 1, 2022 at 18:03

answered Jan 24, 2022 at 19:10

SuperUser

4,8121 gold badge7 silver badges25 bronze badges

I get the following error: TypeError: Object of type DataFrame is not JSON serializable
– dollar bill
Commented Jan 24, 2022 at 19:46
How do you run the spider?\
– SuperUser
Commented Jan 25, 2022 at 9:24
@dollarbill The code works for me. If you're having problems with other parts of the code you need to update it.
– SuperUser
Commented Jan 25, 2022 at 14:20
I have updated the answer with how I run it
– dollar bill
Commented Feb 1, 2022 at 17:07
@dollarbill See the edit.
– SuperUser
Commented Feb 1, 2022 at 18:03

Add a comment |

Collectives™ on Stack Overflow

Scrapy equivalent to Requests r.content

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
pandas
scrapy
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pandasscrapy or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
pandas
scrapy
or ask your own question.