2

I'm writing a python script to automatically check dog re-homing sites for dogs that we might be able to adopt as they become available, however I'm stuck completing the form data on this site and can't figure out why.

The form attributes state it should have a post method and I've gone through all of the inputs for the form and created a payload.

I expect the page with the search results to be returned and the html scraped from the results page so I can start processing it, but the scrape is just the form page and never has the results.

I've tried using .get with the payload as params, the url with the payload and using the requests-html library to render any java script elements without success.

If you paste the url_w_payload into a browser it loads the page and says one of the fields is empty. If you then press enter in the url bar again to reload the page without modifying the url it loads... something to do with cookies maybe?

import requests
from requests_html import HTMLSession

session = HTMLSession()

form_url =  "https://www.rspca.org.uk/findapet?p_p_id=petSearch2016_WAR_ptlPetRehomingPortlets&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&_petSearch2016_WAR_ptlPetRehomingPortlets_action=search"

url_w_payload = "https://www.rspca.org.uk/findapet?p_p_id=petSearch2016_WAR_ptlPetRehomingPortlets&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&_petSearch2016_WAR_ptlPetRehomingPortlets_action=search&noPageView=false&animalType=DOG&freshSearch=false&arrivalSort=false&previousAnimalType=&location=WC2N5DU&previousLocation=&prevSearchedPostcode=&postcode=WC2N5DU&searchedLongitude=-0.1282688&searchedLatitude=51.5072106"

payload = {'noPageView': 'false','animalType': 'DOG', 'freshSearch': 'false', 'arrivalSort': 'false', 'previousAnimalType': '', 'location': 'WC2N5DU', 'previousLocation': '','prevSearchedPostcode': '', 'postcode': 'WC2N5DU', 'searchedLongitude': '-0.1282688',  'searchedLatitude': '51.5072106'}

#req = requests.post(form_url, data = payload)

#with open("requests_output.txt", "w") as f:
 #   f.write(req.text)

ses = session.post(form_url, data = payload)

ses.html.render()

with open("session_output.txt", "w") as f:
    f.write(ses.text)

print("Done")

1 Answer 1

2

There's a few hoops to jump with cookies and headers but once you get those right, you'll get the proper response.

Here's how to do it:

import time
from urllib.parse import urlencode

import requests
from bs4 import BeautifulSoup

query_string = {
    "p_p_id": "petSearch2016_WAR_ptlPetRehomingPortlets",
    "p_p_lifecycle": 1,
    "p_p_state": "normal",
    "p_p_mode": "view",
    "_petSearch2016_WAR_ptlPetRehomingPortlets_action": "search",
}

payload = {
    'noPageView': 'false',
    'animalType': 'DOG',
    'freshSearch': 'false',
    'arrivalSort': 'false',
    'previousAnimalType': '',
    'location': 'WC2N5DU',
    'previousLocation': '',
    'prevSearchedPostcode': '',
    'postcode': 'WC2N5DU',
    'searchedLongitude': '-0.1282688',
    'searchedLatitude': '51.5072106',
}


def make_cookies(cookie_dict: dict) -> str:
    return "; ".join(f"{k}={v}" for k, v in cookie_dict.items())


with requests.Session() as connection:
    main_url = "https://www.rspca.org.uk"
    
    connection.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) " \
                                       "AppleWebKit/537.36 (KHTML, like Gecko) " \
                                       "Chrome/90.0.4430.212 Safari/537.36"
    r = connection.get(main_url)
    
    cookies = make_cookies(r.cookies.get_dict())
    additional_string = f"; cb-enabled=enabled; " \
                        f"LFR_SESSION_STATE_10110={int(time.time())}"
    
    post_url = f"https://www.rspca.org.uk/findapet?{urlencode(query_string)}"
    connection.headers.update(
        {
            "cookie": cookies + additional_string,
            "referer": post_url,
            "content-type": "application/x-www-form-urlencoded",
        }
    )
    response = connection.post(post_url, data=urlencode(payload)).text
    dogs = BeautifulSoup(response, "lxml").find_all("a", class_="detailLink")
    print("\n".join(f"{main_url}{dog['href']}" for dog in dogs))

Output (shortened for brevity and no need to paginate the page as all dogs come in the response):

https://www.rspca.org.uk/findapet/details/-/Animal/JAY_JAY/ref/217747/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/STORM/ref/217054/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/DASHER/ref/205702/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/EVE/ref/205701/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/SEBASTIAN/ref/178975/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/FIJI/ref/169578/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/ELLA/ref/154419/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/BEN/ref/217605/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/SNOWY/ref/214416/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/BENSON/ref/215141/rehome/
https://www.rspca.org.uk/findapet/details/-/Animal/BELLA/ref/207716/rehome/

and much more ...

PS. I really enjoyed this challenge as I have two dogs from a shelter. Keep it up, man!

3
  • This looks amazing, thanks! Can't wait to try it out...
    – Steve
    Commented May 24, 2021 at 10:15
  • 1
    Worked great and I was able to put it in a function and call from my main script. Thanks for your help!
    – Steve
    Commented May 25, 2021 at 12:31
  • My pleasure, @Steve. Keep up the good work.
    – baduker
    Commented May 25, 2021 at 12:32

Not the answer you're looking for? Browse other questions tagged or ask your own question.