2

I am trying to scrape some text off of https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms, but as you can see when it loads up the link through web-driver it automatically redirects it to a log in page. After I log in, it then goes straight to the page I want to scrape, but Beautiful Soup just keeps scraping the log in page.

How do I make it so Beautiful Soup scrapes the page I want it to and not the login page?

I have already tried putting a time.sleep() before it scrapes to give me time to log in but that didn't work either.

soup = BeautifulSoup(requests.get("https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms").text, 'html.parser')
while True:
    front_half = soup.find_all(class_='qquestion qtext')
    print(front_half)
    time.sleep(1)
2
  • if its handled on the server side (the way i see it) you cannot do anything, cause the server will not give anything. unless you send the login cookie/.. with the GET request
    – a-sam
    Commented Jul 30, 2019 at 13:55
  • Please can you explain a bit more, I don't understand what you mean
    – Jack
    Commented Jul 30, 2019 at 15:30

2 Answers 2

1

What you probably need is a persistent session with requests. This answer probably covers exactly what you need. The general idea is simple:

  1. You open a session and send a request to the website
  2. Send the login post request so it logs you in
  3. Query the url with the same session.

You will need to understand how the login post request is structured and what data is passed (username, email, etc) and create a json with that data.

import requests

url = 'https://www.memrise.com/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms'

session = requests.session()

login_data = {
    'username': ,
    'csrfmiddlewaretoken': ,
    'password': ,
    'next': '/course/2021573/french-1-145/garden/speed_review/?source_element=ms_mode&source_screen=eos_ms'
}

session.get(url) #this will redirect you and it might load some initial cookies info

r = session.post('https://<theurl>/login.py', login_data)

if r.status_code == 200: #if accepted the request
    res = session.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    ## (...) your scraping code
0

What you could do is use selenium. And simply write browser.get("website.you.need") this will take you to the login page. Login manually for once. Now add a for loop of links you need to scrape of the same website in same program, so that browser does not get closed and hence you do not loose the session. So til the program does not end, the links you want to access, you can.

Your code might look like this.

from selenium import webdriver 
import time 


browser = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver") 
browser.get('abc.com/page=1')
# this link will redirect you to the login page. Enter your credentials manually. And wait for logging in successfully. 30 seconds would be enough
time.sleep(30)

links = ["abc.com/page=1","abc.com/page=2"]

for j in range(len(links)):
    link = links[j]

    browser.get(link)
    #this wont need login as you are not closing the 
    time.sleep(5)
    html = browser.page_source
    # do your scraping or save the html sourcecode somewhere and scrape it later.

browser.close()

Not the answer you're looking for? Browse other questions tagged or ask your own question.