5

I want to write a program that searches through a fairly large website and extracts certain things. I've had a couple online Python courses, but neither said anything about how to access the internet with Python. I have no idea where I ought to start with this.

2
  • 3
    You'll need to read about HTTP, HTML and probably JS/PHP/etc, probably at list dip your toes in a more robust understanding of DOMs, then learn about text parsing/processing. Look at urllib/urllib2/httplib/requests/etc, and something like BeautifulSoup or even Selenium, depending upon the complexity and interactivity you need.
    – Silas Ray
    Commented Apr 3, 2013 at 22:00
  • 1
    Have you looked at the Python documentation? First result on Google for "Python Internet" by the way...
    – kindall
    Commented Apr 3, 2013 at 22:09

3 Answers 3

5

You have first to read about the standard python library urllib2.

Once you are comfortable with the basic ideas behind this lib you can try requests which is much easier to interact with the web especially APIs. I suggest using it in parallel with httpie to test out queries quick and dirty from command line.

If you go a little further building a librairy or an engine to crawl the web you will need some sort of asynchronous programming, I recommend starting with Gevent

Finally, if you want to create a crawler/bot you can take a look at Scrapy. You should however start with basic libraries before diving into this one as it can get quite complex

3

It sounds like you want a web crawler/scraper. What sorts of things do you want to pull? Images? Links? Just the job for a web crawler/scraper.

Start there, there should be lots of articles on Stackoverflow that will help you implement details such as connecting to the internet (getting a web response).

See this article.

2

There is much more in the internet than just websites, but I assume that you just want to crawl some html pages and extract data from them. You have many many options to solve that problem. Just some starting points: