0

I am using wget to retrieve particular pdf files from http://www.aph.gov.au/

I only want to retrieve Hansard files (transcripts of Chamber proceedings).

Two scenarios:

  1. There is a page where Hansard transcripts are listed-out:

http://www.aph.gov.au/Parliamentary_Business/Hansard/Hansreps_2011

Clicking on a day/date link on this page retrieves a response to a database query which displays links to further files. I only want to retrieve the file indicated by 'Download Current Hansard', which is the whole day's transcript (I don't want to retrieve the 'fragments').

I am able to click to the response to the query, harvest the URL/s for whole day's transcript, package them in a file and retrieve them using wget -i.

I am seeking a way to use wget to grab the whole day transcripts only.

  1. Only some years are listed-out on the page. However, going to the database and conducting an advanced search on Hansard, then clicking the decade ranges on the upper left of the screen, and then a year, produces a listing of different days in that year. Again, the top-level link displayed doesn't yield pdf of the whole day's transcript, but clicking on the title results in a page being displayed that shows a link to the whole day's transcript.

I would like to use wget to retrieve just the pdfs of the whole day's transcript.

Any advice would be gratefully received. I am making progress with the 'semi-manual' method, but it is slow and labour-intensive.

1 Answer 1

0

You won't be able to do this using only wget.

You'll need to create a script that will grab the first page with the date links, and then parse the page for the correct URL. Then the script would grab the page at that URL and parse it for the URL to the PDF.

This could be done using a custom python script that uses the beautifulsoup library.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .