I am using wget to retrieve particular pdf files from http://www.aph.gov.au/
I only want to retrieve Hansard files (transcripts of Chamber proceedings).
Two scenarios:
- There is a page where Hansard transcripts are listed-out:
http://www.aph.gov.au/Parliamentary_Business/Hansard/Hansreps_2011
Clicking on a day/date link on this page retrieves a response to a database query which displays links to further files. I only want to retrieve the file indicated by 'Download Current Hansard', which is the whole day's transcript (I don't want to retrieve the 'fragments').
I am able to click to the response to the query, harvest the URL/s for whole day's transcript, package them in a file and retrieve them using wget -i.
I am seeking a way to use wget to grab the whole day transcripts only.
- Only some years are listed-out on the page. However, going to the database and conducting an advanced search on Hansard, then clicking the decade ranges on the upper left of the screen, and then a year, produces a listing of different days in that year. Again, the top-level link displayed doesn't yield pdf of the whole day's transcript, but clicking on the title results in a page being displayed that shows a link to the whole day's transcript.
I would like to use wget to retrieve just the pdfs of the whole day's transcript.
Any advice would be gratefully received. I am making progress with the 'semi-manual' method, but it is slow and labour-intensive.