3

How do I extract all the external links of a web page and save them to a file?

If there is any command line tools that would be great.

It was quite the same question here, and the answer worked gracefully for the google.com, but for some reason it doesn't work with e.g. youtube. I'll explain: let's take for example this page. If I try to run

lynx -dump http://www.youtube.com/playlist?list=PLAA9A2EFA0E3A2039&feature=plcp | awk '/http/{print $2}' | grep watch > links.txt

then it, unlike using it on google.com firstly executes lynx's dump, followed by giving control to awk ( for some reason with empty input ), and finally writes nothing to the file links.txt. Only after that it displays non-filtered dump of lynx, without a possibility to transfer it elsewhere.

Thank you in advance!

1
  • Somewhere I saw the mentioning of the 'dog' command, which can do the same thing, but failed to find it elsewhere.
    – whoever
    Commented Apr 7, 2012 at 12:11

3 Answers 3

3
lynx -dump 'http://www.youtube.com/playlist?list=PLAA9A2EFA0E3A2039&feature=plcp' | awk '/http/{print $2}' | grep watch > links.txt

works. You need to escape the & in the link.

In your original line, the unescaped & will throw Lynx to the background, leaving empty input for links.txt. The background process will still write its output to the terminal you are in, but as you noticed, it will not do the > redirect (ambiguity: which process should write to the file?).

Addendum: I'm assuming a typo in your original command: the beginning and ending ' should not be present. Otherwise you'll get other error messages trying to execute a non-existing command. Removing those gives the behavior you describe.

2
  • Thanks so much! Hate myself for being so newbie. But, all in all 2 weeks of using Linux is not the time, yep? Thanks once again.
    – whoever
    Commented Apr 7, 2012 at 12:43
  • @user1212010: This site relies on the questioner to mark the answer as correct if he/she feels it solved the problem. Checking it as such is the best way to say "Thanks" on SU :-) . Commented Apr 7, 2012 at 12:53
0

Use your favorite website and search for 'website scraper script' or 'website scraping script' and whatever programming language you are most comfortable with. You have thousands and thousands of options, so do the most detailed search you can.

0

While there are lots of options to choose from, I would recommend using python with BeautifilSoup - this would give you total control of the process, including following redirects, handling self-signed/expired SSL certs, working around invalid HTML, extracting links only from specific page blocks, etc.

For an example check out this thread: https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautiful-soup

Installing BeautifilSoup is as trivial as running pip install BeautifilSoup or easy_install BeautifilSoup if you are on linux. On win32 it is probably the easiest to use binary installers.

Not the answer you're looking for? Browse other questions tagged .