0

This is a file I need to get for an assignment pg100.txt available on https://www.gutenberg.org/cache/epub/100/pg100.txt I login to an Linux machine ssh user@machine

wget https://www.gutenberg.org/cache/epub/100/pg100.txt

I get the file but the file I get is garbled text. I want to know 1) How can I get correct text file 2) Why is the text garbled when I do a wget it opens normally in browser. I login to the remote server (CentoS7) from my windows 10 machine via putty.

I tried asking on SO but there bot redirected me here. If this is not the right place to ask let me know where to ask.

1 Answer 1

1

Web servers provide information about the response body in the response header.

To see only the header, we can run:

$ wget --spider --server-response https://www.gutenberg.org/cache/epub/100/pg100.txt  
Spider mode enabled. Check if remote file exists.
--2019-10-14 09:13:55--  https://www.gutenberg.org/cache/epub/100/pg100.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Server: Apache
  Content-Location: pg100.txt.utf8.gzip
  Vary: negotiate
  TCN: choice
  Last-Modified: Sun, 01 Oct 2017 05:16:47 GMT
  X-Frame-Options: sameorigin
  X-Connection: Close
  Content-Type: text/plain; charset=utf-8
  Content-Encoding: gzip
  X-Powered-By: 1
  Content-Length: 2023394
  Date: Mon, 14 Oct 2019 13:13:55 GMT
  X-Varnish: 1859043781 1856607983
  Age: 104391
  Via: 1.1 varnish
Length: 2023394 (1.9M) [text/plain]
Remote file exists.

Once we see that the content is actually compressed with gzip, we can use gunzip to decompress it:

$ wget -O - https://www.gutenberg.org/cache/epub/100/pg100.txt | gunzip -c > pg100.txt

When the page is displayed in a modern browser, you will find that the browser has done this work for us.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .