2

I'm trying to retrieve CSV data from a website through this link.

When downloaded manually you get synop.201708.csv.gz which is in fact a csv wrongly named .gz, it weights 2233KB

When running this code :

import urllib

file_date = '201708'
file_url = "https://donneespubliques.meteofrance.fr/donnees_libres/Txt/Synop/Archive/synop.{}.csv.gz".format(file_date)
output_file_name = "{}.csv.gz".format(file_date)

print "downloading {} to {}".format(file_url, output_file_name)
urllib.urlretrieve (file_url, output_file_name)

I'm getting a corrupted ~361Kb file

Any ideas why?

12
  • What is the content of downloaded file? Trimmed data or actually some web page with warning about something? Commented Aug 18, 2017 at 14:54
  • The csv file content is meteo stations data Commented Aug 18, 2017 at 14:55
  • From output_file_name = "{}.csv.gz".format(file_date) to output_file_name = "{}.csv".format(file_date) Commented Aug 18, 2017 at 14:59
  • 1
    @JoaoVitorino and how will changing the name of the output change the input being received? Commented Aug 18, 2017 at 15:00
  • 1
    @pvg wow that is CRAZY, my browser (chrome) is unzipping the file without telling me and is keeping it named .gz (that is why I thought that I was getting an unzipped file) Commented Aug 18, 2017 at 15:04

2 Answers 2

2

What seems to be happening is that the MétéoFrance site is misusing the Content-Encoding header. The website reports that it is serving you a gzip file (Content-Type: application/x-gzip) and that it is encoding it in gzip format for the transfer (Content-Encoding: x-gzip). It is also saying the page is an attachment, which should be saved under its normal name (Content-Disposition: attachment)

In a vacuum, this would make sense (to a degree; compressing an already compressed file is mostly useless): The server serves a gzip file and compresses it again for transport. Upon receipt, your browser undoes the transport compression and saves the original gzip file. Here, it decompresses the stream, but since it wasn't compressed again, it doesn't work as expected.

0

As pvg said :

the file downloaded by urllib.urlretrieve is a compressed archive and not a csv file, everything is fine

I thought that I was suposed to get a csv named as .gz because when I was downloading it manually through my browser (chrome) it was then unziping it without telling me and it kept the unziped file name .gz

Not the answer you're looking for? Browse other questions tagged or ask your own question.