4

I'm attempting to extract article information using the python newspaper3k package and then write to a CSV file. While the info is downloaded correctly, I'm having issues with the output to CSV. I don't think I fully understand unicode, despite my efforts to read about it.

from newspaper import Article, Source
import csv

first_article = Article(url="http://www.bloomberg.com/news/articles/2016-09-07/asian-stock-futures-deviate-as-s-p-500-ends-flat-crude-tops-46")

first_article.download()
if first_article.is_downloaded:
    first_article.parse()
    first_article.nlp

article_array = []
collate = {}

collate['title'] = first_article.title
collate['content'] = first_article.text
collate['keywords'] = first_article.keywords
collate['url'] = first_article.url
collate['summary'] = first_article.summary
print(collate['content'])
article_array.append(collate)

keys = article_array[0].keys()
with open('bloombergtest.csv', 'w') as output_file:
    csv_writer = csv.DictWriter(output_file, keys)
    csv_writer.writeheader()
    csv_writer.writerows(article_array)

output_file.close()

When I print collate['content'], which is first_article.text, the console outputs the article's content just fine. Everything shows up correctly, apostrophes and all. When I write to the CVS, the content cell text has odd characters in it. For example:

“At the end of the day, Europe’s economy isn’t in great shape, inflation doesn’t look exciting and there are a bunch of political risks to reckon with.

So far I have tried:

with open('bloombergtest.csv', 'w', encoding='utf-8') as output_file:

to no avail. I also tried utf-16 instead of 8, but that just resulted in the cells writing in an odd order. It didn't create the cells correctly in the CSV, although the output looked correct. I've also tried .encode('utf-8') are various variable but nothing has worked.

What's going on? Why would the console print the text correctly, while the CSV file has odd characters? How can I fix this?

3 Answers 3

11

Add encoding='utf-8-sig' to open(). Excel requires the UTF-8-encoded BOM code point (Byte Order Mark, U+FEFF) signature to interpret a file as UTF-8; otherwise, it assumes the default localized encoding.

0
7

Changing with open('bloombergtest.csv', 'w', encoding='utf-8') as output_file: to with open('bloombergtest.csv', 'w', encoding='utf-8-sig') as output_file:, worked, as recommended by Leon and Mark Tolonen.

0
4

That's most probably a problem with the software that you use to open or print the CSV file - it doesn't "understand" that CSV is encoded in UTF-8 and assumes ASCII, latin-1, ISO-8859-1 or a similar encoding for it.

You can aid that software in recognizing the CSV file's encoding by placing a BOM sequence in the beginning of your file (which, in general, is not recommended for UTF-8).

3
  • 1
    I'm opening it in excel. Is there no way to write universal characters? Commented Sep 10, 2016 at 12:01
  • @sirryankennedy Have you tried writing UTF-8 with BOM (as shown in the linked answer)?
    – Leon
    Commented Sep 10, 2016 at 13:13
  • @sirryankennedy: there is no "universal" encoding. Even plain ASCII is not "universal". If you want to use a one-byte encoding, convert to one that contains your curly quotes, such as Windows-1252.
    – Jongware
    Commented Sep 10, 2016 at 15:27

Not the answer you're looking for? Browse other questions tagged or ask your own question.