41

I am creating XML file in Python and there's a field on my XML that I put the contents of a text file. I do it by

f = open ('myText.txt',"r")
data = f.read()
f.close()

root = ET.Element("add")
doc = ET.SubElement(root, "doc")

field = ET.SubElement(doc, "field")
field.set("name", "text")
field.text = data

tree = ET.ElementTree(root)
tree.write("output.xml")

And then I get the UnicodeDecodeError. I already tried to put the special comment # -*- coding: utf-8 -*- on top of my script but still got the error. Also I tried already to enforce the encoding of my variable data.encode('utf-8') but still got the error. I know this issue is very common but all the solutions I got from other questions didn't work for me.

UPDATE

Traceback: Using only the special comment on the first line of the script

Traceback (most recent call last):
  File "D:\Python\lse\createxml.py", line 151, in <module>
    tree.write("D:\\python\\lse\\xmls\\" + items[ctr][0] + ".xml")
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 937, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1073, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 243: ordina
l not in range(128)

Traceback: Using .encode('utf-8')

Traceback (most recent call last):
  File "D:\Python\lse\createxml.py", line 148, in <module>
    field.text = data.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 227: ordina
l not in range(128)

I used .decode('utf-8') and the error message didn't appear and it successfully created my XML file. But the problem is that the XML is not viewable on my browser.

7
  • 1
    It would be useful to see the entire error message to see where it's coming from. In the meantime try using decode instead of encode. Commented May 12, 2013 at 14:49
  • Updated, it successfully created my XML when I use decode, but the file is not viewable on my browser. Commented May 12, 2013 at 15:00
  • 2
    Note that using # -*- coding: utf-8 -*- serves only to insert non ASCII characters in the python sources. It doesn't affect encoding/decoding of strings in any way. Also, if the file myText.txt isn't ASCII you should use codecs.open and provide the right encoding: codecs.open('myText.txt', 'r', 'utf-8').
    – Bakuriu
    Commented May 12, 2013 at 15:17
  • Additionally, you should add an encoding to tree.write if your text is not just ASCII (see also the docs) Commented May 12, 2013 at 15:40
  • 1
    Might have been a non-breaking space. Just saying. Option + Space on Mac. 0xC2 0xA0 in UTF-8.
    – superlukas
    Commented Mar 3, 2015 at 23:13

4 Answers 4

69

You need to decode data from input string into unicode, before using it, to avoid encoding problems.

field.text = data.decode("utf8")
0
12

I was running into a similar error in pywikipediabot. The .decode method is a step in the right direction but for me it didn't work without adding 'ignore':

ignore_encoding = lambda s: s.decode('utf8', 'ignore')

Ignoring encoding errors can lead to data loss or produce incorrect output. But if you just want to get it done and the details aren't very important this can be a good way to move faster.

1
  • 11
    Do note that ignoring encoding errors will potentially lose data, or produce incorrect output.
    – tripleee
    Commented Feb 1, 2015 at 6:55
11

Python 2

The error is caused because ElementTree did not expect to find non-ASCII strings set the XML when trying to write it out. You should use Unicode strings for non-ASCII instead. Unicode strings can be made either by using the u prefix on strings, i.e. u'€' or by decoding a string with mystr.decode('utf-8') using the appropriate encoding.

The best practice is to decode all text data as it's read, rather than decoding mid-program. The io module provides an open() method which decodes text data to Unicode strings as it's read.

ElementTree will be much happier with Unicodes and will properly encode it correctly when using the ET.write() method.

Also, for best compatibility and readability, ensure that ET encodes to UTF-8 during write() and adds the relevant header.

Presuming your input file is UTF-8 encoded (0xC2 is common UTF-8 lead byte), putting everything together, and using the with statement, your code should look like:

with io.open('myText.txt', "r", encoding='utf-8') as f:
    data = f.read()

root = ET.Element("add")
doc = ET.SubElement(root, "doc")

field = ET.SubElement(doc, "field")
field.set("name", "text")
field.text = data

tree = ET.ElementTree(root)
tree.write("output.xml", encoding='utf-8', xml_declaration=True)

Output:

<?xml version='1.0' encoding='utf-8'?>
<add><doc><field name="text">data€</field></doc></add>
1

#!/usr/bin/python

# encoding=utf8

Try This to starting of python file

Not the answer you're looking for? Browse other questions tagged or ask your own question.