UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>

Question

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:

Traceback (most recent call last):  
  File "SCRIPT LOCATION", line NUMBER, in <module>  
    text = file.read()
  File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode  
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`

_{After reading this Q&A, see How to determine the encoding of text if you need help figuring out the encoding of the file you are trying to open.}

For the same error these solution has helped me , solution of charmap error — Shubham Sharma, Commented Sep 14, 2017 at 11:58
See Processing Text Files in Python 3 to understand why you get this error. — Andreas Haferburg, Commented Apr 24, 2018 at 14:33

wjandrea · Accepted Answer · 2023-04-26 19:14:13Z

1884

The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.

You specify the encoding when you open the file:

file = open(filename, encoding="utf8")

edited Apr 26, 2023 at 19:14

wjandrea

31.7k9 gold badges67 silver badges88 bronze badges

answered Feb 10, 2012 at 18:53

Lennart Regebro

171k43 gold badges228 silver badges252 bronze badges

5

if you're using Python 2.7, and getting the same error, try the io module: io.open(filename,encoding="utf8")
– christopherlovell
Commented Jun 3, 2015 at 14:02
1

+1 for specifying the encoding on read. p.s. is it supposed to be encoding="utf8" or is it encoding="utf-8" ?
– Davos
Commented Feb 3, 2016 at 23:03
14

@1vand1ng0: of course Latin-1 works; it'll work for any file regardless of what the actual encoding of the file is. That's because all 256 possible byte values in a file have a Latin-1 codepoint to map to, but that doesn't mean you get legible results! If you don't know the encoding, even opening the file in binary mode instead might be better than assuming Latin-1.
– Martijn Pieters
Commented Mar 6, 2017 at 14:10
4

I get the OP error even though the encoding is already specified correctly as UTF-8 (as shown above) in open(). Any ideas?
– enahel
Commented Nov 15, 2017 at 7:11
1

@rob_7cc That's not necessary. 'utf8' is an alias for UTF-8. docs
– wjandrea
Commented Apr 26, 2023 at 19:29

| Show 2 more comments

Ben · Accepted Answer · 2021-06-09 02:03:34Z

148

If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)

edited Jun 9, 2021 at 2:03

Ben

13.2k5 gold badges41 silver badges72 bronze badges

answered Jun 5, 2018 at 22:03

Declan Nnadozie

2,0061 gold badge11 silver badges20 bronze badges

29

Warning: This will result in data loss when unknown characters are encountered (which may be fine depending on your situation).
– Hans Goldman
Commented Feb 28, 2019 at 0:46
using file = open(filename, errors="ignore") will ignore any error and not display them in terminal. It doesn't solve the actual issue.
– Gangula
Commented Dec 4, 2023 at 18:22
Good warnings to heed. This solution was helpful for an application where I was searching through a number of files and selecting only the ones that apply to my use case. i.e. I couldn't control the file types or codings in the pool of files and I was just picking out the ones I needed based on text within the file. I wasn't concerned with data integrity and was not modifying the files.
– notadoctorthathelpspeople
Commented Apr 19 at 21:30

Add a comment |

MendelG · Accepted Answer · 2021-10-27 04:01:32Z

88

Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:

open(filename, 'rb')

where r = reading, b = binary

edited Oct 27, 2021 at 4:01

MendelG

18.9k4 gold badges30 silver badges60 bronze badges

answered Jul 8, 2019 at 23:50

Kyle Parisi

1,3861 gold badge11 silver badges15 bronze badges

2

Perhaps emphasize that the b will produce bytes instead of str data. Like you note, this is suitable if you don't need to process the bytes in any way.
– tripleee
Commented May 10, 2022 at 7:23
The top two answers didn't work, but this one did. I was trying to read a dictionary of pandas dataframes and kept getting errrors.
– Realhermit
Commented Nov 17, 2022 at 18:40
1

@Realhermit Please see stackoverflow.com/questions/436220. Every text file has a particular encoding, and you have to know what it is in order to use it properly. The common guesses won't always be correct.
– Karl Knechtel
Commented Apr 12, 2023 at 22:20

Add a comment |

rha · Accepted Answer · 2024-04-08 16:58:36Z

55

TLDR: Try: file = open(filename, encoding='cp437')

Why? When one uses:

file = open(filename)
text = file.read()

Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be not handled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.

If such characters are unneeded, one may decide to replace them by question marks, with:

file = open(filename, errors='replace')

Another workaround is to use:

file = open(filename, errors='ignore')

The characters are then left intact, but other errors will be masked too.

A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which maps every single-byte value (0..255) to a character (like cp437 or latin1):

file = open(filename, encoding='cp437')

Codepage 437 is just an example. It is the original DOS encoding. All codes are mapped, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable) and one can check their ord() values.

Please note that this advice is just a quick workaround for a nasty problem. Proper solution is to use binary mode, although it is not so quick.

edited Apr 8 at 16:58

answered Nov 8, 2019 at 18:14

rha

6895 silver badges5 bronze badges

5

Probably you should emphasize even more that randomly guessing at the encoding is likely to produce garbage. You have to know the encoding of the data.
– tripleee
Commented May 10, 2022 at 7:21
2

There are many encodings that "have all characters defined" (you really mean "map every single-byte value to a character"). CP437 is very specifically associated with the Windows/DOS ecosystem. In most cases, Latin-1 (ISO-8859-1) will be a better starting guess.
– Karl Knechtel
Commented Apr 12, 2023 at 22:22
@tripleee - The solution is a quick workaround for a nasty error, allowing to check what is going on. Sometimes there is a garbage character placed inside a big, perfectly encoded text. Using the encoding of the character would break the decoding of the rest of the text. What's more, the encoding may be not handled by Python (e.g. cp790). Still, today I would rather use binary mode and handle the decoding myself.
– rha
Commented Apr 8 at 16:39
@Karl Knechtel - Yes, your phrase is better. I am going to edit my text.
– rha
Commented Apr 8 at 16:39
Thanks for the update. However, the part about decoding cp1252 "to its own default UTF-8" is still weird. Python is decoding cp1252 into the internal string representation, which is not UTF-8 (or necessarily any standard Unicode representation, although Python strings are defined to be Unicode).
– tripleee
Commented Apr 9 at 7:46

Add a comment |

Stevoisiak · Accepted Answer · 2019-09-09 19:25:30Z

44

As an extension to @LennartRegebro's answer:

If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.

EDIT: (Copied from comment)

A quite popular text editor Sublime Text has a command to display encoding if it has been set...

Go to View -> Show Console (or Ctrl+`)

Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)

edited Sep 9, 2019 at 19:25

Stevoisiak

25.8k30 gold badges130 silver badges234 bronze badges

answered Mar 22, 2016 at 16:12

Matas Vaitkevicius

60.3k36 gold badges247 silver badges272 bronze badges

3

Some text editors will provide this information as well. I know that with vim you can get this via :set fileencoding (from this link)
– PaxRomana99
Commented Dec 17, 2016 at 15:20
5

Sublime Text, also -- open up the console and type view.encoding().
– JimmidyJoo
Commented Jul 12, 2017 at 20:27
1

alternatively, you can open your file with notepad. 'Save As' and you shall see a drop-down with the encoding used
– don_Gunner94
Commented Mar 5, 2020 at 12:11
Please see stackoverflow.com/questions/436220 for more details on the general task.
– Karl Knechtel
Commented Apr 12, 2023 at 22:23

Add a comment |

E.Zolduoarrati · Accepted Answer · 2020-06-01 21:54:36Z

16

Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:

open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')

Godspeed

answered Jun 1, 2020 at 21:54

E.Zolduoarrati

1,6192 gold badges9 silver badges9 bronze badges

2

Before you apply that, be sure that you want your 0x90 to be decoded to 'É'. Check b'\x90'.decode('cp437').
– hanna
Commented Aug 6, 2020 at 15:56
1

This is absolutely horrible advice. Code page 437 is a terrible guess unless your source data comes from an MS-DOS system from the 1990s, and ignoring errors is often the worst possible way to silence the warnings. It's like cutting the wires to the "engine hot" and "fuel low" lights in your car to get rid of those annoying distractions.
– tripleee
Commented Oct 25, 2022 at 9:18

Add a comment |

Just Me · Accepted Answer · 2023-03-07 21:41:51Z

def read_files(file_path):

    with open(file_path, encoding='utf8') as f:
        text = f.read()
        return text

OR (AND)

def read_files(text, file_path):

    with open(file_path, 'rb') as f:
        f.write(text.encode('utf8', 'ignore'))

OR

document = Document()
document.add_heading(file_path.name, 0)
    file_path.read_text(encoding='UTF-8'))
        file_content = file_path.read_text(encoding='UTF-8')
        document.add_paragraph(file_content)

OR

def read_text_from_file(cale_fisier):
    text = cale_fisier.read_text(encoding='UTF-8')
    print("what I read: ", text)
    return text # return written text

def save_text_into_file(cale_fisier, text):
    f = open(cale_fisier, "w", encoding = 'utf-8') # open file
    print("Ce am scris: ", text)
    f.write(text) # write the content to the file

OR

def read_text_from_file(file_path):
    with open(file_path, encoding='utf8', errors='ignore') as f:
        text = f.read()
        return text # return written text


def write_to_file(text, file_path):
    with open(file_path, 'wb') as f:
        f.write(text.encode('utf8', 'ignore')) # write the content to the file

OR

import os
import glob

def change_encoding(fname, from_encoding, to_encoding='utf-8') -> None:
    '''
    Read the file at path fname with its original encoding (from_encoding)
    and rewrites it with to_encoding.
    '''
    with open(fname, encoding=from_encoding) as f:
        text = f.read()

    with open(fname, 'w', encoding=to_encoding) as f:
        f.write(text)

navalega0109 · Accepted Answer · 2023-09-14 14:06:50Z

6

Below code will encode the utf8 symbols.

with open("./website.html", encoding="utf8") as file:
    contents = file.read()

answered Sep 14, 2023 at 14:06

navalega0109

3003 silver badges11 bronze badges

Add a comment |

hanna · Accepted Answer · 2020-08-06 16:29:33Z

5

Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)

and then consider removing it from the file.

edited Aug 6, 2020 at 16:29

answered Aug 6, 2020 at 16:05

hanna

6579 silver badges15 bronze badges

2

I have a web page at tripleee.github.io/8bit/#90 where you can look up the character's value in the various 8-bit encodings supported by Python. With enough data points, you can often infer a suitable encoding (though some of them are quite similar, and so establishing exactly which encoding the original writer used will often involve some guesswork, too).
– tripleee
Commented May 10, 2022 at 7:24

Add a comment |

gabi939 · Accepted Answer · 2021-02-21 11:31:22Z

4

for me encoding with utf16 worked

file = open('filename.csv', encoding="utf16")

answered Feb 21, 2021 at 11:31

gabi939

1072 silver badges8 bronze badges

1

Like many of the other answers on this page, randomly guessing which encoding the OP is actually dealing with is mostly a waste of time. The proper solution is to tell them how to figure out the correct encoding, not offer more guesses (the Python documentation contains a list of all of them; there are many, many more which are not suggested in any answer here yet, but which could be correct for any random visitor). UTF-16 is pesky in that the results will often look vaguely like valid Chinese or Korean text if you don't speak the language.
– tripleee
Commented Oct 25, 2022 at 9:13

Add a comment |

Antoni · Accepted Answer · 2019-09-01 05:36:10Z

3

For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.

Open the file in Notepad++. In the bottom right it will tell you the current file encoding. In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"

answered Sep 1, 2019 at 5:36

Antoni

2,60222 silver badges21 bronze badges

Only the viewing encoding is changed in this way. In order to effectively change the file's encoding, change preferences in Notepad++ and create a new document, as shown here: superuser.com/questions/1184299/….
– hanna
Commented Aug 6, 2020 at 10:36

Add a comment |

Arthur MacMillan · Accepted Answer · 2021-11-07 11:19:20Z

3

In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).

Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.

answered Nov 7, 2021 at 11:19

Arthur MacMillan

917 bronze badges

1

This assumes that the source data is UTF-8, which is by no means a given.
– tripleee
Commented May 10, 2022 at 7:22

Add a comment |

Sayantam · Accepted Answer · 2023-05-23 12:49:45Z

3

If you are on Windows, the file may be starting with a UTF-8 BOM indicating it definitely is a UTF-8 file. As per https://bugs.python.org/issue44510, I used encoding="utf-8-sig", and the csv file was read successfully.

answered May 23, 2023 at 12:49

Sayantam

9548 silver badges6 bronze badges

Add a comment |

SuperStormer · Accepted Answer · 2022-02-06 03:56:03Z

1

for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

edited Feb 6, 2022 at 3:56

SuperStormer

5,3575 gold badges28 silver badges38 bronze badges

answered Feb 4, 2020 at 5:45

Piyush raj

294 bronze badges

Like many other random guesses, "latin-1" will remove the error, but will not guarantee that the file is decoded correctly. You have to know which encoding the file actually uses. Also notice that latin1 without quotes is a syntax error (unless you have a variable with that name, and it contains a string which represents a valid Python character encoding name).
– tripleee
Commented Oct 25, 2022 at 9:07
In this particular example, the real problem is that a PNG file does not contain text at all. You should instead read the raw bytes (open('pic3.png', 'rb') where the b signifies binary mode).
– tripleee
Commented Oct 25, 2022 at 9:09

Add a comment |

Hellena Crainicu · Accepted Answer · 2023-03-21 10:24:05Z

1

This is an example of how I open and close file with UTF-8, extracted from a recent code:

def traducere_v1_txt(translator, file):
  data = []
  with open(f"{base_path}/{file}" , "r" ,encoding='utf8', errors='ignore') as open_file:
    data = open_file.readlines()
    
    
file_name = file.replace(".html","")
        with open(f"Translated_Folder/{file_name}_{input_lang}.html","w", encoding='utf8') as htmlfile:
          htmlfile.write(lxml1)

answered Mar 21, 2023 at 10:24

Hellena Crainicu

516 bronze badges

Add a comment |

Collectives™ on Stack Overflow

UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>

15 Answers 15

Not the answer you're looking for? Browse other questions tagged
python
python-3.x
unicode
file-io
decode
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

15 Answers 15

Not the answer you're looking for? Browse other questions tagged pythonpython-3.xunicodefile-iodecode or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
python-3.x
unicode
file-io
decode
or ask your own question.