Python CSV error: line contains NULL byte

Question

I'm working with some CSV files, with the following code:

reader = csv.reader(open(filepath, "rU"))
try:
    for row in reader:
        print 'Row read successfully!', row
except csv.Error, e:
    sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))

And one file is throwing this error:

file my.csv, line 1: line contains NULL byte

What can I do? Google seems to suggest that it may be an Excel file that's been saved as a .csv improperly. Is there any way I can get round this problem in Python?

== UPDATE ==

Following @JohnMachin's comment below, I tried adding these lines to my script:

print repr(open(filepath, 'rb').read(200)) # dump 1st 200 bytes of file
data = open(filepath, 'rb').read()
print data.find('\x00')
print data.count('\x00')

And this is the output I got:

'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\ .... <snip>
8
13834

So the file does indeed contain NUL bytes.

what query should I run, something like cat my.csv | od -c | more ? with that I get: 0000000 D e p a r t m e n t F a m i l — AP257, Commented Nov 12, 2010 at 15:35
How is the CSV generated ? From excel, you may be able to try a dialect. Otherwise look at say: stackoverflow.com/questions/2753022/… — dr jimbob, Commented Nov 12, 2010 at 15:51
Thanks. It's not my CSV, and unfortunately I don't have the power to change it. I think it's been created as Excel and saved as CSV (boo). A dialect sounds like a good idea - I'll try that! — AP257, Commented Nov 12, 2010 at 16:24
If it's actually been saved as CSV, it should work. One thing I sometimes find is TSV (tab separated) files masquerading as CSV, so you could try setting a delimiter of '\t'. If it's been saved as an Excel file, and the extension changed to CSV, no dialect is going to work. I think your only option in that case would be to use Excel to save copies as proper CSV. — Thomas K, Commented Nov 12, 2010 at 17:38

John Machin · Accepted Answer · 2010-11-12 23:06:08Z

As @S.Lott says, you should be opening your files in 'rb' mode, not 'rU' mode. However that may NOT be causing your current problem. As far as I know, using 'rU' mode would mess you up if there are embedded \r in the data, but not cause any other dramas. I also note that you have several files (all opened with 'rU' ??) but only one causing a problem.

If the csv module says that you have a "NULL" (silly message, should be "NUL") byte in your file, then you need to check out what is in your file. I would suggest that you do this even if using 'rb' makes the problem go away.

repr() is (or wants to be) your debugging friend. It will show unambiguously what you've got, in a platform independant fashion (which is helpful to helpers who are unaware what od is or does). Do this:

print repr(open('my.csv', 'rb').read(200)) # dump 1st 200 bytes of file

and carefully copy/paste (don't retype) the result into an edit of your question (not into a comment).

Also note that if the file is really dodgy e.g. no \r or \n within reasonable distance from the start of the file, the line number reported by reader.line_num will be (unhelpfully) 1. Find where the first \x00 is (if any) by doing

data = open('my.csv', 'rb').read()
print data.find('\x00')

and make sure that you dump at least that many bytes with repr or od.

What does data.count('\x00') tell you? If there are many, you may want to do something like

for i, c in enumerate(data):
    if c == '\x00':
        print i, repr(data[i-30:i]) + ' *NUL* ' + repr(data[i+1:i+31])

so that you can see the NUL bytes in context.

If you can see \x00 in the output (or \0 in your od -c output), then you definitely have NUL byte(s) in the file, and you will need to do something like this:

fi = open('my.csv', 'rb')
data = fi.read()
fi.close()
fo = open('mynew.csv', 'wb')
fo.write(data.replace('\x00', ''))
fo.close()

By the way, have you looked at the file (including the last few lines) with a text editor? Does it actually look like a reasonable CSV file like the other (no "NULL byte" exception) files?

Thank you so much for this very detailed help. There are lots of \x00 characters in the file (see edit to question) - it's odd, because in a text editor it looks like a perfectly reasonable CSV file. — AP257, Commented Nov 15, 2010 at 17:35
@AP257: '\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1 is the "signature" denoting an OLE2 Compound Document file -- e.g. an Excel 97-2003 .XLS file. I find "in a text editor it looks like a perfectly reasonable CSV file" to be utterly unbelievable. You must have been looking at a different file, a valid CSV file, in another folder or on another machine or at some other time. Note that your od output wasn't from an XLS file. — John Machin, Commented Nov 15, 2010 at 21:48
Works, but should be possible and nice on-the-fly with a file-like object that filters the CSV and can be passed to csv.reader directly. — gerrit, Commented Oct 15, 2015 at 17:37
Shouldn't fo.write(data.replace('\x00', '')) be fo.write(data.replace(b'\x00', b''))? Python 3.6 here... — Boern, Commented Jan 8, 2019 at 9:08

double · Accepted Answer · 2014-11-26 09:59:56Z

30

data_initial = open("staff.csv", "rb")
data = csv.reader((line.replace('\0','') for line in data_initial), delimiter=",")

This works for me.

answered Nov 26, 2014 at 9:59

double

3013 silver badges2 bronze badges

Solved for my case, the null were the '\0' values. Thanks.
– Joabe da Luz
Commented Feb 12, 2017 at 2:44

Add a comment |

User · Accepted Answer · 2014-02-20 00:52:49Z

22

Reading it as UTF-16 was also my problem.

Here's my code that ended up working:

f=codecs.open(location,"rb","utf-16")
csvread=csv.reader(f,delimiter='\t')
csvread.next()
for row in csvread:
    print row

Where location is the directory of your csv file.

answered Feb 20, 2014 at 0:52

User

24.5k41 gold badges130 silver badges213 bronze badges

Add a comment |

woot · Accepted Answer · 2014-11-25 07:52:05Z

You could just inline a generator to filter out the null values if you want to pretend they don't exist. Of course this is assuming the null bytes are not really part of the encoding and really are some kind of erroneous artifact or bug.

with open(filepath, "rb") as f:
    reader = csv.reader( (line.replace('\0','') for line in f) )

    try:
        for row in reader:
            print 'Row read successfully!', row
    except csv.Error, e:
        sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))

ayaz · Accepted Answer · 2010-12-02 19:25:51Z

14

I bumped into this problem as well. Using the Python csv module, I was trying to read an XLS file created in MS Excel and running into the NULL byte error you were getting. I looked around and found the xlrd Python module for reading and formatting data from MS Excel spreadsheet files. With the xlrd module, I am not only able to read the file properly, but I can also access many different parts of the file in a way I couldn't before.

I thought it might help you.

answered Dec 2, 2010 at 19:25

ayaz

10.5k6 gold badges35 silver badges48 bronze badges

8

Thanks for pointing out that module. Interestingly enough, I went to download it and noticed the author was none other than @John_Machin who is also the top comment on this question.
– Evan
Commented Mar 19, 2012 at 23:28

Add a comment |

Community · Accepted Answer · 2017-05-23 10:31:03Z

11

Converting the encoding of the source file from UTF-16 to UTF-8 solve my problem.

How to convert a file to utf-8 in Python?

import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "utf-16") as sourceFile:
    with codecs.open(targetFileName, "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
                break
            targetFile.write(contents)

edited May 23, 2017 at 10:31

CommunityBot

11 silver badge

answered Apr 24, 2012 at 14:27

Patrick Halley

1111 silver badge4 bronze badges

Add a comment |

S.Lott · Accepted Answer · 2010-11-12 20:38:40Z

2

Why are you doing this?

 reader = csv.reader(open(filepath, "rU"))

The docs are pretty clear that you must do this:

with open(filepath, "rb") as src:
    reader= csv.reader( src )

The mode must be "rb" to read.

http://docs.python.org/library/csv.html#csv.reader

If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.

answered Nov 12, 2010 at 20:38

S.Lott

390k82 gold badges514 silver badges785 bronze badges

@AP257: "Doesn't help"? Means what? Any specific error messages?
– S.Lott
Commented Nov 15, 2010 at 19:40
1

@S.Lott: Means he gets the same answer as before. The reality is that he is dealing with a chameleon or shapeshifter file ... when he dumps it with od or looks at it in a text editor, it looks like a perfectly normal CSV file. However when he dumps the first few bytes with Python repr(), it makes like an Excel .XLS file (that's been renamed to have a CSV extension).
– John Machin
Commented Nov 15, 2010 at 22:01
@John Machin: "an Excel .XLS file (that's been renamed to have a CSV extension" Makes sense that it cannot be processed at all.
– S.Lott
Commented Nov 15, 2010 at 22:03
1

@S.Lott: With that content, it makes sense that the csv module can't process it; however the xlrd module can process it. Sensibly, neither module infers anything from the name of the input file, if indeed the input is a file with a name.
– John Machin
Commented Nov 15, 2010 at 23:08
1

@John Machin: "neither module infers anything from the name of the input file". True. My application framework depends on that fact. We don't trust the filename to mean anything, since people make mistakes ("lie"). So we have to check a bunch of alternatives until one clicks.
– S.Lott
Commented Nov 16, 2010 at 12:05

| Show 2 more comments

Xavier Combelle · Accepted Answer · 2010-11-18 16:21:54Z

2

appparently it's a XLS file and not a CSV file as http://www.garykessler.net/library/file_sigs.html confirm

answered Nov 18, 2010 at 16:21

Xavier Combelle

11.1k5 gold badges29 silver badges52 bronze badges

Not necessarily, but yes, this could be a cause. I did get this error when I tried parsing a CSV file that was saved by Excel from an XLSX file.
– Cerin
Commented Jan 22, 2015 at 18:29
With this magic number it's the cause XLSX have different magic number
– Xavier Combelle
Commented Jan 24, 2015 at 14:09

Add a comment |

Botz3000 · Accepted Answer · 2012-06-14 13:08:26Z

2

Instead of csv reader I use read file and split function for string:

lines = open(input_file,'rb') 

for line_all in lines:

    line=line_all.replace('\x00', '').split(";")

edited Jun 14, 2012 at 13:08

Botz3000

39.4k8 gold badges106 silver badges128 bronze badges

answered Jun 14, 2012 at 13:01

Nico The Brush

211 bronze badge

Add a comment |

mikaiscute · Accepted Answer · 2011-11-29 07:19:18Z

1

I got the same error. Saved the file in UTF-8 and it worked.

answered Nov 29, 2011 at 7:19

mikaiscute

211 bronze badge

1

You may have got the same error message, but the cause would have been different -- you probably saved it originally as UTF-16 (what Notepad calls "Unicode").
– John Machin
Commented Nov 29, 2011 at 7:48

Add a comment |

user1990371 · Accepted Answer · 2013-01-18 12:23:57Z

1

This happened to me when I created a CSV file with OpenOffice Calc. It didn't happen when I created the CSV file in my text editor, even if I later edited it with Calc.

I solved my problem by copy-pasting in my text editor the data from my Calc-created file to a new editor-created file.

answered Jan 18, 2013 at 12:23

user1990371

111 bronze badge

Add a comment |

Matthias Kuhn · Accepted Answer · 2014-02-11 10:42:30Z

I had the same problem opening a CSV produced from a webservice which inserted NULL bytes in empty headers. I did the following to clean the file:

with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
    data = myfile.read()
    # clean file first if dirty
    if data.count( '\x00' ):
        print 'Cleaning...'
        with codecs.open('my.csv.tmp', 'w', 'utf-8') as of:
            for line in data:
                of.write(line.replace('\x00', ''))

        shutil.move( 'my.csv.tmp', 'my.csv' )

with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
    myreader = csv.reader(myfile, delimiter=',')
    # Continue with your business logic here...

Disclaimer: Be aware that this overwrites your original data. Make sure you have a backup copy of it. You have been warned!

isopach · Accepted Answer · 2021-09-15 10:52:40Z

1

I opened and saved the original csv file as a .csv file through Excel's "Save As" and the NULL byte disappeared.

I think the original encoding for the file I received was double byte unicode (it had a null character every other character) so saving it through excel fixed the encoding.

edited Sep 15, 2021 at 10:52

isopach

1,8487 gold badges33 silver badges46 bronze badges

answered Sep 15, 2021 at 1:34

Jeffrey Kozik

5315 silver badges10 bronze badges

Add a comment |

minglyu · Accepted Answer · 2023-08-10 09:35:15Z

1

Remove or replace null bytes:

with open('your_file.csv', 'rb') as file:
    data = file.read().replace(b'\x00', b'')
with open('your_file.csv', 'wb') as file:
    file.write(data)

answered Aug 10, 2023 at 9:35

minglyu

3,2182 gold badges16 silver badges37 bronze badges

Add a comment |

Bill Gross · Accepted Answer · 2014-02-17 20:53:07Z

0

For all those 'rU' filemode haters: I just tried opening a CSV file from a Windows machine on a Mac with the 'rb' filemode and I got this error from the csv module:

Error: new-line character seen in unquoted field - do you need to 
open the file in universal-newline mode?

Opening the file in 'rU' mode works fine. I love universal-newline mode -- it saves me so much hassle.

answered Feb 17, 2014 at 20:53

Bill Gross

5163 silver badges11 bronze badges

Add a comment |

Gesias · Accepted Answer · 2014-10-24 07:13:54Z

0

I encountered this when using scrapy and fetching a zipped csvfile without having a correct middleware to unzip the response body before handing it to the csvreader. Hence the file was not really a csv file and threw the line contains NULL byte error accordingly.

answered Oct 24, 2014 at 7:13

Gesias

7785 silver badges17 bronze badges

hey mate, can you please explain to me, how did you configure the right csv.middleware? I got the same exact error like you described :(
– kekw
Commented Aug 2, 2021 at 7:16
@y.y Certainly, this is what I wrote gist.github.com/Gesias/b9ae4593ae7ba7584bf6bcf295f18ffd
– Gesias
Commented Aug 9, 2021 at 9:34

Add a comment |

Munene iUwej Julius · Accepted Answer · 2018-09-04 13:47:08Z

0

Have you tried using gzip.open?

with gzip.open('my.csv', 'rb') as data_file:

I was trying to open a file that had been compressed but had the extension '.csv' instead of 'csv.gz'. This error kept showing up until I used gzip.open

edited Sep 4, 2018 at 13:47

answered Sep 4, 2018 at 13:24

Munene iUwej Julius

911 silver badge4 bronze badges

Add a comment |

Tyler · Accepted Answer · 2023-04-02 13:18:46Z

What worked for me is taking a more manual approach of blacklisting certain characters. In the data I was working with, an ASCII control character indicated that the row was corrupted. This script looks for any "bad" characters, and if found, skips the row entirely. It assumes that the CSV header in the first row isn't corrupted though. With this approach, the corrupted data is intercepted before it reaches csv.DictReader which then throws a null byte error.

import io, csv

# Problematic ASCII control characters.
ascii_control_characters = list(range(0, 31))
ascii_control_characters.append(127) # Delete.
ascii_control_characters.remove(10) # Line feed.
ascii_control_characters.remove(13) # Carriage return.

with open('/foo/bar/baz.csv', 'r') as data_file:
    header = ''

    for index, line in enumerate(data_file):
        # Search line for problematic ASCII characters.
        bad_character_found = False

        for character in line:
            if ord(character) in ascii_control_characters:
                bad_character_found = True
                break

        # If a bad character is found, skip the line altogether.
        if bad_character_found:
            print(
                'Corrupted data found on line: ' + \
                 str(index + 1) + \
                 '. Skipping...'
            )

            continue

        if index == 0:
            header += line
            continue

        csv_data = header + line

        reader = csv.DictReader(io.StringIO(csv_data))

        for row in reader:
            # Process each CSV row here.
            pass

kirancodify · Accepted Answer · 2015-06-24 15:05:31Z

-2

One case is that - If the CSV file contains empty rows this error may show up. Check for row is necessary before we proceed to write or read.

for row in csvreader:
        if (row):       
            do something

I solved my issue by adding this check in the code.

answered Jun 24, 2015 at 15:05

kirancodify

6958 silver badges14 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Python CSV error: line contains NULL byte

19 Answers 19

Not the answer you're looking for? Browse other questions tagged
python
csv
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

19 Answers 19

Not the answer you're looking for? Browse other questions tagged pythoncsv or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
csv
or ask your own question.