0

I'm working with data in .csv format and want to set all the empty cells to the value of an empty string.

The problem that I'm facing is that those files have been manipulated for several people in different environments, hence there are various different junk values on these cells, such as:

' '
'NaN'
'nan'
'\n'
'   '

And so on.

I'm looking for a standard way to identify all of these types of "junk values."

2

4 Answers 4

4

Use .strip() to remove whitespace, and then check if the value is one you want to ignore:

if value.strip() in ['', 'NaN', 'nan']:
    # ignore this value

Or, make it case-insensitive:

if value.strip().lower() in ['', 'nan']:
    # ignore this value
2

You can use the isspace function which would eliminate whitespace values like ' ' and '\n' but would not handle values like 'NaN' or 'nan'. There isn't really a standard way to deal with these, so in addition to using isspace I would also create a blacklist, e.g.:

blacklist = ['NaN', 'nan'] # add more as needed

Then use isspace() plus your blacklist to filter out unwanted values.

0

You could read the csv into a Pandas DataFrame, and then use DataFrame.fillna().

0

I think pandas.replace would be a good alternative for your problem.

Following are some sample codes:

import pandas as pd
# sample data
dic = {'a':['NAN', "", "NaN"], 'b':["", "nan", '\n'], 'c':[1,'2','3']}
df = pd.DataFrame(dic)

replace_list = ['NaN', '', 'nan', '\n']
df_clean = df.replace(replace_list, '')
df_clean

You can import csv data to Pandas and do the same thing.

Hope it helps.

Not the answer you're looking for? Browse other questions tagged or ask your own question.