Convert Pandas column containing NaNs to dtype `int`

Question

I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.

When I try to cast the id column to integer while reading the .csv, I get:

df= pd.read_csv("data.csv", dtype={'id': int}) 
error: Integer column has NA values

Alternatively, I tried to convert the column type after reading as below, but this time I get:

df= pd.read_csv("data.csv") 
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer

How can I tackle this?

I think that integer values cannot be converted or stored in a series/dataframe if there are missing/NaN values. This I think is to do with numpy compatibility (I'm guessing here), if you want missing value compatibility then I would store the values as floats — EdChum, Commented Jan 22, 2014 at 16:14
see here: pandas.pydata.org/pandas-docs/dev/…; you must have a float dtype when u have missing values (or technically object dtype but that is inefficient); what is your goal of using int type? — Jeff, Commented Jan 22, 2014 at 16:16
I believe this is a NumPy issue, not specific to Pandas. It's a shame since there are so many cases when having an int type that allows for the possibility of null values is much more efficient than a large column of floats. — ely, Commented Jan 22, 2014 at 17:44
I have a problem with this too. I have multiple dataframes which I want to merge based on a string representation of several "integer" columns. However, when one of those integer columns has a np.nan, the string casting produces a ".0", which throws off the merge. Just makes things slightly more complicated, would be nice if there was simple work-around. — dermen, Commented Jul 11, 2015 at 3:52
@Rhubarb, Optional Nullable Integer Support is now officially added on pandas 0.24.0 - finally :) - please find an updated answer bellow. pandas 0.24.x release notes — mork, Commented Jan 25, 2019 at 17:14

jezrael · Accepted Answer · 2019-11-05 06:13:22Z

364

In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.

Nullable Integer Data Type.

Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:

arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)

0      1
1      2
2    NaN
dtype: Int64

For convert column to nullable integers use:

df['myCol'] = df['myCol'].astype('Int64')

edited Nov 5, 2019 at 6:13

answered Jan 15, 2019 at 8:13

jezrael

854k100 gold badges1.4k silver badges1.3k bronze badges

90

Note that dtype must be "Int64" and not "int64" (first 'i' must be capitalized)
– Viacheslav Zhukov
Commented Oct 3, 2019 at 18:08
7

df.myCol = df.myCol.astype('Int64') or df['myCol'] = df['myCol'].astype('Int64')
– LoMaPh
Commented Nov 4, 2019 at 21:38
11

It may be obvious to some but it I think it is still worth noting that you can use any Int (e.g. Int16, Int32) and indeed probably should if the dataframe is very large to save memory.
– wfgeo
Commented Sep 21, 2020 at 12:42
4

I'm getting TypeError: cannot safely cast non-equivalent float64 to int64
– Bera
Commented Sep 13, 2021 at 11:31
1

As of pandas 1.4, IntegerArray and pandas.NA are still marked as experimental
– creanion
Commented Jul 18, 2022 at 15:01

| Show 5 more comments

3 revs, 2 users 83% · Accepted Answer · 2019-04-07 19:23:00Z

264

The lack of NaN rep in integer columns is a pandas "gotcha".

The usual workaround is to simply use floats.

edited Apr 7, 2019 at 19:23

community wiki

3 revs, 2 users 83%
Andy Hayden

30

Are there any other workarounds besides treating them like floats?
– NumenorForLife
Commented May 14, 2015 at 23:26
5

@jsc123 you can use the object dtype. This comes with a small health warning but for the most part works well.
– Andy Hayden
Commented May 19, 2015 at 15:16
1

Can you provide an example of how to use object dtype? I've been looking through the pandas docs and googling, and I've read it's the recommended method. But, I haven't found an example of how to use the object dtype.
– MikeyE
Commented Aug 15, 2016 at 3:23
67

In v0.24, you can now do df = df.astype(pd.Int32Dtype()) (to convert the entire dataFrame, or) df['col'] = df['col'].astype(pd.Int32Dtype()). Other accepted nullable integer types are pd.Int16Dtype and pd.Int64Dtype. Pick your poison.
– cs95
Commented Apr 2, 2019 at 7:56
2

It is NaN value but isnan checking doesn't work at all :(
– Winston
Commented Jul 31, 2019 at 9:48

| Show 3 more comments

hibernado · Accepted Answer · 2018-10-29 09:30:19Z

81

My use case is munging data prior to loading into a DB table:

df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)

Remove NaNs, convert to int, convert to str and then reinsert NANs.

It's not pretty but it gets the job done!

edited Oct 29, 2018 at 9:30

answered May 2, 2018 at 10:28

hibernado

1,7501 gold badge18 silver badges19 bronze badges

2

I have been pulling my hair out trying to load serial numbers where some are null and the rest are floats, this saved me.
– Chris Decker
Commented Jan 15, 2019 at 17:51
3

The OP wants a column of integers. Converting it to string does not meet the condition.
– Rishab Gupta
Commented Feb 21, 2019 at 1:33
6

Works only if col doesn't already have -1. Otherwise, it will mess with the data
– Sharvari Gc
Commented Oct 10, 2019 at 4:55
1

then how to get back to int..??
– abdoulsn
Commented Jan 23, 2020 at 9:48
This produces a column of strings!! For a solution with current versions of pandas, see stackoverflow.com/questions/58029359/…
– PatrickT
Commented Oct 25, 2021 at 5:39

| Show 1 more comment

Abhishek Bhatia · Accepted Answer · 2021-07-16 08:24:34Z

15

Whether your pandas series is object datatype or simply float datatype the below method will work

df = pd.read_csv("data.csv") 
df['id'] = df['id'].astype(float).astype('Int64')

answered Jul 16, 2021 at 8:24

Abhishek Bhatia

5774 silver badges13 bronze badges

Thank you @Abhishek Bhatia this worked for me.
– Jane Kathambi
Commented Jan 26, 2022 at 16:26
This is one of the better answers on this thread.
– drake
Commented Dec 2, 2022 at 19:08
If the id is too large, will the read_csv float lose precision?
– qwr
Commented May 3 at 14:00

Add a comment |

mork · Accepted Answer · 2019-01-25 17:55:59Z

13

It is now possible to create a pandas column containing NaNs as dtype int, since it is now officially added on pandas 0.24.0

pandas 0.24.x release notes Quote: "Pandas has gained the ability to hold integer dtypes with missing values

edited Jan 25, 2019 at 17:55

answered Jan 25, 2019 at 17:13

mork

1,84324 silver badges24 bronze badges

Add a comment |

Roelant · Accepted Answer · 2024-01-22 10:49:02Z

8

As of Pandas 1.0.0 you can now use pandas.NA values. This does not force integer columns with missing values to be floats.

When reading in your data all you have to do is:

df= pd.read_csv("data.csv", dtype={'id': 'Int64'})

Notice the 'Int64' is surrounded by quotes and the I is capitalized. This distinguishes Panda's 'Int64' from numpy's int64.

As a side note, this will also work with .astype()

df['id'] = df['id'].astype('Int64')

You might have to use round if you actually have floats.

df['id'] = df['id'].round().astype('Int64')

Documentation here https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

edited Jan 22 at 10:49

Roelant

4,9455 gold badges37 silver badges69 bronze badges

answered Sep 15, 2020 at 3:26

Bradon

2332 silver badges8 bronze badges

I know this is a few years old, but thanks so much
– thnkwthprtls
Commented May 22 at 14:57

Add a comment |

Kamil · Accepted Answer · 2021-04-01 02:07:03Z

7

I had the problem a few weeks ago with a few discrete features which were formatted as 'object'. This solution seemed to work.

for col in discrete:
    df[col] = pd.to_numeric(df[col],errors='coerce').astype(pd.Int64Dtype())

edited Apr 1, 2021 at 2:07

answered Sep 23, 2020 at 22:44

Kamil

3253 silver badges8 bronze badges

Add a comment |

jmenglund · Accepted Answer · 2018-11-22 15:42:21Z

6

If you absolutely want to combine integers and NaNs in a column, you can use the 'object' data type:

df['col'] = (
    df['col'].fillna(0)
    .astype(int)
    .astype(object)
    .where(df['col'].notnull())
)

This will replace NaNs with an integer (doesn't matter which), convert to int, convert to object and finally reinsert NaNs.

edited Nov 22, 2018 at 15:42

answered Nov 22, 2018 at 15:27

jmenglund

691 silver badge4 bronze badges

Add a comment |

elomage · Accepted Answer · 2018-09-13 10:35:34Z

You could use .dropna() if it is OK to drop the rows with the NaN values.

df = df.dropna(subset=['id'])

Alternatively, use .fillna() and .astype() to replace the NaN with values and convert them to int.

I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.

My solution was to use str as the intermediate type. Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.

df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)

For the illustration, here is an example how floats may loose the precision:

s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)

And the output is:

1.2345678901234567e+19 12345678901234567168 12345678901234567890

gboffi · Accepted Answer · 2016-04-25 08:38:01Z

3

If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write

if row['id']:
   regular_process(row)
else:
   special_process(row)

answered Apr 25, 2016 at 8:38

gboffi

24.3k9 gold badges59 silver badges92 bronze badges

Add a comment |

Corbin · Accepted Answer · 2018-12-12 21:43:19Z

3

Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.

keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))

answered Dec 12, 2018 at 21:43

Corbin

3181 silver badge8 bronze badges

this approach can add a lot of memory overhead, especially on larger dataframes
– zelusp
Commented Feb 18, 2021 at 19:36

Add a comment |

WolVes · Accepted Answer · 2021-03-16 00:26:50Z

3

The issue with Int64, like many other's solutions, is that if you have null values, they get replaced with <NA> values, which do not work with pandas default 'NaN' functions, like isnull() or fillna(). Or if you convert values to -1 you end up in a situation where you may be deleting your information. My solution is a little lame, but will provide int values with np.nan, allowing for nan functions to work without compromising your values.

            def to_int(x):
                try:
                    return int(x)
                except:
                    return np.nan

            df[column] = df[column].apply(to_int)

answered Mar 16, 2021 at 0:26

WolVes

1,3263 gold badges20 silver badges41 bronze badges

I don't think this is lame -- it's one of the better answers when starting with a float column. A slightly more concise version is df[column] = df[column].apply(lambda x: round(x) if pd.notna(x) else pd.NA)
– grbruns
Commented Apr 8 at 20:49

Add a comment |

TWebbs · Accepted Answer · 2021-05-07 15:41:13Z

3

For anyone needing to have int values within NULL/NaN-containing columns, but working under the constraint of being unable to use pandas version 0.24.0 nullable integer features mentioned in other answers, I suggest converting the columns to object type using pd.where:

df = df.where(pd.notnull(df), None)

This converts all NaNs in the dataframe to None, treating mixed-type columns as objects, but leaving the int values as int, rather than float.

answered May 7, 2021 at 15:41

TWebbs

311 silver badge3 bronze badges

Add a comment |

Mehdi Golzadeh · Accepted Answer · 2020-10-23 15:30:43Z

2

If you want to use it when you chain methods, you can use assign:

df = (
     df.assign(col = lambda x: x['col'].astype('Int64'))
)

answered Oct 23, 2020 at 15:30

Mehdi Golzadeh

2,5831 gold badge16 silver badges28 bronze badges

Add a comment |

Digestible1010101 · Accepted Answer · 2021-06-10 23:34:19Z

First you need to specify the newer integer type, Int8 (...Int64) that can handle null integer data (pandas version >= 0.24.0)

df = df.astype('Int8')

But you may want to only target specific columns which have integer data mixed with NaN/nulls:

df = df.astype({'col1':'Int8','col2':'Int8','col3':'Int8')

At this point, the NaN's are converted into <NA> and if you want to change the default null value with df.fillna(), you need to coerce the object datatype on the columns you wish to change, otherwise you will see TypeError: <U1 cannot be converted to an IntegerDtype

You can do this by df = df.astype(object) if you don't mind changing every column datatype to object (individually, each value's type is still preserved) ... OR df = df.astype({"col1": object,"col2": object}) if you prefer to target individual columns.

This should help with forcing your integer columns mixed with nulls to stay formatted as integers and change the null values to whatever you like. I can't speak to the efficiency of this method, but it worked for my formatting and printing purposes.

Neuneck · Accepted Answer · 2018-05-23 08:45:36Z

I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:

def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
    if custom_dtype is None:
        return pd.read_csv(file_path, **kwargs)
    else:
        assert 'dtype' not in kwargs.keys()
        df = pd.read_csv(file_path, dtype = {}, **kwargs)
        for col, typ in custom_dtype.items():
            if fill_values is None or col not in fill_values.keys():
                fill_val = -1
            else:
                fill_val = fill_values[col]
            df[col] = df[col].fillna(fill_val).astype(typ)
    return df

Monaheng Ramochele · Accepted Answer · 2020-06-06 00:09:14Z

1

import pandas as pd

df= pd.read_csv("data.csv")
df['id'] = pd.to_numeric(df['id'])

answered Jun 6, 2020 at 0:09

Monaheng Ramochele

393 bronze badges

4

Is there a reason you prefer this formulation over that proposed in the accepted answer? If so, it'd be useful to edit your answer to provide that explanation—and especially since there are ten additional answers that are competing for attention.
– Jeremy Caney
Commented Jun 6, 2020 at 0:38
1

While this code may resolve the OP's issue, it is best to include an explanation as to how/why your code addresses it. In this way, future visitors can learn from your post, and apply it to their own code. SO is not a coding service, but a resource for knowledge. Also, high quality, complete answers are more likely to be upvoted. These features, along with the requirement that all posts are self-contained, are some of the strengths of SO as a platform differentiates it from forums. You can edit to add additional info &/or to supplement your explanations with source documentation.
– SherylHohman
Commented Jun 6, 2020 at 1:35

Add a comment |

Nikhil Redij · Accepted Answer · 2020-09-23 11:28:57Z

1

Try this:

df[['id']] = df[['id']].astype(pd.Int64Dtype())

If you print it's dtypes, you will get id Int64 instead of normal one int64

answered Sep 23, 2020 at 11:28

Nikhil Redij

1,0311 gold badge15 silver badges22 bronze badges

Add a comment |

Sohail Anjum · Accepted Answer · 2023-03-28 07:21:21Z

1

df['id'] = df['id'].astype('float').astype(pd.Int64Dtype())

answered Mar 28, 2023 at 7:21

Sohail Anjum

947 bronze badges

Add a comment |

kamran kausar · Accepted Answer · 2018-08-01 10:05:17Z

0

First remove the rows which contain NaN. Then do Integer conversion on remaining rows. At Last insert the removed rows again. Hope it will work

answered Aug 1, 2018 at 10:05

kamran kausar

4,5032 gold badges24 silver badges17 bronze badges

Add a comment |

Alex Metsai · Accepted Answer · 2021-04-12 06:57:03Z

0

Use .fillna() to replace all NaN values with 0 and then convert it to int using astype(int)

df['id'] = df['id'].fillna(0).astype(int)

edited Apr 12, 2021 at 6:57

Alex Metsai

1,9305 gold badges14 silver badges25 bronze badges

answered Apr 12, 2021 at 6:12

Naina Gerwani

331 bronze badge

5

Works but I think replacing NaN with 0 changes the meaning of the data.
– Jane Kathambi
Commented Jan 26, 2022 at 16:27

Add a comment |

mqx · Accepted Answer · 2021-05-28 23:17:12Z

0

Had a similar problem. That was my solution:

def toint(zahl = 1.1):
    try:
        zahl = int(zahl)
    except:
        zahl = np.nan
    return zahl

print(toint(4.776655), toint(np.nan), toint('test'))

4 nan nan

df = pd.read_csv("data.csv") 
df['id'] = df['id'].astype(float)
df['id'] = toint(df['id'])

edited May 28, 2021 at 23:17

answered May 28, 2021 at 22:27

mqx

114 bronze badges

Add a comment |

lassebenninga · Accepted Answer · 2021-07-22 07:48:28Z

0

Since I didn't see the answer here, I might as well add it:

One-liner to convert NANs to empty string if you for some reason you still can't handle np.na or pd.NA like me when relying on a library with an older version of pandas:

df.select_dtypes('number').fillna(-1).astype(str).replace('-1', '')

answered Jul 22, 2021 at 7:48

lassebenninga

511 silver badge4 bronze badges

3

caution with this approach... if any of your data really is -1, it will be overwritten.
– bsauce
Commented Mar 17, 2022 at 16:55

Add a comment |

Nimantha · Accepted Answer · 2021-11-25 04:26:27Z

0

I think the approach of @Digestible1010101 is the more appropriate for Pandas 1.2.+ versions, something like this should do the job:

df = df.astype({
            'col_1': 'Int64',
            'col_2': 'Int64',
            'col_3': 'Int64',
            'col_4': 'Int64', })

edited Nov 25, 2021 at 4:26

Nimantha

6,3696 gold badges30 silver badges74 bronze badges

answered Jun 29, 2021 at 3:47

David I. Rock

1252 silver badges4 bronze badges

Add a comment |

bracoo · Accepted Answer · 2022-08-09 08:59:19Z

0

Similar to @hibernado's answer, but keeping it as integers (instead of strings)

df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = np.where(df[col] == -1, np.nan, df[col])

answered Aug 9, 2022 at 8:59

bracoo

213 bronze badges

Add a comment |

okadahiroshi · Accepted Answer · 2022-09-21 10:43:32Z

0

df.loc[~df['id'].isna(), 'id'] = df.loc[~df['id'].isna(), 'id'].astype('int')

answered Sep 21, 2022 at 10:43

okadahiroshi

1711 silver badge8 bronze badges

Add a comment |

Yashar Ahmadov · Accepted Answer · 2023-04-27 08:45:22Z

0

I use the following workaround:

condition = (~df['mixed_column'].isnull())
df['mixed_column'] = df['mixed_column'].mask(condition, df[condition]['mixed_column'].astype(int))

answered Apr 27, 2023 at 8:45

Yashar Ahmadov

1,6262 gold badges11 silver badges21 bronze badges

Add a comment |

qwr · Accepted Answer · 2024-03-07 17:05:32Z

0

One workaround at file read, assuming you don't need any integer operations other than equality, is to not infer dtype and import everything as strings. This is less efficient but simple and closer to what is actually in CSV files, since CSV files don't have any notion of type.

pd.read_csv(path, dtype=False)

answered Mar 7 at 17:05

qwr

10.4k5 gold badges65 silver badges111 bronze badges

Add a comment |

Daniil · Accepted Answer · 2023-11-20 05:50:48Z

-1

You may try:

df.id = df.id.astype(int, errors = 'ignore')

There can be any numeric type instead of int.

answered Nov 20, 2023 at 5:50

Daniil

852 silver badges13 bronze badges

1

This unfortunately won't work. According to the docs on astype, the 'ignore' flag will return the original object on error. So if there are NaN values, there is an error, and nothing will be changed.
– Wesley Cheek
Commented Dec 22, 2023 at 4:34

Add a comment |

Justin Malinchak · Accepted Answer · 2018-04-06 14:45:13Z

-2

Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.

df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))

edited Apr 6, 2018 at 14:45

answered Apr 6, 2018 at 14:40

Justin Malinchak

5371 gold badge6 silver badges11 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Convert Pandas column containing NaNs to dtype `int`

31 Answers 31

Not the answer you're looking for? Browse other questions tagged
python
pandas
dataframe
nan
dtype
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

31 Answers 31

Not the answer you're looking for? Browse other questions tagged pythonpandasdataframenandtype or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
pandas
dataframe
nan
dtype
or ask your own question.