386

I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.

When I try to cast the id column to integer while reading the .csv, I get:

df= pd.read_csv("data.csv", dtype={'id': int}) 
error: Integer column has NA values

Alternatively, I tried to convert the column type after reading as below, but this time I get:

df= pd.read_csv("data.csv") 
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer

How can I tackle this?

10
  • 5
    I think that integer values cannot be converted or stored in a series/dataframe if there are missing/NaN values. This I think is to do with numpy compatibility (I'm guessing here), if you want missing value compatibility then I would store the values as floats
    – EdChum
    Commented Jan 22, 2014 at 16:14
  • 1
    see here: pandas.pydata.org/pandas-docs/dev/…; you must have a float dtype when u have missing values (or technically object dtype but that is inefficient); what is your goal of using int type?
    – Jeff
    Commented Jan 22, 2014 at 16:16
  • 8
    I believe this is a NumPy issue, not specific to Pandas. It's a shame since there are so many cases when having an int type that allows for the possibility of null values is much more efficient than a large column of floats.
    – ely
    Commented Jan 22, 2014 at 17:44
  • 1
    I have a problem with this too. I have multiple dataframes which I want to merge based on a string representation of several "integer" columns. However, when one of those integer columns has a np.nan, the string casting produces a ".0", which throws off the merge. Just makes things slightly more complicated, would be nice if there was simple work-around.
    – dermen
    Commented Jul 11, 2015 at 3:52
  • 2
    @Rhubarb, Optional Nullable Integer Support is now officially added on pandas 0.24.0 - finally :) - please find an updated answer bellow. pandas 0.24.x release notes
    – mork
    Commented Jan 25, 2019 at 17:14

31 Answers 31

364

In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.

Nullable Integer Data Type.

Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:

arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)

0      1
1      2
2    NaN
dtype: Int64

For convert column to nullable integers use:

df['myCol'] = df['myCol'].astype('Int64')
10
  • 90
    Note that dtype must be "Int64" and not "int64" (first 'i' must be capitalized) Commented Oct 3, 2019 at 18:08
  • 7
    df.myCol = df.myCol.astype('Int64') or df['myCol'] = df['myCol'].astype('Int64')
    – LoMaPh
    Commented Nov 4, 2019 at 21:38
  • 11
    It may be obvious to some but it I think it is still worth noting that you can use any Int (e.g. Int16, Int32) and indeed probably should if the dataframe is very large to save memory.
    – wfgeo
    Commented Sep 21, 2020 at 12:42
  • 4
    I'm getting TypeError: cannot safely cast non-equivalent float64 to int64
    – Bera
    Commented Sep 13, 2021 at 11:31
  • 1
    As of pandas 1.4, IntegerArray and pandas.NA are still marked as experimental
    – creanion
    Commented Jul 18, 2022 at 15:01
264

The lack of NaN rep in integer columns is a pandas "gotcha".

The usual workaround is to simply use floats.

8
  • 30
    Are there any other workarounds besides treating them like floats? Commented May 14, 2015 at 23:26
  • 5
    @jsc123 you can use the object dtype. This comes with a small health warning but for the most part works well. Commented May 19, 2015 at 15:16
  • 1
    Can you provide an example of how to use object dtype? I've been looking through the pandas docs and googling, and I've read it's the recommended method. But, I haven't found an example of how to use the object dtype.
    – MikeyE
    Commented Aug 15, 2016 at 3:23
  • 67
    In v0.24, you can now do df = df.astype(pd.Int32Dtype()) (to convert the entire dataFrame, or) df['col'] = df['col'].astype(pd.Int32Dtype()). Other accepted nullable integer types are pd.Int16Dtype and pd.Int64Dtype. Pick your poison.
    – cs95
    Commented Apr 2, 2019 at 7:56
  • 2
    It is NaN value but isnan checking doesn't work at all :(
    – Winston
    Commented Jul 31, 2019 at 9:48
81

My use case is munging data prior to loading into a DB table:

df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)

Remove NaNs, convert to int, convert to str and then reinsert NANs.

It's not pretty but it gets the job done!

6
  • 2
    I have been pulling my hair out trying to load serial numbers where some are null and the rest are floats, this saved me. Commented Jan 15, 2019 at 17:51
  • 3
    The OP wants a column of integers. Converting it to string does not meet the condition. Commented Feb 21, 2019 at 1:33
  • 6
    Works only if col doesn't already have -1. Otherwise, it will mess with the data Commented Oct 10, 2019 at 4:55
  • 1
    then how to get back to int..??
    – abdoulsn
    Commented Jan 23, 2020 at 9:48
  • This produces a column of strings!! For a solution with current versions of pandas, see stackoverflow.com/questions/58029359/…
    – PatrickT
    Commented Oct 25, 2021 at 5:39
15

Whether your pandas series is object datatype or simply float datatype the below method will work

df = pd.read_csv("data.csv") 
df['id'] = df['id'].astype(float).astype('Int64')
3
  • Thank you @Abhishek Bhatia this worked for me. Commented Jan 26, 2022 at 16:26
  • This is one of the better answers on this thread.
    – drake
    Commented Dec 2, 2022 at 19:08
  • If the id is too large, will the read_csv float lose precision?
    – qwr
    Commented May 3 at 14:00
13

It is now possible to create a pandas column containing NaNs as dtype int, since it is now officially added on pandas 0.24.0

pandas 0.24.x release notes Quote: "Pandas has gained the ability to hold integer dtypes with missing values

0
8

As of Pandas 1.0.0 you can now use pandas.NA values. This does not force integer columns with missing values to be floats.

When reading in your data all you have to do is:

df= pd.read_csv("data.csv", dtype={'id': 'Int64'})  

Notice the 'Int64' is surrounded by quotes and the I is capitalized. This distinguishes Panda's 'Int64' from numpy's int64.

As a side note, this will also work with .astype()

df['id'] = df['id'].astype('Int64')

You might have to use round if you actually have floats.

df['id'] = df['id'].round().astype('Int64')

Documentation here https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html

1
  • I know this is a few years old, but thanks so much Commented May 22 at 14:57
7

I had the problem a few weeks ago with a few discrete features which were formatted as 'object'. This solution seemed to work.

for col in discrete:
    df[col] = pd.to_numeric(df[col],errors='coerce').astype(pd.Int64Dtype())
0
6

If you absolutely want to combine integers and NaNs in a column, you can use the 'object' data type:

df['col'] = (
    df['col'].fillna(0)
    .astype(int)
    .astype(object)
    .where(df['col'].notnull())
)

This will replace NaNs with an integer (doesn't matter which), convert to int, convert to object and finally reinsert NaNs.

5

You could use .dropna() if it is OK to drop the rows with the NaN values.

df = df.dropna(subset=['id'])

Alternatively, use .fillna() and .astype() to replace the NaN with values and convert them to int.

I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.

My solution was to use str as the intermediate type. Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.

df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)

For the illustration, here is an example how floats may loose the precision:

s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)

And the output is:

1.2345678901234567e+19 12345678901234567168 12345678901234567890
3

If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write

if row['id']:
   regular_process(row)
else:
   special_process(row)
3

Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.

keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))
1
  • this approach can add a lot of memory overhead, especially on larger dataframes
    – zelusp
    Commented Feb 18, 2021 at 19:36
3

The issue with Int64, like many other's solutions, is that if you have null values, they get replaced with <NA> values, which do not work with pandas default 'NaN' functions, like isnull() or fillna(). Or if you convert values to -1 you end up in a situation where you may be deleting your information. My solution is a little lame, but will provide int values with np.nan, allowing for nan functions to work without compromising your values.

            def to_int(x):
                try:
                    return int(x)
                except:
                    return np.nan

            df[column] = df[column].apply(to_int)
1
  • I don't think this is lame -- it's one of the better answers when starting with a float column. A slightly more concise version is df[column] = df[column].apply(lambda x: round(x) if pd.notna(x) else pd.NA)
    – grbruns
    Commented Apr 8 at 20:49
3

For anyone needing to have int values within NULL/NaN-containing columns, but working under the constraint of being unable to use pandas version 0.24.0 nullable integer features mentioned in other answers, I suggest converting the columns to object type using pd.where:

df = df.where(pd.notnull(df), None)

This converts all NaNs in the dataframe to None, treating mixed-type columns as objects, but leaving the int values as int, rather than float.

2

If you want to use it when you chain methods, you can use assign:

df = (
     df.assign(col = lambda x: x['col'].astype('Int64'))
)
2

First you need to specify the newer integer type, Int8 (...Int64) that can handle null integer data (pandas version >= 0.24.0)

df = df.astype('Int8')

But you may want to only target specific columns which have integer data mixed with NaN/nulls:

df = df.astype({'col1':'Int8','col2':'Int8','col3':'Int8')

At this point, the NaN's are converted into <NA> and if you want to change the default null value with df.fillna(), you need to coerce the object datatype on the columns you wish to change, otherwise you will see TypeError: <U1 cannot be converted to an IntegerDtype

You can do this by df = df.astype(object) if you don't mind changing every column datatype to object (individually, each value's type is still preserved) ... OR df = df.astype({"col1": object,"col2": object}) if you prefer to target individual columns.

This should help with forcing your integer columns mixed with nulls to stay formatted as integers and change the null values to whatever you like. I can't speak to the efficiency of this method, but it worked for my formatting and printing purposes.

1

I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:

def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
    if custom_dtype is None:
        return pd.read_csv(file_path, **kwargs)
    else:
        assert 'dtype' not in kwargs.keys()
        df = pd.read_csv(file_path, dtype = {}, **kwargs)
        for col, typ in custom_dtype.items():
            if fill_values is None or col not in fill_values.keys():
                fill_val = -1
            else:
                fill_val = fill_values[col]
            df[col] = df[col].fillna(fill_val).astype(typ)
    return df
1
import pandas as pd

df= pd.read_csv("data.csv")
df['id'] = pd.to_numeric(df['id'])
2
  • 4
    Is there a reason you prefer this formulation over that proposed in the accepted answer? If so, it'd be useful to edit your answer to provide that explanation—and especially since there are ten additional answers that are competing for attention. Commented Jun 6, 2020 at 0:38
  • 1
    While this code may resolve the OP's issue, it is best to include an explanation as to how/why your code addresses it. In this way, future visitors can learn from your post, and apply it to their own code. SO is not a coding service, but a resource for knowledge. Also, high quality, complete answers are more likely to be upvoted. These features, along with the requirement that all posts are self-contained, are some of the strengths of SO as a platform differentiates it from forums. You can edit to add additional info &/or to supplement your explanations with source documentation. Commented Jun 6, 2020 at 1:35
1

Try this:

df[['id']] = df[['id']].astype(pd.Int64Dtype())

If you print it's dtypes, you will get id Int64 instead of normal one int64

1

df['id'] = df['id'].astype('float').astype(pd.Int64Dtype())

0

First remove the rows which contain NaN. Then do Integer conversion on remaining rows. At Last insert the removed rows again. Hope it will work

0

Use .fillna() to replace all NaN values with 0 and then convert it to int using astype(int)

df['id'] = df['id'].fillna(0).astype(int)
1
  • 5
    Works but I think replacing NaN with 0 changes the meaning of the data. Commented Jan 26, 2022 at 16:27
0

Had a similar problem. That was my solution:

def toint(zahl = 1.1):
    try:
        zahl = int(zahl)
    except:
        zahl = np.nan
    return zahl

print(toint(4.776655), toint(np.nan), toint('test'))

4 nan nan

df = pd.read_csv("data.csv") 
df['id'] = df['id'].astype(float)
df['id'] = toint(df['id'])
0

Since I didn't see the answer here, I might as well add it:

One-liner to convert NANs to empty string if you for some reason you still can't handle np.na or pd.NA like me when relying on a library with an older version of pandas:

df.select_dtypes('number').fillna(-1).astype(str).replace('-1', '')

1
  • 3
    caution with this approach... if any of your data really is -1, it will be overwritten.
    – bsauce
    Commented Mar 17, 2022 at 16:55
0

I think the approach of @Digestible1010101 is the more appropriate for Pandas 1.2.+ versions, something like this should do the job:

df = df.astype({
            'col_1': 'Int64',
            'col_2': 'Int64',
            'col_3': 'Int64',
            'col_4': 'Int64', })
0

Similar to @hibernado's answer, but keeping it as integers (instead of strings)

df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = np.where(df[col] == -1, np.nan, df[col])
0
df.loc[~df['id'].isna(), 'id'] = df.loc[~df['id'].isna(), 'id'].astype('int')

0

I use the following workaround:

condition = (~df['mixed_column'].isnull())
df['mixed_column'] = df['mixed_column'].mask(condition, df[condition]['mixed_column'].astype(int))
0

One workaround at file read, assuming you don't need any integer operations other than equality, is to not infer dtype and import everything as strings. This is less efficient but simple and closer to what is actually in CSV files, since CSV files don't have any notion of type.

pd.read_csv(path, dtype=False)  
-1

You may try:

df.id = df.id.astype(int, errors = 'ignore')

There can be any numeric type instead of int.

1
  • 1
    This unfortunately won't work. According to the docs on astype, the 'ignore' flag will return the original object on error. So if there are NaN values, there is an error, and nothing will be changed. Commented Dec 22, 2023 at 4:34
-2

Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.

df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))

Not the answer you're looking for? Browse other questions tagged or ask your own question.