Remove pandas rows with duplicate indices

Question

How to remove rows with duplicate index values?

In the weather DataFrame below, sometimes a scientist goes back and corrects observations -- not by editing the erroneous rows, but by appending a duplicate row to the end of a file.

I'm reading some automated weather data from the web (observations occur every 5 minutes, and compiled into monthly files for each weather station.) After parsing a file, the DataFrame looks like:

                      Sta  Precip1hr  Precip5min  Temp  DewPnt  WindSpd  WindDir  AtmPress
Date                                                                                      
2001-01-01 00:00:00  KPDX          0           0     4       3        0        0     30.31
2001-01-01 00:05:00  KPDX          0           0     4       3        0        0     30.30
2001-01-01 00:10:00  KPDX          0           0     4       3        4       80     30.30
2001-01-01 00:15:00  KPDX          0           0     3       2        5       90     30.30
2001-01-01 00:20:00  KPDX          0           0     3       2       10      110     30.28

Example of a duplicate case:

import pandas as pd
import datetime

startdate = datetime.datetime(2001, 1, 1, 0, 0)
enddate = datetime.datetime(2001, 1, 1, 5, 0)
index = pd.date_range(start=startdate, end=enddate, freq='H')
data1 = {'A' : range(6), 'B' : range(6)}
data2 = {'A' : [20, -30, 40], 'B' : [-50, 60, -70]}
df1 = pd.DataFrame(data=data1, index=index)
df2 = pd.DataFrame(data=data2, index=index[:3])
df3 = df2.append(df1)

df3
                       A   B
2001-01-01 00:00:00   20 -50
2001-01-01 01:00:00  -30  60
2001-01-01 02:00:00   40 -70
2001-01-01 03:00:00    3   3
2001-01-01 04:00:00    4   4
2001-01-01 05:00:00    5   5
2001-01-01 00:00:00    0   0
2001-01-01 01:00:00    1   1
2001-01-01 02:00:00    2   2

And so I need df3 to eventually become:

                       A   B
2001-01-01 00:00:00    0   0
2001-01-01 01:00:00    1   1
2001-01-01 02:00:00    2   2
2001-01-01 03:00:00    3   3
2001-01-01 04:00:00    4   4
2001-01-01 05:00:00    5   5

I thought that adding a column of row numbers (df3['rownum'] = range(df3.shape[0])) would help me select the bottom-most row for any value of the DatetimeIndex, but I am stuck on figuring out the group_by or pivot (or ???) statements to make that work.

Another way of getting duplicates is hourly data in the night when clocks are set back for daylight saving time: 1 AM, 2, 3, 2, 3 again, 4 ... — denis, Commented Aug 28, 2017 at 9:42
When you say "remove duplicates", your context here implicitly is "keep the first". i.e. drop_duplicates(keep='first'). (That's not always the case, sometimes it's harder to figure out from the other fields which row should be kept, or merge multiples, filling NAs from various rows). — smci, Commented Jan 21, 2021 at 19:54

feetwet · Accepted Answer · 2021-05-27 18:27:05Z

857

I would suggest using the duplicated method on the Pandas Index itself:

df3 = df3[~df3.index.duplicated(keep='first')]

While all the other methods work, .drop_duplicates is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

Using the sample data provided:

>>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
1000 loops, best of 3: 1.54 ms per loop

>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 µs per loop

>>> %timeit df3[~df3.index.duplicated(keep='first')]
1000 loops, best of 3: 307 µs per loop

Note that you can keep the last element by changing the keep argument to 'last'.

It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul's example):

>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 µs per loop

>>> %timeit df1[~df1.index.duplicated(keep='last')]
1000 loops, best of 3: 365 µs per loop

edited May 27, 2021 at 18:27

feetwet

3,3888 gold badges51 silver badges88 bronze badges

answered Dec 15, 2015 at 19:25

n8yoder

9,8672 gold badges17 silver badges19 bronze badges

10

loc might not be necessary. Simply do df3 = df3[~df3.index.duplicated(keep='first')], which will drop all rows with duplicate index except the first occurrence.
– lingjiankong
Commented Sep 16, 2019 at 18:30
8

what does ~ do in df3 = df3.loc[~df3.index.duplicated(keep='first')] if anyone do not mind answering?
– jsl5703
Commented Feb 27, 2020 at 1:38
9

@jsl5703 It reverses the mask. So it turns everything that was True False and vice-versa. In this case, that means that we will select out the ones that are not duplicated according to the method.
– n8yoder
Commented Feb 27, 2020 at 2:35
2

What witchcraft is this, using a bitwise (~) operation on a DataFrame an it even works. Have a upvote and a comment, because an upvote does not seem enough. Added bonus for %timeit.
– Harper
Commented Sep 21, 2020 at 13:30
5

For those who prefer "chaining": df3.query("~index.duplicated(keep='first')")
– MikeGM
Commented Mar 19, 2021 at 11:47

| Show 9 more comments

MERose · Accepted Answer · 2021-05-02 21:34:02Z

141

This adds the index as a DataFrame column, drops duplicates on that, then removes the new column:

df = (df.reset_index()
        .drop_duplicates(subset='index', keep='last')
        .set_index('index').sort_index())

Note that the use of .sort_index() above at the end is as needed and is optional.

edited May 2, 2021 at 21:34

MERose

4,2937 gold badges55 silver badges85 bronze badges

answered Feb 15, 2013 at 17:27

D. A.

3,4693 gold badges33 silver badges34 bronze badges

11

Another variation on this is : df.reset_index().drop_duplicates(cols='index',take_last=True).set_index('index')
– Luciano
Commented Feb 20, 2014 at 10:35
2

While this method does work it also creates two temporary copies of the DataFrame and is significantly less performant than using either the duplicated index or groupby methods suggested as alternative answers.
– n8yoder
Commented Dec 19, 2015 at 21:26
1

If your index is a MultiIndex, reset_index() adds columns level_0, level_1, etc. And if your index has a name that name will be used in place of the "index" label. That makes this a bit more than a one-liner to do it right for any DataFrame. index_label = getattr(df.index, 'names', getattr(df.index, 'name', 'index')) then cols=index_label then set_index(index_labels) and even this isn't foolproof (won't work for unnamed multiindexes).
– hobs
Commented May 23, 2016 at 22:35
1

Moving the index to a column, clearing duplicates, and resetting the index was awesome, that was exactly what I needed!
– Sienna
Commented Dec 27, 2016 at 19:03
Given idx = df.index.name or 'index', one could also do df2 = df.reset_index(); df2.drop_duplicates(idx, inplace=True); df2.set_index(idx, inplace=True) to avoid the intermediate copies (due to the inplace=True)
– Anakhand
Commented May 27, 2019 at 16:16

Add a comment |

Paul H · Accepted Answer · 2015-03-24 16:51:39Z

Oh my. This is actually so simple!

grouped = df3.groupby(level=0)
df4 = grouped.last()
df4
                      A   B  rownum

2001-01-01 00:00:00   0   0       6
2001-01-01 01:00:00   1   1       7
2001-01-01 02:00:00   2   2       8
2001-01-01 03:00:00   3   3       3
2001-01-01 04:00:00   4   4       4
2001-01-01 05:00:00   5   5       5

Follow up edit 2013-10-29 In the case where I have a fairly complex MultiIndex, I think I prefer the groupby approach. Here's simple example for posterity:

import numpy as np
import pandas

# fake index
idx = pandas.MultiIndex.from_tuples([('a', letter) for letter in list('abcde')])

# random data + naming the index levels
df1 = pandas.DataFrame(np.random.normal(size=(5,2)), index=idx, columns=['colA', 'colB'])
df1.index.names = ['iA', 'iB']

# artificially append some duplicate data
df1 = df1.append(df1.select(lambda idx: idx[1] in ['c', 'e']))
df1
#           colA      colB
#iA iB                    
#a  a  -1.297535  0.691787
#   b  -1.688411  0.404430
#   c   0.275806 -0.078871
#   d  -0.509815 -0.220326
#   e  -0.066680  0.607233
#   c   0.275806 -0.078871  # <--- dup 1
#   e  -0.066680  0.607233  # <--- dup 2

and here's the important part

# group the data, using df1.index.names tells pandas to look at the entire index
groups = df1.groupby(level=df1.index.names)  
groups.last() # or .first()
#           colA      colB
#iA iB                    
#a  a  -1.297535  0.691787
#   b  -1.688411  0.404430
#   c   0.275806 -0.078871
#   d  -0.509815 -0.220326
#   e  -0.066680  0.607233

if they have names, otherwise (if one name is None) lets say level=[0,1] will work if there are 2 levels df1.groupby(level=[0,1]).last(). This should be part of Pandas as a complimentary to drop_duplicates — dashesy, Commented Apr 12, 2015 at 18:38
@dashesy yeah. Using df.index.names is just an easy way to group by all levels of the index. — Paul H, Commented Apr 12, 2015 at 18:45
Great solution, thank you! I will also add that this works in xarray for dealing with duplicate DateTime indices as well that make ds.resample and ds.groupby operations fail — drg, Commented May 15, 2018 at 23:26
Amendment to my earlier comment: it works in xarray as long as you change the grouped = df3.groupby(level=0) to grouped = df3.groupby(dim='time') or whatever the dimension is that contains duplicates — drg, Commented May 15, 2018 at 23:34

user11186769user11186769 · Accepted Answer · 2019-07-30 18:00:57Z

7

Remove duplicates (Keeping First)

idx = np.unique( df.index.values, return_index = True )[1]
df = df.iloc[idx]

Remove duplicates (Keeping Last)

df = df[::-1]
df = df.iloc[ np.unique( df.index.values, return_index = True )[1] ]

Tests: 10k loops using OP's data

numpy method - 3.03 seconds
df.loc[~df.index.duplicated(keep='first')] - 4.43 seconds
df.groupby(df.index).first() - 21 seconds
reset_index() method - 29 seconds

answered Jul 30, 2019 at 18:00

user11186769

Add a comment |

bbiegel · Accepted Answer · 2018-06-26 11:05:46Z

5

If anyone like me likes chainable data manipulation using the pandas dot notation (like piping), then the following may be useful:

df3 = df3.query('~index.duplicated()')

This enables chaining statements like this:

df3.assign(C=2).query('~index.duplicated()').mean()

answered Jun 26, 2018 at 11:05

bbiegel

2273 silver badges8 bronze badges

I tried this but could not get it to work.. I get an error like this: TypeError: 'Series' objects are mutable, thus they cannot be hashed.. Did this actually work for you?
– Onno Eberhard
Commented Sep 3, 2019 at 13:45
It worked for me, using Pandas 1.2.2. I had to change the construction of the index, since .DateTimeIndex does not support start and end arguments anymore, but I used .date_range instead.
– VirtualScooter
Commented Feb 28, 2021 at 16:53
Can you elaborate a bit more on what this does df3.assign(C=2).query('~index.duplicated()').mean(). It would be nice to have a step by step example.
– MyCarta
Commented Jul 15 at 16:35

Add a comment |

user128754 · Accepted Answer · 2014-10-28 05:17:36Z

4

Unfortunately, I don't think Pandas allows one to drop dups off the indices. I would suggest the following:

df3 = df3.reset_index() # makes date column part of your data
df3.columns = ['timestamp','A','B','rownum'] # set names
df3 = df3.drop_duplicates('timestamp',take_last=True).set_index('timestamp') #done!

answered Oct 28, 2014 at 5:17

user128754

491 silver badge1 bronze badge

Add a comment |

Berny · Accepted Answer · 2021-08-06 01:57:19Z

0

I had the experience with this same error, and after diving into every df, it turns out one of the had 2 columns with the same name, you mention you drop some columns, probably this could be a reason.

answered Aug 6, 2021 at 1:57

Berny

13513 bronze badges

Add a comment |

Muhammad Yasirroni · Accepted Answer · 2024-06-18 09:06:46Z

Another way using pandas.Index.drop_duplicates(),

df.loc[df.index.drop_duplicates(keep='first'), :]

But, it is slower compared to the accepted answer. Just use that.

%timeit df.reset_index().drop_duplicates(subset='Notasi', keep='first').set_index('Notasi')
281 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit df.groupby(df.index).first()
212 µs ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit df[~df.index.duplicated(keep='first')]
38.1 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df.loc[df.index.drop_duplicates(keep='first'), :]
104 µs ± 721 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Collectives™ on Stack Overflow

Remove pandas rows with duplicate indices

8 Answers 8

Not the answer you're looking for? Browse other questions tagged
python
pandas
dataframe
duplicates
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

Not the answer you're looking for? Browse other questions tagged pythonpandasdataframeduplicates or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
pandas
dataframe
duplicates
or ask your own question.