470

How to remove rows with duplicate index values?

In the weather DataFrame below, sometimes a scientist goes back and corrects observations -- not by editing the erroneous rows, but by appending a duplicate row to the end of a file.

I'm reading some automated weather data from the web (observations occur every 5 minutes, and compiled into monthly files for each weather station.) After parsing a file, the DataFrame looks like:

                      Sta  Precip1hr  Precip5min  Temp  DewPnt  WindSpd  WindDir  AtmPress
Date                                                                                      
2001-01-01 00:00:00  KPDX          0           0     4       3        0        0     30.31
2001-01-01 00:05:00  KPDX          0           0     4       3        0        0     30.30
2001-01-01 00:10:00  KPDX          0           0     4       3        4       80     30.30
2001-01-01 00:15:00  KPDX          0           0     3       2        5       90     30.30
2001-01-01 00:20:00  KPDX          0           0     3       2       10      110     30.28

Example of a duplicate case:

import pandas as pd
import datetime

startdate = datetime.datetime(2001, 1, 1, 0, 0)
enddate = datetime.datetime(2001, 1, 1, 5, 0)
index = pd.date_range(start=startdate, end=enddate, freq='H')
data1 = {'A' : range(6), 'B' : range(6)}
data2 = {'A' : [20, -30, 40], 'B' : [-50, 60, -70]}
df1 = pd.DataFrame(data=data1, index=index)
df2 = pd.DataFrame(data=data2, index=index[:3])
df3 = df2.append(df1)

df3
                       A   B
2001-01-01 00:00:00   20 -50
2001-01-01 01:00:00  -30  60
2001-01-01 02:00:00   40 -70
2001-01-01 03:00:00    3   3
2001-01-01 04:00:00    4   4
2001-01-01 05:00:00    5   5
2001-01-01 00:00:00    0   0
2001-01-01 01:00:00    1   1
2001-01-01 02:00:00    2   2

And so I need df3 to eventually become:

                       A   B
2001-01-01 00:00:00    0   0
2001-01-01 01:00:00    1   1
2001-01-01 02:00:00    2   2
2001-01-01 03:00:00    3   3
2001-01-01 04:00:00    4   4
2001-01-01 05:00:00    5   5

I thought that adding a column of row numbers (df3['rownum'] = range(df3.shape[0])) would help me select the bottom-most row for any value of the DatetimeIndex, but I am stuck on figuring out the group_by or pivot (or ???) statements to make that work.

2
  • 2
    Another way of getting duplicates is hourly data in the night when clocks are set back for daylight saving time: 1 AM, 2, 3, 2, 3 again, 4 ...
    – denis
    Commented Aug 28, 2017 at 9:42
  • 1
    When you say "remove duplicates", your context here implicitly is "keep the first". i.e. drop_duplicates(keep='first'). (That's not always the case, sometimes it's harder to figure out from the other fields which row should be kept, or merge multiples, filling NAs from various rows).
    – smci
    Commented Jan 21, 2021 at 19:54

8 Answers 8

857

I would suggest using the duplicated method on the Pandas Index itself:

df3 = df3[~df3.index.duplicated(keep='first')]

While all the other methods work, .drop_duplicates is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

Using the sample data provided:

>>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
1000 loops, best of 3: 1.54 ms per loop

>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 µs per loop

>>> %timeit df3[~df3.index.duplicated(keep='first')]
1000 loops, best of 3: 307 µs per loop

Note that you can keep the last element by changing the keep argument to 'last'.

It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul's example):

>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 µs per loop

>>> %timeit df1[~df1.index.duplicated(keep='last')]
1000 loops, best of 3: 365 µs per loop
14
  • 10
    loc might not be necessary. Simply do df3 = df3[~df3.index.duplicated(keep='first')], which will drop all rows with duplicate index except the first occurrence. Commented Sep 16, 2019 at 18:30
  • 8
    what does ~ do in df3 = df3.loc[~df3.index.duplicated(keep='first')] if anyone do not mind answering?
    – jsl5703
    Commented Feb 27, 2020 at 1:38
  • 9
    @jsl5703 It reverses the mask. So it turns everything that was True False and vice-versa. In this case, that means that we will select out the ones that are not duplicated according to the method.
    – n8yoder
    Commented Feb 27, 2020 at 2:35
  • 2
    What witchcraft is this, using a bitwise (~) operation on a DataFrame an it even works. Have a upvote and a comment, because an upvote does not seem enough. Added bonus for %timeit.
    – Harper
    Commented Sep 21, 2020 at 13:30
  • 5
    For those who prefer "chaining": df3.query("~index.duplicated(keep='first')")
    – MikeGM
    Commented Mar 19, 2021 at 11:47
141

This adds the index as a DataFrame column, drops duplicates on that, then removes the new column:

df = (df.reset_index()
        .drop_duplicates(subset='index', keep='last')
        .set_index('index').sort_index())

Note that the use of .sort_index() above at the end is as needed and is optional.

5
  • 11
    Another variation on this is : df.reset_index().drop_duplicates(cols='index',take_last=True).set_index('index')
    – Luciano
    Commented Feb 20, 2014 at 10:35
  • 2
    While this method does work it also creates two temporary copies of the DataFrame and is significantly less performant than using either the duplicated index or groupby methods suggested as alternative answers.
    – n8yoder
    Commented Dec 19, 2015 at 21:26
  • 1
    If your index is a MultiIndex, reset_index() adds columns level_0, level_1, etc. And if your index has a name that name will be used in place of the "index" label. That makes this a bit more than a one-liner to do it right for any DataFrame. index_label = getattr(df.index, 'names', getattr(df.index, 'name', 'index')) then cols=index_label then set_index(index_labels) and even this isn't foolproof (won't work for unnamed multiindexes).
    – hobs
    Commented May 23, 2016 at 22:35
  • 1
    Moving the index to a column, clearing duplicates, and resetting the index was awesome, that was exactly what I needed!
    – Sienna
    Commented Dec 27, 2016 at 19:03
  • Given idx = df.index.name or 'index', one could also do df2 = df.reset_index(); df2.drop_duplicates(idx, inplace=True); df2.set_index(idx, inplace=True) to avoid the intermediate copies (due to the inplace=True)
    – Anakhand
    Commented May 27, 2019 at 16:16
85

Oh my. This is actually so simple!

grouped = df3.groupby(level=0)
df4 = grouped.last()
df4
                      A   B  rownum

2001-01-01 00:00:00   0   0       6
2001-01-01 01:00:00   1   1       7
2001-01-01 02:00:00   2   2       8
2001-01-01 03:00:00   3   3       3
2001-01-01 04:00:00   4   4       4
2001-01-01 05:00:00   5   5       5

Follow up edit 2013-10-29 In the case where I have a fairly complex MultiIndex, I think I prefer the groupby approach. Here's simple example for posterity:

import numpy as np
import pandas

# fake index
idx = pandas.MultiIndex.from_tuples([('a', letter) for letter in list('abcde')])

# random data + naming the index levels
df1 = pandas.DataFrame(np.random.normal(size=(5,2)), index=idx, columns=['colA', 'colB'])
df1.index.names = ['iA', 'iB']

# artificially append some duplicate data
df1 = df1.append(df1.select(lambda idx: idx[1] in ['c', 'e']))
df1
#           colA      colB
#iA iB                    
#a  a  -1.297535  0.691787
#   b  -1.688411  0.404430
#   c   0.275806 -0.078871
#   d  -0.509815 -0.220326
#   e  -0.066680  0.607233
#   c   0.275806 -0.078871  # <--- dup 1
#   e  -0.066680  0.607233  # <--- dup 2

and here's the important part

# group the data, using df1.index.names tells pandas to look at the entire index
groups = df1.groupby(level=df1.index.names)  
groups.last() # or .first()
#           colA      colB
#iA iB                    
#a  a  -1.297535  0.691787
#   b  -1.688411  0.404430
#   c   0.275806 -0.078871
#   d  -0.509815 -0.220326
#   e  -0.066680  0.607233
4
  • if they have names, otherwise (if one name is None) lets say level=[0,1] will work if there are 2 levels df1.groupby(level=[0,1]).last(). This should be part of Pandas as a complimentary to drop_duplicates
    – dashesy
    Commented Apr 12, 2015 at 18:38
  • @dashesy yeah. Using df.index.names is just an easy way to group by all levels of the index.
    – Paul H
    Commented Apr 12, 2015 at 18:45
  • Great solution, thank you! I will also add that this works in xarray for dealing with duplicate DateTime indices as well that make ds.resample and ds.groupby operations fail
    – drg
    Commented May 15, 2018 at 23:26
  • Amendment to my earlier comment: it works in xarray as long as you change the grouped = df3.groupby(level=0) to grouped = df3.groupby(dim='time') or whatever the dimension is that contains duplicates
    – drg
    Commented May 15, 2018 at 23:34
7

Remove duplicates (Keeping First)

idx = np.unique( df.index.values, return_index = True )[1]
df = df.iloc[idx]

Remove duplicates (Keeping Last)

df = df[::-1]
df = df.iloc[ np.unique( df.index.values, return_index = True )[1] ]

Tests: 10k loops using OP's data

numpy method - 3.03 seconds
df.loc[~df.index.duplicated(keep='first')] - 4.43 seconds
df.groupby(df.index).first() - 21 seconds
reset_index() method - 29 seconds
5

If anyone like me likes chainable data manipulation using the pandas dot notation (like piping), then the following may be useful:

df3 = df3.query('~index.duplicated()')

This enables chaining statements like this:

df3.assign(C=2).query('~index.duplicated()').mean()
3
  • I tried this but could not get it to work.. I get an error like this: TypeError: 'Series' objects are mutable, thus they cannot be hashed.. Did this actually work for you? Commented Sep 3, 2019 at 13:45
  • It worked for me, using Pandas 1.2.2. I had to change the construction of the index, since .DateTimeIndex does not support start and end arguments anymore, but I used .date_range instead. Commented Feb 28, 2021 at 16:53
  • Can you elaborate a bit more on what this does df3.assign(C=2).query('~index.duplicated()').mean(). It would be nice to have a step by step example.
    – MyCarta
    Commented Jul 15 at 16:35
4

Unfortunately, I don't think Pandas allows one to drop dups off the indices. I would suggest the following:

df3 = df3.reset_index() # makes date column part of your data
df3.columns = ['timestamp','A','B','rownum'] # set names
df3 = df3.drop_duplicates('timestamp',take_last=True).set_index('timestamp') #done!
0

I had the experience with this same error, and after diving into every df, it turns out one of the had 2 columns with the same name, you mention you drop some columns, probably this could be a reason.

0

Another way using pandas.Index.drop_duplicates(),

df.loc[df.index.drop_duplicates(keep='first'), :]

But, it is slower compared to the accepted answer. Just use that.

%timeit df.reset_index().drop_duplicates(subset='Notasi', keep='first').set_index('Notasi')
281 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit df.groupby(df.index).first()
212 µs ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit df[~df.index.duplicated(keep='first')]
38.1 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit df.loc[df.index.drop_duplicates(keep='first'), :]
104 µs ± 721 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Not the answer you're looking for? Browse other questions tagged or ask your own question.