2.15 Duplicate Data

If you want to identify and remove duplicate rows in a DataFrame, there are two methods that will help: duplicated and drop_duplicates. Each takes as an argument the columns to use to identify duplicated rows.

  • duplicated returns a boolean vector whose length is the number of rows, and which indicates whether a row is duplicated.
  • drop_duplicates removes duplicate rows.

By default, the first observed row of a duplicate set is considered unique, but each method has a keep parameter to specify targets to be kept.

  • keep='first' (default): mark / drop duplicates except for the first occurrence.
  • keep='last': mark / drop duplicates except for the last occurrence.
  • keep=False: mark / drop all duplicates.
In [1]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
   ...:                     'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
   ...:                     'c': np.random.randn(7)})
   ...: 

In [2]: df2
Out[2]: 
       a  b       c
0    one  x -0.5146
1    one  y -0.4496
2    two  x  1.7346
3    two  y  0.6434
4    two  x  0.0261
5  three  x  0.0804
6   four  x -0.7974

In [3]: df2.duplicated('a')
Out[3]: 
0    False
1     True
2    False
3     True
4     True
5    False
6    False
dtype: bool

In [4]: df2.duplicated('a', keep='last')
Out[4]: 
0     True
1    False
2     True
3     True
4    False
5    False
6    False
dtype: bool

In [5]: df2.duplicated('a', keep=False)
Out[5]: 
0     True
1     True
2     True
3     True
4     True
5    False
6    False
dtype: bool

In [6]: df2.drop_duplicates('a')
Out[6]: 
       a  b       c
0    one  x -0.5146
2    two  x  1.7346
5  three  x  0.0804
6   four  x -0.7974

In [7]: df2.drop_duplicates('a', keep='last')
Out[7]: 
       a  b       c
1    one  y -0.4496
4    two  x  0.0261
5  three  x  0.0804
6   four  x -0.7974

In [8]: df2.drop_duplicates('a', keep=False)
Out[8]: 
       a  b       c
5  three  x  0.0804
6   four  x -0.7974

Also, you can pass a list of columns to identify duplications.

In [9]: df2.duplicated(['a', 'b'])
Out[9]: 
0    False
1    False
2    False
3    False
4     True
5    False
6    False
dtype: bool

In [10]: df2.drop_duplicates(['a', 'b'])
Out[10]: 
       a  b       c
0    one  x -0.5146
1    one  y -0.4496
2    two  x  1.7346
3    two  y  0.6434
5  three  x  0.0804
6   four  x -0.7974

To drop duplicates by index value, use Index.duplicated then perform slicing. Same options are available in keep parameter.

In [11]: df3 = pd.DataFrame({'a': np.arange(6),
   ....:                     'b': np.random.randn(6)},
   ....:                    index=['a', 'a', 'b', 'c', 'b', 'a'])
   ....: 

In [12]: df3
Out[12]: 
   a       b
a  0 -0.6281
a  1 -0.3462
b  2  0.9681
c  3  0.7056
b  4 -2.1567
a  5  0.9506

In [13]: df3.index.duplicated()
Out[13]: array([False,  True, False, False,  True,  True], dtype=bool)

In [14]: df3[~df3.index.duplicated()]
Out[14]: 
   a       b
a  0 -0.6281
b  2  0.9681
c  3  0.7056

In [15]: df3[~df3.index.duplicated(keep='last')]
Out[15]: 
   a       b
c  3  0.7056
b  4 -2.1567
a  5  0.9506

In [16]: df3[~df3.index.duplicated(keep=False)]
Out[16]: 
   a       b
c  3  0.7056