2.15 Duplicate Data
If you want to identify and remove duplicate rows in a DataFrame, there are
two methods that will help: duplicated
and drop_duplicates
. Each
takes as an argument the columns to use to identify duplicated rows.
duplicated
returns a boolean vector whose length is the number of rows, and which indicates whether a row is duplicated.drop_duplicates
removes duplicate rows.
By default, the first observed row of a duplicate set is considered unique, but
each method has a keep
parameter to specify targets to be kept.
keep='first'
(default): mark / drop duplicates except for the first occurrence.keep='last'
: mark / drop duplicates except for the last occurrence.keep=False
: mark / drop all duplicates.
In [1]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
...: 'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
...: 'c': np.random.randn(7)})
...:
In [2]: df2
Out[2]:
a b c
0 one x -0.5146
1 one y -0.4496
2 two x 1.7346
3 two y 0.6434
4 two x 0.0261
5 three x 0.0804
6 four x -0.7974
In [3]: df2.duplicated('a')
Out[3]:
0 False
1 True
2 False
3 True
4 True
5 False
6 False
dtype: bool
In [4]: df2.duplicated('a', keep='last')
Out[4]:
0 True
1 False
2 True
3 True
4 False
5 False
6 False
dtype: bool
In [5]: df2.duplicated('a', keep=False)
Out[5]:
0 True
1 True
2 True
3 True
4 True
5 False
6 False
dtype: bool
In [6]: df2.drop_duplicates('a')
Out[6]:
a b c
0 one x -0.5146
2 two x 1.7346
5 three x 0.0804
6 four x -0.7974
In [7]: df2.drop_duplicates('a', keep='last')
Out[7]:
a b c
1 one y -0.4496
4 two x 0.0261
5 three x 0.0804
6 four x -0.7974
In [8]: df2.drop_duplicates('a', keep=False)
Out[8]:
a b c
5 three x 0.0804
6 four x -0.7974
Also, you can pass a list of columns to identify duplications.
In [9]: df2.duplicated(['a', 'b'])
Out[9]:
0 False
1 False
2 False
3 False
4 True
5 False
6 False
dtype: bool
In [10]: df2.drop_duplicates(['a', 'b'])
Out[10]:
a b c
0 one x -0.5146
1 one y -0.4496
2 two x 1.7346
3 two y 0.6434
5 three x 0.0804
6 four x -0.7974
To drop duplicates by index value, use Index.duplicated
then perform slicing.
Same options are available in keep
parameter.
In [11]: df3 = pd.DataFrame({'a': np.arange(6),
....: 'b': np.random.randn(6)},
....: index=['a', 'a', 'b', 'c', 'b', 'a'])
....:
In [12]: df3
Out[12]:
a b
a 0 -0.6281
a 1 -0.3462
b 2 0.9681
c 3 0.7056
b 4 -2.1567
a 5 0.9506
In [13]: df3.index.duplicated()
Out[13]: array([False, True, False, False, True, True], dtype=bool)
In [14]: df3[~df3.index.duplicated()]
Out[14]:
a b
a 0 -0.6281
b 2 0.9681
c 3 0.7056
In [15]: df3[~df3.index.duplicated(keep='last')]
Out[15]:
a b
c 3 0.7056
b 4 -2.1567
a 5 0.9506
In [16]: df3[~df3.index.duplicated(keep=False)]
Out[16]:
a b
c 3 0.7056