5.7 Filtration
New in version 0.12.
The filter
method returns a subset of the original object. Suppose we
want to take only elements that belong to groups with a group sum greater
than 2.
In [1]: sf = pd.Series([1, 1, 2, 3, 3, 3])
In [2]: sf.groupby(sf).filter(lambda x: x.sum() > 2)
Out[2]:
3 3
4 3
5 3
dtype: int64
The argument of filter
must be a function that, applied to the group as a
whole, returns True
or False
.
Another useful operation is filtering out elements that belong to groups with only a couple members.
In [3]: dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')})
In [4]: dff.groupby('B').filter(lambda x: len(x) > 2)
Out[4]:
A B
2 2 b
3 3 b
4 4 b
5 5 b
Alternatively, instead of dropping the offending groups, we can return a like-indexed objects where the groups that do not pass the filter are filled with NaNs.
In [5]: dff.groupby('B').filter(lambda x: len(x) > 2, dropna=False)
Out[5]:
A B
0 NaN NaN
1 NaN NaN
2 2.0 b
3 3.0 b
4 4.0 b
5 5.0 b
6 NaN NaN
7 NaN NaN
For DataFrames with multiple columns, filters should explicitly specify a column as the filter criterion.
In [6]: dff['C'] = np.arange(8)
In [7]: dff.groupby('B').filter(lambda x: len(x['C']) > 2)
Out[7]:
A B C
2 2 b 2
3 3 b 3
4 4 b 4
5 5 b 5
Note
Some functions when applied to a groupby object will act as a filter on the input, returning
a reduced shape of the original (and potentially eliminating groups), but with the index unchanged.
Passing as_index=False
will not affect these transformation methods.
For example: head, tail
.
In [8]: dff.groupby('B').head(2)
Out[8]:
A B C
0 0 a 0
1 1 a 1
2 2 b 2
3 3 b 3
6 6 c 6
7 7 c 7