5.7 Filtration

New in version 0.12.

The filter method returns a subset of the original object. Suppose we want to take only elements that belong to groups with a group sum greater than 2.

In [1]: sf = pd.Series([1, 1, 2, 3, 3, 3])

In [2]: sf.groupby(sf).filter(lambda x: x.sum() > 2)
Out[2]: 
3    3
4    3
5    3
dtype: int64

The argument of filter must be a function that, applied to the group as a whole, returns True or False.

Another useful operation is filtering out elements that belong to groups with only a couple members.

In [3]: dff = pd.DataFrame({'A': np.arange(8), 'B': list('aabbbbcc')})

In [4]: dff.groupby('B').filter(lambda x: len(x) > 2)
Out[4]: 
   A  B
2  2  b
3  3  b
4  4  b
5  5  b

Alternatively, instead of dropping the offending groups, we can return a like-indexed objects where the groups that do not pass the filter are filled with NaNs.

In [5]: dff.groupby('B').filter(lambda x: len(x) > 2, dropna=False)
Out[5]: 
     A    B
0  NaN  NaN
1  NaN  NaN
2  2.0    b
3  3.0    b
4  4.0    b
5  5.0    b
6  NaN  NaN
7  NaN  NaN

For DataFrames with multiple columns, filters should explicitly specify a column as the filter criterion.

In [6]: dff['C'] = np.arange(8)

In [7]: dff.groupby('B').filter(lambda x: len(x['C']) > 2)
Out[7]: 
   A  B  C
2  2  b  2
3  3  b  3
4  4  b  4
5  5  b  5

Note

Some functions when applied to a groupby object will act as a filter on the input, returning a reduced shape of the original (and potentially eliminating groups), but with the index unchanged. Passing as_index=False will not affect these transformation methods.

For example: head, tail.

In [8]: dff.groupby('B').head(2)
Out[8]: 
   A  B  C
0  0  a  0
1  1  a  1
2  2  b  2
3  3  b  3
6  6  c  6
7  7  c  7