5.8 Dispatching to instance methods
When doing an aggregation or transformation, you might just want to call an instance method on each data group. This is pretty easy to do by passing lambda functions:
In [1]: df
Out[1]:
A B C D
0 foo one 0.4691 -0.8618
1 bar one -0.2829 -2.1046
2 foo two -1.5091 -0.4949
3 bar three -1.1356 1.0718
4 foo two 1.2121 0.7216
5 bar two -0.1732 -0.7068
6 foo one 0.1192 -1.0396
7 foo three -1.0442 0.2719
In [2]: grouped = df.groupby('A')
In [3]: grouped.agg(lambda x: x.std())
Out[3]:
C D
A
bar 0.5269 1.5920
foo 1.1133 0.7532
But, it’s rather verbose and can be untidy if you need to pass additional arguments. Using a bit of metaprogramming cleverness, GroupBy now has the ability to “dispatch” method calls to the groups:
In [4]: grouped.std()
Out[4]:
C D
A
bar 0.5269 1.5920
foo 1.1133 0.7532
What is actually happening here is that a function wrapper is being
generated. When invoked, it takes any passed arguments and invokes the function
with any arguments on each group (in the above example, the std
function). The results are then combined together much in the style of agg
and transform
(it actually uses apply
to infer the gluing, documented
next). This enables some operations to be carried out rather succinctly:
In [5]: tsdf = pd.DataFrame(np.random.randn(1000, 3),
...: index=pd.date_range('1/1/2000', periods=1000),
...: columns=['A', 'B', 'C'])
...:
In [6]: tsdf.ix[::2] = np.nan
In [7]: grouped = tsdf.groupby(lambda x: x.year)
In [8]: grouped.fillna(method='pad')
Out[8]:
A B C
2000-01-01 NaN NaN NaN
2000-01-02 -1.0874 -0.6737 0.1136
2000-01-03 -1.0874 -0.6737 0.1136
2000-01-04 0.5770 -1.7150 -1.0393
2000-01-05 0.5770 -1.7150 -1.0393
2000-01-06 0.8449 1.0758 -0.1090
2000-01-07 0.8449 1.0758 -0.1090
... ... ... ...
2002-09-20 -1.0644 -0.4991 -1.9149
2002-09-21 -1.0644 -0.4991 -1.9149
2002-09-22 -0.3688 0.3842 -1.4366
2002-09-23 -0.3688 0.3842 -1.4366
2002-09-24 -3.1013 -0.5137 -1.9194
2002-09-25 -3.1013 -0.5137 -1.9194
2002-09-26 1.8759 0.4622 0.6205
[1000 rows x 3 columns]
In this example, we chopped the collection of time series into yearly chunks then independently called fillna on the groups.
New in version 0.14.1.
The nlargest
and nsmallest
methods work on Series
style groupbys:
In [9]: s = pd.Series([9, 8, 7, 5, 19, 1, 4.2, 3.3])
In [10]: g = pd.Series(list('abababab'))
In [11]: gb = s.groupby(g)
In [12]: gb.nlargest(3)
Out[12]:
a 4 19.0
0 9.0
2 7.0
b 1 8.0
3 5.0
7 3.3
dtype: float64
In [13]: gb.nsmallest(3)
Out[13]:
a 6 4.2
2 7.0
0 9.0
b 5 1.0
7 3.3
3 5.0
dtype: float64