5.9 Flexible apply
Some operations on the grouped data might not fit into either the aggregate or
transform categories. Or, you may simply want GroupBy to infer how to combine
the results. For these, use the apply
function, which can be substituted
for both aggregate
and transform
in many standard use cases. However,
apply
can handle some exceptional use cases, for example:
In [1]: df
Out[1]:
A B C D
0 foo one 0.4691 -0.8618
1 bar one -0.2829 -2.1046
2 foo two -1.5091 -0.4949
3 bar three -1.1356 1.0718
4 foo two 1.2121 0.7216
5 bar two -0.1732 -0.7068
6 foo one 0.1192 -1.0396
7 foo three -1.0442 0.2719
In [2]: grouped = df.groupby('A')
# could also just call .describe()
In [3]: grouped['C'].apply(lambda x: x.describe())
Out[3]:
A
bar count 3.0000
mean -0.5306
std 0.5269
min -1.1356
25% -0.7092
50% -0.2829
75% -0.2280
...
foo mean -0.1506
std 1.1133
min -1.5091
25% -1.0442
50% 0.1192
75% 0.4691
max 1.2121
Name: C, dtype: float64
The dimension of the returned result can also change:
In [4]: grouped = df.groupby('A')['C']
In [5]: def f(group):
...: return pd.DataFrame({'original' : group,
...: 'demeaned' : group - group.mean()})
...:
In [6]: grouped.apply(f)
Out[6]:
demeaned original
0 0.6197 0.4691
1 0.2477 -0.2829
2 -1.3585 -1.5091
3 -0.6051 -1.1356
4 1.3627 1.2121
5 0.3574 -0.1732
6 0.2698 0.1192
7 -0.8937 -1.0442
apply
on a Series can operate on a returned value from the applied function, that is itself a series, and possibly upcast the result to a DataFrame
In [7]: def f(x):
...: return pd.Series([ x, x**2 ], index = ['x', 'x^2'])
...:
In [8]: s
Out[8]:
0 9.0
1 8.0
2 7.0
3 5.0
4 19.0
5 1.0
6 4.2
7 3.3
dtype: float64
In [9]: s.apply(f)
Out[9]:
x x^2
0 9.0 81.00
1 8.0 64.00
2 7.0 49.00
3 5.0 25.00
4 19.0 361.00
5 1.0 1.00
6 4.2 17.64
7 3.3 10.89
Note
apply
can act as a reducer, transformer, or filter function, depending on exactly what is passed to it.
So depending on the path taken, and exactly what you are grouping. Thus the grouped columns(s) may be included in
the output as well as set the indices.
Warning
In the current implementation apply calls func twice on the first group to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first group.
In [10]: d = pd.DataFrame({"a":["x", "y"], "b":[1,2]})
In [11]: def identity(df):
....: print df
....: return df
....:
In [12]: d.groupby("a").apply(identity)
a b
0 x 1
a b
0 x 1
a b
1 y 2
Out[12]:
a b
0 x 1
1 y 2