5.5 Aggregation
Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data.
An obvious one is aggregation via the aggregate
or equivalently agg
method:
In [1]: df
Out[1]:
A B C D
0 foo one 0.4691 -0.8618
1 bar one -0.2829 -2.1046
2 foo two -1.5091 -0.4949
3 bar three -1.1356 1.0718
4 foo two 1.2121 0.7216
5 bar two -0.1732 -0.7068
6 foo one 0.1192 -1.0396
7 foo three -1.0442 0.2719
In [2]: grouped = df.groupby('A')
In [3]: grouped.aggregate(np.sum)
Out[3]:
C D
A
bar -1.5917 -1.7395
foo -0.7529 -1.4029
In [4]: grouped = df.groupby(['A', 'B'])
In [5]: grouped.aggregate(np.sum)
Out[5]:
C D
A B
bar one -0.2829 -2.1046
three -1.1356 1.0718
two -0.1732 -0.7068
foo one 0.5883 -1.9014
three -1.0442 0.2719
two -0.2969 0.2266
As you can see, the result of the aggregation will have the group names as the
new index along the grouped axis. In the case of multiple keys, the result is a
MultiIndex by default, though this can be
changed by using the as_index
option:
In [6]: grouped = df.groupby(['A', 'B'], as_index=False)
In [7]: grouped.aggregate(np.sum)
Out[7]:
A B C D
0 bar one -0.2829 -2.1046
1 bar three -1.1356 1.0718
2 bar two -0.1732 -0.7068
3 foo one 0.5883 -1.9014
4 foo three -1.0442 0.2719
5 foo two -0.2969 0.2266
In [8]: df.groupby('A', as_index=False).sum()
Out[8]:
A C D
0 bar -1.5917 -1.7395
1 foo -0.7529 -1.4029
Note that you could use the reset_index
DataFrame function to achieve the
same result as the column names are stored in the resulting MultiIndex
:
In [9]: df.groupby(['A', 'B']).sum().reset_index()
Out[9]:
A B C D
0 bar one -0.2829 -2.1046
1 bar three -1.1356 1.0718
2 bar two -0.1732 -0.7068
3 foo one 0.5883 -1.9014
4 foo three -1.0442 0.2719
5 foo two -0.2969 0.2266
Another simple aggregation example is to compute the size of each group.
This is included in GroupBy as the size
method. It returns a Series whose
index are the group names and whose values are the sizes of each group.
In [10]: grouped.size()
Out[10]:
A B
bar one 1
three 1
two 1
foo one 2
three 1
two 2
dtype: int64
In [11]: grouped.describe()
Out[11]:
C D
0 count 1.0000 1.0000
mean -0.2829 -2.1046
std NaN NaN
min -0.2829 -2.1046
25% -0.2829 -2.1046
50% -0.2829 -2.1046
75% -0.2829 -2.1046
... ... ...
5 mean -0.1485 0.1133
std 1.9242 0.8602
min -1.5091 -0.4949
25% -0.8288 -0.1908
50% -0.1485 0.1133
75% 0.5318 0.4174
max 1.2121 0.7216
[48 rows x 2 columns]
Note
Aggregation functions will not return the groups that you are aggregating over
if they are named columns, when as_index=True
, the default. The grouped columns will
be the indices of the returned object.
Passing as_index=False
will return the groups that you are aggregating over, if they are
named columns.
Aggregating functions are ones that reduce the dimension of the returned objects,
for example: mean, sum, size, count, std, var, sem, describe, first, last, nth, min, max
. This is
what happens when you do for example DataFrame.sum()
and get back a Series
.
nth
can act as a reducer or a filter, see here
5.5.1 Applying multiple functions at once
With grouped Series you can also pass a list or dict of functions to do aggregation with, outputting a DataFrame:
In [12]: grouped = df.groupby('A')
In [13]: grouped['C'].agg([np.sum, np.mean, np.std])
Out[13]:
sum mean std
A
bar -1.5917 -0.5306 0.5269
foo -0.7529 -0.1506 1.1133
If a dict is passed, the keys will be used to name the columns. Otherwise the function’s name (stored in the function object) will be used.
In [14]: grouped['D'].agg({'result1' : np.sum,
....: 'result2' : np.mean})
....:
Out[14]:
result2 result1
A
bar -0.5798 -1.7395
foo -0.2806 -1.4029
On a grouped DataFrame, you can pass a list of functions to apply to each column, which produces an aggregated result with a hierarchical index:
In [15]: grouped.agg([np.sum, np.mean, np.std])
Out[15]:
C D
sum mean std sum mean std
A
bar -1.5917 -0.5306 0.5269 -1.7395 -0.5798 1.5920
foo -0.7529 -0.1506 1.1133 -1.4029 -0.2806 0.7532
Passing a dict of functions has different behavior by default, see the next section.
5.5.2 Applying different functions to DataFrame columns
By passing a dict to aggregate
you can apply a different aggregation to the
columns of a DataFrame:
In [16]: grouped.agg({'C' : np.sum,
....: 'D' : lambda x: np.std(x, ddof=1)})
....:
Out[16]:
C D
A
bar -1.5917 1.5920
foo -0.7529 0.7532
The function names can also be strings. In order for a string to be valid it must be either implemented on GroupBy or available via dispatching:
In [17]: grouped.agg({'C' : 'sum', 'D' : 'std'})
Out[17]:
C D
A
bar -1.5917 1.5920
foo -0.7529 0.7532
Note
If you pass a dict to aggregate
, the ordering of the output colums is
non-deterministic. If you want to be sure the output columns will be in a specific
order, you can use an OrderedDict
. Compare the output of the following two commands:
In [18]: grouped.agg({'D': 'std', 'C': 'mean'})
Out[18]:
C D
A
bar -0.5306 1.5920
foo -0.1506 0.7532
In [19]: grouped.agg(OrderedDict([('D', 'std'), ('C', 'mean')]))
Out[19]:
D C
A
bar 1.5920 -0.5306
foo 0.7532 -0.1506
5.5.3 Cython-optimized aggregation functions
Some common aggregations, currently only sum
, mean
, std
, and sem
, have
optimized Cython implementations:
In [20]: df.groupby('A').sum()
Out[20]:
C D
A
bar -1.5917 -1.7395
foo -0.7529 -1.4029
In [21]: df.groupby(['A', 'B']).mean()
Out[21]:
C D
A B
bar one -0.2829 -2.1046
three -1.1356 1.0718
two -0.1732 -0.7068
foo one 0.2942 -0.9507
three -1.0442 0.2719
two -0.1485 0.1133
Of course sum
and mean
are implemented on pandas objects, so the above
code would work even without the special versions via dispatching (see below).