5.5 Aggregation

Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data.

An obvious one is aggregation via the aggregate or equivalently agg method:

In [1]: df
Out[1]: 
     A      B       C       D
0  foo    one  0.4691 -0.8618
1  bar    one -0.2829 -2.1046
2  foo    two -1.5091 -0.4949
3  bar  three -1.1356  1.0718
4  foo    two  1.2121  0.7216
5  bar    two -0.1732 -0.7068
6  foo    one  0.1192 -1.0396
7  foo  three -1.0442  0.2719

In [2]: grouped = df.groupby('A')

In [3]: grouped.aggregate(np.sum)
Out[3]: 
          C       D
A                  
bar -1.5917 -1.7395
foo -0.7529 -1.4029

In [4]: grouped = df.groupby(['A', 'B'])

In [5]: grouped.aggregate(np.sum)
Out[5]: 
                C       D
A   B                    
bar one   -0.2829 -2.1046
    three -1.1356  1.0718
    two   -0.1732 -0.7068
foo one    0.5883 -1.9014
    three -1.0442  0.2719
    two   -0.2969  0.2266

As you can see, the result of the aggregation will have the group names as the new index along the grouped axis. In the case of multiple keys, the result is a MultiIndex by default, though this can be changed by using the as_index option:

In [6]: grouped = df.groupby(['A', 'B'], as_index=False)

In [7]: grouped.aggregate(np.sum)
Out[7]: 
     A      B       C       D
0  bar    one -0.2829 -2.1046
1  bar  three -1.1356  1.0718
2  bar    two -0.1732 -0.7068
3  foo    one  0.5883 -1.9014
4  foo  three -1.0442  0.2719
5  foo    two -0.2969  0.2266

In [8]: df.groupby('A', as_index=False).sum()
Out[8]: 
     A       C       D
0  bar -1.5917 -1.7395
1  foo -0.7529 -1.4029

Note that you could use the reset_index DataFrame function to achieve the same result as the column names are stored in the resulting MultiIndex:

In [9]: df.groupby(['A', 'B']).sum().reset_index()
Out[9]: 
     A      B       C       D
0  bar    one -0.2829 -2.1046
1  bar  three -1.1356  1.0718
2  bar    two -0.1732 -0.7068
3  foo    one  0.5883 -1.9014
4  foo  three -1.0442  0.2719
5  foo    two -0.2969  0.2266

Another simple aggregation example is to compute the size of each group. This is included in GroupBy as the size method. It returns a Series whose index are the group names and whose values are the sizes of each group.

In [10]: grouped.size()
Out[10]: 
A    B    
bar  one      1
     three    1
     two      1
foo  one      2
     three    1
     two      2
dtype: int64
In [11]: grouped.describe()
Out[11]: 
              C       D
0 count  1.0000  1.0000
  mean  -0.2829 -2.1046
  std       NaN     NaN
  min   -0.2829 -2.1046
  25%   -0.2829 -2.1046
  50%   -0.2829 -2.1046
  75%   -0.2829 -2.1046
...         ...     ...
5 mean  -0.1485  0.1133
  std    1.9242  0.8602
  min   -1.5091 -0.4949
  25%   -0.8288 -0.1908
  50%   -0.1485  0.1133
  75%    0.5318  0.4174
  max    1.2121  0.7216

[48 rows x 2 columns]

Note

Aggregation functions will not return the groups that you are aggregating over if they are named columns, when as_index=True, the default. The grouped columns will be the indices of the returned object.

Passing as_index=False will return the groups that you are aggregating over, if they are named columns.

Aggregating functions are ones that reduce the dimension of the returned objects, for example: mean, sum, size, count, std, var, sem, describe, first, last, nth, min, max. This is what happens when you do for example DataFrame.sum() and get back a Series.

nth can act as a reducer or a filter, see here

5.5.1 Applying multiple functions at once

With grouped Series you can also pass a list or dict of functions to do aggregation with, outputting a DataFrame:

In [12]: grouped = df.groupby('A')

In [13]: grouped['C'].agg([np.sum, np.mean, np.std])
Out[13]: 
        sum    mean     std
A                          
bar -1.5917 -0.5306  0.5269
foo -0.7529 -0.1506  1.1133

If a dict is passed, the keys will be used to name the columns. Otherwise the function’s name (stored in the function object) will be used.

In [14]: grouped['D'].agg({'result1' : np.sum,
   ....:                   'result2' : np.mean})
   ....: 
Out[14]: 
     result2  result1
A                    
bar  -0.5798  -1.7395
foo  -0.2806  -1.4029

On a grouped DataFrame, you can pass a list of functions to apply to each column, which produces an aggregated result with a hierarchical index:

In [15]: grouped.agg([np.sum, np.mean, np.std])
Out[15]: 
          C                       D                
        sum    mean     std     sum    mean     std
A                                                  
bar -1.5917 -0.5306  0.5269 -1.7395 -0.5798  1.5920
foo -0.7529 -0.1506  1.1133 -1.4029 -0.2806  0.7532

Passing a dict of functions has different behavior by default, see the next section.

5.5.2 Applying different functions to DataFrame columns

By passing a dict to aggregate you can apply a different aggregation to the columns of a DataFrame:

In [16]: grouped.agg({'C' : np.sum,
   ....:              'D' : lambda x: np.std(x, ddof=1)})
   ....: 
Out[16]: 
          C       D
A                  
bar -1.5917  1.5920
foo -0.7529  0.7532

The function names can also be strings. In order for a string to be valid it must be either implemented on GroupBy or available via dispatching:

In [17]: grouped.agg({'C' : 'sum', 'D' : 'std'})
Out[17]: 
          C       D
A                  
bar -1.5917  1.5920
foo -0.7529  0.7532

Note

If you pass a dict to aggregate, the ordering of the output colums is non-deterministic. If you want to be sure the output columns will be in a specific order, you can use an OrderedDict. Compare the output of the following two commands:

In [18]: grouped.agg({'D': 'std', 'C': 'mean'})
Out[18]: 
          C       D
A                  
bar -0.5306  1.5920
foo -0.1506  0.7532

In [19]: grouped.agg(OrderedDict([('D', 'std'), ('C', 'mean')]))
Out[19]: 
          D       C
A                  
bar  1.5920 -0.5306
foo  0.7532 -0.1506

5.5.3 Cython-optimized aggregation functions

Some common aggregations, currently only sum, mean, std, and sem, have optimized Cython implementations:

In [20]: df.groupby('A').sum()
Out[20]: 
          C       D
A                  
bar -1.5917 -1.7395
foo -0.7529 -1.4029

In [21]: df.groupby(['A', 'B']).mean()
Out[21]: 
                C       D
A   B                    
bar one   -0.2829 -2.1046
    three -1.1356  1.0718
    two   -0.1732 -0.7068
foo one    0.2942 -0.9507
    three -1.0442  0.2719
    two   -0.1485  0.1133

Of course sum and mean are implemented on pandas objects, so the above code would work even without the special versions via dispatching (see below).