In [1]: import numpy as np
In [2]: np.random.seed(123456)
In [3]: np.set_printoptions(precision=4, suppress=True)
In [4]: import pandas as pd
In [5]: pd.options.display.max_rows = 8
In [6]: import matplotlib
In [7]: matplotlib.style.use('ggplot')
In [8]: import matplotlib.pyplot as plt
In [9]: plt.close('all')
In [10]: from collections import OrderedDict
5.1 Introduction
By “group by” we are referring to a process involving one or more of the following steps
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure
Of these, the split step is the most straightforward. In fact, in many situations you may wish to split the data set into groups and do something with those groups yourself. In the apply step, we might wish to one of the following:
Aggregation: computing a summary statistic (or statistics) about each group. Some examples:
- Compute group sums or means
- Compute group sizes / counts
Transformation: perform some group-specific computations and return a like-indexed. Some examples:
- Standardizing data (zscore) within group
- Filling NAs within groups with a value derived from each group
Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:
- Discarding data that belongs to groups with only a few members
- Filtering out data based on the group sum or mean
Some combination of the above: GroupBy will examine the results of the apply step and try to return a sensibly combined result if it doesn’t fit into either of the above two categories
Since the set of object instance methods on pandas data structures are generally
rich and expressive, we often simply want to invoke, say, a DataFrame function
on each group. The name GroupBy should be quite familiar to those who have used
a SQL-based tool (or itertools
), in which you can write code like:
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2
We aim to make operations like this natural and easy to express using pandas. We’ll address each area of GroupBy functionality then provide some non-trivial examples / use cases.
See the cookbook for some advanced strategies