5.2 Splitting an object into groups
pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. To create a GroupBy object (more on what the GroupBy object is later), you do the following:
>>> # default is axis=0
>>> grouped = obj.groupby(key)
>>> grouped = obj.groupby(key, axis=1)
>>> grouped = obj.groupby([key1, key2])
The mapping can be specified many different ways:
- A Python function, to be called on each of the axis labels
- A list or NumPy array of the same length as the selected axis
- A dict or Series, providing a
label -> group name
mapping- For DataFrame objects, a string indicating a column to be used to group. Of course
df.groupby('A')
is just syntactic sugar fordf.groupby(df['A'])
, but it makes life simpler- A list of any of the above things
Collectively we refer to the grouping objects as the keys. For example, consider the following DataFrame:
In [1]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : np.random.randn(8),
...: 'D' : np.random.randn(8)})
...:
In [2]: df
Out[2]:
A B C D
0 foo one 0.4691 -0.8618
1 bar one -0.2829 -2.1046
2 foo two -1.5091 -0.4949
3 bar three -1.1356 1.0718
4 foo two 1.2121 0.7216
5 bar two -0.1732 -0.7068
6 foo one 0.1192 -1.0396
7 foo three -1.0442 0.2719
We could naturally group by either the A
or B
columns or both:
In [3]: grouped = df.groupby('A')
In [4]: grouped = df.groupby(['A', 'B'])
These will split the DataFrame on its index (rows). We could also split by the columns:
In [5]: def get_letter_type(letter):
...: if letter.lower() in 'aeiou':
...: return 'vowel'
...: else:
...: return 'consonant'
...:
In [6]: grouped = df.groupby(get_letter_type, axis=1)
Starting with 0.8, pandas Index objects now support duplicate values. If a non-unique index is used as the group key in a groupby operation, all values for the same index value will be considered to be in one group and thus the output of aggregation functions will only contain unique index values:
In [7]: lst = [1, 2, 3, 1, 2, 3]
In [8]: s = pd.Series([1, 2, 3, 10, 20, 30], lst)
In [9]: grouped = s.groupby(level=0)
In [10]: grouped.first()
Out[10]:
1 1
2 2
3 3
dtype: int64
In [11]: grouped.last()
Out[11]:
1 10
2 20
3 30
dtype: int64
In [12]: grouped.sum()
Out[12]:
1 11
2 22
3 33
dtype: int64
Note that no splitting occurs until it’s needed. Creating the GroupBy object only verifies that you’ve passed a valid mapping.
Note
Many kinds of complicated data manipulations can be expressed in terms of GroupBy operations (though can’t be guaranteed to be the most efficient). You can get quite creative with the label mapping functions.
5.2.1 GroupBy sorting
By default the group keys are sorted during the groupby
operation. You may however pass sort=False
for potential speedups:
In [13]: df2 = pd.DataFrame({'X' : ['B', 'B', 'A', 'A'], 'Y' : [1, 2, 3, 4]})
In [14]: df2.groupby(['X']).sum()
Out[14]:
Y
X
A 7
B 3
In [15]: df2.groupby(['X'], sort=False).sum()
Out[15]:
Y
X
B 3
A 7
Note that groupby
will preserve the order in which observations are sorted within each group.
For example, the groups created by groupby()
below are in the order they appeared in the original DataFrame
:
In [16]: df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
In [17]: df3.groupby(['X']).get_group('A')
Out[17]:
X Y
0 A 1
2 A 3
In [18]: df3.groupby(['X']).get_group('B')
Out[18]:
X Y
1 B 4
3 B 2
5.2.2 GroupBy object attributes
The groups
attribute is a dict whose keys are the computed unique groups
and corresponding values being the axis labels belonging to each group. In the
above example we have:
In [19]: df.groupby('A').groups
Out[19]: {'bar': [1, 3, 5], 'foo': [0, 2, 4, 6, 7]}
In [20]: df.groupby(get_letter_type, axis=1).groups
Out[20]: {'consonant': ['B', 'C', 'D'], 'vowel': ['A']}
Calling the standard Python len
function on the GroupBy object just returns
the length of the groups
dict, so it is largely just a convenience:
In [21]: grouped = df.groupby(['A', 'B'])
In [22]: grouped.groups
Out[22]:
{('bar', 'one'): [1],
('bar', 'three'): [3],
('bar', 'two'): [5],
('foo', 'one'): [0, 6],
('foo', 'three'): [7],
('foo', 'two'): [2, 4]}
In [23]: len(grouped)
Out[23]: 6
GroupBy
will tab complete column names (and other attributes)
In [24]: df
Out[24]:
gender height weight
2000-01-01 male 42.8500 157.5006
2000-01-02 male 49.6073 177.3404
2000-01-03 male 56.2935 171.5246
2000-01-04 female 48.4211 144.2520
2000-01-05 male 46.5569 152.5262
2000-01-06 female 68.4489 168.2730
2000-01-07 male 70.7577 136.4315
2000-01-08 female 58.9095 176.4998
2000-01-09 female 76.4356 174.0941
2000-01-10 male 45.3061 177.5409
In [25]: gb = df.groupby('gender')
In [26]: gb.<TAB>
gb.agg gb.boxplot gb.cummin gb.describe gb.filter gb.get_group gb.height gb.last gb.median gb.ngroups gb.plot gb.rank gb.std gb.transform
gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups gb.hist gb.max gb.min gb.nth gb.prod gb.resample gb.sum gb.var
gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight
5.2.3 GroupBy with MultiIndex
With hierarchically-indexed data, it’s quite natural to group by one of the levels of the hierarchy.
Let’s create a Series with a two-level MultiIndex
.
In [27]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
....: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
....:
In [28]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
In [29]: s = pd.Series(np.random.randn(8), index=index)
In [30]: s
Out[30]:
first second
bar one -0.5752
two 0.2542
baz one -1.1437
two 0.2159
foo one 1.1936
two -0.0771
qux one -0.4085
two -0.8625
dtype: float64
We can then group by one of the levels in s
.
In [31]: grouped = s.groupby(level=0)
In [32]: grouped.sum()
Out[32]:
first
bar -0.3211
baz -0.9278
foo 1.1164
qux -1.2710
dtype: float64
If the MultiIndex has names specified, these can be passed instead of the level number:
In [33]: s.groupby(level='second').sum()
Out[33]:
second
one -0.9339
two -0.4696
dtype: float64
The aggregation functions such as sum
will take the level parameter
directly. Additionally, the resulting index will be named according to the
chosen level:
In [34]: s.sum(level='second')
Out[34]:
second
one -0.9339
two -0.4696
dtype: float64
Also as of v0.6, grouping with multiple levels is supported.
In [35]: s
Out[35]:
first second third
bar doo one 1.3461
two 1.5118
baz bee one 1.6271
two -0.9906
foo bop one -0.4417
two 1.2115
qux bop one 0.2685
two 0.0246
dtype: float64
In [36]: s.groupby(level=['first', 'second']).sum()
Out[36]:
first second
bar doo 2.8578
baz bee 0.6365
foo bop 0.7699
qux bop 0.2931
dtype: float64
More on the sum
function and aggregation later.
5.2.4 DataFrame column selection in GroupBy
Once you have created the GroupBy object from a DataFrame, for example, you
might want to do something different for each of the columns. Thus, using
[]
similar to getting a column from a DataFrame, you can do:
In [37]: grouped = df.groupby(['A'])
In [38]: grouped_C = grouped['C']
In [39]: grouped_D = grouped['D']
This is mainly syntactic sugar for the alternative and much more verbose:
In [40]: df['C'].groupby(df['A'])
Out[40]: <pandas.core.groupby.SeriesGroupBy object at 0x2b6c20009ed0>
Additionally this method avoids recomputing the internal grouping information derived from the passed key.