5.2 Splitting an object into groups

pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. To create a GroupBy object (more on what the GroupBy object is later), you do the following:

>>> # default is axis=0
>>> grouped = obj.groupby(key)
>>> grouped = obj.groupby(key, axis=1)
>>> grouped = obj.groupby([key1, key2])

The mapping can be specified many different ways:

  • A Python function, to be called on each of the axis labels
  • A list or NumPy array of the same length as the selected axis
  • A dict or Series, providing a label -> group name mapping
  • For DataFrame objects, a string indicating a column to be used to group. Of course df.groupby('A') is just syntactic sugar for df.groupby(df['A']), but it makes life simpler
  • A list of any of the above things

Collectively we refer to the grouping objects as the keys. For example, consider the following DataFrame:

In [1]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : np.random.randn(8),
   ...:                    'D' : np.random.randn(8)})
   ...: 

In [2]: df
Out[2]: 
     A      B       C       D
0  foo    one  0.4691 -0.8618
1  bar    one -0.2829 -2.1046
2  foo    two -1.5091 -0.4949
3  bar  three -1.1356  1.0718
4  foo    two  1.2121  0.7216
5  bar    two -0.1732 -0.7068
6  foo    one  0.1192 -1.0396
7  foo  three -1.0442  0.2719

We could naturally group by either the A or B columns or both:

In [3]: grouped = df.groupby('A')

In [4]: grouped = df.groupby(['A', 'B'])

These will split the DataFrame on its index (rows). We could also split by the columns:

In [5]: def get_letter_type(letter):
   ...:     if letter.lower() in 'aeiou':
   ...:         return 'vowel'
   ...:     else:
   ...:         return 'consonant'
   ...: 

In [6]: grouped = df.groupby(get_letter_type, axis=1)

Starting with 0.8, pandas Index objects now support duplicate values. If a non-unique index is used as the group key in a groupby operation, all values for the same index value will be considered to be in one group and thus the output of aggregation functions will only contain unique index values:

In [7]: lst = [1, 2, 3, 1, 2, 3]

In [8]: s = pd.Series([1, 2, 3, 10, 20, 30], lst)

In [9]: grouped = s.groupby(level=0)

In [10]: grouped.first()
Out[10]: 
1    1
2    2
3    3
dtype: int64

In [11]: grouped.last()
Out[11]: 
1    10
2    20
3    30
dtype: int64

In [12]: grouped.sum()
Out[12]: 
1    11
2    22
3    33
dtype: int64

Note that no splitting occurs until it’s needed. Creating the GroupBy object only verifies that you’ve passed a valid mapping.

Note

Many kinds of complicated data manipulations can be expressed in terms of GroupBy operations (though can’t be guaranteed to be the most efficient). You can get quite creative with the label mapping functions.

5.2.1 GroupBy sorting

By default the group keys are sorted during the groupby operation. You may however pass sort=False for potential speedups:

In [13]: df2 = pd.DataFrame({'X' : ['B', 'B', 'A', 'A'], 'Y' : [1, 2, 3, 4]})

In [14]: df2.groupby(['X']).sum()
Out[14]: 
   Y
X   
A  7
B  3

In [15]: df2.groupby(['X'], sort=False).sum()
Out[15]: 
   Y
X   
B  3
A  7

Note that groupby will preserve the order in which observations are sorted within each group. For example, the groups created by groupby() below are in the order they appeared in the original DataFrame:

In [16]: df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})

In [17]: df3.groupby(['X']).get_group('A')
Out[17]: 
   X  Y
0  A  1
2  A  3

In [18]: df3.groupby(['X']).get_group('B')
Out[18]: 
   X  Y
1  B  4
3  B  2

5.2.2 GroupBy object attributes

The groups attribute is a dict whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group. In the above example we have:

In [19]: df.groupby('A').groups
Out[19]: {'bar': [1, 3, 5], 'foo': [0, 2, 4, 6, 7]}

In [20]: df.groupby(get_letter_type, axis=1).groups
Out[20]: {'consonant': ['B', 'C', 'D'], 'vowel': ['A']}

Calling the standard Python len function on the GroupBy object just returns the length of the groups dict, so it is largely just a convenience:

In [21]: grouped = df.groupby(['A', 'B'])

In [22]: grouped.groups
Out[22]: 
{('bar', 'one'): [1],
 ('bar', 'three'): [3],
 ('bar', 'two'): [5],
 ('foo', 'one'): [0, 6],
 ('foo', 'three'): [7],
 ('foo', 'two'): [2, 4]}

In [23]: len(grouped)
Out[23]: 6

GroupBy will tab complete column names (and other attributes)

In [24]: df
Out[24]: 
            gender   height    weight
2000-01-01    male  42.8500  157.5006
2000-01-02    male  49.6073  177.3404
2000-01-03    male  56.2935  171.5246
2000-01-04  female  48.4211  144.2520
2000-01-05    male  46.5569  152.5262
2000-01-06  female  68.4489  168.2730
2000-01-07    male  70.7577  136.4315
2000-01-08  female  58.9095  176.4998
2000-01-09  female  76.4356  174.0941
2000-01-10    male  45.3061  177.5409

In [25]: gb = df.groupby('gender')
In [26]: gb.<TAB>
gb.agg        gb.boxplot    gb.cummin     gb.describe   gb.filter     gb.get_group  gb.height     gb.last       gb.median     gb.ngroups    gb.plot       gb.rank       gb.std        gb.transform
gb.aggregate  gb.count      gb.cumprod    gb.dtype      gb.first      gb.groups     gb.hist       gb.max        gb.min        gb.nth        gb.prod       gb.resample   gb.sum        gb.var
gb.apply      gb.cummax     gb.cumsum     gb.fillna     gb.gender     gb.head       gb.indices    gb.mean       gb.name       gb.ohlc       gb.quantile   gb.size       gb.tail       gb.weight

5.2.3 GroupBy with MultiIndex

With hierarchically-indexed data, it’s quite natural to group by one of the levels of the hierarchy.

Let’s create a Series with a two-level MultiIndex.

In [27]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ....:           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
   ....: 

In [28]: index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])

In [29]: s = pd.Series(np.random.randn(8), index=index)

In [30]: s
Out[30]: 
first  second
bar    one      -0.5752
       two       0.2542
baz    one      -1.1437
       two       0.2159
foo    one       1.1936
       two      -0.0771
qux    one      -0.4085
       two      -0.8625
dtype: float64

We can then group by one of the levels in s.

In [31]: grouped = s.groupby(level=0)

In [32]: grouped.sum()
Out[32]: 
first
bar   -0.3211
baz   -0.9278
foo    1.1164
qux   -1.2710
dtype: float64

If the MultiIndex has names specified, these can be passed instead of the level number:

In [33]: s.groupby(level='second').sum()
Out[33]: 
second
one   -0.9339
two   -0.4696
dtype: float64

The aggregation functions such as sum will take the level parameter directly. Additionally, the resulting index will be named according to the chosen level:

In [34]: s.sum(level='second')
Out[34]: 
second
one   -0.9339
two   -0.4696
dtype: float64

Also as of v0.6, grouping with multiple levels is supported.

In [35]: s
Out[35]: 
first  second  third
bar    doo     one      1.3461
               two      1.5118
baz    bee     one      1.6271
               two     -0.9906
foo    bop     one     -0.4417
               two      1.2115
qux    bop     one      0.2685
               two      0.0246
dtype: float64

In [36]: s.groupby(level=['first', 'second']).sum()
Out[36]: 
first  second
bar    doo       2.8578
baz    bee       0.6365
foo    bop       0.7699
qux    bop       0.2931
dtype: float64

More on the sum function and aggregation later.

5.2.4 DataFrame column selection in GroupBy

Once you have created the GroupBy object from a DataFrame, for example, you might want to do something different for each of the columns. Thus, using [] similar to getting a column from a DataFrame, you can do:

In [37]: grouped = df.groupby(['A'])

In [38]: grouped_C = grouped['C']

In [39]: grouped_D = grouped['D']

This is mainly syntactic sugar for the alternative and much more verbose:

In [40]: df['C'].groupby(df['A'])
Out[40]: <pandas.core.groupby.SeriesGroupBy object at 0x2b6c20009ed0>

Additionally this method avoids recomputing the internal grouping information derived from the passed key.