7 Grouping

By “group by” we are referring to a process involving one or more of the following steps

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure

See the Grouping section

In [1]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : np.random.randn(8),
   ...:                    'D' : np.random.randn(8)})
   ...: 

In [2]: df
Out[2]: 
     A      B         C         D
0  foo    one  0.469112 -0.861849
1  bar    one -0.282863 -2.104569
2  foo    two -1.509059 -0.494929
3  bar  three -1.135632  1.071804
4  foo    two  1.212112  0.721555
5  bar    two -0.173215 -0.706771
6  foo    one  0.119209 -1.039575
7  foo  three -1.044236  0.271860

Grouping and then applying a function sum to the resulting groups.

In [3]: df.groupby('A').sum()
Out[3]: 
            C         D
A                      
bar -1.591710 -1.739537
foo -0.752861 -1.402938

Grouping by multiple columns forms a hierarchical index, which we then apply the function.

In [4]: df.groupby(['A','B']).sum()
Out[4]: 
                  C         D
A   B                        
bar one   -0.282863 -2.104569
    three -1.135632  1.071804
    two   -0.173215 -0.706771
foo one    0.588321 -1.901424
    three -1.044236  0.271860
    two   -0.296946  0.226626