2.5 Grouping

The grouping docs.

Basic grouping with apply

Unlike agg, apply’s callable is passed a sub-DataFrame which gives you access to all the columns

In [1]: df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(),
   ...:                    'size': list('SSMMMLL'),
   ...:                    'weight': [8, 10, 11, 1, 20, 12, 12],
   ...:                    'adult' : [False] * 5 + [True] * 2}); df
   ...: 
Out[1]: 
   adult animal size  weight
0  False    cat    S       8
1  False    dog    S      10
2  False    cat    M      11
3  False   fish    M       1
4  False    dog    M      20
5   True    cat    L      12
6   True    cat    L      12

#List the size of the animals with the highest weight.
In [2]: df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])
Out[2]: 
animal
cat     L
dog     M
fish    M
dtype: object

Using get_group

In [3]: gb = df.groupby(['animal'])

In [4]: gb.get_group('cat')
Out[4]: 
   adult animal size  weight
0  False    cat    S       8
2  False    cat    M      11
5   True    cat    L      12
6   True    cat    L      12

Apply to different items in a group

In [5]: def GrowUp(x):
   ...:    avg_weight =  sum(x[x['size'] == 'S'].weight * 1.5)
   ...:    avg_weight += sum(x[x['size'] == 'M'].weight * 1.25)
   ...:    avg_weight += sum(x[x['size'] == 'L'].weight)
   ...:    avg_weight /= len(x)
   ...:    return pd.Series(['L',avg_weight,True], index=['size', 'weight', 'adult'])
   ...: 

In [6]: expected_df = gb.apply(GrowUp)

In [7]: expected_df
Out[7]: 
       size   weight adult
animal                    
cat       L  12.4375  True
dog       L  20.0000  True
fish      L   1.2500  True

Expanding Apply

In [8]: S = pd.Series([i / 100.0 for i in range(1,11)])

In [9]: def CumRet(x,y):
   ...:    return x * (1 + y)
   ...: 

In [10]: def Red(x):
   ....:    return functools.reduce(CumRet,x,1.0)
   ....: 

In [11]: S.expanding().apply(Red)
Out[11]: 
0    1.010000
1    1.030200
2    1.061106
3    1.103550
       ...   
6    1.314229
7    1.419367
8    1.547110
9    1.701821
dtype: float64

Replacing some values with mean of the rest of a group

In [12]: df = pd.DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, -1, 1, 2]})

In [13]: gb = df.groupby('A')

In [14]: def replace(g):
   ....:    mask = g < 0
   ....:    g.loc[mask] = g[~mask].mean()
   ....:    return g
   ....: 

In [15]: gb.transform(replace)
Out[15]: 
     B
0  1.0
1  1.0
2  1.0
3  2.0

Sort groups by aggregated data

In [16]: df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2,
   ....:                    'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
   ....:                    'flag': [False, True] * 3})
   ....: 

In [17]: code_groups = df.groupby('code')

In [18]: agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')

In [19]: sorted_df = df.ix[agg_n_sort_order.index]

In [20]: sorted_df
Out[20]: 
  code  data   flag
1  bar -0.21   True
4  bar -0.59  False
0  foo  0.16  False
3  foo  0.45   True
2  baz  0.33  False
5  baz  0.62   True

Create multiple aggregated columns

In [21]: rng = pd.date_range(start="2014-10-07",periods=10,freq='2min')

In [22]: ts = pd.Series(data = list(range(10)), index = rng)

In [23]: def MyCust(x):
   ....:    if len(x) > 2:
   ....:       return x[1] * 1.234
   ....:    return pd.NaT
   ....: 

In [24]: mhc = {'Mean' : np.mean, 'Max' : np.max, 'Custom' : MyCust}

In [25]: ts.resample("5min").apply(mhc)
Out[25]: 
                     Max Custom  Mean
2014-10-07 00:00:00    2  1.234   1.0
2014-10-07 00:05:00    4    NaT   3.5
2014-10-07 00:10:00    7  7.404   6.0
2014-10-07 00:15:00    9    NaT   8.5

In [26]: ts
Out[26]: 
2014-10-07 00:00:00    0
2014-10-07 00:02:00    1
2014-10-07 00:04:00    2
2014-10-07 00:06:00    3
                      ..
2014-10-07 00:12:00    6
2014-10-07 00:14:00    7
2014-10-07 00:16:00    8
2014-10-07 00:18:00    9
Freq: 2T, dtype: int64

Create a value counts column and reassign back to the DataFrame

In [27]: df = pd.DataFrame({'Color': 'Red Red Red Blue'.split(),
   ....:                    'Value': [100, 150, 50, 50]}); df
   ....: 
Out[27]: 
  Color  Value
0   Red    100
1   Red    150
2   Red     50
3  Blue     50

In [28]: df['Counts'] = df.groupby(['Color']).transform(len)

In [29]: df
Out[29]: 
  Color  Value  Counts
0   Red    100       3
1   Red    150       3
2   Red     50       3
3  Blue     50       1

Shift groups of the values in a column based on the index

In [30]: df = pd.DataFrame(
   ....:    {u'line_race': [10, 10, 8, 10, 10, 8],
   ....:     u'beyer': [99, 102, 103, 103, 88, 100]},
   ....:     index=[u'Last Gunfighter', u'Last Gunfighter', u'Last Gunfighter',
   ....:            u'Paynter', u'Paynter', u'Paynter']); df
   ....: 
Out[30]: 
                 beyer  line_race
Last Gunfighter     99         10
Last Gunfighter    102         10
Last Gunfighter    103          8
Paynter            103         10
Paynter             88         10
Paynter            100          8

In [31]: df['beyer_shifted'] = df.groupby(level=0)['beyer'].shift(1)

In [32]: df
Out[32]: 
                 beyer  line_race  beyer_shifted
Last Gunfighter     99         10            NaN
Last Gunfighter    102         10           99.0
Last Gunfighter    103          8          102.0
Paynter            103         10            NaN
Paynter             88         10          103.0
Paynter            100          8           88.0

Select row with maximum value from each group

In [33]: df = pd.DataFrame({'host':['other','other','that','this','this'],
   ....:                    'service':['mail','web','mail','mail','web'],
   ....:                    'no':[1, 2, 1, 2, 1]}).set_index(['host', 'service'])
   ....: 

In [34]: mask = df.groupby(level=0).agg('idxmax')

In [35]: df_count = df.loc[mask['no']].reset_index()

In [36]: df_count
Out[36]: 
    host service  no
0  other     web   2
1   that    mail   1
2   this    mail   2

Grouping like Python’s itertools.groupby

In [37]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A'])

In [38]: df.A.groupby((df.A != df.A.shift()).cumsum()).groups
Out[38]: {1: [0], 2: [1], 3: [2], 4: [3, 4, 5], 5: [6], 6: [7, 8]}

In [39]: df.A.groupby((df.A != df.A.shift()).cumsum()).cumsum()
Out[39]: 
0    0
1    1
2    0
3    1
    ..
5    3
6    0
7    1
8    2
Name: A, dtype: int64

2.5.2 Splitting

Splitting a frame

Create a list of dataframes, split using a delineation based on logic included in rows.

In [40]: df = pd.DataFrame(data={'Case' : ['A','A','A','B','A','A','B','A','A'],
   ....:                         'Data' : np.random.randn(9)})
   ....: 

In [41]: dfs = list(zip(*df.groupby((1*(df['Case']=='B')).cumsum().rolling(window=3,min_periods=1).median())))[-1]

In [42]: dfs[0]
Out[42]: 
  Case      Data
0    A  0.469112
1    A -0.282863
2    A -1.509059
3    B -1.135632

In [43]: dfs[1]
Out[43]: 
  Case      Data
4    A  1.212112
5    A -0.173215
6    B  0.119209

In [44]: dfs[2]
Out[44]: 
  Case      Data
7    A -1.044236
8    A -0.861849

2.5.3 Pivot

The Pivot docs.

Partial sums and subtotals

In [45]: df = pd.DataFrame(data={'Province' : ['ON','QC','BC','AL','AL','MN','ON'],
   ....:                          'City' : ['Toronto','Montreal','Vancouver','Calgary','Edmonton','Winnipeg','Windsor'],
   ....:                          'Sales' : [13,6,16,8,4,3,1]})
   ....: 

In [46]: table = pd.pivot_table(df,values=['Sales'],index=['Province'],columns=['City'],aggfunc=np.sum,margins=True)

In [47]: table.stack('City')
Out[47]: 
                    Sales
Province City            
AL       All         12.0
         Calgary      8.0
         Edmonton     4.0
BC       All         16.0
...                   ...
All      Toronto     13.0
         Vancouver   16.0
         Windsor      1.0
         Winnipeg     3.0

[20 rows x 1 columns]

Frequency table like plyr in R

In [48]: grades = [48,99,75,80,42,80,72,68,36,78]

In [49]: df = pd.DataFrame( {'ID': ["x%d" % r for r in range(10)],
   ....:                     'Gender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
   ....:                     'ExamYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
   ....:                     'Class': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
   ....:                     'Participated': ['yes','yes','yes','yes','no','yes','yes','yes','yes','yes'],
   ....:                     'Passed': ['yes' if x > 50 else 'no' for x in grades],
   ....:                     'Employed': [True,True,True,False,False,False,False,True,True,False],
   ....:                     'Grade': grades})
   ....: 

In [50]: df.groupby('ExamYear').agg({'Participated': lambda x: x.value_counts()['yes'],
   ....:                     'Passed': lambda x: sum(x == 'yes'),
   ....:                     'Employed' : lambda x : sum(x),
   ....:                     'Grade' : lambda x : sum(x) / len(x)})
   ....: 
Out[50]: 
          Grade  Employed  Participated  Passed
ExamYear                                       
2007         74         3             3       2
2008         68         0             3       3
2009         60         2             3       2

Plot pandas DataFrame with year over year data

To create year and month crosstabulation:

In [51]: df = pd.DataFrame({'value': np.random.randn(36)},
   ....:                   index=pd.date_range('2011-01-01', freq='M', periods=36))
   ....: 

In [52]: pd.pivot_table(df, index=df.index.month, columns=df.index.year,
   ....:                values='value', aggfunc='sum')
   ....: 
Out[52]: 
        2011      2012      2013
1  -2.104569  0.113648 -0.109050
2  -0.494929 -1.478427  1.643563
3   1.071804  0.524988 -1.469388
4   0.721555  0.404705  0.357021
..       ...       ...       ...
9   0.567020 -1.157892  0.413738
10  0.276232 -1.344312  0.276662
11 -1.087401  0.844885 -0.472035
12 -0.673690  1.075770 -0.013960

[12 rows x 3 columns]

2.5.4 Apply

Rolling Apply to Organize - Turning embedded lists into a multi-index frame

In [53]: df = pd.DataFrame(data={'A' : [[2,4,8,16],[100,200],[10,20,30]], 'B' : [['a','b','c'],['jj','kk'],['ccc']]},index=['I','II','III'])

In [54]: def SeriesFromSubList(aList):
   ....:    return pd.Series(aList)
   ....: 

In [55]: df_orgz = pd.concat(dict([ (ind,row.apply(SeriesFromSubList)) for ind,row in df.iterrows() ]))

Rolling Apply with a DataFrame returning a Series

Rolling Apply to multiple columns where function calculates a Series before a Scalar from the Series is returned

In [56]: df = pd.DataFrame(data=np.random.randn(2000,2)/10000,
   ....:                   index=pd.date_range('2001-01-01',periods=2000),
   ....:                   columns=['A','B']); df
   ....: 
Out[56]: 
                   A             B
2001-01-01 -0.000036 -6.153570e-07
2001-01-02 -0.000092  8.957173e-05
2001-01-03  0.000081 -1.206412e-04
2001-01-04  0.000257  1.431256e-04
...              ...           ...
2006-06-20 -0.000066  6.418233e-06
2006-06-21  0.000134  2.368990e-05
2006-06-22 -0.000053 -4.636504e-05
2006-06-23 -0.000016  1.640535e-04

[2000 rows x 2 columns]

In [57]: def gm(aDF,Const):
   ....:    v = ((((aDF.A+aDF.B)+1).cumprod())-1)*Const
   ....:    return (aDF.index[0],v.iloc[-1])
   ....: 

In [58]: S = pd.Series(dict([ gm(df.iloc[i:min(i+51,len(df)-1)],5) for i in range(len(df)-50) ])); S
Out[58]: 
2001-01-01    0.001555
2001-01-02    0.002357
2001-01-03    0.002620
2001-01-04    0.002146
                ...   
2006-05-01    0.003101
2006-05-02    0.003143
2006-05-03    0.003168
2006-05-04    0.003204
dtype: float64

Rolling apply with a DataFrame returning a Scalar

Rolling Apply to multiple columns where function returns a Scalar (Volume Weighted Average Price)

In [59]: rng = pd.date_range(start = '2014-01-01',periods = 100)

In [60]: df = pd.DataFrame({'Open' : np.random.randn(len(rng)),
   ....:                    'Close' : np.random.randn(len(rng)),
   ....:                    'Volume' : np.random.randint(100,2000,len(rng))}, index=rng); df
   ....: 
Out[60]: 
               Close      Open  Volume
2014-01-01  0.297665 -0.194914    1457
2014-01-02  1.659370 -0.317710    1882
2014-01-03  0.325790 -1.492666    1504
2014-01-04 -3.188821 -0.754563     222
...              ...       ...     ...
2014-04-07 -1.355489  1.592339     777
2014-04-08 -1.317901  0.066266     353
2014-04-09  1.422172 -1.693412     419
2014-04-10 -1.518677 -0.109194     465

[100 rows x 3 columns]

In [61]: def vwap(bars): return ((bars.Close*bars.Volume).sum()/bars.Volume.sum())

In [62]: window = 5

In [63]: s = pd.concat([ (pd.Series(vwap(df.iloc[i:i+window]), index=[df.index[i+window]])) for i in range(len(df)-window) ]);

In [64]: s.round(2)
Out[64]: 
2014-01-06    0.65
2014-01-07    0.58
2014-01-08    0.01
2014-01-09    0.21
              ... 
2014-04-07   -0.31
2014-04-08   -0.40
2014-04-09   -0.88
2014-04-10   -0.92
dtype: float64