2.5 Grouping
The grouping docs.
Unlike agg, apply’s callable is passed a sub-DataFrame which gives you access to all the columns
In [1]: df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(),
...: 'size': list('SSMMMLL'),
...: 'weight': [8, 10, 11, 1, 20, 12, 12],
...: 'adult' : [False] * 5 + [True] * 2}); df
...:
Out[1]:
adult animal size weight
0 False cat S 8
1 False dog S 10
2 False cat M 11
3 False fish M 1
4 False dog M 20
5 True cat L 12
6 True cat L 12
#List the size of the animals with the highest weight.
In [2]: df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])
Out[2]:
animal
cat L
dog M
fish M
dtype: object
In [3]: gb = df.groupby(['animal'])
In [4]: gb.get_group('cat')
Out[4]:
adult animal size weight
0 False cat S 8
2 False cat M 11
5 True cat L 12
6 True cat L 12
Apply to different items in a group
In [5]: def GrowUp(x):
...: avg_weight = sum(x[x['size'] == 'S'].weight * 1.5)
...: avg_weight += sum(x[x['size'] == 'M'].weight * 1.25)
...: avg_weight += sum(x[x['size'] == 'L'].weight)
...: avg_weight /= len(x)
...: return pd.Series(['L',avg_weight,True], index=['size', 'weight', 'adult'])
...:
In [6]: expected_df = gb.apply(GrowUp)
In [7]: expected_df
Out[7]:
size weight adult
animal
cat L 12.4375 True
dog L 20.0000 True
fish L 1.2500 True
In [8]: S = pd.Series([i / 100.0 for i in range(1,11)])
In [9]: def CumRet(x,y):
...: return x * (1 + y)
...:
In [10]: def Red(x):
....: return functools.reduce(CumRet,x,1.0)
....:
In [11]: S.expanding().apply(Red)
Out[11]:
0 1.010000
1 1.030200
2 1.061106
3 1.103550
...
6 1.314229
7 1.419367
8 1.547110
9 1.701821
dtype: float64
Replacing some values with mean of the rest of a group
In [12]: df = pd.DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, -1, 1, 2]})
In [13]: gb = df.groupby('A')
In [14]: def replace(g):
....: mask = g < 0
....: g.loc[mask] = g[~mask].mean()
....: return g
....:
In [15]: gb.transform(replace)
Out[15]:
B
0 1.0
1 1.0
2 1.0
3 2.0
Sort groups by aggregated data
In [16]: df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2,
....: 'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
....: 'flag': [False, True] * 3})
....:
In [17]: code_groups = df.groupby('code')
In [18]: agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')
In [19]: sorted_df = df.ix[agg_n_sort_order.index]
In [20]: sorted_df
Out[20]:
code data flag
1 bar -0.21 True
4 bar -0.59 False
0 foo 0.16 False
3 foo 0.45 True
2 baz 0.33 False
5 baz 0.62 True
Create multiple aggregated columns
In [21]: rng = pd.date_range(start="2014-10-07",periods=10,freq='2min')
In [22]: ts = pd.Series(data = list(range(10)), index = rng)
In [23]: def MyCust(x):
....: if len(x) > 2:
....: return x[1] * 1.234
....: return pd.NaT
....:
In [24]: mhc = {'Mean' : np.mean, 'Max' : np.max, 'Custom' : MyCust}
In [25]: ts.resample("5min").apply(mhc)
Out[25]:
Max Custom Mean
2014-10-07 00:00:00 2 1.234 1.0
2014-10-07 00:05:00 4 NaT 3.5
2014-10-07 00:10:00 7 7.404 6.0
2014-10-07 00:15:00 9 NaT 8.5
In [26]: ts
Out[26]:
2014-10-07 00:00:00 0
2014-10-07 00:02:00 1
2014-10-07 00:04:00 2
2014-10-07 00:06:00 3
..
2014-10-07 00:12:00 6
2014-10-07 00:14:00 7
2014-10-07 00:16:00 8
2014-10-07 00:18:00 9
Freq: 2T, dtype: int64
Create a value counts column and reassign back to the DataFrame
In [27]: df = pd.DataFrame({'Color': 'Red Red Red Blue'.split(),
....: 'Value': [100, 150, 50, 50]}); df
....:
Out[27]:
Color Value
0 Red 100
1 Red 150
2 Red 50
3 Blue 50
In [28]: df['Counts'] = df.groupby(['Color']).transform(len)
In [29]: df
Out[29]:
Color Value Counts
0 Red 100 3
1 Red 150 3
2 Red 50 3
3 Blue 50 1
Shift groups of the values in a column based on the index
In [30]: df = pd.DataFrame(
....: {u'line_race': [10, 10, 8, 10, 10, 8],
....: u'beyer': [99, 102, 103, 103, 88, 100]},
....: index=[u'Last Gunfighter', u'Last Gunfighter', u'Last Gunfighter',
....: u'Paynter', u'Paynter', u'Paynter']); df
....:
Out[30]:
beyer line_race
Last Gunfighter 99 10
Last Gunfighter 102 10
Last Gunfighter 103 8
Paynter 103 10
Paynter 88 10
Paynter 100 8
In [31]: df['beyer_shifted'] = df.groupby(level=0)['beyer'].shift(1)
In [32]: df
Out[32]:
beyer line_race beyer_shifted
Last Gunfighter 99 10 NaN
Last Gunfighter 102 10 99.0
Last Gunfighter 103 8 102.0
Paynter 103 10 NaN
Paynter 88 10 103.0
Paynter 100 8 88.0
Select row with maximum value from each group
In [33]: df = pd.DataFrame({'host':['other','other','that','this','this'],
....: 'service':['mail','web','mail','mail','web'],
....: 'no':[1, 2, 1, 2, 1]}).set_index(['host', 'service'])
....:
In [34]: mask = df.groupby(level=0).agg('idxmax')
In [35]: df_count = df.loc[mask['no']].reset_index()
In [36]: df_count
Out[36]:
host service no
0 other web 2
1 that mail 1
2 this mail 2
Grouping like Python’s itertools.groupby
In [37]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A'])
In [38]: df.A.groupby((df.A != df.A.shift()).cumsum()).groups
Out[38]: {1: [0], 2: [1], 3: [2], 4: [3, 4, 5], 5: [6], 6: [7, 8]}
In [39]: df.A.groupby((df.A != df.A.shift()).cumsum()).cumsum()
Out[39]:
0 0
1 1
2 0
3 1
..
5 3
6 0
7 1
8 2
Name: A, dtype: int64
2.5.1 Expanding Data
Rolling Computation window based on values instead of counts
2.5.2 Splitting
Create a list of dataframes, split using a delineation based on logic included in rows.
In [40]: df = pd.DataFrame(data={'Case' : ['A','A','A','B','A','A','B','A','A'],
....: 'Data' : np.random.randn(9)})
....:
In [41]: dfs = list(zip(*df.groupby((1*(df['Case']=='B')).cumsum().rolling(window=3,min_periods=1).median())))[-1]
In [42]: dfs[0]
Out[42]:
Case Data
0 A 0.469112
1 A -0.282863
2 A -1.509059
3 B -1.135632
In [43]: dfs[1]
Out[43]:
Case Data
4 A 1.212112
5 A -0.173215
6 B 0.119209
In [44]: dfs[2]
Out[44]:
Case Data
7 A -1.044236
8 A -0.861849
2.5.3 Pivot
The Pivot docs.
In [45]: df = pd.DataFrame(data={'Province' : ['ON','QC','BC','AL','AL','MN','ON'],
....: 'City' : ['Toronto','Montreal','Vancouver','Calgary','Edmonton','Winnipeg','Windsor'],
....: 'Sales' : [13,6,16,8,4,3,1]})
....:
In [46]: table = pd.pivot_table(df,values=['Sales'],index=['Province'],columns=['City'],aggfunc=np.sum,margins=True)
In [47]: table.stack('City')
Out[47]:
Sales
Province City
AL All 12.0
Calgary 8.0
Edmonton 4.0
BC All 16.0
... ...
All Toronto 13.0
Vancouver 16.0
Windsor 1.0
Winnipeg 3.0
[20 rows x 1 columns]
Frequency table like plyr in R
In [48]: grades = [48,99,75,80,42,80,72,68,36,78]
In [49]: df = pd.DataFrame( {'ID': ["x%d" % r for r in range(10)],
....: 'Gender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
....: 'ExamYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
....: 'Class': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
....: 'Participated': ['yes','yes','yes','yes','no','yes','yes','yes','yes','yes'],
....: 'Passed': ['yes' if x > 50 else 'no' for x in grades],
....: 'Employed': [True,True,True,False,False,False,False,True,True,False],
....: 'Grade': grades})
....:
In [50]: df.groupby('ExamYear').agg({'Participated': lambda x: x.value_counts()['yes'],
....: 'Passed': lambda x: sum(x == 'yes'),
....: 'Employed' : lambda x : sum(x),
....: 'Grade' : lambda x : sum(x) / len(x)})
....:
Out[50]:
Grade Employed Participated Passed
ExamYear
2007 74 3 3 2
2008 68 0 3 3
2009 60 2 3 2
Plot pandas DataFrame with year over year data
To create year and month crosstabulation:
In [51]: df = pd.DataFrame({'value': np.random.randn(36)},
....: index=pd.date_range('2011-01-01', freq='M', periods=36))
....:
In [52]: pd.pivot_table(df, index=df.index.month, columns=df.index.year,
....: values='value', aggfunc='sum')
....:
Out[52]:
2011 2012 2013
1 -2.104569 0.113648 -0.109050
2 -0.494929 -1.478427 1.643563
3 1.071804 0.524988 -1.469388
4 0.721555 0.404705 0.357021
.. ... ... ...
9 0.567020 -1.157892 0.413738
10 0.276232 -1.344312 0.276662
11 -1.087401 0.844885 -0.472035
12 -0.673690 1.075770 -0.013960
[12 rows x 3 columns]
2.5.4 Apply
Rolling Apply to Organize - Turning embedded lists into a multi-index frame
In [53]: df = pd.DataFrame(data={'A' : [[2,4,8,16],[100,200],[10,20,30]], 'B' : [['a','b','c'],['jj','kk'],['ccc']]},index=['I','II','III'])
In [54]: def SeriesFromSubList(aList):
....: return pd.Series(aList)
....:
In [55]: df_orgz = pd.concat(dict([ (ind,row.apply(SeriesFromSubList)) for ind,row in df.iterrows() ]))
Rolling Apply with a DataFrame returning a Series
Rolling Apply to multiple columns where function calculates a Series before a Scalar from the Series is returned
In [56]: df = pd.DataFrame(data=np.random.randn(2000,2)/10000,
....: index=pd.date_range('2001-01-01',periods=2000),
....: columns=['A','B']); df
....:
Out[56]:
A B
2001-01-01 -0.000036 -6.153570e-07
2001-01-02 -0.000092 8.957173e-05
2001-01-03 0.000081 -1.206412e-04
2001-01-04 0.000257 1.431256e-04
... ... ...
2006-06-20 -0.000066 6.418233e-06
2006-06-21 0.000134 2.368990e-05
2006-06-22 -0.000053 -4.636504e-05
2006-06-23 -0.000016 1.640535e-04
[2000 rows x 2 columns]
In [57]: def gm(aDF,Const):
....: v = ((((aDF.A+aDF.B)+1).cumprod())-1)*Const
....: return (aDF.index[0],v.iloc[-1])
....:
In [58]: S = pd.Series(dict([ gm(df.iloc[i:min(i+51,len(df)-1)],5) for i in range(len(df)-50) ])); S
Out[58]:
2001-01-01 0.001555
2001-01-02 0.002357
2001-01-03 0.002620
2001-01-04 0.002146
...
2006-05-01 0.003101
2006-05-02 0.003143
2006-05-03 0.003168
2006-05-04 0.003204
dtype: float64
Rolling apply with a DataFrame returning a Scalar
Rolling Apply to multiple columns where function returns a Scalar (Volume Weighted Average Price)
In [59]: rng = pd.date_range(start = '2014-01-01',periods = 100)
In [60]: df = pd.DataFrame({'Open' : np.random.randn(len(rng)),
....: 'Close' : np.random.randn(len(rng)),
....: 'Volume' : np.random.randint(100,2000,len(rng))}, index=rng); df
....:
Out[60]:
Close Open Volume
2014-01-01 0.297665 -0.194914 1457
2014-01-02 1.659370 -0.317710 1882
2014-01-03 0.325790 -1.492666 1504
2014-01-04 -3.188821 -0.754563 222
... ... ... ...
2014-04-07 -1.355489 1.592339 777
2014-04-08 -1.317901 0.066266 353
2014-04-09 1.422172 -1.693412 419
2014-04-10 -1.518677 -0.109194 465
[100 rows x 3 columns]
In [61]: def vwap(bars): return ((bars.Close*bars.Volume).sum()/bars.Volume.sum())
In [62]: window = 5
In [63]: s = pd.concat([ (pd.Series(vwap(df.iloc[i:i+window]), index=[df.index[i+window]])) for i in range(len(df)-window) ]);
In [64]: s.round(2)
Out[64]:
2014-01-06 0.65
2014-01-07 0.58
2014-01-08 0.01
2014-01-09 0.21
...
2014-04-07 -0.31
2014-04-08 -0.40
2014-04-09 -0.88
2014-04-10 -0.92
dtype: float64