3.3 Aggregation
Once the Rolling
, Expanding
or EWM
objects have been created, several methods are available to
perform multiple computations on the data. This is very similar to a .groupby(...).agg
seen here.
In [1]: dfa = pd.DataFrame(np.random.randn(1000, 3),
...: index=pd.date_range('1/1/2000', periods=1000),
...: columns=['A', 'B', 'C'])
...:
In [2]: r = dfa.rolling(window=60,min_periods=1)
In [3]: r
Out[3]: Rolling [window=60,min_periods=1,center=False,axis=0]
We can aggregate by passing a function to the entire DataFrame, or select a Series (or multiple Series) via standard getitem.
In [4]: r.aggregate(np.sum)
Out[4]:
A B C
2000-01-01 0.469112 -0.282863 -1.509059
2000-01-02 -0.666520 0.929249 -1.682273
2000-01-03 -0.547311 -0.114987 -2.544122
2000-01-04 -2.651881 -0.609917 -1.472318
... ... ... ...
2002-09-23 8.342906 -8.420254 -4.488404
2002-09-24 8.922223 -8.817915 -1.041163
2002-09-25 6.181801 -9.888219 -4.544517
2002-09-26 4.105633 -7.922664 -4.677549
[1000 rows x 3 columns]
In [5]: r['A'].aggregate(np.sum)
Out[5]:
2000-01-01 0.469112
2000-01-02 -0.666520
2000-01-03 -0.547311
2000-01-04 -2.651881
...
2002-09-23 8.342906
2002-09-24 8.922223
2002-09-25 6.181801
2002-09-26 4.105633
Freq: D, Name: A, dtype: float64
In [6]: r[['A','B']].aggregate(np.sum)
Out[6]:
A B
2000-01-01 0.469112 -0.282863
2000-01-02 -0.666520 0.929249
2000-01-03 -0.547311 -0.114987
2000-01-04 -2.651881 -0.609917
... ... ...
2002-09-23 8.342906 -8.420254
2002-09-24 8.922223 -8.817915
2002-09-25 6.181801 -9.888219
2002-09-26 4.105633 -7.922664
[1000 rows x 2 columns]
As you can see, the result of the aggregation will have the selected columns, or all columns if none are selected.
3.3.1 Applying multiple functions at once
With windowed Series you can also pass a list or dict of functions to do aggregation with, outputting a DataFrame:
In [7]: r['A'].agg([np.sum, np.mean, np.std])
Out[7]:
sum mean std
2000-01-01 0.469112 0.469112 NaN
2000-01-02 -0.666520 -0.333260 1.134726
2000-01-03 -0.547311 -0.182437 0.843827
2000-01-04 -2.651881 -0.662970 1.182516
... ... ... ...
2002-09-23 8.342906 0.139048 1.006288
2002-09-24 8.922223 0.148704 0.996540
2002-09-25 6.181801 0.103030 1.031933
2002-09-26 4.105633 0.068427 1.064250
[1000 rows x 3 columns]
If a dict is passed, the keys will be used to name the columns. Otherwise the function’s name (stored in the function object) will be used.
In [8]: r['A'].agg({'result1' : np.sum,
...: 'result2' : np.mean})
...:
Out[8]:
result2 result1
2000-01-01 0.469112 0.469112
2000-01-02 -0.333260 -0.666520
2000-01-03 -0.182437 -0.547311
2000-01-04 -0.662970 -2.651881
... ... ...
2002-09-23 0.139048 8.342906
2002-09-24 0.148704 8.922223
2002-09-25 0.103030 6.181801
2002-09-26 0.068427 4.105633
[1000 rows x 2 columns]
On a widowed DataFrame, you can pass a list of functions to apply to each column, which produces an aggregated result with a hierarchical index:
In [9]: r.agg([np.sum, np.mean])
Out[9]:
A B C
sum mean sum mean sum mean
2000-01-01 0.469112 0.469112 -0.282863 -0.282863 -1.509059 -1.509059
2000-01-02 -0.666520 -0.333260 0.929249 0.464624 -1.682273 -0.841137
2000-01-03 -0.547311 -0.182437 -0.114987 -0.038329 -2.544122 -0.848041
2000-01-04 -2.651881 -0.662970 -0.609917 -0.152479 -1.472318 -0.368080
... ... ... ... ... ... ...
2002-09-23 8.342906 0.139048 -8.420254 -0.140338 -4.488404 -0.074807
2002-09-24 8.922223 0.148704 -8.817915 -0.146965 -1.041163 -0.017353
2002-09-25 6.181801 0.103030 -9.888219 -0.164804 -4.544517 -0.075742
2002-09-26 4.105633 0.068427 -7.922664 -0.132044 -4.677549 -0.077959
[1000 rows x 6 columns]
Passing a dict of functions has different behavior by default, see the next section.
3.3.2 Applying different functions to DataFrame columns
By passing a dict to aggregate
you can apply a different aggregation to the
columns of a DataFrame:
In [10]: r.agg({'A' : np.sum,
....: 'B' : lambda x: np.std(x, ddof=1)})
....:
Out[10]:
A B
2000-01-01 0.469112 NaN
2000-01-02 -0.666520 1.057107
2000-01-03 -0.547311 1.147878
2000-01-04 -2.651881 0.964644
... ... ...
2002-09-23 8.342906 1.035191
2002-09-24 8.922223 1.032920
2002-09-25 6.181801 1.039457
2002-09-26 4.105633 1.058789
[1000 rows x 2 columns]
The function names can also be strings. In order for a string to be valid it must be implemented on the windowed object
In [11]: r.agg({'A' : 'sum', 'B' : 'std'})
Out[11]:
A B
2000-01-01 0.469112 NaN
2000-01-02 -0.666520 1.057107
2000-01-03 -0.547311 1.147878
2000-01-04 -2.651881 0.964644
... ... ...
2002-09-23 8.342906 1.035191
2002-09-24 8.922223 1.032920
2002-09-25 6.181801 1.039457
2002-09-26 4.105633 1.058789
[1000 rows x 2 columns]
Furthermore you can pass a nested dict to indicate different aggregations on different columns.
In [12]: r.agg({'A' : ['sum','std'], 'B' : ['mean','std'] })
Out[12]:
A B
sum std mean std
2000-01-01 0.469112 NaN -0.282863 NaN
2000-01-02 -0.666520 1.134726 0.464624 1.057107
2000-01-03 -0.547311 0.843827 -0.038329 1.147878
2000-01-04 -2.651881 1.182516 -0.152479 0.964644
... ... ... ... ...
2002-09-23 8.342906 1.006288 -0.140338 1.035191
2002-09-24 8.922223 0.996540 -0.146965 1.032920
2002-09-25 6.181801 1.031933 -0.164804 1.039457
2002-09-26 4.105633 1.064250 -0.132044 1.058789
[1000 rows x 4 columns]