10 Resampling
Warning
The interface to .resample
has changed in 0.18.0 to be more groupby-like and hence more flexible.
See the whatsnew docs for a comparison with prior versions.
Pandas has a simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications.
.resample()
is a time-based groupby, followed by a reduction method on each of its groups.
Note
.resample()
is similar to using a .rolling()
operation with a time-based offset, see a discussion here <stats.moments.ts-versus-resampling>
See some cookbook examples for some advanced strategies
In [1]: rng = pd.date_range('1/1/2012', periods=100, freq='S')
In [2]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
In [3]: ts.resample('5Min').sum()
Out[3]:
2012-01-01 26448
Freq: 5T, dtype: int64
The resample
function is very flexible and allows you to specify many
different parameters to control the frequency conversion and resampling
operation.
The how
parameter can be a function name or numpy array function that takes
an array and produces aggregated values:
In [4]: ts.resample('5Min').mean()
Out[4]:
2012-01-01 264.48
Freq: 5T, dtype: float64
In [5]: ts.resample('5Min').ohlc()
Out[5]:
open high low close
2012-01-01 65 498 18 101
In [6]: ts.resample('5Min').max()
Out[6]:
2012-01-01 498
Freq: 5T, dtype: int64
Any function available via dispatching can be given to
the how
parameter by name, including sum
, mean
, std
, sem
,
max
, min
, median
, first
, last
, ohlc
.
For downsampling, closed
can be set to ‘left’ or ‘right’ to specify which
end of the interval is closed:
In [7]: ts.resample('5Min', closed='right').mean()
Out[7]:
2011-12-31 23:55:00 65.000000
2012-01-01 00:00:00 266.494949
Freq: 5T, dtype: float64
In [8]: ts.resample('5Min', closed='left').mean()
Out[8]:
2012-01-01 264.48
Freq: 5T, dtype: float64
Parameters like label
and loffset
are used to manipulate the resulting
labels. label
specifies whether the result is labeled with the beginning or
the end of the interval. loffset
performs a time adjustment on the output
labels.
In [9]: ts.resample('5Min').mean() # by default label='right'
Out[9]:
2012-01-01 264.48
Freq: 5T, dtype: float64
In [10]: ts.resample('5Min', label='left').mean()
Out[10]:
2012-01-01 264.48
Freq: 5T, dtype: float64
In [11]: ts.resample('5Min', label='left', loffset='1s').mean()
Out[11]:
2012-01-01 00:00:01 264.48
dtype: float64
The axis
parameter can be set to 0 or 1 and allows you to resample the
specified axis for a DataFrame.
kind
can be set to ‘timestamp’ or ‘period’ to convert the resulting index
to/from time-stamp and time-span representations. By default resample
retains the input representation.
convention
can be set to ‘start’ or ‘end’ when resampling period data
(detail below). It specifies how low frequency periods are converted to higher
frequency periods.
10.1 Up Sampling
For upsampling, you can specify a way to upsample and the limit
parameter to interpolate over the gaps that are created:
# from secondly to every 250 milliseconds
In [12]: ts[:2].resample('250L').asfreq()
Out[12]:
2012-01-01 00:00:00.000 65.0
2012-01-01 00:00:00.250 NaN
2012-01-01 00:00:00.500 NaN
2012-01-01 00:00:00.750 NaN
2012-01-01 00:00:01.000 234.0
Freq: 250L, dtype: float64
In [13]: ts[:2].resample('250L').ffill()
Out[13]:
2012-01-01 00:00:00.000 65
2012-01-01 00:00:00.250 65
2012-01-01 00:00:00.500 65
2012-01-01 00:00:00.750 65
2012-01-01 00:00:01.000 234
Freq: 250L, dtype: int64
In [14]: ts[:2].resample('250L').ffill(limit=2)
Out[14]:
2012-01-01 00:00:00.000 65.0
2012-01-01 00:00:00.250 65.0
2012-01-01 00:00:00.500 65.0
2012-01-01 00:00:00.750 NaN
2012-01-01 00:00:01.000 234.0
Freq: 250L, dtype: float64
10.2 Sparse Resampling
Sparse timeseries are ones where you have a lot fewer points relative
to the amount of time you are looking to resample. Naively upsampling a sparse series can potentially
generate lots of intermediate values. When you don’t want to use a method to fill these values, e.g. fill_method
is None
,
then intermediate values will be filled with NaN
.
Since resample
is a time-based groupby, the following is a method to efficiently
resample only the groups that are not all NaN
In [15]: rng = pd.date_range('2014-1-1', periods=100, freq='D') + pd.Timedelta('1s')
In [16]: ts = pd.Series(range(100), index=rng)
If we want to resample to the full range of the series
In [17]: ts.resample('3T').sum()
Out[17]:
2014-01-01 00:00:00 0.0
2014-01-01 00:03:00 NaN
2014-01-01 00:06:00 NaN
2014-01-01 00:09:00 NaN
...
2014-04-09 23:51:00 NaN
2014-04-09 23:54:00 NaN
2014-04-09 23:57:00 NaN
2014-04-10 00:00:00 99.0
Freq: 3T, dtype: float64
We can instead only resample those groups where we have points as follows:
In [18]: from functools import partial
In [19]: from pandas.tseries.frequencies import to_offset
In [20]: def round(t, freq):
....: freq = to_offset(freq)
....: return pd.Timestamp((t.value // freq.delta.value) * freq.delta.value)
....:
In [21]: ts.groupby(partial(round, freq='3T')).sum()
Out[21]:
2014-01-01 0
2014-01-02 1
2014-01-03 2
2014-01-04 3
..
2014-04-07 96
2014-04-08 97
2014-04-09 98
2014-04-10 99
dtype: int64
10.3 Aggregation
Similar to groupby aggregates and the window functions, a Resampler
can be selectively
resampled.
Resampling a DataFrame
, the default will be to act on all columns with the same function.
In [22]: df = pd.DataFrame(np.random.randn(1000, 3),
....: index=pd.date_range('1/1/2012', freq='S', periods=1000),
....: columns=['A', 'B', 'C'])
....:
In [23]: r = df.resample('3T')
In [24]: r.mean()
Out[24]:
A B C
2012-01-01 00:00:00 0.027477 0.045215 -0.056040
2012-01-01 00:03:00 -0.017939 -0.054207 0.050467
2012-01-01 00:06:00 -0.009675 0.024055 -0.008814
2012-01-01 00:09:00 -0.108410 0.060835 -0.029850
2012-01-01 00:12:00 0.068417 -0.134840 0.041435
2012-01-01 00:15:00 -0.054826 -0.017097 -0.004233
We can select a specific column or columns using standard getitem.
In [25]: r['A'].mean()
Out[25]:
2012-01-01 00:00:00 0.027477
2012-01-01 00:03:00 -0.017939
2012-01-01 00:06:00 -0.009675
2012-01-01 00:09:00 -0.108410
2012-01-01 00:12:00 0.068417
2012-01-01 00:15:00 -0.054826
Freq: 3T, Name: A, dtype: float64
In [26]: r[['A','B']].mean()
Out[26]:
A B
2012-01-01 00:00:00 0.027477 0.045215
2012-01-01 00:03:00 -0.017939 -0.054207
2012-01-01 00:06:00 -0.009675 0.024055
2012-01-01 00:09:00 -0.108410 0.060835
2012-01-01 00:12:00 0.068417 -0.134840
2012-01-01 00:15:00 -0.054826 -0.017097
You can pass a list or dict of functions to do aggregation with, outputting a DataFrame:
In [27]: r['A'].agg([np.sum, np.mean, np.std])
Out[27]:
sum mean std
2012-01-01 00:00:00 4.945870 0.027477 0.963542
2012-01-01 00:03:00 -3.229049 -0.017939 0.963678
2012-01-01 00:06:00 -1.741514 -0.009675 1.041111
2012-01-01 00:09:00 -19.513772 -0.108410 0.999055
2012-01-01 00:12:00 12.315043 0.068417 1.055563
2012-01-01 00:15:00 -5.482622 -0.054826 1.082826
If a dict is passed, the keys will be used to name the columns. Otherwise the function’s name (stored in the function object) will be used.
In [28]: r['A'].agg({'result1' : np.sum,
....: 'result2' : np.mean})
....:
Out[28]:
result2 result1
2012-01-01 00:00:00 0.027477 4.945870
2012-01-01 00:03:00 -0.017939 -3.229049
2012-01-01 00:06:00 -0.009675 -1.741514
2012-01-01 00:09:00 -0.108410 -19.513772
2012-01-01 00:12:00 0.068417 12.315043
2012-01-01 00:15:00 -0.054826 -5.482622
On a resampled DataFrame, you can pass a list of functions to apply to each column, which produces an aggregated result with a hierarchical index:
In [29]: r.agg([np.sum, np.mean])
Out[29]:
A B C \
sum mean sum mean sum
2012-01-01 00:00:00 4.945870 0.027477 8.138690 0.045215 -10.087267
2012-01-01 00:03:00 -3.229049 -0.017939 -9.757335 -0.054207 9.084093
2012-01-01 00:06:00 -1.741514 -0.009675 4.329810 0.024055 -1.586513
2012-01-01 00:09:00 -19.513772 -0.108410 10.950261 0.060835 -5.373065
2012-01-01 00:12:00 12.315043 0.068417 -24.271279 -0.134840 7.458211
2012-01-01 00:15:00 -5.482622 -0.054826 -1.709680 -0.017097 -0.423342
mean
2012-01-01 00:00:00 -0.056040
2012-01-01 00:03:00 0.050467
2012-01-01 00:06:00 -0.008814
2012-01-01 00:09:00 -0.029850
2012-01-01 00:12:00 0.041435
2012-01-01 00:15:00 -0.004233
By passing a dict to aggregate
you can apply a different aggregation to the
columns of a DataFrame:
In [30]: r.agg({'A' : np.sum,
....: 'B' : lambda x: np.std(x, ddof=1)})
....:
Out[30]:
A B
2012-01-01 00:00:00 4.945870 1.020321
2012-01-01 00:03:00 -3.229049 1.109096
2012-01-01 00:06:00 -1.741514 1.081234
2012-01-01 00:09:00 -19.513772 1.012368
2012-01-01 00:12:00 12.315043 1.018528
2012-01-01 00:15:00 -5.482622 1.044362
The function names can also be strings. In order for a string to be valid it must be implemented on the Resampled object
In [31]: r.agg({'A' : 'sum', 'B' : 'std'})
Out[31]:
A B
2012-01-01 00:00:00 4.945870 1.020321
2012-01-01 00:03:00 -3.229049 1.109096
2012-01-01 00:06:00 -1.741514 1.081234
2012-01-01 00:09:00 -19.513772 1.012368
2012-01-01 00:12:00 12.315043 1.018528
2012-01-01 00:15:00 -5.482622 1.044362
Furthermore, you can also specify multiple aggregation functions for each column separately.
In [32]: r.agg({'A' : ['sum','std'], 'B' : ['mean','std'] })
Out[32]:
A B
sum std mean std
2012-01-01 00:00:00 4.945870 0.963542 0.045215 1.020321
2012-01-01 00:03:00 -3.229049 0.963678 -0.054207 1.109096
2012-01-01 00:06:00 -1.741514 1.041111 0.024055 1.081234
2012-01-01 00:09:00 -19.513772 0.999055 0.060835 1.012368
2012-01-01 00:12:00 12.315043 1.055563 -0.134840 1.018528
2012-01-01 00:15:00 -5.482622 1.082826 -0.017097 1.044362