10 Resampling

Warning

The interface to .resample has changed in 0.18.0 to be more groupby-like and hence more flexible. See the whatsnew docs for a comparison with prior versions.

Pandas has a simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications.

.resample() is a time-based groupby, followed by a reduction method on each of its groups.

Note

.resample() is similar to using a .rolling() operation with a time-based offset, see a discussion here <stats.moments.ts-versus-resampling>

See some cookbook examples for some advanced strategies

In [1]: rng = pd.date_range('1/1/2012', periods=100, freq='S')

In [2]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [3]: ts.resample('5Min').sum()
Out[3]: 
2012-01-01    26448
Freq: 5T, dtype: int64

The resample function is very flexible and allows you to specify many different parameters to control the frequency conversion and resampling operation.

The how parameter can be a function name or numpy array function that takes an array and produces aggregated values:

In [4]: ts.resample('5Min').mean()
Out[4]: 
2012-01-01    264.48
Freq: 5T, dtype: float64

In [5]: ts.resample('5Min').ohlc()
Out[5]: 
            open  high  low  close
2012-01-01    65   498   18    101

In [6]: ts.resample('5Min').max()
Out[6]: 
2012-01-01    498
Freq: 5T, dtype: int64

Any function available via dispatching can be given to the how parameter by name, including sum, mean, std, sem, max, min, median, first, last, ohlc.

For downsampling, closed can be set to ‘left’ or ‘right’ to specify which end of the interval is closed:

In [7]: ts.resample('5Min', closed='right').mean()
Out[7]: 
2011-12-31 23:55:00     65.000000
2012-01-01 00:00:00    266.494949
Freq: 5T, dtype: float64

In [8]: ts.resample('5Min', closed='left').mean()
Out[8]: 
2012-01-01    264.48
Freq: 5T, dtype: float64

Parameters like label and loffset are used to manipulate the resulting labels. label specifies whether the result is labeled with the beginning or the end of the interval. loffset performs a time adjustment on the output labels.

In [9]: ts.resample('5Min').mean() # by default label='right'
Out[9]: 
2012-01-01    264.48
Freq: 5T, dtype: float64

In [10]: ts.resample('5Min', label='left').mean()
Out[10]: 
2012-01-01    264.48
Freq: 5T, dtype: float64

In [11]: ts.resample('5Min', label='left', loffset='1s').mean()
Out[11]: 
2012-01-01 00:00:01    264.48
dtype: float64

The axis parameter can be set to 0 or 1 and allows you to resample the specified axis for a DataFrame.

kind can be set to ‘timestamp’ or ‘period’ to convert the resulting index to/from time-stamp and time-span representations. By default resample retains the input representation.

convention can be set to ‘start’ or ‘end’ when resampling period data (detail below). It specifies how low frequency periods are converted to higher frequency periods.

10.1 Up Sampling

For upsampling, you can specify a way to upsample and the limit parameter to interpolate over the gaps that are created:

# from secondly to every 250 milliseconds
In [12]: ts[:2].resample('250L').asfreq()
Out[12]: 
2012-01-01 00:00:00.000     65.0
2012-01-01 00:00:00.250      NaN
2012-01-01 00:00:00.500      NaN
2012-01-01 00:00:00.750      NaN
2012-01-01 00:00:01.000    234.0
Freq: 250L, dtype: float64

In [13]: ts[:2].resample('250L').ffill()
Out[13]: 
2012-01-01 00:00:00.000     65
2012-01-01 00:00:00.250     65
2012-01-01 00:00:00.500     65
2012-01-01 00:00:00.750     65
2012-01-01 00:00:01.000    234
Freq: 250L, dtype: int64

In [14]: ts[:2].resample('250L').ffill(limit=2)
Out[14]: 
2012-01-01 00:00:00.000     65.0
2012-01-01 00:00:00.250     65.0
2012-01-01 00:00:00.500     65.0
2012-01-01 00:00:00.750      NaN
2012-01-01 00:00:01.000    234.0
Freq: 250L, dtype: float64

10.2 Sparse Resampling

Sparse timeseries are ones where you have a lot fewer points relative to the amount of time you are looking to resample. Naively upsampling a sparse series can potentially generate lots of intermediate values. When you don’t want to use a method to fill these values, e.g. fill_method is None, then intermediate values will be filled with NaN.

Since resample is a time-based groupby, the following is a method to efficiently resample only the groups that are not all NaN

In [15]: rng = pd.date_range('2014-1-1', periods=100, freq='D') + pd.Timedelta('1s')

In [16]: ts = pd.Series(range(100), index=rng)

If we want to resample to the full range of the series

In [17]: ts.resample('3T').sum()
Out[17]: 
2014-01-01 00:00:00     0.0
2014-01-01 00:03:00     NaN
2014-01-01 00:06:00     NaN
2014-01-01 00:09:00     NaN
                       ... 
2014-04-09 23:51:00     NaN
2014-04-09 23:54:00     NaN
2014-04-09 23:57:00     NaN
2014-04-10 00:00:00    99.0
Freq: 3T, dtype: float64

We can instead only resample those groups where we have points as follows:

In [18]: from functools import partial

In [19]: from pandas.tseries.frequencies import to_offset

In [20]: def round(t, freq):
   ....:     freq = to_offset(freq)
   ....:     return pd.Timestamp((t.value // freq.delta.value) * freq.delta.value)
   ....: 

In [21]: ts.groupby(partial(round, freq='3T')).sum()
Out[21]: 
2014-01-01     0
2014-01-02     1
2014-01-03     2
2014-01-04     3
              ..
2014-04-07    96
2014-04-08    97
2014-04-09    98
2014-04-10    99
dtype: int64

10.3 Aggregation

Similar to groupby aggregates and the window functions, a Resampler can be selectively resampled.

Resampling a DataFrame, the default will be to act on all columns with the same function.

In [22]: df = pd.DataFrame(np.random.randn(1000, 3),
   ....:                   index=pd.date_range('1/1/2012', freq='S', periods=1000),
   ....:                   columns=['A', 'B', 'C'])
   ....: 

In [23]: r = df.resample('3T')

In [24]: r.mean()
Out[24]: 
                            A         B         C
2012-01-01 00:00:00  0.027477  0.045215 -0.056040
2012-01-01 00:03:00 -0.017939 -0.054207  0.050467
2012-01-01 00:06:00 -0.009675  0.024055 -0.008814
2012-01-01 00:09:00 -0.108410  0.060835 -0.029850
2012-01-01 00:12:00  0.068417 -0.134840  0.041435
2012-01-01 00:15:00 -0.054826 -0.017097 -0.004233

We can select a specific column or columns using standard getitem.

In [25]: r['A'].mean()
Out[25]: 
2012-01-01 00:00:00    0.027477
2012-01-01 00:03:00   -0.017939
2012-01-01 00:06:00   -0.009675
2012-01-01 00:09:00   -0.108410
2012-01-01 00:12:00    0.068417
2012-01-01 00:15:00   -0.054826
Freq: 3T, Name: A, dtype: float64

In [26]: r[['A','B']].mean()
Out[26]: 
                            A         B
2012-01-01 00:00:00  0.027477  0.045215
2012-01-01 00:03:00 -0.017939 -0.054207
2012-01-01 00:06:00 -0.009675  0.024055
2012-01-01 00:09:00 -0.108410  0.060835
2012-01-01 00:12:00  0.068417 -0.134840
2012-01-01 00:15:00 -0.054826 -0.017097

You can pass a list or dict of functions to do aggregation with, outputting a DataFrame:

In [27]: r['A'].agg([np.sum, np.mean, np.std])
Out[27]: 
                           sum      mean       std
2012-01-01 00:00:00   4.945870  0.027477  0.963542
2012-01-01 00:03:00  -3.229049 -0.017939  0.963678
2012-01-01 00:06:00  -1.741514 -0.009675  1.041111
2012-01-01 00:09:00 -19.513772 -0.108410  0.999055
2012-01-01 00:12:00  12.315043  0.068417  1.055563
2012-01-01 00:15:00  -5.482622 -0.054826  1.082826

If a dict is passed, the keys will be used to name the columns. Otherwise the function’s name (stored in the function object) will be used.

In [28]: r['A'].agg({'result1' : np.sum,
   ....:             'result2' : np.mean})
   ....: 
Out[28]: 
                      result2    result1
2012-01-01 00:00:00  0.027477   4.945870
2012-01-01 00:03:00 -0.017939  -3.229049
2012-01-01 00:06:00 -0.009675  -1.741514
2012-01-01 00:09:00 -0.108410 -19.513772
2012-01-01 00:12:00  0.068417  12.315043
2012-01-01 00:15:00 -0.054826  -5.482622

On a resampled DataFrame, you can pass a list of functions to apply to each column, which produces an aggregated result with a hierarchical index:

In [29]: r.agg([np.sum, np.mean])
Out[29]: 
                             A                    B                    C  \
                           sum      mean        sum      mean        sum   
2012-01-01 00:00:00   4.945870  0.027477   8.138690  0.045215 -10.087267   
2012-01-01 00:03:00  -3.229049 -0.017939  -9.757335 -0.054207   9.084093   
2012-01-01 00:06:00  -1.741514 -0.009675   4.329810  0.024055  -1.586513   
2012-01-01 00:09:00 -19.513772 -0.108410  10.950261  0.060835  -5.373065   
2012-01-01 00:12:00  12.315043  0.068417 -24.271279 -0.134840   7.458211   
2012-01-01 00:15:00  -5.482622 -0.054826  -1.709680 -0.017097  -0.423342   

                               
                         mean  
2012-01-01 00:00:00 -0.056040  
2012-01-01 00:03:00  0.050467  
2012-01-01 00:06:00 -0.008814  
2012-01-01 00:09:00 -0.029850  
2012-01-01 00:12:00  0.041435  
2012-01-01 00:15:00 -0.004233  

By passing a dict to aggregate you can apply a different aggregation to the columns of a DataFrame:

In [30]: r.agg({'A' : np.sum,
   ....:        'B' : lambda x: np.std(x, ddof=1)})
   ....: 
Out[30]: 
                             A         B
2012-01-01 00:00:00   4.945870  1.020321
2012-01-01 00:03:00  -3.229049  1.109096
2012-01-01 00:06:00  -1.741514  1.081234
2012-01-01 00:09:00 -19.513772  1.012368
2012-01-01 00:12:00  12.315043  1.018528
2012-01-01 00:15:00  -5.482622  1.044362

The function names can also be strings. In order for a string to be valid it must be implemented on the Resampled object

In [31]: r.agg({'A' : 'sum', 'B' : 'std'})
Out[31]: 
                             A         B
2012-01-01 00:00:00   4.945870  1.020321
2012-01-01 00:03:00  -3.229049  1.109096
2012-01-01 00:06:00  -1.741514  1.081234
2012-01-01 00:09:00 -19.513772  1.012368
2012-01-01 00:12:00  12.315043  1.018528
2012-01-01 00:15:00  -5.482622  1.044362

Furthermore, you can also specify multiple aggregation functions for each column separately.

In [32]: r.agg({'A' : ['sum','std'], 'B' : ['mean','std'] })
Out[32]: 
                             A                   B          
                           sum       std      mean       std
2012-01-01 00:00:00   4.945870  0.963542  0.045215  1.020321
2012-01-01 00:03:00  -3.229049  0.963678 -0.054207  1.109096
2012-01-01 00:06:00  -1.741514  1.041111  0.024055  1.081234
2012-01-01 00:09:00 -19.513772  0.999055  0.060835  1.012368
2012-01-01 00:12:00  12.315043  1.055563 -0.134840  1.018528
2012-01-01 00:15:00  -5.482622  1.082826 -0.017097  1.044362