6.3 Timeseries friendly merging

6.3.1 Merging Ordered Data

A merge_ordered() function allows combining time series and other ordered data. In particular it has an optional fill_method keyword to fill/interpolate missing data:

In [1]: left = pd.DataFrame({'k': ['K0', 'K1', 'K1', 'K2'],
   ...:                      'lv': [1, 2, 3, 4],
   ...:                      's': ['a', 'b', 'c', 'd']})
   ...: 

In [2]: right = pd.DataFrame({'k': ['K1', 'K2', 'K4'],
   ...:                       'rv': [1, 2, 3]})
   ...: 

In [3]: pd.merge_ordered(left, right, fill_method='ffill', left_by='s')
Out[3]: 
     k   lv  s   rv
0   K0  1.0  a  NaN
1   K1  1.0  a  1.0
2   K2  1.0  a  2.0
3   K4  1.0  a  3.0
4   K1  2.0  b  1.0
5   K2  2.0  b  2.0
6   K4  2.0  b  3.0
7   K1  3.0  c  1.0
8   K2  3.0  c  2.0
9   K4  3.0  c  3.0
10  K1  NaN  d  1.0
11  K2  4.0  d  2.0
12  K4  4.0  d  3.0

6.3.2 Merging AsOf

New in version 0.19.0.

A merge_asof() is similar to an ordered left-join except that we match on nearest key rather than equal keys. For each row in the left DataFrame, we select the last row in the right DataFrame whose on key is less than the left’s key. Both DataFrames must be sorted by the key.

Optionally an asof merge can perform a group-wise merge. This matches the by key equally, in addition to the nearest match on the on key.

For example; we might have trades and quotes and we want to asof merge them.

In [4]: trades = pd.DataFrame({
   ...:     'time': pd.to_datetime(['20160525 13:30:00.023',
   ...:                             '20160525 13:30:00.038',
   ...:                             '20160525 13:30:00.048',
   ...:                             '20160525 13:30:00.048',
   ...:                             '20160525 13:30:00.048']),
   ...:     'ticker': ['MSFT', 'MSFT',
   ...:                'GOOG', 'GOOG', 'AAPL'],
   ...:     'price': [51.95, 51.95,
   ...:               720.77, 720.92, 98.00],
   ...:     'quantity': [75, 155,
   ...:                  100, 100, 100]},
   ...:     columns=['time', 'ticker', 'price', 'quantity'])
   ...: 

In [5]: quotes = pd.DataFrame({
   ...:     'time': pd.to_datetime(['20160525 13:30:00.023',
   ...:                             '20160525 13:30:00.023',
   ...:                             '20160525 13:30:00.030',
   ...:                             '20160525 13:30:00.041',
   ...:                             '20160525 13:30:00.048',
   ...:                             '20160525 13:30:00.049',
   ...:                             '20160525 13:30:00.072',
   ...:                             '20160525 13:30:00.075']),
   ...:     'ticker': ['GOOG', 'MSFT', 'MSFT',
   ...:                'MSFT', 'GOOG', 'AAPL', 'GOOG',
   ...:                'MSFT'],
   ...:     'bid': [720.50, 51.95, 51.97, 51.99,
   ...:             720.50, 97.99, 720.50, 52.01],
   ...:     'ask': [720.93, 51.96, 51.98, 52.00,
   ...:             720.93, 98.01, 720.88, 52.03]},
   ...:     columns=['time', 'ticker', 'bid', 'ask'])
   ...: 
In [6]: trades
Out[6]: 
                     time ticker   price  quantity
0 2016-05-25 13:30:00.023   MSFT   51.95        75
1 2016-05-25 13:30:00.038   MSFT   51.95       155
2 2016-05-25 13:30:00.048   GOOG  720.77       100
3 2016-05-25 13:30:00.048   GOOG  720.92       100
4 2016-05-25 13:30:00.048   AAPL   98.00       100

In [7]: quotes
Out[7]: 
                     time ticker     bid     ask
0 2016-05-25 13:30:00.023   GOOG  720.50  720.93
1 2016-05-25 13:30:00.023   MSFT   51.95   51.96
2 2016-05-25 13:30:00.030   MSFT   51.97   51.98
3 2016-05-25 13:30:00.041   MSFT   51.99   52.00
4 2016-05-25 13:30:00.048   GOOG  720.50  720.93
5 2016-05-25 13:30:00.049   AAPL   97.99   98.01
6 2016-05-25 13:30:00.072   GOOG  720.50  720.88
7 2016-05-25 13:30:00.075   MSFT   52.01   52.03

By default we are taking the asof of the quotes.

In [8]: pd.merge_asof(trades, quotes,
   ...:               on='time',
   ...:               by='ticker')
   ...: 
Out[8]: 
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

We only asof within 2ms betwen the quote time and the trade time.

In [9]: pd.merge_asof(trades, quotes,
   ...:               on='time',
   ...:               by='ticker',
   ...:               tolerance=pd.Timedelta('2ms'))
   ...: 
Out[9]: 
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155     NaN     NaN
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

We only asof within 10ms betwen the quote time and the trade time and we exclude exact matches on time. Note that though we exclude the exact matches (of the quotes), prior quotes DO propogate to that point in time.

In [10]: pd.merge_asof(trades, quotes,
   ....:               on='time',
   ....:               by='ticker',
   ....:               tolerance=pd.Timedelta('10ms'),
   ....:               allow_exact_matches=False)
   ....: 
Out[10]: 
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75     NaN     NaN
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN