6.3 Timeseries friendly merging
6.3.1 Merging Ordered Data
A merge_ordered()
function allows combining time series and other
ordered data. In particular it has an optional fill_method
keyword to
fill/interpolate missing data:
In [1]: left = pd.DataFrame({'k': ['K0', 'K1', 'K1', 'K2'],
...: 'lv': [1, 2, 3, 4],
...: 's': ['a', 'b', 'c', 'd']})
...:
In [2]: right = pd.DataFrame({'k': ['K1', 'K2', 'K4'],
...: 'rv': [1, 2, 3]})
...:
In [3]: pd.merge_ordered(left, right, fill_method='ffill', left_by='s')
Out[3]:
k lv s rv
0 K0 1.0 a NaN
1 K1 1.0 a 1.0
2 K2 1.0 a 2.0
3 K4 1.0 a 3.0
4 K1 2.0 b 1.0
5 K2 2.0 b 2.0
6 K4 2.0 b 3.0
7 K1 3.0 c 1.0
8 K2 3.0 c 2.0
9 K4 3.0 c 3.0
10 K1 NaN d 1.0
11 K2 4.0 d 2.0
12 K4 4.0 d 3.0
6.3.2 Merging AsOf
New in version 0.19.0.
A merge_asof()
is similar to an ordered left-join except that we match on nearest key rather than equal keys. For each row in the left
DataFrame, we select the last row in the right
DataFrame whose on
key is less than the left’s key. Both DataFrames must be sorted by the key.
Optionally an asof merge can perform a group-wise merge. This matches the by
key equally,
in addition to the nearest match on the on
key.
For example; we might have trades
and quotes
and we want to asof
merge them.
In [4]: trades = pd.DataFrame({
...: 'time': pd.to_datetime(['20160525 13:30:00.023',
...: '20160525 13:30:00.038',
...: '20160525 13:30:00.048',
...: '20160525 13:30:00.048',
...: '20160525 13:30:00.048']),
...: 'ticker': ['MSFT', 'MSFT',
...: 'GOOG', 'GOOG', 'AAPL'],
...: 'price': [51.95, 51.95,
...: 720.77, 720.92, 98.00],
...: 'quantity': [75, 155,
...: 100, 100, 100]},
...: columns=['time', 'ticker', 'price', 'quantity'])
...:
In [5]: quotes = pd.DataFrame({
...: 'time': pd.to_datetime(['20160525 13:30:00.023',
...: '20160525 13:30:00.023',
...: '20160525 13:30:00.030',
...: '20160525 13:30:00.041',
...: '20160525 13:30:00.048',
...: '20160525 13:30:00.049',
...: '20160525 13:30:00.072',
...: '20160525 13:30:00.075']),
...: 'ticker': ['GOOG', 'MSFT', 'MSFT',
...: 'MSFT', 'GOOG', 'AAPL', 'GOOG',
...: 'MSFT'],
...: 'bid': [720.50, 51.95, 51.97, 51.99,
...: 720.50, 97.99, 720.50, 52.01],
...: 'ask': [720.93, 51.96, 51.98, 52.00,
...: 720.93, 98.01, 720.88, 52.03]},
...: columns=['time', 'ticker', 'bid', 'ask'])
...:
In [6]: trades
Out[6]:
time ticker price quantity
0 2016-05-25 13:30:00.023 MSFT 51.95 75
1 2016-05-25 13:30:00.038 MSFT 51.95 155
2 2016-05-25 13:30:00.048 GOOG 720.77 100
3 2016-05-25 13:30:00.048 GOOG 720.92 100
4 2016-05-25 13:30:00.048 AAPL 98.00 100
In [7]: quotes
Out[7]:
time ticker bid ask
0 2016-05-25 13:30:00.023 GOOG 720.50 720.93
1 2016-05-25 13:30:00.023 MSFT 51.95 51.96
2 2016-05-25 13:30:00.030 MSFT 51.97 51.98
3 2016-05-25 13:30:00.041 MSFT 51.99 52.00
4 2016-05-25 13:30:00.048 GOOG 720.50 720.93
5 2016-05-25 13:30:00.049 AAPL 97.99 98.01
6 2016-05-25 13:30:00.072 GOOG 720.50 720.88
7 2016-05-25 13:30:00.075 MSFT 52.01 52.03
By default we are taking the asof of the quotes.
In [8]: pd.merge_asof(trades, quotes,
...: on='time',
...: by='ticker')
...:
Out[8]:
time ticker price quantity bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96
1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98
2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93
4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
We only asof within 2ms
betwen the quote time and the trade time.
In [9]: pd.merge_asof(trades, quotes,
...: on='time',
...: by='ticker',
...: tolerance=pd.Timedelta('2ms'))
...:
Out[9]:
time ticker price quantity bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96
1 2016-05-25 13:30:00.038 MSFT 51.95 155 NaN NaN
2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93
4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
We only asof within 10ms
betwen the quote time and the trade time and we exclude exact matches on time.
Note that though we exclude the exact matches (of the quotes), prior quotes DO propogate to that point
in time.
In [10]: pd.merge_asof(trades, quotes,
....: on='time',
....: by='ticker',
....: tolerance=pd.Timedelta('10ms'),
....: allow_exact_matches=False)
....:
Out[10]:
time ticker price quantity bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 NaN NaN
1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98
2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93
4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN