7.1 Reshaping by pivoting DataFrame objects
import pandas.util.testing as tm; tm.N = 3
def unpivot(frame):
N, K = frame.shape
data = {'value' : frame.values.ravel('F'),
'variable' : np.asarray(frame.columns).repeat(N),
'date' : np.tile(np.asarray(frame.index), K)}
return pd.DataFrame(data, columns=['date', 'variable', 'value'])
df = unpivot(tm.makeTimeDataFrame())
Data is often stored in CSV files or databases in so-called “stacked” or “record” format:
In [1]: df
Out[1]:
date variable value
0 2000-01-03 A 0.469112
1 2000-01-04 A -0.282863
2 2000-01-05 A -1.509059
3 2000-01-03 B -1.135632
.. ... ... ...
8 2000-01-05 C -0.861849
9 2000-01-03 D -2.104569
10 2000-01-04 D -0.494929
11 2000-01-05 D 1.071804
[12 rows x 3 columns]
To select out everything for variable A we could do:
In [2]: df[df['variable'] == 'A']
Out[2]:
date variable value
0 2000-01-03 A 0.469112
1 2000-01-04 A -0.282863
2 2000-01-05 A -1.509059
But suppose we wish to do time series operations with the variables. A better
representation would be where the columns are the unique variables and an
index of dates identifies individual observations. To reshape the data into
this form, use the pivot function:
In [3]: df
Out[3]:
date variable value
0 2000-01-03 A 0.469112
1 2000-01-04 A -0.282863
2 2000-01-05 A -1.509059
3 2000-01-03 B -1.135632
.. ... ... ...
8 2000-01-05 C -0.861849
9 2000-01-03 D -2.104569
10 2000-01-04 D -0.494929
11 2000-01-05 D 1.071804
[12 rows x 3 columns]
In [4]: df.pivot(index='date', columns='variable', values='value')
Out[4]:
variable A B C D
date
2000-01-03 0.469112 -1.135632 0.119209 -2.104569
2000-01-04 -0.282863 1.212112 -1.044236 -0.494929
2000-01-05 -1.509059 -0.173215 -0.861849 1.071804
If the values argument is omitted, and the input DataFrame has more than
one column of values which are not used as column or index inputs to pivot,
then the resulting “pivoted” DataFrame will have hierarchical columns whose topmost level indicates the respective value
column:
In [5]: df['value2'] = df['value'] * 2
In [6]: df
Out[6]:
date variable value value2
0 2000-01-03 A 0.469112 0.938225
1 2000-01-04 A -0.282863 -0.565727
2 2000-01-05 A -1.509059 -3.018117
3 2000-01-03 B -1.135632 -2.271265
.. ... ... ... ...
8 2000-01-05 C -0.861849 -1.723698
9 2000-01-03 D -2.104569 -4.209138
10 2000-01-04 D -0.494929 -0.989859
11 2000-01-05 D 1.071804 2.143608
[12 rows x 4 columns]
In [7]: pivoted = df.pivot('date', 'variable')
In [8]: pivoted
Out[8]:
value value2 \
variable A B C D A B
date
2000-01-03 0.469112 -1.135632 0.119209 -2.104569 0.938225 -2.271265
2000-01-04 -0.282863 1.212112 -1.044236 -0.494929 -0.565727 2.424224
2000-01-05 -1.509059 -0.173215 -0.861849 1.071804 -3.018117 -0.346429
variable C D
date
2000-01-03 0.238417 -4.209138
2000-01-04 -2.088472 -0.989859
2000-01-05 -1.723698 2.143608
You of course can then select subsets from the pivoted DataFrame:
In [9]: pivoted['value2']
Out[9]:
variable A B C D
date
2000-01-03 0.938225 -2.271265 0.238417 -4.209138
2000-01-04 -0.565727 2.424224 -2.088472 -0.989859
2000-01-05 -3.018117 -0.346429 -1.723698 2.143608
Note that this returns a view on the underlying data in the case where the data are homogeneously-typed.