7.1 Reshaping by pivoting DataFrame objects
import pandas.util.testing as tm; tm.N = 3
def unpivot(frame):
N, K = frame.shape
data = {'value' : frame.values.ravel('F'),
'variable' : np.asarray(frame.columns).repeat(N),
'date' : np.tile(np.asarray(frame.index), K)}
return pd.DataFrame(data, columns=['date', 'variable', 'value'])
df = unpivot(tm.makeTimeDataFrame())
Data is often stored in CSV files or databases in so-called “stacked” or “record” format:
In [1]: df
Out[1]:
date variable value
0 2000-01-03 A 0.469112
1 2000-01-04 A -0.282863
2 2000-01-05 A -1.509059
3 2000-01-03 B -1.135632
.. ... ... ...
8 2000-01-05 C -0.861849
9 2000-01-03 D -2.104569
10 2000-01-04 D -0.494929
11 2000-01-05 D 1.071804
[12 rows x 3 columns]
To select out everything for variable A
we could do:
In [2]: df[df['variable'] == 'A']
Out[2]:
date variable value
0 2000-01-03 A 0.469112
1 2000-01-04 A -0.282863
2 2000-01-05 A -1.509059
But suppose we wish to do time series operations with the variables. A better
representation would be where the columns
are the unique variables and an
index
of dates identifies individual observations. To reshape the data into
this form, use the pivot
function:
In [3]: df
Out[3]:
date variable value
0 2000-01-03 A 0.469112
1 2000-01-04 A -0.282863
2 2000-01-05 A -1.509059
3 2000-01-03 B -1.135632
.. ... ... ...
8 2000-01-05 C -0.861849
9 2000-01-03 D -2.104569
10 2000-01-04 D -0.494929
11 2000-01-05 D 1.071804
[12 rows x 3 columns]
In [4]: df.pivot(index='date', columns='variable', values='value')
Out[4]:
variable A B C D
date
2000-01-03 0.469112 -1.135632 0.119209 -2.104569
2000-01-04 -0.282863 1.212112 -1.044236 -0.494929
2000-01-05 -1.509059 -0.173215 -0.861849 1.071804
If the values
argument is omitted, and the input DataFrame has more than
one column of values which are not used as column or index inputs to pivot
,
then the resulting “pivoted” DataFrame will have hierarchical columns whose topmost level indicates the respective value
column:
In [5]: df['value2'] = df['value'] * 2
In [6]: df
Out[6]:
date variable value value2
0 2000-01-03 A 0.469112 0.938225
1 2000-01-04 A -0.282863 -0.565727
2 2000-01-05 A -1.509059 -3.018117
3 2000-01-03 B -1.135632 -2.271265
.. ... ... ... ...
8 2000-01-05 C -0.861849 -1.723698
9 2000-01-03 D -2.104569 -4.209138
10 2000-01-04 D -0.494929 -0.989859
11 2000-01-05 D 1.071804 2.143608
[12 rows x 4 columns]
In [7]: pivoted = df.pivot('date', 'variable')
In [8]: pivoted
Out[8]:
value value2 \
variable A B C D A B
date
2000-01-03 0.469112 -1.135632 0.119209 -2.104569 0.938225 -2.271265
2000-01-04 -0.282863 1.212112 -1.044236 -0.494929 -0.565727 2.424224
2000-01-05 -1.509059 -0.173215 -0.861849 1.071804 -3.018117 -0.346429
variable C D
date
2000-01-03 0.238417 -4.209138
2000-01-04 -2.088472 -0.989859
2000-01-05 -1.723698 2.143608
You of course can then select subsets from the pivoted DataFrame:
In [9]: pivoted['value2']
Out[9]:
variable A B C D
date
2000-01-03 0.938225 -2.271265 0.238417 -4.209138
2000-01-04 -0.565727 2.424224 -2.088472 -0.989859
2000-01-05 -3.018117 -0.346429 -1.723698 2.143608
Note that this returns a view on the underlying data in the case where the data are homogeneously-typed.