7.2 Reshaping by stacking and unstacking
Closely related to the pivot
function are the related stack
and
unstack
functions currently available on Series and DataFrame. These
functions are designed to work together with MultiIndex
objects (see the
section on hierarchical indexing). Here are
essentially what these functions do:
stack
: “pivot” a level of the (possibly hierarchical) column labels, returning a DataFrame with an index with a new inner-most level of row labels.unstack
: inverse operation fromstack
: “pivot” a level of the (possibly hierarchical) row index to the column axis, producing a reshaped DataFrame with a new inner-most level of column labels.
The clearest way to explain is by example. Let’s take a prior example data set from the hierarchical indexing section:
In [1]: tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
...: 'foo', 'foo', 'qux', 'qux'],
...: ['one', 'two', 'one', 'two',
...: 'one', 'two', 'one', 'two']]))
...:
In [2]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [3]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
In [4]: df
Out[4]:
A B
first second
bar one 0.469112 -0.282863
two -1.509059 -1.135632
baz one 1.212112 -0.173215
two 0.119209 -1.044236
foo one -0.861849 -2.104569
two -0.494929 1.071804
qux one 0.721555 -0.706771
two -1.039575 0.271860
In [5]: df2 = df[:4]
In [6]: df2
Out[6]:
A B
first second
bar one 0.469112 -0.282863
two -1.509059 -1.135632
baz one 1.212112 -0.173215
two 0.119209 -1.044236
The stack
function “compresses” a level in the DataFrame’s columns to
produce either:
- A Series, in the case of a simple column Index
- A DataFrame, in the case of a
MultiIndex
in the columns
If the columns have a MultiIndex
, you can choose which level to stack. The
stacked level becomes the new lowest level in a MultiIndex
on the columns:
In [7]: stacked = df2.stack()
In [8]: stacked
Out[8]:
first second
bar one A 0.469112
B -0.282863
two A -1.509059
B -1.135632
baz one A 1.212112
B -0.173215
two A 0.119209
B -1.044236
dtype: float64
With a “stacked” DataFrame or Series (having a MultiIndex
as the
index
), the inverse operation of stack
is unstack
, which by default
unstacks the last level:
In [9]: stacked.unstack() # last level (default)
Out[9]:
A B
first second
bar one 0.469112 -0.282863
two -1.509059 -1.135632
baz one 1.212112 -0.173215
two 0.119209 -1.044236
In [10]: stacked.unstack(1) # 2nd level
Out[10]:
second one two
first
bar A 0.469112 -1.509059
B -0.282863 -1.135632
baz A 1.212112 0.119209
B -0.173215 -1.044236
In [11]: stacked.unstack(0) # 1st level
Out[11]:
first bar baz
second
one A 0.469112 1.212112
B -0.282863 -0.173215
two A -1.509059 0.119209
B -1.135632 -1.044236
If the indexes have names, you can use the level names instead of specifying the level numbers:
In [12]: stacked
Out[12]:
first second
bar one A 0.469112
B -0.282863
two A -1.509059
B -1.135632
baz one A 1.212112
B -0.173215
two A 0.119209
B -1.044236
dtype: float64
In [13]: stacked.unstack('second')
Out[13]:
second one two
first
bar A 0.469112 -1.509059
B -0.282863 -1.135632
baz A 1.212112 0.119209
B -0.173215 -1.044236
Notice that the stack
and unstack
methods implicitly sort the index
levels involved. Hence a call to stack
and then unstack
, or viceversa,
will result in a sorted copy of the original DataFrame or Series:
In [14]: index = pd.MultiIndex.from_product([[2,1], ['a', 'b']])
In [15]: df = pd.DataFrame(np.random.randn(4), index=index, columns=['A'])
In [16]: df
Out[16]:
A
2 a -0.424972
b 0.567020
1 a 0.276232
b -1.087401
In [17]: all(df.unstack().stack() == df.sort_index())
Out[17]: True
while the above code will raise a TypeError
if the call to sort_index
is
removed.
7.2.1 Multiple Levels
You may also stack or unstack more than one level at a time by passing a list of levels, in which case the end result is as if each level in the list were processed individually.
In [18]: columns = pd.MultiIndex.from_tuples([
....: ('A', 'cat', 'long'), ('B', 'cat', 'long'),
....: ('A', 'dog', 'short'), ('B', 'dog', 'short')
....: ],
....: names=['exp', 'animal', 'hair_length']
....: )
....:
In [19]: df = pd.DataFrame(np.random.randn(4, 4), columns=columns)
In [20]: df
Out[20]:
exp A B A B
animal cat cat dog dog
hair_length long long short short
0 -0.673690 0.113648 -1.478427 0.524988
1 0.404705 0.577046 -1.715002 -1.039268
2 -0.370647 -1.157892 -1.344312 0.844885
3 1.075770 -0.109050 1.643563 -1.469388
In [21]: df.stack(level=['animal', 'hair_length'])
Out[21]:
exp A B
animal hair_length
0 cat long -0.673690 0.113648
dog short -1.478427 0.524988
1 cat long 0.404705 0.577046
dog short -1.715002 -1.039268
2 cat long -0.370647 -1.157892
dog short -1.344312 0.844885
3 cat long 1.075770 -0.109050
dog short 1.643563 -1.469388
The list of levels can contain either level names or level numbers (but not a mixture of the two).
# df.stack(level=['animal', 'hair_length'])
# from above is equivalent to:
In [22]: df.stack(level=[1, 2])
Out[22]:
exp A B
animal hair_length
0 cat long -0.673690 0.113648
dog short -1.478427 0.524988
1 cat long 0.404705 0.577046
dog short -1.715002 -1.039268
2 cat long -0.370647 -1.157892
dog short -1.344312 0.844885
3 cat long 1.075770 -0.109050
dog short 1.643563 -1.469388
7.2.2 Missing Data
These functions are intelligent about handling missing data and do not expect
each subgroup within the hierarchical index to have the same set of labels.
They also can handle the index being unsorted (but you can make it sorted by
calling sort_index
, of course). Here is a more complex example:
In [23]: columns = pd.MultiIndex.from_tuples([('A', 'cat'), ('B', 'dog'),
....: ('B', 'cat'), ('A', 'dog')],
....: names=['exp', 'animal'])
....:
In [24]: index = pd.MultiIndex.from_product([('bar', 'baz', 'foo', 'qux'),
....: ('one', 'two')],
....: names=['first', 'second'])
....:
In [25]: df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns)
In [26]: df2 = df.ix[[0, 1, 2, 4, 5, 7]]
In [27]: df2
Out[27]:
exp A B A
animal cat dog cat dog
first second
bar one 0.357021 -0.674600 -1.776904 -0.968914
two -1.294524 0.413738 0.276662 -0.472035
baz one -0.013960 -0.362543 -0.006154 -0.923061
foo one 1.431256 1.340309 -1.170299 -0.226169
two 0.410835 0.813850 0.132003 -0.827317
qux two -1.413681 1.607920 1.024180 0.569605
As mentioned above, stack
can be called with a level
argument to select
which level in the columns to stack:
In [28]: df2.stack('exp')
Out[28]:
animal cat dog
first second exp
bar one A 0.357021 -0.968914
B -1.776904 -0.674600
two A -1.294524 -0.472035
B 0.276662 0.413738
... ... ...
foo two A 0.410835 -0.827317
B 0.132003 0.813850
qux two A -1.413681 0.569605
B 1.024180 1.607920
[12 rows x 2 columns]
In [29]: df2.stack('animal')
Out[29]:
exp A B
first second animal
bar one cat 0.357021 -1.776904
dog -0.968914 -0.674600
two cat -1.294524 0.276662
dog -0.472035 0.413738
... ... ...
foo two cat 0.410835 0.132003
dog -0.827317 0.813850
qux two cat -1.413681 1.024180
dog 0.569605 1.607920
[12 rows x 2 columns]
Unstacking can result in missing values if subgroups do not have the same
set of labels. By default, missing values will be replaced with the default
fill value for that data type, NaN
for float, NaT
for datetimelike,
etc. For integer types, by default data will converted to float and missing
values will be set to NaN
.
In [30]: df3 = df.iloc[[0, 1, 4, 7], [1, 2]]
In [31]: df3
Out[31]:
exp B
animal dog cat
first second
bar one -0.674600 -1.776904
two 0.413738 0.276662
foo one 1.340309 -1.170299
qux two 1.607920 1.024180
In [32]: df3.unstack()
Out[32]:
exp B
animal dog cat
second one two one two
first
bar -0.674600 0.413738 -1.776904 0.276662
foo 1.340309 NaN -1.170299 NaN
qux NaN 1.607920 NaN 1.024180
Alternatively, unstack takes an optional fill_value
argument, for specifying
the value of missing data.
In [33]: df3.unstack(fill_value=-1e9)
Out[33]:
exp B
animal dog cat
second one two one two
first
bar -6.746001e-01 4.137381e-01 -1.776904e+00 2.766617e-01
foo 1.340309e+00 -1.000000e+09 -1.170299e+00 -1.000000e+09
qux -1.000000e+09 1.607920e+00 -1.000000e+09 1.024180e+00
7.2.3 With a MultiIndex
Unstacking when the columns are a MultiIndex
is also careful about doing
the right thing:
In [34]: df
Out[34]:
exp A B A
animal cat dog cat dog
first second
bar one 0.357021 -0.674600 -1.776904 -0.968914
two -1.294524 0.413738 0.276662 -0.472035
baz one -0.013960 -0.362543 -0.006154 -0.923061
two 0.895717 0.805244 -1.206412 2.565646
foo one 1.431256 1.340309 -1.170299 -0.226169
two 0.410835 0.813850 0.132003 -0.827317
qux one -0.076467 -1.187678 1.130127 -1.436737
two -1.413681 1.607920 1.024180 0.569605
In [35]: df[:3].unstack(0)
Out[35]:
exp A B A \
animal cat dog cat dog
first bar baz bar baz bar baz bar
second
one 0.357021 -0.01396 -0.674600 -0.362543 -1.776904 -0.006154 -0.968914
two -1.294524 NaN 0.413738 NaN 0.276662 NaN -0.472035
exp
animal
first baz
second
one -0.923061
two NaN
In [36]: df2
Out[36]:
exp A B A
animal cat dog cat dog
first second
bar one 0.357021 -0.674600 -1.776904 -0.968914
two -1.294524 0.413738 0.276662 -0.472035
baz one -0.013960 -0.362543 -0.006154 -0.923061
foo one 1.431256 1.340309 -1.170299 -0.226169
two 0.410835 0.813850 0.132003 -0.827317
qux two -1.413681 1.607920 1.024180 0.569605
In [37]: df2.unstack(1)
Out[37]:
exp A B A \
animal cat dog cat dog
second one two one two one two one
first
bar 0.357021 -1.294524 -0.674600 0.413738 -1.776904 0.276662 -0.968914
baz -0.013960 NaN -0.362543 NaN -0.006154 NaN -0.923061
foo 1.431256 0.410835 1.340309 0.813850 -1.170299 0.132003 -0.226169
qux NaN -1.413681 NaN 1.607920 NaN 1.024180 NaN
exp
animal
second two
first
bar -0.472035
baz NaN
foo -0.827317
qux 0.569605