7 Reindexing and altering labels
reindex()
is the fundamental data alignment method in pandas.
It is used to implement nearly all other features relying on label-alignment
functionality. To reindex means to conform the data to match a given set of
labels along a particular axis. This accomplishes several things:
- Reorders the existing data to match a new set of labels
- Inserts missing value (NA) markers in label locations where no data for that label existed
- If specified, fill data for missing labels using logic (highly relevant to working with time series data)
Here is a simple example:
In [1]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
In [2]: s
Out[2]:
a 0.734560
b -0.445120
c -0.703433
d 0.320412
e 0.185202
dtype: float64
In [3]: s.reindex(['e', 'b', 'f', 'd'])
Out[3]:
e 0.185202
b -0.445120
f NaN
d 0.320412
dtype: float64
Here, the f
label was not contained in the Series and hence appears as
NaN
in the result.
With a DataFrame, you can simultaneously reindex the index and columns:
In [4]: df
Out[4]:
one three two
a 0.851097 NaN -0.429037
b 0.266049 -0.330979 0.963385
c 1.117346 -0.409168 2.243459
d NaN -0.305334 -0.432789
In [5]: df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one'])
Out[5]:
three two one
c -0.409168 2.243459 1.117346
f NaN NaN NaN
b -0.330979 0.963385 0.266049
For convenience, you may utilize the reindex_axis()
method, which
takes the labels and a keyword axis
parameter.
Note that the Index
objects containing the actual axis labels can be
shared between objects. So if we have a Series and a DataFrame, the
following can be done:
In [6]: rs = s.reindex(df.index)
In [7]: rs
Out[7]:
a 0.734560
b -0.445120
c -0.703433
d 0.320412
dtype: float64
In [8]: rs.index is df.index
Out[8]: True
This means that the reindexed Series’s index is the same Python object as the DataFrame’s index.
See also
MultiIndex / Advanced Indexing is an even more concise way of doing reindexing.
Note
When writing performance-sensitive code, there is a good reason to spend
some time becoming a reindexing ninja: many operations are faster on
pre-aligned data. Adding two unaligned DataFrames internally triggers a
reindexing step. For exploratory analysis you will hardly notice the
difference (because reindex
has been heavily optimized), but when CPU
cycles matter sprinkling a few explicit reindex
calls here and there can
have an impact.
7.1 Reindexing to align with another object
You may wish to take an object and reindex its axes to be labeled the same as
another object. While the syntax for this is straightforward albeit verbose, it
is a common enough operation that the reindex_like()
method is
available to make this simpler:
In [9]: df2
Out[9]:
one two
a 0.851097 -0.429037
b 0.266049 0.963385
c 1.117346 2.243459
In [10]: df3
Out[10]:
one two
a 0.106266 -1.354973
b -0.478782 0.037450
c 0.372515 1.317523
In [11]: df.reindex_like(df2)
Out[11]:
one two
a 0.851097 -0.429037
b 0.266049 0.963385
c 1.117346 2.243459
7.2 Aligning objects with each other with align
The align()
method is the fastest way to simultaneously align two objects. It
supports a join
argument (related to joining and merging):
join='outer'
: take the union of the indexes (default)join='left'
: use the calling object’s indexjoin='right'
: use the passed object’s indexjoin='inner'
: intersect the indexes
It returns a tuple with both of the reindexed Series:
In [12]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
In [13]: s1 = s[:4]
In [14]: s2 = s[1:]
In [15]: s1.align(s2)
Out[15]:
(a 0.498698
b -0.643722
c -0.028228
d 0.070209
e NaN
dtype: float64, a NaN
b -0.643722
c -0.028228
d 0.070209
e -0.791176
dtype: float64)
In [16]: s1.align(s2, join='inner')
Out[16]:
(b -0.643722
c -0.028228
d 0.070209
dtype: float64, b -0.643722
c -0.028228
d 0.070209
dtype: float64)
In [17]: s1.align(s2, join='left')
Out[17]:
(a 0.498698
b -0.643722
c -0.028228
d 0.070209
dtype: float64, a NaN
b -0.643722
c -0.028228
d 0.070209
dtype: float64)
For DataFrames, the join method will be applied to both the index and the columns by default:
In [18]: df.align(df2, join='inner')
Out[18]:
( one two
a 0.851097 -0.429037
b 0.266049 0.963385
c 1.117346 2.243459, one two
a 0.851097 -0.429037
b 0.266049 0.963385
c 1.117346 2.243459)
You can also pass an axis
option to only align on the specified axis:
In [19]: df.align(df2, join='inner', axis=0)
Out[19]:
( one three two
a 0.851097 NaN -0.429037
b 0.266049 -0.330979 0.963385
c 1.117346 -0.409168 2.243459, one two
a 0.851097 -0.429037
b 0.266049 0.963385
c 1.117346 2.243459)
If you pass a Series to DataFrame.align()
, you can choose to align both
objects either on the DataFrame’s index or columns using the axis
argument:
In [20]: df.align(df2.ix[0], axis=1)
Out[20]:
( one three two
a 0.851097 NaN -0.429037
b 0.266049 -0.330979 0.963385
c 1.117346 -0.409168 2.243459
d NaN -0.305334 -0.432789, one 0.851097
three NaN
two -0.429037
Name: a, dtype: float64)
7.3 Filling while reindexing
reindex()
takes an optional parameter method
which is a
filling method chosen from the following table:
Method | Action |
---|---|
pad / ffill | Fill values forward |
bfill / backfill | Fill values backward |
nearest | Fill from the nearest index value |
We illustrate these fill methods on a simple Series:
In [21]: rng = pd.date_range('1/3/2000', periods=8)
In [22]: ts = pd.Series(np.random.randn(8), index=rng)
In [23]: ts2 = ts[[0, 3, 6]]
In [24]: ts
Out[24]:
2000-01-03 0.393495
2000-01-04 2.410230
2000-01-05 -0.368339
2000-01-06 -1.934392
2000-01-07 2.398912
2000-01-08 0.521658
2000-01-09 -2.389278
2000-01-10 0.395639
Freq: D, dtype: float64
In [25]: ts2
Out[25]:
2000-01-03 0.393495
2000-01-06 -1.934392
2000-01-09 -2.389278
dtype: float64
In [26]: ts2.reindex(ts.index)
Out[26]:
2000-01-03 0.393495
2000-01-04 NaN
2000-01-05 NaN
2000-01-06 -1.934392
2000-01-07 NaN
2000-01-08 NaN
2000-01-09 -2.389278
2000-01-10 NaN
Freq: D, dtype: float64
In [27]: ts2.reindex(ts.index, method='ffill')
Out[27]:
2000-01-03 0.393495
2000-01-04 0.393495
2000-01-05 0.393495
2000-01-06 -1.934392
2000-01-07 -1.934392
2000-01-08 -1.934392
2000-01-09 -2.389278
2000-01-10 -2.389278
Freq: D, dtype: float64
In [28]: ts2.reindex(ts.index, method='bfill')
Out[28]:
2000-01-03 0.393495
2000-01-04 -1.934392
2000-01-05 -1.934392
2000-01-06 -1.934392
2000-01-07 -2.389278
2000-01-08 -2.389278
2000-01-09 -2.389278
2000-01-10 NaN
Freq: D, dtype: float64
In [29]: ts2.reindex(ts.index, method='nearest')
Out[29]:
2000-01-03 0.393495
2000-01-04 0.393495
2000-01-05 -1.934392
2000-01-06 -1.934392
2000-01-07 -1.934392
2000-01-08 -2.389278
2000-01-09 -2.389278
2000-01-10 -2.389278
Freq: D, dtype: float64
These methods require that the indexes are ordered increasing or decreasing.
Note that the same result could have been achieved using
fillna (except for method='nearest'
) or
interpolate:
In [30]: ts2.reindex(ts.index).fillna(method='ffill')
Out[30]:
2000-01-03 0.393495
2000-01-04 0.393495
2000-01-05 0.393495
2000-01-06 -1.934392
2000-01-07 -1.934392
2000-01-08 -1.934392
2000-01-09 -2.389278
2000-01-10 -2.389278
Freq: D, dtype: float64
reindex()
will raise a ValueError if the index is not monotonic
increasing or decreasing. fillna()
and interpolate()
will not make any checks on the order of the index.
7.4 Limits on filling while reindexing
The limit
and tolerance
arguments provide additional control over
filling while reindexing. Limit specifies the maximum count of consecutive
matches:
In [31]: ts2.reindex(ts.index, method='ffill', limit=1)
Out[31]:
2000-01-03 0.393495
2000-01-04 0.393495
2000-01-05 NaN
2000-01-06 -1.934392
2000-01-07 -1.934392
2000-01-08 NaN
2000-01-09 -2.389278
2000-01-10 -2.389278
Freq: D, dtype: float64
In contrast, tolerance specifies the maximum distance between the index and indexer values:
In [32]: ts2.reindex(ts.index, method='ffill', tolerance='1 day')
Out[32]:
2000-01-03 0.393495
2000-01-04 0.393495
2000-01-05 NaN
2000-01-06 -1.934392
2000-01-07 -1.934392
2000-01-08 NaN
2000-01-09 -2.389278
2000-01-10 -2.389278
Freq: D, dtype: float64
Notice that when used on a DatetimeIndex
, TimedeltaIndex
or
PeriodIndex
, tolerance
will coerced into a Timedelta
if possible.
This allows you to specify tolerance with appropriate strings.
7.5 Dropping labels from an axis
A method closely related to reindex
is the drop()
function.
It removes a set of labels from an axis:
In [33]: df
Out[33]:
one three two
a 0.851097 NaN -0.429037
b 0.266049 -0.330979 0.963385
c 1.117346 -0.409168 2.243459
d NaN -0.305334 -0.432789
In [34]: df.drop(['a', 'd'], axis=0)
Out[34]:
one three two
b 0.266049 -0.330979 0.963385
c 1.117346 -0.409168 2.243459
In [35]: df.drop(['one'], axis=1)
Out[35]:
three two
a NaN -0.429037
b -0.330979 0.963385
c -0.409168 2.243459
d -0.305334 -0.432789
Note that the following also works, but is a bit less obvious / clean:
In [36]: df.reindex(df.index.difference(['a', 'd']))
Out[36]:
one three two
b 0.266049 -0.330979 0.963385
c 1.117346 -0.409168 2.243459
7.6 Renaming / mapping labels
The rename()
method allows you to relabel an axis based on some
mapping (a dict or Series) or an arbitrary function.
In [37]: s
Out[37]:
a 0.498698
b -0.643722
c -0.028228
d 0.070209
e -0.791176
dtype: float64
In [38]: s.rename(str.upper)
Out[38]:
A 0.498698
B -0.643722
C -0.028228
D 0.070209
E -0.791176
dtype: float64
If you pass a function, it must return a value when called with any of the labels (and must produce a set of unique values). A dict or Series can also be used:
In [39]: df.rename(columns={'one' : 'foo', 'two' : 'bar'},
....: index={'a' : 'apple', 'b' : 'banana', 'd' : 'durian'})
....:
Out[39]:
foo three bar
apple 0.851097 NaN -0.429037
banana 0.266049 -0.330979 0.963385
c 1.117346 -0.409168 2.243459
durian NaN -0.305334 -0.432789
If the mapping doesn’t include a column/index label, it isn’t renamed. Also extra labels in the mapping don’t throw an error.
The rename()
method also provides an inplace
named
parameter that is by default False
and copies the underlying data. Pass
inplace=True
to rename the data in place.
New in version 0.18.0.
Finally, rename()
also accepts a scalar or list-like
for altering the Series.name
attribute.
In [40]: s.rename("scalar-name")
Out[40]:
a 0.498698
b -0.643722
c -0.028228
d 0.070209
e -0.791176
Name: scalar-name, dtype: float64
The Panel class has a related rename_axis()
class which can rename
any of its three axes.