7 Reindexing and altering labels

reindex() is the fundamental data alignment method in pandas. It is used to implement nearly all other features relying on label-alignment functionality. To reindex means to conform the data to match a given set of labels along a particular axis. This accomplishes several things:

  • Reorders the existing data to match a new set of labels
  • Inserts missing value (NA) markers in label locations where no data for that label existed
  • If specified, fill data for missing labels using logic (highly relevant to working with time series data)

Here is a simple example:

In [1]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [2]: s
Out[2]: 
a    0.734560
b   -0.445120
c   -0.703433
d    0.320412
e    0.185202
dtype: float64

In [3]: s.reindex(['e', 'b', 'f', 'd'])
Out[3]: 
e    0.185202
b   -0.445120
f         NaN
d    0.320412
dtype: float64

Here, the f label was not contained in the Series and hence appears as NaN in the result.

With a DataFrame, you can simultaneously reindex the index and columns:

In [4]: df
Out[4]: 
        one     three       two
a  0.851097       NaN -0.429037
b  0.266049 -0.330979  0.963385
c  1.117346 -0.409168  2.243459
d       NaN -0.305334 -0.432789

In [5]: df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one'])
Out[5]: 
      three       two       one
c -0.409168  2.243459  1.117346
f       NaN       NaN       NaN
b -0.330979  0.963385  0.266049

For convenience, you may utilize the reindex_axis() method, which takes the labels and a keyword axis parameter.

Note that the Index objects containing the actual axis labels can be shared between objects. So if we have a Series and a DataFrame, the following can be done:

In [6]: rs = s.reindex(df.index)

In [7]: rs
Out[7]: 
a    0.734560
b   -0.445120
c   -0.703433
d    0.320412
dtype: float64

In [8]: rs.index is df.index
Out[8]: True

This means that the reindexed Series’s index is the same Python object as the DataFrame’s index.

See also

MultiIndex / Advanced Indexing is an even more concise way of doing reindexing.

Note

When writing performance-sensitive code, there is a good reason to spend some time becoming a reindexing ninja: many operations are faster on pre-aligned data. Adding two unaligned DataFrames internally triggers a reindexing step. For exploratory analysis you will hardly notice the difference (because reindex has been heavily optimized), but when CPU cycles matter sprinkling a few explicit reindex calls here and there can have an impact.

7.1 Reindexing to align with another object

You may wish to take an object and reindex its axes to be labeled the same as another object. While the syntax for this is straightforward albeit verbose, it is a common enough operation that the reindex_like() method is available to make this simpler:

In [9]: df2
Out[9]: 
        one       two
a  0.851097 -0.429037
b  0.266049  0.963385
c  1.117346  2.243459

In [10]: df3
Out[10]: 
        one       two
a  0.106266 -1.354973
b -0.478782  0.037450
c  0.372515  1.317523

In [11]: df.reindex_like(df2)
Out[11]: 
        one       two
a  0.851097 -0.429037
b  0.266049  0.963385
c  1.117346  2.243459

7.2 Aligning objects with each other with align

The align() method is the fastest way to simultaneously align two objects. It supports a join argument (related to joining and merging):

  • join='outer': take the union of the indexes (default)
  • join='left': use the calling object’s index
  • join='right': use the passed object’s index
  • join='inner': intersect the indexes

It returns a tuple with both of the reindexed Series:

In [12]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [13]: s1 = s[:4]

In [14]: s2 = s[1:]

In [15]: s1.align(s2)
Out[15]: 
(a    0.498698
 b   -0.643722
 c   -0.028228
 d    0.070209
 e         NaN
 dtype: float64, a         NaN
 b   -0.643722
 c   -0.028228
 d    0.070209
 e   -0.791176
 dtype: float64)

In [16]: s1.align(s2, join='inner')
Out[16]: 
(b   -0.643722
 c   -0.028228
 d    0.070209
 dtype: float64, b   -0.643722
 c   -0.028228
 d    0.070209
 dtype: float64)

In [17]: s1.align(s2, join='left')
Out[17]: 
(a    0.498698
 b   -0.643722
 c   -0.028228
 d    0.070209
 dtype: float64, a         NaN
 b   -0.643722
 c   -0.028228
 d    0.070209
 dtype: float64)

For DataFrames, the join method will be applied to both the index and the columns by default:

In [18]: df.align(df2, join='inner')
Out[18]: 
(        one       two
 a  0.851097 -0.429037
 b  0.266049  0.963385
 c  1.117346  2.243459,         one       two
 a  0.851097 -0.429037
 b  0.266049  0.963385
 c  1.117346  2.243459)

You can also pass an axis option to only align on the specified axis:

In [19]: df.align(df2, join='inner', axis=0)
Out[19]: 
(        one     three       two
 a  0.851097       NaN -0.429037
 b  0.266049 -0.330979  0.963385
 c  1.117346 -0.409168  2.243459,         one       two
 a  0.851097 -0.429037
 b  0.266049  0.963385
 c  1.117346  2.243459)

If you pass a Series to DataFrame.align(), you can choose to align both objects either on the DataFrame’s index or columns using the axis argument:

In [20]: df.align(df2.ix[0], axis=1)
Out[20]: 
(        one     three       two
 a  0.851097       NaN -0.429037
 b  0.266049 -0.330979  0.963385
 c  1.117346 -0.409168  2.243459
 d       NaN -0.305334 -0.432789, one      0.851097
 three         NaN
 two     -0.429037
 Name: a, dtype: float64)

7.3 Filling while reindexing

reindex() takes an optional parameter method which is a filling method chosen from the following table:

Method Action
pad / ffill Fill values forward
bfill / backfill Fill values backward
nearest Fill from the nearest index value

We illustrate these fill methods on a simple Series:

In [21]: rng = pd.date_range('1/3/2000', periods=8)

In [22]: ts = pd.Series(np.random.randn(8), index=rng)

In [23]: ts2 = ts[[0, 3, 6]]

In [24]: ts
Out[24]: 
2000-01-03    0.393495
2000-01-04    2.410230
2000-01-05   -0.368339
2000-01-06   -1.934392
2000-01-07    2.398912
2000-01-08    0.521658
2000-01-09   -2.389278
2000-01-10    0.395639
Freq: D, dtype: float64

In [25]: ts2
Out[25]: 
2000-01-03    0.393495
2000-01-06   -1.934392
2000-01-09   -2.389278
dtype: float64

In [26]: ts2.reindex(ts.index)
Out[26]: 
2000-01-03    0.393495
2000-01-04         NaN
2000-01-05         NaN
2000-01-06   -1.934392
2000-01-07         NaN
2000-01-08         NaN
2000-01-09   -2.389278
2000-01-10         NaN
Freq: D, dtype: float64

In [27]: ts2.reindex(ts.index, method='ffill')
Out[27]: 
2000-01-03    0.393495
2000-01-04    0.393495
2000-01-05    0.393495
2000-01-06   -1.934392
2000-01-07   -1.934392
2000-01-08   -1.934392
2000-01-09   -2.389278
2000-01-10   -2.389278
Freq: D, dtype: float64

In [28]: ts2.reindex(ts.index, method='bfill')
Out[28]: 
2000-01-03    0.393495
2000-01-04   -1.934392
2000-01-05   -1.934392
2000-01-06   -1.934392
2000-01-07   -2.389278
2000-01-08   -2.389278
2000-01-09   -2.389278
2000-01-10         NaN
Freq: D, dtype: float64

In [29]: ts2.reindex(ts.index, method='nearest')
Out[29]: 
2000-01-03    0.393495
2000-01-04    0.393495
2000-01-05   -1.934392
2000-01-06   -1.934392
2000-01-07   -1.934392
2000-01-08   -2.389278
2000-01-09   -2.389278
2000-01-10   -2.389278
Freq: D, dtype: float64

These methods require that the indexes are ordered increasing or decreasing.

Note that the same result could have been achieved using fillna (except for method='nearest') or interpolate:

In [30]: ts2.reindex(ts.index).fillna(method='ffill')
Out[30]: 
2000-01-03    0.393495
2000-01-04    0.393495
2000-01-05    0.393495
2000-01-06   -1.934392
2000-01-07   -1.934392
2000-01-08   -1.934392
2000-01-09   -2.389278
2000-01-10   -2.389278
Freq: D, dtype: float64

reindex() will raise a ValueError if the index is not monotonic increasing or decreasing. fillna() and interpolate() will not make any checks on the order of the index.

7.4 Limits on filling while reindexing

The limit and tolerance arguments provide additional control over filling while reindexing. Limit specifies the maximum count of consecutive matches:

In [31]: ts2.reindex(ts.index, method='ffill', limit=1)
Out[31]: 
2000-01-03    0.393495
2000-01-04    0.393495
2000-01-05         NaN
2000-01-06   -1.934392
2000-01-07   -1.934392
2000-01-08         NaN
2000-01-09   -2.389278
2000-01-10   -2.389278
Freq: D, dtype: float64

In contrast, tolerance specifies the maximum distance between the index and indexer values:

In [32]: ts2.reindex(ts.index, method='ffill', tolerance='1 day')
Out[32]: 
2000-01-03    0.393495
2000-01-04    0.393495
2000-01-05         NaN
2000-01-06   -1.934392
2000-01-07   -1.934392
2000-01-08         NaN
2000-01-09   -2.389278
2000-01-10   -2.389278
Freq: D, dtype: float64

Notice that when used on a DatetimeIndex, TimedeltaIndex or PeriodIndex, tolerance will coerced into a Timedelta if possible. This allows you to specify tolerance with appropriate strings.

7.5 Dropping labels from an axis

A method closely related to reindex is the drop() function. It removes a set of labels from an axis:

In [33]: df
Out[33]: 
        one     three       two
a  0.851097       NaN -0.429037
b  0.266049 -0.330979  0.963385
c  1.117346 -0.409168  2.243459
d       NaN -0.305334 -0.432789

In [34]: df.drop(['a', 'd'], axis=0)
Out[34]: 
        one     three       two
b  0.266049 -0.330979  0.963385
c  1.117346 -0.409168  2.243459

In [35]: df.drop(['one'], axis=1)
Out[35]: 
      three       two
a       NaN -0.429037
b -0.330979  0.963385
c -0.409168  2.243459
d -0.305334 -0.432789

Note that the following also works, but is a bit less obvious / clean:

In [36]: df.reindex(df.index.difference(['a', 'd']))
Out[36]: 
        one     three       two
b  0.266049 -0.330979  0.963385
c  1.117346 -0.409168  2.243459

7.6 Renaming / mapping labels

The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

In [37]: s
Out[37]: 
a    0.498698
b   -0.643722
c   -0.028228
d    0.070209
e   -0.791176
dtype: float64

In [38]: s.rename(str.upper)
Out[38]: 
A    0.498698
B   -0.643722
C   -0.028228
D    0.070209
E   -0.791176
dtype: float64

If you pass a function, it must return a value when called with any of the labels (and must produce a set of unique values). A dict or Series can also be used:

In [39]: df.rename(columns={'one' : 'foo', 'two' : 'bar'},
   ....:           index={'a' : 'apple', 'b' : 'banana', 'd' : 'durian'})
   ....: 
Out[39]: 
             foo     three       bar
apple   0.851097       NaN -0.429037
banana  0.266049 -0.330979  0.963385
c       1.117346 -0.409168  2.243459
durian       NaN -0.305334 -0.432789

If the mapping doesn’t include a column/index label, it isn’t renamed. Also extra labels in the mapping don’t throw an error.

The rename() method also provides an inplace named parameter that is by default False and copies the underlying data. Pass inplace=True to rename the data in place.

New in version 0.18.0.

Finally, rename() also accepts a scalar or list-like for altering the Series.name attribute.

In [40]: s.rename("scalar-name")
Out[40]: 
a    0.498698
b   -0.643722
c   -0.028228
d    0.070209
e   -0.791176
Name: scalar-name, dtype: float64

The Panel class has a related rename_axis() class which can rename any of its three axes.