1.5 Miscellaneous indexing gotchas

1.5.1 Reindex versus ix gotchas

Many users will find themselves using the ix indexing capabilities as a concise means of selecting data from a pandas object:

In [1]: df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three', 'four'],
   ...:                   index=list('abcdef'))
   ...: 

In [2]: df
Out[2]: 
      one     two   three    four
a -0.1726 -1.1223 -3.1767 -1.1547
b  0.3706  0.5508  0.1087  0.9402
c -1.0410 -0.9303  0.0771 -0.4607
d -0.0377  0.3060 -0.8745  0.9521
e -0.1760  0.0393 -0.2299  1.4152
f  1.7684  1.3841  0.9348 -1.0299

In [3]: df.ix[['b', 'c', 'e']]
Out[3]: 
      one     two   three    four
b  0.3706  0.5508  0.1087  0.9402
c -1.0410 -0.9303  0.0771 -0.4607
e -0.1760  0.0393 -0.2299  1.4152

This is, of course, completely equivalent in this case to using the reindex method:

In [4]: df.reindex(['b', 'c', 'e'])
Out[4]: 
      one     two   three    four
b  0.3706  0.5508  0.1087  0.9402
c -1.0410 -0.9303  0.0771 -0.4607
e -0.1760  0.0393 -0.2299  1.4152

Some might conclude that ix and reindex are 100% equivalent based on this. This is indeed true except in the case of integer indexing. For example, the above operation could alternately have been expressed as:

In [5]: df.ix[[1, 2, 4]]
Out[5]: 
      one     two   three    four
b  0.3706  0.5508  0.1087  0.9402
c -1.0410 -0.9303  0.0771 -0.4607
e -0.1760  0.0393 -0.2299  1.4152

If you pass [1, 2, 4] to reindex you will get another thing entirely:

In [6]: df.reindex([1, 2, 4])
Out[6]: 
   one  two  three  four
1  NaN  NaN    NaN   NaN
2  NaN  NaN    NaN   NaN
4  NaN  NaN    NaN   NaN

So it’s important to remember that reindex is strict label indexing only. This can lead to some potentially surprising results in pathological cases where an index contains, say, both integers and strings:

In [7]: s = pd.Series([1, 2, 3], index=['a', 0, 1])

In [8]: s
Out[8]: 
a    1
0    2
1    3
dtype: int64

In [9]: s.ix[[0, 1]]
Out[9]: 
0    2
1    3
dtype: int64

In [10]: s.reindex([0, 1])
Out[10]: 
0    2
1    3
dtype: int64

Because the index in this case does not contain solely integers, ix falls back on integer indexing. By contrast, reindex only looks for the values passed in the index, thus finding the integers 0 and 1. While it would be possible to insert some logic to check whether a passed sequence is all contained in the index, that logic would exact a very high cost in large data sets.

1.5.2 Reindex potentially changes underlying Series dtype

The use of reindex_like can potentially change the dtype of a Series.

In [11]: series = pd.Series([1, 2, 3])

In [12]: x = pd.Series([True])

In [13]: x.dtype
Out[13]: dtype('bool')

In [14]: x = pd.Series([True]).reindex_like(series)

In [15]: x.dtype
Out[15]: dtype('O')

This is because reindex_like silently inserts NaNs and the dtype changes accordingly. This can cause some issues when using numpy ufuncs such as numpy.logical_and.

See the this old issue for a more detailed discussion.