1.5 Miscellaneous indexing gotchas
1.5.1 Reindex versus ix gotchas
Many users will find themselves using the ix
indexing capabilities as a
concise means of selecting data from a pandas object:
In [1]: df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three', 'four'],
...: index=list('abcdef'))
...:
In [2]: df
Out[2]:
one two three four
a -0.1726 -1.1223 -3.1767 -1.1547
b 0.3706 0.5508 0.1087 0.9402
c -1.0410 -0.9303 0.0771 -0.4607
d -0.0377 0.3060 -0.8745 0.9521
e -0.1760 0.0393 -0.2299 1.4152
f 1.7684 1.3841 0.9348 -1.0299
In [3]: df.ix[['b', 'c', 'e']]
Out[3]:
one two three four
b 0.3706 0.5508 0.1087 0.9402
c -1.0410 -0.9303 0.0771 -0.4607
e -0.1760 0.0393 -0.2299 1.4152
This is, of course, completely equivalent in this case to using the
reindex
method:
In [4]: df.reindex(['b', 'c', 'e'])
Out[4]:
one two three four
b 0.3706 0.5508 0.1087 0.9402
c -1.0410 -0.9303 0.0771 -0.4607
e -0.1760 0.0393 -0.2299 1.4152
Some might conclude that ix
and reindex
are 100% equivalent based on
this. This is indeed true except in the case of integer indexing. For
example, the above operation could alternately have been expressed as:
In [5]: df.ix[[1, 2, 4]]
Out[5]:
one two three four
b 0.3706 0.5508 0.1087 0.9402
c -1.0410 -0.9303 0.0771 -0.4607
e -0.1760 0.0393 -0.2299 1.4152
If you pass [1, 2, 4]
to reindex
you will get another thing entirely:
In [6]: df.reindex([1, 2, 4])
Out[6]:
one two three four
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
4 NaN NaN NaN NaN
So it’s important to remember that reindex
is strict label indexing
only. This can lead to some potentially surprising results in pathological
cases where an index contains, say, both integers and strings:
In [7]: s = pd.Series([1, 2, 3], index=['a', 0, 1])
In [8]: s
Out[8]:
a 1
0 2
1 3
dtype: int64
In [9]: s.ix[[0, 1]]
Out[9]:
0 2
1 3
dtype: int64
In [10]: s.reindex([0, 1])
Out[10]:
0 2
1 3
dtype: int64
Because the index in this case does not contain solely integers, ix
falls
back on integer indexing. By contrast, reindex
only looks for the values
passed in the index, thus finding the integers 0
and 1
. While it would
be possible to insert some logic to check whether a passed sequence is all
contained in the index, that logic would exact a very high cost in large data
sets.
1.5.2 Reindex potentially changes underlying Series dtype
The use of reindex_like
can potentially change the dtype of a Series
.
In [11]: series = pd.Series([1, 2, 3])
In [12]: x = pd.Series([True])
In [13]: x.dtype
Out[13]: dtype('bool')
In [14]: x = pd.Series([True]).reindex_like(series)
In [15]: x.dtype
Out[15]: dtype('O')
This is because reindex_like
silently inserts NaNs
and the dtype
changes accordingly. This can cause some issues when using numpy
ufuncs
such as numpy.logical_and
.
See the this old issue for a more detailed discussion.