4.6 Missing data casting rules and indexing

While pandas supports storing arrays of integer and boolean type, these types are not capable of storing missing data. Until we can switch to using a native NA type in NumPy, we’ve established some “casting rules” when reindexing will cause missing data to be introduced into, say, a Series or DataFrame. Here they are:

data type Cast to
integer float
boolean object
float no cast
object no cast

For example:

In [1]: s = pd.Series(np.random.randn(5), index=[0, 2, 4, 6, 7])

In [2]: s
Out[2]: 
0    1.3316
2    0.7153
4   -1.5454
6   -0.0084
7    0.6213
dtype: float64

In [3]: s > 0
Out[3]: 
0     True
2     True
4    False
6    False
7     True
dtype: bool

In [4]: (s > 0).dtype
Out[4]: dtype('bool')

In [5]: crit = (s > 0).reindex(list(range(8)))

In [6]: crit
Out[6]: 
0     True
1      NaN
2     True
3      NaN
4    False
5      NaN
6    False
7     True
dtype: object

In [7]: crit.dtype
Out[7]: dtype('O')

Ordinarily NumPy will complain if you try to use an object array (even if it contains boolean values) instead of a boolean array to get or set values from an ndarray (e.g. selecting values based on some criteria). If a boolean vector contains NAs, an exception will be generated:

In [8]: reindexed = s.reindex(list(range(8))).fillna(0)

In [9]: reindexed[crit]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-2da204ed1ac7> in <module>()
----> 1 reindexed[crit]

/home/takanori/.local/lib/python2.7/site-packages/pandas-0.18.1+293.g51e6adb-py2.7-linux-x86_64.egg/pandas/core/series.pyc in __getitem__(self, key)
    634             key = list(key)
    635 
--> 636         if com.is_bool_indexer(key):
    637             key = check_bool_indexer(self.index, key)
    638 

/home/takanori/.local/lib/python2.7/site-packages/pandas-0.18.1+293.g51e6adb-py2.7-linux-x86_64.egg/pandas/core/common.pyc in is_bool_indexer(key)
    190             if not lib.is_bool_array(key):
    191                 if isnull(key).any():
--> 192                     raise ValueError('cannot index with vector containing '
    193                                      'NA / NaN values')
    194                 return False

ValueError: cannot index with vector containing NA / NaN values

However, these can be filled in using fillna and it will work fine:

In [10]: reindexed[crit.fillna(False)]
Out[10]: 
0    1.3316
2    0.7153
7    0.6213
dtype: float64

In [11]: reindexed[crit.fillna(True)]
Out[11]: 
0    1.3316
1    0.0000
2    0.7153
3    0.0000
5    0.0000
7    0.6213
dtype: float64