4.6 Missing data casting rules and indexing
While pandas supports storing arrays of integer and boolean type, these types are not capable of storing missing data. Until we can switch to using a native NA type in NumPy, we’ve established some “casting rules” when reindexing will cause missing data to be introduced into, say, a Series or DataFrame. Here they are:
data type | Cast to |
---|---|
integer | float |
boolean | object |
float | no cast |
object | no cast |
For example:
In [1]: s = pd.Series(np.random.randn(5), index=[0, 2, 4, 6, 7])
In [2]: s
Out[2]:
0 1.3316
2 0.7153
4 -1.5454
6 -0.0084
7 0.6213
dtype: float64
In [3]: s > 0
Out[3]:
0 True
2 True
4 False
6 False
7 True
dtype: bool
In [4]: (s > 0).dtype
Out[4]: dtype('bool')
In [5]: crit = (s > 0).reindex(list(range(8)))
In [6]: crit
Out[6]:
0 True
1 NaN
2 True
3 NaN
4 False
5 NaN
6 False
7 True
dtype: object
In [7]: crit.dtype
Out[7]: dtype('O')
Ordinarily NumPy will complain if you try to use an object array (even if it contains boolean values) instead of a boolean array to get or set values from an ndarray (e.g. selecting values based on some criteria). If a boolean vector contains NAs, an exception will be generated:
In [8]: reindexed = s.reindex(list(range(8))).fillna(0)
In [9]: reindexed[crit]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-2da204ed1ac7> in <module>()
----> 1 reindexed[crit]
/home/takanori/.local/lib/python2.7/site-packages/pandas-0.18.1+293.g51e6adb-py2.7-linux-x86_64.egg/pandas/core/series.pyc in __getitem__(self, key)
634 key = list(key)
635
--> 636 if com.is_bool_indexer(key):
637 key = check_bool_indexer(self.index, key)
638
/home/takanori/.local/lib/python2.7/site-packages/pandas-0.18.1+293.g51e6adb-py2.7-linux-x86_64.egg/pandas/core/common.pyc in is_bool_indexer(key)
190 if not lib.is_bool_array(key):
191 if isnull(key).any():
--> 192 raise ValueError('cannot index with vector containing '
193 'NA / NaN values')
194 return False
ValueError: cannot index with vector containing NA / NaN values
However, these can be filled in using fillna and it will work fine:
In [10]: reindexed[crit.fillna(False)]
Out[10]:
0 1.3316
2 0.7153
7 0.6213
dtype: float64
In [11]: reindexed[crit.fillna(True)]
Out[11]:
0 1.3316
1 0.0000
2 0.7153
3 0.0000
5 0.0000
7 0.6213
dtype: float64