4.1 Missing data basics
4.1.1 When / why does data become missing?
Some might quibble over our usage of missing. By “missing” we simply mean null or “not present for whatever reason”. Many data sets simply arrive with missing data, either because it exists and was not collected or it never existed. For example, in a collection of financial time series, some of the time series might start on different dates. Thus, values prior to the start date would generally be marked as missing.
In pandas, one of the most common ways that missing data is introduced into a data set is by reindexing. For example
In [1]: df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
...: columns=['one', 'two', 'three'])
...:
In [2]: df['four'] = 'bar'
In [3]: df['five'] = df['one'] > 0
In [4]: df
Out[4]:
one two three four five
a 1.7641 0.4002 0.9787 bar True
c 2.2409 1.8676 -0.9773 bar True
e 0.9501 -0.1514 -0.1032 bar True
f 0.4106 0.1440 1.4543 bar True
h 0.7610 0.1217 0.4439 bar True
In [5]: df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
In [6]: df2
Out[6]:
one two three four five
a 1.7641 0.4002 0.9787 bar True
b NaN NaN NaN NaN NaN
c 2.2409 1.8676 -0.9773 bar True
d NaN NaN NaN NaN NaN
e 0.9501 -0.1514 -0.1032 bar True
f 0.4106 0.1440 1.4543 bar True
g NaN NaN NaN NaN NaN
h 0.7610 0.1217 0.4439 bar True
4.1.2 Values considered “missing”
As data comes in many shapes and forms, pandas aims to be flexible with regard
to handling missing data. While NaN
is the default missing value marker for
reasons of computational speed and convenience, we need to be able to easily
detect this value with data of different types: floating point, integer,
boolean, and general object. In many cases, however, the Python None
will
arise and we wish to also consider that “missing” or “null”.
Note
Prior to version v0.10.0 inf
and -inf
were also
considered to be “null” in computations. This is no longer the case by
default; use the mode.use_inf_as_null
option to recover it.
To make detecting missing values easier (and across different array dtypes),
pandas provides the isnull()
and
notnull()
functions, which are also methods on
Series
and DataFrame
objects:
In [7]: df2['one']
Out[7]:
a 1.7641
b NaN
c 2.2409
d NaN
e 0.9501
f 0.4106
g NaN
h 0.7610
Name: one, dtype: float64
In [8]: pd.isnull(df2['one'])
Out[8]:
a False
b True
c False
d True
e False
f False
g True
h False
Name: one, dtype: bool
In [9]: df2['four'].notnull()
Out[9]:
a True
b False
c True
d False
e True
f True
g False
h True
Name: four, dtype: bool
In [10]: df2.isnull()
Out[10]:
one two three four five
a False False False False False
b True True True True True
c False False False False False
d True True True True True
e False False False False False
f False False False False False
g True True True True True
h False False False False False
Warning
One has to be mindful that in python (and numpy), the nan's
don’t compare equal, but None's
do.
Note that Pandas/numpy uses the fact that np.nan != np.nan
, and treats None
like np.nan
.
In [11]: None == None
Out[11]: True
In [12]: np.nan == np.nan
Out[12]: False
So as compared to above, a scalar equality comparison versus a None/np.nan
doesn’t provide useful information.
In [13]: df2['one'] == np.nan
Out[13]:
a False
b False
c False
d False
e False
f False
g False
h False
Name: one, dtype: bool