5.10 Missing Data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the Missing Data section.

Missing values should not be included in the Categorical’s categories, only in the values. Instead, it is understood that NaN is different, and is always a possibility. When working with the Categorical’s codes, missing values will always have a code of -1.

In [1]: s = pd.Series(["a", "b", np.nan, "a"], dtype="category")

# only two categories
In [2]: s
Out[2]: 
0      a
1      b
2    NaN
3      a
dtype: category
Categories (2, object): [a, b]

In [3]: s.cat.codes
Out[3]: 
0    0
1    1
2   -1
3    0
dtype: int8

Methods for working with missing data, e.g. isnull(), fillna(), dropna(), all work normally:

In [4]: s = pd.Series(["a", "b", np.nan], dtype="category")

In [5]: s
Out[5]: 
0      a
1      b
2    NaN
dtype: category
Categories (2, object): [a, b]

In [6]: pd.isnull(s)
Out[6]: 
0    False
1    False
2     True
dtype: bool

In [7]: s.fillna("a")
Out[7]: 
0    a
1    b
2    a
dtype: category
Categories (2, object): [a, b]