7.9 Factorizing values

To encode 1-d values as an enumerated type use factorize:

In [1]: x = pd.Series(['A', 'A', np.nan, 'B', 3.14, np.inf])

In [2]: x
Out[2]: 
0       A
1       A
2     NaN
3       B
4    3.14
5     inf
dtype: object

In [3]: labels, uniques = pd.factorize(x)

In [4]: labels
Out[4]: array([ 0,  0, -1,  1,  2,  3])

In [5]: uniques
Out[5]: Index([u'A', u'B', 3.14, inf], dtype='object')

Note that factorize is similar to numpy.unique, but differs in its handling of NaN:

Note

The following numpy.unique will fail under Python 3 with a TypeError because of an ordering bug. See also Here

In [6]: pd.factorize(x, sort=True)
Out[6]: 
(array([ 2,  2, -1,  3,  0,  1]),
 Index([3.14, inf, u'A', u'B'], dtype='object'))

In [7]: np.unique(x, return_inverse=True)[::-1]
Out[7]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object))

Note

If you just want to handle one column as a categorical variable (like R’s factor), you can use df["cat_col"] = pd.Categorical(df["col"]) or df["cat_col"] = df["col"].astype("category"). For full docs on Categorical, see the Categorical introduction and the API documentation. This feature was introduced in version 0.15.