7.9 Factorizing values
To encode 1-d values as an enumerated type use factorize
:
In [1]: x = pd.Series(['A', 'A', np.nan, 'B', 3.14, np.inf])
In [2]: x
Out[2]:
0 A
1 A
2 NaN
3 B
4 3.14
5 inf
dtype: object
In [3]: labels, uniques = pd.factorize(x)
In [4]: labels
Out[4]: array([ 0, 0, -1, 1, 2, 3])
In [5]: uniques
Out[5]: Index([u'A', u'B', 3.14, inf], dtype='object')
Note that factorize
is similar to numpy.unique
, but differs in its
handling of NaN:
Note
The following numpy.unique
will fail under Python 3 with a TypeError
because of an ordering bug. See also
Here
In [6]: pd.factorize(x, sort=True)
Out[6]:
(array([ 2, 2, -1, 3, 0, 1]),
Index([3.14, inf, u'A', u'B'], dtype='object'))
In [7]: np.unique(x, return_inverse=True)[::-1]
Out[7]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object))
Note
If you just want to handle one column as a categorical variable (like R’s factor),
you can use df["cat_col"] = pd.Categorical(df["col"])
or
df["cat_col"] = df["col"].astype("category")
. For full docs on Categorical
,
see the Categorical introduction and the
API documentation. This feature was introduced in version 0.15.