5.4 Working with categories

Categorical data has a categories and a ordered property, which list their possible values and whether the ordering matters or not. These properties are exposed as s.cat.categories and s.cat.ordered. If you don’t manually specify categories and ordering, they are inferred from the passed in values.

In [1]: s = pd.Series(["a","b","c","a"], dtype="category")

In [2]: s.cat.categories
Out[2]: Index([u'a', u'b', u'c'], dtype='object')

In [3]: s.cat.ordered
Out[3]: False

It’s also possible to pass in the categories in a specific order:

In [4]: s = pd.Series(pd.Categorical(["a","b","c","a"], categories=["c","b","a"]))

In [5]: s.cat.categories
Out[5]: Index([u'c', u'b', u'a'], dtype='object')

In [6]: s.cat.ordered
Out[6]: False


New categorical data are NOT automatically ordered. You must explicitly pass ordered=True to indicate an ordered Categorical.


The result of Series.unique() is not always the same as Series.cat.categories, because Series.unique() has a couple of guarantees, namely that it returns categories in the order of appearance, and it only includes values that are actually present.

In [7]: s = pd.Series(list('babc')).astype('category', categories=list('abcd'))

In [8]: s
0    b
1    a
2    b
3    c
dtype: category
Categories (4, object): [a, b, c, d]

# categories
In [9]: s.cat.categories
Out[9]: Index([u'a', u'b', u'c', u'd'], dtype='object')

# uniques
In [10]: s.unique()
[b, a, c]
Categories (3, object): [b, a, c]

5.4.1 Renaming categories

Renaming categories is done by assigning new values to the Series.cat.categories property or by using the Categorical.rename_categories() method:

In [11]: s = pd.Series(["a","b","c","a"], dtype="category")

In [12]: s
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

In [13]: s.cat.categories = ["Group %s" % g for g in s.cat.categories]

In [14]: s
0    Group a
1    Group b
2    Group c
3    Group a
dtype: category
Categories (3, object): [Group a, Group b, Group c]

In [15]: s.cat.rename_categories([1,2,3])
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [1, 2, 3]


In contrast to R’s factor, categorical data can have categories of other types than string.


Be aware that assigning new categories is an inplace operations, while most other operation under Series.cat per default return a new Series of dtype category.

Categories must be unique or a ValueError is raised:

In [16]: try:
   ....:     s.cat.categories = [1,1,1]
   ....: except ValueError as e:
   ....:     print("ValueError: " + str(e))
ValueError: Categorical categories must be unique

5.4.2 Appending new categories

Appending categories can be done by using the Categorical.add_categories() method:

In [17]: s = s.cat.add_categories([4])

In [18]: s.cat.categories
Out[18]: Index([u'Group a', u'Group b', u'Group c', 4], dtype='object')

In [19]: s
0    Group a
1    Group b
2    Group c
3    Group a
dtype: category
Categories (4, object): [Group a, Group b, Group c, 4]

5.4.3 Removing categories

Removing categories can be done by using the Categorical.remove_categories() method. Values which are removed are replaced by np.nan.:

In [20]: s = s.cat.remove_categories([4])

In [21]: s
0    Group a
1    Group b
2    Group c
3    Group a
dtype: category
Categories (3, object): [Group a, Group b, Group c]

5.4.4 Removing unused categories

Removing unused categories can also be done:

In [22]: s = pd.Series(pd.Categorical(["a","b","a"], categories=["a","b","c","d"]))

In [23]: s
0    a
1    b
2    a
dtype: category
Categories (4, object): [a, b, c, d]

In [24]: s.cat.remove_unused_categories()
0    a
1    b
2    a
dtype: category
Categories (2, object): [a, b]

5.4.5 Setting categories

If you want to do remove and add new categories in one step (which has some speed advantage), or simply set the categories to a predefined scale, use Categorical.set_categories().

In [25]: s = pd.Series(["one","two","four", "-"], dtype="category")

In [26]: s
0     one
1     two
2    four
3       -
dtype: category
Categories (4, object): [-, four, one, two]

In [27]: s = s.cat.set_categories(["one","two","three","four"])

In [28]: s
0     one
1     two
2    four
3     NaN
dtype: category
Categories (4, object): [one, two, three, four]


Be aware that Categorical.set_categories() cannot know whether some category is omitted intentionally or because it is misspelled or (under Python3) due to a type difference (e.g., numpys S1 dtype and python strings). This can result in surprising behaviour!