5.4 Working with categories
Categorical data has a categories and a ordered property, which list their possible values and
whether the ordering matters or not. These properties are exposed as s.cat.categories
and
s.cat.ordered
. If you don’t manually specify categories and ordering, they are inferred from the
passed in values.
In [1]: s = pd.Series(["a","b","c","a"], dtype="category")
In [2]: s.cat.categories
Out[2]: Index([u'a', u'b', u'c'], dtype='object')
In [3]: s.cat.ordered
Out[3]: False
It’s also possible to pass in the categories in a specific order:
In [4]: s = pd.Series(pd.Categorical(["a","b","c","a"], categories=["c","b","a"]))
In [5]: s.cat.categories
Out[5]: Index([u'c', u'b', u'a'], dtype='object')
In [6]: s.cat.ordered
Out[6]: False
Note
New categorical data are NOT automatically ordered. You must explicitly pass ordered=True
to
indicate an ordered Categorical
.
Note
The result of Series.unique()
is not always the same as Series.cat.categories
,
because Series.unique()
has a couple of guarantees, namely that it returns categories
in the order of appearance, and it only includes values that are actually present.
In [7]: s = pd.Series(list('babc')).astype('category', categories=list('abcd'))
In [8]: s
Out[8]:
0 b
1 a
2 b
3 c
dtype: category
Categories (4, object): [a, b, c, d]
# categories
In [9]: s.cat.categories
Out[9]: Index([u'a', u'b', u'c', u'd'], dtype='object')
# uniques
In [10]: s.unique()
Out[10]:
[b, a, c]
Categories (3, object): [b, a, c]
5.4.1 Renaming categories
Renaming categories is done by assigning new values to the Series.cat.categories
property or
by using the Categorical.rename_categories()
method:
In [11]: s = pd.Series(["a","b","c","a"], dtype="category")
In [12]: s
Out[12]:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
In [13]: s.cat.categories = ["Group %s" % g for g in s.cat.categories]
In [14]: s
Out[14]:
0 Group a
1 Group b
2 Group c
3 Group a
dtype: category
Categories (3, object): [Group a, Group b, Group c]
In [15]: s.cat.rename_categories([1,2,3])
Out[15]:
0 1
1 2
2 3
3 1
dtype: category
Categories (3, int64): [1, 2, 3]
Note
In contrast to R’s factor, categorical data can have categories of other types than string.
Note
Be aware that assigning new categories is an inplace operations, while most other operation
under Series.cat
per default return a new Series of dtype category.
Categories must be unique or a ValueError is raised:
In [16]: try:
....: s.cat.categories = [1,1,1]
....: except ValueError as e:
....: print("ValueError: " + str(e))
....:
ValueError: Categorical categories must be unique
5.4.2 Appending new categories
Appending categories can be done by using the Categorical.add_categories()
method:
In [17]: s = s.cat.add_categories([4])
In [18]: s.cat.categories
Out[18]: Index([u'Group a', u'Group b', u'Group c', 4], dtype='object')
In [19]: s
Out[19]:
0 Group a
1 Group b
2 Group c
3 Group a
dtype: category
Categories (4, object): [Group a, Group b, Group c, 4]
5.4.3 Removing categories
Removing categories can be done by using the Categorical.remove_categories()
method. Values
which are removed are replaced by np.nan
.:
In [20]: s = s.cat.remove_categories([4])
In [21]: s
Out[21]:
0 Group a
1 Group b
2 Group c
3 Group a
dtype: category
Categories (3, object): [Group a, Group b, Group c]
5.4.4 Removing unused categories
Removing unused categories can also be done:
In [22]: s = pd.Series(pd.Categorical(["a","b","a"], categories=["a","b","c","d"]))
In [23]: s
Out[23]:
0 a
1 b
2 a
dtype: category
Categories (4, object): [a, b, c, d]
In [24]: s.cat.remove_unused_categories()
Out[24]:
0 a
1 b
2 a
dtype: category
Categories (2, object): [a, b]
5.4.5 Setting categories
If you want to do remove and add new categories in one step (which has some speed advantage),
or simply set the categories to a predefined scale, use Categorical.set_categories()
.
In [25]: s = pd.Series(["one","two","four", "-"], dtype="category")
In [26]: s
Out[26]:
0 one
1 two
2 four
3 -
dtype: category
Categories (4, object): [-, four, one, two]
In [27]: s = s.cat.set_categories(["one","two","three","four"])
In [28]: s
Out[28]:
0 one
1 two
2 four
3 NaN
dtype: category
Categories (4, object): [one, two, three, four]
Note
Be aware that Categorical.set_categories()
cannot know whether some category is omitted
intentionally or because it is misspelled or (under Python3) due to a type difference (e.g.,
numpys S1 dtype and python strings). This can result in surprising behaviour!