5.12 Gotchas
5.12.1 Memory Usage
The memory usage of a Categorical
is proportional to the number of categories times the length of the data. In contrast,
an object
dtype is a constant times the length of the data.
In [1]: s = pd.Series(['foo','bar']*1000)
# object dtype
In [2]: s.nbytes
Out[2]: 16000
# category dtype
In [3]: s.astype('category').nbytes
Out[3]: 2016
Note
If the number of categories approaches the length of the data, the Categorical
will use nearly the same or
more memory than an equivalent object
dtype representation.
In [4]: s = pd.Series(['foo%04d' % i for i in range(2000)])
# object dtype
In [5]: s.nbytes
Out[5]: 16000
# category dtype
In [6]: s.astype('category').nbytes
Out[6]: 20000
5.12.2 Old style constructor usage
In earlier versions than pandas 0.15, a Categorical could be constructed by passing in precomputed
codes (called then labels) instead of values with categories. The codes were interpreted as
pointers to the categories with -1 as NaN. This type of constructor usage is replaced by
the special constructor Categorical.from_codes()
.
Unfortunately, in some special cases, using code which assumes the old style constructor usage will work with the current pandas version, resulting in subtle bugs:
>>> cat = pd.Categorical([1,2], [1,2,3])
>>> # old version
>>> cat.get_values()
array([2, 3], dtype=int64)
>>> # new version
>>> cat.get_values()
array([1, 2], dtype=int64)
Warning
If you used Categoricals with older versions of pandas, please audit your code before
upgrading and change your code to use the from_codes()
constructor.
5.12.3 Categorical is not a numpy array
Currently, categorical data and the underlying Categorical is implemented as a python object and not as a low-level numpy array dtype. This leads to some problems.
numpy itself doesn’t know about the new dtype:
In [7]: try:
...: np.dtype("category")
...: except TypeError as e:
...: print("TypeError: " + str(e))
...:
TypeError: data type "category" not understood
In [8]: dtype = pd.Categorical(["a"]).dtype
In [9]: try:
...: np.dtype(dtype)
...: except TypeError as e:
...: print("TypeError: " + str(e))
...:
TypeError: data type not understood
Dtype comparisons work:
In [10]: dtype == np.str_
Out[10]: False
In [11]: np.str_ == dtype
Out[11]: False
To check if a Series contains Categorical data, with pandas 0.16 or later, use
hasattr(s, 'cat')
:
In [12]: hasattr(pd.Series(['a'], dtype='category'), 'cat')
Out[12]: True
In [13]: hasattr(pd.Series(['a']), 'cat')
Out[13]: False
Using numpy functions on a Series of type category
should not work as Categoricals
are not numeric data (even in the case that .categories
is numeric).
In [14]: s = pd.Series(pd.Categorical([1,2,3,4]))
In [15]: try:
....: np.sum(s)
....: except TypeError as e:
....: print("TypeError: " + str(e))
....:
TypeError: Categorical cannot perform the operation sum
Note
If such a function works, please file a bug at https://github.com/pydata/pandas!
5.12.4 dtype in apply
Pandas currently does not preserve the dtype in apply functions: If you apply along rows you get
a Series of object
dtype (same as getting a row -> getting one element will return a
basic type) and applying along columns will also convert to object.
In [16]: df = pd.DataFrame({"a":[1,2,3,4],
....: "b":["a","b","c","d"],
....: "cats":pd.Categorical([1,2,3,2])})
....:
In [17]: df.apply(lambda row: type(row["cats"]), axis=1)
Out[17]:
0 <type 'int'>
1 <type 'int'>
2 <type 'int'>
3 <type 'int'>
dtype: object
In [18]: df.apply(lambda col: col.dtype, axis=0)
Out[18]:
a object
b object
cats object
dtype: object
5.12.5 Categorical Index
New in version 0.16.1.
A new CategoricalIndex
index type is introduced in version 0.16.1. See the
advanced indexing docs for a more detailed
explanation.
Setting the index, will create create a CategoricalIndex
In [19]: cats = pd.Categorical([1,2,3,4], categories=[4,2,3,1])
In [20]: strings = ["a","b","c","d"]
In [21]: values = [4,2,3,1]
In [22]: df = pd.DataFrame({"strings":strings, "values":values}, index=cats)
In [23]: df.index
Out[23]: CategoricalIndex([1, 2, 3, 4], categories=[4, 2, 3, 1], ordered=False, dtype='category')
# This now sorts by the categories order
In [24]: df.sort_index()
Out[24]:
strings values
4 d 1
2 b 2
3 c 3
1 a 4
In previous versions (<0.16.1) there is no index of type category
, so
setting the index to categorical column will convert the categorical data to a
“normal” dtype first and therefore remove any custom ordering of the categories.
5.12.6 Side Effects
Constructing a Series from a Categorical will not copy the input Categorical. This means that changes to the Series will in most cases change the original Categorical:
In [25]: cat = pd.Categorical([1,2,3,10], categories=[1,2,3,4,10])
In [26]: s = pd.Series(cat, name="cat")
In [27]: cat
Out[27]:
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
In [28]: s.iloc[0:2] = 10
In [29]: cat
Out[29]:
[10, 10, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
In [30]: df = pd.DataFrame(s)
In [31]: df["cat"].cat.categories = [1,2,3,4,5]
In [32]: cat
Out[32]:
[5, 5, 3, 5]
Categories (5, int64): [1, 2, 3, 4, 5]
Use copy=True
to prevent such a behaviour or simply don’t reuse Categoricals:
In [33]: cat = pd.Categorical([1,2,3,10], categories=[1,2,3,4,10])
In [34]: s = pd.Series(cat, name="cat", copy=True)
In [35]: cat
Out[35]:
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
In [36]: s.iloc[0:2] = 10
In [37]: cat
Out[37]:
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
Note
This also happens in some cases when you supply a numpy array instead of a Categorical:
using an int array (e.g. np.array([1,2,3,4])
) will exhibit the same behaviour, while using
a string array (e.g. np.array(["a","b","c","a"])
) will not.