5.12 Gotchas

5.12.1 Memory Usage

The memory usage of a Categorical is proportional to the number of categories times the length of the data. In contrast, an object dtype is a constant times the length of the data.

In [1]: s = pd.Series(['foo','bar']*1000)

# object dtype
In [2]: s.nbytes
Out[2]: 16000

# category dtype
In [3]: s.astype('category').nbytes
Out[3]: 2016

Note

If the number of categories approaches the length of the data, the Categorical will use nearly the same or more memory than an equivalent object dtype representation.

In [4]: s = pd.Series(['foo%04d' % i for i in range(2000)])

# object dtype
In [5]: s.nbytes
Out[5]: 16000

# category dtype
In [6]: s.astype('category').nbytes
Out[6]: 20000

5.12.2 Old style constructor usage

In earlier versions than pandas 0.15, a Categorical could be constructed by passing in precomputed codes (called then labels) instead of values with categories. The codes were interpreted as pointers to the categories with -1 as NaN. This type of constructor usage is replaced by the special constructor Categorical.from_codes().

Unfortunately, in some special cases, using code which assumes the old style constructor usage will work with the current pandas version, resulting in subtle bugs:

>>> cat = pd.Categorical([1,2], [1,2,3])
>>> # old version
>>> cat.get_values()
array([2, 3], dtype=int64)
>>> # new version
>>> cat.get_values()
array([1, 2], dtype=int64)

Warning

If you used Categoricals with older versions of pandas, please audit your code before upgrading and change your code to use the from_codes() constructor.

5.12.3 Categorical is not a numpy array

Currently, categorical data and the underlying Categorical is implemented as a python object and not as a low-level numpy array dtype. This leads to some problems.

numpy itself doesn’t know about the new dtype:

In [7]: try:
   ...:     np.dtype("category")
   ...: except TypeError as e:
   ...:     print("TypeError: " + str(e))
   ...: 
TypeError: data type "category" not understood

In [8]: dtype = pd.Categorical(["a"]).dtype

In [9]: try:
   ...:     np.dtype(dtype)
   ...: except TypeError as e:
   ...:      print("TypeError: " + str(e))
   ...: 
TypeError: data type not understood

Dtype comparisons work:

In [10]: dtype == np.str_
Out[10]: False

In [11]: np.str_ == dtype
Out[11]: False

To check if a Series contains Categorical data, with pandas 0.16 or later, use hasattr(s, 'cat'):

In [12]: hasattr(pd.Series(['a'], dtype='category'), 'cat')
Out[12]: True

In [13]: hasattr(pd.Series(['a']), 'cat')
Out[13]: False

Using numpy functions on a Series of type category should not work as Categoricals are not numeric data (even in the case that .categories is numeric).

In [14]: s = pd.Series(pd.Categorical([1,2,3,4]))

In [15]: try:
   ....:     np.sum(s)
   ....: except TypeError as e:
   ....:      print("TypeError: " + str(e))
   ....: 
TypeError: Categorical cannot perform the operation sum

Note

If such a function works, please file a bug at https://github.com/pydata/pandas!

5.12.4 dtype in apply

Pandas currently does not preserve the dtype in apply functions: If you apply along rows you get a Series of object dtype (same as getting a row -> getting one element will return a basic type) and applying along columns will also convert to object.

In [16]: df = pd.DataFrame({"a":[1,2,3,4],
   ....:                    "b":["a","b","c","d"],
   ....:                    "cats":pd.Categorical([1,2,3,2])})
   ....: 

In [17]: df.apply(lambda row: type(row["cats"]), axis=1)
Out[17]: 
0    <type 'int'>
1    <type 'int'>
2    <type 'int'>
3    <type 'int'>
dtype: object

In [18]: df.apply(lambda col: col.dtype, axis=0)
Out[18]: 
a       object
b       object
cats    object
dtype: object

5.12.5 Categorical Index

New in version 0.16.1.

A new CategoricalIndex index type is introduced in version 0.16.1. See the advanced indexing docs for a more detailed explanation.

Setting the index, will create create a CategoricalIndex

In [19]: cats = pd.Categorical([1,2,3,4], categories=[4,2,3,1])

In [20]: strings = ["a","b","c","d"]

In [21]: values = [4,2,3,1]

In [22]: df = pd.DataFrame({"strings":strings, "values":values}, index=cats)

In [23]: df.index
Out[23]: CategoricalIndex([1, 2, 3, 4], categories=[4, 2, 3, 1], ordered=False, dtype='category')

# This now sorts by the categories order
In [24]: df.sort_index()
Out[24]: 
  strings  values
4       d       1
2       b       2
3       c       3
1       a       4

In previous versions (<0.16.1) there is no index of type category, so setting the index to categorical column will convert the categorical data to a “normal” dtype first and therefore remove any custom ordering of the categories.

5.12.6 Side Effects

Constructing a Series from a Categorical will not copy the input Categorical. This means that changes to the Series will in most cases change the original Categorical:

In [25]: cat = pd.Categorical([1,2,3,10], categories=[1,2,3,4,10])

In [26]: s = pd.Series(cat, name="cat")

In [27]: cat
Out[27]: 
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]

In [28]: s.iloc[0:2] = 10

In [29]: cat
Out[29]: 
[10, 10, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]

In [30]: df = pd.DataFrame(s)

In [31]: df["cat"].cat.categories = [1,2,3,4,5]

In [32]: cat
Out[32]: 
[5, 5, 3, 5]
Categories (5, int64): [1, 2, 3, 4, 5]

Use copy=True to prevent such a behaviour or simply don’t reuse Categoricals:

In [33]: cat = pd.Categorical([1,2,3,10], categories=[1,2,3,4,10])

In [34]: s = pd.Series(cat, name="cat", copy=True)

In [35]: cat
Out[35]: 
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]

In [36]: s.iloc[0:2] = 10

In [37]: cat
Out[37]: 
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]

Note

This also happens in some cases when you supply a numpy array instead of a Categorical: using an int array (e.g. np.array([1,2,3,4])) will exhibit the same behaviour, while using a string array (e.g. np.array(["a","b","c","a"])) will not.