.. currentmodule:: pandas .. ipython:: python :suppress: import numpy as np import pandas as pd np.random.seed(123456) np.set_printoptions(precision=4, suppress=True) pd.options.display.max_rows = 8 Gotchas ------- .. _categorical.rfactor: Memory Usage ~~~~~~~~~~~~ .. _categorical.memory: The memory usage of a ``Categorical`` is proportional to the number of categories times the length of the data. In contrast, an ``object`` dtype is a constant times the length of the data. .. ipython:: python s = pd.Series(['foo','bar']*1000) # object dtype s.nbytes # category dtype s.astype('category').nbytes .. note:: If the number of categories approaches the length of the data, the ``Categorical`` will use nearly the same or more memory than an equivalent ``object`` dtype representation. .. ipython:: python s = pd.Series(['foo%04d' % i for i in range(2000)]) # object dtype s.nbytes # category dtype s.astype('category').nbytes Old style constructor usage ~~~~~~~~~~~~~~~~~~~~~~~~~~~ In earlier versions than pandas 0.15, a `Categorical` could be constructed by passing in precomputed `codes` (called then `labels`) instead of values with categories. The `codes` were interpreted as pointers to the categories with `-1` as `NaN`. This type of constructor usage is replaced by the special constructor :func:`Categorical.from_codes`. Unfortunately, in some special cases, using code which assumes the old style constructor usage will work with the current pandas version, resulting in subtle bugs: .. code-block:: python >>> cat = pd.Categorical([1,2], [1,2,3]) >>> # old version >>> cat.get_values() array([2, 3], dtype=int64) >>> # new version >>> cat.get_values() array([1, 2], dtype=int64) .. warning:: If you used `Categoricals` with older versions of pandas, please audit your code before upgrading and change your code to use the :func:`~pandas.Categorical.from_codes` constructor. `Categorical` is not a `numpy` array ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Currently, categorical data and the underlying `Categorical` is implemented as a python object and not as a low-level `numpy` array dtype. This leads to some problems. `numpy` itself doesn't know about the new `dtype`: .. ipython:: python try: np.dtype("category") except TypeError as e: print("TypeError: " + str(e)) dtype = pd.Categorical(["a"]).dtype try: np.dtype(dtype) except TypeError as e: print("TypeError: " + str(e)) Dtype comparisons work: .. ipython:: python dtype == np.str_ np.str_ == dtype To check if a Series contains Categorical data, with pandas 0.16 or later, use ``hasattr(s, 'cat')``: .. ipython:: python hasattr(pd.Series(['a'], dtype='category'), 'cat') hasattr(pd.Series(['a']), 'cat') Using `numpy` functions on a `Series` of type ``category`` should not work as `Categoricals` are not numeric data (even in the case that ``.categories`` is numeric). .. ipython:: python s = pd.Series(pd.Categorical([1,2,3,4])) try: np.sum(s) #same with np.log(s),.. except TypeError as e: print("TypeError: " + str(e)) .. note:: If such a function works, please file a bug at https://github.com/pydata/pandas! dtype in apply ~~~~~~~~~~~~~~ Pandas currently does not preserve the dtype in apply functions: If you apply along rows you get a `Series` of ``object`` `dtype` (same as getting a row -> getting one element will return a basic type) and applying along columns will also convert to object. .. ipython:: python df = pd.DataFrame({"a":[1,2,3,4], "b":["a","b","c","d"], "cats":pd.Categorical([1,2,3,2])}) df.apply(lambda row: type(row["cats"]), axis=1) df.apply(lambda col: col.dtype, axis=0) Categorical Index ~~~~~~~~~~~~~~~~~ .. versionadded:: 0.16.1 A new ``CategoricalIndex`` index type is introduced in version 0.16.1. See the :ref:`advanced indexing docs ` for a more detailed explanation. Setting the index, will create create a ``CategoricalIndex`` .. ipython:: python cats = pd.Categorical([1,2,3,4], categories=[4,2,3,1]) strings = ["a","b","c","d"] values = [4,2,3,1] df = pd.DataFrame({"strings":strings, "values":values}, index=cats) df.index # This now sorts by the categories order df.sort_index() In previous versions (<0.16.1) there is no index of type ``category``, so setting the index to categorical column will convert the categorical data to a "normal" dtype first and therefore remove any custom ordering of the categories. Side Effects ~~~~~~~~~~~~ Constructing a `Series` from a `Categorical` will not copy the input `Categorical`. This means that changes to the `Series` will in most cases change the original `Categorical`: .. ipython:: python cat = pd.Categorical([1,2,3,10], categories=[1,2,3,4,10]) s = pd.Series(cat, name="cat") cat s.iloc[0:2] = 10 cat df = pd.DataFrame(s) df["cat"].cat.categories = [1,2,3,4,5] cat Use ``copy=True`` to prevent such a behaviour or simply don't reuse `Categoricals`: .. ipython:: python cat = pd.Categorical([1,2,3,10], categories=[1,2,3,4,10]) s = pd.Series(cat, name="cat", copy=True) cat s.iloc[0:2] = 10 cat .. note:: This also happens in some cases when you supply a `numpy` array instead of a `Categorical`: using an int array (e.g. ``np.array([1,2,3,4])``) will exhibit the same behaviour, while using a string array (e.g. ``np.array(["a","b","c","a"])``) will not.