5.5 Sorting and Order

Warning

The default for construction has changed in v0.16.0 to ordered=False, from the prior implicit ordered=True

If categorical data is ordered (s.cat.ordered == True), then the order of the categories has a meaning and certain operations are possible. If the categorical is unordered, .min()/.max() will raise a TypeError.

In [1]: s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))

In [2]: s.sort_values(inplace=True)

In [3]: s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)

In [4]: s.sort_values(inplace=True)

In [5]: s
Out[5]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): [a < b < c]

In [6]: s.min(), s.max()
Out[6]: ('a', 'c')

You can set categorical data to be ordered by using as_ordered() or unordered by using as_unordered(). These will by default return a new object.

In [7]: s.cat.as_ordered()
Out[7]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): [a < b < c]

In [8]: s.cat.as_unordered()
Out[8]: 
0    a
3    a
1    b
2    c
dtype: category
Categories (3, object): [a, b, c]

Sorting will use the order defined by categories, not any lexical order present on the data type. This is even true for strings and numeric data:

In [9]: s = pd.Series([1,2,3,1], dtype="category")

In [10]: s = s.cat.set_categories([2,3,1], ordered=True)

In [11]: s
Out[11]: 
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]

In [12]: s.sort_values(inplace=True)

In [13]: s
Out[13]: 
1    2
2    3
0    1
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]

In [14]: s.min(), s.max()
Out[14]: (2, 1)

5.5.1 Reordering

Reordering the categories is possible via the Categorical.reorder_categories() and the Categorical.set_categories() methods. For Categorical.reorder_categories(), all old categories must be included in the new categories and no new categories are allowed. This will necessarily make the sort order the same as the categories order.

In [15]: s = pd.Series([1,2,3,1], dtype="category")

In [16]: s = s.cat.reorder_categories([2,3,1], ordered=True)

In [17]: s
Out[17]: 
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]

In [18]: s.sort_values(inplace=True)

In [19]: s
Out[19]: 
1    2
2    3
0    1
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]

In [20]: s.min(), s.max()
Out[20]: (2, 1)

Note

Note the difference between assigning new categories and reordering the categories: the first renames categories and therefore the individual values in the Series, but if the first position was sorted last, the renamed value will still be sorted last. Reordering means that the way values are sorted is different afterwards, but not that individual values in the Series are changed.

Note

If the Categorical is not ordered, Series.min() and Series.max() will raise TypeError. Numeric operations like +, -, *, / and operations based on them (e.g. Series.median(), which would need to compute the mean between two values if the length of an array is even) do not work and raise a TypeError.

5.5.2 Multi Column Sorting

A categorical dtyped column will participate in a multi-column sort in a similar manner to other columns. The ordering of the categorical is determined by the categories of that column.

In [21]: dfs = pd.DataFrame({'A' : pd.Categorical(list('bbeebbaa'), categories=['e','a','b'], ordered=True),
   ....:                     'B' : [1,2,1,2,2,1,2,1] })
   ....: 

In [22]: dfs.sort_values(by=['A', 'B'])
Out[22]: 
   A  B
2  e  1
3  e  2
7  a  1
6  a  2
0  b  1
5  b  1
1  b  2
4  b  2

Reordering the categories changes a future sort.

In [23]: dfs['A'] = dfs['A'].cat.reorder_categories(['a','b','e'])

In [24]: dfs.sort_values(by=['A','B'])
Out[24]: 
   A  B
7  a  1
6  a  2
0  b  1
5  b  1
1  b  2
4  b  2
2  e  1
3  e  2