5.5 Sorting and Order
Warning
The default for construction has changed in v0.16.0 to ordered=False
, from the prior implicit ordered=True
If categorical data is ordered (s.cat.ordered == True
), then the order of the categories has a
meaning and certain operations are possible. If the categorical is unordered, .min()/.max()
will raise a TypeError.
In [1]: s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
In [2]: s.sort_values(inplace=True)
In [3]: s = pd.Series(["a","b","c","a"]).astype('category', ordered=True)
In [4]: s.sort_values(inplace=True)
In [5]: s
Out[5]:
0 a
3 a
1 b
2 c
dtype: category
Categories (3, object): [a < b < c]
In [6]: s.min(), s.max()
Out[6]: ('a', 'c')
You can set categorical data to be ordered by using as_ordered()
or unordered by using as_unordered()
. These will by
default return a new object.
In [7]: s.cat.as_ordered()
Out[7]:
0 a
3 a
1 b
2 c
dtype: category
Categories (3, object): [a < b < c]
In [8]: s.cat.as_unordered()
Out[8]:
0 a
3 a
1 b
2 c
dtype: category
Categories (3, object): [a, b, c]
Sorting will use the order defined by categories, not any lexical order present on the data type. This is even true for strings and numeric data:
In [9]: s = pd.Series([1,2,3,1], dtype="category")
In [10]: s = s.cat.set_categories([2,3,1], ordered=True)
In [11]: s
Out[11]:
0 1
1 2
2 3
3 1
dtype: category
Categories (3, int64): [2 < 3 < 1]
In [12]: s.sort_values(inplace=True)
In [13]: s
Out[13]:
1 2
2 3
0 1
3 1
dtype: category
Categories (3, int64): [2 < 3 < 1]
In [14]: s.min(), s.max()
Out[14]: (2, 1)
5.5.1 Reordering
Reordering the categories is possible via the Categorical.reorder_categories()
and
the Categorical.set_categories()
methods. For Categorical.reorder_categories()
, all
old categories must be included in the new categories and no new categories are allowed. This will
necessarily make the sort order the same as the categories order.
In [15]: s = pd.Series([1,2,3,1], dtype="category")
In [16]: s = s.cat.reorder_categories([2,3,1], ordered=True)
In [17]: s
Out[17]:
0 1
1 2
2 3
3 1
dtype: category
Categories (3, int64): [2 < 3 < 1]
In [18]: s.sort_values(inplace=True)
In [19]: s
Out[19]:
1 2
2 3
0 1
3 1
dtype: category
Categories (3, int64): [2 < 3 < 1]
In [20]: s.min(), s.max()
Out[20]: (2, 1)
Note
Note the difference between assigning new categories and reordering the categories: the first renames categories and therefore the individual values in the Series, but if the first position was sorted last, the renamed value will still be sorted last. Reordering means that the way values are sorted is different afterwards, but not that individual values in the Series are changed.
Note
If the Categorical is not ordered, Series.min()
and Series.max()
will raise
TypeError
. Numeric operations like +
, -
, *
, /
and operations based on them
(e.g. Series.median()
, which would need to compute the mean between two values if the length
of an array is even) do not work and raise a TypeError
.
5.5.2 Multi Column Sorting
A categorical dtyped column will participate in a multi-column sort in a similar manner to other columns.
The ordering of the categorical is determined by the categories
of that column.
In [21]: dfs = pd.DataFrame({'A' : pd.Categorical(list('bbeebbaa'), categories=['e','a','b'], ordered=True),
....: 'B' : [1,2,1,2,2,1,2,1] })
....:
In [22]: dfs.sort_values(by=['A', 'B'])
Out[22]:
A B
2 e 1
3 e 2
7 a 1
6 a 2
0 b 1
5 b 1
1 b 2
4 b 2
Reordering the categories
changes a future sort.
In [23]: dfs['A'] = dfs['A'].cat.reorder_categories(['a','b','e'])
In [24]: dfs.sort_values(by=['A','B'])
Out[24]:
A B
7 a 1
6 a 2
0 b 1
5 b 1
1 b 2
4 b 2
2 e 1
3 e 2