11 Sorting
Warning
The sorting API is substantially changed in 0.17.0, see here for these changes.
In particular, all sorting methods now return a new object by default, and DO NOT operate in-place (except by passing inplace=True
).
There are two obvious kinds of sorting that you may be interested in: sorting by label and sorting by actual values.
11.1 By Index
The primary method for sorting axis
labels (indexes) are the Series.sort_index()
and the DataFrame.sort_index()
methods.
In [1]: df
Out[1]:
one three two
a -0.626544 NaN -0.351587
b -0.138894 -0.177289 1.136249
c 0.011617 0.462215 -0.448789
d NaN 1.124472 -1.101558
In [2]: unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
...: columns=['three', 'two', 'one'])
...:
# DataFrame
In [3]: unsorted_df.sort_index()
Out[3]:
three two one
a NaN -0.351587 -0.626544
b -0.177289 1.136249 -0.138894
c 0.462215 -0.448789 0.011617
d 1.124472 -1.101558 NaN
In [4]: unsorted_df.sort_index(ascending=False)
Out[4]:
three two one
d 1.124472 -1.101558 NaN
c 0.462215 -0.448789 0.011617
b -0.177289 1.136249 -0.138894
a NaN -0.351587 -0.626544
In [5]: unsorted_df.sort_index(axis=1)
Out[5]:
one three two
a -0.626544 NaN -0.351587
d NaN 1.124472 -1.101558
c 0.011617 0.462215 -0.448789
b -0.138894 -0.177289 1.136249
# Series
In [6]: unsorted_df['three'].sort_index()
Out[6]:
a NaN
b -0.177289
c 0.462215
d 1.124472
Name: three, dtype: float64
11.2 By Values
The Series.sort_values()
and DataFrame.sort_values()
are the entry points for value sorting (that is the values in a column or row).
DataFrame.sort_values()
can accept an optional by
argument for axis=0
which will use an arbitrary vector or a column name of the DataFrame to
determine the sort order:
In [7]: df1 = pd.DataFrame({'one':[2,1,1,1],'two':[1,3,2,4],'three':[5,4,3,2]})
In [8]: df1.sort_values(by='two')
Out[8]:
one three two
0 2 5 1
2 1 3 2
1 1 4 3
3 1 2 4
The by
argument can take a list of column names, e.g.:
In [9]: df1[['one', 'two', 'three']].sort_values(by=['one','two'])
Out[9]:
one two three
2 1 2 3
1 1 3 4
3 1 4 2
0 2 1 5
These methods have special treatment of NA values via the na_position
argument:
In [10]: s[2] = np.nan
In [11]: s.sort_values()
Out[11]:
0 A
3 Aaba
1 B
4 Baca
...
8 cat
7 dog
2 NaN
5 NaN
dtype: object
In [12]: s.sort_values(na_position='first')
Out[12]:
2 NaN
5 NaN
0 A
3 Aaba
...
4 Baca
6 CABA
8 cat
7 dog
dtype: object
11.3 searchsorted
Series has the searchsorted()
method, which works similar to
numpy.ndarray.searchsorted()
.
In [13]: ser = pd.Series([1, 2, 3])
In [14]: ser.searchsorted([0, 3])
Out[14]: array([0, 2])
In [15]: ser.searchsorted([0, 4])
Out[15]: array([0, 3])
In [16]: ser.searchsorted([1, 3], side='right')
Out[16]: array([1, 3])
In [17]: ser.searchsorted([1, 3], side='left')
Out[17]: array([0, 2])
In [18]: ser = pd.Series([3, 1, 2])
In [19]: ser.searchsorted([0, 3], sorter=np.argsort(ser))
Out[19]: array([0, 2])
11.4 smallest / largest values
New in version 0.14.0.
Series
has the nsmallest()
and nlargest()
methods which return the
smallest or largest \(n\) values. For a large Series
this can be much
faster than sorting the entire Series and calling head(n)
on the result.
In [20]: s = pd.Series(np.random.permutation(10))
In [21]: s
Out[21]:
0 9
1 7
2 1
3 5
..
6 3
7 0
8 6
9 8
dtype: int64
In [22]: s.sort_values()
Out[22]:
7 0
2 1
5 2
6 3
..
8 6
1 7
9 8
0 9
dtype: int64
In [23]: s.nsmallest(3)
Out[23]:
7 0
2 1
5 2
dtype: int64
In [24]: s.nlargest(3)
Out[24]:
0 9
9 8
1 7
dtype: int64
New in version 0.17.0.
DataFrame
also has the nlargest
and nsmallest
methods.
In [25]: df = pd.DataFrame({'a': [-2, -1, 1, 10, 8, 11, -1],
....: 'b': list('abdceff'),
....: 'c': [1.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0]})
....:
In [26]: df.nlargest(3, 'a')
Out[26]:
a b c
5 11 f 3.0
3 10 c 3.2
4 8 e NaN
In [27]: df.nlargest(5, ['a', 'c'])
Out[27]:
a b c
5 11 f 3.0
3 10 c 3.2
4 8 e NaN
2 1 d 4.0
1 -1 b 2.0
In [28]: df.nsmallest(3, 'a')
Out[28]:
a b c
0 -2 a 1.0
1 -1 b 2.0
6 -1 f 4.0
In [29]: df.nsmallest(5, ['a', 'c'])
Out[29]:
a b c
0 -2 a 1.0
1 -1 b 2.0
6 -1 f 4.0
2 1 d 4.0
4 8 e NaN
11.5 Sorting by a multi-index column
You must be explicit about sorting when the column is a multi-index, and fully specify
all levels to by
.
In [30]: df1.columns = pd.MultiIndex.from_tuples([('a','one'),('a','two'),('b','three')])
In [31]: df1.sort_values(by=('a','two'))
Out[31]:
a b
one two three
3 1 2 4
2 1 3 2
1 1 4 3
0 2 5 1