11 Sorting

Warning

The sorting API is substantially changed in 0.17.0, see here for these changes. In particular, all sorting methods now return a new object by default, and DO NOT operate in-place (except by passing inplace=True).

There are two obvious kinds of sorting that you may be interested in: sorting by label and sorting by actual values.

11.1 By Index

The primary method for sorting axis labels (indexes) are the Series.sort_index() and the DataFrame.sort_index() methods.

In [1]: df
Out[1]: 
        one     three       two
a -0.626544       NaN -0.351587
b -0.138894 -0.177289  1.136249
c  0.011617  0.462215 -0.448789
d       NaN  1.124472 -1.101558

In [2]: unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
   ...:                          columns=['three', 'two', 'one'])
   ...: 

# DataFrame
In [3]: unsorted_df.sort_index()
Out[3]: 
      three       two       one
a       NaN -0.351587 -0.626544
b -0.177289  1.136249 -0.138894
c  0.462215 -0.448789  0.011617
d  1.124472 -1.101558       NaN

In [4]: unsorted_df.sort_index(ascending=False)
Out[4]: 
      three       two       one
d  1.124472 -1.101558       NaN
c  0.462215 -0.448789  0.011617
b -0.177289  1.136249 -0.138894
a       NaN -0.351587 -0.626544

In [5]: unsorted_df.sort_index(axis=1)
Out[5]: 
        one     three       two
a -0.626544       NaN -0.351587
d       NaN  1.124472 -1.101558
c  0.011617  0.462215 -0.448789
b -0.138894 -0.177289  1.136249

# Series
In [6]: unsorted_df['three'].sort_index()
Out[6]: 
a         NaN
b   -0.177289
c    0.462215
d    1.124472
Name: three, dtype: float64

11.2 By Values

The Series.sort_values() and DataFrame.sort_values() are the entry points for value sorting (that is the values in a column or row). DataFrame.sort_values() can accept an optional by argument for axis=0 which will use an arbitrary vector or a column name of the DataFrame to determine the sort order:

In [7]: df1 = pd.DataFrame({'one':[2,1,1,1],'two':[1,3,2,4],'three':[5,4,3,2]})

In [8]: df1.sort_values(by='two')
Out[8]: 
   one  three  two
0    2      5    1
2    1      3    2
1    1      4    3
3    1      2    4

The by argument can take a list of column names, e.g.:

In [9]: df1[['one', 'two', 'three']].sort_values(by=['one','two'])
Out[9]: 
   one  two  three
2    1    2      3
1    1    3      4
3    1    4      2
0    2    1      5

These methods have special treatment of NA values via the na_position argument:

In [10]: s[2] = np.nan

In [11]: s.sort_values()
Out[11]: 
0       A
3    Aaba
1       B
4    Baca
     ... 
8     cat
7     dog
2     NaN
5     NaN
dtype: object

In [12]: s.sort_values(na_position='first')
Out[12]: 
2     NaN
5     NaN
0       A
3    Aaba
     ... 
4    Baca
6    CABA
8     cat
7     dog
dtype: object

11.3 searchsorted

Series has the searchsorted() method, which works similar to numpy.ndarray.searchsorted().

In [13]: ser = pd.Series([1, 2, 3])

In [14]: ser.searchsorted([0, 3])
Out[14]: array([0, 2])

In [15]: ser.searchsorted([0, 4])
Out[15]: array([0, 3])

In [16]: ser.searchsorted([1, 3], side='right')
Out[16]: array([1, 3])

In [17]: ser.searchsorted([1, 3], side='left')
Out[17]: array([0, 2])

In [18]: ser = pd.Series([3, 1, 2])

In [19]: ser.searchsorted([0, 3], sorter=np.argsort(ser))
Out[19]: array([0, 2])

11.4 smallest / largest values

New in version 0.14.0.

Series has the nsmallest() and nlargest() methods which return the smallest or largest \(n\) values. For a large Series this can be much faster than sorting the entire Series and calling head(n) on the result.

In [20]: s = pd.Series(np.random.permutation(10))

In [21]: s
Out[21]: 
0    9
1    7
2    1
3    5
    ..
6    3
7    0
8    6
9    8
dtype: int64

In [22]: s.sort_values()
Out[22]: 
7    0
2    1
5    2
6    3
    ..
8    6
1    7
9    8
0    9
dtype: int64

In [23]: s.nsmallest(3)
Out[23]: 
7    0
2    1
5    2
dtype: int64

In [24]: s.nlargest(3)
Out[24]: 
0    9
9    8
1    7
dtype: int64

New in version 0.17.0.

DataFrame also has the nlargest and nsmallest methods.

In [25]: df = pd.DataFrame({'a': [-2, -1, 1, 10, 8, 11, -1],
   ....:                    'b': list('abdceff'),
   ....:                    'c': [1.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0]})
   ....: 

In [26]: df.nlargest(3, 'a')
Out[26]: 
    a  b    c
5  11  f  3.0
3  10  c  3.2
4   8  e  NaN

In [27]: df.nlargest(5, ['a', 'c'])
Out[27]: 
    a  b    c
5  11  f  3.0
3  10  c  3.2
4   8  e  NaN
2   1  d  4.0
1  -1  b  2.0

In [28]: df.nsmallest(3, 'a')
Out[28]: 
   a  b    c
0 -2  a  1.0
1 -1  b  2.0
6 -1  f  4.0

In [29]: df.nsmallest(5, ['a', 'c'])
Out[29]: 
   a  b    c
0 -2  a  1.0
1 -1  b  2.0
6 -1  f  4.0
2  1  d  4.0
4  8  e  NaN

11.5 Sorting by a multi-index column

You must be explicit about sorting when the column is a multi-index, and fully specify all levels to by.

In [30]: df1.columns = pd.MultiIndex.from_tuples([('a','one'),('a','two'),('b','three')])

In [31]: df1.sort_values(by=('a','two'))
Out[31]: 
    a         b
  one two three
3   1   2     4
2   1   3     2
1   1   4     3
0   2   5     1