In [1]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])

2.14 The query() Method (Experimental)

New in version 0.13.

DataFrame objects have a query() method that allows selection using an expression.

You can get the value of the frame where column b has values between the values of columns a and c. For example:

In [2]: n = 10

In [3]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [4]: df
Out[4]: 
         a       b       c
0   0.2370  0.0076  0.0198
1   0.3131  0.0995  0.1952
2   0.2073  0.1649  0.7119
3   0.0321  0.1974  0.9646
..     ...     ...     ...
6   0.8615  0.3707  0.3688
7   0.5429  0.6828  0.8229
8   0.1314  0.0776  0.8409
9   0.2210  0.7529  0.3430

[10 rows x 3 columns]

# pure python
In [5]: df[(df.a < df.b) & (df.b < df.c)]
Out[5]: 
        a       b       c
3  0.0321  0.1974  0.9646
4  0.5739  0.6992  0.9746
7  0.5429  0.6828  0.8229

# query
In [6]: df.query('(a < b) & (b < c)')
Out[6]: 
        a       b       c
3  0.0321  0.1974  0.9646
4  0.5739  0.6992  0.9746
7  0.5429  0.6828  0.8229

Do the same thing but fall back on a named index if there is no column with the name a.

In [7]: df = pd.DataFrame(np.random.randint(n / 2, size=(n, 2)), columns=list('bc'))

In [8]: df.index.name = 'a'

In [9]: df
Out[9]: 
    b  c
a       
0   4  2
1   3  2
2   2  0
3   1  3
.. .. ..
6   4  3
7   3  0
8   2  3
9   2  4

[10 rows x 2 columns]

In [10]: df.query('a < b and b < c')
Out[10]: 
Empty DataFrame
Columns: [b, c]
Index: []

If instead you don’t want to or cannot name your index, you can use the name index in your query expression:

In [11]: df = pd.DataFrame(np.random.randint(n, size=(n, 2)), columns=list('bc'))

In [12]: df
Out[12]: 
    b  c
0   1  8
1   3  3
2   5  1
3   9  2
.. .. ..
6   6  9
7   2  1
8   1  0
9   0  8

[10 rows x 2 columns]

In [13]: df.query('index < b < c')
Out[13]: 
   b  c
0  1  8

Note

If the name of your index overlaps with a column name, the column name is given precedence. For example,

In [14]: df = pd.DataFrame({'a': np.random.randint(5, size=5)})

In [15]: df.index.name = 'a'

In [16]: df.query('a > 2') # uses the column 'a', not the index
Out[16]: 
Empty DataFrame
Columns: [a]
Index: []

You can still use the index in a query expression by using the special identifier ‘index’:

In [17]: df.query('index > 2')
Out[17]: 
   a
a   
3  0
4  2

If for some reason you have a column named index, then you can refer to the index as ilevel_0 as well, but at this point you should consider renaming your columns to something less ambiguous.

2.14.1 MultiIndex query() Syntax

You can also use the levels of a DataFrame with a MultiIndex as if they were columns in the frame:

In [18]: n = 10

In [19]: colors = np.random.choice(['red', 'green'], size=n)

In [20]: foods = np.random.choice(['eggs', 'ham'], size=n)

In [21]: colors
Out[21]: 
array(['green', 'red', 'red', 'green', 'red', 'green', 'green', 'red',
       'green', 'red'], 
      dtype='|S5')

In [22]: foods
Out[22]: 
array(['eggs', 'eggs', 'eggs', 'eggs', 'eggs', 'ham', 'eggs', 'eggs',
       'eggs', 'eggs'], 
      dtype='|S4')

In [23]: index = pd.MultiIndex.from_arrays([colors, foods], names=['color', 'food'])

In [24]: df = pd.DataFrame(np.random.randn(n, 2), index=index)

In [25]: df
Out[25]: 
                 0       1
color food                
green eggs  0.2513 -0.1437
red   eggs  0.1958 -0.8763
      eggs -0.7964  0.2935
green eggs -0.5310  1.5091
...            ...     ...
      eggs  0.0068 -0.8226
red   eggs  0.3901  0.6442
green eggs -0.7360 -1.0821
red   eggs -0.4404  0.0792

[10 rows x 2 columns]

In [26]: df.query('color == "red"')
Out[26]: 
                 0       1
color food                
red   eggs  0.1958 -0.8763
      eggs -0.7964  0.2935
      eggs  0.5069  0.5864
      eggs  0.3901  0.6442
      eggs -0.4404  0.0792

If the levels of the MultiIndex are unnamed, you can refer to them using special names:

In [27]: df.index.names = [None, None]

In [28]: df
Out[28]: 
                 0       1
green eggs  0.2513 -0.1437
red   eggs  0.1958 -0.8763
      eggs -0.7964  0.2935
green eggs -0.5310  1.5091
...            ...     ...
      eggs  0.0068 -0.8226
red   eggs  0.3901  0.6442
green eggs -0.7360 -1.0821
red   eggs -0.4404  0.0792

[10 rows x 2 columns]

In [29]: df.query('ilevel_0 == "red"')
Out[29]: 
               0       1
red eggs  0.1958 -0.8763
    eggs -0.7964  0.2935
    eggs  0.5069  0.5864
    eggs  0.3901  0.6442
    eggs -0.4404  0.0792

The convention is ilevel_0, which means “index level 0” for the 0th level of the index.

2.14.2 query() Use Cases

A use case for query() is when you have a collection of DataFrame objects that have a subset of column names (or index levels/names) in common. You can pass the same query to both frames without having to specify which frame you’re interested in querying

In [30]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [31]: df
Out[31]: 
         a       b       c
0   0.1427  0.8672  0.1419
1   0.7378  0.7852  0.5929
2   0.1730  0.9451  0.1403
3   0.4987  0.0624  0.7161
..     ...     ...     ...
6   0.0155  0.7807  0.5830
7   0.8813  0.4309  0.6147
8   0.6297  0.2569  0.3005
9   0.8960  0.7063  0.2833

[10 rows x 3 columns]

In [32]: df2 = pd.DataFrame(np.random.rand(n + 2, 3), columns=df.columns)

In [33]: df2
Out[33]: 
         a       b       c
0   0.4871  0.6371  0.5722
1   0.8823  0.8011  0.8852
2   0.9926  0.1276  0.0486
3   0.0866  0.5796  0.2961
..     ...     ...     ...
8   0.3562  0.7629  0.8319
9   0.4292  0.4811  0.6428
10  0.6572  0.5602  0.7924
11  0.1042  0.1500  0.3098

[12 rows x 3 columns]

In [34]: expr = '0.0 <= a <= c <= 0.5'

In [35]: map(lambda frame: frame.query(expr), [df, df2])
Out[35]: 
[Empty DataFrame
 Columns: [a, b, c]
 Index: [],          a       b       c
 3   0.0866  0.5796  0.2961
 11  0.1042  0.1500  0.3098]

2.14.3 query() Python versus pandas Syntax Comparison

Full numpy-like syntax

In [36]: df = pd.DataFrame(np.random.randint(n, size=(n, 3)), columns=list('abc'))

In [37]: df
Out[37]: 
    a  b  c
0   0  8  9
1   3  3  9
2   4  1  6
3   6  9  4
.. .. .. ..
6   0  9  3
7   5  4  2
8   6  3  3
9   4  5  6

[10 rows x 3 columns]

In [38]: df.query('(a < b) & (b < c)')
Out[38]: 
   a  b  c
0  0  8  9
9  4  5  6

In [39]: df[(df.a < df.b) & (df.b < df.c)]
Out[39]: 
   a  b  c
0  0  8  9
9  4  5  6

Slightly nicer by removing the parentheses (by binding making comparison operators bind tighter than &/|)

In [40]: df.query('a < b & b < c')
Out[40]: 
   a  b  c
0  0  8  9
9  4  5  6

Use English instead of symbols

In [41]: df.query('a < b and b < c')
Out[41]: 
   a  b  c
0  0  8  9
9  4  5  6

Pretty close to how you might write it on paper

In [42]: df.query('a < b < c')
Out[42]: 
   a  b  c
0  0  8  9
9  4  5  6

2.14.4 The in and not in operators

query() also supports special use of Python’s in and not in comparison operators, providing a succinct syntax for calling the isin method of a Series or DataFrame.

# get all rows where columns "a" and "b" have overlapping values
In [43]: df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
   ....:                    'c': np.random.randint(5, size=12),
   ....:                    'd': np.random.randint(9, size=12)})
   ....: 

In [44]: df
Out[44]: 
    a  b  c  d
0   a  a  3  1
1   a  a  0  3
2   b  a  3  4
3   b  a  0  3
.. .. .. .. ..
8   e  c  1  0
9   e  c  4  7
10  f  c  0  1
11  f  c  0  0

[12 rows x 4 columns]

In [45]: df.query('a in b')
Out[45]: 
   a  b  c  d
0  a  a  3  1
1  a  a  0  3
2  b  a  3  4
3  b  a  0  3
4  c  b  4  0
5  c  b  4  7

# How you'd do it in pure Python
In [46]: df[df.a.isin(df.b)]
Out[46]: 
   a  b  c  d
0  a  a  3  1
1  a  a  0  3
2  b  a  3  4
3  b  a  0  3
4  c  b  4  0
5  c  b  4  7

In [47]: df.query('a not in b')
Out[47]: 
    a  b  c  d
6   d  b  3  1
7   d  b  1  2
8   e  c  1  0
9   e  c  4  7
10  f  c  0  1
11  f  c  0  0

# pure Python
In [48]: df[~df.a.isin(df.b)]
Out[48]: 
    a  b  c  d
6   d  b  3  1
7   d  b  1  2
8   e  c  1  0
9   e  c  4  7
10  f  c  0  1
11  f  c  0  0

You can combine this with other expressions for very succinct queries:

# rows where cols a and b have overlapping values and col c's values are less than col d's
In [49]: df.query('a in b and c < d')
Out[49]: 
   a  b  c  d
1  a  a  0  3
2  b  a  3  4
3  b  a  0  3
5  c  b  4  7

# pure Python
In [50]: df[df.b.isin(df.a) & (df.c < df.d)]
Out[50]: 
    a  b  c  d
1   a  a  0  3
2   b  a  3  4
3   b  a  0  3
5   c  b  4  7
7   d  b  1  2
9   e  c  4  7
10  f  c  0  1

Note

Note that in and not in are evaluated in Python, since numexpr has no equivalent of this operation. However, only the in/not in expression itself is evaluated in vanilla Python. For example, in the expression

df.query('a in b + c + d')

(b + c + d) is evaluated by numexpr and then the in operation is evaluated in plain Python. In general, any operations that can be evaluated using numexpr will be.

2.14.5 Special use of the == operator with list objects

Comparing a list of values to a column using ==/!= works similarly to in/not in

In [51]: df.query('b == ["a", "b", "c"]')
Out[51]: 
    a  b  c  d
0   a  a  3  1
1   a  a  0  3
2   b  a  3  4
3   b  a  0  3
.. .. .. .. ..
8   e  c  1  0
9   e  c  4  7
10  f  c  0  1
11  f  c  0  0

[12 rows x 4 columns]

# pure Python
In [52]: df[df.b.isin(["a", "b", "c"])]
Out[52]: 
    a  b  c  d
0   a  a  3  1
1   a  a  0  3
2   b  a  3  4
3   b  a  0  3
.. .. .. .. ..
8   e  c  1  0
9   e  c  4  7
10  f  c  0  1
11  f  c  0  0

[12 rows x 4 columns]

In [53]: df.query('c == [1, 2]')
Out[53]: 
   a  b  c  d
7  d  b  1  2
8  e  c  1  0

In [54]: df.query('c != [1, 2]')
Out[54]: 
    a  b  c  d
0   a  a  3  1
1   a  a  0  3
2   b  a  3  4
3   b  a  0  3
.. .. .. .. ..
6   d  b  3  1
9   e  c  4  7
10  f  c  0  1
11  f  c  0  0

[10 rows x 4 columns]

# using in/not in
In [55]: df.query('[1, 2] in c')
Out[55]: 
   a  b  c  d
7  d  b  1  2
8  e  c  1  0

In [56]: df.query('[1, 2] not in c')
Out[56]: 
    a  b  c  d
0   a  a  3  1
1   a  a  0  3
2   b  a  3  4
3   b  a  0  3
.. .. .. .. ..
6   d  b  3  1
9   e  c  4  7
10  f  c  0  1
11  f  c  0  0

[10 rows x 4 columns]

# pure Python
In [57]: df[df.c.isin([1, 2])]
Out[57]: 
   a  b  c  d
7  d  b  1  2
8  e  c  1  0

2.14.6 Boolean Operators

You can negate boolean expressions with the word not or the ~ operator.

In [58]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [59]: df['bools'] = np.random.rand(len(df)) > 0.5

In [60]: df.query('~bools')
Out[60]: 
        a       b       c  bools
4  0.9722  0.6328  0.7278  False
7  0.9512  0.4674  0.4987  False

In [61]: df.query('not bools')
Out[61]: 
        a       b       c  bools
4  0.9722  0.6328  0.7278  False
7  0.9512  0.4674  0.4987  False

In [62]: df.query('not bools') == df[~df.bools]
Out[62]: 
      a     b     c bools
4  True  True  True  True
7  True  True  True  True

Of course, expressions can be arbitrarily complex too

# short query syntax
In [63]: shorter = df.query('a < b < c and (not bools) or bools > 2')

# equivalent in pure Python
In [64]: longer = df[(df.a < df.b) & (df.b < df.c) & (~df.bools) | (df.bools > 2)]

In [65]: shorter
Out[65]: 
Empty DataFrame
Columns: [a, b, c, bools]
Index: []

In [66]: longer
Out[66]: 
Empty DataFrame
Columns: [a, b, c, bools]
Index: []

In [67]: shorter == longer
Out[67]: 
Empty DataFrame
Columns: [a, b, c, bools]
Index: []

2.14.7 Performance of query()

DataFrame.query() using numexpr is slightly faster than Python for large frames

../_images/query-perf.png

Note

You will only see the performance benefits of using the numexpr engine with DataFrame.query() if your frame has more than approximately 200,000 rows

../_images/query-perf-small.png

This plot was created using a DataFrame with 3 columns each containing floating point values generated using numpy.random.randn().