In [1]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])

2.14 The `query()` Method (Experimental)

New in version 0.13.

DataFrame objects have a query() method that allows selection using an expression.

You can get the value of the frame where column b has values between the values of columns a and c. For example:

In [2]: n = 10

In [3]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [4]: df
Out[4]: 
         a       b       c
0   0.2370  0.0076  0.0198
1   0.3131  0.0995  0.1952
2   0.2073  0.1649  0.7119
3   0.0321  0.1974  0.9646
..     ...     ...     ...
6   0.8615  0.3707  0.3688
7   0.5429  0.6828  0.8229
8   0.1314  0.0776  0.8409
9   0.2210  0.7529  0.3430

[10 rows x 3 columns]

# pure python
In [5]: df[(df.a < df.b) & (df.b < df.c)]
Out[5]: 
        a       b       c
3  0.0321  0.1974  0.9646
4  0.5739  0.6992  0.9746
7  0.5429  0.6828  0.8229

# query
In [6]: df.query('(a < b) & (b < c)')
Out[6]: 
        a       b       c
3  0.0321  0.1974  0.9646
4  0.5739  0.6992  0.9746
7  0.5429  0.6828  0.8229

Do the same thing but fall back on a named index if there is no column with the name a.

In [7]: df = pd.DataFrame(np.random.randint(n / 2, size=(n, 2)), columns=list('bc'))

In [8]: df.index.name = 'a'

In [9]: df
Out[9]: 
    b  c
a       
0   4  2
1   3  2
2   2  0
3   1  3
.. .. ..
6   4  3
7   3  0
8   2  3
9   2  4

[10 rows x 2 columns]

In [10]: df.query('a < b and b < c')
Out[10]: 
Empty DataFrame
Columns: [b, c]
Index: []

If instead you don’t want to or cannot name your index, you can use the name index in your query expression:

In [11]: df = pd.DataFrame(np.random.randint(n, size=(n, 2)), columns=list('bc'))

In [12]: df
Out[12]: 
    b  c
0   1  8
1   3  3
2   5  1
3   9  2
.. .. ..
6   6  9
7   2  1
8   1  0
9   0  8

[10 rows x 2 columns]

In [13]: df.query('index < b < c')
Out[13]: 
   b  c
0  1  8

Note

If the name of your index overlaps with a column name, the column name is given precedence. For example,

In [14]: df = pd.DataFrame({'a': np.random.randint(5, size=5)})

In [15]: df.index.name = 'a'

In [16]: df.query('a > 2') # uses the column 'a', not the index
Out[16]: 
Empty DataFrame
Columns: [a]
Index: []

You can still use the index in a query expression by using the special identifier ‘index’:

In [17]: df.query('index > 2')
Out[17]: 
   a
a   
3  0
4  2

If for some reason you have a column named index, then you can refer to the index as ilevel_0 as well, but at this point you should consider renaming your columns to something less ambiguous.

2.14.1 `MultiIndex` `query()` Syntax

You can also use the levels of a DataFrame with a MultiIndex as if they were columns in the frame:

In [18]: n = 10

In [19]: colors = np.random.choice(['red', 'green'], size=n)

In [20]: foods = np.random.choice(['eggs', 'ham'], size=n)

In [21]: colors
Out[21]: 
array(['green', 'red', 'red', 'green', 'red', 'green', 'green', 'red',
       'green', 'red'], 
      dtype='|S5')

In [22]: foods
Out[22]: 
array(['eggs', 'eggs', 'eggs', 'eggs', 'eggs', 'ham', 'eggs', 'eggs',
       'eggs', 'eggs'], 
      dtype='|S4')

In [23]: index = pd.MultiIndex.from_arrays([colors, foods], names=['color', 'food'])

In [24]: df = pd.DataFrame(np.random.randn(n, 2), index=index)

In [25]: df
Out[25]: 
                 0       1
color food                
green eggs  0.2513 -0.1437
red   eggs  0.1958 -0.8763
      eggs -0.7964  0.2935
green eggs -0.5310  1.5091
...            ...     ...
      eggs  0.0068 -0.8226
red   eggs  0.3901  0.6442
green eggs -0.7360 -1.0821
red   eggs -0.4404  0.0792

[10 rows x 2 columns]

In [26]: df.query('color == "red"')
Out[26]: 
                 0       1
color food                
red   eggs  0.1958 -0.8763
      eggs -0.7964  0.2935
      eggs  0.5069  0.5864
      eggs  0.3901  0.6442
      eggs -0.4404  0.0792

If the levels of the MultiIndex are unnamed, you can refer to them using special names:

In [27]: df.index.names = [None, None]

In [28]: df
Out[28]: 
                 0       1
green eggs  0.2513 -0.1437
red   eggs  0.1958 -0.8763
      eggs -0.7964  0.2935
green eggs -0.5310  1.5091
...            ...     ...
      eggs  0.0068 -0.8226
red   eggs  0.3901  0.6442
green eggs -0.7360 -1.0821
red   eggs -0.4404  0.0792

[10 rows x 2 columns]

In [29]: df.query('ilevel_0 == "red"')
Out[29]: 
               0       1
red eggs  0.1958 -0.8763
    eggs -0.7964  0.2935
    eggs  0.5069  0.5864
    eggs  0.3901  0.6442
    eggs -0.4404  0.0792

The convention is ilevel_0, which means “index level 0” for the 0th level of the index.

2.14.2 `query()` Use Cases

A use case for query() is when you have a collection of DataFrame objects that have a subset of column names (or index levels/names) in common. You can pass the same query to both frames without having to specify which frame you’re interested in querying

In [30]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [31]: df
Out[31]: 
         a       b       c
0   0.1427  0.8672  0.1419
1   0.7378  0.7852  0.5929
2   0.1730  0.9451  0.1403
3   0.4987  0.0624  0.7161
..     ...     ...     ...
6   0.0155  0.7807  0.5830
7   0.8813  0.4309  0.6147
8   0.6297  0.2569  0.3005
9   0.8960  0.7063  0.2833

[10 rows x 3 columns]

In [32]: df2 = pd.DataFrame(np.random.rand(n + 2, 3), columns=df.columns)

In [33]: df2
Out[33]: 
         a       b       c
0   0.4871  0.6371  0.5722
1   0.8823  0.8011  0.8852
2   0.9926  0.1276  0.0486
3   0.0866  0.5796  0.2961
..     ...     ...     ...
8   0.3562  0.7629  0.8319
9   0.4292  0.4811  0.6428
10  0.6572  0.5602  0.7924
11  0.1042  0.1500  0.3098

[12 rows x 3 columns]

In [34]: expr = '0.0 <= a <= c <= 0.5'

In [35]: map(lambda frame: frame.query(expr), [df, df2])
Out[35]: 
[Empty DataFrame
 Columns: [a, b, c]
 Index: [],          a       b       c
 3   0.0866  0.5796  0.2961
 11  0.1042  0.1500  0.3098]

2.14.3 `query()` Python versus pandas Syntax Comparison

Full numpy-like syntax

In [36]: df = pd.DataFrame(np.random.randint(n, size=(n, 3)), columns=list('abc'))

In [37]: df
Out[37]: 
    a  b  c
0   0  8  9
1   3  3  9
2   4  1  6
3   6  9  4
.. .. .. ..
6   0  9  3
7   5  4  2
8   6  3  3
9   4  5  6

[10 rows x 3 columns]

In [38]: df.query('(a < b) & (b < c)')
Out[38]: 
   a  b  c
0  0  8  9
9  4  5  6

In [39]: df[(df.a < df.b) & (df.b < df.c)]
Out[39]: 
   a  b  c
0  0  8  9
9  4  5  6

Slightly nicer by removing the parentheses (by binding making comparison operators bind tighter than &/|)

In [40]: df.query('a < b & b < c')
Out[40]: 
   a  b  c
0  0  8  9
9  4  5  6

Use English instead of symbols

In [41]: df.query('a < b and b < c')
Out[41]: 
   a  b  c
0  0  8  9
9  4  5  6

Pretty close to how you might write it on paper

In [42]: df.query('a < b < c')
Out[42]: 
   a  b  c
0  0  8  9
9  4  5  6

2.14.4 The `in` and `not in` operators

query() also supports special use of Python’s in and not in comparison operators, providing a succinct syntax for calling the isin method of a Series or DataFrame.

# get all rows where columns "a" and "b" have overlapping values
In [43]: df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
   ....:                    'c': np.random.randint(5, size=12),
   ....:                    'd': np.random.randint(9, size=12)})
   ....: 

In [44]: df
Out[44]: 
    a  b  c  d
0   a  a  3  1
1   a  a  0  3
2   b  a  3  4
3   b  a  0  3
.. .. .. .. ..
8   e  c  1  0
9   e  c  4  7
10  f  c  0  1
11  f  c  0  0

[12 rows x 4 columns]

In [45]: df.query('a in b')
Out[45]: 
   a  b  c  d
0  a  a  3  1
1  a  a  0  3
2  b  a  3  4
3  b  a  0  3
4  c  b  4  0
5  c  b  4  7

# How you'd do it in pure Python
In [46]: df[df.a.isin(df.b)]
Out[46]: 
   a  b  c  d
0  a  a  3  1
1  a  a  0  3
2  b  a  3  4
3  b  a  0  3
4  c  b  4  0
5  c  b  4  7

In [47]: df.query('a not in b')
Out[47]: 
    a  b  c  d
6   d  b  3  1
7   d  b  1  2
8   e  c  1  0
9   e  c  4  7
10  f  c  0  1
11  f  c  0  0

# pure Python
In [48]: df[~df.a.isin(df.b)]
Out[48]: 
    a  b  c  d
6   d  b  3  1
7   d  b  1  2
8   e  c  1  0
9   e  c  4  7
10  f  c  0  1
11  f  c  0  0

You can combine this with other expressions for very succinct queries:

# rows where cols a and b have overlapping values and col c's values are less than col d's
In [49]: df.query('a in b and c < d')
Out[49]: 
   a  b  c  d
1  a  a  0  3
2  b  a  3  4
3  b  a  0  3
5  c  b  4  7

# pure Python
In [50]: df[df.b.isin(df.a) & (df.c < df.d)]
Out[50]: 
    a  b  c  d
1   a  a  0  3
2   b  a  3  4
3   b  a  0  3
5   c  b  4  7
7   d  b  1  2
9   e  c  4  7
10  f  c  0  1

Note

Note that in and not in are evaluated in Python, since numexpr has no equivalent of this operation. However, only the in/not in expression itself is evaluated in vanilla Python. For example, in the expression

df.query('a in b + c + d')

(b + c + d) is evaluated by numexpr and then the in operation is evaluated in plain Python. In general, any operations that can be evaluated using numexpr will be.

2.14.5 Special use of the `==` operator with `list` objects

Comparing a list of values to a column using ==/!= works similarly to in/not in

In [51]: df.query('b == ["a", "b", "c"]')
Out[51]: 
    a  b  c  d
0   a  a  3  1
1   a  a  0  3
2   b  a  3  4
3   b  a  0  3
.. .. .. .. ..
8   e  c  1  0
9   e  c  4  7
10  f  c  0  1
11  f  c  0  0

[12 rows x 4 columns]

# pure Python
In [52]: df[df.b.isin(["a", "b", "c"])]
Out[52]: 
    a  b  c  d
0   a  a  3  1
1   a  a  0  3
2   b  a  3  4
3   b  a  0  3
.. .. .. .. ..
8   e  c  1  0
9   e  c  4  7
10  f  c  0  1
11  f  c  0  0

[12 rows x 4 columns]

In [53]: df.query('c == [1, 2]')
Out[53]: 
   a  b  c  d
7  d  b  1  2
8  e  c  1  0

In [54]: df.query('c != [1, 2]')
Out[54]: 
    a  b  c  d
0   a  a  3  1
1   a  a  0  3
2   b  a  3  4
3   b  a  0  3
.. .. .. .. ..
6   d  b  3  1
9   e  c  4  7
10  f  c  0  1
11  f  c  0  0

[10 rows x 4 columns]

# using in/not in
In [55]: df.query('[1, 2] in c')
Out[55]: 
   a  b  c  d
7  d  b  1  2
8  e  c  1  0

In [56]: df.query('[1, 2] not in c')
Out[56]: 
    a  b  c  d
0   a  a  3  1
1   a  a  0  3
2   b  a  3  4
3   b  a  0  3
.. .. .. .. ..
6   d  b  3  1
9   e  c  4  7
10  f  c  0  1
11  f  c  0  0

[10 rows x 4 columns]

# pure Python
In [57]: df[df.c.isin([1, 2])]
Out[57]: 
   a  b  c  d
7  d  b  1  2
8  e  c  1  0

2.14.6 Boolean Operators

You can negate boolean expressions with the word not or the ~ operator.

In [58]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [59]: df['bools'] = np.random.rand(len(df)) > 0.5

In [60]: df.query('~bools')
Out[60]: 
        a       b       c  bools
4  0.9722  0.6328  0.7278  False
7  0.9512  0.4674  0.4987  False

In [61]: df.query('not bools')
Out[61]: 
        a       b       c  bools
4  0.9722  0.6328  0.7278  False
7  0.9512  0.4674  0.4987  False

In [62]: df.query('not bools') == df[~df.bools]
Out[62]: 
      a     b     c bools
4  True  True  True  True
7  True  True  True  True

Of course, expressions can be arbitrarily complex too

# short query syntax
In [63]: shorter = df.query('a < b < c and (not bools) or bools > 2')

# equivalent in pure Python
In [64]: longer = df[(df.a < df.b) & (df.b < df.c) & (~df.bools) | (df.bools > 2)]

In [65]: shorter
Out[65]: 
Empty DataFrame
Columns: [a, b, c, bools]
Index: []

In [66]: longer
Out[66]: 
Empty DataFrame
Columns: [a, b, c, bools]
Index: []

In [67]: shorter == longer
Out[67]: 
Empty DataFrame
Columns: [a, b, c, bools]
Index: []

2.14.7 Performance of `query()`

DataFrame.query() using numexpr is slightly faster than Python for large frames

Note

You will only see the performance benefits of using the numexpr engine with DataFrame.query() if your frame has more than approximately 200,000 rows

This plot was created using a DataFrame with 3 columns each containing floating point values generated using numpy.random.randn().

2.14 The query() Method (Experimental)

2.14.1 MultiIndex query() Syntax

2.14.2 query() Use Cases

2.14.3 query() Python versus pandas Syntax Comparison

2.14.4 The in and not in operators

2.14.5 Special use of the == operator with list objects

2.14.6 Boolean Operators

2.14.7 Performance of query()

2.14 The `query()` Method (Experimental)

2.14.1 `MultiIndex` `query()` Syntax

2.14.2 `query()` Use Cases

2.14.3 `query()` Python versus pandas Syntax Comparison

2.14.4 The `in` and `not in` operators

2.14.5 Special use of the `==` operator with `list` objects

2.14.7 Performance of `query()`