In [1]: dates = pd.date_range('1/1/2000', periods=8)
In [2]: df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
2.11 Boolean indexing
Another common operation is the use of boolean vectors to filter the data.
The operators are: |
for or
, &
for and
, and ~
for not
. These must be grouped by using parentheses.
Using a boolean vector to index a Series works exactly as in a numpy ndarray:
In [3]: s = pd.Series(range(-3, 4))
In [4]: s
Out[4]:
0 -3
1 -2
2 -1
3 0
4 1
5 2
6 3
dtype: int64
In [5]: s[s > 0]
Out[5]:
4 1
5 2
6 3
dtype: int64
In [6]: s[(s < -1) | (s > 0.5)]
Out[6]:
0 -3
1 -2
4 1
5 2
6 3
dtype: int64
In [7]: s[~(s < 0)]
Out[7]:
3 0
4 1
5 2
6 3
dtype: int64
You may select rows from a DataFrame using a boolean vector the same length as the DataFrame’s index (for example, something derived from one of the columns of the DataFrame):
In [8]: df
Out[8]:
A B C D
2000-01-01 0.4691 -0.2829 -1.5091 -1.1356
2000-01-02 1.2121 -0.1732 0.1192 -1.0442
2000-01-03 -0.8618 -2.1046 -0.4949 1.0718
2000-01-04 0.7216 -0.7068 -1.0396 0.2719
2000-01-05 -0.4250 0.5670 0.2762 -1.0874
2000-01-06 -0.6737 0.1136 -1.4784 0.5250
2000-01-07 0.4047 0.5770 -1.7150 -1.0393
2000-01-08 -0.3706 -1.1579 -1.3443 0.8449
In [9]: df[df['A'] > 0]
Out[9]:
A B C D
2000-01-01 0.4691 -0.2829 -1.5091 -1.1356
2000-01-02 1.2121 -0.1732 0.1192 -1.0442
2000-01-04 0.7216 -0.7068 -1.0396 0.2719
2000-01-07 0.4047 0.5770 -1.7150 -1.0393
List comprehensions and map
method of Series can also be used to produce
more complex criteria:
In [10]: df2 = pd.DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
....: 'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
....: 'c' : np.random.randn(7)})
....:
In [11]: df2
Out[11]:
a b c
0 one x 1.0758
1 one y -0.1090
2 two y 1.6436
3 three x -1.4694
4 two y 0.3570
5 one x -0.6746
6 six x -1.7769
# only want 'two' or 'three'
In [12]: criterion = df2['a'].map(lambda x: x.startswith('t'))
In [13]: df2[criterion]
Out[13]:
a b c
2 two y 1.6436
3 three x -1.4694
4 two y 0.3570
# equivalent but slower
In [14]: df2[[x.startswith('t') for x in df2['a']]]
Out[14]:
a b c
2 two y 1.6436
3 three x -1.4694
4 two y 0.3570
# Multiple criteria
In [15]: df2[criterion & (df2['b'] == 'x')]
Out[15]:
a b c
3 three x -1.4694
Note, with the choice methods Selection by Label, Selection by Position, and Advanced Indexing you may select along more than one axis using boolean vectors combined with other indexing expressions.
In [16]: df2.loc[criterion & (df2['b'] == 'x'),'b':'c']
Out[16]:
b c
3 x -1.4694