In [1]: tips.head()
Out[1]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5.2 WHERE
Filtering in SQL is done via a WHERE clause.
SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.
In [2]: tips[tips['time'] == 'Dinner'].head(5)
Out[2]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
The above statement is simply passing a Series
of True/False objects to the DataFrame,
returning all rows with True.
In [3]: is_dinner = tips['time'] == 'Dinner'
In [4]: is_dinner.value_counts()
Out[4]:
True 176
False 68
Name: time, dtype: int64
In [5]: tips[is_dinner].head(5)
Out[5]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and & (AND).
-- tips of more than $5.00 at Dinner meals
SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;
# tips of more than $5.00 at Dinner meals
In [2]: tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]
Out[2]:
total_bill tip sex smoker day time size
23 39.42 7.58 Male No Sat Dinner 4
44 30.40 5.60 Male No Sun Dinner 4
47 32.40 6.00 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
.. ... ... ... ... ... ... ...
211 25.89 5.16 Male Yes Sat Dinner 4
212 48.33 9.00 Male No Sat Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
239 29.03 5.92 Male No Sat Dinner 3
[15 rows x 7 columns]
-- tips by parties of at least 5 diners OR bill total was more than $45
SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;
# tips by parties of at least 5 diners OR bill total was more than $45
In [3]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]
Out[3]:
total_bill tip sex smoker day time size
59 48.27 6.73 Male No Sat Dinner 4
125 29.80 4.20 Female No Thur Lunch 6
141 34.30 6.70 Male No Thur Lunch 6
142 41.19 5.00 Male No Thur Lunch 5
.. ... ... ... ... ... ... ...
185 20.69 5.00 Male No Sun Dinner 5
187 30.46 2.00 Male Yes Sun Dinner 5
212 48.33 9.00 Male No Sat Dinner 4
216 28.15 3.00 Male Yes Sat Dinner 5
[13 rows x 7 columns]
NULL checking is done using the notnull()
and isnull()
methods.
In [4]: frame = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'],
...: 'col2': ['F', np.NaN, 'G', 'H', 'I']})
...:
In [5]: frame
Out[5]:
col1 col2
0 A F
1 B NaN
2 NaN G
3 C H
4 D I
Assume we have a table of the same structure as our DataFrame above. We can see only the records
where col2
IS NULL with the following query:
SELECT *
FROM frame
WHERE col2 IS NULL;
In [6]: frame[frame['col2'].isnull()]
Out[6]:
col1 col2
1 B NaN
Getting items where col1
IS NOT NULL can be done with notnull()
.
SELECT *
FROM frame
WHERE col1 IS NOT NULL;
In [7]: frame[frame['col1'].notnull()]
Out[7]:
col1 col2
0 A F
1 B NaN
3 C H
4 D I