In [1]: tips.head()
Out[1]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5.3 GROUP BY
In pandas, SQL’s GROUP BY operations are performed using the similarly named
groupby()
method. groupby()
typically refers to a
process where we’d like to split a dataset into groups, apply some function (typically aggregation)
, and then combine the groups together.
A common SQL operation would be getting the count of records in each group throughout a dataset. For instance, a query getting us the number of tips left by sex:
SELECT sex, count(*)
FROM tips
GROUP BY sex;
/*
Female 87
Male 157
*/
The pandas equivalent would be:
In [2]: tips.groupby('sex').size()
Out[2]:
sex
Female 87
Male 157
dtype: int64
Notice that in the pandas code we used size()
and not
count()
. This is because
count()
applies the function to each column, returning
the number of not null
records within each.
In [3]: tips.groupby('sex').count()
Out[3]:
total_bill tip smoker day time size
sex
Female 87 87 87 87 87 87
Male 157 157 157 157 157 157
Alternatively, we could have applied the count()
method
to an individual column:
In [4]: tips.groupby('sex')['total_bill'].count()
Out[4]:
sex
Female 87
Male 157
Name: total_bill, dtype: int64
Multiple functions can also be applied at once. For instance, say we’d like to see how tip amount
differs by day of the week - agg()
allows you to pass a dictionary
to your grouped DataFrame, indicating which functions to apply to specific columns.
SELECT day, AVG(tip), COUNT(*)
FROM tips
GROUP BY day;
/*
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62
*/
In [5]: tips.groupby('day').agg({'tip': np.mean, 'day': np.size})
Out[5]:
tip day
day
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62
Grouping by more than one column is done by passing a list of columns to the
groupby()
method.
SELECT smoker, day, COUNT(*), AVG(tip)
FROM tips
GROUP BY smoker, day;
/*
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thur 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
Thur 17 3.030000
*/
In [6]: tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})
Out[6]:
tip
size mean
smoker day
No Fri 4.0 2.812500
Sat 45.0 3.102889
Sun 57.0 3.167895
Thur 45.0 2.673778
Yes Fri 15.0 2.714000
Sat 42.0 2.875476
Sun 19.0 3.516842
Thur 17.0 3.030000