5.7 Operations
Apart from Series.min()
, Series.max()
and Series.mode()
, the following operations are
possible with categorical data:
Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data:
In [1]: s = pd.Series(pd.Categorical(["a","b","c","c"], categories=["c","a","b","d"]))
In [2]: s.value_counts()
Out[2]:
c 2
b 1
a 1
d 0
dtype: int64
Groupby will also show “unused” categories:
In [3]: cats = pd.Categorical(["a","b","b","b","c","c","c"], categories=["a","b","c","d"])
In [4]: df = pd.DataFrame({"cats":cats,"values":[1,2,2,2,3,4,5]})
In [5]: df.groupby("cats").mean()
Out[5]:
values
cats
a 1.0
b 2.0
c 4.0
d NaN
In [6]: cats2 = pd.Categorical(["a","a","b","b"], categories=["a","b","c"])
In [7]: df2 = pd.DataFrame({"cats":cats2,"B":["c","d","c","d"], "values":[1,2,3,4]})
In [8]: df2.groupby(["cats","B"]).mean()
Out[8]:
values
cats B
a c 1.0
d 2.0
b c 3.0
d 4.0
c c NaN
d NaN
Pivot tables:
In [9]: raw_cat = pd.Categorical(["a","a","b","b"], categories=["a","b","c"])
In [10]: df = pd.DataFrame({"A":raw_cat,"B":["c","d","c","d"], "values":[1,2,3,4]})
In [11]: pd.pivot_table(df, values='values', index=['A', 'B'])
Out[11]:
A B
a c 1.0
d 2.0
b c 3.0
d 4.0
c c NaN
d NaN
Name: values, dtype: float64