5.8 Data munging
The optimized pandas data access methods .loc, .iloc, .ix .at, and .iat,
work as normal. The only difference is the return type (for getting) and
that only values already in categories can be assigned.
5.8.1 Getting
If the slicing operation returns either a DataFrame or a column of type Series,
the category dtype is preserved.
In [1]: idx = pd.Index(["h","i","j","k","l","m","n",])
In [2]: cats = pd.Series(["a","b","b","b","c","c","c"], dtype="category", index=idx)
In [3]: values= [1,2,2,2,3,4,5]
In [4]: df = pd.DataFrame({"cats":cats,"values":values}, index=idx)
In [5]: df.iloc[2:4,:]
Out[5]:
cats values
j b 2
k b 2
In [6]: df.iloc[2:4,:].dtypes
Out[6]:
cats category
values int64
dtype: object
In [7]: df.loc["h":"j","cats"]
Out[7]:
h a
i b
j b
Name: cats, dtype: category
Categories (3, object): [a, b, c]
In [8]: df.ix["h":"j",0:1]
Out[8]:
cats
h a
i b
j b
In [9]: df[df["cats"] == "b"]
Out[9]:
cats values
i b 2
j b 2
k b 2
An example where the category type is not preserved is if you take one single row: the
resulting Series is of dtype object:
# get the complete "h" row as a Series
In [10]: df.loc["h", :]
Out[10]:
cats a
values 1
Name: h, dtype: object
Returning a single item from categorical data will also return the value, not a categorical of length “1”.
In [11]: df.iat[0,0]
Out[11]: 'a'
In [12]: df["cats"].cat.categories = ["x","y","z"]
In [13]: df.at["h","cats"] # returns a string
Out[13]: 'x'
Note
This is a difference to R’s factor function, where factor(c(1,2,3))[1]
returns a single value factor.
To get a single value Series of type category pass in a list with a single value:
In [14]: df.loc[["h"],"cats"]
Out[14]:
h x
Name: cats, dtype: category
Categories (3, object): [x, y, z]
5.8.2 String and datetime accessors
New in version 0.17.1.
The accessors .dt and .str will work if the s.cat.categories are of an appropriate
type:
In [15]: str_s = pd.Series(list('aabb'))
In [16]: str_cat = str_s.astype('category')
In [17]: str_cat
Out[17]:
0 a
1 a
2 b
3 b
dtype: category
Categories (2, object): [a, b]
In [18]: str_cat.str.contains("a")
Out[18]:
0 True
1 True
2 False
3 False
dtype: bool
In [19]: date_s = pd.Series(pd.date_range('1/1/2015', periods=5))
In [20]: date_cat = date_s.astype('category')
In [21]: date_cat
Out[21]:
0 2015-01-01
1 2015-01-02
2 2015-01-03
3 2015-01-04
4 2015-01-05
dtype: category
Categories (5, datetime64[ns]): [2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05]
In [22]: date_cat.dt.day
Out[22]:
0 1
1 2
2 3
3 4
4 5
dtype: int64
Note
The returned Series (or DataFrame) is of the same type as if you used the
.str.<method> / .dt.<method> on a Series of that type (and not of
type category!).
That means, that the returned values from methods and properties on the accessors of a
Series and the returned values from methods and properties on the accessors of this
Series transformed to one of type category will be equal:
In [23]: ret_s = str_s.str.contains("a")
In [24]: ret_cat = str_cat.str.contains("a")
In [25]: ret_s.dtype == ret_cat.dtype
Out[25]: True
In [26]: ret_s == ret_cat
Out[26]:
0 True
1 True
2 True
3 True
dtype: bool
Note
The work is done on the categories and then a new Series is constructed. This has
some performance implication if you have a Series of type string, where lots of elements
are repeated (i.e. the number of unique elements in the Series is a lot smaller than the
length of the Series). In this case it can be faster to convert the original Series
to one of type category and use .str.<method> or .dt.<property> on that.
5.8.3 Setting
Setting values in a categorical column (or Series) works as long as the value is included in the categories:
In [27]: idx = pd.Index(["h","i","j","k","l","m","n"])
In [28]: cats = pd.Categorical(["a","a","a","a","a","a","a"], categories=["a","b"])
In [29]: values = [1,1,1,1,1,1,1]
In [30]: df = pd.DataFrame({"cats":cats,"values":values}, index=idx)
In [31]: df.iloc[2:4,:] = [["b",2],["b",2]]
In [32]: df
Out[32]:
cats values
h a 1
i a 1
j b 2
k b 2
l a 1
m a 1
n a 1
In [33]: try:
....: df.iloc[2:4,:] = [["c",3],["c",3]]
....: except ValueError as e:
....: print("ValueError: " + str(e))
....:
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
Setting values by assigning categorical data will also check that the categories match:
In [34]: df.loc["j":"k","cats"] = pd.Categorical(["a","a"], categories=["a","b"])
In [35]: df
Out[35]:
cats values
h a 1
i a 1
j a 2
k a 2
l a 1
m a 1
n a 1
In [36]: try:
....: df.loc["j":"k","cats"] = pd.Categorical(["b","b"], categories=["a","b","c"])
....: except ValueError as e:
....: print("ValueError: " + str(e))
....:
ValueError: Cannot set a Categorical with another, without identical categories
Assigning a Categorical to parts of a column of other types will use the values:
In [37]: df = pd.DataFrame({"a":[1,1,1,1,1], "b":["a","a","a","a","a"]})
In [38]: df.loc[1:2,"a"] = pd.Categorical(["b","b"], categories=["a","b"])
In [39]: df.loc[2:3,"b"] = pd.Categorical(["b","b"], categories=["a","b"])
In [40]: df
Out[40]:
a b
0 1 a
1 b a
2 b b
3 1 b
4 1 a
In [41]: df.dtypes
Out[41]:
a object
b object
dtype: object
5.8.4 Merging
You can concat two DataFrames containing categorical data together, but the categories of these categoricals need to be the same:
In [42]: cat = pd.Series(["a","b"], dtype="category")
In [43]: vals = [1,2]
In [44]: df = pd.DataFrame({"cats":cat, "vals":vals})
In [45]: res = pd.concat([df,df])
In [46]: res
Out[46]:
cats vals
0 a 1
1 b 2
0 a 1
1 b 2
In [47]: res.dtypes
Out[47]:
cats category
vals int64
dtype: object
In this case the categories are not the same and so an error is raised:
In [48]: df_different = df.copy()
In [49]: df_different["cats"].cat.categories = ["c","d"]
In [50]: try:
....: pd.concat([df,df_different])
....: except ValueError as e:
....: print("ValueError: " + str(e))
....:
ValueError: incompatible categories in categorical concat
The same applies to df.append(df_different).
5.8.5 Unioning
New in version 0.19.0.
If you want to combine categoricals that do not necessarily have
the same categories, the union_categoricals function will
combine a list-like of categoricals. The new categories
will be the union of the categories being combined.
In [51]: from pandas.types.concat import union_categoricals
In [52]: a = pd.Categorical(["b", "c"])
In [53]: b = pd.Categorical(["a", "b"])
In [54]: union_categoricals([a, b])
Out[54]:
[b, c, a, b]
Categories (3, object): [b, c, a]
By default, the resulting categories will be ordered as
they appear in the data. If you want the categories to
be lexsorted, use sort_categories=True argument.
In [55]: union_categoricals([a, b], sort_categories=True)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-55-bb14018d143e> in <module>()
----> 1 union_categoricals([a, b], sort_categories=True)
TypeError: union_categoricals() got an unexpected keyword argument 'sort_categories'
Note
In addition to the “easy” case of combining two categoricals of the same
categories and order information (e.g. what you could also append for),
union_categoricals only works with unordered categoricals and will
raise if any are ordered.