5.8 Data munging
The optimized pandas data access methods .loc
, .iloc
, .ix
.at
, and .iat
,
work as normal. The only difference is the return type (for getting) and
that only values already in categories can be assigned.
5.8.1 Getting
If the slicing operation returns either a DataFrame or a column of type Series,
the category
dtype is preserved.
In [1]: idx = pd.Index(["h","i","j","k","l","m","n",])
In [2]: cats = pd.Series(["a","b","b","b","c","c","c"], dtype="category", index=idx)
In [3]: values= [1,2,2,2,3,4,5]
In [4]: df = pd.DataFrame({"cats":cats,"values":values}, index=idx)
In [5]: df.iloc[2:4,:]
Out[5]:
cats values
j b 2
k b 2
In [6]: df.iloc[2:4,:].dtypes
Out[6]:
cats category
values int64
dtype: object
In [7]: df.loc["h":"j","cats"]
Out[7]:
h a
i b
j b
Name: cats, dtype: category
Categories (3, object): [a, b, c]
In [8]: df.ix["h":"j",0:1]
Out[8]:
cats
h a
i b
j b
In [9]: df[df["cats"] == "b"]
Out[9]:
cats values
i b 2
j b 2
k b 2
An example where the category type is not preserved is if you take one single row: the
resulting Series is of dtype object
:
# get the complete "h" row as a Series
In [10]: df.loc["h", :]
Out[10]:
cats a
values 1
Name: h, dtype: object
Returning a single item from categorical data will also return the value, not a categorical of length “1”.
In [11]: df.iat[0,0]
Out[11]: 'a'
In [12]: df["cats"].cat.categories = ["x","y","z"]
In [13]: df.at["h","cats"] # returns a string
Out[13]: 'x'
Note
This is a difference to R’s factor function, where factor(c(1,2,3))[1]
returns a single value factor.
To get a single value Series of type category
pass in a list with a single value:
In [14]: df.loc[["h"],"cats"]
Out[14]:
h x
Name: cats, dtype: category
Categories (3, object): [x, y, z]
5.8.2 String and datetime accessors
New in version 0.17.1.
The accessors .dt
and .str
will work if the s.cat.categories
are of an appropriate
type:
In [15]: str_s = pd.Series(list('aabb'))
In [16]: str_cat = str_s.astype('category')
In [17]: str_cat
Out[17]:
0 a
1 a
2 b
3 b
dtype: category
Categories (2, object): [a, b]
In [18]: str_cat.str.contains("a")
Out[18]:
0 True
1 True
2 False
3 False
dtype: bool
In [19]: date_s = pd.Series(pd.date_range('1/1/2015', periods=5))
In [20]: date_cat = date_s.astype('category')
In [21]: date_cat
Out[21]:
0 2015-01-01
1 2015-01-02
2 2015-01-03
3 2015-01-04
4 2015-01-05
dtype: category
Categories (5, datetime64[ns]): [2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05]
In [22]: date_cat.dt.day
Out[22]:
0 1
1 2
2 3
3 4
4 5
dtype: int64
Note
The returned Series
(or DataFrame
) is of the same type as if you used the
.str.<method>
/ .dt.<method>
on a Series
of that type (and not of
type category
!).
That means, that the returned values from methods and properties on the accessors of a
Series
and the returned values from methods and properties on the accessors of this
Series
transformed to one of type category will be equal:
In [23]: ret_s = str_s.str.contains("a")
In [24]: ret_cat = str_cat.str.contains("a")
In [25]: ret_s.dtype == ret_cat.dtype
Out[25]: True
In [26]: ret_s == ret_cat
Out[26]:
0 True
1 True
2 True
3 True
dtype: bool
Note
The work is done on the categories
and then a new Series
is constructed. This has
some performance implication if you have a Series
of type string, where lots of elements
are repeated (i.e. the number of unique elements in the Series
is a lot smaller than the
length of the Series
). In this case it can be faster to convert the original Series
to one of type category
and use .str.<method>
or .dt.<property>
on that.
5.8.3 Setting
Setting values in a categorical column (or Series) works as long as the value is included in the categories:
In [27]: idx = pd.Index(["h","i","j","k","l","m","n"])
In [28]: cats = pd.Categorical(["a","a","a","a","a","a","a"], categories=["a","b"])
In [29]: values = [1,1,1,1,1,1,1]
In [30]: df = pd.DataFrame({"cats":cats,"values":values}, index=idx)
In [31]: df.iloc[2:4,:] = [["b",2],["b",2]]
In [32]: df
Out[32]:
cats values
h a 1
i a 1
j b 2
k b 2
l a 1
m a 1
n a 1
In [33]: try:
....: df.iloc[2:4,:] = [["c",3],["c",3]]
....: except ValueError as e:
....: print("ValueError: " + str(e))
....:
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
Setting values by assigning categorical data will also check that the categories match:
In [34]: df.loc["j":"k","cats"] = pd.Categorical(["a","a"], categories=["a","b"])
In [35]: df
Out[35]:
cats values
h a 1
i a 1
j a 2
k a 2
l a 1
m a 1
n a 1
In [36]: try:
....: df.loc["j":"k","cats"] = pd.Categorical(["b","b"], categories=["a","b","c"])
....: except ValueError as e:
....: print("ValueError: " + str(e))
....:
ValueError: Cannot set a Categorical with another, without identical categories
Assigning a Categorical to parts of a column of other types will use the values:
In [37]: df = pd.DataFrame({"a":[1,1,1,1,1], "b":["a","a","a","a","a"]})
In [38]: df.loc[1:2,"a"] = pd.Categorical(["b","b"], categories=["a","b"])
In [39]: df.loc[2:3,"b"] = pd.Categorical(["b","b"], categories=["a","b"])
In [40]: df
Out[40]:
a b
0 1 a
1 b a
2 b b
3 1 b
4 1 a
In [41]: df.dtypes
Out[41]:
a object
b object
dtype: object
5.8.4 Merging
You can concat two DataFrames containing categorical data together, but the categories of these categoricals need to be the same:
In [42]: cat = pd.Series(["a","b"], dtype="category")
In [43]: vals = [1,2]
In [44]: df = pd.DataFrame({"cats":cat, "vals":vals})
In [45]: res = pd.concat([df,df])
In [46]: res
Out[46]:
cats vals
0 a 1
1 b 2
0 a 1
1 b 2
In [47]: res.dtypes
Out[47]:
cats category
vals int64
dtype: object
In this case the categories are not the same and so an error is raised:
In [48]: df_different = df.copy()
In [49]: df_different["cats"].cat.categories = ["c","d"]
In [50]: try:
....: pd.concat([df,df_different])
....: except ValueError as e:
....: print("ValueError: " + str(e))
....:
ValueError: incompatible categories in categorical concat
The same applies to df.append(df_different)
.
5.8.5 Unioning
New in version 0.19.0.
If you want to combine categoricals that do not necessarily have
the same categories, the union_categoricals
function will
combine a list-like of categoricals. The new categories
will be the union of the categories being combined.
In [51]: from pandas.types.concat import union_categoricals
In [52]: a = pd.Categorical(["b", "c"])
In [53]: b = pd.Categorical(["a", "b"])
In [54]: union_categoricals([a, b])
Out[54]:
[b, c, a, b]
Categories (3, object): [b, c, a]
By default, the resulting categories will be ordered as
they appear in the data. If you want the categories to
be lexsorted, use sort_categories=True
argument.
In [55]: union_categoricals([a, b], sort_categories=True)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-55-bb14018d143e> in <module>()
----> 1 union_categoricals([a, b], sort_categories=True)
TypeError: union_categoricals() got an unexpected keyword argument 'sort_categories'
Note
In addition to the “easy” case of combining two categoricals of the same
categories and order information (e.g. what you could also append
for),
union_categoricals
only works with unordered categoricals and will
raise if any are ordered.