5.8 Data munging

The optimized pandas data access methods .loc, .iloc, .ix .at, and .iat, work as normal. The only difference is the return type (for getting) and that only values already in categories can be assigned.

5.8.1 Getting

If the slicing operation returns either a DataFrame or a column of type Series, the category dtype is preserved.

In [1]: idx = pd.Index(["h","i","j","k","l","m","n",])

In [2]: cats = pd.Series(["a","b","b","b","c","c","c"], dtype="category", index=idx)

In [3]: values= [1,2,2,2,3,4,5]

In [4]: df = pd.DataFrame({"cats":cats,"values":values}, index=idx)

In [5]: df.iloc[2:4,:]
Out[5]: 
  cats  values
j    b       2
k    b       2

In [6]: df.iloc[2:4,:].dtypes
Out[6]: 
cats      category
values       int64
dtype: object

In [7]: df.loc["h":"j","cats"]
Out[7]: 
h    a
i    b
j    b
Name: cats, dtype: category
Categories (3, object): [a, b, c]

In [8]: df.ix["h":"j",0:1]
Out[8]: 
  cats
h    a
i    b
j    b

In [9]: df[df["cats"] == "b"]
Out[9]: 
  cats  values
i    b       2
j    b       2
k    b       2

An example where the category type is not preserved is if you take one single row: the resulting Series is of dtype object:

# get the complete "h" row as a Series
In [10]: df.loc["h", :]
Out[10]: 
cats      a
values    1
Name: h, dtype: object

Returning a single item from categorical data will also return the value, not a categorical of length “1”.

In [11]: df.iat[0,0]
Out[11]: 'a'

In [12]: df["cats"].cat.categories = ["x","y","z"]

In [13]: df.at["h","cats"] # returns a string
Out[13]: 'x'

Note

This is a difference to R’s factor function, where factor(c(1,2,3))[1] returns a single value factor.

To get a single value Series of type category pass in a list with a single value:

In [14]: df.loc[["h"],"cats"]
Out[14]: 
h    x
Name: cats, dtype: category
Categories (3, object): [x, y, z]

5.8.2 String and datetime accessors

New in version 0.17.1.

The accessors .dt and .str will work if the s.cat.categories are of an appropriate type:

In [15]: str_s = pd.Series(list('aabb'))

In [16]: str_cat = str_s.astype('category')

In [17]: str_cat
Out[17]: 
0    a
1    a
2    b
3    b
dtype: category
Categories (2, object): [a, b]

In [18]: str_cat.str.contains("a")
Out[18]: 
0     True
1     True
2    False
3    False
dtype: bool

In [19]: date_s = pd.Series(pd.date_range('1/1/2015', periods=5))

In [20]: date_cat = date_s.astype('category')

In [21]: date_cat
Out[21]: 
0   2015-01-01
1   2015-01-02
2   2015-01-03
3   2015-01-04
4   2015-01-05
dtype: category
Categories (5, datetime64[ns]): [2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05]

In [22]: date_cat.dt.day
Out[22]: 
0    1
1    2
2    3
3    4
4    5
dtype: int64

Note

The returned Series (or DataFrame) is of the same type as if you used the .str.<method> / .dt.<method> on a Series of that type (and not of type category!).

That means, that the returned values from methods and properties on the accessors of a Series and the returned values from methods and properties on the accessors of this Series transformed to one of type category will be equal:

In [23]: ret_s = str_s.str.contains("a")

In [24]: ret_cat = str_cat.str.contains("a")

In [25]: ret_s.dtype == ret_cat.dtype
Out[25]: True

In [26]: ret_s == ret_cat
Out[26]: 
0    True
1    True
2    True
3    True
dtype: bool

Note

The work is done on the categories and then a new Series is constructed. This has some performance implication if you have a Series of type string, where lots of elements are repeated (i.e. the number of unique elements in the Series is a lot smaller than the length of the Series). In this case it can be faster to convert the original Series to one of type category and use .str.<method> or .dt.<property> on that.

5.8.3 Setting

Setting values in a categorical column (or Series) works as long as the value is included in the categories:

In [27]: idx = pd.Index(["h","i","j","k","l","m","n"])

In [28]: cats = pd.Categorical(["a","a","a","a","a","a","a"], categories=["a","b"])

In [29]: values = [1,1,1,1,1,1,1]

In [30]: df = pd.DataFrame({"cats":cats,"values":values}, index=idx)

In [31]: df.iloc[2:4,:] = [["b",2],["b",2]]

In [32]: df
Out[32]: 
  cats  values
h    a       1
i    a       1
j    b       2
k    b       2
l    a       1
m    a       1
n    a       1

In [33]: try:
   ....:     df.iloc[2:4,:] = [["c",3],["c",3]]
   ....: except ValueError as e:
   ....:     print("ValueError: " + str(e))
   ....: 
ValueError: Cannot setitem on a Categorical with a new category, set the categories first

Setting values by assigning categorical data will also check that the categories match:

In [34]: df.loc["j":"k","cats"] = pd.Categorical(["a","a"], categories=["a","b"])

In [35]: df
Out[35]: 
  cats  values
h    a       1
i    a       1
j    a       2
k    a       2
l    a       1
m    a       1
n    a       1

In [36]: try:
   ....:     df.loc["j":"k","cats"] = pd.Categorical(["b","b"], categories=["a","b","c"])
   ....: except ValueError as e:
   ....:     print("ValueError: " + str(e))
   ....: 
ValueError: Cannot set a Categorical with another, without identical categories

Assigning a Categorical to parts of a column of other types will use the values:

In [37]: df = pd.DataFrame({"a":[1,1,1,1,1], "b":["a","a","a","a","a"]})

In [38]: df.loc[1:2,"a"] = pd.Categorical(["b","b"], categories=["a","b"])

In [39]: df.loc[2:3,"b"] = pd.Categorical(["b","b"], categories=["a","b"])

In [40]: df
Out[40]: 
   a  b
0  1  a
1  b  a
2  b  b
3  1  b
4  1  a

In [41]: df.dtypes
Out[41]: 
a    object
b    object
dtype: object

5.8.4 Merging

You can concat two DataFrames containing categorical data together, but the categories of these categoricals need to be the same:

In [42]: cat = pd.Series(["a","b"], dtype="category")

In [43]: vals = [1,2]

In [44]: df = pd.DataFrame({"cats":cat, "vals":vals})

In [45]: res = pd.concat([df,df])

In [46]: res
Out[46]: 
  cats  vals
0    a     1
1    b     2
0    a     1
1    b     2

In [47]: res.dtypes
Out[47]: 
cats    category
vals       int64
dtype: object

In this case the categories are not the same and so an error is raised:

In [48]: df_different = df.copy()

In [49]: df_different["cats"].cat.categories = ["c","d"]

In [50]: try:
   ....:     pd.concat([df,df_different])
   ....: except ValueError as e:
   ....:     print("ValueError: " + str(e))
   ....: 
ValueError: incompatible categories in categorical concat

The same applies to df.append(df_different).

5.8.5 Unioning

New in version 0.19.0.

If you want to combine categoricals that do not necessarily have the same categories, the union_categoricals function will combine a list-like of categoricals. The new categories will be the union of the categories being combined.

In [51]: from pandas.types.concat import union_categoricals

In [52]: a = pd.Categorical(["b", "c"])

In [53]: b = pd.Categorical(["a", "b"])

In [54]: union_categoricals([a, b])
Out[54]: 
[b, c, a, b]
Categories (3, object): [b, c, a]

By default, the resulting categories will be ordered as they appear in the data. If you want the categories to be lexsorted, use sort_categories=True argument.

In [55]: union_categoricals([a, b], sort_categories=True)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-55-bb14018d143e> in <module>()
----> 1 union_categoricals([a, b], sort_categories=True)

TypeError: union_categoricals() got an unexpected keyword argument 'sort_categories'

Note

In addition to the “easy” case of combining two categoricals of the same categories and order information (e.g. what you could also append for), union_categoricals only works with unordered categoricals and will raise if any are ordered.