5.2 Object Creation

Categorical Series or columns in a DataFrame can be created in several ways:

By specifying dtype="category" when constructing a Series:

In [1]: s = pd.Series(["a","b","c","a"], dtype="category")

In [2]: s
Out[2]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

By converting an existing Series or column to a category dtype:

In [3]: df = pd.DataFrame({"A":["a","b","c","a"]})

In [4]: df["B"] = df["A"].astype('category')

In [5]: df
Out[5]: 
   A  B
0  a  a
1  b  b
2  c  c
3  a  a

By using some special functions:

In [6]: df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})

In [7]: labels = [ "{0} - {1}".format(i, i + 9) for i in range(0, 100, 10) ]

In [8]: df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)

In [9]: df.head(10)
Out[9]: 
    value    group
0      65  60 - 69
1      49  40 - 49
2      56  50 - 59
3      43  40 - 49
..    ...      ...
6      32  30 - 39
7      87  80 - 89
8      36  30 - 39
9       8    0 - 9

[10 rows x 2 columns]

See documentation for cut().

By passing a pandas.Categorical object to a Series or assigning it to a DataFrame.

In [10]: raw_cat = pd.Categorical(["a","b","c","a"], categories=["b","c","d"],
   ....:                          ordered=False)
   ....: 

In [11]: s = pd.Series(raw_cat)

In [12]: s
Out[12]: 
0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): [b, c, d]

In [13]: df = pd.DataFrame({"A":["a","b","c","a"]})

In [14]: df["B"] = raw_cat

In [15]: df
Out[15]: 
   A    B
0  a  NaN
1  b    b
2  c    c
3  a  NaN

You can also specify differently ordered categories or make the resulting data ordered, by passing these arguments to astype():

In [16]: s = pd.Series(["a","b","c","a"])

In [17]: s_cat = s.astype("category", categories=["b","c","d"], ordered=False)

In [18]: s_cat
Out[18]: 
0    NaN
1      b
2      c
3    NaN
dtype: category
Categories (3, object): [b, c, d]

Categorical data has a specific category dtype:

In [19]: df.dtypes
Out[19]: 
A      object
B    category
dtype: object

Note

In contrast to R’s factor function, categorical data is not converting input values to strings and categories will end up the same data type as the original values.

Note

In contrast to R’s factor function, there is currently no way to assign/change labels at creation time. Use categories to change the categories after creation time.

To get back to the original Series or numpy array, use Series.astype(original_dtype) or np.asarray(categorical):

In [20]: s = pd.Series(["a","b","c","a"])

In [21]: s
Out[21]: 
0    a
1    b
2    c
3    a
dtype: object

In [22]: s2 = s.astype('category')

In [23]: s2
Out[23]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

In [24]: s3 = s2.astype('string')

In [25]: s3
Out[25]: 
0    a
1    b
2    c
3    a
dtype: object

In [26]: np.asarray(s2)
Out[26]: array(['a', 'b', 'c', 'a'], dtype=object)

If you have already codes and categories, you can use the from_codes() constructor to save the factorize step during normal constructor mode:

In [27]: splitter = np.random.choice([0,1], 5, p=[0.5,0.5])

In [28]: s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))