7.8 Computing indicator / dummy variables

To convert a categorical variable into a “dummy” or “indicator” DataFrame, for example a column in a DataFrame (a Series) which has k distinct values, can derive a DataFrame containing k columns of 1s and 0s:

In [1]: df = pd.DataFrame({'key': list('bbacab'), 'data1': range(6)})

In [2]: df
Out[2]: 
   data1 key
0      0   b
1      1   b
2      2   a
3      3   c
4      4   a
5      5   b

In [3]: pd.get_dummies(df['key'])
Out[3]: 
     a    b    c
0  0.0  1.0  0.0
1  0.0  1.0  0.0
2  1.0  0.0  0.0
3  0.0  0.0  1.0
4  1.0  0.0  0.0
5  0.0  1.0  0.0

Sometimes it’s useful to prefix the column names, for example when merging the result with the original DataFrame:

In [4]: dummies = pd.get_dummies(df['key'], prefix='key')

In [5]: dummies
Out[5]: 
   key_a  key_b  key_c
0    0.0    1.0    0.0
1    0.0    1.0    0.0
2    1.0    0.0    0.0
3    0.0    0.0    1.0
4    1.0    0.0    0.0
5    0.0    1.0    0.0

In [6]: df[['data1']].join(dummies)
Out[6]: 
   data1  key_a  key_b  key_c
0      0    0.0    1.0    0.0
1      1    0.0    1.0    0.0
2      2    1.0    0.0    0.0
3      3    0.0    0.0    1.0
4      4    1.0    0.0    0.0
5      5    0.0    1.0    0.0

This function is often used along with discretization functions like cut:

In [7]: values = np.random.randn(10)

In [8]: values
Out[8]: 
array([ 0.4691, -0.2829, -1.5091, -1.1356,  1.2121, -0.1732,  0.1192,
       -1.0442, -0.8618, -2.1046])

In [9]: bins = [0, 0.2, 0.4, 0.6, 0.8, 1]

In [10]: pd.get_dummies(pd.cut(values, bins))
Out[10]: 
    (0, 0.2]  (0.2, 0.4]  (0.4, 0.6]  (0.6, 0.8]  (0.8, 1]
0        0.0         0.0         1.0         0.0       0.0
1        0.0         0.0         0.0         0.0       0.0
2        0.0         0.0         0.0         0.0       0.0
3        0.0         0.0         0.0         0.0       0.0
..       ...         ...         ...         ...       ...
6        1.0         0.0         0.0         0.0       0.0
7        0.0         0.0         0.0         0.0       0.0
8        0.0         0.0         0.0         0.0       0.0
9        0.0         0.0         0.0         0.0       0.0

[10 rows x 5 columns]

See also Series.str.get_dummies.

New in version 0.15.0.

get_dummies() also accepts a DataFrame. By default all categorical variables (categorical in the statistical sense, those with object or categorical dtype) are encoded as dummy variables.

In [11]: df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
   ....:                    'C': [1, 2, 3]})
   ....: 

In [12]: df
Out[12]: 
   A  B  C
0  a  c  1
1  b  c  2
2  a  b  3

In [13]: pd.get_dummies(df)
Out[13]: 
   C  A_a  A_b  B_b  B_c
0  1  1.0  0.0  0.0  1.0
1  2  0.0  1.0  0.0  1.0
2  3  1.0  0.0  1.0  0.0

All non-object columns are included untouched in the output.

You can control the columns that are encoded with the columns keyword.

In [14]: pd.get_dummies(df, columns=['A'])
Out[14]: 
   B  C  A_a  A_b
0  c  1  1.0  0.0
1  c  2  0.0  1.0
2  b  3  1.0  0.0

Notice that the B column is still included in the output, it just hasn’t been encoded. You can drop B before calling get_dummies if you don’t want to include it in the output.

As with the Series version, you can pass values for the prefix and prefix_sep. By default the column name is used as the prefix, and ‘_’ as the prefix separator. You can specify prefix and prefix_sep in 3 ways

  • string: Use the same value for prefix or prefix_sep for each column to be encoded
  • list: Must be the same length as the number of columns being encoded.
  • dict: Mapping column name to prefix
In [15]: simple = pd.get_dummies(df, prefix='new_prefix')

In [16]: simple
Out[16]: 
   C  new_prefix_a  new_prefix_b  new_prefix_b  new_prefix_c
0  1           1.0           0.0           0.0           1.0
1  2           0.0           1.0           0.0           1.0
2  3           1.0           0.0           1.0           0.0

In [17]: from_list = pd.get_dummies(df, prefix=['from_A', 'from_B'])

In [18]: from_list
Out[18]: 
   C  from_A_a  from_A_b  from_B_b  from_B_c
0  1       1.0       0.0       0.0       1.0
1  2       0.0       1.0       0.0       1.0
2  3       1.0       0.0       1.0       0.0

In [19]: from_dict = pd.get_dummies(df, prefix={'B': 'from_B', 'A': 'from_A'})

In [20]: from_dict
Out[20]: 
   C  from_A_a  from_A_b  from_B_b  from_B_c
0  1       1.0       0.0       0.0       1.0
1  2       0.0       1.0       0.0       1.0
2  3       1.0       0.0       1.0       0.0

New in version 0.18.0.

Sometimes it will be useful to only keep k-1 levels of a categorical variable to avoid collinearity when feeding the result to statistical models. You can switch to this mode by turn on drop_first.

In [21]: s = pd.Series(list('abcaa'))

In [22]: s
Out[22]: 
0    a
1    b
2    c
3    a
4    a
dtype: object

In [23]: pd.get_dummies(s)
Out[23]: 
     a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0
3  1.0  0.0  0.0
4  1.0  0.0  0.0

In [24]: pd.get_dummies(s, drop_first=True)
Out[24]: 
     b    c
0  0.0  0.0
1  1.0  0.0
2  0.0  1.0
3  0.0  0.0
4  0.0  0.0

When a column contains only one level, it will be omitted in the result.

In [25]: df = pd.DataFrame({'A':list('aaaaa'),'B':list('ababc')})

In [26]: df
Out[26]: 
   A  B
0  a  a
1  a  b
2  a  a
3  a  b
4  a  c

In [27]: pd.get_dummies(df)
Out[27]: 
   A_a  B_a  B_b  B_c
0  1.0  1.0  0.0  0.0
1  1.0  0.0  1.0  0.0
2  1.0  1.0  0.0  0.0
3  1.0  0.0  1.0  0.0
4  1.0  0.0  0.0  1.0

In [28]: pd.get_dummies(df, drop_first=True)
Out[28]: 
   B_b  B_c
0  0.0  0.0
1  1.0  0.0
2  0.0  0.0
3  1.0  0.0
4  0.0  1.0