7.8 Computing indicator / dummy variables
To convert a categorical variable into a “dummy” or “indicator” DataFrame, for example
a column in a DataFrame (a Series) which has k
distinct values, can derive a DataFrame
containing k
columns of 1s and 0s:
In [1]: df = pd.DataFrame({'key': list('bbacab'), 'data1': range(6)})
In [2]: df
Out[2]:
data1 key
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 b
In [3]: pd.get_dummies(df['key'])
Out[3]:
a b c
0 0.0 1.0 0.0
1 0.0 1.0 0.0
2 1.0 0.0 0.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
5 0.0 1.0 0.0
Sometimes it’s useful to prefix the column names, for example when merging the result with the original DataFrame:
In [4]: dummies = pd.get_dummies(df['key'], prefix='key')
In [5]: dummies
Out[5]:
key_a key_b key_c
0 0.0 1.0 0.0
1 0.0 1.0 0.0
2 1.0 0.0 0.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
5 0.0 1.0 0.0
In [6]: df[['data1']].join(dummies)
Out[6]:
data1 key_a key_b key_c
0 0 0.0 1.0 0.0
1 1 0.0 1.0 0.0
2 2 1.0 0.0 0.0
3 3 0.0 0.0 1.0
4 4 1.0 0.0 0.0
5 5 0.0 1.0 0.0
This function is often used along with discretization functions like cut
:
In [7]: values = np.random.randn(10)
In [8]: values
Out[8]:
array([ 0.4691, -0.2829, -1.5091, -1.1356, 1.2121, -0.1732, 0.1192,
-1.0442, -0.8618, -2.1046])
In [9]: bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
In [10]: pd.get_dummies(pd.cut(values, bins))
Out[10]:
(0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1]
0 0.0 0.0 1.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0
.. ... ... ... ... ...
6 1.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0
[10 rows x 5 columns]
See also Series.str.get_dummies
.
New in version 0.15.0.
get_dummies()
also accepts a DataFrame. By default all categorical
variables (categorical in the statistical sense,
those with object or categorical dtype) are encoded as dummy variables.
In [11]: df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
....: 'C': [1, 2, 3]})
....:
In [12]: df
Out[12]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [13]: pd.get_dummies(df)
Out[13]:
C A_a A_b B_b B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
All non-object columns are included untouched in the output.
You can control the columns that are encoded with the columns
keyword.
In [14]: pd.get_dummies(df, columns=['A'])
Out[14]:
B C A_a A_b
0 c 1 1.0 0.0
1 c 2 0.0 1.0
2 b 3 1.0 0.0
Notice that the B
column is still included in the output, it just hasn’t
been encoded. You can drop B
before calling get_dummies
if you don’t
want to include it in the output.
As with the Series version, you can pass values for the prefix
and
prefix_sep
. By default the column name is used as the prefix, and ‘_’ as
the prefix separator. You can specify prefix
and prefix_sep
in 3 ways
- string: Use the same value for
prefix
orprefix_sep
for each column to be encoded - list: Must be the same length as the number of columns being encoded.
- dict: Mapping column name to prefix
In [15]: simple = pd.get_dummies(df, prefix='new_prefix')
In [16]: simple
Out[16]:
C new_prefix_a new_prefix_b new_prefix_b new_prefix_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
In [17]: from_list = pd.get_dummies(df, prefix=['from_A', 'from_B'])
In [18]: from_list
Out[18]:
C from_A_a from_A_b from_B_b from_B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
In [19]: from_dict = pd.get_dummies(df, prefix={'B': 'from_B', 'A': 'from_A'})
In [20]: from_dict
Out[20]:
C from_A_a from_A_b from_B_b from_B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
New in version 0.18.0.
Sometimes it will be useful to only keep k-1 levels of a categorical
variable to avoid collinearity when feeding the result to statistical models.
You can switch to this mode by turn on drop_first
.
In [21]: s = pd.Series(list('abcaa'))
In [22]: s
Out[22]:
0 a
1 b
2 c
3 a
4 a
dtype: object
In [23]: pd.get_dummies(s)
Out[23]:
a b c
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
3 1.0 0.0 0.0
4 1.0 0.0 0.0
In [24]: pd.get_dummies(s, drop_first=True)
Out[24]:
b c
0 0.0 0.0
1 1.0 0.0
2 0.0 1.0
3 0.0 0.0
4 0.0 0.0
When a column contains only one level, it will be omitted in the result.
In [25]: df = pd.DataFrame({'A':list('aaaaa'),'B':list('ababc')})
In [26]: df
Out[26]:
A B
0 a a
1 a b
2 a a
3 a b
4 a c
In [27]: pd.get_dummies(df)
Out[27]:
A_a B_a B_b B_c
0 1.0 1.0 0.0 0.0
1 1.0 0.0 1.0 0.0
2 1.0 1.0 0.0 0.0
3 1.0 0.0 1.0 0.0
4 1.0 0.0 0.0 1.0
In [28]: pd.get_dummies(df, drop_first=True)
Out[28]:
B_b B_c
0 0.0 0.0
1 1.0 0.0
2 0.0 0.0
3 1.0 0.0
4 0.0 1.0