7.6 Cross tabulations
Use the crosstab
function to compute a cross-tabulation of two (or more)
factors. By default crosstab
computes a frequency table of the factors
unless an array of values and an aggregation function are passed.
It takes a number of arguments
index
: array-like, values to group by in the rowscolumns
: array-like, values to group by in the columnsvalues
: array-like, optional, array of values to aggregate according to the factorsaggfunc
: function, optional, If no values array is passed, computes a frequency tablerownames
: sequence, defaultNone
, must match number of row arrays passedcolnames
: sequence, defaultNone
, if passed, must match number of column arrays passedmargins
: boolean, defaultFalse
, Add row/column margins (subtotals)normalize
: boolean, {‘all’, ‘index’, ‘columns’}, or {0,1}, defaultFalse
. Normalize by dividing all values by the sum of values.
Any Series passed will have their name attributes used unless row or column names for the cross-tabulation are specified
For example:
In [1]: foo, bar, dull, shiny, one, two = 'foo', 'bar', 'dull', 'shiny', 'one', 'two'
In [2]: a = np.array([foo, foo, bar, bar, foo, foo], dtype=object)
In [3]: b = np.array([one, one, two, one, two, one], dtype=object)
In [4]: c = np.array([dull, dull, shiny, dull, dull, shiny], dtype=object)
In [5]: pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
Out[5]:
b one two
c dull shiny dull shiny
a
bar 1 0 0 1
foo 2 1 1 0
If crosstab
receives only two Series, it will provide a frequency table.
In [6]: df = pd.DataFrame({'A': [1, 2, 2, 2, 2], 'B': [3, 3, 4, 4, 4],
...: 'C': [1, 1, np.nan, 1, 1]})
...:
In [7]: df
Out[7]:
A B C
0 1 3 1.0
1 2 3 1.0
2 2 4 NaN
3 2 4 1.0
4 2 4 1.0
In [8]: pd.crosstab(df.A, df.B)
Out[8]:
B 3 4
A
1 1 0
2 1 3
Any input passed containing Categorical
data will have all of its
categories included in the cross-tabulation, even if the actual data does
not contain any instances of a particular category.
In [9]: foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
In [10]: bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
In [11]: pd.crosstab(foo, bar)
Out[11]:
col_0 d e f
row_0
a 1 0 0
b 0 1 0
c 0 0 0
7.6.1 Normalization
New in version 0.18.1.
Frequency tables can also be normalized to show percentages rather than counts
using the normalize
argument:
In [12]: pd.crosstab(df.A, df.B, normalize=True)
Out[12]:
B 3 4
A
1 0.2 0.0
2 0.2 0.6
normalize
can also normalize values within each row or within each column:
In [13]: pd.crosstab(df.A, df.B, normalize='columns')
Out[13]:
B 3 4
A
1 0.5 0.0
2 0.5 1.0
crosstab
can also be passed a third Series and an aggregation function
(aggfunc
) that will be applied to the values of the third Series within each
group defined by the first two Series:
In [14]: pd.crosstab(df.A, df.B, values=df.C, aggfunc=np.sum)
Out[14]:
B 3 4
A
1 1.0 NaN
2 1.0 2.0
7.6.2 Adding Margins
Finally, one can also add margins or normalize this output.
In [15]: pd.crosstab(df.A, df.B, values=df.C, aggfunc=np.sum, normalize=True,
....: margins=True)
....:
Out[15]:
B 3 4 All
A
1 0.25 0.0 0.25
2 0.25 0.5 0.75
All 0.50 0.5 1.00