7.6 Cross tabulations

Use the crosstab function to compute a cross-tabulation of two (or more) factors. By default crosstab computes a frequency table of the factors unless an array of values and an aggregation function are passed.

It takes a number of arguments

index: array-like, values to group by in the rows
columns: array-like, values to group by in the columns
values: array-like, optional, array of values to aggregate according to the factors
aggfunc: function, optional, If no values array is passed, computes a frequency table
rownames: sequence, default None, must match number of row arrays passed
colnames: sequence, default None, if passed, must match number of column arrays passed
margins: boolean, default False, Add row/column margins (subtotals)
normalize: boolean, {‘all’, ‘index’, ‘columns’}, or {0,1}, default False. Normalize by dividing all values by the sum of values.

Any Series passed will have their name attributes used unless row or column names for the cross-tabulation are specified

For example:

In [1]: foo, bar, dull, shiny, one, two = 'foo', 'bar', 'dull', 'shiny', 'one', 'two'

In [2]: a = np.array([foo, foo, bar, bar, foo, foo], dtype=object)

In [3]: b = np.array([one, one, two, one, two, one], dtype=object)

In [4]: c = np.array([dull, dull, shiny, dull, dull, shiny], dtype=object)

In [5]: pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
Out[5]: 
b    one        two      
c   dull shiny dull shiny
a                        
bar    1     0    0     1
foo    2     1    1     0

If crosstab receives only two Series, it will provide a frequency table.

In [6]: df = pd.DataFrame({'A': [1, 2, 2, 2, 2], 'B': [3, 3, 4, 4, 4],
   ...:                    'C': [1, 1, np.nan, 1, 1]})
   ...: 

In [7]: df
Out[7]: 
   A  B    C
0  1  3  1.0
1  2  3  1.0
2  2  4  NaN
3  2  4  1.0
4  2  4  1.0

In [8]: pd.crosstab(df.A, df.B)
Out[8]: 
B  3  4
A      
1  1  0
2  1  3

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

In [9]: foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])

In [10]: bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])

In [11]: pd.crosstab(foo, bar)
Out[11]: 
col_0  d  e  f
row_0         
a      1  0  0
b      0  1  0
c      0  0  0

7.6.1 Normalization

New in version 0.18.1.

Frequency tables can also be normalized to show percentages rather than counts using the normalize argument:

In [12]: pd.crosstab(df.A, df.B, normalize=True)
Out[12]: 
B    3    4
A          
1  0.2  0.0
2  0.2  0.6

normalize can also normalize values within each row or within each column:

In [13]: pd.crosstab(df.A, df.B, normalize='columns')
Out[13]: 
B    3    4
A          
1  0.5  0.0
2  0.5  1.0

crosstab can also be passed a third Series and an aggregation function (aggfunc) that will be applied to the values of the third Series within each group defined by the first two Series:

In [14]: pd.crosstab(df.A, df.B, values=df.C, aggfunc=np.sum)
Out[14]: 
B    3    4
A          
1  1.0  NaN
2  1.0  2.0

7.6.2 Adding Margins

Finally, one can also add margins or normalize this output.

In [15]: pd.crosstab(df.A, df.B, values=df.C, aggfunc=np.sum, normalize=True,
   ....:             margins=True)
   ....: 
Out[15]: 
B       3    4   All
A                   
1    0.25  0.0  0.25
2    0.25  0.5  0.75
All  0.50  0.5  1.00