5 Descriptive statistics

A large number of methods for computing descriptive statistics and other related operations on Series, DataFrame, and Panel. Most of these are aggregations (hence producing a lower-dimensional result) like sum(), mean(), and quantile(), but some of them, like cumsum() and cumprod(), produce an object of the same size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, ...}, but the axis can be specified by name or integer:

  • Series: no axis argument needed
  • DataFrame: “index” (axis=0, default), “columns” (axis=1)
  • Panel: “items” (axis=0), “major” (axis=1, default), “minor” (axis=2)

For example:

In [1]: df
Out[1]: 
        one     three       two
a -0.144420       NaN  1.670200
b  0.007400 -1.249974  1.433805
c  1.185509  1.342610 -1.803779
d       NaN  0.868571 -1.927031

In [2]: df.mean(0)
Out[2]: 
one      0.349496
three    0.320403
two     -0.156701
dtype: float64

In [3]: df.mean(1)
Out[3]: 
a    0.762890
b    0.063744
c    0.241447
d   -0.529230
dtype: float64

All such methods have a skipna option signaling whether to exclude missing data (True by default):

In [4]: df.sum(0, skipna=False)
Out[4]: 
one           NaN
three         NaN
two     -0.626805
dtype: float64

In [5]: df.sum(axis=1, skipna=True)
Out[5]: 
a    1.525780
b    0.191231
c    0.724340
d   -1.058460
dtype: float64

Combined with the broadcasting / arithmetic behavior, one can describe various statistical procedures, like standardization (rendering data zero mean and standard deviation 1), very concisely:

In [6]: ts_stand = (df - df.mean()) / df.std()

In [7]: ts_stand.std()
Out[7]: 
one      1.0
three    1.0
two      1.0
dtype: float64

In [8]: xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)

In [9]: xs_stand.std(1)
Out[9]: 
a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64

Note that methods like cumsum() and cumprod() preserve the location of NA values:

In [10]: df.cumsum()
Out[10]: 
        one     three       two
a -0.144420       NaN  1.670200
b -0.137020 -1.249974  3.104005
c  1.048488  0.092636  1.300226
d       NaN  0.961208 -0.626805

Here is a quick reference summary table of common functions. Each also takes an optional level parameter which applies only if the object has a hierarchical index.

Function Description
count Number of non-null observations
sum Sum of values
mean Mean of values
mad Mean absolute deviation
median Arithmetic median of values
min Minimum
max Maximum
mode Mode
abs Absolute Value
prod Product of values
std Bessel-corrected sample standard deviation
var Unbiased variance
sem Standard error of the mean
skew Sample skewness (3rd moment)
kurt Sample kurtosis (4th moment)
quantile Sample quantile (value at %)
cumsum Cumulative sum
cumprod Cumulative product
cummax Cumulative maximum
cummin Cumulative minimum

Note that by chance some NumPy methods, like mean, std, and sum, will exclude NAs on Series input by default:

In [11]: np.mean(df['one'])
Out[11]: 0.34949612490375265

In [12]: np.mean(df['one'].values)
Out[12]: nan

Series also has a method nunique() which will return the number of unique non-null values:

In [13]: series = pd.Series(np.random.randn(500))

In [14]: series[20:500] = np.nan

In [15]: series[10:20]  = 5

In [16]: series.nunique()
Out[16]: 11

5.1 Summarizing data: describe

There is a convenient describe() function which computes a variety of summary statistics about a Series or the columns of a DataFrame (excluding NAs of course):

In [17]: series = pd.Series(np.random.randn(1000))

In [18]: series[::2] = np.nan

In [19]: series.describe()
Out[19]: 
count    500.000000
mean      -0.054836
std        1.055274
min       -3.463789
25%       -0.716140
50%       -0.059378
75%        0.627730
max        3.400646
dtype: float64

In [20]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])

In [21]: frame.ix[::2] = np.nan

In [22]: frame.describe()
Out[22]: 
                a           b           c           d           e
count  500.000000  500.000000  500.000000  500.000000  500.000000
mean    -0.034552   -0.063424    0.018501   -0.007295    0.004320
std      1.015046    1.002699    1.051238    1.038535    1.006397
min     -2.915767   -3.294023   -3.610499   -2.825702   -3.010899
25%     -0.765493   -0.705856   -0.617814   -0.709969   -0.689996
50%     -0.079172   -0.050934    0.010791    0.018474    0.005807
75%      0.666677    0.611545    0.748653    0.724091    0.656439
max      2.820663    2.925597    3.416896    3.056469    3.007143

You can select specific percentiles to include in the output:

In [23]: series.describe(percentiles=[.05, .25, .75, .95])
Out[23]: 
count    500.000000
mean      -0.054836
std        1.055274
min       -3.463789
            ...    
50%       -0.059378
75%        0.627730
95%        1.641186
max        3.400646
dtype: float64

By default, the median is always included.

For a non-numerical Series object, describe() will give a simple summary of the number of unique values and most frequently occurring values:

In [24]: s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])

In [25]: s.describe()
Out[25]: 
count     9
unique    4
top       a
freq      5
dtype: object

Note that on a mixed-type DataFrame object, describe() will restrict the summary to include only numerical columns or, if none are, only categorical columns:

In [26]: frame = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})

In [27]: frame.describe()
Out[27]: 
              b
count  4.000000
mean   1.500000
std    1.290994
min    0.000000
25%    0.750000
50%    1.500000
75%    2.250000
max    3.000000

This behaviour can be controlled by providing a list of types as include/exclude arguments. The special value all can also be used:

In [28]: frame.describe(include=['object'])
Out[28]: 
         a
count    4
unique   2
top     No
freq     2

In [29]: frame.describe(include=['number'])
Out[29]: 
              b
count  4.000000
mean   1.500000
std    1.290994
min    0.000000
25%    0.750000
50%    1.500000
75%    2.250000
max    3.000000

In [30]: frame.describe(include='all')
Out[30]: 
          a     b
count     4  4.00
unique    2   NaN
top      No   NaN
freq      2   NaN
...     ...   ...
25%     NaN  0.75
50%     NaN  1.50
75%     NaN  2.25
max     NaN  3.00

[11 rows x 2 columns]

That feature relies on select_dtypes. Refer to there for details about accepted inputs.

5.2 Index of Min/Max Values

The idxmin() and idxmax() functions on Series and DataFrame compute the index labels with the minimum and maximum corresponding values:

In [31]: s1 = pd.Series(np.random.randn(5))

In [32]: s1
Out[32]: 
0   -0.250753
1    0.369150
2    0.916294
3   -0.618059
4   -0.804420
dtype: float64

In [33]: s1.idxmin(), s1.idxmax()
Out[33]: (4, 2)

In [34]: df1 = pd.DataFrame(np.random.randn(5,3), columns=['A','B','C'])

In [35]: df1
Out[35]: 
          A         B         C
0 -0.676993  2.007616 -0.297283
1 -0.297308  1.239574 -0.358859
2  0.122141  0.442947 -0.748188
3 -1.225176 -1.010445 -1.737071
4  0.032349 -0.782580  0.841748

In [36]: df1.idxmin(axis=0)
Out[36]: 
A    3
B    3
C    3
dtype: int64

In [37]: df1.idxmax(axis=1)
Out[37]: 
0    B
1    B
2    B
3    B
4    C
dtype: object

When there are multiple rows (or columns) matching the minimum or maximum value, idxmin() and idxmax() return the first matching index:

In [38]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))

In [39]: df3
Out[39]: 
     A
e  2.0
d  1.0
c  1.0
b  3.0
a  NaN

In [40]: df3['A'].idxmin()
Out[40]: 'd'

Note

idxmin and idxmax are called argmin and argmax in NumPy.

5.3 Value counts (histogramming) / Mode

The value_counts() Series method and top-level function computes a histogram of a 1D array of values. It can also be used as a function on regular arrays:

In [41]: data = np.random.randint(0, 7, size=50)

In [42]: data
Out[42]: 
array([1, 4, 3, 1, 2, 3, 3, 3, 1, 4, 3, 0, 6, 6, 6, 0, 6, 1, 1, 6, 5, 5, 6,
       6, 0, 6, 1, 4, 2, 0, 2, 6, 0, 3, 0, 4, 0, 2, 0, 6, 3, 3, 6, 1, 0, 2,
       6, 2, 0, 1])

In [43]: s = pd.Series(data)

In [44]: s.value_counts()
Out[44]: 
6    12
0    10
3     8
1     8
2     6
4     4
5     2
dtype: int64

In [45]: pd.value_counts(data)
Out[45]: 
6    12
0    10
3     8
1     8
2     6
4     4
5     2
dtype: int64

Similarly, you can get the most frequently occurring value(s) (the mode) of the values in a Series or DataFrame:

In [46]: s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])

In [47]: s5.mode()
Out[47]: 
0    3
1    7
dtype: int64

In [48]: df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
   ....:                     "B": np.random.randint(-10, 15, size=50)})
   ....: 

In [49]: df5.mode()
Out[49]: 
   A   B
0  3  11

5.4 Discretization and quantiling

Continuous values can be discretized using the cut() (bins based on values) and qcut() (bins based on sample quantiles) functions:

In [50]: arr = np.random.randn(20)

In [51]: factor = pd.cut(arr, 4)

In [52]: factor
Out[52]: 
[(-1.187, -0.351], (-0.351, 0.483], (1.316, 2.15], (-0.351, 0.483], (0.483, 1.316], ..., (0.483, 1.316], (0.483, 1.316], (-1.187, -0.351], (-0.351, 0.483], (-0.351, 0.483]]
Length: 20
Categories (4, object): [(-1.187, -0.351] < (-0.351, 0.483] < (0.483, 1.316] < (1.316, 2.15]]

In [53]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])

In [54]: factor
Out[54]: 
[(-1, 0], (0, 1], (1, 5], (0, 1], (0, 1], ..., (1, 5], (1, 5], (-5, -1], (0, 1], (-1, 0]]
Length: 20
Categories (4, object): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

qcut() computes sample quantiles. For example, we could slice up some normally distributed data into equal-size quartiles like so:

In [55]: arr = np.random.randn(30)

In [56]: factor = pd.qcut(arr, [0, .25, .5, .75, 1])

In [57]: factor
Out[57]: 
[[-2.282, -0.351], (0.76, 2.106], (0.76, 2.106], [-2.282, -0.351], [-2.282, -0.351], ..., (0.76, 2.106], (0.76, 2.106], (0.116, 0.76], [-2.282, -0.351], (-0.351, 0.116]]
Length: 30
Categories (4, object): [[-2.282, -0.351] < (-0.351, 0.116] < (0.116, 0.76] < (0.76, 2.106]]

In [58]: pd.value_counts(factor)
Out[58]: 
(0.76, 2.106]       8
[-2.282, -0.351]    8
(0.116, 0.76]       7
(-0.351, 0.116]     7
dtype: int64

We can also pass infinite values to define the bins:

In [59]: arr = np.random.randn(20)

In [60]: factor = pd.cut(arr, [-np.inf, 0, np.inf])

In [61]: factor
Out[61]: 
[(0, inf], (-inf, 0], (-inf, 0], (-inf, 0], (-inf, 0], ..., (0, inf], (0, inf], (0, inf], (-inf, 0], (0, inf]]
Length: 20
Categories (2, object): [(-inf, 0] < (0, inf]]