5 Descriptive statistics
A large number of methods for computing descriptive statistics and other related
operations on Series, DataFrame, and Panel. Most of these
are aggregations (hence producing a lower-dimensional result) like
sum()
, mean()
, and quantile()
,
but some of them, like cumsum()
and cumprod()
,
produce an object of the same size. Generally speaking, these methods take an
axis argument, just like ndarray.{sum, std, ...}, but the axis can be
specified by name or integer:
- Series: no axis argument needed
- DataFrame: “index” (axis=0, default), “columns” (axis=1)
- Panel: “items” (axis=0), “major” (axis=1, default), “minor” (axis=2)
For example:
In [1]: df
Out[1]:
one three two
a -0.144420 NaN 1.670200
b 0.007400 -1.249974 1.433805
c 1.185509 1.342610 -1.803779
d NaN 0.868571 -1.927031
In [2]: df.mean(0)
Out[2]:
one 0.349496
three 0.320403
two -0.156701
dtype: float64
In [3]: df.mean(1)
Out[3]:
a 0.762890
b 0.063744
c 0.241447
d -0.529230
dtype: float64
All such methods have a skipna
option signaling whether to exclude missing
data (True
by default):
In [4]: df.sum(0, skipna=False)
Out[4]:
one NaN
three NaN
two -0.626805
dtype: float64
In [5]: df.sum(axis=1, skipna=True)
Out[5]:
a 1.525780
b 0.191231
c 0.724340
d -1.058460
dtype: float64
Combined with the broadcasting / arithmetic behavior, one can describe various statistical procedures, like standardization (rendering data zero mean and standard deviation 1), very concisely:
In [6]: ts_stand = (df - df.mean()) / df.std()
In [7]: ts_stand.std()
Out[7]:
one 1.0
three 1.0
two 1.0
dtype: float64
In [8]: xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)
In [9]: xs_stand.std(1)
Out[9]:
a 1.0
b 1.0
c 1.0
d 1.0
dtype: float64
Note that methods like cumsum()
and cumprod()
preserve the location of NA values:
In [10]: df.cumsum()
Out[10]:
one three two
a -0.144420 NaN 1.670200
b -0.137020 -1.249974 3.104005
c 1.048488 0.092636 1.300226
d NaN 0.961208 -0.626805
Here is a quick reference summary table of common functions. Each also takes an
optional level
parameter which applies only if the object has a
hierarchical index.
Function | Description |
---|---|
count |
Number of non-null observations |
sum |
Sum of values |
mean |
Mean of values |
mad |
Mean absolute deviation |
median |
Arithmetic median of values |
min |
Minimum |
max |
Maximum |
mode |
Mode |
abs |
Absolute Value |
prod |
Product of values |
std |
Bessel-corrected sample standard deviation |
var |
Unbiased variance |
sem |
Standard error of the mean |
skew |
Sample skewness (3rd moment) |
kurt |
Sample kurtosis (4th moment) |
quantile |
Sample quantile (value at %) |
cumsum |
Cumulative sum |
cumprod |
Cumulative product |
cummax |
Cumulative maximum |
cummin |
Cumulative minimum |
Note that by chance some NumPy methods, like mean
, std
, and sum
,
will exclude NAs on Series input by default:
In [11]: np.mean(df['one'])
Out[11]: 0.34949612490375265
In [12]: np.mean(df['one'].values)
Out[12]: nan
Series
also has a method nunique()
which will return the
number of unique non-null values:
In [13]: series = pd.Series(np.random.randn(500))
In [14]: series[20:500] = np.nan
In [15]: series[10:20] = 5
In [16]: series.nunique()
Out[16]: 11
5.1 Summarizing data: describe
There is a convenient describe()
function which computes a variety of summary
statistics about a Series or the columns of a DataFrame (excluding NAs of
course):
In [17]: series = pd.Series(np.random.randn(1000))
In [18]: series[::2] = np.nan
In [19]: series.describe()
Out[19]:
count 500.000000
mean -0.054836
std 1.055274
min -3.463789
25% -0.716140
50% -0.059378
75% 0.627730
max 3.400646
dtype: float64
In [20]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])
In [21]: frame.ix[::2] = np.nan
In [22]: frame.describe()
Out[22]:
a b c d e
count 500.000000 500.000000 500.000000 500.000000 500.000000
mean -0.034552 -0.063424 0.018501 -0.007295 0.004320
std 1.015046 1.002699 1.051238 1.038535 1.006397
min -2.915767 -3.294023 -3.610499 -2.825702 -3.010899
25% -0.765493 -0.705856 -0.617814 -0.709969 -0.689996
50% -0.079172 -0.050934 0.010791 0.018474 0.005807
75% 0.666677 0.611545 0.748653 0.724091 0.656439
max 2.820663 2.925597 3.416896 3.056469 3.007143
You can select specific percentiles to include in the output:
In [23]: series.describe(percentiles=[.05, .25, .75, .95])
Out[23]:
count 500.000000
mean -0.054836
std 1.055274
min -3.463789
...
50% -0.059378
75% 0.627730
95% 1.641186
max 3.400646
dtype: float64
By default, the median is always included.
For a non-numerical Series object, describe()
will give a simple
summary of the number of unique values and most frequently occurring values:
In [24]: s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])
In [25]: s.describe()
Out[25]:
count 9
unique 4
top a
freq 5
dtype: object
Note that on a mixed-type DataFrame object, describe()
will
restrict the summary to include only numerical columns or, if none are, only
categorical columns:
In [26]: frame = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})
In [27]: frame.describe()
Out[27]:
b
count 4.000000
mean 1.500000
std 1.290994
min 0.000000
25% 0.750000
50% 1.500000
75% 2.250000
max 3.000000
This behaviour can be controlled by providing a list of types as include
/exclude
arguments. The special value all
can also be used:
In [28]: frame.describe(include=['object'])
Out[28]:
a
count 4
unique 2
top No
freq 2
In [29]: frame.describe(include=['number'])
Out[29]:
b
count 4.000000
mean 1.500000
std 1.290994
min 0.000000
25% 0.750000
50% 1.500000
75% 2.250000
max 3.000000
In [30]: frame.describe(include='all')
Out[30]:
a b
count 4 4.00
unique 2 NaN
top No NaN
freq 2 NaN
... ... ...
25% NaN 0.75
50% NaN 1.50
75% NaN 2.25
max NaN 3.00
[11 rows x 2 columns]
That feature relies on select_dtypes. Refer to there for details about accepted inputs.
5.2 Index of Min/Max Values
The idxmin()
and idxmax()
functions on Series
and DataFrame compute the index labels with the minimum and maximum
corresponding values:
In [31]: s1 = pd.Series(np.random.randn(5))
In [32]: s1
Out[32]:
0 -0.250753
1 0.369150
2 0.916294
3 -0.618059
4 -0.804420
dtype: float64
In [33]: s1.idxmin(), s1.idxmax()
Out[33]: (4, 2)
In [34]: df1 = pd.DataFrame(np.random.randn(5,3), columns=['A','B','C'])
In [35]: df1
Out[35]:
A B C
0 -0.676993 2.007616 -0.297283
1 -0.297308 1.239574 -0.358859
2 0.122141 0.442947 -0.748188
3 -1.225176 -1.010445 -1.737071
4 0.032349 -0.782580 0.841748
In [36]: df1.idxmin(axis=0)
Out[36]:
A 3
B 3
C 3
dtype: int64
In [37]: df1.idxmax(axis=1)
Out[37]:
0 B
1 B
2 B
3 B
4 C
dtype: object
When there are multiple rows (or columns) matching the minimum or maximum
value, idxmin()
and idxmax()
return the first
matching index:
In [38]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))
In [39]: df3
Out[39]:
A
e 2.0
d 1.0
c 1.0
b 3.0
a NaN
In [40]: df3['A'].idxmin()
Out[40]: 'd'
Note
idxmin
and idxmax
are called argmin
and argmax
in NumPy.
5.3 Value counts (histogramming) / Mode
The value_counts()
Series method and top-level function computes a histogram
of a 1D array of values. It can also be used as a function on regular arrays:
In [41]: data = np.random.randint(0, 7, size=50)
In [42]: data
Out[42]:
array([1, 4, 3, 1, 2, 3, 3, 3, 1, 4, 3, 0, 6, 6, 6, 0, 6, 1, 1, 6, 5, 5, 6,
6, 0, 6, 1, 4, 2, 0, 2, 6, 0, 3, 0, 4, 0, 2, 0, 6, 3, 3, 6, 1, 0, 2,
6, 2, 0, 1])
In [43]: s = pd.Series(data)
In [44]: s.value_counts()
Out[44]:
6 12
0 10
3 8
1 8
2 6
4 4
5 2
dtype: int64
In [45]: pd.value_counts(data)
Out[45]:
6 12
0 10
3 8
1 8
2 6
4 4
5 2
dtype: int64
Similarly, you can get the most frequently occurring value(s) (the mode) of the values in a Series or DataFrame:
In [46]: s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])
In [47]: s5.mode()
Out[47]:
0 3
1 7
dtype: int64
In [48]: df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
....: "B": np.random.randint(-10, 15, size=50)})
....:
In [49]: df5.mode()
Out[49]:
A B
0 3 11
5.4 Discretization and quantiling
Continuous values can be discretized using the cut()
(bins based on values)
and qcut()
(bins based on sample quantiles) functions:
In [50]: arr = np.random.randn(20)
In [51]: factor = pd.cut(arr, 4)
In [52]: factor
Out[52]:
[(-1.187, -0.351], (-0.351, 0.483], (1.316, 2.15], (-0.351, 0.483], (0.483, 1.316], ..., (0.483, 1.316], (0.483, 1.316], (-1.187, -0.351], (-0.351, 0.483], (-0.351, 0.483]]
Length: 20
Categories (4, object): [(-1.187, -0.351] < (-0.351, 0.483] < (0.483, 1.316] < (1.316, 2.15]]
In [53]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])
In [54]: factor
Out[54]:
[(-1, 0], (0, 1], (1, 5], (0, 1], (0, 1], ..., (1, 5], (1, 5], (-5, -1], (0, 1], (-1, 0]]
Length: 20
Categories (4, object): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]
qcut()
computes sample quantiles. For example, we could slice up some
normally distributed data into equal-size quartiles like so:
In [55]: arr = np.random.randn(30)
In [56]: factor = pd.qcut(arr, [0, .25, .5, .75, 1])
In [57]: factor
Out[57]:
[[-2.282, -0.351], (0.76, 2.106], (0.76, 2.106], [-2.282, -0.351], [-2.282, -0.351], ..., (0.76, 2.106], (0.76, 2.106], (0.116, 0.76], [-2.282, -0.351], (-0.351, 0.116]]
Length: 30
Categories (4, object): [[-2.282, -0.351] < (-0.351, 0.116] < (0.116, 0.76] < (0.76, 2.106]]
In [58]: pd.value_counts(factor)
Out[58]:
(0.76, 2.106] 8
[-2.282, -0.351] 8
(0.116, 0.76] 7
(-0.351, 0.116] 7
dtype: int64
We can also pass infinite values to define the bins:
In [59]: arr = np.random.randn(20)
In [60]: factor = pd.cut(arr, [-np.inf, 0, np.inf])
In [61]: factor
Out[61]:
[(0, inf], (-inf, 0], (-inf, 0], (-inf, 0], (-inf, 0], ..., (0, inf], (0, inf], (0, inf], (-inf, 0], (0, inf]]
Length: 20
Categories (2, object): [(-inf, 0] < (0, inf]]