5 Descriptive statistics
A large number of methods for computing descriptive statistics and other related
operations on Series, DataFrame, and Panel. Most of these
are aggregations (hence producing a lower-dimensional result) like
sum(), mean(), and quantile(),
but some of them, like cumsum() and cumprod(),
produce an object of the same size. Generally speaking, these methods take an
axis argument, just like ndarray.{sum, std, ...}, but the axis can be
specified by name or integer:
- Series: no axis argument needed
- DataFrame: “index” (axis=0, default), “columns” (axis=1)
- Panel: “items” (axis=0), “major” (axis=1, default), “minor” (axis=2)
For example:
In [1]: df
Out[1]:
one three two
a -0.144420 NaN 1.670200
b 0.007400 -1.249974 1.433805
c 1.185509 1.342610 -1.803779
d NaN 0.868571 -1.927031
In [2]: df.mean(0)
Out[2]:
one 0.349496
three 0.320403
two -0.156701
dtype: float64
In [3]: df.mean(1)
Out[3]:
a 0.762890
b 0.063744
c 0.241447
d -0.529230
dtype: float64
All such methods have a skipna option signaling whether to exclude missing
data (True by default):
In [4]: df.sum(0, skipna=False)
Out[4]:
one NaN
three NaN
two -0.626805
dtype: float64
In [5]: df.sum(axis=1, skipna=True)
Out[5]:
a 1.525780
b 0.191231
c 0.724340
d -1.058460
dtype: float64
Combined with the broadcasting / arithmetic behavior, one can describe various statistical procedures, like standardization (rendering data zero mean and standard deviation 1), very concisely:
In [6]: ts_stand = (df - df.mean()) / df.std()
In [7]: ts_stand.std()
Out[7]:
one 1.0
three 1.0
two 1.0
dtype: float64
In [8]: xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)
In [9]: xs_stand.std(1)
Out[9]:
a 1.0
b 1.0
c 1.0
d 1.0
dtype: float64
Note that methods like cumsum() and cumprod()
preserve the location of NA values:
In [10]: df.cumsum()
Out[10]:
one three two
a -0.144420 NaN 1.670200
b -0.137020 -1.249974 3.104005
c 1.048488 0.092636 1.300226
d NaN 0.961208 -0.626805
Here is a quick reference summary table of common functions. Each also takes an
optional level parameter which applies only if the object has a
hierarchical index.
| Function | Description |
|---|---|
count |
Number of non-null observations |
sum |
Sum of values |
mean |
Mean of values |
mad |
Mean absolute deviation |
median |
Arithmetic median of values |
min |
Minimum |
max |
Maximum |
mode |
Mode |
abs |
Absolute Value |
prod |
Product of values |
std |
Bessel-corrected sample standard deviation |
var |
Unbiased variance |
sem |
Standard error of the mean |
skew |
Sample skewness (3rd moment) |
kurt |
Sample kurtosis (4th moment) |
quantile |
Sample quantile (value at %) |
cumsum |
Cumulative sum |
cumprod |
Cumulative product |
cummax |
Cumulative maximum |
cummin |
Cumulative minimum |
Note that by chance some NumPy methods, like mean, std, and sum,
will exclude NAs on Series input by default:
In [11]: np.mean(df['one'])
Out[11]: 0.34949612490375265
In [12]: np.mean(df['one'].values)
Out[12]: nan
Series also has a method nunique() which will return the
number of unique non-null values:
In [13]: series = pd.Series(np.random.randn(500))
In [14]: series[20:500] = np.nan
In [15]: series[10:20] = 5
In [16]: series.nunique()
Out[16]: 11
5.1 Summarizing data: describe
There is a convenient describe() function which computes a variety of summary
statistics about a Series or the columns of a DataFrame (excluding NAs of
course):
In [17]: series = pd.Series(np.random.randn(1000))
In [18]: series[::2] = np.nan
In [19]: series.describe()
Out[19]:
count 500.000000
mean -0.054836
std 1.055274
min -3.463789
25% -0.716140
50% -0.059378
75% 0.627730
max 3.400646
dtype: float64
In [20]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])
In [21]: frame.ix[::2] = np.nan
In [22]: frame.describe()
Out[22]:
a b c d e
count 500.000000 500.000000 500.000000 500.000000 500.000000
mean -0.034552 -0.063424 0.018501 -0.007295 0.004320
std 1.015046 1.002699 1.051238 1.038535 1.006397
min -2.915767 -3.294023 -3.610499 -2.825702 -3.010899
25% -0.765493 -0.705856 -0.617814 -0.709969 -0.689996
50% -0.079172 -0.050934 0.010791 0.018474 0.005807
75% 0.666677 0.611545 0.748653 0.724091 0.656439
max 2.820663 2.925597 3.416896 3.056469 3.007143
You can select specific percentiles to include in the output:
In [23]: series.describe(percentiles=[.05, .25, .75, .95])
Out[23]:
count 500.000000
mean -0.054836
std 1.055274
min -3.463789
...
50% -0.059378
75% 0.627730
95% 1.641186
max 3.400646
dtype: float64
By default, the median is always included.
For a non-numerical Series object, describe() will give a simple
summary of the number of unique values and most frequently occurring values:
In [24]: s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])
In [25]: s.describe()
Out[25]:
count 9
unique 4
top a
freq 5
dtype: object
Note that on a mixed-type DataFrame object, describe() will
restrict the summary to include only numerical columns or, if none are, only
categorical columns:
In [26]: frame = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})
In [27]: frame.describe()
Out[27]:
b
count 4.000000
mean 1.500000
std 1.290994
min 0.000000
25% 0.750000
50% 1.500000
75% 2.250000
max 3.000000
This behaviour can be controlled by providing a list of types as include/exclude
arguments. The special value all can also be used:
In [28]: frame.describe(include=['object'])
Out[28]:
a
count 4
unique 2
top No
freq 2
In [29]: frame.describe(include=['number'])
Out[29]:
b
count 4.000000
mean 1.500000
std 1.290994
min 0.000000
25% 0.750000
50% 1.500000
75% 2.250000
max 3.000000
In [30]: frame.describe(include='all')
Out[30]:
a b
count 4 4.00
unique 2 NaN
top No NaN
freq 2 NaN
... ... ...
25% NaN 0.75
50% NaN 1.50
75% NaN 2.25
max NaN 3.00
[11 rows x 2 columns]
That feature relies on select_dtypes. Refer to there for details about accepted inputs.
5.2 Index of Min/Max Values
The idxmin() and idxmax() functions on Series
and DataFrame compute the index labels with the minimum and maximum
corresponding values:
In [31]: s1 = pd.Series(np.random.randn(5))
In [32]: s1
Out[32]:
0 -0.250753
1 0.369150
2 0.916294
3 -0.618059
4 -0.804420
dtype: float64
In [33]: s1.idxmin(), s1.idxmax()
Out[33]: (4, 2)
In [34]: df1 = pd.DataFrame(np.random.randn(5,3), columns=['A','B','C'])
In [35]: df1
Out[35]:
A B C
0 -0.676993 2.007616 -0.297283
1 -0.297308 1.239574 -0.358859
2 0.122141 0.442947 -0.748188
3 -1.225176 -1.010445 -1.737071
4 0.032349 -0.782580 0.841748
In [36]: df1.idxmin(axis=0)
Out[36]:
A 3
B 3
C 3
dtype: int64
In [37]: df1.idxmax(axis=1)
Out[37]:
0 B
1 B
2 B
3 B
4 C
dtype: object
When there are multiple rows (or columns) matching the minimum or maximum
value, idxmin() and idxmax() return the first
matching index:
In [38]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))
In [39]: df3
Out[39]:
A
e 2.0
d 1.0
c 1.0
b 3.0
a NaN
In [40]: df3['A'].idxmin()
Out[40]: 'd'
Note
idxmin and idxmax are called argmin and argmax in NumPy.
5.3 Value counts (histogramming) / Mode
The value_counts() Series method and top-level function computes a histogram
of a 1D array of values. It can also be used as a function on regular arrays:
In [41]: data = np.random.randint(0, 7, size=50)
In [42]: data
Out[42]:
array([1, 4, 3, 1, 2, 3, 3, 3, 1, 4, 3, 0, 6, 6, 6, 0, 6, 1, 1, 6, 5, 5, 6,
6, 0, 6, 1, 4, 2, 0, 2, 6, 0, 3, 0, 4, 0, 2, 0, 6, 3, 3, 6, 1, 0, 2,
6, 2, 0, 1])
In [43]: s = pd.Series(data)
In [44]: s.value_counts()
Out[44]:
6 12
0 10
3 8
1 8
2 6
4 4
5 2
dtype: int64
In [45]: pd.value_counts(data)
Out[45]:
6 12
0 10
3 8
1 8
2 6
4 4
5 2
dtype: int64
Similarly, you can get the most frequently occurring value(s) (the mode) of the values in a Series or DataFrame:
In [46]: s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])
In [47]: s5.mode()
Out[47]:
0 3
1 7
dtype: int64
In [48]: df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
....: "B": np.random.randint(-10, 15, size=50)})
....:
In [49]: df5.mode()
Out[49]:
A B
0 3 11
5.4 Discretization and quantiling
Continuous values can be discretized using the cut() (bins based on values)
and qcut() (bins based on sample quantiles) functions:
In [50]: arr = np.random.randn(20)
In [51]: factor = pd.cut(arr, 4)
In [52]: factor
Out[52]:
[(-1.187, -0.351], (-0.351, 0.483], (1.316, 2.15], (-0.351, 0.483], (0.483, 1.316], ..., (0.483, 1.316], (0.483, 1.316], (-1.187, -0.351], (-0.351, 0.483], (-0.351, 0.483]]
Length: 20
Categories (4, object): [(-1.187, -0.351] < (-0.351, 0.483] < (0.483, 1.316] < (1.316, 2.15]]
In [53]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])
In [54]: factor
Out[54]:
[(-1, 0], (0, 1], (1, 5], (0, 1], (0, 1], ..., (1, 5], (1, 5], (-5, -1], (0, 1], (-1, 0]]
Length: 20
Categories (4, object): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]
qcut() computes sample quantiles. For example, we could slice up some
normally distributed data into equal-size quartiles like so:
In [55]: arr = np.random.randn(30)
In [56]: factor = pd.qcut(arr, [0, .25, .5, .75, 1])
In [57]: factor
Out[57]:
[[-2.282, -0.351], (0.76, 2.106], (0.76, 2.106], [-2.282, -0.351], [-2.282, -0.351], ..., (0.76, 2.106], (0.76, 2.106], (0.116, 0.76], [-2.282, -0.351], (-0.351, 0.116]]
Length: 30
Categories (4, object): [[-2.282, -0.351] < (-0.351, 0.116] < (0.116, 0.76] < (0.76, 2.106]]
In [58]: pd.value_counts(factor)
Out[58]:
(0.76, 2.106] 8
[-2.282, -0.351] 8
(0.116, 0.76] 7
(-0.351, 0.116] 7
dtype: int64
We can also pass infinite values to define the bins:
In [59]: arr = np.random.randn(20)
In [60]: factor = pd.cut(arr, [-np.inf, 0, np.inf])
In [61]: factor
Out[61]:
[(0, inf], (-inf, 0], (-inf, 0], (-inf, 0], (-inf, 0], ..., (0, inf], (0, inf], (0, inf], (-inf, 0], (0, inf]]
Length: 20
Categories (2, object): [(-inf, 0] < (0, inf]]