3.1 Statistical Functions

3.1.1 Percent Change

Series, DataFrame, and Panel all have a method pct_change to compute the percent change over a given number of periods (using fill_method to fill NA/null values before computing the percent change).

In [1]: ser = pd.Series(np.random.randn(8))

In [2]: ser.pct_change()
Out[2]: 
0         NaN
1   -1.602976
2    4.334938
3   -0.247456
4   -2.067345
5   -1.142903
6   -1.688214
7   -9.759729
dtype: float64
In [3]: df = pd.DataFrame(np.random.randn(10, 4))

In [4]: df.pct_change(periods=3)
Out[4]: 
           0         1         2         3
0        NaN       NaN       NaN       NaN
1        NaN       NaN       NaN       NaN
2        NaN       NaN       NaN       NaN
3  -0.218320 -1.054001  1.987147 -0.510183
..       ...       ...       ...       ...
6  -2.596833 -1.959538 -2.111697 -3.798900
7  -0.117826 -2.169058  0.036094 -0.067696
8   2.492606 -1.357320 -1.205802 -1.558697
9  -1.012977  2.324558 -1.003744 -0.371806

[10 rows x 4 columns]

3.1.2 Covariance

The Series object has a method cov to compute covariance between series (excluding NA/null values).

In [5]: s1 = pd.Series(np.random.randn(1000))

In [6]: s2 = pd.Series(np.random.randn(1000))

In [7]: s1.cov(s2)
Out[7]: 0.00068010881743108204

Analogously, DataFrame has a method cov to compute pairwise covariances among the series in the DataFrame, also excluding NA/null values.

Note

Assuming the missing data are missing at random this results in an estimate for the covariance matrix which is unbiased. However, for many applications this estimate may not be acceptable because the estimated covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimated correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.

In [8]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])

In [9]: frame.cov()
Out[9]: 
          a         b         c         d         e
a  1.000882 -0.003177 -0.002698 -0.006889  0.031912
b -0.003177  1.024721  0.000191  0.009212  0.000857
c -0.002698  0.000191  0.950735 -0.031743 -0.005087
d -0.006889  0.009212 -0.031743  1.002983 -0.047952
e  0.031912  0.000857 -0.005087 -0.047952  1.042487

DataFrame.cov also supports an optional min_periods keyword that specifies the required minimum number of observations for each column pair in order to have a valid result.

In [10]: frame = pd.DataFrame(np.random.randn(20, 3), columns=['a', 'b', 'c'])

In [11]: frame.ix[:5, 'a'] = np.nan

In [12]: frame.ix[5:10, 'b'] = np.nan

In [13]: frame.cov()
Out[13]: 
          a         b         c
a  1.210090 -0.430629  0.018002
b -0.430629  1.240960  0.347188
c  0.018002  0.347188  1.301149

In [14]: frame.cov(min_periods=12)
Out[14]: 
          a         b         c
a  1.210090       NaN  0.018002
b       NaN  1.240960  0.347188
c  0.018002  0.347188  1.301149

3.1.3 Correlation

Several methods for computing correlations are provided:

Method name Description
pearson (default) Standard correlation coefficient
kendall Kendall Tau correlation coefficient
spearman Spearman rank correlation coefficient

All of these are currently computed using pairwise complete observations.

Note

Please see the caveats associated with this method of calculating correlation matrices in the covariance section.

In [15]: frame = pd.DataFrame(np.random.randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])

In [16]: frame.ix[::2] = np.nan

# Series with Series
In [17]: frame['a'].corr(frame['b'])
Out[17]: 0.013479040400098794

In [18]: frame['a'].corr(frame['b'], method='spearman')
Out[18]: -0.0072898851595406371

# Pairwise correlation of DataFrame columns
In [19]: frame.corr()
Out[19]: 
          a         b         c         d         e
a  1.000000  0.013479 -0.049269 -0.042239 -0.028525
b  0.013479  1.000000 -0.020433 -0.011139  0.005654
c -0.049269 -0.020433  1.000000  0.018587 -0.054269
d -0.042239 -0.011139  0.018587  1.000000 -0.017060
e -0.028525  0.005654 -0.054269 -0.017060  1.000000

Note that non-numeric columns will be automatically excluded from the correlation calculation.

Like cov, corr also supports the optional min_periods keyword:

In [20]: frame = pd.DataFrame(np.random.randn(20, 3), columns=['a', 'b', 'c'])

In [21]: frame.ix[:5, 'a'] = np.nan

In [22]: frame.ix[5:10, 'b'] = np.nan

In [23]: frame.corr()
Out[23]: 
          a         b         c
a  1.000000 -0.076520  0.160092
b -0.076520  1.000000  0.135967
c  0.160092  0.135967  1.000000

In [24]: frame.corr(min_periods=12)
Out[24]: 
          a         b         c
a  1.000000       NaN  0.160092
b       NaN  1.000000  0.135967
c  0.160092  0.135967  1.000000

A related method corrwith is implemented on DataFrame to compute the correlation between like-labeled Series contained in different DataFrame objects.

In [25]: index = ['a', 'b', 'c', 'd', 'e']

In [26]: columns = ['one', 'two', 'three', 'four']

In [27]: df1 = pd.DataFrame(np.random.randn(5, 4), index=index, columns=columns)

In [28]: df2 = pd.DataFrame(np.random.randn(4, 4), index=index[:4], columns=columns)

In [29]: df1.corrwith(df2)
Out[29]: 
one     -0.125501
two     -0.493244
three    0.344056
four     0.004183
dtype: float64

In [30]: df2.corrwith(df1, axis=1)
Out[30]: 
a   -0.675817
b    0.458296
c    0.190809
d   -0.186275
e         NaN
dtype: float64

3.1.4 Data ranking

The rank method produces a data ranking with ties being assigned the mean of the ranks (by default) for the group:

In [31]: s = pd.Series(np.random.np.random.randn(5), index=list('abcde'))

In [32]: s['d'] = s['b'] # so there's a tie

In [33]: s.rank()
Out[33]: 
a    5.0
b    2.5
c    1.0
d    2.5
e    4.0
dtype: float64

rank is also a DataFrame method and can rank either the rows (axis=0) or the columns (axis=1). NaN values are excluded from the ranking.

In [34]: df = pd.DataFrame(np.random.np.random.randn(10, 6))

In [35]: df[4] = df[2][:5] # some ties

In [36]: df
Out[36]: 
           0         1         2         3         4         5
0  -0.904948 -1.163537 -1.457187  0.135463 -1.457187  0.294650
1  -0.976288 -0.244652 -0.748406 -0.999601 -0.748406 -0.800809
2   0.401965  1.460840  1.256057  1.308127  1.256057  0.876004
3   0.205954  0.369552 -0.669304  0.038378 -0.669304  1.140296
..       ...       ...       ...       ...       ...       ...
6   0.376892  0.959292  0.095572 -0.593740       NaN -0.069180
7  -1.002601  1.957794 -0.120708  0.094214       NaN -1.467422
8  -0.547231  0.664402 -0.519424 -0.073254       NaN -1.263544
9  -0.250277 -0.237428 -1.056443  0.419477       NaN  1.375064

[10 rows x 6 columns]

In [37]: df.rank(1)
Out[37]: 
      0    1    2    3    4    5
0   4.0  3.0  1.5  5.0  1.5  6.0
1   2.0  6.0  4.5  1.0  4.5  3.0
2   1.0  6.0  3.5  5.0  3.5  2.0
3   4.0  5.0  1.5  3.0  1.5  6.0
..  ...  ...  ...  ...  ...  ...
6   4.0  5.0  3.0  1.0  NaN  2.0
7   2.0  5.0  3.0  4.0  NaN  1.0
8   2.0  5.0  3.0  4.0  NaN  1.0
9   2.0  3.0  1.0  4.0  NaN  5.0

[10 rows x 6 columns]

rank optionally takes a parameter ascending which by default is true; when false, data is reverse-ranked, with larger values assigned a smaller rank.

rank supports different tie-breaking methods, specified with the method parameter:

  • average : average rank of tied group
  • min : lowest rank in the group
  • max : highest rank in the group
  • first : ranks assigned in the order they appear in the array