4.4 Calculations with missing data

Missing values propagate naturally through arithmetic operations between pandas objects.

In [1]: a
Out[1]: 
      one     two
a     NaN  0.4002
c     NaN  1.8676
e  0.9501 -0.1514
f  0.4106  0.1440
h  0.4106  0.1217

In [2]: b
Out[2]: 
      one     two   three
a     NaN  0.4002  0.9787
c     NaN  1.8676 -0.9773
e  0.9501 -0.1514 -0.1032
f  0.4106  0.1440  1.4543
h     NaN  0.1217  0.4439

In [3]: a + b
Out[3]: 
      one  three     two
a     NaN    NaN  0.8003
c     NaN    NaN  3.7351
e  1.9002    NaN -0.3027
f  0.8212    NaN  0.2881
h     NaN    NaN  0.2434

The descriptive statistics and computational methods discussed in the data structure overview (and listed here and here) are all written to account for missing data. For example:

  • When summing data, NA (missing) values will be treated as zero
  • If the data are all NA, the result will be NA
  • Methods like cumsum and cumprod ignore NA values, but preserve them in the resulting arrays
In [4]: df
Out[4]: 
      one     two   three
a     NaN  0.4002  0.9787
c     NaN  1.8676 -0.9773
e  0.9501 -0.1514 -0.1032
f  0.4106  0.1440  1.4543
h     NaN  0.1217  0.4439

In [5]: df['one'].sum()
Out[5]: 1.3606869194639617

In [6]: df.mean(1)
Out[6]: 
a    0.6894
c    0.4451
e    0.2318
f    0.6696
h    0.2828
dtype: float64

In [7]: df.cumsum()
Out[7]: 
      one     two   three
a     NaN  0.4002  0.9787
c     NaN  2.2677  0.0015
e  0.9501  2.1164 -0.1018
f  1.3607  2.2604  1.3525
h     NaN  2.3821  1.7964

4.4.1 NA values in GroupBy

NA groups in GroupBy are automatically excluded. This behavior is consistent with R, for example:

In [8]: df
Out[8]: 
      one     two   three
a     NaN  0.4002  0.9787
c     NaN  1.8676 -0.9773
e  0.9501 -0.1514 -0.1032
f  0.4106  0.1440  1.4543
h     NaN  0.1217  0.4439

In [9]: df.groupby('one').mean()
Out[9]: 
           two   three
one                   
0.4106  0.1440  1.4543
0.9501 -0.1514 -0.1032

See the groupby section here for more information.