4.4 Calculations with missing data

Missing values propagate naturally through arithmetic operations between pandas objects.

In [1]: a
Out[1]: 
      one     two
a     NaN  0.4002
c     NaN  1.8676
e  0.9501 -0.1514
f  0.4106  0.1440
h  0.4106  0.1217

In [2]: b
Out[2]: 
      one     two   three
a     NaN  0.4002  0.9787
c     NaN  1.8676 -0.9773
e  0.9501 -0.1514 -0.1032
f  0.4106  0.1440  1.4543
h     NaN  0.1217  0.4439

In [3]: a + b
Out[3]: 
      one  three     two
a     NaN    NaN  0.8003
c     NaN    NaN  3.7351
e  1.9002    NaN -0.3027
f  0.8212    NaN  0.2881
h     NaN    NaN  0.2434

The descriptive statistics and computational methods discussed in the data structure overview (and listed here and here) are all written to account for missing data. For example:

When summing data, NA (missing) values will be treated as zero
If the data are all NA, the result will be NA
Methods like cumsum and cumprod ignore NA values, but preserve them in the resulting arrays

In [4]: df
Out[4]: 
      one     two   three
a     NaN  0.4002  0.9787
c     NaN  1.8676 -0.9773
e  0.9501 -0.1514 -0.1032
f  0.4106  0.1440  1.4543
h     NaN  0.1217  0.4439

In [5]: df['one'].sum()
Out[5]: 1.3606869194639617

In [6]: df.mean(1)
Out[6]: 
a    0.6894
c    0.4451
e    0.2318
f    0.6696
h    0.2828
dtype: float64

In [7]: df.cumsum()
Out[7]: 
      one     two   three
a     NaN  0.4002  0.9787
c     NaN  2.2677  0.0015
e  0.9501  2.1164 -0.1018
f  1.3607  2.2604  1.3525
h     NaN  2.3821  1.7964

4.4.1 NA values in GroupBy

NA groups in GroupBy are automatically excluded. This behavior is consistent with R, for example:

In [8]: df
Out[8]: 
      one     two   three
a     NaN  0.4002  0.9787
c     NaN  1.8676 -0.9773
e  0.9501 -0.1514 -0.1032
f  0.4106  0.1440  1.4543
h     NaN  0.1217  0.4439

In [9]: df.groupby('one').mean()
Out[9]: 
           two   three
one                   
0.4106  0.1440  1.4543
0.9501 -0.1514 -0.1032

See the groupby section here for more information.