4.4 Calculations with missing data
Missing values propagate naturally through arithmetic operations between pandas objects.
In [1]: a
Out[1]:
one two
a NaN 0.4002
c NaN 1.8676
e 0.9501 -0.1514
f 0.4106 0.1440
h 0.4106 0.1217
In [2]: b
Out[2]:
one two three
a NaN 0.4002 0.9787
c NaN 1.8676 -0.9773
e 0.9501 -0.1514 -0.1032
f 0.4106 0.1440 1.4543
h NaN 0.1217 0.4439
In [3]: a + b
Out[3]:
one three two
a NaN NaN 0.8003
c NaN NaN 3.7351
e 1.9002 NaN -0.3027
f 0.8212 NaN 0.2881
h NaN NaN 0.2434
The descriptive statistics and computational methods discussed in the data structure overview (and listed here and here) are all written to account for missing data. For example:
- When summing data, NA (missing) values will be treated as zero
- If the data are all NA, the result will be NA
- Methods like cumsum and cumprod ignore NA values, but preserve them in the resulting arrays
In [4]: df
Out[4]:
one two three
a NaN 0.4002 0.9787
c NaN 1.8676 -0.9773
e 0.9501 -0.1514 -0.1032
f 0.4106 0.1440 1.4543
h NaN 0.1217 0.4439
In [5]: df['one'].sum()
Out[5]: 1.3606869194639617
In [6]: df.mean(1)
Out[6]:
a 0.6894
c 0.4451
e 0.2318
f 0.6696
h 0.2828
dtype: float64
In [7]: df.cumsum()
Out[7]:
one two three
a NaN 0.4002 0.9787
c NaN 2.2677 0.0015
e 0.9501 2.1164 -0.1018
f 1.3607 2.2604 1.3525
h NaN 2.3821 1.7964
4.4.1 NA values in GroupBy
NA groups in GroupBy are automatically excluded. This behavior is consistent with R, for example:
In [8]: df
Out[8]:
one two three
a NaN 0.4002 0.9787
c NaN 1.8676 -0.9773
e 0.9501 -0.1514 -0.1032
f 0.4106 0.1440 1.4543
h NaN 0.1217 0.4439
In [9]: df.groupby('one').mean()
Out[9]:
two three
one
0.4106 0.1440 1.4543
0.9501 -0.1514 -0.1032
See the groupby section here for more information.