4 Flexible binary operations

With binary operations between pandas data structures, there are two key points of interest:

  • Broadcasting behavior between higher- (e.g. DataFrame) and lower-dimensional (e.g. Series) objects.
  • Missing data in computations

We will demonstrate how to manage these issues independently, though they can be handled simultaneously.

4.1 Matching / broadcasting behavior

DataFrame has the methods add(), sub(), mul(), div() and related functions radd(), rsub(), ... for carrying out binary operations. For broadcasting behavior, Series input is of primary interest. Using these functions, you can use to either match on the index or columns via the axis keyword:

In [1]: df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
   ...:                    'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
   ...:                    'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
   ...: 

In [2]: df
Out[2]: 
        one     three       two
a -0.539499       NaN  0.638015
b  0.599181  0.348593  2.283162
c  1.217597 -0.090397 -0.940880
d       NaN -0.066985  0.373389

In [3]: row = df.ix[1]

In [4]: column = df['two']

In [5]: df.sub(row, axis='columns')
Out[5]: 
        one     three       two
a -1.138679       NaN -1.645146
b  0.000000  0.000000  0.000000
c  0.618416 -0.438990 -3.224041
d       NaN -0.415579 -1.909772

In [6]: df.sub(row, axis=1)
Out[6]: 
        one     three       two
a -1.138679       NaN -1.645146
b  0.000000  0.000000  0.000000
c  0.618416 -0.438990 -3.224041
d       NaN -0.415579 -1.909772

In [7]: df.sub(column, axis='index')
Out[7]: 
        one     three  two
a -1.177514       NaN  0.0
b -1.683981 -1.934568  0.0
c  2.158476  0.850483  0.0
d       NaN -0.440375  0.0

In [8]: df.sub(column, axis=0)
Out[8]: 
        one     three  two
a -1.177514       NaN  0.0
b -1.683981 -1.934568  0.0
c  2.158476  0.850483  0.0
d       NaN -0.440375  0.0

Furthermore you can align a level of a multi-indexed DataFrame with a Series.

In [9]: dfmi = df.copy()

In [10]: dfmi.index = pd.MultiIndex.from_tuples([(1,'a'),(1,'b'),(1,'c'),(2,'a')],
   ....:                                        names=['first','second'])
   ....: 

In [11]: dfmi.sub(column, axis=0, level='second')
Out[11]: 
                   one     three       two
first second                              
1     a      -1.177514       NaN  0.000000
      b      -1.683981 -1.934568  0.000000
      c       2.158476  0.850483  0.000000
2     a            NaN -0.705001 -0.264626

With Panel, describing the matching behavior is a bit more difficult, so the arithmetic methods instead (and perhaps confusingly?) give you the option to specify the broadcast axis. For example, suppose we wished to demean the data over a particular axis. This can be accomplished by taking the mean over an axis and broadcasting over the same axis:

In [12]: major_mean = wp.mean(axis='major')

In [13]: major_mean
Out[13]: 
      Item1     Item2
A -0.698246 -0.266135
B  0.494307  0.674535
C  0.126692 -0.296539
D  0.593198 -0.132742

In [14]: wp.sub(major_mean, axis='major')
Out[14]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D

And similarly for axis="items" and axis="minor".

Note

I could be convinced to make the axis argument in the DataFrame methods match the broadcasting behavior of Panel. Though it would require a transition period so users can change their code...

4.2 Missing data / operations with fill values

In Series and DataFrame (though not yet in Panel), the arithmetic functions have the option of inputting a fill_value, namely a value to substitute when at most one of the values at a location are missing. For example, when adding two DataFrame objects, you may wish to treat NaN as 0 unless both DataFrames are missing that value, in which case the result will be NaN (you can later replace NaN with some other value using fillna if you wish).

In [15]: df
Out[15]: 
        one     three       two
a -0.539499       NaN  0.638015
b  0.599181  0.348593  2.283162
c  1.217597 -0.090397 -0.940880
d       NaN -0.066985  0.373389

In [16]: df2
Out[16]: 
        one     three       two
a -0.539499  1.000000  0.638015
b  0.599181  0.348593  2.283162
c  1.217597 -0.090397 -0.940880
d       NaN -0.066985  0.373389

In [17]: df + df2
Out[17]: 
        one     three       two
a -1.078997       NaN  1.276030
b  1.198361  0.697187  4.566323
c  2.435193 -0.180794 -1.881759
d       NaN -0.133971  0.746779

In [18]: df.add(df2, fill_value=0)
Out[18]: 
        one     three       two
a -1.078997  1.000000  1.276030
b  1.198361  0.697187  4.566323
c  2.435193 -0.180794 -1.881759
d       NaN -0.133971  0.746779

4.3 Flexible Comparisons

Starting in v0.8, pandas introduced binary comparison methods eq, ne, lt, gt, le, and ge to Series and DataFrame whose behavior is analogous to the binary arithmetic operations described above:

In [19]: df.gt(df2)
Out[19]: 
     one  three    two
a  False  False  False
b  False  False  False
c  False  False  False
d  False  False  False

In [20]: df2.ne(df)
Out[20]: 
     one  three    two
a  False   True  False
b  False  False  False
c  False  False  False
d   True  False  False

These operations produce a pandas object the same type as the left-hand-side input that if of dtype bool. These boolean objects can be used in indexing operations, see here

4.4 Boolean Reductions

You can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean result.

In [21]: (df > 0).all()
Out[21]: 
one      False
three    False
two      False
dtype: bool

In [22]: (df > 0).any()
Out[22]: 
one      True
three    True
two      True
dtype: bool

You can reduce to a final boolean value.

In [23]: (df > 0).any().any()
Out[23]: True

You can test if a pandas object is empty, via the empty property.

In [24]: df.empty
Out[24]: False

In [25]: pd.DataFrame(columns=list('ABC')).empty
Out[25]: True

To evaluate single-element pandas objects in a boolean context, use the method bool():

In [26]: pd.Series([True]).bool()
Out[26]: True

In [27]: pd.Series([False]).bool()
Out[27]: False

In [28]: pd.DataFrame([[True]]).bool()
Out[28]: True

In [29]: pd.DataFrame([[False]]).bool()
Out[29]: False

Warning

You might be tempted to do the following:

>>> if df:
     ...

Or

>>> df and df2

These both will raise as you are trying to compare multiple values.

ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

See gotchas for a more detailed discussion.

4.5 Comparing if objects are equivalent

Often you may find there is more than one way to compute the same result. As a simple example, consider df+df and df*2. To test that these two computations produce the same result, given the tools shown above, you might imagine using (df+df == df*2).all(). But in fact, this expression is False:

In [30]: df+df == df*2
Out[30]: 
     one  three   two
a   True  False  True
b   True   True  True
c   True   True  True
d  False   True  True

In [31]: (df+df == df*2).all()
Out[31]: 
one      False
three    False
two       True
dtype: bool

Notice that the boolean DataFrame df+df == df*2 contains some False values! That is because NaNs do not compare as equals:

In [32]: np.nan == np.nan
Out[32]: False

So, as of v0.13.1, NDFrames (such as Series, DataFrames, and Panels) have an equals() method for testing equality, with NaNs in corresponding locations treated as equal.

In [33]: (df+df).equals(df*2)
Out[33]: True

Note that the Series or DataFrame index needs to be in the same order for equality to be True:

In [34]: df1 = pd.DataFrame({'col':['foo', 0, np.nan]})

In [35]: df2 = pd.DataFrame({'col':[np.nan, 0, 'foo']}, index=[2,1,0])

In [36]: df1.equals(df2)
Out[36]: False

In [37]: df1.equals(df2.sort_index())
Out[37]: True

4.6 Comparing array-like objects

You can conveniently do element-wise comparisons when comparing a pandas data structure with a scalar value:

In [38]: pd.Series(['foo', 'bar', 'baz']) == 'foo'
Out[38]: 
0     True
1    False
2    False
dtype: bool

In [39]: pd.Index(['foo', 'bar', 'baz']) == 'foo'
Out[39]: array([ True, False, False], dtype=bool)

Pandas also handles element-wise comparisons between different array-like objects of the same length:

In [40]: pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])
Out[40]: 
0     True
1     True
2    False
dtype: bool

In [41]: pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux'])
Out[41]: 
0     True
1     True
2    False
dtype: bool

Trying to compare Index or Series objects of different lengths will raise a ValueError:

In [55]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])
ValueError: Series lengths must match to compare

In [56]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])
ValueError: Series lengths must match to compare

Note that this is different from the numpy behavior where a comparison can be broadcast:

In [42]: np.array([1, 2, 3]) == np.array([2])
Out[42]: array([False,  True, False], dtype=bool)

or it can return False if broadcasting can not be done:

In [43]: np.array([1, 2, 3]) == np.array([1, 2])
Out[43]: False

4.7 Combining overlapping data sets

A problem occasionally arising is the combination of two similar data sets where values in one are preferred over the other. An example would be two data series representing a particular economic indicator where one is considered to be of “higher quality”. However, the lower quality series might extend further back in history or have more complete data coverage. As such, we would like to combine two DataFrame objects where missing values in one DataFrame are conditionally filled with like-labeled values from the other DataFrame. The function implementing this operation is combine_first(), which we illustrate:

In [44]: df1 = pd.DataFrame({'A' : [1., np.nan, 3., 5., np.nan],
   ....:                     'B' : [np.nan, 2., 3., np.nan, 6.]})
   ....: 

In [45]: df2 = pd.DataFrame({'A' : [5., 2., 4., np.nan, 3., 7.],
   ....:                     'B' : [np.nan, np.nan, 3., 4., 6., 8.]})
   ....: 

In [46]: df1
Out[46]: 
     A    B
0  1.0  NaN
1  NaN  2.0
2  3.0  3.0
3  5.0  NaN
4  NaN  6.0

In [47]: df2
Out[47]: 
     A    B
0  5.0  NaN
1  2.0  NaN
2  4.0  3.0
3  NaN  4.0
4  3.0  6.0
5  7.0  8.0

In [48]: df1.combine_first(df2)
Out[48]: 
     A    B
0  1.0  NaN
1  2.0  2.0
2  3.0  3.0
3  5.0  4.0
4  3.0  6.0
5  7.0  8.0

4.8 General DataFrame Combine

The combine_first() method above calls the more general DataFrame method combine(). This method takes another DataFrame and a combiner function, aligns the input DataFrame and then passes the combiner function pairs of Series (i.e., columns whose names are the same).

So, for instance, to reproduce combine_first() as above:

In [49]: combiner = lambda x, y: np.where(pd.isnull(x), y, x)

In [50]: df1.combine(df2, combiner)
Out[50]: 
     A    B
0  1.0  NaN
1  2.0  2.0
2  3.0  3.0
3  5.0  4.0
4  3.0  6.0
5  7.0  8.0