4 Flexible binary operations
With binary operations between pandas data structures, there are two key points of interest:
- Broadcasting behavior between higher- (e.g. DataFrame) and lower-dimensional (e.g. Series) objects.
- Missing data in computations
We will demonstrate how to manage these issues independently, though they can be handled simultaneously.
4.1 Matching / broadcasting behavior
DataFrame has the methods add()
, sub()
,
mul()
, div()
and related functions
radd()
, rsub()
, ...
for carrying out binary operations. For broadcasting behavior,
Series input is of primary interest. Using these functions, you can use to
either match on the index or columns via the axis keyword:
In [1]: df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
...: 'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
...: 'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
...:
In [2]: df
Out[2]:
one three two
a -0.539499 NaN 0.638015
b 0.599181 0.348593 2.283162
c 1.217597 -0.090397 -0.940880
d NaN -0.066985 0.373389
In [3]: row = df.ix[1]
In [4]: column = df['two']
In [5]: df.sub(row, axis='columns')
Out[5]:
one three two
a -1.138679 NaN -1.645146
b 0.000000 0.000000 0.000000
c 0.618416 -0.438990 -3.224041
d NaN -0.415579 -1.909772
In [6]: df.sub(row, axis=1)
Out[6]:
one three two
a -1.138679 NaN -1.645146
b 0.000000 0.000000 0.000000
c 0.618416 -0.438990 -3.224041
d NaN -0.415579 -1.909772
In [7]: df.sub(column, axis='index')
Out[7]:
one three two
a -1.177514 NaN 0.0
b -1.683981 -1.934568 0.0
c 2.158476 0.850483 0.0
d NaN -0.440375 0.0
In [8]: df.sub(column, axis=0)
Out[8]:
one three two
a -1.177514 NaN 0.0
b -1.683981 -1.934568 0.0
c 2.158476 0.850483 0.0
d NaN -0.440375 0.0
Furthermore you can align a level of a multi-indexed DataFrame with a Series.
In [9]: dfmi = df.copy()
In [10]: dfmi.index = pd.MultiIndex.from_tuples([(1,'a'),(1,'b'),(1,'c'),(2,'a')],
....: names=['first','second'])
....:
In [11]: dfmi.sub(column, axis=0, level='second')
Out[11]:
one three two
first second
1 a -1.177514 NaN 0.000000
b -1.683981 -1.934568 0.000000
c 2.158476 0.850483 0.000000
2 a NaN -0.705001 -0.264626
With Panel, describing the matching behavior is a bit more difficult, so the arithmetic methods instead (and perhaps confusingly?) give you the option to specify the broadcast axis. For example, suppose we wished to demean the data over a particular axis. This can be accomplished by taking the mean over an axis and broadcasting over the same axis:
In [12]: major_mean = wp.mean(axis='major')
In [13]: major_mean
Out[13]:
Item1 Item2
A -0.698246 -0.266135
B 0.494307 0.674535
C 0.126692 -0.296539
D 0.593198 -0.132742
In [14]: wp.sub(major_mean, axis='major')
Out[14]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D
And similarly for axis="items"
and axis="minor"
.
Note
I could be convinced to make the axis argument in the DataFrame methods match the broadcasting behavior of Panel. Though it would require a transition period so users can change their code...
4.2 Missing data / operations with fill values
In Series and DataFrame (though not yet in Panel), the arithmetic functions
have the option of inputting a fill_value, namely a value to substitute when
at most one of the values at a location are missing. For example, when adding
two DataFrame objects, you may wish to treat NaN as 0 unless both DataFrames
are missing that value, in which case the result will be NaN (you can later
replace NaN with some other value using fillna
if you wish).
In [15]: df
Out[15]:
one three two
a -0.539499 NaN 0.638015
b 0.599181 0.348593 2.283162
c 1.217597 -0.090397 -0.940880
d NaN -0.066985 0.373389
In [16]: df2
Out[16]:
one three two
a -0.539499 1.000000 0.638015
b 0.599181 0.348593 2.283162
c 1.217597 -0.090397 -0.940880
d NaN -0.066985 0.373389
In [17]: df + df2
Out[17]:
one three two
a -1.078997 NaN 1.276030
b 1.198361 0.697187 4.566323
c 2.435193 -0.180794 -1.881759
d NaN -0.133971 0.746779
In [18]: df.add(df2, fill_value=0)
Out[18]:
one three two
a -1.078997 1.000000 1.276030
b 1.198361 0.697187 4.566323
c 2.435193 -0.180794 -1.881759
d NaN -0.133971 0.746779
4.3 Flexible Comparisons
Starting in v0.8, pandas introduced binary comparison methods eq, ne, lt, gt, le, and ge to Series and DataFrame whose behavior is analogous to the binary arithmetic operations described above:
In [19]: df.gt(df2)
Out[19]:
one three two
a False False False
b False False False
c False False False
d False False False
In [20]: df2.ne(df)
Out[20]:
one three two
a False True False
b False False False
c False False False
d True False False
These operations produce a pandas object the same type as the left-hand-side input
that if of dtype bool
. These boolean
objects can be used in indexing operations,
see here
4.4 Boolean Reductions
You can apply the reductions: empty
, any()
,
all()
, and bool()
to provide a
way to summarize a boolean result.
In [21]: (df > 0).all()
Out[21]:
one False
three False
two False
dtype: bool
In [22]: (df > 0).any()
Out[22]:
one True
three True
two True
dtype: bool
You can reduce to a final boolean value.
In [23]: (df > 0).any().any()
Out[23]: True
You can test if a pandas object is empty, via the empty
property.
In [24]: df.empty
Out[24]: False
In [25]: pd.DataFrame(columns=list('ABC')).empty
Out[25]: True
To evaluate single-element pandas objects in a boolean context, use the method
bool()
:
In [26]: pd.Series([True]).bool()
Out[26]: True
In [27]: pd.Series([False]).bool()
Out[27]: False
In [28]: pd.DataFrame([[True]]).bool()
Out[28]: True
In [29]: pd.DataFrame([[False]]).bool()
Out[29]: False
Warning
You might be tempted to do the following:
>>> if df:
...
Or
>>> df and df2
These both will raise as you are trying to compare multiple values.
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
See gotchas for a more detailed discussion.
4.5 Comparing if objects are equivalent
Often you may find there is more than one way to compute the same
result. As a simple example, consider df+df
and df*2
. To test
that these two computations produce the same result, given the tools
shown above, you might imagine using (df+df == df*2).all()
. But in
fact, this expression is False:
In [30]: df+df == df*2
Out[30]:
one three two
a True False True
b True True True
c True True True
d False True True
In [31]: (df+df == df*2).all()
Out[31]:
one False
three False
two True
dtype: bool
Notice that the boolean DataFrame df+df == df*2
contains some False values!
That is because NaNs do not compare as equals:
In [32]: np.nan == np.nan
Out[32]: False
So, as of v0.13.1, NDFrames (such as Series, DataFrames, and Panels)
have an equals()
method for testing equality, with NaNs in
corresponding locations treated as equal.
In [33]: (df+df).equals(df*2)
Out[33]: True
Note that the Series or DataFrame index needs to be in the same order for equality to be True:
In [34]: df1 = pd.DataFrame({'col':['foo', 0, np.nan]})
In [35]: df2 = pd.DataFrame({'col':[np.nan, 0, 'foo']}, index=[2,1,0])
In [36]: df1.equals(df2)
Out[36]: False
In [37]: df1.equals(df2.sort_index())
Out[37]: True
4.6 Comparing array-like objects
You can conveniently do element-wise comparisons when comparing a pandas data structure with a scalar value:
In [38]: pd.Series(['foo', 'bar', 'baz']) == 'foo'
Out[38]:
0 True
1 False
2 False
dtype: bool
In [39]: pd.Index(['foo', 'bar', 'baz']) == 'foo'
Out[39]: array([ True, False, False], dtype=bool)
Pandas also handles element-wise comparisons between different array-like objects of the same length:
In [40]: pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux'])
Out[40]:
0 True
1 True
2 False
dtype: bool
In [41]: pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux'])
Out[41]:
0 True
1 True
2 False
dtype: bool
Trying to compare Index
or Series
objects of different lengths will
raise a ValueError:
In [55]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])
ValueError: Series lengths must match to compare
In [56]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])
ValueError: Series lengths must match to compare
Note that this is different from the numpy behavior where a comparison can be broadcast:
In [42]: np.array([1, 2, 3]) == np.array([2])
Out[42]: array([False, True, False], dtype=bool)
or it can return False if broadcasting can not be done:
In [43]: np.array([1, 2, 3]) == np.array([1, 2])
Out[43]: False
4.7 Combining overlapping data sets
A problem occasionally arising is the combination of two similar data sets
where values in one are preferred over the other. An example would be two data
series representing a particular economic indicator where one is considered to
be of “higher quality”. However, the lower quality series might extend further
back in history or have more complete data coverage. As such, we would like to
combine two DataFrame objects where missing values in one DataFrame are
conditionally filled with like-labeled values from the other DataFrame. The
function implementing this operation is combine_first()
,
which we illustrate:
In [44]: df1 = pd.DataFrame({'A' : [1., np.nan, 3., 5., np.nan],
....: 'B' : [np.nan, 2., 3., np.nan, 6.]})
....:
In [45]: df2 = pd.DataFrame({'A' : [5., 2., 4., np.nan, 3., 7.],
....: 'B' : [np.nan, np.nan, 3., 4., 6., 8.]})
....:
In [46]: df1
Out[46]:
A B
0 1.0 NaN
1 NaN 2.0
2 3.0 3.0
3 5.0 NaN
4 NaN 6.0
In [47]: df2
Out[47]:
A B
0 5.0 NaN
1 2.0 NaN
2 4.0 3.0
3 NaN 4.0
4 3.0 6.0
5 7.0 8.0
In [48]: df1.combine_first(df2)
Out[48]:
A B
0 1.0 NaN
1 2.0 2.0
2 3.0 3.0
3 5.0 4.0
4 3.0 6.0
5 7.0 8.0
4.8 General DataFrame Combine
The combine_first()
method above calls the more general
DataFrame method combine()
. This method takes another DataFrame
and a combiner function, aligns the input DataFrame and then passes the combiner
function pairs of Series (i.e., columns whose names are the same).
So, for instance, to reproduce combine_first()
as above:
In [49]: combiner = lambda x, y: np.where(pd.isnull(x), y, x)
In [50]: df1.combine(df2, combiner)
Out[50]:
A B
0 1.0 NaN
1 2.0 2.0
2 3.0 3.0
3 5.0 4.0
4 3.0 6.0
5 7.0 8.0