.. currentmodule:: pandas .. ipython:: python :suppress: import numpy as np import pandas as pd np.set_printoptions(precision=4, suppress=True) pd.options.display.max_rows = 8 index = pd.date_range('1/1/2000', periods=8) s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['A', 'B', 'C']) wp = pd.Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'], major_axis=pd.date_range('1/1/2000', periods=5), minor_axis=['A', 'B', 'C', 'D']) .. _basics.binop: Flexible binary operations -------------------------- With binary operations between pandas data structures, there are two key points of interest: * Broadcasting behavior between higher- (e.g. DataFrame) and lower-dimensional (e.g. Series) objects. * Missing data in computations We will demonstrate how to manage these issues independently, though they can be handled simultaneously. Matching / broadcasting behavior ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ DataFrame has the methods :meth:`~DataFrame.add`, :meth:`~DataFrame.sub`, :meth:`~DataFrame.mul`, :meth:`~DataFrame.div` and related functions :meth:`~DataFrame.radd`, :meth:`~DataFrame.rsub`, ... for carrying out binary operations. For broadcasting behavior, Series input is of primary interest. Using these functions, you can use to either match on the *index* or *columns* via the **axis** keyword: .. ipython:: python df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']), 'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']), 'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])}) df row = df.ix[1] column = df['two'] df.sub(row, axis='columns') df.sub(row, axis=1) df.sub(column, axis='index') df.sub(column, axis=0) .. ipython:: python :suppress: df_orig = df Furthermore you can align a level of a multi-indexed DataFrame with a Series. .. ipython:: python dfmi = df.copy() dfmi.index = pd.MultiIndex.from_tuples([(1,'a'),(1,'b'),(1,'c'),(2,'a')], names=['first','second']) dfmi.sub(column, axis=0, level='second') With Panel, describing the matching behavior is a bit more difficult, so the arithmetic methods instead (and perhaps confusingly?) give you the option to specify the *broadcast axis*. For example, suppose we wished to demean the data over a particular axis. This can be accomplished by taking the mean over an axis and broadcasting over the same axis: .. ipython:: python major_mean = wp.mean(axis='major') major_mean wp.sub(major_mean, axis='major') And similarly for ``axis="items"`` and ``axis="minor"``. .. note:: I could be convinced to make the **axis** argument in the DataFrame methods match the broadcasting behavior of Panel. Though it would require a transition period so users can change their code... Missing data / operations with fill values ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In Series and DataFrame (though not yet in Panel), the arithmetic functions have the option of inputting a *fill_value*, namely a value to substitute when at most one of the values at a location are missing. For example, when adding two DataFrame objects, you may wish to treat NaN as 0 unless both DataFrames are missing that value, in which case the result will be NaN (you can later replace NaN with some other value using ``fillna`` if you wish). .. ipython:: python :suppress: df2 = df.copy() df2['three']['a'] = 1. .. ipython:: python df df2 df + df2 df.add(df2, fill_value=0) .. _basics.compare: Flexible Comparisons ~~~~~~~~~~~~~~~~~~~~ Starting in v0.8, pandas introduced binary comparison methods eq, ne, lt, gt, le, and ge to Series and DataFrame whose behavior is analogous to the binary arithmetic operations described above: .. ipython:: python df.gt(df2) df2.ne(df) These operations produce a pandas object the same type as the left-hand-side input that if of dtype ``bool``. These ``boolean`` objects can be used in indexing operations, see :ref:`here` .. _basics.reductions: Boolean Reductions ~~~~~~~~~~~~~~~~~~ You can apply the reductions: :attr:`~DataFrame.empty`, :meth:`~DataFrame.any`, :meth:`~DataFrame.all`, and :meth:`~DataFrame.bool` to provide a way to summarize a boolean result. .. ipython:: python (df > 0).all() (df > 0).any() You can reduce to a final boolean value. .. ipython:: python (df > 0).any().any() You can test if a pandas object is empty, via the :attr:`~DataFrame.empty` property. .. ipython:: python df.empty pd.DataFrame(columns=list('ABC')).empty To evaluate single-element pandas objects in a boolean context, use the method :meth:`~DataFrame.bool`: .. ipython:: python pd.Series([True]).bool() pd.Series([False]).bool() pd.DataFrame([[True]]).bool() pd.DataFrame([[False]]).bool() .. warning:: You might be tempted to do the following: .. code-block:: python >>> if df: ... Or .. code-block:: python >>> df and df2 These both will raise as you are trying to compare multiple values. .. code-block:: python ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all(). See :ref:`gotchas` for a more detailed discussion. .. _basics.equals: Comparing if objects are equivalent ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Often you may find there is more than one way to compute the same result. As a simple example, consider ``df+df`` and ``df*2``. To test that these two computations produce the same result, given the tools shown above, you might imagine using ``(df+df == df*2).all()``. But in fact, this expression is False: .. ipython:: python df+df == df*2 (df+df == df*2).all() Notice that the boolean DataFrame ``df+df == df*2`` contains some False values! That is because NaNs do not compare as equals: .. ipython:: python np.nan == np.nan So, as of v0.13.1, NDFrames (such as Series, DataFrames, and Panels) have an :meth:`~DataFrame.equals` method for testing equality, with NaNs in corresponding locations treated as equal. .. ipython:: python (df+df).equals(df*2) Note that the Series or DataFrame index needs to be in the same order for equality to be True: .. ipython:: python df1 = pd.DataFrame({'col':['foo', 0, np.nan]}) df2 = pd.DataFrame({'col':[np.nan, 0, 'foo']}, index=[2,1,0]) df1.equals(df2) df1.equals(df2.sort_index()) Comparing array-like objects ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can conveniently do element-wise comparisons when comparing a pandas data structure with a scalar value: .. ipython:: python pd.Series(['foo', 'bar', 'baz']) == 'foo' pd.Index(['foo', 'bar', 'baz']) == 'foo' Pandas also handles element-wise comparisons between different array-like objects of the same length: .. ipython:: python pd.Series(['foo', 'bar', 'baz']) == pd.Index(['foo', 'bar', 'qux']) pd.Series(['foo', 'bar', 'baz']) == np.array(['foo', 'bar', 'qux']) Trying to compare ``Index`` or ``Series`` objects of different lengths will raise a ValueError: .. code-block:: ipython In [55]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar']) ValueError: Series lengths must match to compare In [56]: pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo']) ValueError: Series lengths must match to compare Note that this is different from the numpy behavior where a comparison can be broadcast: .. ipython:: python np.array([1, 2, 3]) == np.array([2]) or it can return False if broadcasting can not be done: .. ipython:: python :okwarning: np.array([1, 2, 3]) == np.array([1, 2]) Combining overlapping data sets ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A problem occasionally arising is the combination of two similar data sets where values in one are preferred over the other. An example would be two data series representing a particular economic indicator where one is considered to be of "higher quality". However, the lower quality series might extend further back in history or have more complete data coverage. As such, we would like to combine two DataFrame objects where missing values in one DataFrame are conditionally filled with like-labeled values from the other DataFrame. The function implementing this operation is :meth:`~DataFrame.combine_first`, which we illustrate: .. ipython:: python df1 = pd.DataFrame({'A' : [1., np.nan, 3., 5., np.nan], 'B' : [np.nan, 2., 3., np.nan, 6.]}) df2 = pd.DataFrame({'A' : [5., 2., 4., np.nan, 3., 7.], 'B' : [np.nan, np.nan, 3., 4., 6., 8.]}) df1 df2 df1.combine_first(df2) General DataFrame Combine ~~~~~~~~~~~~~~~~~~~~~~~~~ The :meth:`~DataFrame.combine_first` method above calls the more general DataFrame method :meth:`~DataFrame.combine`. This method takes another DataFrame and a combiner function, aligns the input DataFrame and then passes the combiner function pairs of Series (i.e., columns whose names are the same). So, for instance, to reproduce :meth:`~DataFrame.combine_first` as above: .. ipython:: python combiner = lambda x, y: np.where(pd.isnull(x), y, x) df1.combine(df2, combiner)