6 Function application

To apply your own or another library’s functions to pandas objects, you should be aware of the three methods below. The appropriate method to use depends on whether your function expects to operate on an entire DataFrame or Series, row- or column-wise, or elementwise.

Tablewise Function Application: pipe()
Row or Column-wise Function Application: apply()
Elementwise function application: applymap()

6.1 Tablewise Function Application

New in version 0.16.2.

DataFrames and Series can of course just be passed into functions. However, if the function needs to be called in a chain, consider using the pipe() method. Compare the following

# f, g, and h are functions taking and returning ``DataFrames``
>>> f(g(h(df), arg1=1), arg2=2, arg3=3)

with the equivalent

>>> (df.pipe(h)
       .pipe(g, arg1=1)
       .pipe(f, arg2=2, arg3=3)
    )

Pandas encourages the second style, which is known as method chaining. pipe makes it easy to use your own or another library’s functions in method chains, alongside pandas’ methods.

In the example above, the functions f, g, and h each expected the DataFrame as the first positional argument. What if the function you wish to apply takes its data as, say, the second argument? In this case, provide pipe with a tuple of (callable, data_keyword). .pipe will route the DataFrame to the argument specified in the tuple.

For example, we can fit a regression using statsmodels. Their API expects a formula first and a DataFrame as the second argument, data. We pass in the function, keyword pair (sm.poisson, 'data') to pipe:

In [1]: import statsmodels.formula.api as sm

In [2]: bb = pd.read_csv('https://raw.githubusercontent.com/pydata/pandas/master/doc/data/baseball.csv',
   ...:     index_col='id')
   ...: 

In [3]: (bb.query('h > 0')
   ...:    .assign(ln_h = lambda df: np.log(df.h))
   ...:    .pipe((sm.poisson, 'data'), 'hr ~ ln_h + year + g + C(lg)')
   ...:    .fit()
   ...:    .summary()
   ...: )
   ...: 
Optimization terminated successfully.
         Current function value: 2.116284
         Iterations 24
Out[3]: 
<class 'statsmodels.iolib.summary.Summary'>
"""
                          Poisson Regression Results                          
==============================================================================
Dep. Variable:                     hr   No. Observations:                   68
Model:                        Poisson   Df Residuals:                       63
Method:                           MLE   Df Model:                            4
Date:                Fri, 30 Sep 2016   Pseudo R-squ.:                  0.6878
Time:                        13:50:04   Log-Likelihood:                -143.91
converged:                       True   LL-Null:                       -460.91
                                        LLR p-value:                6.774e-136
===============================================================================
                  coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept   -1267.3636    457.867     -2.768      0.006     -2164.767  -369.960
C(lg)[T.NL]    -0.2057      0.101     -2.044      0.041        -0.403    -0.008
ln_h            0.9280      0.191      4.866      0.000         0.554     1.302
year            0.6301      0.228      2.762      0.006         0.183     1.077
g               0.0099      0.004      2.754      0.006         0.003     0.017
===============================================================================
"""

The pipe method is inspired by unix pipes and more recently dplyr and magrittr, which have introduced the popular (%>%) (read pipe) operator for R. The implementation of pipe here is quite clean and feels right at home in python. We encourage you to view the source code (pd.DataFrame.pipe?? in IPython).

6.2 Row or Column-wise Function Application

Arbitrary functions can be applied along the axes of a DataFrame or Panel using the apply() method, which, like the descriptive statistics methods, take an optional axis argument:

In [4]: df.apply(np.mean)
Out[4]: 
one      0.272336
three    0.448588
two      1.197674
dtype: float64

In [5]: df.apply(np.mean, axis=1)
Out[5]: 
a   -0.222203
b    0.072798
c    1.551760
d    1.262101
dtype: float64

In [6]: df.apply(lambda x: x.max() - x.min())
Out[6]: 
one      1.781053
three    2.646671
two      3.221620
dtype: float64

In [7]: df.apply(np.cumsum)
Out[7]: 
        one     three       two
a -0.437898       NaN -0.006509
b  0.905258 -0.591153 -0.540119
c  0.817008  1.464366  2.147891
d       NaN  1.345763  4.790695

In [8]: df.apply(np.exp)
Out[8]: 
        one     three        two
a  0.645392       NaN   0.993512
b  3.831113  0.553689   0.586484
c  0.915532  7.810885  14.702397
d       NaN  0.888161  14.052545

Depending on the return type of the function passed to apply(), the result will either be of lower dimension or the same dimension.

apply() combined with some cleverness can be used to answer many questions about a data set. For example, suppose we wanted to extract the date where the maximum value for each column occurred:

In [9]: tsdf = pd.DataFrame(np.random.randn(1000, 3), columns=['A', 'B', 'C'],
   ...:                     index=pd.date_range('1/1/2000', periods=1000))
   ...: 

In [10]: tsdf.apply(lambda x: x.idxmax())
Out[10]: 
A   2001-01-06
B   2000-01-25
C   2000-07-08
dtype: datetime64[ns]

You may also pass additional arguments and keyword arguments to the apply() method. For instance, consider the following function you would like to apply:

def subtract_and_divide(x, sub, divide=1):
    return (x - sub) / divide

You may then apply this function as follows:

df.apply(subtract_and_divide, args=(5,), divide=3)

Another useful feature is the ability to pass Series methods to carry out some Series operation on each column or row:

In [11]: tsdf
Out[11]: 
                   A         B         C
2000-01-01 -0.867744 -0.739910  0.263414
2000-01-02 -0.252225  1.662201 -0.773457
2000-01-03 -0.203230  0.012023  0.567693
2000-01-04       NaN       NaN       NaN
...              ...       ...       ...
2000-01-07       NaN       NaN       NaN
2000-01-08 -0.333957 -0.364556 -0.363141
2000-01-09 -1.198297  0.938670 -1.528172
2000-01-10  0.501602  3.103739 -1.647841

[10 rows x 3 columns]

In [12]: tsdf.apply(pd.Series.interpolate)
Out[12]: 
                   A         B         C
2000-01-01 -0.867744 -0.739910  0.263414
2000-01-02 -0.252225  1.662201 -0.773457
2000-01-03 -0.203230  0.012023  0.567693
2000-01-04 -0.229375 -0.063293  0.381526
...              ...       ...       ...
2000-01-07 -0.307812 -0.289240 -0.176974
2000-01-08 -0.333957 -0.364556 -0.363141
2000-01-09 -1.198297  0.938670 -1.528172
2000-01-10  0.501602  3.103739 -1.647841

[10 rows x 3 columns]

Finally, apply() takes an argument raw which is False by default, which converts each row or column into a Series before applying the function. When set to True, the passed function will instead receive an ndarray object, which has positive performance implications if you do not need the indexing functionality.

6.3 Applying elementwise Python functions

Since not all functions can be vectorized (accept NumPy arrays and return another array or value), the methods applymap() on DataFrame and analogously map() on Series accept any Python function taking a single value and returning a single value. For example:

In [13]: df4
Out[13]: 
        one     three       two
a -0.539499       NaN  0.638015
b  0.599181  0.348593  2.283162
c  1.217597 -0.090397 -0.940880
d       NaN -0.066985  0.373389

In [14]: f = lambda x: len(str(x))

In [15]: df4['one'].map(f)
Out[15]: 
a    15
b    14
c    13
d     3
Name: one, dtype: int64

In [16]: df4.applymap(f)
Out[16]: 
   one  three  two
a   15      3   14
b   14     14   13
c   13     16   15
d    3     16   14

Series.map() has an additional feature which is that it can be used to easily “link” or “map” values defined by a secondary series. This is closely related to merging/joining functionality:

In [17]: s = pd.Series(['six', 'seven', 'six', 'seven', 'six'],
   ....:               index=['a', 'b', 'c', 'd', 'e'])
   ....: 

In [18]: t = pd.Series({'six' : 6., 'seven' : 7.})

In [19]: s
Out[19]: 
a      six
b    seven
c      six
d    seven
e      six
dtype: object

In [20]: s.map(t)
Out[20]: 
a    6.0
b    7.0
c    6.0
d    7.0
e    6.0
dtype: float64

6.4 Applying with a Panel

Applying with a Panel will pass a Series to the applied function. If the applied function returns a Series, the result of the application will be a Panel. If the applied function reduces to a scalar, the result of the application will be a DataFrame.

Note

Prior to 0.13.1 apply on a Panel would only work on ufuncs (e.g. np.sum/np.max).

In [21]: import pandas.util.testing as tm

In [22]: panel = tm.makePanel(5)

In [23]: panel
Out[23]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: A to D

In [24]: panel['ItemA']
Out[24]: 
                   A         B         C         D
2000-01-03 -0.058321  1.541779  1.642293  0.526628
2000-01-04 -0.015366 -1.028011  0.646377  2.143954
2000-01-05 -0.246782  1.095148  0.463691 -1.616592
2000-01-06 -0.867023 -0.880842 -1.465336  0.287073
2000-01-07  0.508859 -0.556211 -0.610470 -0.004885

A transformational apply.

In [25]: result = panel.apply(lambda x: x*2, axis='items')

In [26]: result
Out[26]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: A to D

In [27]: result['ItemA']
Out[27]: 
                   A         B         C         D
2000-01-03 -0.116642  3.083559  3.284586  1.053255
2000-01-04 -0.030733 -2.056023  1.292754  4.287908
2000-01-05 -0.493565  2.190296  0.927382 -3.233183
2000-01-06 -1.734047 -1.761684 -2.930673  0.574147
2000-01-07  1.017718 -1.112422 -1.220940 -0.009770

A reduction operation.

In [28]: panel.apply(lambda x: x.dtype, axis='items')
Out[28]: 
                  A        B        C        D
2000-01-03  float64  float64  float64  float64
2000-01-04  float64  float64  float64  float64
2000-01-05  float64  float64  float64  float64
2000-01-06  float64  float64  float64  float64
2000-01-07  float64  float64  float64  float64

A similar reduction type operation

In [29]: panel.apply(lambda x: x.sum(), axis='major_axis')
Out[29]: 
      ItemA     ItemB     ItemC
A -0.678634 -1.814294  0.389953
B  0.171863  3.260122  0.264703
C  0.676554 -0.926901  0.250588
D  1.336178 -1.196204  2.904298

This last reduction is equivalent to

In [30]: panel.sum('major_axis')
Out[30]: 
      ItemA     ItemB     ItemC
A -0.678634 -1.814294  0.389953
B  0.171863  3.260122  0.264703
C  0.676554 -0.926901  0.250588
D  1.336178 -1.196204  2.904298

A transformation operation that returns a Panel, but is computing the z-score across the major_axis.

In [31]: result = panel.apply(
   ....:            lambda x: (x-x.mean())/x.std(),
   ....:            axis='major_axis')
   ....: 

In [32]: result
Out[32]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: A to D

In [33]: result['ItemA']
Out[33]: 
                   A         B         C         D
2000-01-03  0.156137  1.261375  1.256006  0.193170
2000-01-04  0.242782 -0.888986  0.425952  1.397600
2000-01-05 -0.224012  0.887640  0.273691 -1.402895
2000-01-06 -1.475113 -0.765838 -1.334072  0.014773
2000-01-07  1.300207 -0.494191 -0.621577 -0.202649

Apply can also accept multiple axes in the axis argument. This will pass a DataFrame of the cross-section to the applied function.

In [34]: f = lambda x: ((x.T-x.mean(1))/x.std(1)).T

In [35]: result = panel.apply(f, axis = ['items','major_axis'])

In [36]: result
Out[36]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis)
Items axis: A to D
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: ItemA to ItemC

In [37]: result.loc[:,:,'ItemA']
Out[37]: 
                   A         B         C         D
2000-01-03  0.619463  0.896525  1.075947  0.420792
2000-01-04  1.059012 -1.096130  0.891046  0.754187
2000-01-05  0.058672  1.047126  0.492211 -1.064774
2000-01-06 -0.613789 -0.435602 -1.154147 -1.084692
2000-01-07  0.044797 -0.908652 -1.071002 -0.483408

This is equivalent to the following

In [38]: result = pd.Panel(dict([ (ax, f(panel.loc[:,:,ax]))
   ....:                         for ax in panel.minor_axis ]))
   ....: 

In [39]: result
Out[39]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis)
Items axis: A to D
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: ItemA to ItemC

In [40]: result.loc[:,:,'ItemA']
Out[40]: 
                   A         B         C         D
2000-01-03  0.619463  0.896525  1.075947  0.420792
2000-01-04  1.059012 -1.096130  0.891046  0.754187
2000-01-05  0.058672  1.047126  0.492211 -1.064774
2000-01-06 -0.613789 -0.435602 -1.154147 -1.084692
2000-01-07  0.044797 -0.908652 -1.071002 -0.483408