6 Function application
To apply your own or another library’s functions to pandas objects,
you should be aware of the three methods below. The appropriate
method to use depends on whether your function expects to operate
on an entire DataFrame
or Series
, row- or column-wise, or elementwise.
- Tablewise Function Application:
pipe()
- Row or Column-wise Function Application:
apply()
- Elementwise function application:
applymap()
6.1 Tablewise Function Application
New in version 0.16.2.
DataFrames
and Series
can of course just be passed into functions.
However, if the function needs to be called in a chain, consider using the pipe()
method.
Compare the following
# f, g, and h are functions taking and returning ``DataFrames``
>>> f(g(h(df), arg1=1), arg2=2, arg3=3)
with the equivalent
>>> (df.pipe(h)
.pipe(g, arg1=1)
.pipe(f, arg2=2, arg3=3)
)
Pandas encourages the second style, which is known as method chaining.
pipe
makes it easy to use your own or another library’s functions
in method chains, alongside pandas’ methods.
In the example above, the functions f
, g
, and h
each expected the DataFrame
as the first positional argument.
What if the function you wish to apply takes its data as, say, the second argument?
In this case, provide pipe
with a tuple of (callable, data_keyword)
.
.pipe
will route the DataFrame
to the argument specified in the tuple.
For example, we can fit a regression using statsmodels. Their API expects a formula first and a DataFrame
as the second argument, data
. We pass in the function, keyword pair (sm.poisson, 'data')
to pipe
:
In [1]: import statsmodels.formula.api as sm
In [2]: bb = pd.read_csv('https://raw.githubusercontent.com/pydata/pandas/master/doc/data/baseball.csv',
...: index_col='id')
...:
In [3]: (bb.query('h > 0')
...: .assign(ln_h = lambda df: np.log(df.h))
...: .pipe((sm.poisson, 'data'), 'hr ~ ln_h + year + g + C(lg)')
...: .fit()
...: .summary()
...: )
...:
Optimization terminated successfully.
Current function value: 2.116284
Iterations 24
Out[3]:
<class 'statsmodels.iolib.summary.Summary'>
"""
Poisson Regression Results
==============================================================================
Dep. Variable: hr No. Observations: 68
Model: Poisson Df Residuals: 63
Method: MLE Df Model: 4
Date: Fri, 30 Sep 2016 Pseudo R-squ.: 0.6878
Time: 13:50:04 Log-Likelihood: -143.91
converged: True LL-Null: -460.91
LLR p-value: 6.774e-136
===============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
-------------------------------------------------------------------------------
Intercept -1267.3636 457.867 -2.768 0.006 -2164.767 -369.960
C(lg)[T.NL] -0.2057 0.101 -2.044 0.041 -0.403 -0.008
ln_h 0.9280 0.191 4.866 0.000 0.554 1.302
year 0.6301 0.228 2.762 0.006 0.183 1.077
g 0.0099 0.004 2.754 0.006 0.003 0.017
===============================================================================
"""
The pipe method is inspired by unix pipes and more recently dplyr and magrittr, which
have introduced the popular (%>%)
(read pipe) operator for R.
The implementation of pipe
here is quite clean and feels right at home in python.
We encourage you to view the source code (pd.DataFrame.pipe??
in IPython).
6.2 Row or Column-wise Function Application
Arbitrary functions can be applied along the axes of a DataFrame or Panel
using the apply()
method, which, like the descriptive
statistics methods, take an optional axis
argument:
In [4]: df.apply(np.mean)
Out[4]:
one 0.272336
three 0.448588
two 1.197674
dtype: float64
In [5]: df.apply(np.mean, axis=1)
Out[5]:
a -0.222203
b 0.072798
c 1.551760
d 1.262101
dtype: float64
In [6]: df.apply(lambda x: x.max() - x.min())
Out[6]:
one 1.781053
three 2.646671
two 3.221620
dtype: float64
In [7]: df.apply(np.cumsum)
Out[7]:
one three two
a -0.437898 NaN -0.006509
b 0.905258 -0.591153 -0.540119
c 0.817008 1.464366 2.147891
d NaN 1.345763 4.790695
In [8]: df.apply(np.exp)
Out[8]:
one three two
a 0.645392 NaN 0.993512
b 3.831113 0.553689 0.586484
c 0.915532 7.810885 14.702397
d NaN 0.888161 14.052545
Depending on the return type of the function passed to apply()
,
the result will either be of lower dimension or the same dimension.
apply()
combined with some cleverness can be used to answer many questions
about a data set. For example, suppose we wanted to extract the date where the
maximum value for each column occurred:
In [9]: tsdf = pd.DataFrame(np.random.randn(1000, 3), columns=['A', 'B', 'C'],
...: index=pd.date_range('1/1/2000', periods=1000))
...:
In [10]: tsdf.apply(lambda x: x.idxmax())
Out[10]:
A 2001-01-06
B 2000-01-25
C 2000-07-08
dtype: datetime64[ns]
You may also pass additional arguments and keyword arguments to the apply()
method. For instance, consider the following function you would like to apply:
def subtract_and_divide(x, sub, divide=1):
return (x - sub) / divide
You may then apply this function as follows:
df.apply(subtract_and_divide, args=(5,), divide=3)
Another useful feature is the ability to pass Series methods to carry out some Series operation on each column or row:
In [11]: tsdf
Out[11]:
A B C
2000-01-01 -0.867744 -0.739910 0.263414
2000-01-02 -0.252225 1.662201 -0.773457
2000-01-03 -0.203230 0.012023 0.567693
2000-01-04 NaN NaN NaN
... ... ... ...
2000-01-07 NaN NaN NaN
2000-01-08 -0.333957 -0.364556 -0.363141
2000-01-09 -1.198297 0.938670 -1.528172
2000-01-10 0.501602 3.103739 -1.647841
[10 rows x 3 columns]
In [12]: tsdf.apply(pd.Series.interpolate)
Out[12]:
A B C
2000-01-01 -0.867744 -0.739910 0.263414
2000-01-02 -0.252225 1.662201 -0.773457
2000-01-03 -0.203230 0.012023 0.567693
2000-01-04 -0.229375 -0.063293 0.381526
... ... ... ...
2000-01-07 -0.307812 -0.289240 -0.176974
2000-01-08 -0.333957 -0.364556 -0.363141
2000-01-09 -1.198297 0.938670 -1.528172
2000-01-10 0.501602 3.103739 -1.647841
[10 rows x 3 columns]
Finally, apply()
takes an argument raw
which is False by default, which
converts each row or column into a Series before applying the function. When
set to True, the passed function will instead receive an ndarray object, which
has positive performance implications if you do not need the indexing
functionality.
See also
The section on GroupBy demonstrates related, flexible functionality for grouping by some criterion, applying, and combining the results into a Series, DataFrame, etc.
6.3 Applying elementwise Python functions
Since not all functions can be vectorized (accept NumPy arrays and return
another array or value), the methods applymap()
on DataFrame
and analogously map()
on Series accept any Python function taking
a single value and returning a single value. For example:
In [13]: df4
Out[13]:
one three two
a -0.539499 NaN 0.638015
b 0.599181 0.348593 2.283162
c 1.217597 -0.090397 -0.940880
d NaN -0.066985 0.373389
In [14]: f = lambda x: len(str(x))
In [15]: df4['one'].map(f)
Out[15]:
a 15
b 14
c 13
d 3
Name: one, dtype: int64
In [16]: df4.applymap(f)
Out[16]:
one three two
a 15 3 14
b 14 14 13
c 13 16 15
d 3 16 14
Series.map()
has an additional feature which is that it can be used to easily
“link” or “map” values defined by a secondary series. This is closely related
to merging/joining functionality:
In [17]: s = pd.Series(['six', 'seven', 'six', 'seven', 'six'],
....: index=['a', 'b', 'c', 'd', 'e'])
....:
In [18]: t = pd.Series({'six' : 6., 'seven' : 7.})
In [19]: s
Out[19]:
a six
b seven
c six
d seven
e six
dtype: object
In [20]: s.map(t)
Out[20]:
a 6.0
b 7.0
c 6.0
d 7.0
e 6.0
dtype: float64
6.4 Applying with a Panel
Applying with a Panel
will pass a Series
to the applied function. If the applied
function returns a Series
, the result of the application will be a Panel
. If the applied function
reduces to a scalar, the result of the application will be a DataFrame
.
Note
Prior to 0.13.1 apply
on a Panel
would only work on ufuncs
(e.g. np.sum/np.max
).
In [21]: import pandas.util.testing as tm
In [22]: panel = tm.makePanel(5)
In [23]: panel
Out[23]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: A to D
In [24]: panel['ItemA']
Out[24]:
A B C D
2000-01-03 -0.058321 1.541779 1.642293 0.526628
2000-01-04 -0.015366 -1.028011 0.646377 2.143954
2000-01-05 -0.246782 1.095148 0.463691 -1.616592
2000-01-06 -0.867023 -0.880842 -1.465336 0.287073
2000-01-07 0.508859 -0.556211 -0.610470 -0.004885
A transformational apply.
In [25]: result = panel.apply(lambda x: x*2, axis='items')
In [26]: result
Out[26]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: A to D
In [27]: result['ItemA']
Out[27]:
A B C D
2000-01-03 -0.116642 3.083559 3.284586 1.053255
2000-01-04 -0.030733 -2.056023 1.292754 4.287908
2000-01-05 -0.493565 2.190296 0.927382 -3.233183
2000-01-06 -1.734047 -1.761684 -2.930673 0.574147
2000-01-07 1.017718 -1.112422 -1.220940 -0.009770
A reduction operation.
In [28]: panel.apply(lambda x: x.dtype, axis='items')
Out[28]:
A B C D
2000-01-03 float64 float64 float64 float64
2000-01-04 float64 float64 float64 float64
2000-01-05 float64 float64 float64 float64
2000-01-06 float64 float64 float64 float64
2000-01-07 float64 float64 float64 float64
A similar reduction type operation
In [29]: panel.apply(lambda x: x.sum(), axis='major_axis')
Out[29]:
ItemA ItemB ItemC
A -0.678634 -1.814294 0.389953
B 0.171863 3.260122 0.264703
C 0.676554 -0.926901 0.250588
D 1.336178 -1.196204 2.904298
This last reduction is equivalent to
In [30]: panel.sum('major_axis')
Out[30]:
ItemA ItemB ItemC
A -0.678634 -1.814294 0.389953
B 0.171863 3.260122 0.264703
C 0.676554 -0.926901 0.250588
D 1.336178 -1.196204 2.904298
A transformation operation that returns a Panel
, but is computing
the z-score across the major_axis
.
In [31]: result = panel.apply(
....: lambda x: (x-x.mean())/x.std(),
....: axis='major_axis')
....:
In [32]: result
Out[32]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: A to D
In [33]: result['ItemA']
Out[33]:
A B C D
2000-01-03 0.156137 1.261375 1.256006 0.193170
2000-01-04 0.242782 -0.888986 0.425952 1.397600
2000-01-05 -0.224012 0.887640 0.273691 -1.402895
2000-01-06 -1.475113 -0.765838 -1.334072 0.014773
2000-01-07 1.300207 -0.494191 -0.621577 -0.202649
Apply can also accept multiple axes in the axis
argument. This will pass a
DataFrame
of the cross-section to the applied function.
In [34]: f = lambda x: ((x.T-x.mean(1))/x.std(1)).T
In [35]: result = panel.apply(f, axis = ['items','major_axis'])
In [36]: result
Out[36]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis)
Items axis: A to D
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: ItemA to ItemC
In [37]: result.loc[:,:,'ItemA']
Out[37]:
A B C D
2000-01-03 0.619463 0.896525 1.075947 0.420792
2000-01-04 1.059012 -1.096130 0.891046 0.754187
2000-01-05 0.058672 1.047126 0.492211 -1.064774
2000-01-06 -0.613789 -0.435602 -1.154147 -1.084692
2000-01-07 0.044797 -0.908652 -1.071002 -0.483408
This is equivalent to the following
In [38]: result = pd.Panel(dict([ (ax, f(panel.loc[:,:,ax]))
....: for ax in panel.minor_axis ]))
....:
In [39]: result
Out[39]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis)
Items axis: A to D
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: ItemA to ItemC
In [40]: result.loc[:,:,'ItemA']
Out[40]:
A B C D
2000-01-03 0.619463 0.896525 1.075947 0.420792
2000-01-04 1.059012 -1.096130 0.891046 0.754187
2000-01-05 0.058672 1.047126 0.492211 -1.064774
2000-01-06 -0.613789 -0.435602 -1.154147 -1.084692
2000-01-07 0.044797 -0.908652 -1.071002 -0.483408