4.3 plyr
plyr
is an R library for the split-apply-combine strategy for data
analysis. The functions revolve around three data structures in R, a
for arrays
, l
for lists
, and d
for data.frame
. The
table below shows how these data structures could be mapped in Python.
R | Python |
---|---|
array | list |
lists | dictionary or list of objects |
data.frame | dataframe |
4.3.1 |ddply|_
An expression using a data.frame called df
in R where you want to
summarize x
by month
:
require(plyr)
df <- data.frame(
x = runif(120, 1, 168),
y = runif(120, 7, 334),
z = runif(120, 1.7, 20.7),
month = rep(c(5,6,7,8),30),
week = sample(1:4, 120, TRUE)
)
ddply(df, .(month, week), summarize,
mean = round(mean(x), 2),
sd = round(sd(x), 2))
In pandas
the equivalent expression, using the
groupby()
method, would be:
In [1]: df = pd.DataFrame({
...: 'x': np.random.uniform(1., 168., 120),
...: 'y': np.random.uniform(7., 334., 120),
...: 'z': np.random.uniform(1.7, 20.7, 120),
...: 'month': [5,6,7,8]*30,
...: 'week': np.random.randint(1,4, 120)
...: })
...:
In [2]: grouped = df.groupby(['month','week'])
In [3]: grouped['x'].agg([np.mean, np.std])
Out[3]:
mean std
month week
5 1 63.653367 40.601965
2 78.126605 53.342400
3 92.091886 57.630110
6 1 81.747070 54.339218
... ... ...
7 3 71.688795 37.595638
8 1 62.741922 34.618153
2 91.774627 49.790202
3 73.936856 60.773900
[12 rows x 2 columns]
For more details and examples see the groupby documentation.