4.2 Base R
4.2.1 Slicing with R’s |c|_
R makes it easy to access data.frame
columns by name
df <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5), d=rnorm(5), e=rnorm(5))
df[, c("a", "c", "e")]
or by integer location
df <- data.frame(matrix(rnorm(1000), ncol=100))
df[, c(1:10, 25:30, 40, 50:100)]
Selecting multiple columns by name in pandas
is straightforward
In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=list('abc'))
In [2]: df[['a', 'c']]
Out[2]:
a c
0 0.469112 -1.509059
1 -1.135632 -0.173215
2 0.119209 -0.861849
3 -2.104569 1.071804
.. ... ...
6 0.276232 -0.673690
7 0.113648 0.524988
8 0.404705 -1.715002
9 -1.039268 -1.157892
[10 rows x 2 columns]
In [3]: df.loc[:, ['a', 'c']]
Out[3]:
a c
0 0.469112 -1.509059
1 -1.135632 -0.173215
2 0.119209 -0.861849
3 -2.104569 1.071804
.. ... ...
6 0.276232 -0.673690
7 0.113648 0.524988
8 0.404705 -1.715002
9 -1.039268 -1.157892
[10 rows x 2 columns]
Selecting multiple noncontiguous columns by integer location can be achieved
with a combination of the iloc
indexer attribute and numpy.r_
.
In [4]: named = list('abcdefg')
In [5]: n = 30
In [6]: columns = named + np.arange(len(named), n).tolist()
In [7]: df = pd.DataFrame(np.random.randn(n, n), columns=columns)
In [8]: df.iloc[:, np.r_[:10, 24:30]]
Out[8]:
a b c d e f g \
0 -1.344312 0.844885 1.075770 -0.109050 1.643563 -1.469388 0.357021
1 -0.076467 -1.187678 1.130127 -1.436737 -1.413681 1.607920 1.024180
2 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849 -0.954208
3 0.132885 -0.023688 2.410179 1.450520 0.206053 -0.251905 -2.213588
.. ... ... ... ... ... ... ...
26 0.725238 0.624607 -0.141185 -0.143948 -0.328162 2.095086 -0.608888
27 1.262419 1.950057 0.301038 -0.933858 0.814946 0.181439 -0.110015
28 -1.585746 -0.899734 0.921494 -0.211762 -0.059182 0.058308 0.915377
29 -0.986248 0.169729 -1.158091 1.019673 0.646039 0.917399 -0.010435
7 8 9 24 25 26 27 \
0 -0.674600 -1.776904 -0.968914 -1.170299 -0.226169 0.410835 0.813850
1 0.569605 0.875906 -2.211372 0.959726 -1.110336 -0.619976 0.149748
2 1.462696 -1.743161 -0.826591 0.084844 0.432390 1.519970 -0.493662
3 1.063327 1.266143 0.299368 -2.484478 -0.281461 0.030711 0.109121
.. ... ... ... ... ... ... ...
26 -0.926422 1.872601 -2.513465 -0.846188 1.190624 0.778507 1.008500
27 -2.364638 -1.584814 0.307941 -1.341814 0.334281 -0.162227 1.007824
28 -0.696321 0.150664 -3.060395 0.403620 -0.026602 -0.240481 0.577223
29 0.366366 0.922729 0.869610 -1.209247 -0.671466 0.332872 -2.013086
28 29
0 0.132003 -0.827317
1 -0.732339 0.687738
2 0.600178 0.274230
3 1.126203 -0.977349
.. ... ...
26 1.424017 0.717110
27 2.826008 1.458383
28 -1.088417 0.326687
29 -1.602549 0.333109
[30 rows x 16 columns]
4.2.2 |aggregate|_
In R you may want to split data into subsets and compute the mean for each.
Using a data.frame called df
and splitting it into groups by1
and
by2
:
df <- data.frame(
v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99),
by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12),
by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA))
aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean)
The groupby()
method is similar to base R aggregate
function.
In [9]: df = pd.DataFrame({
...: 'v1': [1,3,5,7,8,3,5,np.nan,4,5,7,9],
...: 'v2': [11,33,55,77,88,33,55,np.nan,44,55,77,99],
...: 'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12],
...: 'by2': ["wet", "dry", 99, 95, np.nan, "damp", 95, 99, "red", 99, np.nan,
...: np.nan]
...: })
...:
In [10]: g = df.groupby(['by1','by2'])
In [11]: g[['v1','v2']].mean()
Out[11]:
v1 v2
by1 by2
1 95 5.0 55.0
99 5.0 55.0
2 95 7.0 77.0
99 NaN NaN
big damp 3.0 33.0
blue dry 3.0 33.0
red red 4.0 44.0
wet 1.0 11.0
For more details and examples see the groupby documentation.
4.2.3 |match|_
A common way to select data in R is using %in%
which is defined using the
function match
. The operator %in%
is used to return a logical vector
indicating if there is a match or not:
s <- 0:4
s %in% c(2,4)
The isin()
method is similar to R %in%
operator:
In [12]: s = pd.Series(np.arange(5),dtype=np.float32)
In [13]: s.isin([2, 4])
Out[13]:
0 False
1 False
2 True
3 False
4 True
dtype: bool
The match
function returns a vector of the positions of matches
of its first argument in its second:
s <- 0:4
match(s, c(2,4))
The apply()
method can be used to replicate
this:
In [14]: s = pd.Series(np.arange(5),dtype=np.float32)
In [15]: pd.Series(pd.match(s,[2,4],np.nan))
Out[15]:
0 NaN
1 NaN
2 0.0
3 NaN
4 1.0
dtype: float64
For more details and examples see the reshaping documentation.
4.2.4 |tapply|_
tapply
is similar to aggregate
, but data can be in a ragged array,
since the subclass sizes are possibly irregular. Using a data.frame called
baseball
, and retrieving information based on the array team
:
baseball <-
data.frame(team = gl(5, 5,
labels = paste("Team", LETTERS[1:5])),
player = sample(letters, 25),
batting.average = runif(25, .200, .400))
tapply(baseball$batting.average, baseball.example$team,
max)
In pandas
we may use pivot_table()
method to handle this:
In [16]: import random
In [17]: import string
In [18]: baseball = pd.DataFrame({
....: 'team': ["team %d" % (x+1) for x in range(5)]*5,
....: 'player': random.sample(list(string.ascii_lowercase),25),
....: 'batting avg': np.random.uniform(.200, .400, 25)
....: })
....:
In [19]: baseball.pivot_table(values='batting avg', columns='team', aggfunc=np.max)
Out[19]:
team
team 1 0.352134
team 2 0.295327
team 3 0.397191
team 4 0.394457
team 5 0.396194
Name: batting avg, dtype: float64
For more details and examples see the reshaping documentation.
4.2.5 |subset|_
New in version 0.13.
The query()
method is similar to the base R subset
function. In R you might want to get the rows of a data.frame
where one
column’s values are less than another column’s values:
df <- data.frame(a=rnorm(10), b=rnorm(10))
subset(df, a <= b)
df[df$a <= df$b,] # note the comma
In pandas
, there are a few ways to perform subsetting. You can use
query()
or pass an expression as if it were an
index/slice as well as standard boolean indexing:
In [20]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
In [21]: df.query('a <= b')
Out[21]:
a b
1 0.174950 0.552887
2 -0.023167 0.148084
3 -0.495291 -0.300218
4 -0.860736 0.197378
5 -1.134146 1.720780
7 -0.290098 0.083515
8 0.238636 0.946550
In [22]: df[df.a <= df.b]
Out[22]:
a b
1 0.174950 0.552887
2 -0.023167 0.148084
3 -0.495291 -0.300218
4 -0.860736 0.197378
5 -1.134146 1.720780
7 -0.290098 0.083515
8 0.238636 0.946550
In [23]: df.loc[df.a <= df.b]
Out[23]:
a b
1 0.174950 0.552887
2 -0.023167 0.148084
3 -0.495291 -0.300218
4 -0.860736 0.197378
5 -1.134146 1.720780
7 -0.290098 0.083515
8 0.238636 0.946550
For more details and examples see the query documentation.
4.2.6 |with|_
New in version 0.13.
An expression using a data.frame called df
in R with the columns a
and
b
would be evaluated using with
like so:
df <- data.frame(a=rnorm(10), b=rnorm(10))
with(df, a + b)
df$a + df$b # same as the previous expression
In pandas
the equivalent expression, using the
eval()
method, would be:
In [24]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})
In [25]: df.eval('a + b')
Out[25]:
0 -0.091430
1 -2.483890
2 -0.252728
3 -0.626444
...
6 -0.332214
7 0.799331
8 -2.377245
9 2.104677
dtype: float64
In [26]: df.a + df.b # same as the previous expression
Out[26]:
0 -0.091430
1 -2.483890
2 -0.252728
3 -0.626444
...
6 -0.332214
7 0.799331
8 -2.377245
9 2.104677
dtype: float64
In certain cases eval()
will be much faster than
evaluation in pure Python. For more details and examples see the eval
documentation.