4.2 Base R

4.2.1 Slicing with R’s |c|_

R makes it easy to access data.frame columns by name

df <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5), d=rnorm(5), e=rnorm(5))
df[, c("a", "c", "e")]

or by integer location

df <- data.frame(matrix(rnorm(1000), ncol=100))
df[, c(1:10, 25:30, 40, 50:100)]

Selecting multiple columns by name in pandas is straightforward

In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=list('abc'))

In [2]: df[['a', 'c']]
Out[2]: 
           a         c
0   0.469112 -1.509059
1  -1.135632 -0.173215
2   0.119209 -0.861849
3  -2.104569  1.071804
..       ...       ...
6   0.276232 -0.673690
7   0.113648  0.524988
8   0.404705 -1.715002
9  -1.039268 -1.157892

[10 rows x 2 columns]

In [3]: df.loc[:, ['a', 'c']]
Out[3]: 
           a         c
0   0.469112 -1.509059
1  -1.135632 -0.173215
2   0.119209 -0.861849
3  -2.104569  1.071804
..       ...       ...
6   0.276232 -0.673690
7   0.113648  0.524988
8   0.404705 -1.715002
9  -1.039268 -1.157892

[10 rows x 2 columns]

Selecting multiple noncontiguous columns by integer location can be achieved with a combination of the iloc indexer attribute and numpy.r_.

In [4]: named = list('abcdefg')

In [5]: n = 30

In [6]: columns = named + np.arange(len(named), n).tolist()

In [7]: df = pd.DataFrame(np.random.randn(n, n), columns=columns)

In [8]: df.iloc[:, np.r_[:10, 24:30]]
Out[8]: 
           a         b         c         d         e         f         g  \
0  -1.344312  0.844885  1.075770 -0.109050  1.643563 -1.469388  0.357021   
1  -0.076467 -1.187678  1.130127 -1.436737 -1.413681  1.607920  1.024180   
2   0.176444  0.403310 -0.154951  0.301624 -2.179861 -1.369849 -0.954208   
3   0.132885 -0.023688  2.410179  1.450520  0.206053 -0.251905 -2.213588   
..       ...       ...       ...       ...       ...       ...       ...   
26  0.725238  0.624607 -0.141185 -0.143948 -0.328162  2.095086 -0.608888   
27  1.262419  1.950057  0.301038 -0.933858  0.814946  0.181439 -0.110015   
28 -1.585746 -0.899734  0.921494 -0.211762 -0.059182  0.058308  0.915377   
29 -0.986248  0.169729 -1.158091  1.019673  0.646039  0.917399 -0.010435   

           7         8         9        24        25        26        27  \
0  -0.674600 -1.776904 -0.968914 -1.170299 -0.226169  0.410835  0.813850   
1   0.569605  0.875906 -2.211372  0.959726 -1.110336 -0.619976  0.149748   
2   1.462696 -1.743161 -0.826591  0.084844  0.432390  1.519970 -0.493662   
3   1.063327  1.266143  0.299368 -2.484478 -0.281461  0.030711  0.109121   
..       ...       ...       ...       ...       ...       ...       ...   
26 -0.926422  1.872601 -2.513465 -0.846188  1.190624  0.778507  1.008500   
27 -2.364638 -1.584814  0.307941 -1.341814  0.334281 -0.162227  1.007824   
28 -0.696321  0.150664 -3.060395  0.403620 -0.026602 -0.240481  0.577223   
29  0.366366  0.922729  0.869610 -1.209247 -0.671466  0.332872 -2.013086   

          28        29  
0   0.132003 -0.827317  
1  -0.732339  0.687738  
2   0.600178  0.274230  
3   1.126203 -0.977349  
..       ...       ...  
26  1.424017  0.717110  
27  2.826008  1.458383  
28 -1.088417  0.326687  
29 -1.602549  0.333109  

[30 rows x 16 columns]

4.2.2 |aggregate|_

In R you may want to split data into subsets and compute the mean for each. Using a data.frame called df and splitting it into groups by1 and by2:

df <- data.frame(
  v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
  v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99),
  by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12),
  by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA))
aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean)

The groupby() method is similar to base R aggregate function.

In [9]: df = pd.DataFrame({
   ...:   'v1': [1,3,5,7,8,3,5,np.nan,4,5,7,9],
   ...:   'v2': [11,33,55,77,88,33,55,np.nan,44,55,77,99],
   ...:   'by1': ["red", "blue", 1, 2, np.nan, "big", 1, 2, "red", 1, np.nan, 12],
   ...:   'by2': ["wet", "dry", 99, 95, np.nan, "damp", 95, 99, "red", 99, np.nan,
   ...:           np.nan]
   ...: })
   ...: 

In [10]: g = df.groupby(['by1','by2'])

In [11]: g[['v1','v2']].mean()
Out[11]: 
            v1    v2
by1  by2            
1    95    5.0  55.0
     99    5.0  55.0
2    95    7.0  77.0
     99    NaN   NaN
big  damp  3.0  33.0
blue dry   3.0  33.0
red  red   4.0  44.0
     wet   1.0  11.0

For more details and examples see the groupby documentation.

4.2.3 |match|_

A common way to select data in R is using %in% which is defined using the function match. The operator %in% is used to return a logical vector indicating if there is a match or not:

s <- 0:4
s %in% c(2,4)

The isin() method is similar to R %in% operator:

In [12]: s = pd.Series(np.arange(5),dtype=np.float32)

In [13]: s.isin([2, 4])
Out[13]: 
0    False
1    False
2     True
3    False
4     True
dtype: bool

The match function returns a vector of the positions of matches of its first argument in its second:

s <- 0:4
match(s, c(2,4))

The apply() method can be used to replicate this:

In [14]: s = pd.Series(np.arange(5),dtype=np.float32)

In [15]: pd.Series(pd.match(s,[2,4],np.nan))
Out[15]: 
0    NaN
1    NaN
2    0.0
3    NaN
4    1.0
dtype: float64

For more details and examples see the reshaping documentation.

4.2.4 |tapply|_

tapply is similar to aggregate, but data can be in a ragged array, since the subclass sizes are possibly irregular. Using a data.frame called baseball, and retrieving information based on the array team:

baseball <-
  data.frame(team = gl(5, 5,
             labels = paste("Team", LETTERS[1:5])),
             player = sample(letters, 25),
             batting.average = runif(25, .200, .400))

tapply(baseball$batting.average, baseball.example$team,
       max)

In pandas we may use pivot_table() method to handle this:

In [16]: import random

In [17]: import string

In [18]: baseball = pd.DataFrame({
   ....:    'team': ["team %d" % (x+1) for x in range(5)]*5,
   ....:    'player': random.sample(list(string.ascii_lowercase),25),
   ....:    'batting avg': np.random.uniform(.200, .400, 25)
   ....:    })
   ....: 

In [19]: baseball.pivot_table(values='batting avg', columns='team', aggfunc=np.max)
Out[19]: 
team
team 1    0.352134
team 2    0.295327
team 3    0.397191
team 4    0.394457
team 5    0.396194
Name: batting avg, dtype: float64

For more details and examples see the reshaping documentation.

4.2.5 |subset|_

New in version 0.13.

The query() method is similar to the base R subset function. In R you might want to get the rows of a data.frame where one column’s values are less than another column’s values:

df <- data.frame(a=rnorm(10), b=rnorm(10))
subset(df, a <= b)
df[df$a <= df$b,]  # note the comma

In pandas, there are a few ways to perform subsetting. You can use query() or pass an expression as if it were an index/slice as well as standard boolean indexing:

In [20]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})

In [21]: df.query('a <= b')
Out[21]: 
          a         b
1  0.174950  0.552887
2 -0.023167  0.148084
3 -0.495291 -0.300218
4 -0.860736  0.197378
5 -1.134146  1.720780
7 -0.290098  0.083515
8  0.238636  0.946550

In [22]: df[df.a <= df.b]
Out[22]: 
          a         b
1  0.174950  0.552887
2 -0.023167  0.148084
3 -0.495291 -0.300218
4 -0.860736  0.197378
5 -1.134146  1.720780
7 -0.290098  0.083515
8  0.238636  0.946550

In [23]: df.loc[df.a <= df.b]
Out[23]: 
          a         b
1  0.174950  0.552887
2 -0.023167  0.148084
3 -0.495291 -0.300218
4 -0.860736  0.197378
5 -1.134146  1.720780
7 -0.290098  0.083515
8  0.238636  0.946550

For more details and examples see the query documentation.

4.2.6 |with|_

New in version 0.13.

An expression using a data.frame called df in R with the columns a and b would be evaluated using with like so:

df <- data.frame(a=rnorm(10), b=rnorm(10))
with(df, a + b)
df$a + df$b  # same as the previous expression

In pandas the equivalent expression, using the eval() method, would be:

In [24]: df = pd.DataFrame({'a': np.random.randn(10), 'b': np.random.randn(10)})

In [25]: df.eval('a + b')
Out[25]: 
0   -0.091430
1   -2.483890
2   -0.252728
3   -0.626444
       ...   
6   -0.332214
7    0.799331
8   -2.377245
9    2.104677
dtype: float64

In [26]: df.a + df.b  # same as the previous expression
Out[26]: 
0   -0.091430
1   -2.483890
2   -0.252728
3   -0.626444
       ...   
6   -0.332214
7    0.799331
8   -2.377245
9    2.104677
dtype: float64

In certain cases eval() will be much faster than evaluation in pure Python. For more details and examples see the eval documentation.