4.1 Quick Reference
We’ll start off with a quick reference guide pairing some common R operations using dplyr with pandas equivalents.
4.1.1 Querying, Filtering, Sampling
R | pandas |
---|---|
dim(df) |
df.shape |
head(df) |
df.head() |
slice(df, 1:10) |
df.iloc[:9] |
filter(df, col1 == 1, col2 == 1) |
df.query('col1 == 1 & col2 == 1') |
df[df$col1 == 1 & df$col2 == 1,] |
df[(df.col1 == 1) & (df.col2 == 1)] |
select(df, col1, col2) |
df[['col1', 'col2']] |
select(df, col1:col3) |
df.loc[:, 'col1':'col3'] |
select(df, -(col1:col3)) |
df.drop(cols_to_drop, axis=1) but see [1] |
distinct(select(df, col1)) |
df[['col1']].drop_duplicates() |
distinct(select(df, col1, col2)) |
df[['col1', 'col2']].drop_duplicates() |
sample_n(df, 10) |
df.sample(n=10) |
sample_frac(df, 0.01) |
df.sample(frac=0.01) |
[1] | R’s shorthand for a subrange of columns
(select(df, col1:col3) ) can be approached
cleanly in pandas, if you have the list of columns,
for example df[cols[1:3]] or
df.drop(cols[1:3]) , but doing this by column
name is a bit messy. |
4.1.2 Sorting
R | pandas |
---|---|
arrange(df, col1, col2) |
df.sort_values(['col1', 'col2']) |
arrange(df, desc(col1)) |
df.sort_values('col1', ascending=False) |
4.1.3 Transforming
R | pandas |
---|---|
select(df, col_one = col1) |
df.rename(columns={'col1': 'col_one'})['col_one'] |
rename(df, col_one = col1) |
df.rename(columns={'col1': 'col_one'}) |
mutate(df, c=a-b) |
df.assign(c=df.a-df.b) |
4.1.4 Grouping and Summarizing
R | pandas |
---|---|
summary(df) |
df.describe() |
gdf <- group_by(df, col1) |
gdf = df.groupby('col1') |
summarise(gdf, avg=mean(col1, na.rm=TRUE)) |
df.groupby('col1').agg({'col1': 'mean'}) |
summarise(gdf, total=sum(col1)) |
df.groupby('col1').sum() |