5.4 JOIN
JOINs can be performed with join()
or merge()
. By default,
join()
will join the DataFrames on their indices. Each method has
parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or the
columns to join on (column names or indices).
In [1]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
...: 'value': np.random.randn(4)})
...:
In [2]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
...: 'value': np.random.randn(4)})
...:
In [3]: df1
Out[3]:
key value
0 A -0.837865
1 B 0.609390
2 C 1.428977
3 D 0.581303
In [4]: df2
Out[4]:
key value
0 B -1.000217
1 D 1.522258
2 D 0.662935
3 E -0.799717
Assume we have two database tables of the same name and structure as our DataFrames.
Now let’s go over the various types of JOINs.
5.4.1 INNER JOIN
SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;
# merge performs an INNER JOIN by default
In [5]: pd.merge(df1, df2, on='key')
Out[5]:
key value_x value_y
0 B 0.609390 -1.000217
1 D 0.581303 1.522258
2 D 0.581303 0.662935
merge()
also offers parameters for cases when you’d like to join one DataFrame’s
column with another DataFrame’s index.
In [6]: indexed_df2 = df2.set_index('key')
In [7]: pd.merge(df1, indexed_df2, left_on='key', right_index=True)
Out[7]:
key value_x value_y
1 B 0.609390 -1.000217
3 D 0.581303 1.522258
3 D 0.581303 0.662935
5.4.2 LEFT OUTER JOIN
-- show all records from df1
SELECT *
FROM df1
LEFT OUTER JOIN df2
ON df1.key = df2.key;
# show all records from df1
In [8]: pd.merge(df1, df2, on='key', how='left')
Out[8]:
key value_x value_y
0 A -0.837865 NaN
1 B 0.609390 -1.000217
2 C 1.428977 NaN
3 D 0.581303 1.522258
4 D 0.581303 0.662935
5.4.3 RIGHT JOIN
-- show all records from df2
SELECT *
FROM df1
RIGHT OUTER JOIN df2
ON df1.key = df2.key;
# show all records from df2
In [9]: pd.merge(df1, df2, on='key', how='right')
Out[9]:
key value_x value_y
0 B 0.609390 -1.000217
1 D 0.581303 1.522258
2 D 0.581303 0.662935
3 E NaN -0.799717
5.4.4 FULL JOIN
pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not the joined columns find a match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).
-- show all records from both tables
SELECT *
FROM df1
FULL OUTER JOIN df2
ON df1.key = df2.key;
# show all records from both frames
In [10]: pd.merge(df1, df2, on='key', how='outer')
Out[10]:
key value_x value_y
0 A -0.837865 NaN
1 B 0.609390 -1.000217
2 C 1.428977 NaN
3 D 0.581303 1.522258
4 D 0.581303 0.662935
5 E NaN -0.799717