5.4 JOIN

JOINs can be performed with join() or merge(). By default, join() will join the DataFrames on their indices. Each method has parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or the columns to join on (column names or indices).

In [1]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
   ...:                     'value': np.random.randn(4)})
   ...: 

In [2]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
   ...:                     'value': np.random.randn(4)})
   ...: 

In [3]: df1
Out[3]: 
  key     value
0   A -0.837865
1   B  0.609390
2   C  1.428977
3   D  0.581303

In [4]: df2
Out[4]: 
  key     value
0   B -1.000217
1   D  1.522258
2   D  0.662935
3   E -0.799717

Assume we have two database tables of the same name and structure as our DataFrames.

Now let’s go over the various types of JOINs.

5.4.1 INNER JOIN

SELECT *
FROM df1
INNER JOIN df2
  ON df1.key = df2.key;
# merge performs an INNER JOIN by default
In [5]: pd.merge(df1, df2, on='key')
Out[5]: 
  key   value_x   value_y
0   B  0.609390 -1.000217
1   D  0.581303  1.522258
2   D  0.581303  0.662935

merge() also offers parameters for cases when you’d like to join one DataFrame’s column with another DataFrame’s index.

In [6]: indexed_df2 = df2.set_index('key')

In [7]: pd.merge(df1, indexed_df2, left_on='key', right_index=True)
Out[7]: 
  key   value_x   value_y
1   B  0.609390 -1.000217
3   D  0.581303  1.522258
3   D  0.581303  0.662935

5.4.2 LEFT OUTER JOIN

-- show all records from df1
SELECT *
FROM df1
LEFT OUTER JOIN df2
  ON df1.key = df2.key;
# show all records from df1
In [8]: pd.merge(df1, df2, on='key', how='left')
Out[8]: 
  key   value_x   value_y
0   A -0.837865       NaN
1   B  0.609390 -1.000217
2   C  1.428977       NaN
3   D  0.581303  1.522258
4   D  0.581303  0.662935

5.4.3 RIGHT JOIN

-- show all records from df2
SELECT *
FROM df1
RIGHT OUTER JOIN df2
  ON df1.key = df2.key;
# show all records from df2
In [9]: pd.merge(df1, df2, on='key', how='right')
Out[9]: 
  key   value_x   value_y
0   B  0.609390 -1.000217
1   D  0.581303  1.522258
2   D  0.581303  0.662935
3   E       NaN -0.799717

5.4.4 FULL JOIN

pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not the joined columns find a match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).

-- show all records from both tables
SELECT *
FROM df1
FULL OUTER JOIN df2
  ON df1.key = df2.key;
# show all records from both frames
In [10]: pd.merge(df1, df2, on='key', how='outer')
Out[10]: 
  key   value_x   value_y
0   A -0.837865       NaN
1   B  0.609390 -1.000217
2   C  1.428977       NaN
3   D  0.581303  1.522258
4   D  0.581303  0.662935
5   E       NaN -0.799717