TIL — Full Outer Join use case in ML

A simple solution to merge two tables in sql without losing information

Imagine the following Problem: You have two tables with differents informations about users. The first A gives information about the age of the user and B gives where the user lives.

Table A (user information)

user_idnameage
1Alice30
2Bob25

Table B (user metadata)

user_idnamecity
2BobParis
3CharlieLondon

I was surprise that using DBT, it was not uncommon to decouple the features concerning users. But at the end of the day, you would like to merge both table.

How to do it ?

More surprising, it is not that easy because of the primary key on table A and B (user_id) are not aligned. And it would be a loss to either do a left join or right join because you want all rows.

FULL OUTER JOIN and COALESCE Approach


To achieve the desired result, you need to handle two key aspects:

  1. Preserving all rows: You want to include all rows from both tables, even if there is no corresponding match in the other table. This is where a FULL OUTER JOIN comes in. A full outer join includes all rows from both tables, filling in NULL for columns where a match is missing.

  2. Handling non-null values for the key (user_id): After performing the join, you may have NULL values for the user_id in one table but not the other (e.g., user_id from Table A may be NULL for Charlie since he exists only in Table B). Using COALESCE, you can take the first non-null value from either table for the user_id.

Try 1, Basic Full Outer Join.

SELECT 
    A.user_id AS A_user_id, A.name AS A_name, A.age,
    B.user_id AS B_user_id, B.name AS B_name, B.city
FROM 
    A
FULL OUTER JOIN 
    B
ON 
    A.user_id = B.user_id;
A_user_idA_nameageB_user_idB_namecity
1Alice30NULLNULLNULL
2Bob252BobParis
NULLNULLNULL3CharlieLondon

Try 2, Full Outer Join with Coalesce

SELECT 
    COALESCE(A.user_id, B.user_id) AS user_id, 
    COALESCE(A.name, B.name) AS name, 
    A.age, 
    B.city
FROM 
    A
FULL OUTER JOIN 
    B
ON 
    A.user_id = B.user_id;
user_idnameagecity
1Alice30NULL
2Bob25Paris
3CharlieNULLLondon

Using FULL OUTER JOIN with COALESCE is an efficient solution when merging tables with partial overlap in keys. This ensures a complete, non-lossy merge, accommodating records from both tables while handling null values gracefully. This approach aligns well with dbt’s philosophy of modular data management, where tables often carry distinct but complementary feature sets for the same entities.