Finding Common Identifiers & Choosing your Primary Table
To join tables in SQL, there needs to be a common identifier among the tables being joined — a value that uniquely identifies each row from the others in each of the tables being combined, like an id or email address. When writing your query, you need to indicate the columns where this value is present.
Imagine you have one primary table that you want to combine with another table. The left join function works by looking for the common identifier in the other table, and then adding the values in the corresponding columns to the rows that match the common identifier. If there are rows in the other table with an identifier not present in the primary table, these rows will simply be disregarded. And vice versa, if there are identifiers in the primary table that don’t exist in the other table, these columns will simply be empty (NULL will be printed in the cells).
In this illustration, you can also see that the values for F and G in the other table are not carried over to the primary table as these identifiers are not present in the primary table.
You might recognize this way of looking up data in another table to populate the cells with matching identifiers from working with VLOOKUP in Excel or Google Sheets. It can work in the exact same way if you want to populate one sheet with information from a different sheet:
The same logic from the VLOOKUP function can be used to join tables in SQL.
So how do you figure out what your common identifier is? This depends on the nature of the data you’re trying to join. There need to be some fields that are common among the tables. In Weld’s sample data, for example, you can use the ‘customer_id’ to join the Customer table and Order table, since this field is present in both. Or, you could use the ‘order_id’ to join the Order table with the Order Line table. In your own data, you’d have to see what data you could use, for example an email address to identify customers across two tables if an id is not available across the two tables.
In some cases, there are no common identifiers between two tables, and therefore no meaningful way to join them. In this case, you might need to find a manual way of entering fields from one table into another in order to join those two tables. For example, you could manually enter your customer ID from your CRM system into your invoicing system to be able to match the customers across those two tables.
Once you’ve found the common identifier among the two tables, you need to define which of them should be your primary table (the “left” table in left join). The left join function keeps all rows in your primary table, and adds values in the columns in the corresponding rows of the other table. So you should pick the table in which you’d like to have all rows present as the primary table.
For example: If you want to have a table containing all your customers and potential customers from your CRM combined with how much they’ve been invoiced if they’re a paying customer in your invoicing system, you should pick the customer table from the CRM as the primary table. This will ensure you have a row for each customer in your CRM, and one with their invoicing information if that’s relevant. If you used the table from your invoicing system as the primary table, the resulting table would entail only your paying customers, and all your potential customers from your CRM would be left out.