K#13. Beware of filtering on left joins on SQL
This SQL mistake fools even experienced Data Scientists
Hey there 👋
Welcome back to K's DataLadder ✨! Each week, I share a story from my life as a Tech Data Scientist to help you level up in your career (and sometimes in life too).
We've grown to an amazing community of 1260 curious and driven data lovers. Thank you for being part of this journey 🫶
If you’re new here, check out this welcome post to get an idea of what this newsletter is all about!
Reading Time: 5 minutes
Checkpoint
I'm celebrating a big milestone this month, but more on that next week!
In the meantime, life's been good to me lately.
I've been back in Morocco for a week now (my homeland) and it’s been therapeutic for me - it recharges the battery.
Other than that, I'm on holiday for the next two weeks with two main goals:
Finally edit those two YouTube videos I've been sitting on (wish me luck!)
Bring back the 1:1 chats I offered a couple of weeks ago
Speaking of which, if you're interested in a 1:1 coaching session for FREE, it’s right here 👇! Spots are limited so I’ll only pick the most motivated ones.
This Week's Story
This week was intense. A major A/B test I’ve been working on finished running and it was time for analysis. The results are to be presented next week to the big bosses.
But this week was also my last before going off on holiday, so I needed to analyze as much as I could before handing things over to my colleague.
Working under a tight schedule is a breeding ground for careless mistakes.
That's why I had my team's senior data scientist review my queries, and I was right to do so because here’s what he told me:
All my queries had the same mistake.
But what SQL error could be so sneaky that even experienced data scientists fall for it?
Before I tell you, can you guess what it might be?
I hope you played along!
Analyzing an A/B test can be quite complex. You’re literally playing detective chasing clues to uncover why metrics went down.
Like a detective, you need to step into the user's shoes, consider various scenarios, and factor in technical aspects like latency issues.
To gather all this intel, we need to join multiple tables to extract users who were exposed to the test, their metadata, consumption, app interactions, and more.
With so many tables, it's easy to make mistakes with JOINs.
Why this matters
Mastering SQL is one of the most crucial skills any data scientist must have. I’d even say it’s the most important technical skill to master for big tech jobs.
Many data scientists fall into this trap, and today, I'm sharing one of the biggest pitfalls to save you time and face.
Troubleshooting with ChatGPT for SQL vs. Python
You can use ChatGPT for a lot of coding now, especially for doing data manipulation, aggregations, or creating visualizations on Python. But SQL is different.
Python is for working with data once it’s ready. But SQL is for preparing your data from scratch.
You need to do SQL yourself because SQL often involves writing complex queries across multiple tables.
Sure, ChatGPT can help with troubleshooting SQL errors but it can't fully understand your data since it doesn't have access to all the tables you’re using.
Python is easier to get help with because most of the time you’re working with one big table, so you can share column names, and use specific functions.
It’s basically tailored access to StackOverflow solutions to your problems.
The Sneaky SQL Problem
LEFT JOINs gone wrong
So what's this mistake that even my colleague with 4 years at Spotify says he makes?
It's about LEFT JOINs and WHERE conditions.
Don’t be fooled by its innocent appearance, the demon has tricked many souls.
It’s a difficult mistake to realize once it is done because it works just fine and produces results that seem reasonable quite often.
Here’s an example. When you do:
SELECT x,y,z
FROM t1
LEFT JOIN t2 USING(val)
WHERE t2.condition = something
It functions as an INNER JOIN. Anything in t1 that does not exist in t2 will automatically not satisfy WHERE t2.condition = something.
So any user in your dataset that does not meet the condition you are interested in will be filtered out.
The general rule of thumb is to never use WHERE conditions on the table that you just LEFT JOINed in.
Any time you want to do a LEFT JOIN and have WHERE conditions on the table that you are LEFT JOINing in, try breaking it down to separate CTEs.
First, a CTE with t2 + WHERE conditions and then a separate CTE that includes the previous CTE using a LEFT JOIN.
The Main Point
If it’s still not clear, here’s a more specific example my peer gave me to explain it:
t1:
id, val_1
1, 1
1, 2
2, 1
t2:
id, val_2
1, 1
3, 1
Now when we perform a JOIN of these tables with INNER JOIN on id we get t1 INNER JOIN t2 USING(id)
id, val_1, val_2
1, 1, 1
1, 2, 1
When we do LEFT JOIN on id we get t1 LEFT JOIN t2 USING(id)
id, val_1, val_2
1, 1, 1
1, 2, 1
2, 1, NULL
Now you can see from that second table that if we add any condition in a WHERE filter on val_2, the last row will disappear because val_2 is NULL there (unless our filter is specifically WHERE val_2 IS NULL in which case we keep only what does not exist in t2). And if the last row disappears, this table becomes equal to the INNER JOIN table.
This is why using WHERE conditions on the right table of a LEFT JOIN can unexpectedly turn it into an INNER JOIN, potentially excluding data you meant to keep.
The Important Rule: Don't use WHERE conditions on the table you just LEFT JOINed unless you're sure that's what you want.
As my colleague said, "It's hard to notice this mistake because it often works and gives results that look okay."
Next time you're writing complex SQL queries, double-check your JOINs. It will save you from problems later!
Also, if you haven’t done it yet:
Please leave a ❤️ or a comment to let me know that you read me, it would mean the world. Until then, see you next week for more data stories 🫶