FORUM Marketing Analytics 106. How order in FROM clause changes the results

106. How order in FROM clause changes the results

I tried solving 106 problem using the following query:

SELECT 
  utm_campaign, 
  SUM(amount) AS total_revenue,
  COUNT(DISTINCT a.id) AS total_users,
  SUM(amount)/COUNT(DISTINCT b.id) AS ARPU
FROM purchases a 
JOIN users b 
  ON a.user_id = b.id 
    AND a.refunded = FALSE 
-- WHERE utm_campaign IS NOT NULL
GROUP BY 1
ORDER BY 4 DESC

which should be quite similar with the proposed solution, the difference is on the order of FROM clause. I used purchase table before user table and it gives a very different results.

I’d like to understand why is this the case as this is a quite surprising SQL behavior to me.

REPLIES

Hi @itslutfi 👋

That’s a great question, JOIN-ing is one of the fundamentals of SQL and Data Analysis. I’d recommend you to skim through the lessons for INNER and LEFT joins. 💡

I’m also glad that you stumbled upon the joining problem with that example, it’s super awesome because you’ll see this pattern in all conversion calculations later in the course and at work.

Debugging a query

Let’s simplify your query a bit and focus only on joining tables. I think I know what went wrong, let me give you an overview of how I usually debug such problems with queries.

💡 PRO TIP: Use first letters of tables for aliases. That way it’s easy to understand which tables you use for filtering, sorting, selecting, etc.

Here’s our simplified query:

SELECT 
  COUNT(DISTINCT(u.id)) AS users_count,
  COUNT(DISTINCT(p.id)) AS purchases_count
FROM users u
JOIN purchases p
  ON p.user_id = u.id 
    AND refunded = FALSE

Here’s the result set:

users_count purchases_count
805 812

Now let’s change the order of tables:

SELECT 
  COUNT(DISTINCT(u.id)) AS users_count,
  COUNT(DISTINCT(p.id)) AS purchases_count
FROM purchases p
JOIN users u
  ON p.user_id = u.id 
    AND refunded = FALSE
users_count purchases_count
805 812

As you can see, the results are identical and the order of tables doesn’t make a difference. It means the problem is somewhere else. Since our query is very simple, the only other place to look at is the type of JOIN we’re using.

Fixing the problem

Let’s look at the JOIN type in our query. You’re using INNER JOIN (JOIN is just a shorter alias) and by definition it returns only rows that are present in both tables. That’s why when we change the order of tables the results are identical.

Let’s look at our task again. We need to calculate average revenue per user:

  1. We count all users.
  2. We sum up their revenue.
  3. We divide 2 by 3.

Atm our query has a bug in the first step – INNER JOIN removes all users without purchases. ⚠

To count all users and still be able to sum up their revenue we need to use LEFT JOIN – it keeps all rows from the left table (users) and joins only matching rows from the right table (purchases).

SELECT 
  COUNT(DISTINCT(u.id)) AS users_count,
  COUNT(DISTINCT(p.id)) AS purchases_count
FROM users u
LEFT JOIN purchases p
  ON p.user_id = u.id 
    AND refunded = FALSE
users_count purchases_count
6082 812

As you can see, the amount of purchases stayed the same and the number of users is way higher. In fact, this query counts all users in our database, we can verify it by counting them separately:

SELECT COUNT(*) AS users_count
FROM users

Things to think about

Now what if we change the order in the query with LEFT JOIN?

SELECT 
  COUNT(DISTINCT(u.id)) AS users_count,
  COUNT(DISTINCT(p.id)) AS purchases_count
FROM purchases p
LEFT JOIN users u
  ON p.user_id = u.id 
    AND refunded = FALSE

Of course, the results will be completely different because of our JOIN type. LEFT JOIN will keep all records from the left table (in that case purchases) and only join matching records from the right table (users).

Summary

To select a correct JOIN type we need to deeply understand what we’re calculating. Ideally, break down our task into a simple algorithm and then validate each step with simpler queries like this one:

SELECT COUNT(*) AS users_count
FROM users

Most imporatnly, always remember that INNER JOIN returns the records that are present in both table. You can visualize it as an intersection.

The LEFT JOIN always keeps all records from the left table and only joins the matching records from the right table. The base of your conversion (the denominator) usually comes from the left table, because you need to count all records from it.

😅 I hope that was useful! 😊 Ping me if you have more questions. 🍻

Thank you for your great answer! this is beyond helpful!

I understand better more both about the bug in the query and also the underlying data. It is such a very useful breakdown.

Hello, Thanks for such a detailed explanation, along with the de-bugging scenarios. That was a lot helpful for me to understand & dispel my mis-understanding.

Thanks

WRITE A REPLY

“no fuss, just things you actually need”

Start learning with SQL Habit today

Master Data Analysis with SQL through the story of how a startup succeeded through data.