FORUM SQL Habit 105. 'Logical fallacy' explanation of my alternative query using 'where' clause

105. 'Logical fallacy' explanation of my alternative query using 'where' clause

In 105. a great and important point is made in the difference of limiting a query in the join conditions vs. limiting a query in the where clause.

I understand the scenario of why limiting the query in the join condition makes sense (filtering the purchases table to only consider all purchases that were not refunded) and I see how using the where clause ultimately returns wrong output (filtering in the where clause will ignore the ‘negative space’ of all the NULLs in purchase columns for users who did not make a purchase).

I tried challenging my understanding a bit and came up with the ‘b’ CTE below (‘a’ CTE is the correct query as described in the exercise). My question is then: how am I making a logical error in trying to utilise the where clause in the ‘b’ CTE below? As far as I can see, the difference in counts between ‘a’ and ‘b’ is 104 rows (purchases that were refunded?).

WITH a AS (
  SELECT
    SUM(amount) / COUNT(DISTINCT(u.id)) AS ARPU
  FROM users u
  LEFT JOIN purchases p
  ON u.id = p.user_id
  AND refunded = FALSE
),

b AS (
  SELECT
    SUM(amount) / COUNT(DISTINCT(u.id)) AS ARPU
  FROM users as u
  LEFT JOIN purchases AS p
  ON u.id = p.user_id
  WHERE refunded IS NULL OR refunded = FALSE
)

SELECT * FROM b
-- SELECT * FROM a

/*
The 'a' CTE above is correct. I can see that 'b' CTE is not correct,
but I'm unable to wrap my mind around why, for the 'b' version, the
'amount' is greater in relation to the distinct user count.
*/

REPLIES

Hi @QueryingQuail 👋

You’ve brought in a great challenge, thank you so much for such a clear description. 🚀

I think there’s no logical error – we need to tell the database to join relevant purchases ⚠ keeping all users in the table (because we’re calculating ARPU – average revenue per user).

The error happens because we filter in the wrong place. Or even better – at the wrong time, it’s all about the order in which SQL engine executes the statements in our query.

Here’s a link to the lesson with the full order.

WHERE clause is processed after FROM and JOIN (we filter the final joined result set), so it already gives us a feeling that we might lose records.

Let’s break down your CTEs to smaller steps (same way SQL engine would process them):

WITH a AS (
  SELECT
    COUNT(*) AS records_count,
    COUNT(u.id) AS user_records_count,
    COUNT(p.id) AS purchase_records_count,
    COUNT(DISTINCT(u.id)) AS uniq_user_records_count,
    COUNT(DISTINCT(p.id)) AS uniq_purchase_records_count
  FROM users u
  LEFT JOIN purchases p
    ON u.id = p.user_id
      AND refunded = FALSE
),

b AS (
  SELECT
    COUNT(*) AS records_count,
    COUNT(u.id) AS user_records_count,
    COUNT(p.id) AS purchase_records_count,
    COUNT(DISTINCT(u.id)) AS uniq_user_records_count,
    COUNT(DISTINCT(p.id)) AS uniq_purchase_records_count
  FROM users as u
  LEFT JOIN purchases AS p
    ON u.id = p.user_id  
)

SELECT * FROM a
-- SELECT * FROM b

As you can see, the uniq_user_records_count is the same for both result sets, but the uniq_purchase_records_count is way higher in b – we haven’t filtered them out yet.

At this point our a result set is ready – we’ve already joined non-refunded purchases and ready to calculate ARPU.

:drum: Now comes the error. Let’s look at the raw b data from a new angle:

WITH b AS (
  SELECT
    p.id AS purchase_id,
    p.*,
    u.*
  FROM users as u
  LEFT JOIN purchases AS p
    ON u.id = p.user_id  
)

SELECT * 
FROM b
ORDER BY purchase_id ASC NULLS FIRST

☝ I didn’t modify the data, but I just sorted it in a way that the error is clear – there’re lots of user records without purchases.

These records all have refunded and all purchase table columns equal to NULL. So when we apply our filter in the WHERE clause (it’ll be applied to the whole joined result set) we’ll filter all these users records without purchases. 💣

Let’s use the same “angle” to look at the final b result set after filtering:

WITH b AS (
  SELECT
    p.id AS purchase_id,
    p.*,
    u.*
  FROM users as u
  LEFT JOIN purchases AS p
    ON u.id = p.user_id  
  WHERE
    refunded = FALSE  
)

SELECT * 
FROM b
ORDER BY purchase_id ASC NULLS FIRST

As you can see, no more NULL records – they’re all gone. 💥

WRITE A REPLY

“no fuss, just things you actually need”

Start learning with SQL Habit today

Master Data Analysis with SQL through the story of how a startup succeeded through data.