FORUM Product Analytics Grouping and counting with LEFT JOIN

Grouping and counting with LEFT JOIN

I’m in lesson 98 “Grouping and counting with LEFT JOIN”, but I noticed something, when doing the following query:

SELECT
  name,
  COUNT(user_id)
FROM books b
LEFT JOIN books_users u
  ON u.book_id = b.id
GROUP BY 1
ORDER BY 2 ASC

You actually get some books with 0 values, because nobody has started reading them. But then for avoiding duplicated values from same users we add the DISTINCT, however, all books now again have at least a value of 1, I’m not sure why this happens.

SELECT
  name,
  COUNT(DISTINCT(user_id))
FROM books b
LEFT JOIN books_users u
  ON u.book_id = b.id
GROUP BY 1
ORDER BY 2 DESC

REPLIES

Hi @ecasanova95 👋

Awesome question, I think I have a great tip to share with you.

You already know why we’re using DISTINCT here – to avoid counting the same users twice in case there’re duplicated records in the books_users table. More details in the other question of yours.

You see that the results are different and my first guess would be exactly this – there’re duplicated records. Let’s check how many books are read by the same user twice or more:

WITH book_stats AS (
  SELECT 
    book_id,
    COUNT(user_id) AS users_count,
    COUNT(DISTINCT(user_id)) AS uniq_users_count
  FROM books_users
  GROUP BY 1
)

SELECT *
FROM book_stats
WHERE
  users_count != uniq_users_count

☝ The trick here is very simple – we count all users for each book, the number of unique users for each book and compare these 2 numbers. As you can see, this query returns no results which means there’re no duplicated records.

Since there’re no duplicated records your queries should return the same result because DISTINCT doesn’t do much. Where’s the problem then?

Take a look at the sorting order in your queries. Your queries return exactly the same result, they’re just sorted in the different order. 💥

WRITE A REPLY

“no fuss, just things you actually need”

Start learning with SQL Habit today

Master Data Analysis with SQL through the story of how a startup succeeded through data.