Fundamentals of Data Analysis Calculating purchase rate. Part 1

26. Calculating purchase rate. Part 1

“People are purchasing Bindle subscriptions, that’s so amazing! I got an email from this venture firm, B Permutator or something. They’re asking what’s our purchase rate, do we know enough SQL to calculate it?” asked Sandra.

Technically speaking – yes. Let’s look at the purchase rate of users who signed up within the first week of Bindle. We already know how to calculate users within a time range:

  COUNT(*) AS total_users_count
FROM users
  signup_date BETWEEN '2018-01-01' AND '2018-01-07'

In the same way, we can calculate how many of them became customers:

  COUNT(*) AS total_customers_count
FROM users
  signup_date BETWEEN '2018-01-01' AND '2018-01-07'
  AND status = 'customer'

The only thing – we need to manually divide the number of customers by the number of users to calculate the purchase rate. Remember, how SQL itself was inspired by accounting? Clearly there must be a way to easily calculate rates and percentages.

CASE statement

The only difference between 2 queries above ☝ is the filter for customer status status = 'customer'. Wouldn’t it be great if we were able to count all users and customers in one query (and then divide one by the other)? It’s possible via the CASE statement: we can use conditions (when talking about if/else our well-known filters are often referred to as conditions) in the SELECT clause. Here’s a query that counts the number of users and customers at the same time:

  COUNT(*) AS total_users_count,
  COUNT(CASE WHEN status = 'customer' THEN id END) AS total_customers_count
FROM users
  signup_date BETWEEN '2018-01-01' AND '2018-01-07'

In this query ☝ we have used a couple of new SQL features (tricks):

  • we have multiple aggregate functions. That’s totally OK, remember how aggregate functions work? They calculate a single value based on all filtered rows. For example, the COUNT(*) function will calculate the total number of rows.
  • we used the CASE statement in the second COUNT in a way that it’ll only count a user record only if this user has a customer status.
  • we used the CASE statement without ELSE. The default value of any CASE statement is NULL (refresher: NULL is a special type in SQL that stands for no value). Think of it as SQL would always add ELSE NULL to your CASE statement if you didn’t specify it:

is equivalent to

And the final SQL feature we used today: counting rows using the COUNT aggregate function will count only rows with existing values (⚠ it’ll skip all NULL values). This is why our CASE statement for customers returned the user’s id if the user has the customer status.

To have a better feeling how the CASE statement works inside COUNT we can imagine it as a 2 step process. First, we run a query to have a new column with the CASE statement like so:

  CASE WHEN status = 'customer' THEN id END AS new_user_status
FROM users

and then we run COUNT(new_user_status) on it (oh yeah, you totally can query … another query 😄 Why not? The result of any query is a table like any other SQL table).

Why using id in the CASE statement?

The most frequent question I get at this point is “What’s the purpose of this id column inside the CASE statement?”

  COUNT(CASE WHEN status = 'customer' THEN id END) AS customers_count
FROM users

Well, we can use anything but NULL to make this query count users with customer status:

  COUNT(CASE WHEN status = 'customer' THEN id END),
  COUNT(CASE WHEN status = 'customer' THEN 1 END),
  COUNT(CASE WHEN status = 'customer' THEN TRUE END),
  COUNT(CASE WHEN status = 'customer' THEN FALSE END),
  COUNT(CASE WHEN status = 'customer' THEN status END),
  COUNT(CASE WHEN status = 'customer' THEN 'customer' END)
FROM users

☝ As you can see, all these COUNT()-s return the same value. This is because the COUNT() function ignores NULL values. You can visualize it like so: the COUNT() functions goes row-by-row in our table; if the argument (in our case the result of the CASE statement) is not NULL – it’ll increment the counter by 1.

🔍 So why id then? IMO because it’s the most useful column for the job. First of all, id-s are unique identifiers. I read such a query as If the CASE condition is TRUE, then count this record in. Second, we often want to count only unique records and this is where id column is absolutely necessary. More on that later in the course. 🚀

That was a looong lesson 😓 If you have any questions drop me a message via 👉 chat in the bottom right corner. Don’t forget to practice and run all the queries from this lesson ☝

Anatoli Makarevich, author of SQL Habit About SQL Habit

Hi, it’s Anatoli, the author of SQL Habit. 👋

SQL Habit is a course (or, as some of the students say, “business simulator”). It’s based on a story of a fictional startup called Bindle. You’ll play a role of their Data Analyst 📊 and solve real-life challenges from Business, Marketing, and Product Management.

SQL Habit course is made of bite-sized lessons (you’re looking at one atm) and exercises. They always have a real-life setting and detailed explanations. You can immediately apply everything you’ve learned at work. 🚀

“well worth the money”

Fluent in SQL in a month

Master Data Analysis with SQL with real life examples from Product Management, Marketing, Finance and more.
-- Type your query here, for example this one -- lists all records from users table: SELECT * FROM users
Loading chart... ⏳