How to detect recurring payments with SQL

Introduction

Recently I spotted a strange 200$ transaction in my bank statement. It was from a service I thought I’ve canceled. 💥 You know what was next: I wrote to customer support, asked for a refund and after some time got my 200$ back. This story had a happy ending but it made me wonder how many such recurring transactions I haven’t noticed.

💡 It’s 2020 and I thought that bank apps are already smart enough to detect subscriptions. I went to my N26 bank’s Subscriptions report hoping to find a nice list of recurring transactions there. Drum roll… it was empty! 🙈 I definitely pay for some subscription services (Spotify, for example 🎧) but none of them showed up in the report.

No subscriptions in bank report

I’ve no idea how N26 detects subscriptions and why it’s broken for me. Good news are that N26, as any other bank, allows to download all transactions as a CSV file. Let’s download it and improvise a simple subscription detection mechanism based solely on SQL. 🚀

Setting a goal

What we are about to do is a typical data research or an Ad Hoc analysis. It’s not far from the reality of data analysts, marketers and product managers mining their data.

In my experience, any analysis (or anything in general) has much higher chances of success if it has a goal. 🎯 Here I want to know which companies charge me regularly, for how long and how much money have I already paid.

To answer these questions ☝ we’ll have to write one (potentially massive) SQL query. To kick it off I usually try to visualize the result set that our final query will return. Such visualization helps a lot to anticipate intermediate steps and bottlenecks. Looking at our goals ☝ I’d imagine the final result like this:

payee first_transaction_date total_money_spent_eur
Spotify 2019-04-01 -250
Netflix 2020-01-01 -30

It’s a good goal for a small blog post. 😊 Let’s get started! 🚀

Importing the data

Step 0: Inspecting the CSV report

Let’s download the CSV file with our bank transactions. Here’s a link for the N26 report file that will be used in this blog post (the data is made up, of course, but the format is real).

💡 Feel free to download the transactions from you bank and follow along with them! 🚀

⚠ In case you have a problem with anything just leave a comment below or ping me on Twitter.

Let’s inspect the CSV file with transactions, it should look something like this:

Bank transactions CSV report

Basically, CSV files are tables. In the next couple of steps we’ll replicate this table ☝ in the actual SQL database and copy the data.

Step 1: Creating a database

I’ll use a PostgreSQL database with a free client called Postico. I’ll create a simple database via the client’s interface:

Create a new database via Postico UI

Alternative approach is to use SQL queries for database or table creation as well. To create a new database you’ll need to run a query from the template (template1 in case of PostgreSQL) database:

Create a new database via SQL query

Let’s call our database personal_finance ☝ in case you want to add more data in the future (other bank accounts, investments, etc).

Step 2: Creating a transactions table

Now we need a table to host the transactions from the CSV file, let’s call it n26_transactions (if you use Chase bank it’ll be chase_transactions). Let’s examine what columns will we need by looking at the first row in the CSV file:

Bank transactions CSV report

Date, Payee, Account number, etc. In our table, they’ll be called created_at (date is a reserved word in SQL and it’s better not to use it as a column name), payee, account_number, etc.

💡 We’re using lower case for column names with underscores _ instead of spaces. In programming, it’s called snake case.

Now that we have a list of columns let’s create a table. Simply click + Table if you’re using Postico:

Create transactions table via Postico UI

To add new columns switch to the Structure tab of the n26_transactions table and add columns (created_at column has type date, other columns have type text):

Add columns to transactions table via Postico UI

The hacker way to add a table is by using a CREATE TABLE SQL query:

⚠ Note that we used text columns even for columns with numbers (amount_eur, etc). The reason for it is that all values in the CSV file are wrapped in quotation marks "" and from the database perspective they’re all text (it could lead to import errors when SQL engine can’t recognize a number in the empty string, for example). No problem though, we’ll be able to change column types later.

Step 3: Importing transactions data

At this point we’ve got our personal_finance database and n26_transactions table. It’s time to import the data! Let’s use a special COPY query for it. Here the final query that imports transactions from the CSV file:

SQL query for importing transactions from CSV file

⚠ The order of column names in the COPY command should map the order in the CSV file. Otherwise SQL will try to map data to the default order of the columns in the n26_transactions table.

⚠ Don’t forget to adjust the path to the CSV file in the FROM command. As you can see, the query uses the path to the CSV file from my computer: /Users/makaroni4/Downloads/n26-csv-transactions-fake.csv.

Step 4: Inspecting the data

At this point, we should have all transactions from the CSV file in the transactions table. I start every data research with a SELECT * query and look at the raw data:

The records follow the same order as in the original CSV file. It’s quite hard to spot any recurring payments this way. Let’s order transactions by the payee and transaction date (created_at column):

⚠ Note that we filtered out income transactions (with a positive amount).

I bet you’ve already spotted the transactions from Spotify, it’s definitely a subscription! 💥

Detecting recurring transactions

Step 5: Calculating transactions frequency

Looks like Spotify bills us every month on the 3rd day, but other companies could do it every 30 or 31 days, for example. Detecting the frequency is definitely a bottleneck here.

Let’s try counting days between each transaction for every payee. To do this we’ll employ the ROW_NUMBER() window function to index all transactions within each payee and the LAG() window function to access the date of the previous transaction:

As you can see, this query counts days between consecutive transactions for every payee. It also produces a lot of noise in the result (records with NULL date_diff). They definitely make it hard for us to spot the recurring transactions, so let’s filter them out.

We’ll do it by wrapping the previous transaction into a CTE (Common Table Expression or a subquery) and filter out the ones with NULL date_diff:

WITH transactions_with_date_diff AS (
  SELECT  
    ROW_NUMBER() OVER(PARTITION BY payee ORDER BY created_at),
    created_at - LAG(created_at) OVER(PARTITION BY payee ORDER BY created_at) AS date_diff,
    *
  FROM n26_transactions
)

SELECT *
FROM transactions_with_date_diff
WHERE
  date_diff IS NOT NULL

Sweet, we’re down to 400 transactions from the initial 800. Let’s browse through the result and spot more recurring transactions to find patterns in billing dates.

Here’s the one from Vodafone Mobile, it’s definitely a subscription. The date_diff varies from 28 to 33. Let’s try to filter only records with date_diff in this range (let’s round it to 25 and 35):

WITH transactions_with_date_diff AS (
  SELECT  
    ROW_NUMBER() OVER(PARTITION BY payee ORDER BY created_at),
    created_at - LAG(created_at) OVER(PARTITION BY payee ORDER BY created_at) AS date_diff,
    *
  FROM n26_transactions
)

SELECT *
FROM transactions_with_date_diff
WHERE
  date_diff IS NOT NULL
  AND date_diff BETWEEN 25 AND 35

Awesome, we’re down to just 90 transactions and most of them look pretty recurring! 😄 🚀

Step 6: Building final report

Let’s recall the final report we planned, we wanted to know:

  • when a subscription started
  • how many transactions we have had so far
  • how much money we have spent already
  • the latest payment date

Our job at this point is to convert the result of the previous query ☝ into the final report. We’ll definitely need to aggregate data, let’s translate our requirements into SQL statements:

  • the earliest transaction date: MIN(created_at)
  • the number of transactions: COUNT(*)
  • the total amount payed: SUM(amount_eur::numeric). ⚠ Since all columns are strings we need to convert the amount_eur string value into a number.
  • the latest transaction date: MAX(created_at)

Let’s put everything into one query:

WITH transactions_with_date_diff AS (
  SELECT  
    ROW_NUMBER() OVER(PARTITION BY payee ORDER BY created_at),
    created_at - LAG(created_at) OVER(PARTITION BY payee ORDER BY created_at) AS date_diff,
    LAST_VALUE(amount_eur) OVER(PARTITION BY payee ORDER BY created_at) AS latest_amount,
    *
  FROM n26_transactions
)

SELECT
  payee,
  COUNT(*) AS transactions_count,
  MIN(created_at) AS subscription_started,
  MAX(created_at) AS latest_transaction,
  SUM(amount_eur::numeric) AS total_amount
FROM transactions_with_date_diff
WHERE
  date_diff IS NOT NULL
  AND date_diff BETWEEN 25 AND 35
GROUP BY 1
ORDER BY 2 DESC

As you can see in the result, there’s a bunch of payee-s with just 1 transaction, they don’t look like recurring transactions. Let’s filter them out for now (if they’re real subscriptions they’ll show up in the report next month):

WITH transactions_with_date_diff AS (
  SELECT  
    ROW_NUMBER() OVER(PARTITION BY payee ORDER BY created_at),
    created_at - LAG(created_at) OVER(PARTITION BY payee ORDER BY created_at) AS date_diff,
    LAST_VALUE(amount_eur) OVER(PARTITION BY payee ORDER BY created_at) AS latest_amount,
    *
  FROM n26_transactions
)

SELECT
  payee,
  COUNT(*) AS transactions_count,
  MIN(created_at) AS subscription_started,
  MAX(created_at) AS latest_transaction,
  SUM(amount_eur::numeric) AS total_amount
FROM transactions_with_date_diff
WHERE
  date_diff IS NOT NULL
  AND date_diff BETWEEN 25 AND 35
GROUP BY 1
HAVING COUNT(*) > 1
ORDER BY 2 DESC

Pretty neat, huh? I think this query used pretty much every SQL concept available – filters, aggregate and window functions, and sorting. 🚀

BONUS: Transactions with dynamic payee name

Let’s get back to our initial inspection query:

SELECT *
FROM transactions
ORDER BY payee ASC, created_at DESC

If you scroll the result set long enough you’ll see a LOCAL GYM recurring transactions:

payee
LOCAL GYM AUG-29609684
LOCAL GYM DEC-33094239
LOCAL GYM JAN-33997758

Our final query didn’t spot them because the payee was different for each month (somehow the LOCAL GYM managed to include the month of payment as part of the payee value and not as a description).

There are a couple of ways we could fix that issue. We can manually adjust the payee values in the subquery and use it for our report:

SELECT
  (CASE 
  WHEN payee LIKE 'LOCAL GYM %' THEN 'LOCAL GYM'
  ELSE payee
  END) AS adjusted_payee,
  *  
FROM n26_transactions
ORDER BY payee ASC, created_at DESC

or we can extract the first word from the payee value and use it as the new payee label. Anyways, you can turn the result of these queries into subqueries and use them as the input for our final query in Step 6.

SELECT
  SPLIT_PART(payee, ' ', 1) adjusted_payee,
  payee,
  *  
FROM n26_transactions
ORDER BY payee ASC, created_at DESC

The first approach is more precise but requires us to go over all transactions manually to spot recurring ones (which kind of defeats the purpose of our research).

Anyways, since we used CASE statement and scalar function SPLIT_PART we covered all aspects of SQL in one blog post! 💥 🍾 👏 🚀

Epilogue

As you can see, a “simple” SQL query could be very insightful for doing personal finance. Imagine how powerful data insights are when you’re running marketing campaigns or building a product.

In case you want to learn how SQL works and how to use it to run a modern Internet company I recommend you to try the SQL Habit course. It’s a simulation of a startup company that will teach you not just SQL mastery but how to use data for running successful marketing campaigns, building web, and mobile products.