Introduction to SQL’s MATCH_RECOGNIZE Clause SQL is a great way to perform analysis on your data. It is very common and supported by many database engines including big data solutions. SQL is used in many cases to analyze data in our data lake. However, when it comes to pattern detection, SQL traditionally hasn’t been used, pattern detection by SQL is hard, and sometimes even impossible. That’s why Python is normally used instead.
Recently, support for the MATCH_RECOGNZE clause was added to the query engine we use. The MATCH_RECOGNIZE clause is a part of the SQL specification, and it is used to detect patterns using the well-known regular expression syntax. This clause is already supported by different query engines like AWS Athena, Oracle, Snowflake, and Trino. It can save you a lot of time and trouble by looking for patterns in your data. It can also reduce cost since the data stays in the database without moving it to process.
Applying MATCH_RECOGNIZE for Churn Prediction
We will explain how MATCH_RECOGNIZE can be used to predict churn. In this scenario, we looked at account entities and counted them over time. We would like the number of entities to be as high as possible since revenue is directly derived from it. A drop in the number of entities for a specific account may indicate an unsatisfied customer who is about to leave the service.
Using MATCH_RECOGNIZE, we attempted to predict churn in this data by using a single query to find patterns. In our case, the query used millions of records for data containing many accounts over a period of 40 weeks and returned in less than 10 seconds.
Read on to learn how you can do a simple churn prediction by SQL. You can apply it to your own data, or use pattern detection in other situations as well.
Data Preparation and AnalysisWe looked at the number of managed entities for accounts over time and tried to see a recent, continuous, and meaningful decrease over time. Just like in every data science problem, we first had to clean the data, and you will most likely have to do the same in order to get the most meaningful results.
In our case, we began by defining a filter for meaningful entities. For example, we didn’t want to take into consideration entities customers added to the service for testing or unused entities. This is an example of how the data looks after the cleaning, for a single entity over a period of several weeks:
Identifying Patterns through SQL Using MATCH_RECOGNIZE
After filtering, we wanted to look at the most recent data and search for accounts with week by week decrease and a total decrease. In the example we looked for a total decrease of 40% or more. Previously, this would have been very hard or impossible to do by SQL, but by using MATCH_RECOGNIZE we were able to do it in a single query:
The MATCH_RECOGNIZE clause has several different parts. The DEFINE part contains the variable definition, which can later be used in the other parts. We defined the following variables:
- START – the start of the decrease. We require at least 10 entities per customer for a start
- DECREASE – One or more decreases. We check that the number of entities is lower than the previous one. We are also ignoring small decreases of 5% of the entities or less
- FINAL – the final number of entities in which we expect to have a drop of 40% from the start number
The pattern we used is the following:
It means that we expect 3 or more decreases and that the final entity number will be in the last week, since we are looking for actionable information.
The MEASURES part is used to define the output structure. You can see that the variables are used so we will know when the drop started, and the number of entities in the start and end of the drop. Here are some example results:
Case Study: Analyzing a Single Account
We will look at one account example. In the example you can see the data over time for account 6 from the table above:
The decrease started on 2023-10-17. The number of entities dropped from 40 to 22, which is a 45% decrease. Since we ignore the two small increases – we get the full drop. Now we can go back to the source data and learn more about the account and try to keep it alive.
Conclusion: The Benefits of Using MATCH_RECOGNIZE
MATCH_RECOGNIZE is a great capability which makes our query engine stronger. When the data stays in the database, you get great performance. Using SQL to do pattern matching over time can save you a lot of time and costs.
We have provided an example for simple churn prediction. Hopefully you can apply it to your own data, or use pattern detection in other situations as well.
Try Imperva for Free
Protect your business for 30 days on Imperva.