Features extraction is expensive, especially when dealing with big data. That’s why it’s great when you have the ability to preprocess close to the database – the data stays in the DB and doesn’t have to move out, unless necessary.
One common approach for text data representation is N-grams. N-grams are a contiguous sequence of N items, such as words or characters, extracted from a given corpus. In this blog, we will present, step-by-step, a simple method for characters n-grams extraction, selection, and evaluation using SQL only. Here are the main steps:
Once n-grams are selected and evaluated they can be integrated as part of different text mining and Natural Language processing tasks. Read more to understand how to speed up those tasks and how SQL can be facilitated in choosing the best n-grams for your model.
N-gram Extraction
We are using Presto as our big-data query engine, which supports word n-gram processing as part of its array functions. By the way, if you’re using different SQL flavor – don’t you worry – the functions we are using are quite common. What we couldn’t find is a way to process character n-grams. A way to slice a string was also not available. Although it’s pretty safe to assume that more advanced built-in functions, such as the above, will be added to big-query engines, it’s really great to discover that what’s missing could be added with just few lines of code:
-- get the list of 4-grams from a text field
WITH ngrams_arr AS (
SELECT CONCAT(
REGEXP_EXTRACT_ALL(text_field, '....'),
REGEXP_EXTRACT_ALL(SUBSTRING(text_field, 2), '....'),
REGEXP_EXTRACT_ALL(SUBSTRING(text_field, 3), '....'),
REGEXP_EXTRACT_ALL(SUBSTRING(text_field, 4), '....')
) AS ngrams_arr,
COUNT() AS hits
FROM my_table
WHERE LENGTH(text_field)> 4
GROUP BY text_field),
-- group the ngrams and find the top 10k according to hits
ngrams AS (
SELECT ngram, COUNT(1) AS records, SUM(hits) AS hits
FROM ngrams_arr CROSS JOIN UNNEST (ngrams_arr) AS t(ngram)
GROUP BY ngram
ORDER BY hits DESC
LIMIT 10000)
Pretty simple, right? Let’s have a look at the first query named ngrams_arr. We extracted all 4-grams combinations from the corpus (text_field column in my_table table) by, first, removing duplicates using grouping, second, limiting to only N+1 sized instances, and lastly listing all n-grams by repeating the Regex extraction function N (=4) times and readjusting the extraction starting index by one using the Substring function. To enrich the distinct N-grams listed, we’ve added the hits column which counts the number of occurrences of the distinct n-grams array. Note that the hits is just an example and can be replaced or added with any other statistical measure.
Then, in the second query named ngrams we simply extract the distinct n-grams arrays into single n-grams and aggregate the statistical measurement results from the first query, in this case hits. Meaning, for every n-gram, we count the total number of hits in the corpus, by summing the hits of every distinct n-grams array including the n-gram in the corpus.
The result is a list of n-grams and their corpus hits. More information can be included like the number of records in which the n-gram was found and more. Here is a result example found in remote execution attacks:
As you can see in the result “/etc/passwd” is a very common phrase in such attacks.
N-grams Selection
Next, the best N-grams are chosen by tuning and testing the previous N-grams Extraction query on different hyperparameter values, including different corpus examples and N-gram size.
Let’s explain this concept using a real world problem. Imperva’s Web Application Firewall (WAF) monitors web applications traffic and logs statistics regarding identified attacks and legitimate traffic. Billions of HTTP requests are logged and analyzed daily. To find the best n-grams candidates for attacks detection tasks we do the following:
- Define the N-grams Extraction query hyperparameters
- Calculate the top hit legit n-grams by executing the N-grams Extraction query on Clean Traffic statistics.
- Calculate the top hit malicious n-grams by executing the N-grams Extraction query on Attack Traffic statistics.
- Conclude top meaningful n-grams by subtracting top hit malicious n-grams which are unlikely to occur in clean traffic.
The described selection method can be be easily executed as following:
WITH attack_ngrams AS (
-- extract attacks ngrams ...),
clean_traffic_ngrams AS (
-- extract clean traffic ngrams ...)
SELECT attacks.ngram, attacks.hits
FROM attack_ngarms INNER JOIN clean_traffic_ngrams ON attack_ngarms.ngram = clean_traffic_ngrams.ngram
ORDER BY attack_ngarms.hits / clean_traffic_ngrams.hits DESC
LIMIT 1000
Below is an example result for the above selection method, where the last column represents the ratio between clean and attacks traffic, the higher the more meaningful the n-gram could be:
Note that this is just one example of n-grams selection. Another selection method to consider in web traffic could be based on, for instance, on extracting more features per n-gram like the number of different payloads (entire text field), or the number of protected sites the n-gram included in.
There are many more methods that could be used for n-grams selection and representation, including common methods like TF-IDF, and other custom selection logics that would best fit your needs. You don’t have to choose one in advance – grab a few, compare and select to the best choice.
N-grams Evaluation
Once n-grams have been extracted, in our use case HTTP attacks n-grams, and selection methods have been chosen, the top n-grams can be evaluated on a validation set as follows:
Qualitative Evaluation
Retrieving validation data entries, not used for selection, including the extracted n-grams and exploring the results. You can use the following query to see your n-grams in action:
chosen_ngrams_arr AS (
SELECT ARRAY_AGG(ngram) AS ngrams_arr
FROM chosen_ngrams)
SELECT text_field, FILTER(ngrams_arr, x -> POSITION (x IN text_field) > 0)
FROM my_table CROSS JOIN chosen_ngram_arr
LIMIT 100
Here are results example from our WAF dataset:
Also, to zoom out a bit from the payloads perspective, we grouped the n-grams and ran evaluation on both attacks and clean traffic. Below are the results as a heatmap for some popular keywords:
As expected, the n-grams are rare in clean traffic, and very common in attacks.
Quantitative Evaluation
By comparing different n-grams selection experiments, calculating the n-grams correlation with the validation data and choosing the best option.
SELECT ANY_MATCH(ngrams_arr, x -> POSITION (x IN text_field), COUNT(1) AS hits
FROM my_table CROSS JOIN ngram_arr
GROUP BY 1
Note: If you are using AWS Athena please note that ANY_MATCH function is not supported at the moment. You can use CARDINALITY(FILTER(..)) > 0 instead.
When we evaluated our selected n-grams on clean traffic and attacks, we got the following experiment results:
This means it is much more likely to find the n-grams we selected in attacks payloads, as expected. Running experiments on different extraction and selection methods allows us to empirically compare our different n-grams and choose the best option. Here is an example for two experiments results (exp1 and exp2), sharing the same section method:
You can define your own comparison measurements, which will help you get to the n-grams you want.
N-grams to feature vectors
If you want to leverage the chosen N-grams from the previous steps as a feature vector to any future data problem, one possible way would be to one hot encode the N-grams per entry.
First, follow the SQL trick of extracting the selected N-grams from the data entries. Then the data is loaded to a pandas dataframe. Your dataframe should have a column with the extracted ngrams. Then, to encode and normalize the results, use pandas explode and get_dummies functions:
encoded_df = get_dummies(df["extracted_ngrams"].explode(), dtype=bool).max(level=0)
Assuming you have significant amount of data and you preprocess your dataframe entries in batches, make sure your chunks are consistent by using the reindex function with your model’s N-grams list: enc_df = enc_df.reindex(columns=ngrams_list, fill_value=False, copy=False)
enc_df = enc_df.reindex(columns=ngrams_list,fill_value=False, copy=False)
Here is an example input and output data, for the following 1-gram list: a, b, c, d:
Input | Output |
Python and specifically pandas in this case is used for encoding only, while the query engine is used for the heavy lifting – processing the large text data.
Takeaway
N-gram is one common and effective feature extraction approach used for text data representation and as a basis for many machine learning algorithms. When dealing with big data and when multiple trials and errors require, feature extraction in general, and N-gram as a specific use case, could be done close to the database to optimize processing and cost reduction.
In this blog we show that analyzing, experimenting and preprocessing N-grams for any use could be done effectively and easily when using SQL, and allows you to utilize the query engine as the first point of contact with the dataset. You can try it by simply running an SQL query.
Try Imperva for Free
Protect your business for 30 days on Imperva.