Feature contribution is a method to give a weight to each feature that reflects its impact on the model’s prediction. Feature contribution can be calculated on an entire dataset or a single data point.
In our previous blog post, we showed how to visualize feature contributions to make it easy to find the features that contribute least or most to the model results. We did that using popular methods to calculate the contribution (e.g., SHAP).
In this post, we will show how to further leverage traditional feature contribution calculation methods. We show how to use feature contributions to quantify more accurately how good or bad a feature is for the model. We will create a tailor-made function that will rate the features according to the specific goals of the model.
We utilize the feature contribution results to calculate metrics on features as shown in the chart below:
The analysis will be done with SQL. We use SQL to support advanced analysis on large amounts of data.
Rank features by your own metric
Each model has its own purpose and its own metric to optimize. In some cases you will want the model to be correct in as many classifications as possible. In that case, you are likely to choose accuracy: the percentage of correct predictions. In our model, the worst thing that could happen is having False Positives. So our most important metric was precision: the percentage of the real positive predictions out of all positive predictions.
Below you can find three use cases for scoring features and using the feature contribution to know more about your model and feature performance.
Use Case 1 – Find the worst features
We had many features that were doing more harm than good. Thus, we decided to remove some of them.
Reducing the number of features and the number of false positives (FP) led us to score the features by how bad they perform, based on their classification result – FP, FN, TP, TN. We wanted a scoring method that gives more weight to positive classification results, because a low FP rate is crucial to the product. However we gave some importance to the negative classification results, so the model’s results would still make some sense.
Consider a model that classifies the input as 0 or 1. It uses the features f1…., fn. Below is an inference results example, table contribution_data. Each result has a classification, and features contribution data. The contribution data is a mapping between a feature and its contribution, as calculated by a standard feature contribution algorithm.
Note: the table below shows only features above predefined absolute contribution value. Many features have contributions close to 0. We ignore those features because it simplifies the calculations we show next and also makes the computation time shorter.
Another important thing to note: for readability, we changed the sign of avg_contrib so that a contribution that ‘harms’ the final classification will be negative, and positive otherwise.
We calculate the average contribution and the ratio between the true and false results. We do it per feature and classification tuple. From these metrics we get to the feature final feature score:
This is the SQL example to calculate the features scores:
-- STEP 1: average contribution for each classification type and feature
WITH feature_class_data AS (
SELECT type, feature,
SUM(contribution) AS contrib,
COUNT() AS hits
FROM contribution_data CROSS JOIN UNNEST (contribution_map) AS t(feature, contribution)
WHERE ABS(contribution) >= 0.1
GROUP BY 1, 2),
-- STEP 2: aggregation per feature
feature_stats AS (
SELECT feature,
MAP_AGG(type, contrib) AS contrib, -- {"FP": 43434.4, :"FN": -432433.02,...}
MAP_AGG(type, hits) AS hits, -- {"FP": 3434239, :"FN": 9876,...}
1.0 * SUM(hits) AS total_hits
FROM feature_class_data
GROUP BY feature)
-- STEP 3: score per feature
SELECT feature,
(contrib['TP'] + contrib['TN'] - contrib['FP'] - contrib['FN']) / total_hits AS score,
-contrib['FP'] / hits['FP'] AS fp_score
FROM feature_stats
The results will look something like this. The top rows are the ‘worst’ features. The features at the bottom are the most useful features. Using this ranking you can decide which feature should not be used.
Keep in mind that while a feature can cause many False Positives predictions, it may also contribute positively to True Positives predictions. Removing it is not necessarily the best option. Refining the feature could also help the performance of the model.
Use Case 2 – Find unstable and unfulfilled potential features
In this example, we will look for features that improve the classification on some inputs but harm the classification of other inputs.
Instead of looking at the average contribution of each feature, we will now look into the feature’s contribution Standard Deviation (STD) value. A feature with high STD means that its contribution varies a lot for different inputs. This is how its done with SQL:
types_stats AS (
SELECT
feature,
STDDEV(contribution) AS contrib_std,
min(contribution) AS min_contrib,
max(contribution) AS max_contrib
FROM contribution_data CROSS JOIN UNNEST (contribution_map) AS t(feature, contribution)
WHERE ABS(contribution) >= 0.1
GROUP BY 1)
SELECT *
FROM types_stats
WHERE
max_contrib > 0.2 -- In some cases the features improves the classification significantly
AND min_contrib < 0.02 -- In some cases the features is insignificant or worsens classification
changes drastically on different inputs
ORDER BY contrib_std DESC
The output will look like this:
The features in the output table are those with the high STD and thus most relevant to investigate. See if it is possible to isolate the predictions in which those features perform badly and look for the right way to remediate: by creating a supporting feature or tweaking the existing one.
Feature ranking during the ML life cycle
When developing a machine learning model we usually start from an initial model that we update until it is fine tuned and deployed. After deployment, we monitor its performance:
Throughout the development and deployment process, the code and data may change. These changes may impact the performance of the model and the relevance of the features. Features should be monitored during development and deployment just like any other metric. You can log the feature scoring after performing inference, or report it to a monitoring system and track it over time.
For example, if a feature starts having a negative impact on the model’s performance, you should know about it as soon as it happens. It will help you understand why it happened and implement a fix, when necessary.
In our ML testing post we explain how and why a model should be tested continuously throughout the development and deployment process. In the post, we talk about monitoring different test sets. Similarly, monitoring features’ importance for the different test sets lets us know more about our features and how changing the model and the data affects the features’ impact, over time.
Summary
We showed several ways to leverage features contribution data by creating tailor-made metrics to suit the model’s goal. In the same way, you can create your own metric that will work best for your model. You can use it to initially tune your model, and to keep it tuned over time.
Try Imperva for Free
Protect your business for 30 days on Imperva.