Author: Nora Petrova
It is estimated that companies are losing on average 7% of their annual expenditure to fraud, costing the UK £110 billion a year, and £3 trillion globally. To put that into perspective, in 2020, there are only 7 companies in the world that are in the $trillion club. Can data innovation help us solve this issue?
Although many companies are still relying on manual processes to review and prevent fraud (20% for Europe, 16% for North America, 30% for Asia and 20% for Africa*), the table below shows that the crime landscape is constantly evolving and fraudsters are becoming more sophisticated over time.
Companies have to respond with more sophisticated fraud detection solutions in turn. Machine learning is now being adopted worldwide to mitigate the levels of fraud across industries and remove the necessity for manual intervention.
Despite the variety of techniques fraudsters use to cover their tracks, very few manage to completely obfuscate the fraud traces they leave behind.
In this blog, I’ll give an overview of how machine learning approaches can detect and learn fraud signals, with a focus on credit card fraud detection, and share the best approaches we’ve found in tackling it.
Using machine learning to detect fraud is part of the wider effort of anomaly detection, which aims to identify rare events which differ significantly from the majority of the data. There are two main categories within anomaly detection:
Unsupervised techniques aim to detect anomalies in unlabelled datasets by grouping events according to their features. Clustering is a commonly used technique, and, more recently, autoencoders have also proven useful for feature extraction. Autoencoders learn a lower-dimensional latent representation of the input data and then learn to reconstruct it. It is analogous to compression and decompression as shown in the image below. The assumption here is that in a given dataset where the majority of events are non-fraudulent, the model will learn to reconstruct them better than anomalous ones and the reconstruction error will be higher for the outlier events.
Supervised techniques require the dataset to contain a label indicating whether an event is fraudulent or not. However, labelled datasets are scarce, with only a few high-quality public datasets available. Additionally, the quality of the labels is usually obtained from chargebacks issued by customers which contain a level of noise.
Tree-based models, such as Random Forest and XGBoost are heavily used for supervised fraud detection tasks. More recently, graph-based models have been gaining popularity. They enable the usage of the graph underlying a transaction network in order to learn network properties that make identification of fraud easier.
Key aspects of setting up the problem of fraud detection for machine learning include understanding the type of fraud that we’re trying to detect (e.g. credit card fraud, identity theft, etc), engineering features that capture domain knowledge, balancing the data using well-known data re-balancing techniques and selecting the right approach given the available data.
Let’s start by diving into feature selection and engineering. As mentioned earlier, fraudsters tend to leave a pattern of activity that can be distinguished from activity generated by non-fraudulent users. The challenge is to represent the available data in a way that makes this distinction between the classes as easy as possible for the machine learning model. Feature engineering can simplify this considerably by leveraging the domain knowledge humans have when it comes to fraud. Representing the problem in a way that makes it easy for the machine learning model to learn from is the key element to building a successful fraud detection solution.
Some powerful ways of going about feature building, in the context of credit card fraud, is to include features that capture various aspects of the customer activity. We can consider:
The features that help provide context are ones containing information about more global fraud patterns. This could be a particular email domain or a specific grouping of credit cards for a certain country. Further context is provided by features related to the customer’s typical past behaviour in the system, such as the usual amount they spend per order, devices they tend to use, and their typical location.
Finally, I’d like to emphasise the importance of engineering features related to the network inherent in the data. This involves tracking entities (e.g. customers, credit cards, devices, etc) across the network and utilising the network structure. Looking at the problem from the perspective of different entities in the network is at times much more effective at revealing fraud patterns.
Supervised tree-based techniques tend to perform really well when a significant portion of the domain knowledge can be represented as features and the complexity of the fraud is relatively low. Models such as Random Forest and XGboost perform well out of the box on a variety of fraud datasets that we’ve tested them on. Using additional boosting techniques, such as AdaBoost can sometimes help push performance for tree-based models. Using multiple classifiers with majority voting is a robust technique that mitigates some of the limitations specific to certain tree-based models.
On the unsupervised side, Isolation Forest is a tree-based model that lends itself to being used on unlabelled data when the contamination rate (the proportion of observed fraud) is known or can be estimated reliably. Autoencoders are also gaining popularity and are being combined with classifiers. This approach can work very well when the autoencoder can learn a useful lower-dimensional latent representation of the feature space. This enables even simple classifiers to differentiate between the classes much more effectively.
Graph-based neural networks are gaining traction and are more frequently used within anomaly detection. I will cover them in more detail below but first, I want to briefly touch on sampling techniques and useful metrics when dealing with imbalanced data.
The library imblearn provides a lot of sampling techniques and I recommend having a look if you have a highly imbalanced dataset and the problem can be simplified by re-balancing it through undersampling, oversampling or a hybrid approach. The library implements popular sampling techniques such as ADASYN, SMOTE, KMeans SMOTE and others.
Choosing the right metric is very important. The most common metrics used in supervised anomaly detection are accuracy, F1 Score, ROC AUC, Precision-Recall Curve, Matthews Correlation Coefficient (MCC) and others. If you are heavily relying on F1 Score for your classification problem in fraud detection, I’d like to recommend having a look at MCC as it is more robust than F1 Score because it is calculated directly from the confusion matrix; whereas the F1 Score is the harmonic mean of precision and recall.
Now, let’s dive into graph based neural networks and what they can offer beyond the most common techniques which do not utilise the graph structure in the data. Graph based approaches most frequently use the adjacency matrix between entities in the network, which is passed into the model alongside node and edge attributes. As an example, consider a simple transactional dataset where we have credit cards, devices, location and amount. We can construct a graph where credit cards and devices are nodes in the network and edges exist if a transaction between a particular credit card and device has occurred. We can also include the location as a node attribute.
Graphs allow to model inter-dependencies and capture the relational nature of transactions. Individual transactions may display little or not indication of being fraudulent but when considered within the larger context, the patterns may emerge. When we frame the problem in this way, we can consider anomalies at the level of a node, an edge or an entire subgraph. It is a powerful way to approach the problem and the following example will attempt to demonstrate why. If we have fraudulent activity in the network, for example, a user has obtained a group of stolen credit cards and is trying all of them in order to test which ones work. The pattern will be difficult to detect if we just consider the transactions but if we consider the immediate neighbourhood of the user in the graph, we will observe that this user is associated with a disproportionate amount of credit cards, where the norm may be 2-3 cards per customer.
If you’d like to try some of the latest graph-based models, spektral is a really useful library which includes implementation for a lot of models.
In conclusion, fraud detection is a key area where machine learning can lead to billions of savings for businesses while providing customers with a safer environment. Through advanced feature engineering and modelling techniques, such as graph networks, autoencoders and clustering, machine learning can help detect fraudulent events as anomalies in a standard customer purchasing journey. Implementing the right machine learning solution means going beyond simple heuristics, further connecting the user behavior to the common behavior experienced by the merchant, the region, the industry, and many other dimensions. In this unique scenario where a 1% improvement in model performance can bring so much value to both businesses and customers, research in fraud detection may be one of the most interesting spaces to keep an eye on in AI.
As Google Cloud Specialization Partner of the Year, Datatonic has a proven track record in developing cutting-edge solutions in AI, Machine Learning and Cloud Modernization. Driven by technical excellence in Data Science and in-depth industry experience, we’ve delivered AI and Machine Learning powered solutions for top-tier clients globally.
Know exactly where and how to start your AI journey with Datatonic’s
three-week AI Innovation Jumpstart *.
* Duration dependent on data complexity and use case chosen for POC model
With your own data sets, convince your business of the value of migrating your data warehouse, data lake and/or streaming platform to the cloud in four weeks.
With your own data, see how Looker can modernise your BI needs
with Datatonic’s two-week Showcase.