How To Detect Credit Card Fraud Through Transaction Time Using Machine Learning
Our objective in this notebook is to detect credit card fraud transactions, given parameters such as the time between transactions, transaction amount, and some obfuscated columns.
Table of Contents
- Background
- Data ingestion
- Data preprocessing
- Exploratory Data Analysis
- Model Training
- Testing the Model
- Conclusion
- Credits
Background
Purpose
If you work at a bank, millions of fraudulent transactions occur daily. According to Forbes, U.S. card issuers and merchants could lose over $12 billion in 2020 due to fraud. With this model, even with a generic public dataset, we can accurately detect fraud with a minimum of 92% accuracy.
Using machine learning technology, Cocolevio can create custom models to detect credit card fraud for banks or financial institutions, allowing companies and cardholders to rest easy with an additional layer of security to protect their finances.
Introduction
In this notebook, we used a public dataset from Kaggle. You can download and use the dataset HERE. Thanks to the ULB Machine Learning Group for providing and preprocessing this dataset.
The first step will be observing our data and understanding what we deal with. Second, we will “clean” the dataset, dropping empty values and scaling where needed. Third, we will balance our dataset to get reliable predictions for fraudulent and non-fraudulent transactions. Fourth, we will build and train a model to help us predict outcomes.
For this notebook, we are given some columns: Time, V1-V28, Amount, and Class.
Time corresponds to the seconds between each transaction and the subsequent transaction the user makes.
V1-V28 are obfuscated columns of personal data and additional features that may contain sensitive information. The data values for these columns were produced through an obfuscation process. For security reasons, we cannot backtrack these numbers to any values that make sense to us, so our model will only be specific to the non-sensitive data here. However, we can still show that we can create an accurate model for this use case.
Amount refers to the transaction amount.
Class is 1 for fraudulent transactions and 0 otherwise.
Data Ingestion
Initial Thoughts
According to the author of the dataset, the distribution of transactions is heavily skewed for non-fraudulent transactions. While there are only 492 fraudulent transactions, there are 284,315 non-fraudulent transactions. This is not a good balance, and we will rectify this situation during the data preprocessing stage. If we leave the balance heavily skewed towards not fraud, our model will rarely ever predict a transaction as fraudulent. Since there are only two options for a transaction (fraud/not fraud), we want the distribution in our dataset to be as close to equal as possible.
Removing a significant amount of values will also affect our model and has its disadvantages. Since we will significantly reduce our training points, our accuracy may be skewed too high.
Unfortunately, adding to this dataset is impossible, as we do not know all of the original features and attributes (V1-V28).
Data Ingestion
Purpose
The data preprocessing stage aims to minimize potential errors in the model as much as possible. Generally, a model is only as good as the data passed into it, and the data preprocessing ensures that the model has as accurate a dataset as possible. While we cannot perfectly clean the dataset, we can at least follow some basic steps to ensure our dataset has the best possible chance of generating a good model.
First, let’s check and see the null values for this dataset. Null values are useless entries within our dataset that we don’t need. If we skip removing null values, our model will be inaccurate as we create “connections” for useless values, rather than focusing all resources on creating connections for useful values.
import pandas as pd
import numpy as np
df = pd.read_csv(‘datasets/creditcard.csv’)
In [2]:
print("Presence of null values: " + str(df.isnull().values.any()))Presence of null values: False
Now that we’ve confirmed there are no null values, we can see what the distribution really looks like between the fraudulent and non-fraudulent transactions.
In [3]:
not_fraud_df = df[df['Class'] == 0]
fraud_df = df[df['Class'] == 1]
print(“Number of nonfraudulent transactions: ” + str(len(not_fraud_df.index)))
print(“Number of fraudulent transactions: ” + str(len(fraud_df.index)))Number of nonfraudulent transactions: 284315
Number of fraudulent transactions: 492
So this is the first major hurdle of preprocessing this dataset. Building a model with fraudulent vs non-fraudulent distribution is very difficult because our model will predict significantly more non-fraudulent transactions since such an uneven distribution exists.
Having too few data points means that our model training will be much shorter, so we won’t be able to establish as many connections as we had before to improve our results. Since our testing size is significantly smaller, our accuracy is also affected since we won’t have a large enough test set.
However, not equalizing the distributions makes our model practically useless due to the tremendous difference in the size of the not-fraud compared to the fraud.
So, let’s start by equalizing the number of fraud and not-fraud transactions. This will allow for better prediction since the model will be trained to expect fraud and not fraud with an equal chance since it is a binary outcome.
If we had left the distribution as is, the model would be heavily skewed towards non-fraud, and the data from 492 frauds would have almost no effect on the model at all.
Equalization of data between fraudulent and non-fraudulent
Now, we’re going to equalize the number of fraudulent and non-fraud transactions. Let’s start by extracting 492 non-fraudulent transactions at random.
not_fraud_df = df.loc[df['Class'] == 0][:492]
equalized_df = pd.concat([fraud_df, not_fraud_df])
equalized_df = equalized_df.sample(frac = 1, random_state = 42)
not_fraud_df = equalized_df[equalized_df['Class'] == 0]
fraud_df = equalized_df[equalized_df['Class'] == 1]
print(“Number of nonfraudulent transactions: ” + str(len(not_fraud_df.index)))
print(“Number of fraudulent transactions: ” + str(len(fraud_df.index)))Number of nonfraudulent transactions: 492
Number of fraudulent transactions: 492
Now that we’ve equalized the number of fraudulent and non-fraudulent transactions, we should normalize all column values to best identify features in the dataset. In doing so, we minimize inaccuracies in having large values that may skew results.
Normalization is a process by which we scale values between specified limits, usually -1 to 1 or 0 to 1. This process is important because our machine learning models are heavily affected by differences in number size. The difference between 200 and 1 will cause massive inaccuracies in our model compared to between 1 and 0.1. Normalization helps us to eliminate these sources of error rather than having it propagate throughout our analysis.
In [5]:
from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()
scaler.fit(equalized_df['Time'].values.reshape(-1, 1))
equalized_df['Time'] = scaler.fit_transform(equalized_df['Time'].values.reshape(-1, 1))
scaler.fit(equalized_df['Amount'].values.reshape(-1, 1))
equalized_df['Amount'] = scaler.fit_transform(equalized_df['Amount'].values.reshape(-1, 1))
With this scaling done, we can now begin analyzing the dataset for potential features to train on.
In the next section, we’ll be performing Exploratory Data Analysis (EDA) on the dataset.
Exploratory Data Analysis
The purpose of EDA is to enhance our understanding of trends in the dataset without involving complicated machine-learning models. Often, we can see obvious traits using graphs and charts just from plotting dataset columns against each other.
We’ve completed the necessary preprocessing steps, so let’s create a correlation map to see the relations between different features.
A correlation map (or correlation matrix) is a visual tool that illustrates the relationship between different dataset columns. The matrix will be lighter when the columns represented a move in the same direction together, and it will be darker when one column decreases while the other increases. Strong light and dark spots in our correlation matrix tell us about the future reliability of the model.
In [18]:
import seaborn as sns
import matplotlib.pyplot as pltdef plot_corr(df):
corr=df.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,square=True, linewidths=.5, cbar_kws={"shrink": .5})plot_corr(equalized_df)
From these column values, we can see some very useful correlations present within our new dataset. The Class (meaning fraud or not fraud) is heavily correlated with the Time (seconds between each transaction) but is negatively correlated with V3, V5-V7, V15-V18, etc. While this doesn’t tell us anything specific, it does show that the data has trends we can use to improve the accuracy of our model.
Let’s also plot a pie chart of the distribution between fraud and not fraud for easier viewing.
import matplotlib.pyplot as plt
not_fraud = len(equalized_df[equalized_df[‘Class’] == 0].values)
fraud = len(equalized_df[equalized_df[‘Class’] == 1].values)
pie_chart = pd.DataFrame({‘a’: [0, 1],’b’: [not_fraud, fraud]})
pie_chart.plot.pie(subplots=True)
Out[7]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000002290517D188>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000229051BFCC8>],
dtype=object)
As you can see in the pie chart above, the distribution between fraud and not fraud is equal, which means our code snippets for the equalization were successful.
Normally, the EDA section is where we use traditional statistical analysis techniques, rather than machine learning, to gain insights about our dataset. For example, had we known the V1-V28 columns as ‘Store’ or ‘Region’, EDA would tell us how much attributes like those contribute to fraud.
Model Training
In this section, we will be creating and training our model for predicting whether a transaction is fraudulent. Since there are multiple algorithms we can use to build our model, we will compare the accuracy scores after testing and pick the most accurate algorithm.
For my use, I exported the normalized, equal parts fraud and not-fraud data to a .csv file called “cleaned credit card.csv”. You can use the code below to do this or the equalized df generated earlier to provide the same functionality.
equalized_df.to_csv('datasets/cleaned credit card.csv')
Let’s begin by importing the different models we will be used for classification.
In [9]:
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
From this list, we are using XGBoost, DecisionTree, RandomForest, Naive Bayes, LogisticRegressoin, SVC, and KNeighborsClassifier to perform our predictions. We will then see which algorithm produces the highest accuracy and select it as our algorithm of choice for future use. We also want to partition our dataset into training, testing, and validation, so let’s add a method for that ability.
In [10]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
Now, we can begin building and training our model.
Let’s split our data into test, train, and validation using train_test_split.
Our testing will take three phases: testing, training, and validation. Training is first, and it’s where our model generates “intuition” about how to approach fraudulent and not fraudulent transactions. It is similar to a student studying and developing knowledge about a topic before an exam.
The testing phase is where we see how the model performs against data where we know the outcome. This would be the exam if we continue the analogy from before. The algorithms will perform differently, similar to how students score differently on exams. We generate an accuracy score from this phase to compare the different algorithms.
Validation testing ensures that the model isn’t overfitting to our specific dataset. Overfitting is when the model develops an intuition too specific to the training set. Overfitting is a problem because our model is no longer flexible. It may work on the initial set, but subsequent uses will cause our model to fail. Continuing the exam analogy, the validation testing phase is like another exam version with different questions. If a student happened to cheat on the first exam by knowing the questions, the second exam will give a better representation of performance.
Note that verification doesn’t completely disprove or prove to overfit, but the testing does give insight into it.
training,test = train_test_split(equalized_df, train_size = 0.7, test_size = 0.3, shuffle=True)
training, valid = train_test_split(training, train_size = 0.7, test_size =0.3, shuffle=True)
training_label = training.pop(‘Class’)
test_label = test.pop(‘Class’)
valid_label = valid.pop(‘Class’)
We assign the ‘Class’ column to be our label as that is what we are trying to classify by. Our training and testing will use these labels to compare the predicted output versus the actual output.
In [12]:
import pprint
pp = pprint.PrettyPrinter(indent=4)
# instantiate the algorithms
xgb = XGBClassifier()
dtc = DecisionTreeClassifier()
rfc = RandomForestClassifier()
nbc = GaussianNB()
LR = LogisticRegression(random_state=0, solver=’lbfgs’,multi_class=’multinomial’)
SVM = SVC(kernel=’rbf’, C=1,gamma=’auto’)
knn = KNeighborsClassifier(n_neighbors=3)
# train the models
xgb.fit(training, training_label)
dtc.fit(training, training_label)
rfc.fit(training, training_label)
nbc.fit(training, training_label)
LR.fit(training, training_label)
SVM.fit(training, training_label)
knn.fit(training, training_label)
# try and predict an outcome from the test set
xgb_predict = xgb.predict(test)
dtc_predict = dtc.predict(test)
rfc_predict = rfc.predict(test)
nbc_predict = nbc.predict(test)
LR_predict = LR.predict(test)
SVM_predict = SVM.predict(test)
knn_predict = knn.predict(test)
# judge accuracy using built-in function
accuracy = dict()
accuracy[‘XGBoost’] = accuracy_score(test_label, xgb_predict)
accuracy[‘Naive_bayes’] = accuracy_score(test_label, nbc_predict)
accuracy[‘DecisionTree’] = accuracy_score(test_label, dtc_predict)
accuracy[‘RandomForest’] = accuracy_score(test_label,rfc_predict)
accuracy[‘support_vector_Machines’] = accuracy_score(test_label,SVM_predict)
accuracy[‘Linear Regression’] = accuracy_score(test_label,LR_predict)
accuracy[‘KNN’] = accuracy_score(test_label,knn_predict)c:\cocolevio\5411_alpha\ffea_notebooks\venv\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
“10 in version 0.20 to 100 in 0.22.”, FutureWarning)
The accuracies for the different algorithms are shown below, sorted by algorithm name and then its decimal accuracy:
In [13]:
pp.pprint(accuracy){ 'DecisionTree': 0.9932432432432432,
'KNN': 0.9493243243243243,
'Linear Regression': 0.9797297297297297,
'Naive_bayes': 0.9797297297297297,
'RandomForest': 0.9864864864864865,
'XGBoost': 0.9932432432432432,
'support_vector_Machines': 0.9290540540540541
}
From the preliminary testing, we can see that all of the algorithms are above 90% accuracy. While this is good, we also must test for overfitting our model with the data we have. So, we avoid this overfitting by testing with a validation set. If the validation set accuracy is also high, we can be slightly more confident that our model isn’t overfitting.
However, it appears as if XGBoost and DecisionTree are performing the best with a perfect score on the testing set.
Testing the Model
After performing the training tests, let’s perform validation tests to see if the model is overfitting to our test data.
In [14]:
# perform validation testing for dataset
xgb_predict = xgb.predict(valid)
dtc_predict = dtc.predict(valid)
rfc_predict = rfc.predict(valid)
nbc_predict = nbc.predict(valid)
LR_predict = LR.predict(valid)
SVM_predict = SVM.predict(valid)
knn_predict = knn.predict(valid)
Here, we will store the accuracies for all the algorithms and look for the highest accuracy prediction to use for our model.
In [15]:
# judge accuracy using built-in function
accuracy['XGBoost'] = accuracy_score(valid_label, xgb_predict)
accuracy['Naive_bayes'] = accuracy_score(valid_label, nbc_predict)
accuracy['DecisionTree'] = accuracy_score(valid_label, dtc_predict)
accuracy['RandomForest'] = accuracy_score(valid_label,rfc_predict)
accuracy['support_vector_Machines'] = accuracy_score(valid_label,SVM_predict)
accuracy['Linear Regression'] = accuracy_score(valid_label,LR_predict)
accuracy['KNN'] = accuracy_score(valid_label,knn_predict)
The accuracies for the validation testing are below, in the same format as the testing set:
In [16]:
pp.pprint(accuracy){ 'DecisionTree': 1.0,
'KNN': 0.9516908212560387,
'Linear Regression': 0.9710144927536232,
'Naive_bayes': 0.9758454106280193,
'RandomForest': 0.9903381642512077,
'XGBoost': 1.0,
'support_vector_Machines': 0.9468599033816425
}
These accuracies are extremely high as well. Even though we did validation testing, there is a chance that the model is overfitting our test dataset. So, we must be more aware of this issue in the future.
Because we also cut down so many of our data points to equalize the fraud/not fraud distribution, our model predicts very few points, meaning that we should assume a minimum accuracy of 90% and go forward.
In [17]:
max_accuracy = max(accuracy,key=accuracy.get)
pp.pprint("Most accurate algorithm is: " + str(max_accuracy))'Most accurate algorithm is: XGBoost'
For this dataset, XGBoost, RandomForest, and DecisionTree are the highest accuracy classifiers, so we will use them for future classification tasks. The specific algorithm depends on your priorities: speed, accuracy, or resources. These algorithms all specialize in these different fields, so choosing one requires in-depth knowledge of what a company prioritizes most for its use case.
Overall, even with obfuscated data, we can perform reliable fraud detection using very little fraud/non-fraud data.
Conclusion
During this notebook, we built a model that could accurately detect credit card fraud transactions with at least 92% accuracy. This model has many valid real-world use cases. For example, a bank could take a similar approach and reduce the money spent trying to detect fraud by automating it with a machine. Our model could also save the consumer a lot of time and money by having an extra layer of security for lost cards and stolen items.
While the model may be susceptible to overfitting, we have shown that it can detect credit card fraud reliably. We can protect consumers with this technology by integrating a similar model into the transaction process to notify the consumer and bank of fraud within minutes rather than days.
If you’re a banker looking to stop losing money to fraud, Cocolevio’s machine learning models can help you prevent fraud before it affects your business.
Credits
Thanks again to the Machine Learning Group at ULB for providing the dataset and Kaggle for hosting it. Thanks to the many Kagglers out there for their contributions to Data Science.