Efficiently Identify Prospects Using Machine Learning
This notebook aims to identify prospects using machine learning models, which in this dataset are the customers that have subscribed to a particular term deposit product.
Table of Contents
- Data ingestion
- Exploratory Data Analysis
- Data preprocessing
- Model Training
- Model Testing
- Conclusion
- Credits
Data Ingestion
Introduction
The purpose of this notebook is to create a model that identifies prospective clients, which in this dataset are the customers that have subscribed to a particular term deposit product.
Equipped with the information on the customer base and previous marketing campaign efforts, we can understand the campaign’s effectiveness. This model identifies subscribed customers based on a classification algorithm.
This notebook can be useful for any business that wishes to run a re-targeting or similar audience marketing campaign and cross-sell a new product.
With similar data to the columns in this dataset, one can identify customers most likely to subscribe to the product and plan subsequent campaigns accordingly by focusing on a subset of potential customers.
Narrowing the customers targeted by a particular campaign reduces costs and the risk of a failed campaign and/or overall product failure.
Thanks to the University of California Irvine for providing the dataset.
Source: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22–31, June 2014.
Understanding the Data
- job: Job Category the customer belongs to. (categorical: ’admin.’,’blue collar’,’entrepreneur’,’housemaid’,’management’,’retired’,’self-employed’,’services’,’student’,’technician’,’unemployed
- marital: Marital status of customer(categorical: ’divorced’,’married’,’single’,’unknown’; note: ’divorced’ means divorced or widowed)
- education: Education level of customer (categorical: ’basic.4y’,’basic.6y’,’basic.9y’,’high.school’,’illiterate’,’professional.course’,’university.degree’,’unknown’)
- default: Is the customer a defaulter? (categorical: ’no’,’yes’,’unknown’)
- balance: Overall balance amount of the customer.
- housing: Does the customer have any housing loan? (categorical: ’no’,’yes’,’unknown’)
- loan: Does the customer have any personal loan? (categorical: ’no’,’yes’,’unknown’)
Related to the last contact of the current campaign.
- contact: Customer contact communication type (categorical: ’cellular’,’telephone’)
- day: Last day when the customer had been contacted. (numeric: 1, 2, 3, … 31)
- month: Last month when the customer had been contacted(categorical: ’jan’, ’feb’, ’mar’, …, ’nov’, ’dec’)
- duration: Last contact duration with the customer, in seconds (numeric). Important note: This attribute highly affects the output target (e.g., if duration=0 then the target variable y=’no’ i.e the customer has not subscribed). Yet, in a practical scenario, the duration is not known before a call is performed. Also, after the end of the call y, i.e. whether the customer has subscribed is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
Other attributes
- campaign: It is the number of customers contacted during this campaign (numeric, includes the last contact)
- pdays: It is the number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means the client was not previously contacted)
- previous: The number of contacts performed before this campaign and for the bank (numeric)
- poutcome: The outcome of the previous marketing campaign (categorical: ’failure’,’nonexistent’,’success’) Output variable (desired target)
- y: It is the target variable. It defines whether the clientsubscribed the term deposit product(for which the campaign was launched). (binary: ’yes’,’no’)
Statistical Analysis
Hiding the warnings of the output of python code. These warnings are not necessary and do not provide any additional or useful information.
# ignore warningsimport warnings
warnings.filterwarnings(’ignore’)
Now let’s import the required libraries. Library pandas are used for easy manipulation of data frames and seaborn, Matplotlib is used for visualization.
# importing the required librariesimport pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Performing statistical analysis on the fields in the dataset.
# statistical description of the fieldsdf_bank.describe()
The results are 7 numeric variables and 10 categorical variables. We have to first convert the categorical variables to numeric, in order to proceed with the analysis.
# checking for null data
print ("Presence of any null values: " +
str(df_bank.isnull().values.any()))
We observe that there are no null values in the dataset.
Exploratory Data Analysis
Now we will explore the non-numeric (i.e categorical) variables, and get some valuable insights. The code pasted below plots four bar plots, that help us analyze the data and get some insights on the customer data of the bank.
# Checking customer base i.e clients of the bank fig = plt.figure(figsize=(15,15))
plt.subplot(2,3,1)
pd.value_counts(df_bank[’education’]).plot.bar() plt.title(’EDUCATION’)
plt.subplot(2,3,2)
pd.value_counts(df_bank[’poutcome’]).plot.bar()
plt.title(’OUTCOME’)
plt.subplot(2,3,3)
pd.value_counts(df_bank[’contact’]).plot.bar()
plt.title(’CONTACT’)
plt.subplot(2,3,4)
pd.value_counts(df_bank[’job’]).plot.bar()
plt.title(’JOB’)
plt.axis(’tight’)
plt.show()
The above plot shows that the bank has many customers who have a secondary level of education, and blue-collar jobs, and contact communication type is the cellular and unknown outcome of a previous marketing campaign.
Let us further analyze five bar plots, for the customer data of the bank.
# Analyzing the customer base using bar graphsfig = plt.figure(figsize=(15,15))
plt.subplot(2,3,1)pd.value_counts(df_bank[’month’]).plot.bar()
plt.title(’MONTH’)
plt.subplot(2,3,2)
pd.value_counts(df_bank[’default’]).plot.bar()
plt.title(’DEFAULT’)
plt.subplot(2,3,3)
pd.value_counts(df_bank[’housing’]).plot.bar()
plt.title(’HOUSING’)
plt.subplot(2,3,4)
pd.value_counts(df_bank[’loan’]).plot.bar()
plt.title(’LOAN’)
plt.subplot(2,3,5)
pd.value_counts(df_bank[’subscribed’]).plot.bar() plt.title(’SUBSCRIBED’)
plt.axis(’tight’)
plt.show()
The above plot shows that customers were mostly contacted during the month of May, and many customers have not subscribed for the term deposit.
Also, the bank has many customers who own homes vs rent, are not defaulters, and do not have any loans.
We will now move on to view the number of customers that have actually subscribed after the campaign event. At first, we calculate the percentage value of subscribed customers and then plot it in the pie chart.
After that, we will add appropriate labels and use the explode chart feature to view the cut of this data in the pie chart
# Check the percentage of subscribed customersdf_not_subscribed = df_bank.loc[df_bank[’subscribed’] == ’no’] df_subscribed = df_bank.loc[df_bank[’subscribed’] == ’yes’]
sub = len(df_not_subscribed.index)
nsub = len(df_subscribed.index) # Pie chartlabels = [’Not Subscribed’, ’Subscribed’] explode = (0, 0.1) # add colorscolors = [’#ff9999’,’#66b3ff’]
sizes = [(sub/(sub+nsub))*100, (nsub/(sub+nsub))*100]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, explode = explode, colors=colors, autopct=’%1.1f%%’,
shadow=True, startangle=90) # Equal aspect ratio ensures that pie is drawn as a circleax1.axis(’equal’)plt.tight_layout()
plt.show()
The pie chart shows only 11.7% of customers have subscribed for term deposits, after the campaign. For numerical data, it is necessary to check for outliers [1].
We are going to plot two box plots one for balance [2] and the other one for duration [3].
# Box plots to check for outliers
fig = plt.figure(figsize = (15, 15)
plt.subplot (2,3,1)
sns.boxplot (x = 'balance', y = 'subscribed', data = df_bank
plt.title ('BALANCE')
plt.subplot (2,3,2)
sns.boxplot (x = 'duration', y = 'subscribed', data = df_bank)
plt.title ('DURATION')
plt.show()
A few outliers can be observed [1] in balance [2] and duration [3]. The balance of most of the customers seems to be between $0 and $20,000. Also, the customers who have subscribed had a longer duration of contact with bank personnel on average than those who have not subscribed.
Contact duration plays an important role and highly affects the subscription decision.
Data Preprocessing
First, we have to convert the categorical variables to dummy variables [4] (a numeric entity).
To do this, we will create a function convertToDummy, and use the pandas get dummies method to convert the categorical variables to dummy columns.
Then we will delete one of the dummy columns to avoid the dummy variable trap [5] (multicollinearity [6] issues).
# function to convert categorical
defconvertToDummy (df, column): # Create dummy variables for categorical variables
df_dummies = pd.get_dummies (column)
# Deleting one of the dummy variable to avoid multicollinearity issues
del df_dummies[df_dummies.columns [-1]] # Add the columns to existing data frame
df = pd.concat ([df, df_dummies], axis – 1)
return df
Now we convert the dependent variable and subscribe yes and no values to binary 1 and 0, using the map function.
# Function to convert categorical variables to dummy
# Convert y to binary for further processing
df_bank[’subscribed’] = df_bank[’subscribed’].map({’yes’: 1, ’no’: 0})
First we will be converting the categorical variables to dummy variables [4], and later we will delete the original column, as it is no longer required.
We are also going to delete some columns which are not significant for the creation of the model.
The lower correlation [7] can also be observed in a correlation matrix as detailed in section 3.1. Deleting these columns will not have an effect on the accuracy of the model.
# delete unwanted columns day
del df_bank[’day’]
del df_bank[’month’]
# Create dummy variables for categorical variables
df_bank = convertToDummy(df_bank, df_bank[’marital’])
df_bank = convertToDummy(df_bank, df_bank[’job’])
df_bank = convertToDummy(df_bank, df_bank[’education’])
df_bank = convertToDummy(df_bank, df_bank[’poutcome’])
df_bank = convertToDummy(df_bank, df_bank[’contact’]) # Delete the original categorical columns as new variables are created
del df_bank[’marital’]
del df_bank[’job’]
del df_bank[’education’]
del df_bank[’poutcome’]
del df_bank[’contact’] # Convert default, housing, loan to binary for further processing
df_bank[’default’] = df_bank[’default’].map({’yes’: 1, ’no’: 0}) df_bank[’loan’] = df_bank[’loan’].map({’yes’: 1, ’no’: 0}) df_bank[’housing’] = df_bank[’housing’].map({’yes’: 1, ’no’: 0})
Checking correlation matrix (Heat Map)
We will plot a heat map to check the correlation [7] between the variables. We will use the heat map method from the seaborn package.
# Plotting heat map for displaying correlation matrix
fig = plt.figure(figsize=(20,20))
corr = df_bank.corr()
sns.heatmap(corr, annot = True)
Model Training
Let’s import the libraries that are required for creating the model.
# Importing the required librariesfrom xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
Now we will convert the dependent [8] and independent variables [9] to a matrix of features i.e converting the dependent and independent variables in matrix form so that they can be used in the model.
# Creating matrix of features of independent variable
x = df_bank.iloc[:,df_bank.columns != ’subscribed’].values
y = df_bank.iloc[:,9].values
We will now divide our dataset into training sets, test sets, and validation sets. This is done to avoid overfitting [10] issues.
We will divide the whole dataset, categorizing 70% into the training set and 30% into the test dataset.
We then divide the test dataset further, categorizing 70% as the test dataset and 30% as a validation dataset.
Now we will use train test split method from sklearn.model selection library, to split the dataset.
# Split dataset into training and test setx_train, x_test, y_train, y_test = train_test_split(x, y, test_size
= 0.3, random_state = 0)
x_test, x_valid, y_test, y_valid = train_test_split(x_test, y_test,
test_size = 0.3, random_state = 0)
We will now perform feature scaling, that is standardizing [11] the variables before we pass it to the classifier models for classification using StandardScaler method from sklearn.preprocessing package.
# Feature Scaling
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.fit_transform(x_test)
x_valid = sc_x.fit_transform(x_valid)
The function mentioned below classifies the dataset that is passed through it and creates a model.
We first initialize all the classification models, then fit the model with the dataset provided as a parameter (mainly we provide a training dataset as the input).
def classify(x_train, y_train) :
# Initializing the models lr = LogisticRegression(random_state = 0)
xgb = XGBClassifier(random_state = 0)
dtc = DecisionTreeClassifier(random_state = 0)
rfc = RandomForestClassifier(random_state = 0)
nbc = GaussianNB()
svm = SVC(kernel=’rbf’, C=1,gamma=’auto’)
knn = KNeighborsClassifier(n_neighbors=3)
# Fitting Models to training data set lr.fit(x_train, y_train)
xgb.fit(x_train, y_train)
dtc.fit(x_train, y_train)
rfc.fit(x_train, y_train)
nbc.fit(x_train, y_train)
svm.fit(x_train, y_train)
knn.fit(x_train, y_train)
classifiers = [lr, xgb, dtc, rfc, nbc, svm, knn] return classifiers
Invoke the above method, and pass the training dataset through to train the model.
# Passing training data to classify function to fit various classification modelsclassifiers = classify(x_train, y_train)
Testing the Model
The function mentioned below displays the accuracy of all the models in a graphical view.
# Function to plot the accuracy of the model
def plot_accuracy_plot(accuracy) :
dims = (11.7, 8.27)
fig, ax = plt.subplots(figsize = dims)
plt.xlabel(’Accuracy’)
plt.title(’Classifier Accuracy’)
sns.set_color_codes("muted")
splot = sns.barplot(ax=ax, x=’Accuracy’, y=’Classifier’,
data=accuracy, color="b")
plt.show()
The function mentioned below performs the prediction and returns the accuracy score as a data frame. We first perform prediction using predict method on the data frame passed, calculate accuracy using the accuracy score [12] method and save it to a dictionary.
# This function performs prediction,
# plots the accuracy score of all classifiers and
# returns the accuracy score data framedef predict(x_test, classifiers, y_test):
lr_test_pred = classifiers[0].predict(x_test)
xgb_test_pred = classifiers[1].predict(x_test)
dtc_test_pred = classifiers[2].predict(x_test)
rfc_test_pred = classifiers[3].predict(x_test)
nbc_test_pred = classifiers[4].predict(x_test)
svm_test_pred = classifiers[5].predict(x_test)
knn_test_pred = classifiers[6].predict(x_test) # judge accuracy using built-in function
accuracy_test = dict()
accuracy_test[’Logistic Regression’] = accuracy_score(y_test,
lr_test_pred)
accuracy_test[’XGBoost’] = accuracy_score(y_test, xgb_test_pred) accuracy_test[’DecisionTree’] = accuracy_score(y_test,dtc_test_pred)
accuracy_test[’RandomForest’] = accuracy_score(y_test,rfc_test_pred) accuracy_test[’Naive_bayes’] = accuracy_score(y_test, nbc_test_pred) accuracy_test[’support_vector_Machines’] =
accuracy_score(y_test,svm_test_pred)
accuracy_test[’KNN’] = accuracy_score(y_test,knn_test_pred) print(accuracy_test)
df_acc= pd.DataFrame([accuracy_test.keys(),
accuracy_test.values()]).T
df_acc.columns= [’Classifier’, ’Accuracy’]
df_acc.sort_values(by=[’Accuracy’], ascending = False) # Plot accuracy plot plot_accuracy_plot(df_acc)
return df_acc.sort_values(’Accuracy’, ascending = False)
We then convert the dictionary into a data frame, plot the accuracy by invoking the plot accuracy plot method, and return the accuracy data frame sorted in descending order.
Next, we invoke the prediction method and pass through the test dataset.
# Predicting the test data setpredict (x_test, classifiers, y_test)
Result: The XGBoost provides the highest accuracy score [12] of 89.99% among other classifiers, for test dataset.
Now, we invoke the prediction method and pass the validation dataset.
# Predicting the test data setpredict (x_valid, classifiers, y_valid)
Result: XGBoost provides the highest accuracy score [12] of 89.83%among other classifiers, for the validation dataset.
Conclusion
In summary, our team at Cocolevio has learned to create a model, with around 89% accuracy, that can identify prospects using machine learning that are likely to subscribe to the product, enabling us to target those customers and focus our advertising efforts on them.
This will improve marketing and operational efficiency, as we will be focused on a particular section of the customer base which we now know is more likely to subscribe to our services.
Credits
Thanks again to the University of California Irvine for providing this dataset.
References
[1] outliers: A value that ”lies outside” (is much smaller or larger than) most of the other values in a set of data. For example, in the scores 25,29,3,32,85,33,27,28 both 3 and 85 are ”outliers”
[2] balance: Overall balance amount of the customer
[3] duration: Last contact duration with the customer, in seconds (numeric)
[4] dummy variables: Dummy variables are ”proxy” variables or numeric stand-ins for qualitative facts i.e categorical variables
[5] dummy variable trap: The Dummy Variable trap is a scenario in which the independent variables are multicollinear — a scenario in which two or more variables are highly correlated; in simple terms, one variable can be predicted from the others.
[6] multicollinearity: Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be independent. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.
[7] correlation: Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. A positive correlation exists when two variables move in the same direction. A basic example of a positive correlation is height and weight-taller people tend to be heavier, and vice versa
[8] dependent variable: The dependent variable is sometimes called ”the outcome variable.” also known as the target variable
[9] independent variable: An independent variable is a variable believed to affect the dependent variable
[10] overfitting: Overfitting refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.
[11] standardizing: Standardization is the process of putting different variables on the same scale. This process allows you to compare values between different ranges of variables.
[12] accuracy score: accuracy is the fraction of predictions our model got right. It is the number of correct predictions made divided by the total number of predictions made, multiplied by 100 to turn it into a percentage. The higher the accuracy score the better the model.