Churn Modelling Marketing Data with Julia
Logistic Regression Propensity Models of Marketing Data with Julia.
julialogisitic-regressionpropensity-modelsmarketing-analsyisstatistics
4 Minutes, 24 Seconds
2022-05-30 13:30 +0000
Churn Modelling Marketing Data with Julia
Introduction
I prepared this analysis to learn the logistic regression in Julia. The work is fairly straightforward. I am modelling if a customer will exit a website based on a number of sites. I’ll improve the model in an upcoming post. As if, is not as interesting as asking when. That will be my next project
Imports
using Pkg
using DataFrames
using CSV
using Plots
using GLM
using StatsBase
using Lathe
using MLBase
using ClassImbalance
using ROCAnalysis
using PyCall
sklearn = pyimport("sklearn.metrics")
PyObject <module 'sklearn.metrics' from '/Users/jnapolitano/venvs/finance/lib/python3.9/site-packages/sklearn/metrics/__init__.py'>
function load_csv()
df = DataFrame(CSV.File("./Churn_Modelling.csv"))
return df
end
load_csv (generic function with 1 method)
Loading Data
marketing_df = load_csv()
first(marketing_df,5)
println(size(marketing_df))
describe(marketing_df)
(10000, 14)
# Check column names
names(marketing_df)
14-element Vector{Symbol}:
:RowNumber
:CustomerId
:Surname
:CreditScore
:Geography
:Gender
:Age
:Tenure
:Balance
:NumOfProducts
:HasCrCard
:IsActiveMember
:EstimatedSalary
:Exited
Check Class Imbalance
# Count the classes
countmap(marketing_df.Exited)
Dict{Int64, Int64} with 2 entries:
0 => 7963
1 => 2037
Data Preprocessing
One Hot Encoding
# One hot encoding
Lathe.preprocess.OneHotEncode(marketing_df,:Geography)
Lathe.preprocess.OneHotEncode(marketing_df,:Gender)
select!(marketing_df, Not([:RowNumber, :CustomerId,:Surname,:Geography,:Gender,:Male]))
Split Train/and Test Data
# Train test split
using Lathe.preprocess: TrainTestSplit
train, test = TrainTestSplit(marketing_df,.75);
Build Model
# Train logistic regression model
fm = @formula(Exited ~ CreditScore + Age + Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary + Female + France + Spain)
logit = glm(fm, train, Binomial(), ProbitLink())
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Vector{Float64}, Binomial{Float64}, ProbitLink}, GLM.DensePredChol{Float64, LinearAlgebra.Cholesky{Float64, Matrix{Float64}}}}, Matrix{Float64}}
Exited ~ 1 + CreditScore + Age + Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary + Female + France + Spain
Coefficients:
───────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error z Pr(>|z|) Lower 95% Upper 95%
───────────────────────────────────────────────────────────────────────────────────────
(Intercept) -1.90933 0.165007 -11.57 <1e-30 -2.23274 -1.58592
CreditScore -0.000321917 0.000183184 -1.76 0.0789 -0.000680951 3.71172e-5
Age 0.040893 0.00165251 24.75 <1e-99 0.0376541 0.0441318
Tenure -0.008864 0.00611129 -1.45 0.1469 -0.0208419 0.0031139
Balance 1.65933e-6 3.30286e-7 5.02 <1e-06 1.01198e-6 2.30668e-6
NumOfProducts -0.040173 0.0309946 -1.30 0.1949 -0.100921 0.0205753
HasCrCard -0.00442931 0.0386394 -0.11 0.9087 -0.0801612 0.0713026
IsActiveMember -0.557894 0.0365213 -15.28 <1e-51 -0.629475 -0.486314
EstimatedSalary 2.2925e-7 3.07604e-7 0.75 0.4561 -3.73644e-7 8.32143e-7
Female 0.301642 0.0354259 8.51 <1e-16 0.232209 0.371076
France -0.450226 0.0446176 -10.09 <1e-23 -0.537674 -0.362777
Spain -0.443184 0.051707 -8.57 <1e-16 -0.544527 -0.34184
───────────────────────────────────────────────────────────────────────────────────────
Model Predictions and Evaluation
# Predict the target variable on test data
prediction = predict(logit,test)
2406-element Vector{Union{Missing, Float64}}:
0.24401107345293602
0.1266535868551322
0.031721959583257124
0.11357816519004983
0.24824114578495612
0.024688755265128235
0.14209354336141483
0.18528877855991494
0.15470097145575007
0.25962439112051505
0.15117890643161475
0.2110682947689441
0.06358192272871947
⋮
0.24899439141513482
0.23449577199293972
0.13610439167926225
0.1737934374110589
0.1341643450975004
0.5831068095078078
0.2950497674661655
0.04139159536998556
0.06795785137729822
0.017204995327274736
0.12888818685657766
0.15310112069144077
# Convert probability score to class
prediction_class = [if x < 0.5 0 else 1 end for x in prediction];
prediction_df = DataFrame(y_actual = test.Exited, y_predicted = prediction_class, prob_predicted = prediction);
prediction_df.correctly_classified = prediction_df.y_actual .== prediction_df.y_predicted
2406-element BitVector:
0
1
1
1
1
1
1
1
1
0
1
1
1
⋮
1
1
1
1
1
1
1
1
1
1
1
1
Prediction Accuracy
accuracy = mean(prediction_df.correctly_classified)
0.8100581878636741
Confusion Matrix
# confusion_matrix = confusmat(2,prediction_df.y_actual, prediction_df.y_predicted)
confusion_matrix = MLBase.roc(prediction_df.y_actual, prediction_df.y_predicted)
ROCNums{Int64}
p = 510
n = 1896
tp = 105
tn = 1844
fp = 52
fn = 405
Results
The model is estimating far to many exiting cases. About 4 times the true value.
fpr, tpr, thresholds = sklearn.roc_curve(prediction_df.y_actual, prediction_df.prob_predicted)
([0.0, 0.0, 0.0, 0.0005274261603375527, 0.0005274261603375527, 0.0010548523206751054, 0.0010548523206751054, 0.0015822784810126582, 0.0015822784810126582, 0.0026371308016877636 … 0.8829113924050633, 0.9066455696202531, 0.9066455696202531, 0.9193037974683544, 0.9193037974683544, 0.92457805907173, 0.92457805907173, 0.9725738396624473, 0.9725738396624473, 1.0], [0.0, 0.00196078431372549, 0.00392156862745098, 0.00392156862745098, 0.00784313725490196, 0.00784313725490196, 0.01568627450980392, 0.01568627450980392, 0.03137254901960784, 0.03137254901960784 … 0.9921568627450981, 0.9921568627450981, 0.9941176470588236, 0.9941176470588236, 0.996078431372549, 0.996078431372549, 0.9980392156862745, 0.9980392156862745, 1.0, 1.0], [1.8467335270755767, 0.8467335270755767, 0.8140811888019499, 0.8092555110984978, 0.7970873802691381, 0.79684704533007, 0.7719016175181805, 0.7709263202992206, 0.7060214606993195, 0.6994801619873218 … 0.04233143871590189, 0.03786940431261241, 0.037850945580692276, 0.035665362242897694, 0.03532968973176317, 0.03416668456674327, 0.03407543014692377, 0.020932892669754958, 0.020885871157504798, 0.00597005405256463])
# Plot ROC curve
plot(fpr, tpr)
title!("ROC curve")
The Class Imbalance Problem
# Count the classes
countmap(marketing_df.Exited)
Dict{Int64, Int64} with 2 entries:
0 => 7963
1 => 2037
Smote to fix imbalance
X2, y2 =smote(marketing_df[!,[:CreditScore,:Age ,:Tenure, :Balance, :NumOfProducts, :HasCrCard, :IsActiveMember, :EstimatedSalary, :Female , :France, :Spain]], marketing_df.Exited, k = 5, pct_under = 150, pct_over = 200)
df_balanced = X2
df_balanced.Exited = y2;
df = df_balanced;
# Count the classes
countmap(df.Exited)
Dict{Int64, Int64} with 2 entries:
0 => 6111
1 => 6111
Retest
# Train test split
train, test = TrainTestSplit(df,.75);
# Model Building
fm = @formula(Exited ~ CreditScore + Age + Tenure + Balance + NumOfProducts + HasCrCard + IsActiveMember + EstimatedSalary + Female + France + Spain)
logit = glm(fm, train, Binomial(), ProbitLink())
# Predict the target variable on test data
prediction = predict(logit,test)
# Convert probability score to class
prediction_class = [if x < 0.5 0 else 1 end for x in prediction];
prediction_df = DataFrame(y_actual = test.Exited, y_predicted = prediction_class, prob_predicted = prediction);
prediction_df.correctly_classified = prediction_df.y_actual .== prediction_df.y_predicted
# Accuracy Score
accuracy = mean(prediction_df.correctly_classified)
print("Accuracy of the model is : ",accuracy)
# Confusion Matrix
confusion_matrix = MLBase.roc(prediction_df.y_actual, prediction_df.y_predicted)
Accuracy of the model is : 0.7169563791407019
ROCNums{Int64}
p = 1550
n = 1499
tp = 1091
tn = 1095
fp = 404
fn = 459
fpr, tpr, thresholds = sklearn.roc_curve(prediction_df.y_actual, prediction_df.prob_predicted)
# Plot ROC curve
([0.0, 0.0, 0.0, 0.00066711140760507, 0.00066711140760507, 0.00133422281521014, 0.00133422281521014, 0.0020013342228152103, 0.0020013342228152103, 0.00266844563042028 … 0.9846564376250834, 0.9893262174783188, 0.9893262174783188, 0.9913275517011341, 0.9913275517011341, 0.9973315543695798, 0.9986657771847899, 0.9993328885923949, 0.9993328885923949, 1.0], [0.0, 0.0006451612903225806, 0.0025806451612903226, 0.0025806451612903226, 0.005161290322580645, 0.005161290322580645, 0.007741935483870968, 0.007741935483870968, 0.00903225806451613, 0.00903225806451613 … 0.9980645161290322, 0.9980645161290322, 0.9987096774193548, 0.9987096774193548, 0.9993548387096775, 0.9993548387096775, 0.9993548387096775, 0.9993548387096775, 1.0, 1.0], [1.9907624292252022, 0.9907624292252022, 0.983731024429679, 0.97951657298985, 0.9730082291507035, 0.9713532719467679, 0.9629327481173712, 0.9604203755106321, 0.9593444340323958, 0.9584649467140461 … 0.06923199350115271, 0.06553287523911823, 0.06469253560487893, 0.058594401854125504, 0.057872556108602216, 0.034170953161915506, 0.03357051125028141, 0.03297342671224324, 0.030937011626933943, 0.023743078872535135])
plot(fpr, tpr)
title!("ROC curve")
Final Discussion
When accounting for class imbalance, the model accuracy is reduced to 71 percent from about 81 percent.
While this seems counterintutive, the second model is actually a better model overall.
The model with 81 percent accuracy is simply more accurate by chance. The bin of exits to remains is far larger. Thus, reported accuracy is higher.
When the classes are normalized, we see a prediction of about 71 percent. Confidently, I can say that this model would scale appropriately.
The first model on the other hand would scale to about 25-30 percent accuracy.