Detecting a possible event of bank churn using machine learning classification models 💸🏦

analysis
exploratory data analysis
visualisation
statistics
machine learning
classification model
Author

Arindam Baruah

Published

January 18, 2024

Source: Medium

1 What is a bank churn and why does it occur ?

With the advent of digital banking, it has become extremely convenient for users to open a new bank account. As opposed to the age old tradition of submitting multiple documents and credit histories after visiting a bank, the process has been revamped by a great deal with an user needing less than 10 minutes to open an account in the comfort of their couches.

However, with digital banking, it has also become convenient for users to close a bank account, more commonly termed as a “churn”. While this may appear to be a win for consumers, it is however important to understand the reasons and the effects that entail from a bank churn.

Reasons for a churn to occur

A bank account churn could occur for a multitiude of reasons which are important for banks to analyse to reduce attrition of customers and also remain competitive in the market. Some of the reasons are:

  • Fees and Charges: High fees or unexpected charges can prompt customers to switch banks. This may include monthly maintenance fees, ATM fees, overdraft fees, and other charges.
  • Low Interest Rates: If a bank offers low interest rates on savings accounts or certificates of deposit, customers might look for better opportunities elsewhere to maximize their returns.
  • Poor Customer Service: Inadequate customer service, long wait times, and unhelpful staff can lead to frustration and dissatisfaction, prompting customers to seek better service elsewhere.
  • Branch Accessibility: Limited access to physical branches or ATMs can be a significant factor. If a customer moves to an area where their current bank has limited presence, they might switch to a more accessible option.
  • Technology and Online Services: Customers may switch to a bank that provides better online and mobile banking services, as technology plays an increasingly crucial role in the banking experience.
  • Incentives and Promotions: Banks often attract new customers by offering promotions, bonuses, or better interest rates. Existing customers may churn to take advantage of these offers.
  • Changes in Financial Needs: As individuals’ financial situations evolve, their banking needs may change. For example, a customer might require more advanced financial products, and if their current bank can’t meet those needs, they may switch to a different institution.
  • Mergers and Acquisitions: Changes resulting from bank mergers or acquisitions, such as alterations in account terms, fees, or service quality, can drive customers to seek alternatives.
  • Ethical or Social Reasons: Some customers may choose to switch banks due to concerns about a bank’s ethical practices, social responsibility, or involvement in controversial activities.
  • Security Concerns: If a bank experiences a security breach or if customers perceive their accounts to be at risk, they may opt to move their funds to a more secure institution.
  • Better Financial Products: Customers may switch banks to access better financial products, such as higher-interest savings accounts, more competitive loan rates, or improved credit card offerings.
  • Life Events: Major life events like marriage, divorce, retirement, or the death of a spouse can prompt individuals to reassess their banking relationships and switch to better-suited options.

2 Importing the relevant libraries and dataset

In order to initiate our analysis of the bank account data, we will read all the required libraries and then take a glimpse of how our data looks like.

library(tidyverse)
library(naniar)
library(bookdown)
library(stringr)
library(stringi)
library(lubridate)
library(DT)
library(forcats)
library(ggthemes)
library(corrplot)
library(mltools)
library(data.table)
library(visdat)
library(janitor)
library(cowplot)
library(caTools)
library(pscl)
library(ROCR)
library(caret)
library(xgboost)
library(randomForest)
library(lightgbm)
library(Matrix)
library(catboost)
library(kableExtra)
library(plotly)
library(ggExtra)
Table 1: Bank churn dataset
id CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
0 15674932 Okwudilichukwu 668 France Male 33 3 0.0 2 1 0 181449.97 0
1 15749177 Okwudiliolisa 627 France Male 33 1 0.0 2 1 1 49503.50 0
2 15694510 Hsueh 678 France Male 40 10 0.0 2 1 0 184866.69 0
3 15741417 Kao 581 France Male 34 2 148882.5 1 1 1 84560.88 0
4 15766172 Chiemenam 716 Spain Male 33 5 0.0 2 1 1 15068.83 0
5 15771669 Genovese 588 Germany Male 36 4 131778.6 1 1 0 136024.31 1

3 Data description

Based on a brief research, the description for each variable in the dataset is as follows:

  1. CustomerId: Unique identifier for each customer
  2. Surname: Name associated with the customer
  3. CreditScore: Credit score of the customer
  4. Geography: Location of the bank account based on geographical location of the bank
  5. Gender: Gender of the customer
  6. Age: Current age of the customer
  7. Tenure: Length of time since the opening of the bank account
  8. Balance: Current credit balance in the account
  9. NumOfProducts: Number of banking services used by the customer
  10. HasCrCard: An indicator for whether the customer has a credit card
  11. IsActiveMember: Is the customer a regular user of the bank account through transactions
  12. EstimatedSalary: Estimated earning declared as salary for the customer
  13. Exited: Has the customer closed the bank account

4 Data cleaning

Figure 1: Check for missing values in the dataset

As we can observe from Figure 1, the dataset is clean and does not have any missing values to be dealt with. While this is ideal, it is however not the only check that we must perform in the data.

5 Data sanity checks

Table 2: Number of occurrences of each customer
CustomerId Total_occurrences
15682355 121
15570194 99
15585835 98
15595588 91
15648067 90
15793331 90

Based on Table 2, we can observe that the same customer ID appears multiple number of times in the dataset when infact, it was supposed to appear just once. Hence, we can consider the CustomerId variable to be serving no purpose in the current dataset. This shall be dropped in the feature selection section.

The estimated salary variable indicates the income of each customer declared as a salary. Salaries can never be negative. Let us quickly check if that indeed is the case.

Figure 2: Salary distribution of customers

Based on the histrogram of the salaries as illustrated by Figure 2, we can observe that the salaries are indeed positive which is what is expected.

Bank account customers are generally required to be adults (> 18 years). We will check if that holds true for the current dataset and attempt to detect any anomalous data such as negative age.

Figure 3: Age distribution of the bank customers

Based on Figure 3, we can observe that the data indeed suggests that the customers are of the right age (>18 years) and there are no anomalous data entries for this variable.

6 Exploratory Data Analysis

Before we create a prediction model, we need to understand how are our variables correlated to one another. This will be done through various visualisations as follows:

6.1 Correlation plot

Figure 4: Correlation plot
Key takeaway

Based on our understanding of the variables as illustrated by the correlation plot in Figure 4, we can infer that there is no single variable which is highly correlated to the churn indicator. We can also observe that none of the features are highly correlated to one another. This indicates that there is no multicollinearity in the choice of our features.

6.2 Geography wise churn

Let us try to understand if there are any geographical regions which have accounted for high churns.

Figure 5: Bank accounts for each geographical location

Based on the illustration Figure 5, we can observe that the percentage of churns are significantly higher in Germany as compared to France and Spain where the churn percentage appears to be similar.

Key takeaway

The higher churn % for Germany could indicate issues pertaining to the economy of the country or factors such as loan interest rate in the country. It could also indicate that the banking services or customer services in Germany might be lacking as compared to the other countries which have resulted in the higher churn percentage. Diving deeper into the banking practices of Germany would provide a better understanding as to the reason for the high attrition of customers.

6.3 Age wise bank churn

Figure 6: Age wise churn data

As we can observe through Figure 6, the churns are associated majorly with customers between the ages of 25 and 60.

Key takeaway

The data indicates that after a couple of years since opening a new bank account, customers are observed to close their accounts which could be due to numerous reasons such as:

  1. Better customer services from other banks
  2. Low credit score or due to financial bankruptcy of the customer
  3. High loan interests charged by the bank which may have led the customer to switch to another bank after fulfilling the loan.

6.4 Credit score wise churn

Let us now try to analyse if the credit scores could tell us anything about the likelihood of the bank account churns.

Figure 7: Credit score distribution of churned accounts
Key takeaway

Based on the credit score distribution of the churned and active accounts as illustrated in Figure 7, we cannot observe any discernible difference in the credit scores which could indicate the likeliness of a bank account churn. This indicates that the credit scores of customers may not be a strong factor leading to them closing their bank accounts.

6.5 Transaction activity wise churn

Next, we can analyse if the churns are mostly from bank accounts which are inactive, possibly due to very low transactions.

Figure 8: Transaction activity wise churn
Key takeaway

Based on Figure 8, we can observe that the percentage of the bank churns for inactive accounts are more than twice that of the active accounts.

This indicates that the probability of a churn rises if the transaction activity of the bank account reduces. This could also be measure taken by the bank to reduce the burden of keeping the services activated for accounts which have shown little to no activity for an extended period of duraiton.

6.6 Balance wise churn

Let us try to analyse if the account balance can indicate whether a churn might take place.

For this, we will flag any balance which is 0.

Figure 9: Account balance wise churn
Key takeaway

Figure 9 illustrates the percentage of bank accounts churned for accounts with positive balance and zero balance. Contrary to belief, the percentage of churns are actually higher for accounts with positive balance as compared to zero balance accounts.

This could be due to the fact that account holders with zero balances are generally inactive and may not proactively close their accounts but rather, is done so by the bank to reduce the maintenance cost of keeping zero balance accounts active. However, account holders with positive balance may proactively choose to close their accounts due to various reasons as stated in Section 1 which may lead to the higher percentage of churned accounts.

7 Model creation

Here, we will fit the logistic regression model to the train data as follows.

model_logit <- glm(Exited~.,family=binomial(link='logit'),data=train)

Let us study the performance of the logistic regression model through the Receiver Operating Curve (ROC) metric.

Figure 10: ROC-AUC metric

Based on the receiver operating curve as illustrated by Figure 10, we can observe that while a sizeable are of the graph is covered by the curve, it can however by improved by possibly a better classification algorithm.

Figure 11: Confusion matrix for Logistic Regression classification model

Figure 11 depicts a more intuitive way to understand the performance of the Logistic Regression. The model was able to predict the churned accounts with an accuracy of 83.4%.

Let us try to use an extra gradient boosted ensemble method commonly termed as the XGboost classifier.

[1] train-logloss:0.430351 
[2] train-logloss:0.381727 
Figure 12: Confusion matrix for XGBoost classification model

We observe a marginally better classification accuracy as compared to the logistic regression model for the XGboosted classifier as depicted by Figure 12 with an accuracy of 85.1%.

Let us utilise the LGBM algorithm and train it on the given dataset.

Figure 13: Confusion matrix for Light GBM classification model

Based on Figure 13, we can observe that the LGB classifier has actually classified poorly with multiple negatives observed in its classification. We can disregard this model for our purpose of classification.

Figure 14: Importance of each variable based on the Catboost classification algorithm
Figure 15: Confusion matrix for Catboost classification model

Upon utilising the Catboost classifier, we can observe from Figure 15 that the classifier has a marginally better accuracy when compared to the logistic regression, XGboost and Light GBM.

Additionally, Figure 14 indicates the variables which have the highest importance in predicting the bank account churns. We can observe that the variables, “Number of products” followed by “Age” and “Activity” are the most critical to be able to predict tge churned accounts.