Assignment - Retention Prediction
Retention prediction, also known as customer churn prediction, is a process in data science and business analytics that aims to identify and predict the likelihood of customers or users discontinuing their relationship with a company, product, or service. The term retention
refers to the ability of a business to retain its customers over a specific period.
Bank Customer Retention Prediction
In this assignment, you will build a Machine Learning model to predict whether a bank customer is likely to churn (exit) or not based on various features such as credit score, age, tenure, and more. The dataset contains information about the customers, including their demographics, banking behaviors, and whether they have exited the bank (the label to be predicted). The dataset consists of the following columns:
- CustomerId: Unique identifier for each customer
- Surname: Customer's last name
- CreditScore: Customer's credit score
- Geography: Customer's country of residence
- Gender: Customer's gender
- Age: Customer's age
- Tenure: Number of years the customer has been with the bank
- Balance: Account balance
- NumOfProducts: Number of bank products the customer uses
- HasCrCard: Whether the customer has a credit card (1 = yes, 0 = no)
- IsActiveMember: Whether the customer is an active bank member (1 = yes, 0 = no)
- EstimatedSalary: Estimated salary of the customer
- Exited: Whether the customer has exited the bank (1 = yes, 0 = no)
Repository
TODOs
- Load the dataset into a DataFrame using Python's
Pandas
library. - Explore the dataset to understand the features, data types, and potential missing values.
- Preprocess the data by handling missing values, converting categorical variables, and scaling numerical features (if needed).
- Split the data into training and testing sets.
- Choose a suitable classification algorithm (e.g., Logistic Regression, Random Forest, etc.) and train the model on the training data.
- Evaluate the model's performance on the testing data using appropriate metrics (e.g., accuracy, precision, recall, etc.).
- Fine-tune the model if necessary to achieve better results.
- Finally, use the trained model to predict customer churn on new data.
- Complete the assignment using the
notebook
in the repository.- Push your solution back to Github once completed.
- Submit your notebook on Gradescope
- Look for Assignment - ML under assignments
HINTS
- Load the dataset into a DataFrame using Python's pandas library.
- Explore the dataset to understand the features, data types, and potential missing values.
- For data exploration, you can use Pandas methods like
head()
,info()
,describe()
, andvalue_counts()
. - To preprocess the data, consider using the
OneHotEncoder
from scikit-learn to encode categorical variables. - Use scikit-learn's
LogisticRegression
orRandomForestClassifier
for building the classification model. - Evaluate the model's performance using metrics such as accuracy.
- Visualize the results and explore the important features that contribute to customer churn using matplotlib or seaborn.