Lending Club Loan
Jifu Zhao, 05 March 2018
Lending Club Loan Data Analysis (imbalanced classification problem)
Classification is one of two most common data science problems (another one is regression). For the supervised classification problem, imbalanced data is pretty common yet very challenging. For example, credit card fraud detection, disease classification, network intrusion and so on, are classification problem with imbalanced data. In this project, working with the Lending Club loan data, we hope to correctly predict whether or not on loan will be default using the history data.
Contents
This blog can be roughly divided into the following 7 parts. From the problem statement, to the final conclusion, as a case study, I will go through a typical data science project’s major procedures. (For more details, please refer to my GitHub jupyter notebook)
- Problem Statement
- Data Exploration
- Data Cleaning and Initial Feature Engineering
- Visualization
- Further Feature Engineering
- Machine Learning
- Conclusions
1. Problem Statement
For companies like Lending Club, correctly predicting whether or not one loan will be default is very important. In this project, using the historical data, more specifically, the Lending Club loan data from 2007 to 2015, we hope to build a machine learning model such that we can predict the chance of default for the future loans. As I will show later, this dataset is highly imbalanced and includes a lot of features, which makes this problem more challenging.
2. Data Exploration
There are several ways to download the dataset, for example, you can go to Lending Club’s website, or you can go to Kaggle. I will use the loan data from 2007 to 2015 as the training set (+ validation set), and use the data from 2016 as the test set. Below is a summary of the dataset (part of the columns)
We should notice some differences between the training and test set, and look into details. Some major difference are:
- For test set, id, member_id, and url are totally missing, which is different from training set
- For training set, open_acc_6m, open_il_6m, open_il_12m, open_il_24m, mths_since_rcnt_il, total_bal_il, il_util, open_rv_12m, open_rv_24m, max_bal_bc, all_util, inq_fi, total_cu_tl, and inq_last_12m are almost missing in training set, which is different from test set
- desc, mths_since_last_delinq, mths_since_last_record, mths_since_last_major_derog, annual_inc_joint, dti_joint, and verification_status_joint have large amount of missing values
- There are multiple loan status, but we only concern whether or not the load is default
3. Data Cleaning and Initial Feature Engineering
Data cleaning and feature engineering are two of the most important steps. For this project, I have done the following parts.
I. Transform feature int_rate
and revol_util
in test set
II. Transform target values loan_status
In the training set, only 7% of all data have label 1. It’s clear that our dataset is highly imbalanced. (Note that here we treat current
status also as label 0 to increase the difficulty. In other related projects, some people simply drop all the loan in current
status. I have also explored that case, and you can easily get over 99% AUC on both training and test set even with logistic regression.)
III. Drop useless features
Now, we have successfully reduce the features from 74 to 40. Next, let’s focus on more detailed feature engineering First, let’s look at the data again. From the below table, we can see that:
- Most features are numerical, but there are severl categorical features.
- There are still some missing values among numerical and categorical features.
IV. Feature transformation
- Transform numerical values into categorical values
- Transform categorical values into numerical values (discrete)
V. Fill missing values
- For numerical features, use median
- For categorical features, use mode (here, we don’t have missing categorical values)
4. Visualization
I. Visualize categorical features
II. Visualize numerical features
5. Further Feature Engineering
From the above heatmap and the categorical variable countplot, we can see that some feature has strong correlation
- loan_amnt, funded_amnt, funded_amnt_inv, installment
- int_rate, sub_grade
- total_pymnt, total_pymnt_inv, total_rec_prncp
- out_prncp, out_prncp_inv
- recoveries, collection_recovery_fee
We can drop some of them to reduce redundancy. Now, we only 14 categorical features, 17 numerical features. Let’s check the correlation again.
6. Machine Learning
After the above procedures, we are ready to build the predictive models. In this part, I explored three different models: Logistic regression, Random Forest, and Deep Learning.
I used to use scikit-learn a lot. But there is one problem with scikit-learn: you need to do one-hot encoding manually, which can sometimes dramatically increase the feature space. In this part, for logistic regression and random forest, I use H2O package, which has a better support with categorical features. For deep learning model, I use Keras with TensorFlow backend.
I. Logistic Regression
After grid search over alpha and lambda, I got test AUC of 0.841.
II. Random Forest
After grid search over alpha and lambda, I got test AUC of 0.848.
Feature Importance
As shown above, the top 9 most important features are:
- out_prncp: Remaining outstanding principal for total amount funded
- recoveries: Post charge off gross recovery
- last_pymnt_amnt: Last total payment amount received
- total_pymnt: Payments received to date for total amount funded
- int_rate: Interest Rate on the loan
- addr_state: The state provided by the borrower in the loan application
- total_rec_late_fee: Late fees received to date
- loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
- total_rec_int: Interest received to date
III. Neural Networks
In this part, let’s manually build a fully-connected neural network (NN) model to finish the classification task. I use a relatively small model with only two hidden layers. Without comprehensive parameter tuning, the model gives AUC of 0.834.
7. Conclusions
From our above analysis, we can see that for the above three algorithms: Logistic Regression, Random Forest, and Neural Networks, their performance on the test set is pretty similar. Based on our simple analysis and grid search, Random Forest gives the best result.
There are a lot of other methods, such as AdaBoost and XGBoost, and we can tune a lot of parameters for different models, especially for Neural Networks. Here, I didn’t explore all possible algorithms and conduct comprehensive parameter tune. For more details, please refer to my GitHub jupyter notebook.
Footnote:
In this project, to increase the problem difficulty, the loan with status like “current” is treated as status 0, which might not be very appropriate. There are some other online analysis that exclude all the loan that is still in “current” status. I also conducted analysis with that kind of dataset, which is less imbalanced. I can easily get over 0.99 AUC on both training and test dataset even with simple logistic regression. You can try it by yourself if interested.