LendingClub: bias in data? Machine learning and investment strategy
As an investor, how confident do I feel with the decision of the peer-to-peer platform to accept a loan? Should I always trust the interest rate assigned by LendingClub?
The peer-to-peer lending industry has grown significantly since its inception in 2007, with the LendingClub as a major player. With billions in annual loans, there are significant opportunities to capitalize on this alternative investment instrument. A lot has been written about sophisticated investment strategies that utilize LendingClub historical massive data sets to understand which features best predict whether a loan would be fully paid or would lead to charge off.
Many machine learning models, including logistic regression, random forest, quadratic discriminant analysis and other ensemble methods have been found to predict the likelihood of charge off with acceptable accuracy.
Upon receiving a new loan request, the peer-to-peer platform LendingClub relies on a team of highly skilled data scientists to find out whether or not the loan should be rejected and which interest rate to assign, otherwise.
Bias in data can mislead the most careful investors who rely on machine learning to craft an investment strategy. Why shouldn’t we trust the data at hand when training a model for predicting credit risk?
Can we always trust the LendingClub decision? If we choose to invest in a loan, which the platform failed to reject, we surely end-up with loosing money.
Also we can question the grade (interest rate) assigned by the platform. Statistically, loans with higher interest rates are those which borrowers fail to pay the most. If there were flaws in the risk assessment by LendingClub, leading to an exaggerated high interest rate, the outcome of the loan not being paid would not only depend on the goodwill of the borrower.
What if we adjust the data using machine learning?
We definitely could give a trial. This is a more complex heuristic approach, that requires different models solving different tasks at different steps. In this post we will present a workflow that could be exciting to implement.
Step 1: new loans are released on the marketplace
LendingClub has approved and released a list of loans ready for funding. Our investor comes to us for advice about which loans to add to her portfolio.
Step 2: we find out which loans the LendingClub should have rejected
We use a model M1 to predict the probability P_r of the loan being rejected. M1 makes sense since it would detect loans that the LendingClub would have rejected. Our investor should not put money into such loans with much enthusiasm. M1 is a binary classifier, which can be trained using the full LendingClub datasets about accepted and rejected loans with features such as years of employment, FICO credit score, loan amount and purpose.
Step 3: we find out which loans are likely to be fully paid
The subset of loans which were classified to be good in step 2 should now undergo another classification. We use another model M2 to estimate the probability of charge off P_co. M2 is the credit risk classification model which most data scientists build when they explore LendingClub data.
Step 4: we find out risky loans worth investment
Even loans predicted to charge off can be worth the investment, depending on the interest rate and when they will charge off. We look at those loans which were classified as charge off and we use a new model M3 that estimates the life duration of those loans in terms of number of months L_d. For example we can find out that a 36-months loan will charge off around 30th month of its lifetime. If it has a high enough interest rate, it can still bring profit when investing into that loan.
Step 5: we adjust the interest rate, loan amount and payment term
Loans which we predicted to be fully paid in step 3 might still charge-off. We can never be 100% sure that we will not have false positives. One reason why a borrower who we trusted to fully pay, could end-up in default, can be due to an inappropriate interest rate. A too high loan amount could also lead to a bad outcome, or a too short payment term. The idea is to use a model M4 that is able to predict an adjusted interest rate L_sg for such a borrower, when looking at those borrowers who have a similar profile as our borrower, and who have fully paid their loans in the past. We can estimate a more realistic return of investment when adjusting the interest rate to those neighbors. Similarly we can use a model M5 to estimate an adjusted amount L_a that suits better to the borrower profile; and a model M6 to estimate the adjusted term L_t.
Step 6: we predict late payments
Loans can be late in some months and still end up fully paid at the end of the term. Usually large amount of late payment fees are indicators of future charge-offs. We can learn from the late payments recorded in current loans in order to predict if our investor might expect late payments if he invests into a new loan request, which is similar to those currently being late. Therefore we use a model M7 that would predict the amount of late fees L_f that the investor could expect if he invests in this new loan. Similarly we use a model M8 to estimate the probability P_l that the loan will be late at some point of time.
Step 7: we calculate a more realistic return of investment
At this step, we have a new loan L, its probability of leading to charge off P_co, its probability of having late payment P_l, its adjusted interest rate L_sg, its adjusted amount L_a, its adjusted term L_t, the estimated amount of late fees the borrower would likely pay L_f. We use all these numbers in a function F1 that is able to calculate an estimated return of investment L_roi in dollars for the new loan.
Step 8: we build a diverse investment portfolio
Now our investor knows how much money she can expect to make for each loan in the LendingClub list. In order to choose the correct ones and build a portfolio that maximizes the total return of investment while minimizing the risk, we use a function F2 to split the loans into individual notes, and depending on the max amount our investor is willing to invest and the risk he is willing to take, the function will give a list of notes worth investment.
Does this workflow make sense? Which features are required to train each of the 8 models described in this post? How can we avoid collinearity between the models? Does this approach result in a more fair marketplace?
Thank you for reading so far. I hope you enjoyed this post and let me know your thoughts!