========================================================
Prosper is a peer to peer lending company that offers personal loans at low rates. These loans are unsecured, which means you do not have to put up any collateral (like a house or car) that could get taken away if you can’t make payments. Each loan is typically funded by multiple people all over the United States. In this way, Prosper is a marketplace connecting those who need a loan to those who have extra money to lend. Individuals can either invest in personal loans or request to borrow money. I was particularly interested in exploring this dataset as I had just finished reading the Big Short from Michael Lewis. He narrates in much detail of how greedy wallstreet bankers choose to chase the money at any cost. The loan orgination became overly complicated by adding numerous layers of abstractions aka securitization. Ultimately this led to total disregard and actually allowed the convenient manipulation of the lending process at it’s core.
In contrast, briefly reading up about the prosper.com P2P business model, I was keen to understand how prosper.com loan origination prospered. Interestingly prosper.com was in the midst of its Series C (third round) of funding around mid 2008. Just around that time Wallstreet was massively imploding leading up to the Lehman crisis. Note: Post “dabbling” in the prosper data it is clear that subprime loans vs P2P vary in orders of magnitude. Nevertheless the basic principle of risk vs returns applies here as well.
Useful links:
Some paramater definitions:
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180..
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768..
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",....
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1..
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4..
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 ..
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3..
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 ..
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21..
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 ..
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",....
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6..
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 ..
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 1003..
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121..
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10..
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 2..
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 3..
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071..
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
…There is stark dip in loans made around 2009. This is not surprising given this is around the time of subprime crisis. Loan amounts also fluctuate over the years with steady return to pre-2009 period and significant acceleration from 2012 onwards. Note: 2014 includes only Q1 data however loan values are already at 2013 exit levels with only just around half of the 2013 loans orginated
… significant shift in the median loan value from 2012 to 2013. Average loan amounts are trending higher with a decent spread
…loan amounts spike around the 5000 ranges and are interestingly normally distributed in the mid ranges. Execption are 4000$ loans
… decent normal distribution of loans by rating, suggesting maturity in lending confidence over time. Not so surprising is that the median value of loans decreases as risk increases
… trend over the year shows pick up in lower rated loans. However for the higher risk loans from 2011 onwards the box plots suggest decreasing variability … IQR = 0 in some cases! Especially 2013 has some “strange”" box plots for E and HR
…digging into 2013 makes it clear that for risk rating E and HR there is very little variability. It would be very interesting to understand these loans in more detail (to who are these being made to? what is the investor profile? etc)
The year on year trend of loans by risk rating shows significant upwards trend of higher quality loans (AA, A, B & C)
… further the listing rating vs underlying prosper score of borrowers trend shows a clear maturity in risk grading and significant alignment of listing rating to borrower profiles. Interestingly there is a shift towards “riskier lending practices” over the years: e.g. AA listings in 2012 has borrowers with a score at least 8 or higher, whereas in 2013 borrowers of with score of 7 and 6 are included and in 2014 even those with score of 5 are offered loans
## [1] "... Loans by ListingCategory"
##
## Debt NA Other HomImp Biz Auto
## 58242 16824 10470 7422 7172 2565
## Personal HomExp Med Tax LargePurch Wedding
## 2395 1996 1522 885 876 771
## Vacation Student Mbike Engagement Baby Costmetic
## 768 750 304 217 199 91
## Boat GreenLoans RV
## 85 59 52
## [1] "... Loans by Recommendations"
##
## 0 1 2 3 4 5 6 7 8 9
## 109416 3507 567 108 26 14 4 5 3 6
## 14 16 18 19 21 24 39
## 1 2 2 1 1 1 1
## [1] "... Loans by InvestmentFromFriendsCount"
##
## 0 1 2 3 4 5 6 7 8 9
## 111539 1830 215 40 15 8 4 2 3 5
## 13 15 20 33
## 1 1 1 1
## [1] "... Loans by EstimatedEffectiveYield"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.183 0.116 0.162 0.169 0.224 0.320 28812
## [1] "... Loans by DebtToIncomeRatio"
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8482
## [1] "... Loans by StatedMonthlyIncome"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3208 4667 5613 6833 1750000
… by reviewing some of the relevant loan characteristics (subset to those where a rating is available) this is what I observe
… Margins for high risk loans are significantly higher and ROI drops significantly (Returns/Yield). Presumably premium on risk is key factor for this distortion
…Lender yield has outliers at 30% and mainly in 2012 and another outlier at 35%. Estimated return is a pretty well normallly distributed. ROI i.e. returns compared to forecasted yield has significantly narrower spread
… as expected estimated loss is aligned with risk rating. This is not surprising given Prosper calculates this based on historical performance of Prosper loans with similar characteristics
… grouping loans which either defaulted, charged off or payment delayed as bad loans shows significant decrease in such loans in 2013 (2014 is partial data). It also shows that bad loan ratio is higher in the higher risk categories (sort of expected).
I then detailed into borrower scores (as a proxy for valuations) as a starting point to understand underlying risk assessment by Prosper. Interestingly, bad loans are not necessarily all due to borrowers with lower risk scores. Actually the borrower score “heatmap” by risk rating seems more or less similar over the years across 2011 till 2013 (I exclude 2014 as it has only Q4 data and earlier years due to low volumes). Further investigation is needed to understand why then there is a higher portion of bad loans in 2012 compared to 2013. One possible factor to exclude would be that 2013 loans might not be through entire lifecycle
There are 81 variables and nearly 114000 observations. Data has a pre-July 2009 and post July 2009 break. For example risk rating for pre-July is captured in CreditGrade whereas Alpha has risk ratings post July 2009. Similarly ProsperScore is available only post JUly 2009 and same applies to EstimatedReturns etc. Some variables are categorical and hence had to factor them to be able to assess. E.g. ListingCategory. I remapped this one from numerical to a proper label.
I was interested in nature of the loan orginations/trends, credit rating and ROI (yield). Specifically this means getting an understanding of * Loan size spread (loan amount spread i.e. smaller loans vs larger loans) * Is this spread different by risk category? * Prosper Ratings allow investors to consider a loan’s level of risk * Yields (LenderYield and EstimatedLoss)
Propser rating is what investors use to assess credit grade (AA-lowRisk till HR-highRisk). To understand how this compares to the credit score (I believe this would be FICO) I created mean of CreditScoreRangeLower and CreditScoreRangerUpper. I also ordered the prosper rating so it is ordered against their rating order
I created few new variables
Loan amounts spike around the 5000 ranges and are interestingly normally distributed in between. Zooming in shows that loan rangers are per 1000 suggesting this could be a pratical minimal loan quantum slab. The other anomaly I noticed is that loan amount have near zero variability for high risk ratings esp in 2013 and 2014.
## Warning in ggcorr(data = subset(pl, !is.na(ProsperScore)), hjust
## = 0.75, : data in column(s) 'ListingKey', 'ListingCreationDate',
## 'CreditGrade', 'LoanStatus', 'ClosedDate', 'ProsperRating..Alpha.',
## 'ProsperScore', 'BorrowerState', 'Occupation', 'EmploymentStatus',
## 'IsBorrowerHomeowner', 'CurrentlyInGroup', 'GroupKey', 'DateCreditPulled',
## 'FirstRecordedCreditLine', 'IncomeRange', 'IncomeVerifiable',
## 'LoanKey', 'LoanOriginationDate', 'LoanOriginationQuarter', 'MemberKey',
## 'LoanOriginationDate.Y', 'ConcatenatedRating', 'ListingCategory',
## 'LoanStatus.health' are not numeric and were ignored
… Overall correlation seems sparse. However there are certain variables where correlation is visible: Investors, Term, Fees, Yield, BorrowerRate, Rating, CreditLimit, Score, LoanOriginalAmount, CreditScore
… suggests that key correlation factors of interest would be loan amount, effective yield, credit score (risk) and perceived risk (investors)
…as I would have expected the prosper rating is aligned to the credit scores. Possibly bit on the conservative side for the lower risk categories. Trend over the years shows that a narrower credit score range is emerging. Further the risk categories (ratings) are clearly being adjusted
- relaxation for less riskier ones (e.g. average credit score for AA shifting from 800 to 750 over the years)
- tightening for the more risker ones (average credit score for HR 500 to 600)
## [1] 0.2012595
While reading about the typical compliants made by borrowers one of the main struggles highlighted with Prosper was getting approval for loan. The information provided there further highlights that Prosper prefers borrower with a good salary.
Given loans are unsecured, I hypothesised that a key reason could be that salary is used as a direct measure in mitigating the risk
… Interest and (orgination) fees increase with higher risk categories but is lower for highest two categories (wondering why?)
In contrast service fees decrease for higher risk categories. This must be to due to lower loan amounts and payment time frames. Hence a median value of the service fee in $ amount is not meaningful. Better would be to compute service fee in relation to the loan amount.
Similary collection fees are higher on average for higher risk categories with some anamolies (e.g. cat E has lower collection fees on average than cat D).
Further analysis required to understand what these data points are telling
- Unsurprisingly investor size dwindles for higher risk (above C) categories.
- Equally unsurprising is the fact that those earning less are categorized as higher risk.
Prosper ratings are very well tuned to underlying risk (credit scores) and trend over the years shows that Prosper is managing the loans towards a tight range b/w average credit score of 600 to 800 .
LenderYields are strongly related to the credit scores. Interestingly Propser risk averseness is being lowered (ableit not significantly) over the years. This could be also due to competition from other peer-to-peer lenders (e.g. LendingClub).
… Excluding the outliers in 2006 and 2007, lenderyield curve trends towards a near linear alignment to the credit score. Yields trend to a decent 5% for loans for the most safest loans i.e. credit score of more then 800. Coupled with larger loan amounts this is something I would consider investing in providing for good cashflow.
…Estimated loss decreases with increasing income. Note that I have capped the salary at 10’000 as data is sparse above this range. Regression line flattens to ~6%.
Comparing the debt to income ratio to borrower rate highlights that higher leverage exposure increases borrower rate. However I was surprised that the regression line was not steeper - is near linear and pretty flat
….majority of borrowers are salaried employees. Leverage for this group shows some very interesting patterns. The debt ratio is very varied for those in the lower salary bracket. The regression model “swings”" massively in this range and then as salaries ranges increase, a near constant debt to income ratio emerges. Interestingly amplititude increases for those categorized as “self-employed”
Initially I started to focus on yield taking an investor’s point of view. It turned out that salary (as stated by borrower) and leverage (debt to income) were better parameter to observe. Given credit score and prosper rating are closely aligned, using the risk rating as an additional variable provided a differentiated picture. This led me to then look at the employment status of the borrowers in more detail. Insight was that majority of prosper borrowers are salaried.
The debt ratio is very varied for those in the lower salary bracket. The regression model “swings”" massively in this range and as salary ranges increase, a near constant debt to income ratio emerges. Interestingly amplititude increases for those categorized as “self-employed”
Work in progress. An area of focus I would be interested is in forecasting loans going sour.
LenderYield as a measure of expected returns clearly shows an increase with higher risk. The estimated return also shows a similar pattern though variablity is significantly different. Further average estimated return descreases as risk increases and actually drops for the HR risk category. Finally tying this together and using the Estimated return in relation to the interest rate shows a total different picture. For higher risk the ratio drops drastically which confirms that pushing for higher yield without understanding the underlying risk is not always a profitable endveaour.
While reading the typical compliants made by borrowers one of the main struggles highlighted with Prosper was getting approval for loan. The information provided there further highlights that Prosper prefers borrower with a good salary and job. Given loans are unsecured, I hypothesised that a key reason could be that salary is used as a direct measure in mitigating the risk.
Above plots also show following:
Focusing on the “stable” years of 2008 and 2013 onwards, I see that lender yield curve trends towards a near linear alignment to the credit score. Yields achieve decent 5% for loans with near risk free credit score of more then 800 even though the yields are significanlty lower compared to the more risker loans. However, given that larger loan amounts in this range which will deliver a higher (monthly) cashflow in absolute terms, this woudl be my sweet-spot where I would consider investing.
Other interesting observations from this chart
On a personal note: I thoroughly enjoyed this project even though it was a quite a challenge. The sheer number of parameters to connect is daunting and then to craft into a “story” is a key challenge. I realised that understanding the underlying domain is crucial as this allows to ask prodding questions. This combined with each insight then allows for an iterative approach of piecing together the puzzle. Thanks for this well designed learning experience.