========================================================
Prosper is a peer to peer lending company that offers personal loans at low rates. These loans are unsecured, which means you do not have to put up any collateral (like a house or car) that could get taken away if you can’t make payments. Each loan is typically funded by multiple people all over the United States. In this way, Prosper is a marketplace connecting those who need a loan to those who have extra money to lend. Individuals can either invest in personal loans or request to borrow money. I was particularly interested in exploring this dataset as I had just finished reading the Big Short from Michael Lewis. He narrates in much detail of how greedy wallstreet bankers choose to chase the money at any cost. The loan orgination became overly complicated by adding numerous layers of abstractions aka securitization. Ultimately this led to total disregard and actually allowed the convenient manipulation of the lending process at it’s core.
In contrast, briefly reading up about the prosper.com P2P business model, I was keen to understand how prosper.com loan origination prospered. Interestingly prosper.com was in the midst of its Series C (third round) of funding around mid 2008. Just around that time Wallstreet was massively imploding leading up to the Lehman crisis. Note: Post “dabbling” in the prosper data it is clear that subprime loans vs P2P vary in orders of magnitude. Nevertheless the basic principle of risk vs returns applies here as well.

Useful links:

Some paramater definitions:

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180..
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768..
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",....
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1..
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4..
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 ..
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3..
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 ..
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21..
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 ..
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",....
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6..
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 ..
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 1003..
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121..
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10..
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 2..
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 3..
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071..
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

Loan orignation distribution

…loan amounts spike around the 5000 ranges and are interestingly normally distributed in the mid ranges. Execption are 4000$ loans

Loan orignation distribution by rating

… decent normal distribution of loans by rating, suggesting maturity in lending confidence over time. Not so surprising is that the median value of loans decreases as risk increases

Loan orignation distribution by rating over the years

… trend over the year shows pick up in lower rated loans. However for the higher risk loans from 2011 onwards the box plots suggest decreasing variability … IQR = 0 in some cases! Especially 2013 has some “strange”" box plots for E and HR

…digging into 2013 makes it clear that for risk rating E and HR there is very little variability. It would be very interesting to understand these loans in more detail (to who are these being made to? what is the investor profile? etc)


Review of loan details (ListingCat, Recommendations…)
## [1] "... Loans by ListingCategory"
## 
##       Debt         NA      Other     HomImp        Biz       Auto 
##      58242      16824      10470       7422       7172       2565 
##   Personal     HomExp        Med        Tax LargePurch    Wedding 
##       2395       1996       1522        885        876        771 
##   Vacation    Student      Mbike Engagement       Baby  Costmetic 
##        768        750        304        217        199         91 
##       Boat GreenLoans         RV 
##         85         59         52
## [1] "... Loans by Recommendations"
## 
##      0      1      2      3      4      5      6      7      8      9 
## 109416   3507    567    108     26     14      4      5      3      6 
##     14     16     18     19     21     24     39 
##      1      2      2      1      1      1      1
## [1] "... Loans by InvestmentFromFriendsCount"
## 
##      0      1      2      3      4      5      6      7      8      9 
## 111539   1830    215     40     15      8      4      2      3      5 
##     13     15     20     33 
##      1      1      1      1
## [1] "... Loans by EstimatedEffectiveYield"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -0.183   0.116   0.162   0.169   0.224   0.320   28812
## [1] "... Loans by DebtToIncomeRatio"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8482
## [1] "... Loans by StatedMonthlyIncome"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3208    4667    5613    6833 1750000

… by reviewing some of the relevant loan characteristics (subset to those where a rating is available) this is what I observe


Financial returns

… Margins for high risk loans are significantly higher and ROI drops significantly (Returns/Yield). Presumably premium on risk is key factor for this distortion

…Lender yield has outliers at 30% and mainly in 2012 and another outlier at 35%. Estimated return is a pretty well normallly distributed. ROI i.e. returns compared to forecasted yield has significantly narrower spread

… as expected estimated loss is aligned with risk rating. This is not surprising given Prosper calculates this based on historical performance of Prosper loans with similar characteristics

Univariate Analysis

What is the structure of your dataset?

There are 81 variables and nearly 114000 observations. Data has a pre-July 2009 and post July 2009 break. For example risk rating for pre-July is captured in CreditGrade whereas Alpha has risk ratings post July 2009. Similarly ProsperScore is available only post JUly 2009 and same applies to EstimatedReturns etc. Some variables are categorical and hence had to factor them to be able to assess. E.g. ListingCategory. I remapped this one from numerical to a proper label.

What is/are the main feature(s) of interest in your dataset?

I was interested in nature of the loan orginations/trends, credit rating and ROI (yield). Specifically this means getting an understanding of * Loan size spread (loan amount spread i.e. smaller loans vs larger loans) * Is this spread different by risk category? * Prosper Ratings allow investors to consider a loan’s level of risk * Yields (LenderYield and EstimatedLoss)

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Propser rating is what investors use to assess credit grade (AA-lowRisk till HR-highRisk). To understand how this compares to the credit score (I believe this would be FICO) I created mean of CreditScoreRangeLower and CreditScoreRangerUpper. I also ordered the prosper rating so it is ordered against their rating order

Did you create any new variables from existing variables in the dataset?

I created few new variables

  • “Year” variable to group loan orginations by years.
  • ConcatenatedRating variable combining CreditRating (for pre 2009) and Alpha (for post 2009) risk ratings
  • I used the effective yield to total yield as % to assess returns by risk category (ProsperRating)
  • I created a mean score given credit scores are given in lower and upper bound
  • I also grouped loan status to differentiate b/w good and bad loans: defaulted, charged off and delayed payments.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Loan amounts spike around the 5000 ranges and are interestingly normally distributed in between. Zooming in shows that loan rangers are per 1000 suggesting this could be a pratical minimal loan quantum slab. The other anomaly I noticed is that loan amount have near zero variability for high risk ratings esp in 2013 and 2014.


Bivariate Plots Section

## Warning in ggcorr(data = subset(pl, !is.na(ProsperScore)), hjust
## = 0.75, : data in column(s) 'ListingKey', 'ListingCreationDate',
## 'CreditGrade', 'LoanStatus', 'ClosedDate', 'ProsperRating..Alpha.',
## 'ProsperScore', 'BorrowerState', 'Occupation', 'EmploymentStatus',
## 'IsBorrowerHomeowner', 'CurrentlyInGroup', 'GroupKey', 'DateCreditPulled',
## 'FirstRecordedCreditLine', 'IncomeRange', 'IncomeVerifiable',
## 'LoanKey', 'LoanOriginationDate', 'LoanOriginationQuarter', 'MemberKey',
## 'LoanOriginationDate.Y', 'ConcatenatedRating', 'ListingCategory',
## 'LoanStatus.health' are not numeric and were ignored

… Overall correlation seems sparse. However there are certain variables where correlation is visible: Investors, Term, Fees, Yield, BorrowerRate, Rating, CreditLimit, Score, LoanOriginalAmount, CreditScore

… suggests that key correlation factors of interest would be loan amount, effective yield, credit score (risk) and perceived risk (investors)


Credit grade - Credit score vs Prosper rating

…as I would have expected the prosper rating is aligned to the credit scores. Possibly bit on the conservative side for the lower risk categories. Trend over the years shows that a narrower credit score range is emerging. Further the risk categories (ratings) are clearly being adjusted
- relaxation for less riskier ones (e.g. average credit score for AA shifting from 800 to 750 over the years)
- tightening for the more risker ones (average credit score for HR 500 to 600)


Salary vs Loan originations by risk rating

## [1] 0.2012595

While reading about the typical compliants made by borrowers one of the main struggles highlighted with Prosper was getting approval for loan. The information provided there further highlights that Prosper prefers borrower with a good salary.
Given loans are unsecured, I hypothesised that a key reason could be that salary is used as a direct measure in mitigating the risk


Comparing other parameters against risk

… Interest and (orgination) fees increase with higher risk categories but is lower for highest two categories (wondering why?)

In contrast service fees decrease for higher risk categories. This must be to due to lower loan amounts and payment time frames. Hence a median value of the service fee in $ amount is not meaningful. Better would be to compute service fee in relation to the loan amount.

Similary collection fees are higher on average for higher risk categories with some anamolies (e.g. cat E has lower collection fees on average than cat D).
Further analysis required to understand what these data points are telling
- Unsurprisingly investor size dwindles for higher risk (above C) categories.
- Equally unsurprising is the fact that those earning less are categorized as higher risk.


Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Prosper ratings are very well tuned to underlying risk (credit scores) and trend over the years shows that Prosper is managing the loans towards a tight range b/w average credit score of 600 to 800 .

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • With increasing salary, loan amounts seem to increase as well.
  • Strangely there is a dip in loan amounts in the lower salary ranges. This could be either due to Prosper limiting loans for this segment (less likely given the dip) or folks in that salary band requiring lower loan amounts compared to their salary (most likely the case).
  • There is clear outlier at salary = 0. By digging into the employment status it is clear that this in part due to loans being made to non working folks

What was the strongest relationship you found?

LenderYields are strongly related to the credit scores. Interestingly Propser risk averseness is being lowered (ableit not significantly) over the years. This could be also due to competition from other peer-to-peer lenders (e.g. LendingClub).

Multivariate Plots Section

LenderYield vs MeanScore vs LoanOriginalAmount

… Excluding the outliers in 2006 and 2007, lenderyield curve trends towards a near linear alignment to the credit score. Yields trend to a decent 5% for loans for the most safest loans i.e. credit score of more then 800. Coupled with larger loan amounts this is something I would consider investing in providing for good cashflow.


Loss and setting the borrower rate

…Estimated loss decreases with increasing income. Note that I have capped the salary at 10’000 as data is sparse above this range. Regression line flattens to ~6%.
Comparing the debt to income ratio to borrower rate highlights that higher leverage exposure increases borrower rate. However I was surprised that the regression line was not steeper - is near linear and pretty flat


Leverage

….majority of borrowers are salaried employees. Leverage for this group shows some very interesting patterns. The debt ratio is very varied for those in the lower salary bracket. The regression model “swings”" massively in this range and then as salaries ranges increase, a near constant debt to income ratio emerges. Interestingly amplititude increases for those categorized as “self-employed”

Multivariate Analysis

Talk about some of the relationships you observe in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Initially I started to focus on yield taking an investor’s point of view. It turned out that salary (as stated by borrower) and leverage (debt to income) were better parameter to observe. Given credit score and prosper rating are closely aligned, using the risk rating as an additional variable provided a differentiated picture. This led me to then look at the employment status of the borrowers in more detail. Insight was that majority of prosper borrowers are salaried.

Were there any interesting or surprising interactions between features?

The debt ratio is very varied for those in the lower salary bracket. The regression model “swings”" massively in this range and as salary ranges increase, a near constant debt to income ratio emerges. Interestingly amplititude increases for those categorized as “self-employed”

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Work in progress. An area of focus I would be interested is in forecasting loans going sour.


Final Plots and Summary

Plot One

Description One

LenderYield as a measure of expected returns clearly shows an increase with higher risk. The estimated return also shows a similar pattern though variablity is significantly different. Further average estimated return descreases as risk increases and actually drops for the HR risk category. Finally tying this together and using the Estimated return in relation to the interest rate shows a total different picture. For higher risk the ratio drops drastically which confirms that pushing for higher yield without understanding the underlying risk is not always a profitable endveaour.

Plot Two

Description Two

While reading the typical compliants made by borrowers one of the main struggles highlighted with Prosper was getting approval for loan. The information provided there further highlights that Prosper prefers borrower with a good salary and job. Given loans are unsecured, I hypothesised that a key reason could be that salary is used as a direct measure in mitigating the risk.

Above plots also show following:

  • With increasing salary and excluding the outliers, loan amounts increase as well. This correlates very well for salary ranges from 2k to 15k with cor = 0.6365651 ). This compared to an overall correlation factor across all salary ranges of 0.2012595.
  • The significant outliers outside of these “middle income” ranges highlight that
  • a dip in loan amounts in the lower salary ranges. This is to be attributed to the fact that in these ranges, non working folks are included
  • for higher salary ranges, reason for outliers is mainly due to sparse loan originations. Due to this I can imagine that a more adhoc pricing approach is followed by Prosper in these ranges and making assessments more prone to error.

Plot Three

Description Three

Focusing on the “stable” years of 2008 and 2013 onwards, I see that lender yield curve trends towards a near linear alignment to the credit score. Yields achieve decent 5% for loans with near risk free credit score of more then 800 even though the yields are significanlty lower compared to the more risker loans. However, given that larger loan amounts in this range which will deliver a higher (monthly) cashflow in absolute terms, this woudl be my sweet-spot where I would consider investing.

Other interesting observations from this chart

  • With the recovery starting from 2012 onwards, loan sizes are increasing (see the lighter hue along the average yield curve). Interestingly lender yields have come under pressure over the years for the lower risk grade loans. E.g. in the intial years the larger loans had a much higher yield. This suggests a leveling-off and adjustment of the risk permium in-line with risk as Prosper gets a better handle of pricing the loans.
  • Related to above point is another interesting observation. In the early years as shown for 2008 the yield spread appears to be priced by Prosper in verticals i.e. each risk rating had an equal yield spread. However over time the yields seem to have evolved (stretched) horizontally i.e. the lower risk rated loans (i.e. AA/A/B) have tighter yield ranges while allowing for a broader credit scores range. This “layering” and also allowing for a broader range of scores per risk rating in my view is a direct result of the expansion of the client base for Prosper as well as improved pricing (yield) model.
  • What also clearly sticks out is that smaller loan amounts are typical for riskier loans (smaller “bubbles” for higher risk listings). This suggests that risk is limitedby restricting loan size for loans of higher risk credit grade. Conversely and unsurprising is that lender yield is much higher given higher risk. However as I had observed above this is offset by higher fees and default risk.

Reflection

  1. The loan originations dropped drastically in 2009. But then picked up pretty quickly into 2013. 2014 data is only Q1 but loan levels are at 50% of 2013! This suggests that peer to peer lending has picked up. Will be interesting to see how this trend continues into 2015.
  2. As an investor, unsecured lending inherently poses risk of losing everything. This might also suggest why loan amounts are at micro levels. A suggestion would be that Prosper offers “collateral” as in a fixed insured investment return in case of borrower default similar to a credit default swap. This would motivate investors to lend money for more riskier loans.
  3. Analyzing the data points clearly highlights the opportunities of P2P lending. It seems best suited for the lower to middle income population. Borrowers funding requirements are for the more regular (mundane) financial needs in contrast to larger big ticket items which require banks to fund. On the other hand investors seem to be those interested in getting a regular income and not necessarily looking for appreciation.
  4. Quality of the data clearly shows that there is pre and post 2009 period. The pre-2009 is marked by Prosper establishing their business model and inherently outlier prone. Unfortunatley as their business model firms up, the 2009 crisis jolts it significantly. Starting from 2012 onwards I see a stabilization and trends starting to emerge esp from 2013 onwards.

On a personal note: I thoroughly enjoyed this project even though it was a quite a challenge. The sheer number of parameters to connect is daunting and then to craft into a “story” is a key challenge. I realised that understanding the underlying domain is crucial as this allows to ask prodding questions. This combined with each insight then allows for an iterative approach of piecing together the puzzle. Thanks for this well designed learning experience.