Foreword

  • Snippets and results.
  • Source: 'Credit Risk Modeling with R' from DataCamp fitted into Jupyter/IPython, using the IRkernel.


Introduction to Data Structure

Exploring the credit data

Explore the data to get an idea of the number, and percentage, of defaults.

Banks make a profit from loaning to homeowners. Whenever a bank extends a mortgage loan, it takes a risk: the risk of default. In the data, a '1' represents a default, and a '0' represents non-default. A banks wants to minimize its risk exposure, therefore, minimize its expected losses.

$$ Expected~Loss = Probability~of~Default * Exposure~at~Default * Loss~given~Default $$

First, import the data.

In [3]:
loan_data <- read.csv('loan_data.csv', header = TRUE, sep = ';')

# Or with
#loan_data2 <- read.table('loan_data.csv', header = TRUE, sep = ";")

Adjust the data.

In [4]:
loan_data$int_rate <- as.numeric(loan_data$int_rate) # variable is a factor
loan_data$annual_inc <- as.numeric(loan_data$annual_inc) # variable is a factor
loan_data$age <- as.integer(loan_data$age) # variable is a factor

Learn more about the variable structures and spot unexpected tendencies in the data.

In [5]:
# View the structure of loan_data
str(loan_data)
'data.frame':	29092 obs. of  8 variables:
 $ loan_status   : int  0 0 0 0 0 0 1 0 1 0 ...
 $ loan_amnt     : int  5000 2400 10000 5000 3000 12000 9000 3000 10000 1000 ...
 $ int_rate      : num  10.6 NA 13.5 NA NA ...
 $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 3 2 2 4 ...
 $ emp_length    : int  10 25 13 3 9 11 0 3 3 0 ...
 $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 3 4 4 4 4 ...
 $ annual_inc    : num  24000 12252 49200 36000 48000 ...
 $ age           : int  33 31 24 39 24 28 22 22 28 22 ...

The CrossTable() function examines the relationship between loan_status (1 or 0) and certain factor variables.

Hypothesis: the proportion of defaults in the group of customers with grade G (worst credit rating score) should substantially be higher than the proportion of defaults in the grade A group (best credit rating score).

In [6]:
# Load the gmodels package
# Make sure the package is available for both RStudio and Jupyter
# install.packages('gmodels') in R
library(gmodels)

# Call CrossTable() on loan_status
CrossTable(loan_data$loan_status)
 
   Cell Contents
|-------------------------|
|                       N |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  29092 

 
          |         0 |         1 | 
          |-----------|-----------|
          |     25865 |      3227 | 
          |     0.889 |     0.111 | 
          |-----------|-----------|



 

Notes on function CrossTable():

  • x: A vector or a matrix. If y is specified, x must be a vector
  • y: A vector in a matrix or a dataframe
  • digits: Number of digits after the decimal point for cell proportions
  • max.width: In the case of a 1 x n table, the default will be to print the output horizontally. If the number of columns exceeds max.width, the table will be wrapped for each successive increment of max.width columns. If you want a single column vertical table, set max.width to 1 expected If TRUE, chisq will be set to TRUE and expected cell counts from the χ2 will be included
  • prop.r: If TRUE, row proportions will be included
  • prop.c: If TRUE, column proportions will be included
  • prop.t: If TRUE, table proportions will be included
  • prop.chisq: If TRUE, chi-square contribution of each cell will be included
  • chisq: If TRUE, the results of a chi-square test will be included
  • fisher: If TRUE, the results of a Fisher Exact test will be included
  • mcnemar: If TRUE, the results of a McNemar test will be included
  • resid: If TRUE, residual (Pearson) will be included
  • sresid: If TRUE, standardized residual will be included
  • asresid: If TRUE, adjusted standardized residual will be included missing.include If TRUE, then remove any unused factor levels
  • format: Either SAS (default) or SPSS, depending on the type of output desired.
  • dnn: the names to be given to the dimensions in the result (the dimnames names).
  • ...
In [7]:
# Call CrossTable() on grade and loan_status
CrossTable(loan_data$grade, loan_data$loan_status,
           prop.r = TRUE, # If TRUE, row proportions will be included
           prop.c = FALSE, # If TRUE, column proportions will be included
           prop.t = FALSE, # prop.t If TRUE, table proportions will be included
           prop.chisq = FALSE) # prop.chisq If TRUE, chi-square contribution of each cell will be included
 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|-------------------------|

 
Total Observations in Table:  29092 

 
                | loan_data$loan_status 
loan_data$grade |         0 |         1 | Row Total | 
----------------|-----------|-----------|-----------|
              A |      9084 |       565 |      9649 | 
                |     0.941 |     0.059 |     0.332 | 
----------------|-----------|-----------|-----------|
              B |      8344 |       985 |      9329 | 
                |     0.894 |     0.106 |     0.321 | 
----------------|-----------|-----------|-----------|
              C |      4904 |       844 |      5748 | 
                |     0.853 |     0.147 |     0.198 | 
----------------|-----------|-----------|-----------|
              D |      2651 |       580 |      3231 | 
                |     0.820 |     0.180 |     0.111 | 
----------------|-----------|-----------|-----------|
              E |       692 |       176 |       868 | 
                |     0.797 |     0.203 |     0.030 | 
----------------|-----------|-----------|-----------|
              F |       155 |        56 |       211 | 
                |     0.735 |     0.265 |     0.007 | 
----------------|-----------|-----------|-----------|
              G |        35 |        21 |        56 | 
                |     0.625 |     0.375 |     0.002 | 
----------------|-----------|-----------|-----------|
   Column Total |     25865 |      3227 |     29092 | 
----------------|-----------|-----------|-----------|

 

Hypothesis confirmed: the proportion of defaults increases from Grade A (0.059) to Grade G (0.375).

Histograms

In [9]:
# Create histogram of loan_amnt: hist_1
hist_1 <- hist(loan_data$loan_amnt)

# Print locations of the breaks in hist_1
hist_1$breaks
  1. 0
  2. 2000
  3. 4000
  4. 6000
  5. 8000
  6. 10000
  7. 12000
  8. 14000
  9. 16000
  10. 18000
  11. 20000
  12. 22000
  13. 24000
  14. 26000
  15. 28000
  16. 30000
  17. 32000
  18. 34000
  19. 36000
In [12]:
# Change number of breaks and add labels
# Knowing the location of the breaks is important because
# if they are poorly chosen, the histogram may be misleading
hist_2 <- hist(loan_data$loan_amnt, breaks = 200, xlab = "Loan amount", 
               main = "Histogram of the loan amount")

Note that there are some high peaks at round values: 5000, 10000, 15000, and so on because clients tend to borrow round numbers.

Ouliers

In the histogram above, there is a lot of blank space on the right-hand side of the plot. This is an indication of possible outliers.

If outliers are observed for several variables, it might be useful to look at bivariate plots. It's possible the outliers belong to the same observation.

A person with the huge annual wage of $6 million appeared to be 144 years old. This must be a mistake? Check it out.

In [13]:
# Plot the age variable
hist(loan_data$age, ylab = 'Frequency', main = 'Histogram of loan_data$age')
In [14]:
plot(loan_data$age, ylab = 'Age')
In [15]:
# Save the outlier's index to index_highage
index_highage <- which(loan_data$age > 122)

index_highage
19486
In [16]:
# Make bivariate scatterplot of age and annual income
plot(loan_data$age, loan_data$annual_inc, xlab = "Age", ylab = "Annual Income")
In [17]:
# Create data set new_data with outlier deleted
new_data <- loan_data[-index_highage, ]

Missing data and coarse classification

How to deal with problematic data:

  1. delete the data
  2. replace the data
  3. keep the data

1. Deleting missing data

Some observations are missing interest rates. Identify how many rows are missing and delete them.

In [18]:
# Look at summary of loan_data (look for NA)
summary(loan_data$int_rate)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   5.42    7.90   10.99   11.00   13.47   23.22    2776 
In [19]:
# Get indices of missing interest rates: na_index
na_index <- which(is.na(loan_data$int_rate))

# Remove observations with missing interest rates: loan_data_delrow_na
loan_data_delrow_na <- loan_data[-na_index, ]

summary(loan_data_delrow_na$int_rate)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.42    7.90   10.99   11.00   13.47   23.22 
In [20]:
# Make a copy of loan_data
loan_data_delcol_na <- loan_data

# Delete the interest rate column from loan_data_delcol_na
loan_data_delcol_na$int_rate <- NULL

summary(loan_data_delcol_na)
  loan_status       loan_amnt     grade      emp_length      home_ownership 
 Min.   :0.0000   Min.   :  500   A:9649   Min.   : 0.000   MORTGAGE:12002  
 1st Qu.:0.0000   1st Qu.: 5000   B:9329   1st Qu.: 2.000   OTHER   :   97  
 Median :0.0000   Median : 8000   C:5748   Median : 4.000   OWN     : 2301  
 Mean   :0.1109   Mean   : 9594   D:3231   Mean   : 6.145   RENT    :14692  
 3rd Qu.:0.0000   3rd Qu.:12250   E: 868   3rd Qu.: 8.000                   
 Max.   :1.0000   Max.   :35000   F: 211   Max.   :62.000                   
                                  G:  56   NA's   :809                      
   annual_inc           age       
 Min.   :   4000   Min.   : 20.0  
 1st Qu.:  40000   1st Qu.: 23.0  
 Median :  56424   Median : 26.0  
 Mean   :  67169   Mean   : 27.7  
 3rd Qu.:  80000   3rd Qu.: 30.0  
 Max.   :6000000   Max.   :144.0  
                                  
In [21]:
summary(loan_data)
  loan_status       loan_amnt        int_rate     grade      emp_length    
 Min.   :0.0000   Min.   :  500   Min.   : 5.42   A:9649   Min.   : 0.000  
 1st Qu.:0.0000   1st Qu.: 5000   1st Qu.: 7.90   B:9329   1st Qu.: 2.000  
 Median :0.0000   Median : 8000   Median :10.99   C:5748   Median : 4.000  
 Mean   :0.1109   Mean   : 9594   Mean   :11.00   D:3231   Mean   : 6.145  
 3rd Qu.:0.0000   3rd Qu.:12250   3rd Qu.:13.47   E: 868   3rd Qu.: 8.000  
 Max.   :1.0000   Max.   :35000   Max.   :23.22   F: 211   Max.   :62.000  
                                  NA's   :2776    G:  56   NA's   :809     
  home_ownership    annual_inc           age       
 MORTGAGE:12002   Min.   :   4000   Min.   : 20.0  
 OTHER   :   97   1st Qu.:  40000   1st Qu.: 23.0  
 OWN     : 2301   Median :  56424   Median : 26.0  
 RENT    :14692   Mean   :  67169   Mean   : 27.7  
                  3rd Qu.:  80000   3rd Qu.: 30.0  
                  Max.   :6000000   Max.   :144.0  
                                                   

2. Replacing missing data

  • With continuous data: median imputation.
  • With categorical data: most-frequent (category) imputation.
In [22]:
# Compute the median of int_rate
median_ir <- median(loan_data$int_rate, na.rm = TRUE)

# Make copy of loan_data
loan_data_replace <- loan_data

# Replace missing interest rates (na_index from above) with median
loan_data_replace$int_rate[na_index] <- median_ir

# Check if the NAs are gone
summary(loan_data_replace$int_rate)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.42    8.49   10.99   11.00   13.11   23.22 

3. Keeping the missing data

In some situations, the fact that an input is missing is an important information. NAs can be kept in a separate "missing" category using coarse classification.

Coarse classification requires to bin responses into groups that contain ranges of values. With this binning technique, place all NAs in their own bin.

In [23]:
# Make the necessary replacements in the coarse classification example below
# create the variable and fill it with NA
loan_data$ir_cat <- rep(NA, length(loan_data$int_rate))

loan_data$ir_cat[which(loan_data$int_rate <= 8)] <- "0-8"
loan_data$ir_cat[which(loan_data$int_rate > 8 & loan_data$int_rate <= 11)] <- "8-11"
loan_data$ir_cat[which(loan_data$int_rate > 11 & loan_data$int_rate <= 13.5)] <- "11-13.5"
loan_data$ir_cat[which(loan_data$int_rate > 13.5)] <- "13.5+"
loan_data$ir_cat[which(is.na(loan_data$int_rate))] <- "Missing"

loan_data$ir_cat <- as.factor(loan_data$ir_cat)

# Look at your new variable using plot()
plot(loan_data$ir_cat)

Data splitting and confusion matrices

First, start splitting the data into one training set (say 2/3) and one testing test (say 1/3). Second, use cross-validation splits (different train-test splits).

In [24]:
# Set seed of 567
set.seed(567)

# Store row numbers for training set: index_train
index_train <- sample(1:nrow(loan_data), 2 / 3 * nrow(loan_data))

# Create training set: training_set
training_set <- loan_data[index_train, ]

# Create test set: test_set
test_set <- loan_data[-index_train, ]

str(training_set)
'data.frame':	19394 obs. of  9 variables:
 $ loan_status   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ loan_amnt     : int  25000 16000 8500 3500 3600 6600 3000 7500 7800 22750 ...
 $ int_rate      : num  11.36 14.11 7.51 8.88 7.49 ...
 $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 4 1 2 1 1 1 2 1 1 ...
 $ emp_length    : int  11 13 0 0 5 25 10 5 3 3 ...
 $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 1 4 1 3 4 3 4 1 ...
 $ annual_inc    : num  91000 45000 110000 55000 40000 ...
 $ age           : int  34 25 29 24 59 35 24 24 24 25 ...
 $ ir_cat        : Factor w/ 5 levels "0-8","11-13.5",..: 2 3 1 4 1 1 1 4 1 1 ...
In [25]:
str(test_set)
'data.frame':	9698 obs. of  9 variables:
 $ loan_status   : int  0 0 0 0 0 0 0 0 1 0 ...
 $ loan_amnt     : int  5000 2400 10000 5000 3000 3000 1000 3600 21000 10000 ...
 $ int_rate      : num  10.6 NA 13.5 NA NA ...
 $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 4 1 2 2 ...
 $ emp_length    : int  10 25 13 3 9 3 0 13 17 5 ...
 $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 4 4 1 4 4 ...
 $ annual_inc    : num  24000 12252 49200 36000 48000 ...
 $ age           : int  33 31 24 39 24 22 22 27 29 22 ...
 $ ir_cat        : Factor w/ 5 levels "0-8","11-13.5",..: 4 5 2 5 5 4 3 1 2 2 ...

Creating a confusion matrix

Assume that you have run a model and stored the predicted outcomes in a vector called model_pred. In the next section, the focus is on logistic regression; a new topic.

See notes on machine learning for a complete coverage of these train-test techniques like decision or classification trees, random forests, knn, and an introduction to regressions.

See how the model performed with a confusion matrix.

confusion matrix

$$ Classification~Accuracy = \frac{(TP+TN)}{(TP+FP+TN+FN)} $$$$ Classification~Sensitivity = \frac{TP}{(TP+FN)} $$$$ Classification~Specificity = \frac{TN}{(FP+TN)} $$

R-code:

# Create confusion matrix
conf_matrix <- table(test_set$loan_status, model_pred)

# Compute classification accuracy
TN <- conf_matrix[1,1]
FP <- conf_matrix[1,2]
FN <- conf_matrix[2,1]
TP <- conf_matrix[2,2]
(TP+TN)/(TP+FP+TN+FN)

# Compute sensitivity
TP/(TP+FN)


Logistic Regression

Basic logistic regression or logit

It's a generalized linear model (GLM), as opposed to simple and multivariate linear regression (LM).

The former allows for different distributions of the data (such as the logistic distribution) while the later make the assumption the data are normally distributed.

In this GLM-Logit regression, the dependent variable is loan_status and age is the sole predictor.

In [227]:
# Build a glm model with variable ir_cat as a predictor
log_model_cat <- glm(loan_status ~ ir_cat, family = 'binomial', data = training_set)

# Print the parameter estimates 
log_model_cat
Call:  glm(formula = loan_status ~ ir_cat, family = "binomial", data = training_set)

Coefficients:
  (Intercept)  ir_cat11-13.5    ir_cat13.5+     ir_cat8-11  ir_catMissing  
      -2.8893         1.0146         1.3555         0.5859         0.7522  

Degrees of Freedom: 19393 Total (i.e. Null);  19389 Residual
Null Deviance:	    13490 
Residual Deviance: 13100 	AIC: 13110
In [39]:
# Look at the different categories in ir_cat using table()
table(loan_data$ir_cat)
    0-8 11-13.5   13.5+    8-11 Missing 
   7130    6954    6002    6230    2776 
In [31]:
# B0 and B1 per categories
log_model_cat$coefficients
log_model_cat$coefficients[1]
(Intercept)
-2.88926885469467
ir_cat11-13.5
1.01460330826764
ir_cat13.5+
1.35545006766776
ir_cat8-11
0.585892518088457
ir_catMissing
0.752239438142
(Intercept): -2.88926885469467

How do we interpret the parameter estimate for the interest rates that are between 8% and 11%?

In [33]:
B_0_8 <- as.numeric(log_model_cat$coefficients[1])
B_11_135 <- as.numeric(B_0_8 + log_model_cat$coefficients[2])
B_135 <- as.numeric(B_0_8 + log_model_cat$coefficients[3])
B_8_11 <- as.numeric(B_0_8 + log_model_cat$coefficients[4])
B_m <- as.numeric(B_0_8 + log_model_cat$coefficients[5])

$\beta$ < 0, default decreases as age increases (within the category):

In [34]:
B_0_8
B_8_11
-2.88926885469467
-2.30337633660622

The odds of defaulting decrease more with age (steeper negative curve) in 0-8% category than in the 8-11% category.

In [36]:
odds_0_8 <- exp(-2.88926885469467) # exp(B_0_8) 
odds_8_11 <- exp(-2.30337633660622) # exp(B_8_11)

Probabilities of default: $e^{\beta} < 1$.

In [37]:
odds_0_8
odds_8_11
0.055616861756827
0.0999209069338507

The odds of defaulting decrease as age increases (within the category).

As age goes up by 1, the odds increase by...

Compared to the reference category 0-8%, the odds of defaulting for category 8-11% change by a multiple of:

In [38]:
odds_8_11/odds_0_8
1.79659376271056

Multiple variables in a logistic regression model

The interpretation of a single parameter still holds when including several variables in a model.

When including several variables and asking for the interpretation when a certain variable changes, it is assumed that the other variables remain constant: ceteris paribus, literally meaning "keeping all others the same".

The significance of a parameter is often refered to as a p-value; denoted as $Pr(>|t|)$.

In a GLM,

  • mild significance is denoted by a '.',
  • very strong significance is denoted by '***'.

When a parameter is not significant, you cannot assure the parameter is significantly different from 0.

In [42]:
# Build the logistic regression model
log_model_multi <- glm(loan_status ~ 
                       age + ir_cat + grade + loan_amnt + annual_inc,
                       family = 'binomial',
                       data = training_set)

# Obtain significance levels using summary()
summary(log_model_multi)
Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"
Call:
glm(formula = loan_status ~ age + ir_cat + grade + loan_amnt + 
    annual_inc, family = "binomial", data = training_set)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9727  -0.5361  -0.4380  -0.3344   3.3508  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -2.395e+00  1.279e-01 -18.720  < 2e-16 ***
age           -5.375e-03  3.875e-03  -1.387 0.165423    
ir_cat11-13.5  6.058e-01  1.323e-01   4.578 4.70e-06 ***
ir_cat13.5+    5.256e-01  1.472e-01   3.570 0.000357 ***
ir_cat8-11     3.798e-01  1.177e-01   3.228 0.001246 ** 
ir_catMissing  3.771e-01  1.296e-01   2.910 0.003617 ** 
gradeB         2.697e-01  1.061e-01   2.541 0.011040 *  
gradeC         5.430e-01  1.213e-01   4.476 7.62e-06 ***
gradeD         9.425e-01  1.376e-01   6.847 7.54e-12 ***
gradeE         1.067e+00  1.669e-01   6.397 1.59e-10 ***
gradeF         1.568e+00  2.275e-01   6.891 5.53e-12 ***
gradeG         1.648e+00  3.713e-01   4.439 9.05e-06 ***
loan_amnt     -1.885e-06  4.128e-06  -0.457 0.647917    
annual_inc    -5.344e-06  7.404e-07  -7.218 5.29e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 13489  on 19393  degrees of freedom
Residual deviance: 12933  on 19380  degrees of freedom
AIC: 12961

Number of Fisher Scoring iterations: 5

annual_inc is statistically significant where loan_amount is not.

Predicting the probability of default

Predict the probability for all the test set cases at once using the predict() function.

In [43]:
log_model_small <- glm(formula = loan_status ~ age + ir_cat,
                       family = "binomial", data = training_set)

summary(log_model_small)
Call:
glm(formula = loan_status ~ age + ir_cat, family = "binomial", 
    data = training_set)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6458  -0.5403  -0.4455  -0.3327   2.5155  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -2.631128   0.123920 -21.233  < 2e-16 ***
age           -0.009414   0.003875  -2.430   0.0151 *  
ir_cat11-13.5  1.016129   0.077845  13.053  < 2e-16 ***
ir_cat13.5+    1.357660   0.076945  17.645  < 2e-16 ***
ir_cat8-11     0.587283   0.084217   6.973 3.09e-12 ***
ir_catMissing  0.753561   0.099771   7.553 4.26e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 13489  on 19393  degrees of freedom
Residual deviance: 13090  on 19388  degrees of freedom
AIC: 13102

Number of Fisher Scoring iterations: 5
In [47]:
# Build the logistic regression model
predictions_all_small <- predict(log_model_small, newdata = test_set, type = "response")

# Look at the range of the object "predictions_all_small"
format(range(predictions_all_small), trim = FALSE, digits = 3, nsmall = 2)
  1. '0.0349'
  2. '0.1868'

Looking at the range of predicted probabilities, a small range means that predictions for the test set cases do not lie far apart, and therefore the model might not be very good at discriminating good from bad customers.

The probability of default predictive model:

$$ P(loan\_status~=~1~|~age,~ir\_cat) = \frac{1}{1+e^{-(\hat{\beta_0}+\hat{\beta_1}*age+\hat{\beta_2}*ir\_cat)}} $$

In the 0-8% category, the coefficient is the intercept without any adjustements (easy!)):

$$ P(loan\_status~=~1~|~age=33,~ir\_cat={[0-8]}) = \frac{1}{1+e^{-(-2.631128+(-0.009413)*33+0*1)}} $$

Probability of default:

In [49]:
1 / 
    (1 + exp(-1 * 
         (as.numeric(log_model_small$coefficient[1]) +(as.numeric(log_model_small$coefficient[2]) * 33 + 0))))
0.0501252024509596

Making more discriminative models

Small predicted default probabilities are to be expected with low default rates.

Building bigger models (which basically means: including more predictors) can expand the range of predictions. First, transform the training_set and test-set: bin the emp_length variable, remove the emp_length and int_rate variables.

In [228]:
# Bin the `emp_length` variable
training_set_modified <- training_set
test_set_modified <- test_set

training_set_modified$emp_cat <- rep(NA, length(training_set$emp_length))
test_set_modified$emp_cat <- rep(NA, length(test_set$emp_length))

str(training_set_modified)
str(test_set_modified)
'data.frame':	19394 obs. of  10 variables:
 $ loan_status   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ loan_amnt     : int  25000 16000 8500 3500 3600 6600 3000 7500 7800 22750 ...
 $ int_rate      : num  11.36 14.11 7.51 8.88 7.49 ...
 $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 4 1 2 1 1 1 2 1 1 ...
 $ emp_length    : int  11 13 0 0 5 25 10 5 3 3 ...
 $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 1 4 1 3 4 3 4 1 ...
 $ annual_inc    : num  91000 45000 110000 55000 40000 ...
 $ age           : int  34 25 29 24 59 35 24 24 24 25 ...
 $ ir_cat        : Factor w/ 5 levels "0-8","11-13.5",..: 2 3 1 4 1 1 1 4 1 1 ...
 $ emp_cat       : logi  NA NA NA NA NA NA ...
'data.frame':	9698 obs. of  10 variables:
 $ loan_status   : int  0 0 0 0 0 0 0 0 1 0 ...
 $ loan_amnt     : int  5000 2400 10000 5000 3000 3000 1000 3600 21000 10000 ...
 $ int_rate      : num  10.6 NA 13.5 NA NA ...
 $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 4 1 2 2 ...
 $ emp_length    : int  10 25 13 3 9 3 0 13 17 5 ...
 $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 4 4 1 4 4 ...
 $ annual_inc    : num  24000 12252 49200 36000 48000 ...
 $ age           : int  33 31 24 39 24 22 22 27 29 22 ...
 $ ir_cat        : Factor w/ 5 levels "0-8","11-13.5",..: 4 5 2 5 5 4 3 1 2 2 ...
 $ emp_cat       : logi  NA NA NA NA NA NA ...
In [229]:
training_set_modified$emp_cat[which(training_set_modified$emp_length <= 15)] <- "0-15"
training_set_modified$emp_cat[which(training_set_modified$emp_length > 15 & training_set_modified$emp_length <= 30)] <- "15-30"
training_set_modified$emp_cat[which(training_set_modified$emp_length > 30 & training_set_modified$emp_length <= 45)] <- "30-45"
training_set_modified$emp_cat[which(training_set_modified$emp_length > 45)] <- "45+"
training_set_modified$emp_cat[which(is.na(training_set_modified$emp_length))] <- "Missing"

training_set_modified$emp_cat <- as.factor(training_set_modified$emp_cat)

# Look at your new variable using plot()
plot(training_set_modified$emp_cat)
In [230]:
test_set_modified$emp_cat[which(test_set_modified$emp_length <= 15)] <- "0-15"
test_set_modified$emp_cat[which(test_set_modified$emp_length > 15 & test_set_modified$emp_length <= 30)] <- "15-30"
test_set_modified$emp_cat[which(test_set_modified$emp_length > 30 & test_set_modified$emp_length <= 45)] <- "30-45"
test_set_modified$emp_cat[which(test_set_modified$emp_length > 45)] <- "45+"
test_set_modified$emp_cat[which(is.na(test_set_modified$emp_length))] <- "Missing"

test_set_modified$emp_cat <- as.factor(test_set_modified$emp_cat)

# Look at your new variable using plot()
plot(test_set_modified$emp_cat)
In [232]:
training_set_modified$emp_length <- NULL
test_set_modified$emp_length <- NULL
training_set_modified$int_rate <- NULL
test_set_modified$int_rate <- NULL

str(training_set_modified)
str(test_set_modified)
'data.frame':	19394 obs. of  8 variables:
 $ loan_status   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ loan_amnt     : int  25000 16000 8500 3500 3600 6600 3000 7500 7800 22750 ...
 $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 4 1 2 1 1 1 2 1 1 ...
 $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 1 4 1 3 4 3 4 1 ...
 $ annual_inc    : num  91000 45000 110000 55000 40000 ...
 $ age           : int  34 25 29 24 59 35 24 24 24 25 ...
 $ ir_cat        : Factor w/ 5 levels "0-8","11-13.5",..: 2 3 1 4 1 1 1 4 1 1 ...
 $ emp_cat       : Factor w/ 5 levels "0-15","15-30",..: 1 1 1 1 1 2 1 1 1 1 ...
'data.frame':	9698 obs. of  8 variables:
 $ loan_status   : int  0 0 0 0 0 0 0 0 1 0 ...
 $ loan_amnt     : int  5000 2400 10000 5000 3000 3000 1000 3600 21000 10000 ...
 $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 4 1 2 2 ...
 $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 4 4 1 4 4 ...
 $ annual_inc    : num  24000 12252 49200 36000 48000 ...
 $ age           : int  33 31 24 39 24 22 22 27 29 22 ...
 $ ir_cat        : Factor w/ 5 levels "0-8","11-13.5",..: 4 5 2 5 5 4 3 1 2 2 ...
 $ emp_cat       : Factor w/ 5 levels "0-15","15-30",..: 1 2 1 1 1 1 1 1 2 1 ...
In [233]:
# Change the code below to construct a logistic regression model using all available predictors in the data set
#log_model_full <- glm(loan_status ~ 
#                      loan_status + loan_amnt + grade + home_ownership + annual_inc + age + emp_cat + ir_cat,
#                      family = "binomial",
#                      data = training_set2)

log_model_full <- glm(loan_status ~ .,
                      family = "binomial",
                      data = training_set_modified)

summary(log_model_full)
Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"
Call:
glm(formula = loan_status ~ ., family = "binomial", data = training_set_modified)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2441  -0.5319  -0.4373  -0.3277   3.3293  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -2.455e+00  1.344e-01 -18.270  < 2e-16 ***
loan_amnt           -1.359e-06  4.128e-06  -0.329 0.742037    
gradeB               2.692e-01  1.064e-01   2.529 0.011443 *  
gradeC               5.515e-01  1.216e-01   4.535 5.77e-06 ***
gradeD               9.598e-01  1.380e-01   6.955 3.54e-12 ***
gradeE               1.093e+00  1.673e-01   6.532 6.51e-11 ***
gradeF               1.587e+00  2.279e-01   6.964 3.31e-12 ***
gradeG               1.677e+00  3.712e-01   4.520 6.20e-06 ***
home_ownershipOTHER  1.057e-01  3.669e-01   0.288 0.773217    
home_ownershipOWN   -1.019e-01  9.425e-02  -1.081 0.279846    
home_ownershipRENT  -1.633e-02  5.264e-02  -0.310 0.756417    
annual_inc          -5.211e-06  7.686e-07  -6.780 1.20e-11 ***
age                 -5.182e-03  3.883e-03  -1.335 0.182010    
ir_cat11-13.5        6.311e-01  1.328e-01   4.754 2.00e-06 ***
ir_cat13.5+          5.333e-01  1.475e-01   3.615 0.000300 ***
ir_cat8-11           3.980e-01  1.181e-01   3.371 0.000748 ***
ir_catMissing        3.906e-01  1.297e-01   3.012 0.002591 ** 
emp_cat15-30         1.261e-01  8.579e-02   1.469 0.141768    
emp_cat30-45        -4.398e-02  2.666e-01  -0.165 0.868948    
emp_cat45+           1.042e+00  5.240e-01   1.989 0.046717 *  
emp_catMissing       7.640e-01  1.163e-01   6.569 5.06e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 13489  on 19393  degrees of freedom
Residual deviance: 12890  on 19373  degrees of freedom
AIC: 12932

Number of Fisher Scoring iterations: 5
In [447]:
# Make PD-predictions for all test set elements using the the full logistic regression model
predictions_all_full <- predict(log_model_full, newdata = test_set_modified, type = "response")

# Look at the predictions range in percentage
format(range(predictions_all_full)*100, trim = FALSE, digits = 6, nsmall = 2)
  1. ' 0.00052998'
  2. '53.88021929'

Evaluating the logistic regression model results

Specify a cut-off (acceptance threshold). A good cut-off makes the difference to obtain a good confusion matrix.

In [235]:
# Make a binary predictions-vector using a cut-off of 15%
pred_cutoff_15 <- ifelse(predictions_all_full > 0.15, 1, 0) # more tolerant

# and of 20%
pred_cutoff_20 <- ifelse(predictions_all_full > 0.20, 1, 0) # less tolerant

pred_cutoff <- cbind(pred_cutoff_15, pred_cutoff_20)
head(pred_cutoff, 10)
pred_cutoff_15pred_cutoff_20
100
210
310
400
511
800
1011
1200
1500
1700
In [236]:
# Make a binary predictions-vector using a cut-off of 15%
pred_cutoff_15 <- ifelse(predictions_all_full > 0.15, 1, 0)

# Construct a confusion matrix
table(test_set$loan_status, pred_cutoff_15)
   pred_cutoff_15
       0    1
  0 6807 1809
  1  688  394
In [237]:
# Check the length
str(test_set_modified)
str(predictions_all_full)
'data.frame':	9698 obs. of  8 variables:
 $ loan_status   : int  0 0 0 0 0 0 0 0 1 0 ...
 $ loan_amnt     : int  5000 2400 10000 5000 3000 3000 1000 3600 21000 10000 ...
 $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 3 3 1 5 2 4 1 2 2 ...
 $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 4 4 4 4 4 1 4 4 ...
 $ annual_inc    : num  24000 12252 49200 36000 48000 ...
 $ age           : int  33 31 24 39 24 22 22 27 29 22 ...
 $ ir_cat        : Factor w/ 5 levels "0-8","11-13.5",..: 4 5 2 5 5 4 3 1 2 2 ...
 $ emp_cat       : Factor w/ 5 levels "0-15","15-30",..: 1 2 1 1 1 1 1 1 2 1 ...
 Named num [1:9698] 0.1084 0.1637 0.1566 0.0774 0.2031 ...
 - attr(*, "names")= chr [1:9698] "1" "2" "3" "4" ...

Construct a confusion matrix

In [238]:
pred_cutoff_15 <- ifelse(predictions_all_full > 0.15, 1, 0)
pred_cutoff_20 <- ifelse(predictions_all_full > 0.20, 1, 0)

conf_matrix_15 <- table(test_set_modified$loan_status, pred_cutoff_15)
conf_matrix_20 <- table(test_set_modified$loan_status, pred_cutoff_20)

$ Classification~Accuracy = \frac{(TP+TN)}{(TP+FP+TN+FN)} $

$ Classification~Sensitivity = \frac{TP}{(TP+FN)} $

$ Classification~Specificity = \frac{TN}{(FP+TN)} $

In [239]:
# Compute classification accuracy
TN <- conf_matrix_15[1,1]
FP <- conf_matrix_15[1,2]
FN <- conf_matrix_15[2,1]
TP <- conf_matrix_15[2,2]

# Compute accuracy
A_15 <- (TP + TN) / (TP + FP + TN + FN)
A_15

# Compute sensitivity
S_15 <- TP / (TP + FN)
S_15

# Compute specificity
Sp_15 <- TN / (FP + TN)
Sp_15
0.742524231800371
0.364140480591497
0.790041782729805
In [240]:
# Compute classification accuracy
TN <- conf_matrix_20[1,1]
FP <- conf_matrix_20[1,2]
FN <- conf_matrix_20[2,1]
TP <- conf_matrix_20[2,2]

# Compute accuracy
A_20 <- (TP + TN) / (TP + FP + TN + FN)
A_20

# Compute sensitivity
S_20 <- TP / (TP + FN)
S_20

# Compute specificity
Sp_20 <- TN / (FP + TN)
Sp_20
0.841719942256135
0.162661737523105
0.926996285979573

Moving from a cut-off of 15% to 20%: accuracy increases, sensitivity decreases and specificity increases.

Pointer: performing simulations with loops would enable the graphing of the results on a cut-off range (0 to 100%).

Alternative models

The model used so far: glm(loan_status ~ .,family = "binomial", data = training_set2)

By default, the attribute is set on family = "binomial"(link = logit), but there are other types of models such as family = "binomial"(link = probit) and family = "binomial"(link = cloglog).

They behave like the logit model, but their specifications are more complex.

In [241]:
# Fit the logit, probit and cloglog-link logistic regression models
log_model_logit <- glm(loan_status ~ age + emp_cat + ir_cat + loan_amnt,
                       family = binomial(link = logit), data = training_set_modified)

log_model_probit <- glm(loan_status ~ age + emp_cat + ir_cat + loan_amnt,
                       family = binomial(link = probit), data = training_set_modified)

log_model_cloglog <- glm(loan_status ~ age + emp_cat + ir_cat + loan_amnt,
                       family = binomial(link = cloglog), data = training_set_modified)
In [242]:
summary(log_model_logit)
Call:
glm(formula = loan_status ~ age + emp_cat + ir_cat + loan_amnt, 
    family = binomial(link = logit), data = training_set_modified)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9297  -0.5390  -0.4440  -0.3266   2.5631  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -2.621e+00  1.270e-01 -20.633  < 2e-16 ***
age            -8.501e-03  3.875e-03  -2.194  0.02824 *  
emp_cat15-30    6.664e-02  8.465e-02   0.787  0.43115    
emp_cat30-45   -1.150e-01  2.654e-01  -0.433  0.66487    
emp_cat45+      8.973e-01  5.244e-01   1.711  0.08706 .  
emp_catMissing  7.962e-01  1.149e-01   6.931 4.17e-12 ***
ir_cat11-13.5   1.056e+00  7.828e-02  13.492  < 2e-16 ***
ir_cat13.5+     1.401e+00  7.758e-02  18.056  < 2e-16 ***
ir_cat8-11      6.137e-01  8.447e-02   7.265 3.73e-13 ***
ir_catMissing   7.823e-01  1.001e-01   7.818 5.37e-15 ***
loan_amnt      -1.034e-05  3.674e-06  -2.814  0.00489 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 13489  on 19393  degrees of freedom
Residual deviance: 13034  on 19383  degrees of freedom
AIC: 13056

Number of Fisher Scoring iterations: 5
In [243]:
summary(log_model_probit)
Call:
glm(formula = loan_status ~ age + emp_cat + ir_cat + loan_amnt, 
    family = binomial(link = probit), data = training_set_modified)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9268  -0.5387  -0.4444  -0.3260   2.5870  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -1.481e+00  6.423e-02 -23.063  < 2e-16 ***
age            -4.372e-03  2.004e-03  -2.181   0.0291 *  
emp_cat15-30    3.748e-02  4.430e-02   0.846   0.3975    
emp_cat30-45   -5.438e-02  1.342e-01  -0.405   0.6854    
emp_cat45+      5.298e-01  3.031e-01   1.748   0.0805 .  
emp_catMissing  4.307e-01  6.484e-02   6.643 3.08e-11 ***
ir_cat11-13.5   5.297e-01  3.827e-02  13.842  < 2e-16 ***
ir_cat13.5+     7.195e-01  3.847e-02  18.702  < 2e-16 ***
ir_cat8-11      2.997e-01  4.075e-02   7.354 1.92e-13 ***
ir_catMissing   3.860e-01  4.967e-02   7.771 7.78e-15 ***
loan_amnt      -5.490e-06  1.928e-06  -2.848   0.0044 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 13489  on 19393  degrees of freedom
Residual deviance: 13034  on 19383  degrees of freedom
AIC: 13056

Number of Fisher Scoring iterations: 5
In [244]:
summary(log_model_cloglog)
Call:
glm(formula = loan_status ~ age + emp_cat + ir_cat + loan_amnt, 
    family = binomial(link = cloglog), data = training_set_modified)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9500  -0.5388  -0.4440  -0.3269   2.5555  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -2.663e+00  1.203e-01 -22.144  < 2e-16 ***
age            -8.011e-03  3.637e-03  -2.203  0.02761 *  
emp_cat15-30    6.102e-02  7.910e-02   0.771  0.44046    
emp_cat30-45   -1.088e-01  2.513e-01  -0.433  0.66489    
emp_cat45+      7.575e-01  4.595e-01   1.649  0.09922 .  
emp_catMissing  7.251e-01  1.017e-01   7.128 1.02e-12 ***
ir_cat11-13.5   1.006e+00  7.503e-02  13.414  < 2e-16 ***
ir_cat13.5+     1.322e+00  7.386e-02  17.898  < 2e-16 ***
ir_cat8-11      5.894e-01  8.139e-02   7.241 4.45e-13 ***
ir_catMissing   7.490e-01  9.566e-02   7.830 4.88e-15 ***
loan_amnt      -9.560e-06  3.431e-06  -2.787  0.00533 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 13489  on 19393  degrees of freedom
Residual deviance: 13035  on 19383  degrees of freedom
AIC: 13057

Number of Fisher Scoring iterations: 6
In [443]:
# Make predictions for all models using the test set
predictions_logit <- predict(log_model_logit, 
                             newdata = test_set_modified, 
                             type = "response")
predictions_probit <- predict(log_model_probit, 
                              newdata = test_set_modified, 
                              type = "response")
predictions_cloglog <- predict(log_model_cloglog, 
                               newdata = test_set_modified, 
                               type = "response")
In [246]:
# Use a cut-off of 14% to make binary predictions-vectors
cutoff <- 0.14

true_value <- test_set_modified$loan_status
class_pred_logit <- ifelse(predictions_logit > cutoff, 1, 0)
class_pred_probit <- ifelse(predictions_probit > cutoff, 1, 0)
class_pred_cloglog <- ifelse(predictions_cloglog > cutoff, 1, 0)

class_pred <- cbind(true_value, class_pred_logit, class_pred_probit, class_pred_cloglog)
head(class_pred, 20)
true_valueclass_pred_logitclass_pred_probitclass_pred_cloglog
10000
20000
30000
40000
50000
80000
100111
120000
151000
170000
260111
280000
300000
330111
340111
380000
390000
411000
420000
441111
In [247]:
str(true_value)
str(class_pred_logit)
 int [1:9698] 0 0 0 0 0 0 0 0 1 0 ...
 Named num [1:9698] 0 0 0 0 0 0 1 0 0 0 ...
 - attr(*, "names")= chr [1:9698] "1" "2" "3" "4" ...
In [248]:
# Make a confusion matrix for the three models
tab_class_logit <- table(true_value, class_pred_logit)
tab_class_probit <- table(true_value, class_pred_probit) 
tab_class_cloglog <- table(true_value, class_pred_cloglog)

tab_class_logit
tab_class_probit
tab_class_cloglog
          class_pred_logit
true_value    0    1
         0 6548 2068
         1  650  432
          class_pred_probit
true_value    0    1
         0 6563 2053
         1  650  432
          class_pred_cloglog
true_value    0    1
         0 6552 2064
         1  652  430
In [249]:
# Compute the classification accuracy for all three models
acc_logit <- sum(diag(tab_class_logit)) / nrow(test_set_modified)
acc_probit <- sum(diag(tab_class_probit)) / nrow(test_set_modified)
acc_cloglog <- sum(diag(tab_class_cloglog)) / nrow(test_set_modified)

acc_logit * 100
acc_probit * 100
acc_cloglog * 100
71.973602804702
72.1282738709012
71.9942256135286

The three models yield the same results (given they all are predictive models and there is a certain margin of error involved).



Decision Trees

Computing the gain for a tree

The Gini-measure is used to create the perfect split for a tree.

The data set contains 500 cases, 89 of these cases are defaults. This led to a Gini of 0.292632 in the root node.

The Gini of a:

$$ certain~node = 2 * proportion~of~defaults * proportion~of~nondefaults $$
gini_root <- 2 * (89 / 500) * (411 / 500)

Use these Gini measures to calculate the gain of the leaf nodes with respect to the root node.

Gain <-
    gini_root - 
    (prop(cases left leaf) * gini_left) - 
    (prop(cases right leaf * gini_right))

In [75]:
# The Gini-measure of the root node is given below (89 + 411 = 500)
gini_root <- 2 * 89 / 500 * 411 / 500
gini_root

# Compute the Gini measure for the left leaf node (401 + 45 = 446)
gini_ll <- 2 * 401 / 446 * 45 / 446
gini_ll

# Compute the Gini measure for the right leaf node (10 + 44 = 54)
gini_rl <- 2 * 10 / 54 * 44 / 54
gini_rl

# Compute the gain
gain <- gini_root - 446 / 500 * gini_ll - 54 / 500 * gini_rl
gain
0.292632
0.181433368859217
0.301783264746228
0.0982008423849859

Compare the gain-column in small_tree$splits with our computed gain, multiplied by 500, and assure they are the same (for now, the small_tree object is not calculated):

small_tree$splits

    count ncat  improve index adj
age   500    1 49.10042  32.5   0
In [20]:
improve <- gain * 500
improve
49.1004211924929

How would a change in the right leaf proportions from 10/44 to 20/34 change the gain?

In [77]:
# The Gini-measure of the root node is given below (89 + 411 = 500)
gini_root <- 2 * 89 / 500 * 411 / 500
gini_root

# Compute the Gini measure for the left leaf node (401 + 45 = 446)
gini_ll <- 2 * 401 / 446 * 45 / 446
gini_ll

# HERE
# Compute the Gini measure for the right leaf node (20 + 34 = 54)
gini_rl <- 2 * 20 / 54 * 34 / 54
gini_rl

# Compute the gain
gain <- gini_root - 446 / 500 * gini_ll - 54 / 500 * gini_rl
gain
0.292632
0.181433368859217
0.46639231824417
0.0804230646072081

The gain would go down from 0.10 to 0.80.

Building decision trees using the rpart package

But, it is hard building nice decision tree for credit risk data because of unbalanced data.

Overcome the unbalance:

  1. Undersampling or oversampling
    • Accuracy issue will disappear
    • Only training set
  2. Changing the prior probabilities
  3. Including a loss matrix

1. Undersampling the training set

Under- or oversampling. The training set is undersampled, such that after transformation, 1/3 of the training set consists of defaults, and 2/3 of non-defaults (undersampling defaulting).

In [287]:
# Import the data
undersampled_training_set <- read.csv('undersampled_training_set.csv', header = TRUE, sep = ';')

#true_val <- true_val$true_value
str(undersampled_training_set)
'data.frame':	6570 obs. of  8 variables:
 $ loan_status   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ loan_amnt     : int  15000 6600 2200 24250 2500 5000 10000 10000 3000 3000 ...
 $ grade         : Factor w/ 7 levels "A","B","C","D",..: 5 4 1 4 3 2 3 2 3 1 ...
 $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 1 4 4 4 1 4 1 1 ...
 $ annual_inc    : num  62000 30000 45000 136000 18984 ...
 $ age           : int  21 29 31 31 22 45 36 28 27 37 ...
 $ emp_cat       : Factor w/ 5 levels "0-15","15-30",..: 1 1 1 1 2 1 1 1 2 5 ...
 $ ir_cat        : Factor w/ 5 levels "0-8","11-13.5",..: 3 3 1 3 3 5 2 4 3 1 ...
In [288]:
# Load package rpart() in your workspace.
library(rpart)
In [289]:
# Change the code provided in the video such that a decision tree is constructed using the undersampled training set.
# If cp is not met, further splits will no longer be pursued. cp's default value is 0.01, but for complex problems, it is advised to relax cp to 0.001.
tree_undersample <- rpart(loan_status ~ ., method = "class",
                          data =  undersampled_training_set,
                          control = rpart.control(cp = 0.001))

# Plot the decision tree; uniform for equal-size branches
plot(tree_undersample, uniform = TRUE)

# Add labels to the decision tree
text(tree_undersample)

Mmmm... a bit crowded.

2. Changing the prior probabilities

With argument: parms = list(prior = c(non_default_proportion, default_proportion)), using the complete training set.

In [290]:
# Change the code below such that a tree is constructed with adjusted prior probabilities.
# parms = list(prior=c(non_default_proportion, default_proportion))
tree_prior <- rpart(loan_status ~ ., method = "class",
                    data = training_set_modified,
                    parms = list(prior = c(0.7, 0.3)),
                    control = rpart.control(cp = 0.001))

# Plot the decision tree
plot(tree_prior, uniform = TRUE)

# Add labels to the decision tree
text(tree_prior)

3. Including a loss matrix

Change the relative importance of misclassifying a default as non-default versus a non-default as a default. Stress that misclassifying a default as a non-default should be penalized more heavily.

Use argument: parms = list(loss = matrix(c(0, cost_def_as_nondef, cost_nondef_as_def, 0), ncol=2)).

In [291]:
# Change the code below such that a decision tree is constructed
# using a loss matrix penalizing 10 times more heavily for misclassified defaults.
# parms = list(loss = matrix(c(0, cost_def_as_nondef, cost_nondef_as_def, 0), ncol=2))
tree_loss_matrix <- rpart(loan_status ~ ., method = "class",
                    data = training_set_modified,
                    parms = list(loss = matrix(c(0, 10, 1, 0), ncol=2)),
                    control = rpart.control(cp = 0.001))

# Plot the decision tree
plot(tree_loss_matrix, uniform = TRUE)

# Add labels to the decision tree
text(tree_loss_matrix)

Pruning the decision tree

Pruning a tree is necessary to avoid overfitting (as above).

Use printcp() and plotcp() for pruning purposes.

In [292]:
library(rpart.plot)

tree_prior from Changing the prior probabilities

In [293]:
# Use printcp() to identify for which complexity parameter the cross-validated error rate is minimized.
printcp(tree_prior)
Classification tree:
rpart(formula = loan_status ~ ., data = training_set_modified, 
    method = "class", parms = list(prior = c(0.7, 0.3)), control = rpart.control(cp = 0.001))

Variables actually used in tree construction:
[1] age            annual_inc     emp_cat        grade          home_ownership
[6] ir_cat         loan_amnt     

Root node error: 5818.2/19394 = 0.3

n= 19394 

          CP nsplit rel error  xerror     xstd
1  0.0038701      0   1.00000 1.00000 0.020363
2  0.0035241      7   0.96930 0.99594 0.019931
3  0.0032501      8   0.96577 0.98528 0.019582
4  0.0018334     11   0.95510 0.98091 0.019416
5  0.0015774     12   0.95326 0.98218 0.019278
6  0.0014739     14   0.95011 0.98332 0.019283
7  0.0014239     21   0.93923 0.98356 0.019263
8  0.0013504     25   0.93313 0.98473 0.019216
9  0.0012914     27   0.93043 0.98517 0.019236
10 0.0011997     32   0.92398 0.98726 0.019278
11 0.0011571     35   0.92038 0.98836 0.019307
12 0.0011482     36   0.91922 0.98747 0.019310
13 0.0011014     41   0.91313 0.98831 0.019295
14 0.0010990     42   0.91203 0.98920 0.019256
15 0.0010822     45   0.90873 0.98975 0.019248
16 0.0010435     46   0.90765 0.99154 0.019234
17 0.0010314     49   0.90452 0.99093 0.019193
18 0.0010284     59   0.89355 0.99093 0.019193
19 0.0010170     64   0.88819 0.99065 0.019201
20 0.0010000     67   0.88514 0.99092 0.019201
In [294]:
# Plot the cross-validated error rate as a function of the complexity parameter
plotcp(tree_prior)
In [295]:
# Create an index for of the row with the minimum xerror
index <- which.min(tree_prior$cptable[ , "xerror"])
index

# Create tree_min
tree_min <- tree_prior$cptable[index, "CP"]
tree_min
4: 4
0.0018334497550684
In [296]:
#  Prune the tree using tree_min
ptree_prior <- prune(tree_prior, cp = tree_min)

# Use prp() to plot the pruned tree
prp(ptree_prior, extra = 1)

tree_loss_matrix from Including a loss matrix

In [410]:
tree_loss_matrix <- rpart(loan_status ~ ., method = "class",
                    data = training_set_modified,
                    parms = list(loss = matrix(c(0, 10, 1, 0), ncol=2)),
                    control = rpart.control(cp = 0.001))

# Use printcp() to identify for which complexity parameter the cross-validated error rate is minimized.
printcp(tree_loss_matrix)
Classification tree:
rpart(formula = loan_status ~ ., data = training_set_modified, 
    method = "class", parms = list(loss = matrix(c(0, 10, 1, 
        0), ncol = 2)), control = rpart.control(cp = 0.001))

Variables actually used in tree construction:
[1] age            annual_inc     emp_cat        grade          home_ownership
[6] ir_cat         loan_amnt     

Root node error: 17249/19394 = 0.8894

n= 19394 

          CP nsplit rel error  xerror     xstd
1  0.1342687      0   1.00000 10.0000 0.025322
2  0.0116528      1   0.86573  6.5112 0.039715
3  0.0038263      3   0.84243  5.7460 0.040169
4  0.0027828      5   0.83477  5.3094 0.040091
5  0.0023190      8   0.82642  5.3340 0.040102
6  0.0020291      9   0.82411  5.4306 0.040141
7  0.0019132     10   0.82208  5.5213 0.040160
8  0.0018938     14   0.81442  5.5172 0.040160
9  0.0018165     17   0.80874  5.4648 0.040148
10 0.0016233     23   0.79784  5.4579 0.040147
11 0.0014204     25   0.79460  5.3971 0.040128
12 0.0012175     27   0.79176  5.1747 0.040007
13 0.0011208     29   0.78932  5.1670 0.040004
14 0.0010725     32   0.78596  5.1873 0.040016
15 0.0010629     34   0.78381  5.1898 0.040016
16 0.0010435     39   0.77767  5.2300 0.040039
17 0.0010000     40   0.77662  5.1903 0.040016
In [411]:
# Plot the cross-validated error rate as a function of the complexity parameter
plotcp(tree_loss_matrix)
In [412]:
# Create an index for of the row with the minimum xerror
index <- which.min(tree_loss_matrix$cptable[ , "xerror"])
index

# Create tree_min
tree_min <- tree_prior$cptable[index, "CP"]
tree_min
13: 13
0.00110135096511118
In [406]:
# Prune the tree (using the above cp will be too much)
# on the graph above, the error touches the asymptote for the first time
ptree_loss_matrix <- prune(tree_loss_matrix, cp = 0.0014)

# Use prp() and argument extra = 1 to plot the pruned tree
prp(ptree_loss_matrix, extra = 1)

tree_undersampled from Undersampling the training set

In [301]:
# Use printcp() to identify for which complexity parameter the cross-validated error rate is minimized.
printcp(tree_undersample)
Classification tree:
rpart(formula = loan_status ~ ., data = undersampled_training_set, 
    method = "class", control = rpart.control(cp = 0.001))

Variables actually used in tree construction:
[1] age            annual_inc     emp_cat        grade          home_ownership
[6] ir_cat         loan_amnt     

Root node error: 2190/6570 = 0.33333

n= 6570 

          CP nsplit rel error xerror     xstd
1  0.0059361      0   1.00000 1.0000 0.017447
2  0.0044140      4   0.97443 1.0041 0.017465
3  0.0036530      7   0.96119 1.0110 0.017495
4  0.0031963      8   0.95753 1.0155 0.017514
5  0.0029680      9   0.95434 1.0151 0.017512
6  0.0025114     11   0.94840 1.0123 0.017500
7  0.0024353     14   0.94018 1.0123 0.017500
8  0.0022831     17   0.93288 1.0114 0.017497
9  0.0016743     18   0.93059 1.0123 0.017500
10 0.0015982     21   0.92557 1.0192 0.017529
11 0.0015221     31   0.90548 1.0279 0.017565
12 0.0014840     34   0.90091 1.0288 0.017569
13 0.0013699     50   0.87671 1.0279 0.017565
14 0.0012177     65   0.85525 1.0324 0.017584
15 0.0011416     68   0.85160 1.0393 0.017611
16 0.0010654     76   0.84247 1.0365 0.017600
17 0.0010000     79   0.83927 1.0379 0.017606
In [302]:
plotcp(tree_undersample)
In [303]:
index <- which.min(tree_undersample$cptable[ , "xerror"])
index

tree_min <- tree_undersample$cptable[index, "CP"]
tree_min
1: 1
0.00593607305936073
In [409]:
# take the second minimum or 5th position where cp = 0.0029680
ptree_undersample <- prune(tree_undersample, cp = 0.0029680)

prp(ptree_undersample, extra = 1)

One final tree using more options

Case weights

This vector contains weights of 1 for the non-defaults in the training set, and weights of 3 for defaults in the training sets. By specifying higher weights for default, the model will assign higher importance to classifying defaults correctly.

In [326]:
# Import the data
training_set2 <- read.csv('training_set2.csv', header = TRUE, sep = ';')
case_weights <- read.csv('case_weights.csv', header = TRUE, sep = ';')
In [327]:
str(training_set2)
'data.frame':	19394 obs. of  8 variables:
 $ loan_status   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ loan_amnt     : int  25000 16000 8500 9800 3600 6600 3000 7500 6000 22750 ...
 $ grade         : Factor w/ 7 levels "A","B","C","D",..: 2 4 1 2 1 1 1 2 1 1 ...
 $ home_ownership: Factor w/ 4 levels "MORTGAGE","OTHER",..: 4 4 1 1 1 3 4 3 4 1 ...
 $ annual_inc    : num  91000 45000 110000 102000 40000 ...
 $ age           : int  34 25 29 24 59 35 24 24 26 25 ...
 $ emp_cat       : Factor w/ 5 levels "0-15","15-30",..: 1 1 1 1 1 2 1 1 1 1 ...
 $ ir_cat        : Factor w/ 5 levels "0-8","11-13.5",..: 2 3 1 4 1 1 1 4 1 1 ...
In [328]:
case_weights <- as.numeric(case_weight$case_weights)
str(case_weights)
class(case_weights)
 num [1:19394] 1 1 1 1 1 1 1 1 1 1 ...
'numeric'
In [329]:
# set a seed and run the code to obtain a tree using weights, minsplit and minbucket
set.seed(345)
tree_weights <- rpart(loan_status ~ ., method = "class",
                      data = training_set2, weights = case_weights,
                      control = rpart.control(minsplit = 5, minbucket = 2, cp = 0.001))
In [330]:
printcp(tree_weights)
Classification tree:
rpart(formula = loan_status ~ ., data = training_set2, weights = case_weights, 
    method = "class", control = rpart.control(minsplit = 5, minbucket = 2, 
        cp = 0.001))

Variables actually used in tree construction:
[1] age            annual_inc     emp_cat        grade          home_ownership
[6] ir_cat         loan_amnt     

Root node error: 6570/19394 = 0.33876

n= 19394 

          CP nsplit rel error  xerror     xstd
1  0.0036530      0   1.00000 1.00000 0.010495
2  0.0030441      5   0.98158 1.00015 0.010495
3  0.0020802      6   0.97854 0.99269 0.010471
4  0.0019026      9   0.97230 0.99467 0.010478
5  0.0018265     11   0.96849 0.99665 0.010484
6  0.0016743     13   0.96484 0.99528 0.010480
7  0.0014713     14   0.96317 0.99574 0.010481
8  0.0013699     18   0.95510 1.00015 0.010495
9  0.0012177     19   0.95373 1.00107 0.010498
10 0.0010654     23   0.94825 1.00685 0.010517
11 0.0010000     26   0.94505 1.01233 0.010535
In [331]:
# Plot the cross-validated error rate for a changing cp
plotcp(tree_weights)
In [332]:
# Create an index for of the row with the minimum xerror
index <- which.min(tree_weights$cp[ , "xerror"])
index

# Create tree_min
tree_min <- tree_weights$cp[index, "CP"]
tree_min
3: 3
0.00208016235413496
In [333]:
# Prune the tree using tree_min
ptree_weights <- prune(tree_weights, cp = tree_min)

# Plot the pruned tree using the rpart.plot()-package
prp(ptree_weights, extra = 1)

Confusion matrices and accuracy of our final trees

The number of splits varies quite a bit from one tree to another:

Tree Obs CP Splits
ptree_prior 19394 0.0035 7
ptree_loss_matrix 19394 0.0025 26
ptree_loss_undersample 6570 0.0030 11
ptree_weight 19394 0.0010 7

Make predictions using the test set.

In [322]:
test_set_modified <- read.csv('test_set_modified.csv', header = TRUE, sep = ';')
In [457]:
# Make predictions for each of the pruned trees using the test set.
pred_prior <- predict(ptree_prior,
                            newdata = test_set_modified,
                            type = "class")

pred_loss_matrix <- predict(ptree_loss_matrix,
                            newdata = test_set_modified,
                            type = "class")

pred_undersample <- predict(ptree_undersample, 
                            newdata = test_set_modified, 
                            type = "class")

pred_weights <- predict(ptree_weights,
                            newdata = test_set_modified,
                            type = "class")
In [462]:
pred_prior <- as.numeric(as.character(pred_prior))
pred_loss_matrix <- as.numeric(as.character(pred_loss_matrix))
pred_undersample <- as.numeric(as.character(pred_undersample))
pred_weights <- as.numeric(as.character(pred_weights))
In [414]:
pred_loss_matrix3 <- predict(ptree_loss_matrix,
                            newdata = test_set_modified,
                            type = "prob")
head(pred_loss_matrix3)
head(pred_loss_matrix3[,2])
head(predictions_cloglog)
01
10.88363290.1163671
20.81051560.1894844
30.81051560.1894844
40.93593750.0640625
50.81051560.1894844
60.88363290.1163671
1
0.116367076631977
2
0.189484378968758
3
0.189484378968758
4
0.0640625
5
0.189484378968758
6
0.116367076631977
1
0.0879097705958422
2
0.112652255231631
3
0.133300504497918
4
0.0977594843277697
5
0.111524809986488
8
0.0973610420437803

Construct the confusion matrix for each of these trees.

Nevertheless, it is important to be aware of the fact that not only the accuracy is important, but also the sensitivity and specificity.

$ Classification~Accuracy = \frac{(TP+TN)}{(TP+FP+TN+FN)} $

$ Classification~Sensitivity = \frac{TP}{(TP+FN)} $

$ Classification~Specificity = \frac{TN}{(FP+TN)} $

In [415]:
# construct confusion matrices using the predictions.
confmat_prior <- table(test_set_modified$loan_status, pred_prior)
confmat_loss_matrix <- table(test_set_modified$loan_status, pred_loss_matrix)
confmat_undersample <- table(test_set_modified$loan_status, pred_undersample)
confmat_weights <- table(test_set_modified$loan_status, pred_weights)

# Compute the accuracies
acc_prior <- sum(diag(confmat_prior)) / nrow(test_set_modified)
acc_loss_matrix <- sum(diag(confmat_loss_matrix)) / nrow(test_set_modified)
acc_undersample <- sum(diag(confmat_undersample)) / nrow(test_set_modified)
acc_weights <- sum(diag(confmat_weights)) / nrow(test_set_modified)

acc_prior
acc_loss_matrix
acc_undersample
acc_weights
0.86377230071156
0.514592141899557
0.855212952459524
0.877797256883572

ptree_prior and ptree_undersample are good predictor. ptree_weight is the best model.



Evaluating a Credit Risk Model

Computing a bad rate given a fixed acceptance rate

Compute the bad rate that a bank can expect (the cut-off) when using the pruned tree ptree_prior that you fitted before, and an acceptance rate of 80% given the tree below.

In [338]:
# Make predictions for the probability of default using the pruned tree and the test set.
prob_default_prior <- predict(ptree_prior, newdata = test_set_modified)[ ,2]

head(prob_default_prior, 10)
1
0.233007967040122
2
0.581384868511394
3
0.377205650678396
4
0.233007967040122
5
0.406340561441676
6
0.233007967040122
7
0.586116363435164
8
0.233007967040122
9
0.233007967040122
10
0.233007967040122
In [340]:
# Obtain the cutoff for acceptance rate 80%
cutoff_prior <- quantile(prob_default_prior, 0.80)

cutoff_prior
80%: 0.377205650678396
In [341]:
# Obtain the binary predictions.
bin_pred_prior_80 <- ifelse(prob_default_prior > cutoff_prior, 1, 0)

head(bin_pred_prior_80, 10)

# where the prob_default_prior > cutoff_prior,
# the bin_pred_default_prior yields 1 or aprediction of default
1
0
2
1
3
0
4
0
5
1
6
0
7
1
8
0
9
0
10
0
In [342]:
# Obtain the actual default status for the accepted loans
accepted_status_prior_80 <- test_set_modified$loan_status[bin_pred_prior_80 == 0]

head(cbind(bin_pred_prior_80, accepted_status_prior_80), 10)
Warning message in cbind(bin_pred_prior_80, accepted_status_prior_80):
"number of rows of result is not a multiple of vector length (arg 2)"
bin_pred_prior_80accepted_status_prior_80
100
210
300
400
510
601
710
800
900
1000
In [343]:
# Obtain the bad rate for the accepted loans
sum(accepted_status_prior_80) / length(accepted_status_prior_80)
0.0976079809545403

The strategy table and strategy curve

This table is a useful tool for banks, as they can give them a better insight to define an acceptance strategy.

predictions_cloglog should be obtained from section 2 (extract the probabilities from the results)...

In [344]:
head(predictions_cloglog)
1
0.0879097705958422
2
0.112652255231631
3
0.133300504497918
4
0.0977594843277697
5
0.111524809986488
8
0.0973610420437803
In [350]:
strategy_bank <- function(prob_of_def) {
    cutoff = rep(NA, 21)
    bad_rate = rep(NA, 21)
    accept_rate = seq(1,0,by = -0.05)
    
    for (i in 1:21){
        cutoff[i] = quantile(prob_of_def, accept_rate[i])
        pred_i = ifelse(prob_of_def > cutoff[i], 1, 0)
        pred_as_good = test_set$loan_status[pred_i==0]
        bad_rate[i] = sum(pred_as_good) / length(pred_as_good)
    }
    table = cbind(accept_rate, cutoff=round(cutoff,4), bad_rate = round(bad_rate,4))
    
    return(list(table = table,
                bad_rate = bad_rate,
                accept_rate = accept_rate,
                cutoff = cutoff))
}
In [416]:
# need the probability of default (1)
head(predictions_cloglog)

pred_loss_matrix2 <- predict(ptree_loss_matrix,
                            newdata = test_set_modified,
                            type = "prob")
head(pred_loss_matrix2)
head(pred_loss_matrix2[,2])
predictions_loss <- pred_loss_matrix2[,2]
1
0.0879097705958422
2
0.112652255231631
3
0.133300504497918
4
0.0977594843277697
5
0.111524809986488
8
0.0973610420437803
01
10.88363290.1163671
20.81051560.1894844
30.81051560.1894844
40.93593750.0640625
50.81051560.1894844
60.88363290.1163671
1
0.116367076631977
2
0.189484378968758
3
0.189484378968758
4
0.0640625
5
0.189484378968758
6
0.116367076631977
In [417]:
# Apply the function strategy_bank to both predictions_cloglog and predictions_loss_matrix
strategy_cloglog <- strategy_bank(predictions_cloglog)

strategy_loss_matrix <- strategy_bank(predictions_loss)
In [418]:
# Obtain the strategy tables for both prediction-vectors
strategy_cloglog$table
accept_ratecutoffbad_rate
1.00 0.35990.1116
0.95 0.18580.1066
0.90 0.17810.1034
0.85 0.16980.0997
0.80 0.15570.0950
0.75 0.14070.0913
0.70 0.13560.0863
0.65 0.13090.0834
0.60 0.12430.0789
0.55 0.11320.0772
0.50 0.10500.0742
0.45 0.09720.0726
0.40 0.09340.0670
0.35 0.09020.0592
0.30 0.08660.0536
0.25 0.07900.0494
0.20 0.05420.0448
0.15 0.05220.0378
0.10 0.05040.0392
0.05 0.04810.0350
0.00 0.03100.0000
In [419]:
strategy_loss_matrix$table
accept_ratecutoffbad_rate
1.00 0.41670.1116
0.95 0.18950.1120
0.90 0.18950.1120
0.85 0.18950.1120
0.80 0.18950.1120
0.75 0.14720.1126
0.70 0.14720.1126
0.65 0.13090.1128
0.60 0.13090.1128
0.55 0.12200.1140
0.50 0.11640.1147
0.45 0.07530.1123
0.40 0.06980.1165
0.35 0.06980.1165
0.30 0.06910.1133
0.25 0.04890.1101
0.20 0.04240.1101
0.15 0.04240.1101
0.10 0.04240.1101
0.05 0.04240.1101
0.00 0.00000.1875

Keep in mind that a pruned tree model depends on the datasets and the cp coefficient... The next graph can vary a lot depending on these two factors.

In [420]:
# Plot the strategy functions
par(mfrow = c(1,2))

plot(strategy_cloglog$accept_rate, strategy_cloglog$bad_rate, 
     type = "l", xlab = "Acceptance rate", ylab = "Bad rate", 
     lwd = 2, main = "logistic regression")

plot(strategy_loss_matrix$accept_rate, strategy_loss_matrix$bad_rate, 
     type = "l", xlab = "Acceptance rate", 
     ylab = "Bad rate", lwd = 2, main = "tree")

This is what another pruned tree model produced:

Based on the last set of plots, if:

  • Bank A has to use an acceptance rate of 45%.
  • Bank B has to use an acceptance rate of 85%.

To minimize the 'Bad rate':

  • Bank A would prefer the tree model.
  • Bank B would prefer the logistic regression model.

ROC-curves for comparison of logistic regression models

ROC-curves are created using the pROC-package.

In [432]:
# Load the pROC-package
library(pROC)
In [448]:
# Construct the objects containing ROC-information
# from section 2
# roc(response, predictor)
# response is the loan status indicator in the test_set
ROC_logit <- roc(test_set_modified$loan_status, predictions_logit)
ROC_probit <- roc(test_set_modified$loan_status, predictions_probit)
ROC_cloglog <- roc(test_set_modified$loan_status, predictions_cloglog)
ROC_all_full <- roc(test_set_modified$loan_status, predictions_all_full)
In [449]:
# Draw all ROCs on one plot
plot(ROC_logit)

lines(ROC_probit, col= 'blue')
lines(ROC_cloglog, col= 'red')
lines(ROC_all_full, col= 'green')
Call:
roc.default(response = test_set_modified$loan_status, predictor = predictions_logit)

Data: predictions_logit in 8660 controls (test_set_modified$loan_status 0) < 1037 cases (test_set_modified$loan_status 1).
Area under the curve: 0.6345
In [450]:
# Compute the AUCs
auc(ROC_logit)
auc(ROC_probit)
auc(ROC_cloglog)
auc(ROC_all_full)
0.634524220470758
0.634373002598988
0.634576667906401
0.655078548664762

ROC-curves for comparison of tree-based models

Repreat the comparison for the tree-based models.

In [463]:
# Construct the objects containing ROC-information
# roc(response, predictor)
ROC_undersample <- roc(test_set_modified$loan_status, pred_undersample)
ROC_prior <- roc(test_set_modified$loan_status, pred_prior)
ROC_loss_matrix <- roc(test_set_modified$loan_status, pred_loss_matrix)
ROC_weights <- roc(test_set_modified$loan_status, pred_weights)
In [464]:
# Draw the ROC-curves in one plot
plot(ROC_undersample)

lines(ROC_prior, col= 'blue')
lines(ROC_loss_matrix, col= 'red')
lines(ROC_weights, col= 'green')
Call:
roc.default(response = test_set_modified$loan_status, predictor = pred_undersample)

Data: pred_undersample in 8660 controls (test_set_modified$loan_status 0) < 1037 cases (test_set_modified$loan_status 1).
Area under the curve: 0.5183
In [465]:
# Compute the AUCs
auc(ROC_undersample)
auc(ROC_prior)
auc(ROC_loss_matrix)
auc(ROC_weights)
0.518281995719577
0.542597617928783
0.622127417203204
0.5143738266139

Another round of pruning based on AUC

The variable home_ownership was deleted from the model, as it improved the overall AUC. After repeating this process for two additional rounds, the variables age and ir_cat were deleted, leading to the model:

log_3_remove_ir <- glm(loan_status ~ loan_amnt + grade + annual_inc + emp_cat, family = binomial, data = training_set)

with an AUC of 0.6545. Now, it's your turn to see whether the AUC can still be improved by deleting another variable from the model.

In [468]:
# Build four models each time deleting one variable in log_3_remove_ir
log_4_remove_amnt <- glm(loan_status ~ grade + annual_inc + emp_cat, 
                         family = binomial, data = training_set_modified)

log_4_remove_grade <- glm(loan_status ~ loan_amnt  + annual_inc + emp_cat, 
                          family = binomial, data = training_set_modified)

log_4_remove_inc <- glm(loan_status ~ loan_amnt + grade  + emp_cat , 
                        family = binomial, data = training_set_modified)

log_4_remove_emp <- glm(loan_status ~ loan_amnt + grade + annual_inc, 
                        family = binomial, data = training_set_modified)
Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"
In [469]:
# Make PD-predictions for each of the models
pred_4_remove_amnt <- predict(log_4_remove_amnt, newdata = test_set_modified, type = "response")
pred_4_remove_grade <- predict(log_4_remove_grade, newdata = test_set_modified, type = "response")
pred_4_remove_inc <- predict(log_4_remove_inc, newdata = test_set_modified, type = "response")
pred_4_remove_emp <- predict(log_4_remove_emp, newdata = test_set_modified, type = "response")
In [471]:
# Compute the AUCs
auc(test_set_modified$loan_status, pred_4_remove_amnt)
auc(test_set_modified$loan_status, pred_4_remove_grade)
auc(test_set_modified$loan_status, pred_4_remove_inc)
auc(test_set_modified$loan_status, pred_4_remove_emp)
0.656049048930896
0.590615806387675
0.641297623051038
0.653983499658145

log_4_remove_amnt is the best model for the area under the curve AUC.

Further model reduction?

Deleting the variable loan_amnt, the AUC can be further improved to 0.6548! The resulting model is:

log_4_remove_amnt <- glm(loan_status ~ grade + annual_inc + emp_cat, family = binomial, data = training_set)

Is it possible to reduce the logistic regression model to only two variable without reducing the AUC?

In [472]:
# Build three models each time deleting one variable in log_4_remove_amnt
log_5_remove_grade <- glm(loan_status ~ annual_inc + emp_cat, family = binomial, data = training_set_modified)  
log_5_remove_inc <- glm(loan_status ~ grade  + emp_cat , family = binomial, data = training_set_modified)
log_5_remove_emp <- glm(loan_status ~ grade + annual_inc, family = binomial, data = training_set_modified)
Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"
In [473]:
# Make PD-predictions for each of the models
pred_5_remove_grade <- predict(log_5_remove_grade, newdata = test_set_modified, type = "response")
pred_5_remove_inc <- predict(log_5_remove_inc, newdata = test_set_modified, type = "response")
pred_5_remove_emp <- predict(log_5_remove_emp, newdata = test_set_modified, type = "response")
In [475]:
# Compute the AUCs
auc(test_set_modified$loan_status, pred_5_remove_grade)
auc(test_set_modified$loan_status, pred_5_remove_inc)
auc(test_set_modified$loan_status, pred_5_remove_emp)
0.588585611808802
0.635949933299333
0.654052483068721
In [476]:
# Plot the ROC-curve for the best model here
plot(roc(test_set_modified$loan_status, pred_4_remove_amnt))
Call:
roc.default(response = test_set_modified$loan_status, predictor = pred_4_remove_amnt)

Data: pred_4_remove_amnt in 8660 controls (test_set_modified$loan_status 0) < 1037 cases (test_set_modified$loan_status 1).
Area under the curve: 0.656

log_4_remove_amnt still is the best model for the area under the curve AUC.



Conclusion

Other methods for predicting the risk of defaulting:

  • Discriminant analysis
  • Random forest
  • Neural networks
  • Support vector machines

But, the timing aspect is neglected in all the above methods (including logistic regressions and decision/classification trees).

A new popular method is survival analysis.

Survival analysis models the probability of default that change over time. Time-varying covariates can be included.