Regression modelling

This article sets out a few practical recipes for modelling with (life) insurance data. Insurance events are typically of low probability - these recipes consider some of the limitations of “small data” model fitting (where the observations of interest are sparse) and other topics for insurance like comparisons to standard tables.

Pat Reen https://www.linkedin.com/in/patrick-reen/
2021-10-14

Overview

Background

Here I’ve pulled together a few simple regression tools using (life) insurance data. See earlier article on data transformations/ preparation. Insurance events are typically of low probability - these recipes consider some of the limitations of “small data” model fitting (where the observations of interest are sparse) and other topics for insurance like comparisons to standard tables. Themes covered:

See link above to GitHub repository which has the detailed code.

Libraries

A list of packages used in the recipes.

Show code
library(rmdformats) # theme for the HTML doc
library(bookdown)   # bibliography formatting
library(kableExtra) # formatting tables
library(scales)     # data formatting  
library(dplyr)      # tidyverse: data manipulation
library(tidyr)      # tidyverse: tidy messy data
library(corrplot)   # correlation matrix visualisation, optional
library(ggplot2)    # tidyverse: graphs
library(pdp)        # tidyverse: arranging plots
library(GGally)     # tidyverse: extension of ggplot2
library(ggthemes)   # tidyverse: additional themes for ggplot, optional
library(plotly)     # graphs, including 3D 
library(caret)      # sampling techniques
library(broom)      # tidyverse: visualising statistical objects
library(pROC)       # visualising ROC curves
library(lmtest)     # tests for regression models

# packages below have some interaction with earlier packages, not always needed
library(arm)        # binned residual plot
library(msme)       # statistical tests, pearson dispersion
library(MASS)       # statistics

Further reading

A few books and articles of interest:

Model selection

The sections below provide a refresher on linear and logistic regression; some considerations for insurance data; model selection and testing model fit.

Splitting data

Training vs testing data

Split data into training and testing data sets. We will used 75% of the data for training and the rest for testing.

Show code
# Determine the number of rows for training
nrow(df)
[1] 600000
Show code
# Create a random sample of row IDs
sample_rows <- sample(nrow(df),0.75*nrow(df))
# Create the training dataset
df_train <- df[sample_rows,]
# Create the test dataset
df_test <- df[-sample_rows,]

Class imbalance

Looking at the random sample we have created, we have a significant imbalance between successes and failures.

Show code
df_train %>% select(inc_count_acc,inc_count_sick) %>% table()
             inc_count_sick
inc_count_acc      0      1
            0 447547   1574
            1    879      0

Generally, “in insurance applications, the prediction of a claim or no claim on an individual policy is rarely the point of statistical modelling … The model is useful provided it explains the variability in claims behaviour, as a function of risk factors.” (Piet de Jong, Gillian Z. Heller 2008, p108–109) So, as long as we have sufficient actual claims to justify the level of predictor variables fitted to the model we should be ok.

However, in some cases where it is important to correctly predict the binary categorical response variable, we may need to create a sample that has a roughly equal proportion of classes (of successes and failures) and then fit our model to that data. E.g. where we are looking to accurately predict fraud within banking transactions data.

To do that we need to refine our sampling method

Showing a method for down sampling below - this is more useful for pure classification models; not that useful for insurance applications.

Show code
# Determine the number of rows for training
df_train_ds <- downSample(df_train,factor(df_train$inc_count_sick)) 
df_train_ds %>% select(inc_count_sick) %>% table()
.
   0    1 
1574 1574 

Regression

Background

Linear regression is a method of modelling the relationship between a response (dependent variable) and one or more explanatory variables (predictors; independent variables). For a data set , the relationship between y, the response/ dependent variables and the vector of x’s, the explanatory variables, is linear of the form

\(y_{i} = \beta_{0} + \beta_{1}x_{i1} + ... + \beta_{p}x_{ip} + \epsilon_{i} = \mathbf{x}_{i}^t\mathbf{\beta} + \epsilon_{i}\), \(i = 1,...,n\)

Key assumptions

Linear vs logistic regression

For logistic regression the log odds are linear. We can transform to odds ratios by taking the exponential of the coefficients or exp(coef(model)) - this shows the relative change to odds of the response variables, where

Logistic regression for claims incidence

A simple regression model is shown below. Interpreting co-efficients and other output:

Other output

Show code
model_1 <- glm(inc_count_sick~age,data=df_train,family="binomial") # use the default link
summary(model_1)

Call:
glm(formula = inc_count_sick ~ age, family = "binomial", data = df_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.1719  -0.0925  -0.0764  -0.0662   3.8636  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -10.143801   0.221246  -45.85   <2e-16 ***
age           0.095743   0.004542   21.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 20946  on 449999  degrees of freedom
Residual deviance: 20499  on 449998  degrees of freedom
AIC: 20503

Number of Fisher Scoring iterations: 9

Choice of link function: the link function defines the relationship of response variables to mean. It is usually sufficient to use the standard link. Section below shows a model with the link specified (results are the same as the model above).

Show code
model_2 <- glm(inc_count_sick~age,data=df_train,family=binomial(link="logit")) # specify the link, logit is default for binomial
summary(model_2)

Adding more response variables: When deciding on explanatory variables to model, consider the properties of the data (like correlation or colinearity).

Show code
model_3 <- glm(inc_count_sick~age+policy_year+sex+waiting_period,data=df_train,family=binomial(link="logit")) 
summary(model_3)

Call:
glm(formula = inc_count_sick ~ age + policy_year + sex + waiting_period, 
    family = binomial(link = "logit"), data = df_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.2363  -0.0979  -0.0751  -0.0522   4.0480  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -9.507516   0.247391 -38.431  < 2e-16 ***
age                 0.092744   0.005108  18.155  < 2e-16 ***
policy_year         0.012077   0.010813   1.117  0.26402    
sexf               -0.723963   0.080513  -8.992  < 2e-16 ***
sexu               -0.392234   0.133309  -2.942  0.00326 ** 
waiting_period14d  -0.225423   0.107046  -2.106  0.03522 *  
waiting_period720d -1.831494   0.184362  -9.934  < 2e-16 ***
waiting_period90d  -1.590302   0.155151 -10.250  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 20946  on 449999  degrees of freedom
Residual deviance: 20032  on 449992  degrees of freedom
AIC: 20048

Number of Fisher Scoring iterations: 9

Linear regression for claims incidence

Choice of response distribution: In the logistic regression example above we considered the Binomial response distribution. Other common count distributions are

Show code
lower<-qpois(0.001, lambda=5)
upper<-qpois(0.999, lambda=5)
x<-seq(lower,upper,1)

data.frame(x, y=dpois(x, lambda=5)) %>% ggplot(mapping=aes(y=y,x=x)) + geom_col()+
labs(x=NULL, y="Density", title = "Density, poisson [dpois(x,lambda=5)]")+
theme_pander() + scale_color_gdocs()

Show code
lower<-qnbinom(0.001, size=2, mu=10)
upper<-qnbinom(0.999, size=2, mu=10)
x<-seq(lower,upper,1)

data.frame(x, y=dnbinom(x, size=2, mu=10)) %>% ggplot(mapping=aes(y=y,x=x)) + geom_col()+
labs(x=NULL, y="Density", title = "Density, negative binomial [dnbinom(x, size=2, mu=10)]")+
theme_pander() + scale_color_gdocs()

And amount

Show code
x<-seq(-3.5,3.5,0.1)  
data.frame(x,y=dnorm(x, mean=0, sd=1)) %>% ggplot(mapping=aes(y=y,x=x)) + geom_line()+
labs(x=NULL, y="Density", title = "Density, standard normal [dnorm(x, mean=0, sd=1)]")+
theme_pander() + scale_color_gdocs()

Show code
lower<-qgamma(0.001, shape=5, rate=3)
upper<-qgamma(0.999, shape=5, rate=3)
x<-seq(lower,upper,0.01)

data.frame(x,y=dgamma(x, shape=5, rate=3)) %>% ggplot(mapping=aes(y=y,x=x)) + geom_line()+
labs(x=NULL, y="Density", title = "Density, gamma [dgamma(x,shape=5, rate=3)]")+
theme_pander() + scale_color_gdocs()

An example of an amount model, assuming a normal response distribution

Show code
amount_model_3 <- glm(inc_amount_sick~age+sex+waiting_period,data=df_train) 
tidy(amount_model_3)
# A tibble: 7 x 5
  term               estimate std.error statistic   p.value
  <chr>                 <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)          -95.8      6.89     -13.9  5.59e- 44
2 age                    3.08     0.129     24.0  1.05e-126
3 sexf                 -17.2      1.80      -9.56 1.24e- 21
4 sexu                 -10.7      3.34      -3.19 1.41e-  3
5 waiting_period14d     -6.54     3.74      -1.75 8.01e-  2
6 waiting_period720d   -33.8      4.23      -7.98 1.51e- 15
7 waiting_period90d    -31.7      4.08      -7.77 7.80e- 15

(Un)grouped data

The models above were based upon ungrouped data, but earlier we noted that data are often grouped. In grouping data, some detail is usually lost e.g. we no longer have any true continuous categorical response variables. In this case, the columns relating to actual events (here claims) are a sum of the individual instance; more at Piet de Jong, Gillian Z. Heller (2008), p49, 105.

The example below compares a model of counts by age and sex on the grouped and shows that the derived parameters are materially similar.

Show code
# grouped model
model_4 <- glm(cbind(inc_count_sick,exposure-inc_count_sick)~age+sex,data=df_grp,family=binomial(link="logit")) # cbind gives a matrix of successes and failures

Use tidy to visualise the modelled result for grouped data:

Show code
tidy(model_4)
# A tibble: 4 x 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  -9.83     0.192      -51.3  0        
2 age           0.0918   0.00394     23.3  4.41e-120
3 sexf         -0.741    0.0703     -10.5  5.56e- 26
4 sexu         -0.461    0.119       -3.88 1.04e-  4

Modelling using ungrouped data, coefficients materially similar

Show code
model_5 <- glm(inc_count_sick~age+sex,data=df,family=binomial(link="logit")) 

Use tidy to visualise the modelled result for ungrouped data:

Show code
tidy(model_5)
# A tibble: 4 x 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  -9.83     0.192      -51.3  0        
2 age           0.0918   0.00394     23.3  4.42e-120
3 sexf         -0.741    0.0703     -10.5  5.57e- 26
4 sexu         -0.461    0.119       -3.88 1.04e-  4

Offsets

When the data are grouped, it is important to consider the level of exposure in a given group. This can be achieved with an offset term (Piet de Jong, Gillian Z. Heller (2008), p66-67). This is important in an insurance context where we are often interested in modelling rates of claim.

Show code
df_grp_filtered <- df_grp %>% filter(inc_count_tot>1, exposure>10)
model_6 <- glm(inc_amount_tot~age+sex+offset(sum_assured),data=df_grp_filtered) 
tidy(model_6)
# A tibble: 4 x 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept) -3454137.   526285.     -6.56 2.39e-10
2 age            50880.    10976.      4.64 5.37e- 6
3 sexf          808044.   134595.      6.00 5.70e- 9
4 sexu         1087955.   542771.      2.00 4.59e- 2
Show code
model_7 <- glm(inc_amount_tot~age+sex,data=df_grp_filtered) 
tidy(model_7)
# A tibble: 4 x 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    8153.     5657.     1.44   0.151 
2 age             204.      118.     1.73   0.0842
3 sexf          -3454.     1447.    -2.39   0.0176
4 sexu          -3801.     5834.    -0.652  0.515 

Actuals or AvE?

The models we have fitted above are based upon actual claim incidences. We can consider the difference between some expected claim rate (e.g. from a standard table) and the actuals.

Using the grouped data again, we model A/E with a normal response distribution. The model shows that none of the coeffients are significant indicating no material difference to expected, which is consistent with the data as the observations were derived from the expected probabilities initially.

Show code
# grouped model
model_8 <- glm(inc_count_sick/inc_count_sick_exp~age+sex,data=df_grp) 
# use tidy to visualise the modelled result
summary(model_8)

Call:
glm(formula = inc_count_sick/inc_count_sick_exp ~ age + sex, 
    data = df_grp)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
  -1.19    -0.97    -0.93    -0.87  2571.52  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.725747   0.456016   1.591    0.112
age          0.004747   0.009713   0.489    0.625
sexf        -0.091246   0.154067  -0.592    0.554
sexu         0.174939   0.208211   0.840    0.401

(Dispersion parameter for gaussian family taken to be 514.2801)

    Null deviance: 56358742  on 109589  degrees of freedom
Residual deviance: 56357894  on 109586  degrees of freedom
AIC: 995154

Number of Fisher Scoring iterations: 2

Stepwise regression

This method builds a regression model from a set of candidate predictor variables by entering predictors based on p values, in a stepwise manner until there are no variables left to enter any more. Model may over/ understate the importance of predictors and the direction of stepping (forward or backward) can impact the outcome - so some degree of interpretation is necessary.

Show code
# specify a null model with no predictors
null_model_sick <- glm(inc_count_sick ~ 1, data = df_train, family = "binomial")

# specify the full model using all of the potential predictors
full_model_sick <- glm(inc_count_sick ~ cal_year + policy_year + sex + smoker + benefit_period + waiting_period + occupation + poly(age,3) + sum_assured + policy_year*age + policy_year*sum_assured, data = df_train, family = "binomial")

# alternatively, glm(y~ . -x1) fits model using all variables excluding x1

# use a forward stepwise algorithm to build a parsimonious model
step_model_sick <- step(null_model_sick, scope = list(lower = null_model_sick, upper = full_model_sick), direction = "forward")
Start:  AIC=20948.4
inc_count_sick ~ 1

                 Df Deviance   AIC
+ age             1    20499 20503
+ poly(age, 3)    3    20497 20505
+ waiting_period  3    20581 20589
+ policy_year     1    20833 20837
+ sum_assured     1    20842 20846
+ sex             2    20844 20850
+ smoker          2    20925 20931
+ cal_year        1    20932 20936
<none>                 20946 20948
+ benefit_period  2    20945 20951
+ occupation      4    20941 20951

Step:  AIC=20503.31
inc_count_sick ~ age

                 Df Deviance   AIC
+ waiting_period  3    20135 20145
+ sex             2    20398 20406
+ smoker          2    20478 20486
+ sum_assured     1    20482 20488
<none>                 20499 20503
+ policy_year     1    20498 20504
+ cal_year        1    20499 20505
+ poly(age, 3)    2    20497 20505
+ occupation      4    20494 20506
+ benefit_period  2    20498 20506

Step:  AIC=20145.08
inc_count_sick ~ age + waiting_period

                 Df Deviance   AIC
+ sex             2    20033 20047
+ smoker          2    20114 20128
+ sum_assured     1    20118 20130
<none>                 20135 20145
+ policy_year     1    20134 20146
+ poly(age, 3)    2    20133 20147
+ cal_year        1    20135 20147
+ occupation      4    20129 20147
+ benefit_period  2    20134 20148

Step:  AIC=20047.43
inc_count_sick ~ age + waiting_period + sex

                 Df Deviance   AIC
+ smoker          2    20012 20030
+ sum_assured     1    20017 20033
<none>                 20033 20047
+ policy_year     1    20032 20048
+ poly(age, 3)    2    20031 20049
+ cal_year        1    20033 20049
+ occupation      4    20028 20050
+ benefit_period  2    20032 20050

Step:  AIC=20030.01
inc_count_sick ~ age + waiting_period + sex + smoker

                 Df Deviance   AIC
+ sum_assured     1    19995 20015
<none>                 20012 20030
+ policy_year     1    20011 20031
+ poly(age, 3)    2    20009 20031
+ cal_year        1    20012 20032
+ occupation      4    20006 20032
+ benefit_period  2    20011 20033

Step:  AIC=20015.39
inc_count_sick ~ age + waiting_period + sex + smoker + sum_assured

                 Df Deviance   AIC
<none>                 19995 20015
+ policy_year     1    19994 20016
+ poly(age, 3)    2    19993 20017
+ cal_year        1    19995 20017
+ benefit_period  2    19994 20018
+ occupation      4    19995 20023
Show code
summary(full_model_sick)

Call:
glm(formula = inc_count_sick ~ cal_year + policy_year + sex + 
    smoker + benefit_period + waiting_period + occupation + poly(age, 
    3) + sum_assured + policy_year * age + policy_year * sum_assured, 
    family = "binomial", data = df_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.2568  -0.0983  -0.0740  -0.0512   4.0416  

Coefficients: (1 not defined because of singularities)
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -4.710e+01  6.626e+01  -0.711  0.47714    
cal_year                 2.058e-02  3.283e-02   0.627  0.53070    
policy_year             -1.077e-01  1.018e-01  -1.058  0.28991    
sexf                    -7.244e-01  8.052e-02  -8.997  < 2e-16 ***
sexu                    -3.912e-01  1.333e-01  -2.934  0.00334 ** 
smokers                  3.464e-01  7.424e-02   4.666 3.07e-06 ***
smokeru                 -1.076e-01  1.239e-01  -0.869  0.38493    
benefit_period2yr       -9.245e-02  8.108e-02  -1.140  0.25414    
benefit_period5yr       -3.854e-02  7.952e-02  -0.485  0.62796    
waiting_period14d       -2.306e-01  1.071e-01  -2.154  0.03121 *  
waiting_period720d      -1.837e+00  1.844e-01  -9.964  < 2e-16 ***
waiting_period90d       -1.595e+00  1.552e-01 -10.281  < 2e-16 ***
occupationsed           -2.305e-02  9.765e-02  -0.236  0.81336    
occupationtechn         -5.071e-02  1.022e-01  -0.496  0.61970    
occupationwhite         -6.020e-02  1.013e-01  -0.594  0.55238    
occupationblue          -3.262e-02  1.189e-01  -0.274  0.78383    
poly(age, 3)1            3.006e+02  4.680e+01   6.423 1.34e-10 ***
poly(age, 3)2           -4.075e+01  2.723e+01  -1.497  0.13445    
poly(age, 3)3           -7.897e+00  2.002e+01  -0.394  0.69328    
sum_assured              4.281e-05  2.744e-05   1.560  0.11867    
age                             NA         NA      NA       NA    
policy_year:age          2.424e-03  2.110e-03   1.149  0.25072    
policy_year:sum_assured -4.251e-08  3.969e-06  -0.011  0.99145    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 20946  on 449999  degrees of freedom
Residual deviance: 19987  on 449978  degrees of freedom
AIC: 20031

Number of Fisher Scoring iterations: 9
Show code
summary(step_model_sick)

Call:
glm(formula = inc_count_sick ~ age + waiting_period + sex + smoker + 
    sum_assured, family = "binomial", data = df_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.2830  -0.0973  -0.0745  -0.0519   4.0246  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -9.640e+00  2.432e-01 -39.635  < 2e-16 ***
age                 8.953e-02  4.755e-03  18.829  < 2e-16 ***
waiting_period14d  -2.295e-01  1.071e-01  -2.144  0.03202 *  
waiting_period720d -1.835e+00  1.844e-01  -9.951  < 2e-16 ***
waiting_period90d  -1.594e+00  1.552e-01 -10.273  < 2e-16 ***
sexf               -7.238e-01  8.052e-02  -8.989  < 2e-16 ***
sexu               -3.916e-01  1.333e-01  -2.937  0.00331 ** 
smokers             3.466e-01  7.424e-02   4.669 3.03e-06 ***
smokeru            -1.067e-01  1.239e-01  -0.862  0.38885    
sum_assured         4.333e-05  1.064e-05   4.072 4.66e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 20946  on 449999  degrees of freedom
Residual deviance: 19995  on 449990  degrees of freedom
AIC: 20015

Number of Fisher Scoring iterations: 9

The form of the final step model is glm(formula = inc_count_sick ~ poly(age, 3) + waiting_period + sex + smoker + sum_assured, family = “binomial,” data = df_train) i.e. dropping some of the explanatory variables from the full model.

Confidence intervals

We can compute the confidence intervals for one or more parameters in a fitted model.

Show code
confint(step_model_sick) # add second argument specifying which parameters we need a confint for
                           2.5 %        97.5 %
(Intercept)        -1.011939e+01 -9.1658786725
age                 8.021044e-02  0.0988490230
waiting_period14d  -4.333078e-01 -0.0131725684
waiting_period720d -2.205152e+00 -1.4802247926
waiting_period90d  -1.900406e+00 -1.2909882621
sexf               -8.848852e-01 -0.5690645113
sexu               -6.636738e-01 -0.1401306177
smokers             1.985219e-01  0.4896777597
smokeru            -3.586694e-01  0.1276681057
sum_assured         2.248727e-05  0.0000642007

Predictions

predict() is a generic function that can be used to predict results from various model forms. The function takes the form below. For logistic regression, setting prediction type to response produces a probability rather than log odds (which are difficult to interpret).

Show code
# predictions
pred_inc_count_sick <- as.data.frame(
  predict(step_model_sick, data= df_train, # model and data
  type="response", # or terms for coefficients
  se.fit = TRUE, # default is false
  interval = "confidence", #default "none", also "prediction"
  level = 0.95
  # ...
  )
)

# add back to data
ifrm("df_train_pred")
pred_inc_count_sick <- rename(pred_inc_count_sick,pred_rate_inc_count_sick=fit,se_rate_inc_count_sick=se.fit)
df_train_pred <- cbind(df_train,pred_inc_count_sick)

We can plot the results by age and sex against the crude rates from earlier:

Show code
# from earlier, summarise data by age and sex
df_train_pred %>% filter(sex != "u", between(age, 30,60)) %>% group_by(age,sex) %>% 
summarise(total_sick=sum(inc_count_sick),total_acc=sum(inc_count_acc), pred_total_sick=sum(pred_rate_inc_count_sick),exposure=n(),.groups = 'drop') %>% 
mutate(sick_rate = total_sick/exposure,pred_sick_rate = pred_total_sick/exposure, acc_rate = total_acc/exposure) %>%
# used ggplot to graph the results
ggplot(aes(x=age,y=sick_rate,color=sex)) +
# ylim(0,1) +
geom_point() +
# add a modeled line
geom_line(aes(x=age,y=pred_sick_rate)) +
theme_pander() + scale_color_gdocs()

Out of sample predictions

Earlier we split the data into “training” data used to create the model and “test” data which we intended to use for performance validation. Adding predictions to test data below.

Show code
# predictions
pred_inc_count_sick <- as.data.frame(
  predict(step_model_sick, data= df_test, # test data
  type="response",
  se.fit = TRUE, 
  interval = "confidence",
  level = 0.95
  # ...
  )
)

# add back to data
ifrm("df_test_pred")
pred_inc_count_sick <- rename(pred_inc_count_sick,pred_rate_inc_count_sick=fit,se_rate_inc_count_sick=se.fit)
df_test_pred <- cbind(df_test,pred_inc_count_sick)

# summary stats for prediction
hist(df_test_pred$pred_rate_inc_count_sick,main = "Histogram of predicted sickness rate",xlab = "Probability of claim")
Show code
summary(df_test_pred$pred_rate_inc_count_sick)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
6.024e-05 1.363e-03 2.787e-03 3.498e-03 4.748e-03 3.925e-02 

To translate the predicted probabilities into a vector of claim/no-claim for each policy we could define a claim as occurring if modelled/ predicted probability of claim is greater than some threshold value. More on this later under evaluation techniques.

Show code
# add binary prediction based upon a threshold probability
df_test_pred$pred_inc_count_sick <- ifelse(df_test_pred$pred_rate_inc_count_sick>0.003,1,0) # example threshold is ~3rd quartile probability of claim. for balanced data this should be closer to 0.5.

Evaluation

Techniques

AIC

For least squares regression the \(R^{2}\) statistic (coefficient of determination) measures the proportion of variance in the dependent variable that can be explained by the independent variables. Adjusted \(R^{2}\), adjusts for the number of predictors in the model. The adjusted \(R^{2}\) increases when the new predictor improves the model more than would be expected by chance. The glm function uses a maximum likelihood estimator which does not minimize the squared error.

AIC stands for Akaike Information Criteria. \(AIC = -2l+2p\) where \(l\) is the log likelihood and \(p\) are the number of parameters. It is analogous to adjusted \(R^{2}\) and is the measure of fit which penalizes model for the number of independent variables. We prefer a model with a lower AIC value.

The results below show the AIC for model 3 is lower than model 1 and that the final step model has the lowest AIC of those evaluated (preferred).

Show code
AIC(model_1)
[1] 20503.31
Show code
AIC(model_3)
[1] 20048.19
Show code
stepAIC(step_model_sick)
Start:  AIC=20015.39
inc_count_sick ~ age + waiting_period + sex + smoker + sum_assured

                 Df Deviance   AIC
<none>                 19995 20015
- sum_assured     1    20012 20030
- smoker          2    20017 20033
- sex             2    20097 20113
- age             1    20352 20370
- waiting_period  3    20360 20374

Call:  glm(formula = inc_count_sick ~ age + waiting_period + sex + smoker + 
    sum_assured, family = "binomial", data = df_train)

Coefficients:
       (Intercept)                 age   waiting_period14d  
        -9.640e+00           8.953e-02          -2.295e-01  
waiting_period720d   waiting_period90d                sexf  
        -1.835e+00          -1.594e+00          -7.238e-01  
              sexu             smokers             smokeru  
        -3.916e-01           3.466e-01          -1.067e-01  
       sum_assured  
         4.333e-05  

Degrees of Freedom: 449999 Total (i.e. Null);  449990 Residual
Null Deviance:      20950 
Residual Deviance: 20000    AIC: 20020

Anova

An anova comparison below between model_3 and the step model using a Chi-squared test shows a small p value for the stepped model - indicating that the model is an improvement. F-test can be used on continuous response models.

Show code
anova(model_3,step_model_sick, test="Chisq")
Analysis of Deviance Table

Model 1: inc_count_sick ~ age + policy_year + sex + waiting_period
Model 2: inc_count_sick ~ age + waiting_period + sex + smoker + sum_assured
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1    449992      20032                          
2    449990      19995  2   36.795 1.023e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Likelihood ratio test

The likelihood ratio test compares two models based upon log likelihoods; more at Piet de Jong, Gillian Z. Heller (2008), p74.

The test below concludes that the step model is more accurate than the less complex model.

Show code
lrtest(model_3,step_model_sick)
Likelihood ratio test

Model 1: inc_count_sick ~ age + policy_year + sex + waiting_period
Model 2: inc_count_sick ~ age + waiting_period + sex + smoker + sum_assured
  #Df   LogLik Df  Chisq Pr(>Chisq)    
1   8 -10016.1                         
2  10  -9997.7  2 36.795  1.023e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Other tests

Other validations tests could be considered like the Wald test, Score test; see Piet de Jong, Gillian Z. Heller (2008), p74-77.

Residual checks

Residuals/ errors are the observed less fitted values. Traditional residual plots (shown below) are usually a good starting point (we expect to see no trend in the plot of residuals vs fitted values), but are not as informative for logistic regression or for data with a low probability outcome.

Standard model plots

Plots, linear regression example:

Show code
par(mfrow = c(2, 2))
plot(amount_model_3)

Logistic regression example:

Show code
par(mfrow = c(2, 2))
plot(step_model_sick)

Alternatives

For logistic regression we can try other tools to test the residuals against the model assumptions. GLMs assume that the residuals/ errors are normally distributed. Plotting the density of the residuals gives us:

Show code
hist(rstandard(step_model_sick),breaks= c(seq(-1,1,by=0.001),seq(2,5, by=1)),freq=FALSE,main = "Histogram of residuals",xlab = "Residuals")
curve(dnorm, add = TRUE)

Focusing the x axis range:

Show code
hist(rstandard(step_model_sick),breaks= c(seq(-1,1,by=0.001),seq(2,5, by=1)), xlim = c(-0.5,0.5),freq=FALSE,main = "Histogram of residuals",xlab = "Residuals")

A binned residual plot divides data into bins based upon their fitted values, showing the average residuals vs fitted value for each bin (Jeff Webb 2017):

Show code
# from the arm package
binnedplot(fitted(step_model_sick), 
           residuals(step_model_sick, type = "response"), 
           nclass = 50, 
           xlab = "Expected Values", 
           ylab = "Average residual", 
           main = "Binned residual plot", 
           cex.pts = 0.8, 
           col.pts = 1, 
           col.int = "gray")

Grey lines are 2 se bands (~95%). Apart from a few outliers, most of the residuals are within those bands.

P-P plots

The P-P (probability–probability) plot is a visualization that plots CDFs of the two distributions (empirical and theoretical) against each other, an unrelated dummy example below. It can be used to assess the residuals for normality.

Show code
x <- rnorm(100)
probDist <- pnorm(x)
#create PP plot
plot(ppoints(length(x)), sort(probDist), main = "PP Plot", xlab = "Observed Probability", ylab = "Expected Probability")
#add diagonal line
abline(0,1)

Other tests

Pearson dispersion test: This function calculates Pearson Chi2 statistic and the Pearson-based dispersion statistic. Values of the dispersion greater than 1 indicate model overdispersion. Values under 1 indicate under-dispersion.

Show code
P__disp(step_model_sick)
pearson.chi2   dispersion 
457855.78948      1.01748 

Confusion matrix

With a binary prediction from the model loaded within the dataframe (defined earlier), we can compare this to the actual outcomes to determine the validity of the model. In this comparison, the true positive rate is called the Sensitivity. The inverse of the false-positive rate is called the Specificity.

Where:

A perfect classification model could have Sensitivity and Specificity close to 1. However, we noted earlier, in insurance applications we are not often interested in in an accurate prediction as to whether a given policy gives rise to a claim. Rather we are interested in understanding the claim rates and how they are explained by the response variables/ risk factors (Piet de Jong, Gillian Z. Heller 2008, p108–109).

A confusion matrix is a tabular representation of Observed vs Predicted values. It helps to quantify the efficiency (or accuracy) of the model.

Show code
confusion_matrix <- df_test_pred %>% select(inc_count_sick,pred_inc_count_sick) %>% table()
confusion_matrix
              pred_inc_count_sick
inc_count_sick      0      1
             0 239855 208606
             1    808    731
Show code
cat("Accuracy Rate =",(confusion_matrix[1,1]+confusion_matrix[2,2])/sum(confusion_matrix[]),
"; Missclasification Rate =",(confusion_matrix[1,2]+confusion_matrix[2,1])/sum(confusion_matrix[]),
"; True Positive Rate/Sensitivity  =",confusion_matrix[2,2]/sum(confusion_matrix[2,]))
Accuracy Rate = 0.5346356 ; Missclasification Rate = 0.4653644 ; True Positive Rate/Sensitivity  = 0.4749838
Show code
cat("False Positive Rate =",confusion_matrix[1,2]/sum(confusion_matrix[1,]),
"; Specificity =",1-confusion_matrix[1,2]/sum(confusion_matrix[1,]))
False Positive Rate = 0.4651597 ; Specificity = 0.5348403

ROC/ AUC

The ROC (Receiver Operating Characteristic) curve is a graph with:

ROC curves start at 0 on the x and y axis and rise to 1. The faster the curve reaches a True Positive Rate of 1, the better the curve generally. A model on the diagonal is only showing a 50/50 chance of correctly guessing the probability of claim. Area under an ROC curve (AUC) is a measure of the usefulness of a model in general, where a greater area means more useful. AUC is a tool for comparing models (generally, the closer the AUC is to 1, the better, but there are some cases where AUC can be misleading; AUC = 0.5 is a model on the diagonal).

In our binary model, the AUC is not much better than 0.5, indicating a very weak predictive ability. It isn’t hugely surprising that our model is not very effective at predicting individual claims. As noted earlier, for insurance applications, we are usually more concerned with predicting the claim rate and how it varies by predictors like age and sex (Piet de Jong, Gillian Z. Heller 2008, p108–109).

Show code
par(mfrow = c(1,1))
roc = roc(df_test_pred$inc_count_sick, df_test_pred$pred_rate_inc_count_sick, plot = TRUE, print.auc = TRUE)
Show code
as.numeric(roc$auc)
[1] 0.5074699
Show code
coords(roc, "best", ret = "threshold")
    threshold
1 0.001932091

References

Jeff Webb. 2017. “Statistics and Predictive Analytics.” https://bookdown.org/jefftemplewebb/IS-6489/logistic-regression.html#assessing-logistic-model-fit.
Piet de Jong, Gillian Z. Heller. 2008. Generalised Linear Models for Insurance Data. Cambridge University Press. https://ggplot2.tidyverse.org/reference/ggtheme.html.

References