This article sets out a few practical recipes for modelling with (life) insurance data. Insurance events are typically of low probability - these recipes consider some of the limitations of “small data” model fitting (where the observations of interest are sparse) and other topics for insurance like comparisons to standard tables.
Here I’ve pulled together a few simple regression tools using (life) insurance data. See earlier article on data transformations/ preparation. Insurance events are typically of low probability - these recipes consider some of the limitations of “small data” model fitting (where the observations of interest are sparse) and other topics for insurance like comparisons to standard tables. Themes covered:
See link above to GitHub repository which has the detailed code.
A list of packages used in the recipes.
library(rmdformats) # theme for the HTML doc
library(bookdown) # bibliography formatting
library(kableExtra) # formatting tables
library(scales) # data formatting
library(dplyr) # tidyverse: data manipulation
library(tidyr) # tidyverse: tidy messy data
library(corrplot) # correlation matrix visualisation, optional
library(ggplot2) # tidyverse: graphs
library(pdp) # tidyverse: arranging plots
library(GGally) # tidyverse: extension of ggplot2
library(ggthemes) # tidyverse: additional themes for ggplot, optional
library(plotly) # graphs, including 3D
library(caret) # sampling techniques
library(broom) # tidyverse: visualising statistical objects
library(pROC) # visualising ROC curves
library(lmtest) # tests for regression models
# packages below have some interaction with earlier packages, not always needed
library(arm) # binned residual plot
library(msme) # statistical tests, pearson dispersion
library(MASS) # statistics
A few books and articles of interest:
The sections below provide a refresher on linear and logistic regression; some considerations for insurance data; model selection and testing model fit.
Split data into training and testing data sets. We will used 75% of the data for training and the rest for testing.
# Determine the number of rows for training
nrow(df)
[1] 600000
Looking at the random sample we have created, we have a significant imbalance between successes and failures.
inc_count_sick
inc_count_acc 0 1
0 447547 1574
1 879 0
Generally, “in insurance applications, the prediction of a claim or no claim on an individual policy is rarely the point of statistical modelling … The model is useful provided it explains the variability in claims behaviour, as a function of risk factors.” (Piet de Jong, Gillian Z. Heller 2008, p108–109) So, as long as we have sufficient actual claims to justify the level of predictor variables fitted to the model we should be ok.
However, in some cases where it is important to correctly predict the binary categorical response variable, we may need to create a sample that has a roughly equal proportion of classes (of successes and failures) and then fit our model to that data. E.g. where we are looking to accurately predict fraud within banking transactions data.
To do that we need to refine our sampling method
Showing a method for down sampling below - this is more useful for pure classification models; not that useful for insurance applications.
.
0 1
1574 1574
Linear regression is a method of modelling the relationship between a response (dependent variable) and one or more explanatory variables (predictors; independent variables). For a data set , the relationship between y, the response/ dependent variables and the vector of x’s, the explanatory variables, is linear of the form
\(y_{i} = \beta_{0} + \beta_{1}x_{i1} + ... + \beta_{p}x_{ip} + \epsilon_{i} = \mathbf{x}_{i}^t\mathbf{\beta} + \epsilon_{i}\), \(i = 1,...,n\)
Key assumptions
For logistic regression the log odds are linear. We can transform to odds ratios by taking the exponential of the coefficients or exp(coef(model)) - this shows the relative change to odds of the response variables, where
A simple regression model is shown below. Interpreting co-efficients and other output:
Other output
Call:
glm(formula = inc_count_sick ~ age, family = "binomial", data = df_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.1719 -0.0925 -0.0764 -0.0662 3.8636
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.143801 0.221246 -45.85 <2e-16 ***
age 0.095743 0.004542 21.08 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 20946 on 449999 degrees of freedom
Residual deviance: 20499 on 449998 degrees of freedom
AIC: 20503
Number of Fisher Scoring iterations: 9
Choice of link function: the link function defines the relationship of response variables to mean. It is usually sufficient to use the standard link. Section below shows a model with the link specified (results are the same as the model above).
Adding more response variables: When deciding on explanatory variables to model, consider the properties of the data (like correlation or colinearity).
Call:
glm(formula = inc_count_sick ~ age + policy_year + sex + waiting_period,
family = binomial(link = "logit"), data = df_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.2363 -0.0979 -0.0751 -0.0522 4.0480
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.507516 0.247391 -38.431 < 2e-16 ***
age 0.092744 0.005108 18.155 < 2e-16 ***
policy_year 0.012077 0.010813 1.117 0.26402
sexf -0.723963 0.080513 -8.992 < 2e-16 ***
sexu -0.392234 0.133309 -2.942 0.00326 **
waiting_period14d -0.225423 0.107046 -2.106 0.03522 *
waiting_period720d -1.831494 0.184362 -9.934 < 2e-16 ***
waiting_period90d -1.590302 0.155151 -10.250 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 20946 on 449999 degrees of freedom
Residual deviance: 20032 on 449992 degrees of freedom
AIC: 20048
Number of Fisher Scoring iterations: 9
Choice of response distribution: In the logistic regression example above we considered the Binomial response distribution. Other common count distributions are
lower<-qpois(0.001, lambda=5)
upper<-qpois(0.999, lambda=5)
x<-seq(lower,upper,1)
data.frame(x, y=dpois(x, lambda=5)) %>% ggplot(mapping=aes(y=y,x=x)) + geom_col()+
labs(x=NULL, y="Density", title = "Density, poisson [dpois(x,lambda=5)]")+
theme_pander() + scale_color_gdocs()
lower<-qnbinom(0.001, size=2, mu=10)
upper<-qnbinom(0.999, size=2, mu=10)
x<-seq(lower,upper,1)
data.frame(x, y=dnbinom(x, size=2, mu=10)) %>% ggplot(mapping=aes(y=y,x=x)) + geom_col()+
labs(x=NULL, y="Density", title = "Density, negative binomial [dnbinom(x, size=2, mu=10)]")+
theme_pander() + scale_color_gdocs()
And amount
x<-seq(-3.5,3.5,0.1)
data.frame(x,y=dnorm(x, mean=0, sd=1)) %>% ggplot(mapping=aes(y=y,x=x)) + geom_line()+
labs(x=NULL, y="Density", title = "Density, standard normal [dnorm(x, mean=0, sd=1)]")+
theme_pander() + scale_color_gdocs()
lower<-qgamma(0.001, shape=5, rate=3)
upper<-qgamma(0.999, shape=5, rate=3)
x<-seq(lower,upper,0.01)
data.frame(x,y=dgamma(x, shape=5, rate=3)) %>% ggplot(mapping=aes(y=y,x=x)) + geom_line()+
labs(x=NULL, y="Density", title = "Density, gamma [dgamma(x,shape=5, rate=3)]")+
theme_pander() + scale_color_gdocs()
An example of an amount model, assuming a normal response distribution
# A tibble: 7 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -95.8 6.89 -13.9 5.59e- 44
2 age 3.08 0.129 24.0 1.05e-126
3 sexf -17.2 1.80 -9.56 1.24e- 21
4 sexu -10.7 3.34 -3.19 1.41e- 3
5 waiting_period14d -6.54 3.74 -1.75 8.01e- 2
6 waiting_period720d -33.8 4.23 -7.98 1.51e- 15
7 waiting_period90d -31.7 4.08 -7.77 7.80e- 15
The models above were based upon ungrouped data, but earlier we noted that data are often grouped. In grouping data, some detail is usually lost e.g. we no longer have any true continuous categorical response variables. In this case, the columns relating to actual events (here claims) are a sum of the individual instance; more at Piet de Jong, Gillian Z. Heller (2008), p49, 105.
The example below compares a model of counts by age and sex on the grouped and shows that the derived parameters are materially similar.
Use tidy to visualise the modelled result for grouped data:
tidy(model_4)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -9.83 0.192 -51.3 0
2 age 0.0918 0.00394 23.3 4.41e-120
3 sexf -0.741 0.0703 -10.5 5.56e- 26
4 sexu -0.461 0.119 -3.88 1.04e- 4
Modelling using ungrouped data, coefficients materially similar
Use tidy to visualise the modelled result for ungrouped data:
tidy(model_5)
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -9.83 0.192 -51.3 0
2 age 0.0918 0.00394 23.3 4.42e-120
3 sexf -0.741 0.0703 -10.5 5.57e- 26
4 sexu -0.461 0.119 -3.88 1.04e- 4
When the data are grouped, it is important to consider the level of exposure in a given group. This can be achieved with an offset term (Piet de Jong, Gillian Z. Heller (2008), p66-67). This is important in an insurance context where we are often interested in modelling rates of claim.
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -3454137. 526285. -6.56 2.39e-10
2 age 50880. 10976. 4.64 5.37e- 6
3 sexf 808044. 134595. 6.00 5.70e- 9
4 sexu 1087955. 542771. 2.00 4.59e- 2
# A tibble: 4 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 8153. 5657. 1.44 0.151
2 age 204. 118. 1.73 0.0842
3 sexf -3454. 1447. -2.39 0.0176
4 sexu -3801. 5834. -0.652 0.515
The models we have fitted above are based upon actual claim incidences. We can consider the difference between some expected claim rate (e.g. from a standard table) and the actuals.
Using the grouped data again, we model A/E with a normal response distribution. The model shows that none of the coeffients are significant indicating no material difference to expected, which is consistent with the data as the observations were derived from the expected probabilities initially.
Call:
glm(formula = inc_count_sick/inc_count_sick_exp ~ age + sex,
data = df_grp)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.19 -0.97 -0.93 -0.87 2571.52
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.725747 0.456016 1.591 0.112
age 0.004747 0.009713 0.489 0.625
sexf -0.091246 0.154067 -0.592 0.554
sexu 0.174939 0.208211 0.840 0.401
(Dispersion parameter for gaussian family taken to be 514.2801)
Null deviance: 56358742 on 109589 degrees of freedom
Residual deviance: 56357894 on 109586 degrees of freedom
AIC: 995154
Number of Fisher Scoring iterations: 2
This method builds a regression model from a set of candidate predictor variables by entering predictors based on p values, in a stepwise manner until there are no variables left to enter any more. Model may over/ understate the importance of predictors and the direction of stepping (forward or backward) can impact the outcome - so some degree of interpretation is necessary.
# specify a null model with no predictors
null_model_sick <- glm(inc_count_sick ~ 1, data = df_train, family = "binomial")
# specify the full model using all of the potential predictors
full_model_sick <- glm(inc_count_sick ~ cal_year + policy_year + sex + smoker + benefit_period + waiting_period + occupation + poly(age,3) + sum_assured + policy_year*age + policy_year*sum_assured, data = df_train, family = "binomial")
# alternatively, glm(y~ . -x1) fits model using all variables excluding x1
# use a forward stepwise algorithm to build a parsimonious model
step_model_sick <- step(null_model_sick, scope = list(lower = null_model_sick, upper = full_model_sick), direction = "forward")
Start: AIC=20948.4
inc_count_sick ~ 1
Df Deviance AIC
+ age 1 20499 20503
+ poly(age, 3) 3 20497 20505
+ waiting_period 3 20581 20589
+ policy_year 1 20833 20837
+ sum_assured 1 20842 20846
+ sex 2 20844 20850
+ smoker 2 20925 20931
+ cal_year 1 20932 20936
<none> 20946 20948
+ benefit_period 2 20945 20951
+ occupation 4 20941 20951
Step: AIC=20503.31
inc_count_sick ~ age
Df Deviance AIC
+ waiting_period 3 20135 20145
+ sex 2 20398 20406
+ smoker 2 20478 20486
+ sum_assured 1 20482 20488
<none> 20499 20503
+ policy_year 1 20498 20504
+ cal_year 1 20499 20505
+ poly(age, 3) 2 20497 20505
+ occupation 4 20494 20506
+ benefit_period 2 20498 20506
Step: AIC=20145.08
inc_count_sick ~ age + waiting_period
Df Deviance AIC
+ sex 2 20033 20047
+ smoker 2 20114 20128
+ sum_assured 1 20118 20130
<none> 20135 20145
+ policy_year 1 20134 20146
+ poly(age, 3) 2 20133 20147
+ cal_year 1 20135 20147
+ occupation 4 20129 20147
+ benefit_period 2 20134 20148
Step: AIC=20047.43
inc_count_sick ~ age + waiting_period + sex
Df Deviance AIC
+ smoker 2 20012 20030
+ sum_assured 1 20017 20033
<none> 20033 20047
+ policy_year 1 20032 20048
+ poly(age, 3) 2 20031 20049
+ cal_year 1 20033 20049
+ occupation 4 20028 20050
+ benefit_period 2 20032 20050
Step: AIC=20030.01
inc_count_sick ~ age + waiting_period + sex + smoker
Df Deviance AIC
+ sum_assured 1 19995 20015
<none> 20012 20030
+ policy_year 1 20011 20031
+ poly(age, 3) 2 20009 20031
+ cal_year 1 20012 20032
+ occupation 4 20006 20032
+ benefit_period 2 20011 20033
Step: AIC=20015.39
inc_count_sick ~ age + waiting_period + sex + smoker + sum_assured
Df Deviance AIC
<none> 19995 20015
+ policy_year 1 19994 20016
+ poly(age, 3) 2 19993 20017
+ cal_year 1 19995 20017
+ benefit_period 2 19994 20018
+ occupation 4 19995 20023
summary(full_model_sick)
Call:
glm(formula = inc_count_sick ~ cal_year + policy_year + sex +
smoker + benefit_period + waiting_period + occupation + poly(age,
3) + sum_assured + policy_year * age + policy_year * sum_assured,
family = "binomial", data = df_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.2568 -0.0983 -0.0740 -0.0512 4.0416
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.710e+01 6.626e+01 -0.711 0.47714
cal_year 2.058e-02 3.283e-02 0.627 0.53070
policy_year -1.077e-01 1.018e-01 -1.058 0.28991
sexf -7.244e-01 8.052e-02 -8.997 < 2e-16 ***
sexu -3.912e-01 1.333e-01 -2.934 0.00334 **
smokers 3.464e-01 7.424e-02 4.666 3.07e-06 ***
smokeru -1.076e-01 1.239e-01 -0.869 0.38493
benefit_period2yr -9.245e-02 8.108e-02 -1.140 0.25414
benefit_period5yr -3.854e-02 7.952e-02 -0.485 0.62796
waiting_period14d -2.306e-01 1.071e-01 -2.154 0.03121 *
waiting_period720d -1.837e+00 1.844e-01 -9.964 < 2e-16 ***
waiting_period90d -1.595e+00 1.552e-01 -10.281 < 2e-16 ***
occupationsed -2.305e-02 9.765e-02 -0.236 0.81336
occupationtechn -5.071e-02 1.022e-01 -0.496 0.61970
occupationwhite -6.020e-02 1.013e-01 -0.594 0.55238
occupationblue -3.262e-02 1.189e-01 -0.274 0.78383
poly(age, 3)1 3.006e+02 4.680e+01 6.423 1.34e-10 ***
poly(age, 3)2 -4.075e+01 2.723e+01 -1.497 0.13445
poly(age, 3)3 -7.897e+00 2.002e+01 -0.394 0.69328
sum_assured 4.281e-05 2.744e-05 1.560 0.11867
age NA NA NA NA
policy_year:age 2.424e-03 2.110e-03 1.149 0.25072
policy_year:sum_assured -4.251e-08 3.969e-06 -0.011 0.99145
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 20946 on 449999 degrees of freedom
Residual deviance: 19987 on 449978 degrees of freedom
AIC: 20031
Number of Fisher Scoring iterations: 9
summary(step_model_sick)
Call:
glm(formula = inc_count_sick ~ age + waiting_period + sex + smoker +
sum_assured, family = "binomial", data = df_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.2830 -0.0973 -0.0745 -0.0519 4.0246
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.640e+00 2.432e-01 -39.635 < 2e-16 ***
age 8.953e-02 4.755e-03 18.829 < 2e-16 ***
waiting_period14d -2.295e-01 1.071e-01 -2.144 0.03202 *
waiting_period720d -1.835e+00 1.844e-01 -9.951 < 2e-16 ***
waiting_period90d -1.594e+00 1.552e-01 -10.273 < 2e-16 ***
sexf -7.238e-01 8.052e-02 -8.989 < 2e-16 ***
sexu -3.916e-01 1.333e-01 -2.937 0.00331 **
smokers 3.466e-01 7.424e-02 4.669 3.03e-06 ***
smokeru -1.067e-01 1.239e-01 -0.862 0.38885
sum_assured 4.333e-05 1.064e-05 4.072 4.66e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 20946 on 449999 degrees of freedom
Residual deviance: 19995 on 449990 degrees of freedom
AIC: 20015
Number of Fisher Scoring iterations: 9
The form of the final step model is glm(formula = inc_count_sick ~ poly(age, 3) + waiting_period + sex + smoker + sum_assured, family = “binomial,” data = df_train) i.e. dropping some of the explanatory variables from the full model.
We can compute the confidence intervals for one or more parameters in a fitted model.
confint(step_model_sick) # add second argument specifying which parameters we need a confint for
2.5 % 97.5 %
(Intercept) -1.011939e+01 -9.1658786725
age 8.021044e-02 0.0988490230
waiting_period14d -4.333078e-01 -0.0131725684
waiting_period720d -2.205152e+00 -1.4802247926
waiting_period90d -1.900406e+00 -1.2909882621
sexf -8.848852e-01 -0.5690645113
sexu -6.636738e-01 -0.1401306177
smokers 1.985219e-01 0.4896777597
smokeru -3.586694e-01 0.1276681057
sum_assured 2.248727e-05 0.0000642007
predict() is a generic function that can be used to predict results from various model forms. The function takes the form below. For logistic regression, setting prediction type to response produces a probability rather than log odds (which are difficult to interpret).
# predictions
pred_inc_count_sick <- as.data.frame(
predict(step_model_sick, data= df_train, # model and data
type="response", # or terms for coefficients
se.fit = TRUE, # default is false
interval = "confidence", #default "none", also "prediction"
level = 0.95
# ...
)
)
# add back to data
ifrm("df_train_pred")
pred_inc_count_sick <- rename(pred_inc_count_sick,pred_rate_inc_count_sick=fit,se_rate_inc_count_sick=se.fit)
df_train_pred <- cbind(df_train,pred_inc_count_sick)
We can plot the results by age and sex against the crude rates from earlier:
# from earlier, summarise data by age and sex
df_train_pred %>% filter(sex != "u", between(age, 30,60)) %>% group_by(age,sex) %>%
summarise(total_sick=sum(inc_count_sick),total_acc=sum(inc_count_acc), pred_total_sick=sum(pred_rate_inc_count_sick),exposure=n(),.groups = 'drop') %>%
mutate(sick_rate = total_sick/exposure,pred_sick_rate = pred_total_sick/exposure, acc_rate = total_acc/exposure) %>%
# used ggplot to graph the results
ggplot(aes(x=age,y=sick_rate,color=sex)) +
# ylim(0,1) +
geom_point() +
# add a modeled line
geom_line(aes(x=age,y=pred_sick_rate)) +
theme_pander() + scale_color_gdocs()
Earlier we split the data into “training” data used to create the model and “test” data which we intended to use for performance validation. Adding predictions to test data below.
# predictions
pred_inc_count_sick <- as.data.frame(
predict(step_model_sick, data= df_test, # test data
type="response",
se.fit = TRUE,
interval = "confidence",
level = 0.95
# ...
)
)
# add back to data
ifrm("df_test_pred")
pred_inc_count_sick <- rename(pred_inc_count_sick,pred_rate_inc_count_sick=fit,se_rate_inc_count_sick=se.fit)
df_test_pred <- cbind(df_test,pred_inc_count_sick)
# summary stats for prediction
hist(df_test_pred$pred_rate_inc_count_sick,main = "Histogram of predicted sickness rate",xlab = "Probability of claim")
summary(df_test_pred$pred_rate_inc_count_sick)
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.024e-05 1.363e-03 2.787e-03 3.498e-03 4.748e-03 3.925e-02
To translate the predicted probabilities into a vector of claim/no-claim for each policy we could define a claim as occurring if modelled/ predicted probability of claim is greater than some threshold value. More on this later under evaluation techniques.
# add binary prediction based upon a threshold probability
df_test_pred$pred_inc_count_sick <- ifelse(df_test_pred$pred_rate_inc_count_sick>0.003,1,0) # example threshold is ~3rd quartile probability of claim. for balanced data this should be closer to 0.5.
For least squares regression the \(R^{2}\) statistic (coefficient of determination) measures the proportion of variance in the dependent variable that can be explained by the independent variables. Adjusted \(R^{2}\), adjusts for the number of predictors in the model. The adjusted \(R^{2}\) increases when the new predictor improves the model more than would be expected by chance. The glm function uses a maximum likelihood estimator which does not minimize the squared error.
AIC stands for Akaike Information Criteria. \(AIC = -2l+2p\) where \(l\) is the log likelihood and \(p\) are the number of parameters. It is analogous to adjusted \(R^{2}\) and is the measure of fit which penalizes model for the number of independent variables. We prefer a model with a lower AIC value.
The results below show the AIC for model 3 is lower than model 1 and that the final step model has the lowest AIC of those evaluated (preferred).
AIC(model_1)
[1] 20503.31
AIC(model_3)
[1] 20048.19
stepAIC(step_model_sick)
Start: AIC=20015.39
inc_count_sick ~ age + waiting_period + sex + smoker + sum_assured
Df Deviance AIC
<none> 19995 20015
- sum_assured 1 20012 20030
- smoker 2 20017 20033
- sex 2 20097 20113
- age 1 20352 20370
- waiting_period 3 20360 20374
Call: glm(formula = inc_count_sick ~ age + waiting_period + sex + smoker +
sum_assured, family = "binomial", data = df_train)
Coefficients:
(Intercept) age waiting_period14d
-9.640e+00 8.953e-02 -2.295e-01
waiting_period720d waiting_period90d sexf
-1.835e+00 -1.594e+00 -7.238e-01
sexu smokers smokeru
-3.916e-01 3.466e-01 -1.067e-01
sum_assured
4.333e-05
Degrees of Freedom: 449999 Total (i.e. Null); 449990 Residual
Null Deviance: 20950
Residual Deviance: 20000 AIC: 20020
An anova comparison below between model_3 and the step model using a Chi-squared test shows a small p value for the stepped model - indicating that the model is an improvement. F-test can be used on continuous response models.
anova(model_3,step_model_sick, test="Chisq")
Analysis of Deviance Table
Model 1: inc_count_sick ~ age + policy_year + sex + waiting_period
Model 2: inc_count_sick ~ age + waiting_period + sex + smoker + sum_assured
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 449992 20032
2 449990 19995 2 36.795 1.023e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The likelihood ratio test compares two models based upon log likelihoods; more at Piet de Jong, Gillian Z. Heller (2008), p74.
The test below concludes that the step model is more accurate than the less complex model.
lrtest(model_3,step_model_sick)
Likelihood ratio test
Model 1: inc_count_sick ~ age + policy_year + sex + waiting_period
Model 2: inc_count_sick ~ age + waiting_period + sex + smoker + sum_assured
#Df LogLik Df Chisq Pr(>Chisq)
1 8 -10016.1
2 10 -9997.7 2 36.795 1.023e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Other validations tests could be considered like the Wald test, Score test; see Piet de Jong, Gillian Z. Heller (2008), p74-77.
Residuals/ errors are the observed less fitted values. Traditional residual plots (shown below) are usually a good starting point (we expect to see no trend in the plot of residuals vs fitted values), but are not as informative for logistic regression or for data with a low probability outcome.
Plots, linear regression example:
Logistic regression example:
For logistic regression we can try other tools to test the residuals against the model assumptions. GLMs assume that the residuals/ errors are normally distributed. Plotting the density of the residuals gives us:
Focusing the x axis range:
A binned residual plot divides data into bins based upon their fitted values, showing the average residuals vs fitted value for each bin (Jeff Webb 2017):
# from the arm package
binnedplot(fitted(step_model_sick),
residuals(step_model_sick, type = "response"),
nclass = 50,
xlab = "Expected Values",
ylab = "Average residual",
main = "Binned residual plot",
cex.pts = 0.8,
col.pts = 1,
col.int = "gray")
Grey lines are 2 se bands (~95%). Apart from a few outliers, most of the residuals are within those bands.
The P-P (probability–probability) plot is a visualization that plots CDFs of the two distributions (empirical and theoretical) against each other, an unrelated dummy example below. It can be used to assess the residuals for normality.
Pearson dispersion test: This function calculates Pearson Chi2 statistic and the Pearson-based dispersion statistic. Values of the dispersion greater than 1 indicate model overdispersion. Values under 1 indicate under-dispersion.
P__disp(step_model_sick)
pearson.chi2 dispersion
457855.78948 1.01748
With a binary prediction from the model loaded within the dataframe (defined earlier), we can compare this to the actual outcomes to determine the validity of the model. In this comparison, the true positive rate is called the Sensitivity. The inverse of the false-positive rate is called the Specificity.
Where:
A perfect classification model could have Sensitivity and Specificity close to 1. However, we noted earlier, in insurance applications we are not often interested in in an accurate prediction as to whether a given policy gives rise to a claim. Rather we are interested in understanding the claim rates and how they are explained by the response variables/ risk factors (Piet de Jong, Gillian Z. Heller 2008, p108–109).
A confusion matrix is a tabular representation of Observed vs Predicted values. It helps to quantify the efficiency (or accuracy) of the model.
pred_inc_count_sick
inc_count_sick 0 1
0 239855 208606
1 808 731
Accuracy Rate = 0.5346356 ; Missclasification Rate = 0.4653644 ; True Positive Rate/Sensitivity = 0.4749838
False Positive Rate = 0.4651597 ; Specificity = 0.5348403
The ROC (Receiver Operating Characteristic) curve is a graph with:
ROC curves start at 0 on the x and y axis and rise to 1. The faster the curve reaches a True Positive Rate of 1, the better the curve generally. A model on the diagonal is only showing a 50/50 chance of correctly guessing the probability of claim. Area under an ROC curve (AUC) is a measure of the usefulness of a model in general, where a greater area means more useful. AUC is a tool for comparing models (generally, the closer the AUC is to 1, the better, but there are some cases where AUC can be misleading; AUC = 0.5 is a model on the diagonal).
In our binary model, the AUC is not much better than 0.5, indicating a very weak predictive ability. It isn’t hugely surprising that our model is not very effective at predicting individual claims. As noted earlier, for insurance applications, we are usually more concerned with predicting the claim rate and how it varies by predictors like age and sex (Piet de Jong, Gillian Z. Heller 2008, p108–109).
as.numeric(roc$auc)
[1] 0.5074699
coords(roc, "best", ret = "threshold")
threshold
1 0.001932091