Background

Bayesian statistics reflects uncertainty about model parameters by assuming some prior probability distribution for those parameters. That prior uncertainty is adjusted for observed data to produce a posterior distribution.

Under a frequentest approach, model parameters are fixed and the observations are assumed to be sampled from a probability distribution with those parameters.

Bayes theorem refresher:

\(\Huge {p}(\theta,x) = \frac{{p}(x,\theta){p}(\theta)}{{p}(x)}\)

Where

\({p}(\theta,x)\) is the posterior, a combination of the likelihood, prior and evidence. Probability distribution of \(\theta\) given the observed data, \(x\).
\({p}(x,\theta)\) is the likelihood which summarizes the likelihood of observing data \(x\) under different values of the underlying support parameter \(\theta\).
\({p}(\theta)\) is the prior, the probability distribution of \(\theta\), independent of the observed data.
\({p}(x)\) is the evidence (normalising term), the probability of the observed data \(x\), independent of \(\theta\).

The evidence term is proportional to the likelihood and therefore the theorem is often stated as

\(\Huge p(\theta | x) \propto p(x | \theta) p(\theta)\)

A paper presented at the 2021 Actuaries (Institute) Summit Hugh Miller, Isabella Lyons (2012) (p2) sets out the ‘tensions’ that exist between Frequentest and Bayesian approaches and suggests that:

“Despite actuaries being pioneers in early work on credibility theory, Bayesian approaches have largely given way to frequentist approaches. This is to our loss, since such methods have advanced substantially over the past couple of decades and they are well-suited to problems with latent variables or noise around underlying ‘true’ values.”

Here we’ll look at a few simple applications of Bayesian techniques.

Applications

The sections below set out a simple example Bayesian refresher, regression models and other applications.

Libraries

A list of packages used in the recipes.

Mental health example

The 2017 Actuaries Institute Green Paper on Mental Health and Insurance Mental Health and Insurance (2017) talks about some of the challenges in obtaining reliable data and projections on mental health in insurance (Mental Health and Insurance 2017, p26):

“There are sources of data and research about the prevalence and profile of people with different mental health conditions, such as the 2007 National Survey of Mental Health and Wellbeing and many other epidemiological and clinical studies… [T]o have a complete and informed view of all this information is a difficult task for a full-time academic, let alone for an insurer.

Even then, the nature and granularity of the data needed for insurance applications is different from that for population, public health or clinical purposes. Firstly, the ‘exposure’ (the number of insured people and their characteristics) needs to be recorded. There is probably a lot of relevant information that is not collected at the start of an insurance policy and can therefore never be analysed.

Secondly, the claims information needs to be available at a detailed level regarding the nature and cause of the claim and other characteristics of the individual, their history and the cover provided. There is a clear need for consistency in definition, language and data standards to improve the quality of information available across the system.

Naturally, when the insurance cover for mental health conditions is not provided (as in the case of most travel insurance products) there will not be relevant data from the insurance history. However, even when relevant data exists within insurers there are practical and commercial difficulties in turning it into a useful form.”

The paper goes on to talk about some of the insurance challenges in detail including

subjectivity in diagnosis
reliance on self-reporting
varying severity and prospects of recovery
impact of co-morbidities
correlation with financial incentive

Based upon data from a single insurer ~19% of IP and TPD claims relate to mental health, with mental health claims contributing 26% to the total cost of claims (in 2015, Mental Health and Insurance (2017), p17). This is roughly consistent with the proportion of the general population that experiences an episode of mental illness in a 12 month period, “Mental Health and Insurance” (2020), p110.

In this simple example, we want to know what the rate of mental illness, \(\theta\) is in the insured population. Assume some true underlying \(\theta\), here a fixed value of 20%. Create a population, \(X\) with \(X \sim B(N,\theta)\) where N=100000. Pull a random sample from population.

Show code

theta_true <- 0.2 
pop <- rbinom(100000,1,theta_true)
sample_n = 500
sample <- sample(pop,sample_n)
sample_success <- sum(sample)
sample_p <- sample_success/sample_n

Choosing a prior

The posterior is proportional to the product of the prior and the likelihood. The beta distribution is a conjugate prior for the binomial distribution (posterior is the same probability distribution family as prior). Assume \(\theta \sim {\sf Beta}(\alpha, \beta)\), then:

\(f(\theta;\alpha,\beta) = \theta^{\alpha-1}(1-\theta)^{\beta-1}\)

It represents the probabilities assigned to values of \(\theta\) in the domain (0,1) given values for the parameters \(\alpha\) and \(\beta\). The binomial posterior distribution represents the probability of values of \(X\) given \(\theta\). Varying the values of \(\alpha\) and \(\beta\) can represent a wide range of different prior beliefs about the distribution of \(\theta\).

Priors can be “informative” “uninformative” or “weakly informative”, e.g.

non-informative - uniform dist => \(\theta\) could be anywhere in (0,1)
weakly informative uniform prior => believe \(\theta\) more likely to be at either end of the ranges (less likely to be in the middle)
strongly informative prior
- a and b both high (bimodial) => true theta likely to be at center of range
- a higher than b (strong “success”) => success more likely than failure (where “success” is a mental health claim)
- b higher than a (strong “failure”) => failure more likely than success

Defining prior plots:

Show code

plot_prior <- function(a,b){
  theta = seq(0,1,0.005)
  p_theta = dbeta(theta, a, b)
  p <- qplot(theta, p_theta, geom='line')
  p <- p + theme_bw()
  p <- p + ylab(expression(paste('p(',theta,')', sep = '')))
  p <- p + xlab(expression(theta))
  return(p)}

Non-informative prior, uniform

Show code

plot_prior(1,1) + labs(title="Uniform Prior, a= 1 and b = 1") + theme(plot.title = element_text(hjust = 0.5))

Weakly informative prior, bimodial

Show code

plot_prior(0.5,0.5) + labs(title="Bimodial Prior, a= 0.5 and b = 0.5") + theme(plot.title = element_text(hjust = 0.5))

Informative prior, “failure” more likely

The higher the number of observations, the greater the influence on the posterior

Show code

plot_prior(5,30) + labs(title="Uniform Prior, a= 5 and b = 30") + theme(plot.title = element_text(hjust = 0.5))

Show code

plot_prior(20,120) + labs(title="Uniform Prior, a= 20 and b = 120") + theme(plot.title = element_text(hjust = 0.5))

Deriving posterior

Prior plot values. Re-express binomial as beta dist.

Show code

  prior_a = 50 #successes
  prior_b = 250 #failures
  prior_n = prior_a + prior_b #observations
  prior_theta = prior_a/prior_n
  prior <- data.frame('Dist'='Prior','x'=seq(0,1,0.005), 'y'=dbeta(seq(0,1,0.005),prior_a,prior_b))

Observations are taken from the earlier binomial sample.

Show code

  obs_a <- sample_success + 1
  obs_b <- sample_n - sample_success + 1
  observations <- data.frame('Dist'='Observations',x=seq(0,1,0.005), y=dbeta(seq(0,1,0.005),obs_a,obs_b))

In this example, the posterior is a simple combination of the prior and observations.

Show code

  post_a <- sample_success + prior_a - 1 #combination of prior and observed successes
  post_b <- sample_n - sample_success + prior_b - 1
  post_theta <- post_a / (post_a + post_b)
  posterior <- data.frame('Dist'='Posterior','x'=seq(0,1,0.005), 'y'=dbeta(seq(0,1,0.005),post_a,post_b))

This can be plotted as:

Show code

model_plot <- rbind(prior,observations,posterior)
with(model_plot, Dist <- factor(Dist, levels = c('Prior', 'Observations','Posterior'), ordered = TRUE))

qplot(model_plot$x, model_plot$y, geom='line',color = model_plot$Dist) +
ylab(expression(paste('p(',theta,')', sep = ''))) +
xlab(expression(theta)) +
geom_vline(xintercept = post_theta, linetype="dotted", color = "black", label="blah") +
scale_color_discrete(name = "Dist")  + 
annotate("text",x=post_theta, y=0, label=label_percent()(post_theta)) +
labs(title="Informative prior", subtitle="Low success") + 
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5))

Regression - logistic

Background

RSTANARM is an interface in R for STAN where Stan is a platform for Bayesian inference - it is a probabilistic programming framework where the guts of the inference method happens in the background, while the user focuses on parameterising the model in R. RJAGS is an interface in R for JAGS (Just Another Gibbs Sample) where JAGS is similarly a platform for Bayesian modeling.

Models in both STAN and JAGS can make use of Markov chains Monte Carlo (MCMC) methods. MCMC methods provide a way to take random samples approximately from a posterior distribution. Such samples can be used to summarize any aspect of the posterior distribution of a statistical model.

RSTANARM and RJAGS have pre-written code for common models like regression.

Under a Frequentest approach to linear regression, for a given set of response variables y, the relationship between y and a set of regressor variables x is assumed to be linear (informed by a set of parameters \(\beta\)), with an error term, i.e. \(y = X\beta + \epsilon\).

The error term, \(\epsilon\) is assumed to be normally distributed and the regression co-efficients are estimated by minimising the error term commonly using ordinary least squares (OLS). We make some other assumptions about the error term - not correlated with X and i.i.d.; and define \(Var(\epsilon)=\sigma^2\). The generalised form allows the response variable to have an error that is not normal - this is achieved by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

With Bayesian inference we are interested in the posterior distributions of the parameters \(\beta\) and \(\sigma\), rather than point estimates. We also want to influence those distributions with some prior knowledge of their behaviour:

\(p(\beta,\sigma|y,X) \propto p(y|\beta,\sigma)p(\beta,\sigma)\)

\(p(\beta,\sigma|y,X)\) is not a normalised probability density function (i.e. is proportional to the likelihood and prior). However, we can sample from it and therefore estimate any quantity of interest e.g. mean or variance, (William H. Press 2007, p825). For this we use MCMC, William H. Press (2007), p825:

“MCMC is a random sampling method… the goal is to visit a point \(x\) with a probability proportional to some given distribution function \(\pi(x)\) [not necessarily a probability]. Why would we want to sample a distribution in this way? THe answer is that Bayesian methods, often implemented using MCMC, provide a powerful way of estimating the parameters of a model and their degree of uncertainty.”

Heart disease data

The Framingham Heart Study is a longitudinal study of cardiovastular disease in a sample population within Framingham, Massachusetts. A subset of data for teaching is available at request from the U.S. National Heart, Lung and Blood Institute. This dataset is not appropriate for publication and has been anonymised, but is nevertheless useful for illustrative purposes here. The dataset:

“..is a subset of the data collected as part of the Framingham study and includes laboratory, clinic, questionnaire, and adjudicated event data on 4,434 participants. Participant clinic data was collected during three examination periods, approximately 6 years apart, from roughly 1956 to 1968. Each participant was followed for a total of 24 years for the outcome of the following events: Angina Pectoris, Myocardial Infarction, Atherothrombotic Infarction or Cerebral Hemorrhage (Stroke) or death.”

We’ll look at a logistic regression model example using predefined Bayesian models in RSTANARM.

Show code

df_heart <- read.csv(file = 'frmgham2.csv') 
df_heart <- df_heart[c(1:23)] %>% filter(PERIOD==2) # ignoring health event data, only looking at observations at period == 2 (observations for participants are not independent over time; this does have limitations e.g. observations are truncated at death which could be due to heart disease)
ignore_cols <- names(df_heart) %in% c("PERIOD","RANDID","TIME", "PREVMI", "PREVAP") #exclude identifier and time columns; exclude  PREVMI and PREVAP as they are a subset of PREVCHD
df_heart <- df_heart[!ignore_cols]
head(df_heart)

  SEX TOTCHOL AGE SYSBP DIABP CURSMOKE CIGPDAY   BMI DIABETES BPMEDS
1   2     260  52   105  69.5        0       0 29.43        0      0
2   1     283  54   141  89.0        1      30 25.34        0      0
3   2     232  67   183 109.0        1      20 30.18        0      0
4   2     343  51   109  77.0        1      30 23.48        0      0
5   2     230  49   177 102.0        0       0 31.36        0      1
6   2     220  70   149  81.0        0       0 36.76        0      0
  HEARTRTE GLUCOSE educ PREVCHD PREVSTRK PREVHYP HDLC LDLC
1       80      86    2       0        0       0   NA   NA
2       75      87    1       0        0       0   NA   NA
3       60      89    3       0        0       1   NA   NA
4       90      72    3       0        0       0   NA   NA
5      120      86    2       0        0       1   NA   NA
6       80      98    1       1        0       1   NA   NA

Data description highlights:

RANDIND: unique identifier.
SEX: 1=male; 2=female (categorical, 1/2).
TIME: number of days since baseline exam (continuous).
PERIOD: examination cycle (categorical, 1,2 or 3). We’ll focus on period 2 - observations for the same participant are not independent over time.
TOTCHOL: total cholesterol level (continuous).
AGE: Age of the patient rounded to nearest year (continuous).
SYSBP: systolic blood pressure (continuous).
DIABP: diastolic blood pressure (continuous).
CURSMOKE: whether or not the patient is a current smoker (categorical, 0/1).
CIGPDAY: the number of cigarettes smoked per day on average (continuous).
BMI: Body Mass Index (continuous).
DIABETES: whether or not the patient had diabetes (categorical, 0/1).
BPMEDS: whether or not the patient was on blood pressure medication (categorical, 0/1).
HEARTRTE: heart rate (continuous).
GLUCOSE: glucose level (continuous).
EDU: attained Education (categorical, 1=0-11 year; 2=High School Diploma, GED; 3=Some College, Vocational School; 4=College (BS, BA) degree or more).
HDLC: High Density Lipoprotein Cholesterol - only available for period 3 (continuous).
LDLC: Low Density Lipoprotein Cholesterol - only available for period 3 (continuous).
PREVCHD: prevalence of Coronary Heart Disease (categorical, 0/1).
PREVAP: prevalence of Angina Pectoris (categorical, 0/1). Excluded from model as it is a subset of PREVCHD.
PREVMI: Prevalence of Myocardial Infarction (categorical, 0/1). Excluded from model as it is a subset of PREVCHD.
PREVSTRK: whether or not the patient had previously had a stroke (categorical, 0/1).
PREVHYP: whether or not the patient was hypertensive (categorical, 0/1).

Using the DataExplore package, we can see that the data is mostly complete apart from missing entries in ‘glucose’ and to a lesser extent ‘education’.

Show code

introduce(df_heart)

  rows columns discrete_columns continuous_columns
1 3930      18                0                 16
  all_missing_columns total_missing_values complete_rows
1                   2                 8720             0
  total_observations memory_usage
1              70740       334464

Show code

plot_intro(df_heart)

Show code

plot_missing(df_heart)

Given around 12% of glucose records are missing as well as all of HDL and LDL (only available for period 3), drop the columns. Also drop rows with other missing data as these are small in number.

Show code

df_heart$GLUCOSE <- NULL
df_heart$HDLC <- NULL
df_heart$LDLC <- NULL
df_heart <- na.omit(df_heart)

Plots of the data:

Show code

plot_bar(df_heart, title="Barplot")

Of the 3.6k remaining records, ~7% have a flag for heart disease:

Show code

table(df_heart$PREVCHD)


   0    1 
3319  253

Further plots and checking distributions:

Show code

plot_histogram(df_heart, title="Histogram for continuous")

Show code

qq_data <- df_heart[,c("BMI", "DIABP", "HEARTRTE", "SYSBP", "TOTCHOL")]
plot_qq(qq_data, title="QQ plot")

Show code

log_qq_data <- update_columns(qq_data, 2:4, function(x) log(x + 1))
plot_qq(log_qq_data, title="QQ plot, logged data")

Most significantly correlated with heart disease are AGE, SYSBP, DIABETES, BPMEDS and PREVHYP.

Show code

plot_correlation(na.omit(df_heart))

Split data into training and testing data sets.

Show code

# Create a random sample of row IDs
sample_rows <- sample(nrow(df_heart),0.75*nrow(df_heart))
# Create the training dataset
df_heart_train <- df_heart[sample_rows,]
# Create the test dataset
df_heart_test <- df_heart[-sample_rows,]

Fit a simple GLM using all data:

Show code

full_model_glm_heart <- glm(PREVCHD~.,data=df_heart_train, family = "binomial")
summary(full_model_glm_heart)


Call:
glm(formula = PREVCHD ~ ., family = "binomial", data = df_heart_train)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.8479472  1.2549718  -5.457 4.85e-08 ***
SEX         -1.1151811  0.1822901  -6.118 9.50e-10 ***
TOTCHOL      0.0031949  0.0017375   1.839 0.065946 .  
AGE          0.0846106  0.0115302   7.338 2.17e-13 ***
SYSBP        0.0100912  0.0050029   2.017 0.043691 *  
DIABP       -0.0147439  0.0096099  -1.534 0.124971    
CURSMOKE     0.3163451  0.2544935   1.243 0.213854    
CIGPDAY     -0.0076423  0.0105696  -0.723 0.469652    
BMI          0.0141148  0.0203822   0.693 0.488621    
DIABETES     0.9371080  0.2676098   3.502 0.000462 ***
BPMEDS       0.3173614  0.2360895   1.344 0.178870    
HEARTRTE    -0.0102143  0.0066348  -1.540 0.123682    
educ         0.0004762  0.0777335   0.006 0.995112    
PREVSTRK     0.2917608  0.5267898   0.554 0.579684    
PREVHYP      0.2978378  0.2191547   1.359 0.174137    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1361.4  on 2678  degrees of freedom
Residual deviance: 1178.9  on 2664  degrees of freedom
AIC: 1208.9

Number of Fisher Scoring iterations: 6

Use stepwise regression to reduce the explanatory variables:

Show code

# specify a null model with no predictors
null_model_glm_heart <- glm(PREVCHD ~ 1, data = df_heart_train, family = "binomial")
# use a forward stepwise algorithm to build a parsimonious model
step_model_glm_heart <- step(null_model_glm_heart, scope = list(lower = null_model_glm_heart, upper = full_model_glm_heart), direction = "forward")

Summary of the model:

Show code

summary(step_model_glm_heart)


Call:
glm(formula = PREVCHD ~ AGE + SEX + DIABETES + SYSBP + TOTCHOL + 
    HEARTRTE + BPMEDS, family = "binomial", data = df_heart_train)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.587973   0.917775  -8.268  < 2e-16 ***
AGE          0.089303   0.010475   8.525  < 2e-16 ***
SEX         -1.093518   0.175776  -6.221 4.94e-10 ***
DIABETES     1.018590   0.263099   3.872 0.000108 ***
SYSBP        0.008739   0.003593   2.432 0.015006 *  
TOTCHOL      0.003100   0.001745   1.777 0.075605 .  
HEARTRTE    -0.009997   0.006488  -1.541 0.123349    
BPMEDS       0.344999   0.232319   1.485 0.137538    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1361.4  on 2678  degrees of freedom
Residual deviance: 1184.6  on 2671  degrees of freedom
AIC: 1200.6

Number of Fisher Scoring iterations: 6

Dropping BPMEDS, HEARTRTE and BMI as little impact on AIC:

Show code

final_model_glm_heart <- glm(PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL, family = "binomial",data = df_heart_train)
summary(final_model_glm_heart)


Call:
glm(formula = PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL, 
    family = "binomial", data = df_heart_train)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.667411   0.742295 -10.329  < 2e-16 ***
AGE          0.093372   0.010254   9.106  < 2e-16 ***
SEX         -1.045255   0.170871  -6.117 9.52e-10 ***
PREVHYP      0.441363   0.173035   2.551 0.010750 *  
DIABETES     1.000627   0.259912   3.850 0.000118 ***
TOTCHOL      0.003196   0.001726   1.851 0.064132 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1361.4  on 2678  degrees of freedom
Residual deviance: 1190.5  on 2673  degrees of freedom
AIC: 1202.5

Number of Fisher Scoring iterations: 6

Fit a Bayes glm using the same explanatory variables. Here parameter estimates no longer have test statistics and p-values as in the frequentist regression. This is because Bayesian estimation samples from the posterior distribution. This means that instead of a point estimate and a test statistic, we get a distribution of plausible values for the parameters, and the estimates section of the summary summarizes those distributions.

Further documentation on Bayesian generalised linear models via Stan is here. Documentation on choice of priors is here. If priors are not specified, default weekly informative priors are used - you can call a summary of the modeled priors (see below). Default priors are autoscaled to make them less informative (see below for notes on scaling).

Show code

model_bayes_glm_heart <- stan_glm(PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL, family = "binomial",data=df_heart_train, iter=2000, warmup=500)
# chains is number of sample paths from posterior; iter is number of samples within a chain; warmup is number of iterations to discard

A summary of the model below. “Sigma” is the standard deviation of the errors; “mean_PPD” is the mean of the posterior predictive samples.MCMC diagnostics - “RHat” is a measure of within chain variance compared to across chain variance (<1.1 implies convergence); “log-posterior” is similar to likelihood.

Show code

print(summary(model_bayes_glm_heart),digits=4)


Model Info:
 function:     stan_glm
 family:       binomial [logit]
 formula:      PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL
 algorithm:    sampling
 sample:       6000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 2679
 predictors:   6

Estimates:
              mean    sd      10%     50%     90%  
(Intercept) -7.6675  0.7337 -8.6151 -7.6631 -6.7333
AGE          0.0933  0.0102  0.0801  0.0933  0.1065
SEX         -1.0460  0.1705 -1.2678 -1.0453 -0.8319
PREVHYP      0.4442  0.1769  0.2241  0.4418  0.6744
DIABETES     0.9833  0.2669  0.6348  0.9900  1.3165
TOTCHOL      0.0032  0.0017  0.0010  0.0032  0.0054

Fit Diagnostics:
           mean   sd     10%    50%    90% 
mean_PPD 0.0704 0.0068 0.0616 0.0702 0.0791

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse   Rhat   n_eff
(Intercept)   0.0103 0.9998 5052 
AGE           0.0002 1.0000 4593 
SEX           0.0023 1.0002 5532 
PREVHYP       0.0024 0.9999 5369 
DIABETES      0.0034 1.0002 6087 
TOTCHOL       0.0000 1.0000 5686 
mean_PPD      0.0001 0.9999 7301 
log-posterior 0.0332 1.0004 2868 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

The stan model used default normal priors for intercepts and coefficients. Default priors are automatically scaled to limit how informative they are - below we can see the adjusted (scaled) prior for the coefficients differs from the default 2.5:

Show code

# summary of priors
print(prior_summary(model_bayes_glm_heart),digits=4)

Priors for model 'model_bayes_glm_heart' 
------
Intercept (after predictors centered)
 ~ normal(location = 0, scale = 2.5)

Coefficients
  Specified prior:
    ~ normal(location = [0,0,0,...], scale = [2.5,2.5,2.5,...])
  Adjusted prior:
    ~ normal(location = [0,0,0,...], scale = [0.2922,5.0215,4.9996,...])
------
See help('prior_summary.stanreg') for more details

Credible intervals express the probability that a parameter falls within a given range (vs a confidence interval which is the probability that a range contains the true value of the parameter). These credible intervals are computed by finding the relevant quantiles of the draws from the posterior distribution. They produce similar ranges to the confidence intervals from the frequentist approach.

Show code

confint(final_model_glm_heart, level=0.95)

                    2.5 %       97.5 %
(Intercept) -9.1471208758 -6.234324626
AGE          0.0735684164  0.113808648
SEX         -1.3850915606 -0.714354726
PREVHYP      0.1065103678  0.785995423
DIABETES     0.4723299199  1.495122321
TOTCHOL     -0.0002292817  0.006540454

Show code

posterior_interval(model_bayes_glm_heart, prob=0.95)

                     2.5%        97.5%
(Intercept) -9.0944495644 -6.226308036
AGE          0.0737930669  0.113502760
SEX         -1.3732792483 -0.714472520
PREVHYP      0.0962806425  0.788979071
DIABETES     0.4395704549  1.489720023
TOTCHOL     -0.0001426672  0.006534955

Adjusting scales and changing priors

stan_glm allows a selection of prior distributions for the coefficients, intercept, and auxiliary parameters. The prior distributions can be altered by specifying different locations/scales/dfs, but all the prior take the form of a single chosen probability distribution.

Some drivers for wanting to specify a prior: * Available information/ research might suggest a base value for our parameters. Particularly valuable if our own observed data is sparse. * We may not have a good idea of the value, but know that it is constrained in some way e.g. is positive.

The scale is the standard deviation of the prior. As noted earlier, the model autoscales to ensure priors are not too informative. We can turn this off when we specify our own priors. Documentation on choice of priors is here.

To allow for varying distributions across the priors, we need to define the model in STAN or JAGS. See the linear regression example below for an example of this in JAGS.

Specifying our own priors below. Little impact on the posterior estimates.

Show code

model_bayes_glm_heart_scale <- stan_glm(PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL, family = "binomial",data=df_heart_train, iter=2000, warmup=500,
                            prior_intercept = normal(location = 0, scale = 10, autoscale = FALSE),
                            prior = normal(location = 0, scale = 5, autoscale = FALSE),
                            prior_aux = exponential(rate = 1, autoscale = FALSE)
                            )


SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 9.1e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.91 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  501 / 2000 [ 25%]  (Sampling)
Chain 1: Iteration:  700 / 2000 [ 35%]  (Sampling)
Chain 1: Iteration:  900 / 2000 [ 45%]  (Sampling)
Chain 1: Iteration: 1100 / 2000 [ 55%]  (Sampling)
Chain 1: Iteration: 1300 / 2000 [ 65%]  (Sampling)
Chain 1: Iteration: 1500 / 2000 [ 75%]  (Sampling)
Chain 1: Iteration: 1700 / 2000 [ 85%]  (Sampling)
Chain 1: Iteration: 1900 / 2000 [ 95%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 10.25 seconds (Warm-up)
Chain 1:                13.006 seconds (Sampling)
Chain 1:                23.256 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 7.7e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.77 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  501 / 2000 [ 25%]  (Sampling)
Chain 2: Iteration:  700 / 2000 [ 35%]  (Sampling)
Chain 2: Iteration:  900 / 2000 [ 45%]  (Sampling)
Chain 2: Iteration: 1100 / 2000 [ 55%]  (Sampling)
Chain 2: Iteration: 1300 / 2000 [ 65%]  (Sampling)
Chain 2: Iteration: 1500 / 2000 [ 75%]  (Sampling)
Chain 2: Iteration: 1700 / 2000 [ 85%]  (Sampling)
Chain 2: Iteration: 1900 / 2000 [ 95%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 7.179 seconds (Warm-up)
Chain 2:                14.675 seconds (Sampling)
Chain 2:                21.854 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 8e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.8 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  501 / 2000 [ 25%]  (Sampling)
Chain 3: Iteration:  700 / 2000 [ 35%]  (Sampling)
Chain 3: Iteration:  900 / 2000 [ 45%]  (Sampling)
Chain 3: Iteration: 1100 / 2000 [ 55%]  (Sampling)
Chain 3: Iteration: 1300 / 2000 [ 65%]  (Sampling)
Chain 3: Iteration: 1500 / 2000 [ 75%]  (Sampling)
Chain 3: Iteration: 1700 / 2000 [ 85%]  (Sampling)
Chain 3: Iteration: 1900 / 2000 [ 95%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 10.085 seconds (Warm-up)
Chain 3:                12.913 seconds (Sampling)
Chain 3:                22.998 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 9.1e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.91 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  501 / 2000 [ 25%]  (Sampling)
Chain 4: Iteration:  700 / 2000 [ 35%]  (Sampling)
Chain 4: Iteration:  900 / 2000 [ 45%]  (Sampling)
Chain 4: Iteration: 1100 / 2000 [ 55%]  (Sampling)
Chain 4: Iteration: 1300 / 2000 [ 65%]  (Sampling)
Chain 4: Iteration: 1500 / 2000 [ 75%]  (Sampling)
Chain 4: Iteration: 1700 / 2000 [ 85%]  (Sampling)
Chain 4: Iteration: 1900 / 2000 [ 95%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 5.002 seconds (Warm-up)
Chain 4:                12.771 seconds (Sampling)
Chain 4:                17.773 seconds (Total)
Chain 4:

Show code

print(summary(model_bayes_glm_heart_scale),digits=4)


Model Info:
 function:     stan_glm
 family:       binomial [logit]
 formula:      PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL
 algorithm:    sampling
 sample:       6000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 2679
 predictors:   6

Estimates:
              mean    sd      10%     50%     90%  
(Intercept) -7.7081  0.7267 -8.6259 -7.7068 -6.7884
AGE          0.0940  0.0101  0.0812  0.0941  0.1071
SEX         -1.0480  0.1672 -1.2645 -1.0456 -0.8365
PREVHYP      0.4450  0.1744  0.2238  0.4439  0.6671
DIABETES     0.9902  0.2546  0.6699  0.9993  1.3099
TOTCHOL      0.0032  0.0017  0.0009  0.0032  0.0053

Fit Diagnostics:
           mean   sd     10%    50%    90% 
mean_PPD 0.0703 0.0066 0.0620 0.0702 0.0788

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse   Rhat   n_eff
(Intercept)   0.0102 1.0001 5109 
AGE           0.0001 1.0007 4845 
SEX           0.0028 1.0004 3682 
PREVHYP       0.0028 1.0000 3851 
DIABETES      0.0041 0.9998 3785 
TOTCHOL       0.0000 0.9999 5841 
mean_PPD      0.0001 0.9998 5050 
log-posterior 0.0351 1.0016 2381 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Model fit

We are able to extract predictions from each iteration of the posterior sampling. We can extract predictions for new data. Could be tested against the observations for goodness of fit. The follwing from the manual, RSTANARM Manual (2020), p43:

“The posterior predictive distribution (posterior_predict) is the distribution of the outcome implied by the model after using the observed data to update our beliefs about the unknown parameters in the model. Simulating data from the posterior predictive distribution using the observed predictors is useful for checking the fit of the model. Drawing from the posterior predictive distribution at interesting values of the predictors also lets us visualize how a manipulation of a predictor affects (a function of)the outcome(s). With new observations of predictor variables we can use the posterior predictive distribution to generate predicted outcomes.”

Show code

posteriors <- posterior_predict(model_bayes_glm_heart,newdata=df_heart_test)
posterior <- as.data.frame(posteriors) # each column is a data point from the dataset and each row is a prediction
# Print 10 predictions for 5 lives
print("10 predictions for 5 lives")

[1] "10 predictions for 5 lives"

Show code

posteriors[1:10, 1:5]

      2 6 15 20 26
 [1,] 0 0  0  0  0
 [2,] 0 0  0  0  0
 [3,] 0 0  0  0  0
 [4,] 0 0  0  0  0
 [5,] 0 0  0  0  0
 [6,] 0 0  0  0  0
 [7,] 0 0  0  0  0
 [8,] 0 0  0  0  0
 [9,] 0 0  0  0  0
[10,] 0 0  0  0  0

We can plot the posterior distributions for the model parameters from the samples:

Show code

posterior_heart <- as.matrix(model_bayes_glm_heart)

plot1<- mcmc_areas(posterior_heart, pars = c("AGE"), prob = 0.80)
plot2<- mcmc_areas(posterior_heart, pars = c("SEX"), prob = 0.80)
plot3<- mcmc_areas(posterior_heart, pars = c("PREVHYP"), prob = 0.80)
plot4<- mcmc_areas(posterior_heart, pars = c("DIABETES"), prob = 0.80)
plot5<- mcmc_areas(posterior_heart, pars = c("TOTCHOL"), prob = 0.80)

grid.arrange(plot1,plot2, plot3, plot4, plot5, top="Posterior distributions (median + 80% intervals)")

We can extract the \(R^2\) from the posterior distribution and plot:

Show code

r2_posterior <- bayes_R2(model_bayes_glm_heart)
summary(r2_posterior)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.04251 0.07077 0.07970 0.08027 0.08920 0.14700

Show code

hist(r2_posterior) #expect normal

Posterior predictive checks can be extracted using “pp_check”

“dens_overlay” - each light blue line represents the distribution of predicted scores from a single replication, and the dark blue line represents the observed data. If the model fits, the dark blue line should align closely with the light blue lines.
“stat” in the “pp_check” function, we can get a distribution of the mean of the dependent variable from all replications. These are the light blue bars: the means from each replication plotted as a histogram. The mean from the observed data is then plotted on top as a dark blue bar.

Here the outcome is binary so the density overlay outcomes are concentrated at one or zero.

Show code

pp_check(model_bayes_glm_heart, "stat")

Show code

pp_check(model_bayes_glm_heart, "dens_overlay")

Further considerations: The loo package provides “efficient approximate leave-one-out cross-validation (LOO), approximate standard errors for estimated predictive errors (including comparisons), and the widely applicable information criterion (WAIC).The loo package can be used to estimate predictive error of any MCMC item-level log likelihood output”.

Regression - linear

The above example looked at a binary response. In this example we’ll look at predicting a continuous variable using data on my household electricity consumption over 2016-2021. The above example used stan_glm in RSTANARM. Here we will use a model defined in RJAGS with additional flexibility around the choice of priors, which is also possible in a paramaterised model for STAN. RJAGS combines the power of R with the “Just Another Gibbs Sampler” or “JAGS” engine. To get started, first download the “JAGS” program, then set up the packages as above.

Both JAGS and STAN allow for the parameterisation of a range of different models. The example below was chosen for comparison with a conventional GLM approach.

Data prep:

Daily electricity usage for a household in Sydney over 2019-21. Rows represent half hourly readings/ estimates.
Data are grouped into daily usage.
Also includes are temperatures, sunlight and rainfall data for the area taken from BOM.
Calculated fields for weekdays vs weekends, seasons and a post-covid working from home indicator.

Show code

df_usage <- read.csv(file = 'elec_usage.csv')
df_usage <- df_usage %>% mutate(Date = as.Date(Date, format="%Y-%m-%d"), EndDate = parse_date_time(EndDate, '%Y-%m-%d %H:%M:%S'), StartDate = parse_date_time(StartDate, '%Y-%m-%d %H:%M:%S'))
df_usage <- df_usage[c(-2,-3,-4,-6)] %>% # drop start and end dates, rate type, duration
group_by(Date) %>% summarise(Usage = sum(Usage), MaxTemp = max(MaxTemp), MinTemp = min(MinTemp), Rain_mm = max(Rain_mm), Solar_Exposure = max(Solar_Exposure)) %>%
# Add in fields to identify seasons, weekend/ weekday and an indicator for the post-covid/ work from home period
mutate(Season=
  if_else(month(Date)>= 3 & month(Date) < 6,"Autumn", 
  if_else(month(Date)>= 6 & month(Date) < 9,"Winter", 
  if_else(month(Date)>= 9 & month(Date) < 12,"Spring","Summer"))),
Covid_Ind=if_else(Date>= "2020-03-15",1,0),
Weekend=
  if_else(wday(Date)==7 | wday(Date)==1,1,0))

df_usage$Season <- factor(df_usage$Season, levels = c("Summer","Autumn","Winter","Spring")) # change season levels

head(df_usage)

# A tibble: 6 × 9
  Date       Usage MaxTemp MinTemp Rain_mm Solar_Exposure Season
  <date>     <dbl>   <dbl>   <dbl>   <dbl>          <dbl> <fct> 
1 2019-12-21 21.1     26.7    21       0             27.1 Summer
2 2019-12-22  9.53    19.9    18.6     0             11.9 Summer
3 2019-12-23 15.7     22.1    17.2     0             14.4 Summer
4 2019-12-24 18.7     24      18.6     1.8           18.3 Summer
5 2019-12-25 18.9     23.2    21.6     0.2           27.8 Summer
6 2019-12-26 14.6     23.6    21.4     0             30.8 Summer
# ℹ 2 more variables: Covid_Ind <dbl>, Weekend <dbl>

Data is mostly complete:

Show code

introduce(df_usage)

# A tibble: 1 × 9
   rows columns discrete_columns continuous_columns
  <int>   <int>            <int>              <int>
1   732       9                2                  7
# ℹ 5 more variables: all_missing_columns <int>,
#   total_missing_values <int>, complete_rows <int>,
#   total_observations <int>, memory_usage <dbl>

Show code

plot_intro(df_usage)

Show code

plot_missing(df_usage)

Drop records with NAs as they are few, also only include non-zero usage, review explanatory variables; Gamma dist potentially a better fit than normal for Usage given the skewness in the tail; alternatively use normal with a log link.

Show code

df_usage <- na.omit(df_usage)
df_usage <- df_usage %>% filter(Usage>0)

plot_bar(df_usage, title="Barplot")

Show code

plot_histogram(df_usage, title="Histogram for continuous")

Show code

qq_data <- df_usage[,c("MaxTemp", "MinTemp", "Solar_Exposure", "Usage")]
plot_qq(qq_data, title="QQ plot")

Show code

log_qq_data <- update_columns(qq_data, 2:4, function(x) log(x + 1))
plot_qq(log_qq_data, title="QQ plot, logged data")

Show code

usage_fit_gamma <- fitdist(df_usage$Usage, distr = "gamma", method = "mme")
summary(usage_fit_gamma)

Fitting of the distribution ' gamma ' by matching moments 
Parameters : 
       estimate  Std. Error
shape 2.9114935 0.179487630
rate  0.1164095 0.007621289
Loglikelihood:  -2813.89   AIC:  5631.78   BIC:  5640.902 
Correlation matrix:
          shape      rate
shape 1.0000000 0.9416261
rate  0.9416261 1.0000000

Show code

plot(usage_fit_gamma)

Usage correlated with season, max/min temp and covid wfh indicator. Max and Min Temps are not surprisingly highly correlated with one another, with the seasons and with solar exposure.

Show code

plot_correlation(na.omit(df_usage))

Split data into training and testing data sets:

Show code

# Create a random sample of row IDs
sample_rows <- sample(nrow(df_usage),0.75*nrow(df_usage))
# Create the training dataset
df_usage_train <- df_usage[sample_rows,]
# Create the test dataset
df_usage_test <- df_usage[-sample_rows,]

Fit a simple GLM using all variables:

Show code

full_model_glm_usage <- glm(Usage~.,data=df_usage_train[-1], gaussian(link="log"))
summary(full_model_glm_usage)


Call:
glm(formula = Usage ~ ., family = gaussian(link = "log"), data = df_usage_train[-1])

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.4312991  0.1936811  28.042  < 2e-16 ***
MaxTemp        -0.0238070  0.0083260  -2.859  0.00442 ** 
MinTemp        -0.0978096  0.0093057 -10.511  < 2e-16 ***
Rain_mm        -0.0001651  0.0013963  -0.118  0.90594    
Solar_Exposure -0.0155053  0.0037217  -4.166 3.63e-05 ***
SeasonAutumn    0.0503508  0.0812128   0.620  0.53554    
SeasonWinter    0.0421299  0.0913116   0.461  0.64471    
SeasonSpring    0.0188279  0.0816951   0.230  0.81782    
Covid_Ind      -0.0768909  0.0992380  -0.775  0.43880    
Weekend        -0.1074911  0.0336111  -3.198  0.00147 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 85.51279)

    Null deviance: 113590  on 529  degrees of freedom
Residual deviance:  44465  on 520  degrees of freedom
AIC: 3873.8

Number of Fisher Scoring iterations: 9

Use stepwise regression to reduce the explanatory variables:

Show code

# specify a null model with no predictors
null_model_glm_usage <- glm(Usage ~ 1, data = df_usage_train, gaussian(link="log"))
# use a forward stepwise algorithm to build a parsimonious model
step_model_glm_usage <- step(null_model_glm_usage, scope = list(lower = null_model_glm_usage, upper = full_model_glm_usage), direction = "forward")

Summary of the model:

Show code

summary(step_model_glm_usage)


Call:
glm(formula = Usage ~ MinTemp + Solar_Exposure + Weekend + MaxTemp, 
    family = gaussian(link = "log"), data = df_usage_train)

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.394821   0.093699  57.576  < 2e-16 ***
MinTemp        -0.099655   0.006987 -14.264  < 2e-16 ***
Solar_Exposure -0.016588   0.003219  -5.152 3.65e-07 ***
Weekend        -0.107299   0.033428  -3.210  0.00141 ** 
MaxTemp        -0.021957   0.007943  -2.764  0.00590 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 84.89477)

    Null deviance: 113590  on 529  degrees of freedom
Residual deviance:  44568  on 525  degrees of freedom
AIC: 3865

Number of Fisher Scoring iterations: 7

For simplicity, we’ll assume a model defined by MaxTemp and MinTemp only, noting the significant correlation with other explanatory variables.

Show code

gamma_model_glm_usage <- glm(Usage ~ MaxTemp + MinTemp, family = stats::Gamma(link="inverse"), data = df_usage_train)
summary(gamma_model_glm_usage)


Call:
glm(formula = Usage ~ MaxTemp + MinTemp, family = stats::Gamma(link = "inverse"), 
    data = df_usage_train)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.0283252  0.0040743  -6.952 1.07e-11 ***
MaxTemp      0.0007415  0.0003301   2.246   0.0251 *  
MinTemp      0.0037811  0.0003119  12.121  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.1921047)

    Null deviance: 189.507  on 529  degrees of freedom
Residual deviance:  99.968  on 527  degrees of freedom
AIC: 3875

Number of Fisher Scoring iterations: 5

Changing the family to gaussian with log link (later graph shows little change in predictions, although AIC is higher). Select gaussian for ease of comparison with the Bayes model.

Show code

final_model_glm_usage <- glm(Usage ~ MaxTemp + MinTemp, family = gaussian(link="log"), data = df_usage_train)
summary(final_model_glm_usage)


Call:
glm(formula = Usage ~ MaxTemp + MinTemp, family = gaussian(link = "log"), 
    data = df_usage_train)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.414092   0.098035  55.226  < 2e-16 ***
MaxTemp     -0.039265   0.007692  -5.105 4.64e-07 ***
MinTemp     -0.094550   0.007262 -13.021  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 90.75013)

    Null deviance: 113590  on 529  degrees of freedom
Residual deviance:  47825  on 527  degrees of freedom
AIC: 3898.4

Number of Fisher Scoring iterations: 7

Extract coefficients:

Show code

final_model_glm_usage_coeff <- as.data.frame(coef(final_model_glm_usage))
final_model_glm_usage_coeff

            coef(final_model_glm_usage)
(Intercept)                  5.41409167
MaxTemp                     -0.03926501
MinTemp                     -0.09454971

Project the result for a dummy record showing the link:

Show code

print("Dummy prediction with min of 10 and max of 20:")

[1] "Dummy prediction with min of 10 and max of 20:"

Show code

predict(final_model_glm_usage, newdata= as.data.frame(cbind(MaxTemp=20,MinTemp=10)), # model and data
  type="response", # or terms for coefficients
  se.fit = FALSE)

       1 
39.77722

Show code

# replicate
exp(final_model_glm_usage_coeff[1,] + final_model_glm_usage_coeff[2,]*20+final_model_glm_usage_coeff[3,]*10)

[1] 39.77722

Adding the predictions to the test data:

Show code

# predictions, normal
pred_usage <- as.data.frame(
  predict(final_model_glm_usage, newdata= df_usage_test, # model and data
  type="response", # or terms for coefficients
  se.fit = TRUE, # default is false
  interval = "confidence", #default "none", also "prediction"
  level = 0.95
  )
)

# predictions, gamma
usage_pred_gamma <- 
  predict(gamma_model_glm_usage, newdata= df_usage_test, # model and data
  type="response", # or terms for coefficients
  se.fit = FALSE, # default is false
  )

# rename
pred_usage <- rename(pred_usage,usage_pred=fit,se_usage_pred=se.fit)

# add back to data
df_usage_test_pred <- cbind(df_usage_test,pred_usage, usage_pred_gamma)

Plotting the results shows a poorer fit for winter when the usage spikes and in instances with outlier temperatures. Predictions for the test data are not hugely different to the gamma family model.

Show code

df_usage_test_pred  %>%
ggplot() +
geom_point(aes(x=Date,y=Usage,color=Season)) +
# add a modeled line
geom_line(aes(x=Date,y=usage_pred)) +
geom_line(aes(x=Date,y=usage_pred_gamma), linetype = "dashed") +
labs(x="Date", y="Usage", title = "Actual vs predicted usage by date and season",subtitle="Dashed=Gamma Pred; Solid=Normal Pred") +
theme_pander() + scale_color_gdocs()

Show code

# plot temperatures
df_usage_test_pred  %>%
ggplot() +
geom_point(aes(x=Date,y=MaxTemp,color=Season)) + 
labs(x="Date", y="Usage", title = "MaxTemp by date and season") +
theme_pander() + scale_color_gdocs()

Plots show slightly higher residuals at higher predicted values for the step regression model and final selected. QQ plot suggests skewness in the data, which we could see earlier in the histogram of usage.

Show code

par(mfrow = c(2, 2))
plot(final_model_glm_usage)

Show code

plot(step_model_glm_usage)

Define a regression model in RJAGS. The structure here is more flexible than stan_glm which is predefined. This allows for a wider selection of prior distributions and a mixture of modeling choices. Also possible to specify model forms in this way for STAN.

\(Y_i \propto N(m_i,s^2)\)

Where

\(m_i = a + bX_1i +cX_2i\)
\(a, b, c and s\) have some assumed prior distributions.

Note that the distribution syntax is slightly different to R, for example in R a normal distribution is specified using the mean and standard deviation, but in JAGS the second parameter of the dnorm command is the distribution’s precision (1 / variance).

Show code

usage_model <- "model{
  #Likelihood model for Y[i]
  for (i in 1:length(Y)) {
      Y[i] ~ dnorm(m[i], s^(-2)) # link=log
      log(m[i]) <- a + b*X_1[i] + c*X_2[i]
  }

  # Prior models for a, b, c, s
  a ~ dnorm(0, 0.001)     # Intercept
  b ~ dnorm(0, 0.001)     # MaxTemp
  c ~ dnorm(0, 0.001)     # MinTemp
  s ~ dunif(0, 10)        # Error

}"

Simulate from the model:

burn-in period of 2k iterations aims for convergence and throws away these earlier test samples.
10k iterations for 3 chains.
saving every 10th sample only in order to save space.

Show code

usage_sim <- jags.model(textConnection(usage_model),
              data=list(Y = df_usage_train$Usage, X_1 = df_usage_train$MaxTemp, X_2 = df_usage_train$MinTemp),
              n.chains = 3)

Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 530
   Unobserved stochastic nodes: 4
   Total graph size: 2919

Initializing model

Show code

usage_posterior <- coda.samples(usage_sim, 
                                 variable.names = c("a","b","c","s"), 
                                 n.iter = 10000, n.burnin = 2000, n.thin = 10, DIC = F)

Summary of the results below:

The mean of the b chain value was -0.040. This is an estimate of the posterior mean of b. This is consistent with the GLM above
The corresponding naive standard error, 0.000253, measures the potential error in this estimate. This error is calculated by dividing the standard deviation of the b chain values by the square root of the number of iterations (or sample size). This is a naive approach to calculating error in a setting with dependent samples.

Also showing the density plots for the posteriors as well as the trace plots for each of the chains.

Show code

summary(usage_posterior)


Iterations = 1001:11000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 10000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

      Mean       SD  Naive SE Time-series SE
a  5.41410 0.106067 6.124e-04      0.0070111
b -0.03912 0.008177 4.721e-05      0.0007667
c -0.09484 0.007196 4.155e-05      0.0004215
s  9.50785 0.254874 1.472e-03      0.0018820

2. Quantiles for each variable:

      2.5%      25%      50%      75%    97.5%
a  5.21749  5.34109  5.41256  5.48393  5.62757
b -0.05514 -0.04475 -0.03913 -0.03349 -0.02289
c -0.10930 -0.09970 -0.09463 -0.08996 -0.08094
s  8.98564  9.33206  9.51959  9.69793  9.94864

Show code

plot(usage_posterior,trace=FALSE)

Show code

par(mfrow = c(2, 2))
plot(usage_posterior,density=FALSE)

We can plot the posterior distributions for the model parameters from the samples:

Show code

posterior_usage <- as.matrix(usage_posterior)

plot1<- mcmc_areas(posterior_usage, pars = c("a"), prob = 0.80)
plot2<- mcmc_areas(posterior_usage, pars = c("b"), prob = 0.80)
plot3<- mcmc_areas(posterior_usage, pars = c("c"), prob = 0.80)
plot4<- mcmc_areas(posterior_usage, pars = c("s"), prob = 0.80)

grid.arrange(plot1,plot2, plot3, plot4, top="Posterior distributions (median + 80% intervals)")

Similar to the earlier example, we could extract the \(R^2\) from the posterior distribution as well as run posterior predictive checks.

Further considerations: extracting the mean of the parameter projections, we could produce predictions for the test data. We could also produce predictions for test data point based upon each of the samples from the posterior, so above we would produce 2400 predictions for each data point (3x10k, discarding the first 2k from each chain and only saving 1/10 of the samples).

Estimating COVID 19 infections in NSW, Australia

Hugh Miller, Isabella Lyons (2012) discusses a method of forecasting COVID infections using the package “epinow2”, Sam Abbott (2020). The package makes use of STAN.

“[Epinow2] estimates the time-varying reproduction number on cases by date of infection… Imputed infections are then mapped to observed data (for example cases by date of report) via a series of uncertain delay distributions…”

That is to say, the model makes some assumptions about lags in symptoms and in reporting (assuming some distribution for those delays) in order to estimate the true underlying number of infections (backwards). Using that backcast it projects out how those infections might be reported over time. The model also estimates the effective reproduction number in order to predict future infections.

Covid case and test data sourced from the department of NSW Website on 17/01/2022.

We should be careful in placing too heavy a reliance on any models of incidence which are based upon case data (particularly where case rates are expected to be under reported or where testing is incomplete). The example is purely illustrative and does not attempt to correct for gaps in the data.

Case data:

Show code

df_cases <- read.csv(file = 'confirmed_cases_table1_location.csv') 
head(df_cases)

  notification_date postcode lhd_2010_code        lhd_2010_name
1        2020-01-25     2134          X700               Sydney
2        2020-01-25     2121          X760      Northern Sydney
3        2020-01-25     2071          X760      Northern Sydney
4        2020-01-27     2033          X720 South Eastern Sydney
5        2020-03-01     2077          X760      Northern Sydney
6        2020-03-01     2163          X710 South Western Sydney
  lga_code19      lga_name19
1      11300     Burwood (A)
2      16260  Parramatta (C)
3      14500 Ku-ring-gai (A)
4      16550    Randwick (C)
5      14000     Hornsby (A)
6      12850   Fairfield (C)

Show code

df_tests <- read.csv(file = 'pcr_testing_table1_location_agg.csv') 
head(df_tests)

   test_date postcode lhd_2010_code   lhd_2010_name lga_code19
1 2020-01-01     2038          X700          Sydney      14170
2 2020-01-01     2039          X700          Sydney      14170
3 2020-01-01     2040          X700          Sydney      14170
4 2020-01-01     2041          X700          Sydney      14170
5 2020-01-01     2069          X760 Northern Sydney      14500
6 2020-01-01     2074          X760 Northern Sydney      14500
       lga_name19 test_count
1  Inner West (A)          1
2  Inner West (A)          1
3  Inner West (A)          2
4  Inner West (A)          1
5 Ku-ring-gai (A)          1
6 Ku-ring-gai (A)          1

Show code

# cases
df_cases <- df_cases %>% mutate(descr="cases",date=as.Date(notification_date),cases=1) %>%
filter(date > "2021-04-30",date <= "2021-12-31") %>% 
group_by(date, descr) %>% 
summarise(confirm=sum(cases)) 
head(df_cases)

# A tibble: 6 × 3
# Groups:   date [6]
  date       descr confirm
  <date>     <chr>   <dbl>
1 2021-05-01 cases       2
2 2021-05-02 cases       8
3 2021-05-03 cases       9
4 2021-05-04 cases       7
5 2021-05-05 cases       9
6 2021-05-06 cases       4

Plotting recorded cases by date (PCR test outcomes only and only to 31 December 2021):

Show code

df_cases %>% mutate(omicron=ifelse(date > "2021-11-26","first case+","pre")) %>%
ggplot(data=., mapping = aes(x=date, y=confirm)) + 
geom_col(aes(fill=omicron)) +
scale_x_date(date_breaks = "months" , date_labels = "%b-%y", limits = c(as.Date("2021-04-30"), as.Date("2021-12-31"))) +
labs(x="Date", y="Cases", title = "Covid cases by date, NSW") +
theme_pander() + scale_color_gdocs()

Test data:

Show code

df_tests <- df_tests %>% mutate(descr="tests",date=as.Date(test_date)) %>%
filter(date > "2021-04-30",date <= "2021-12-31") %>% 
group_by(date, descr) %>% 
summarise(tests=sum(test_count)) 
head(df_tests)

# A tibble: 6 × 3
# Groups:   date [6]
  date       descr tests
  <date>     <chr> <int>
1 2021-05-01 tests  4514
2 2021-05-02 tests  4775
3 2021-05-03 tests 14877
4 2021-05-04 tests 10868
5 2021-05-05 tests 13126
6 2021-05-06 tests 24029

Plotting tests by date (PCR tests only; to 31 December 2021):

Show code

ggplot(data=df_tests, mapping = aes(x=date, y=tests))+ 
geom_bar(stat="identity") +
scale_x_date(date_breaks = "months" , date_labels = "%b-%y", limits = c(as.Date("2021-04-30"), as.Date("2021-12-31"))) +
labs(x="Date", y="Tests", title = "PCR tests by date, NSW") +
theme_pander() + scale_color_gdocs()

Merged to get positive test rate:

Show code

# merge
df <- merge(x = df_cases, y = df_tests, by = "date", all.x = TRUE) %>%
mutate(pos_rate=pmin(1, confirm/tests)) %>%
fill(pos_rate, .direction = "downup") %>%
# add smoothed positive rate
mutate(pos_rate_sm = fitted(loess(pos_rate~as.numeric(date),span=0.1))) # loess - Local Polynomial Regression Fitting; span - the parameter which controls the degree of smoothing
# drop descriptions
df$descr.x <- NULL
df$descr.y <- NULL

Plot of positive rate shows that the proportion of tests returning a positive increases to above 5% from 25th December 2021, suggesting that the case data beyond this point is less complete.

Show code

# positive rate
ggplot(data=df, mapping = aes(x=date))+ 
geom_point(aes(y=(pos_rate))) +
geom_line(aes(y=pos_rate_sm)) +
scale_x_date(date_breaks = "months" , date_labels = "%b-%y", limits = c(as.Date("2021-04-30"), as.Date("2021-12-31"))) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(x="Date", y="Positive test rate", title = "Positive PCR test rate by date, NSW") +
theme_pander() + scale_color_gdocs()

There are some pragmatic methods for adjusting the case numbers for increasing test positivity rates. For our purposes we will focus the model on case data before 23 December 2021 in order to exclude the period where the test data became less reliable/ complete, noting that underlying data completeness issues are likely present.

Show code

# Reporting delays (delay from symptom onset to reporting/ testing positive)
reporting_delay <- estimate_delay(rlnorm(100, log(6), 1), max_value = 15)
cat("Mean reporting delay = ",reporting_delay$mean, "; ")

# Symptom generation time (time between infection events in an infector-infectee pair) and incubation period (time between moment of infection and symptom onset)
generation_time <- get_generation_time(disease = "SARS-CoV-2", source = "ganyani")
cat("Mean generation time =",generation_time$mean, "; ")
incubation_period <- get_incubation_period(disease = "SARS-CoV-2", source = "lauer")
cat("Mean incubation period =",incubation_period$mean, "; ")

Show code

df_cases_fit <- df_cases %>% filter(date <= "2021-12-22") # exclude data before 23 December 2021

estimates <- epinow(reported_cases = df_cases_fit[c("date","confirm")], # case data from NSW
                    generation_time = generation_time,
                    delays = delay_opts(incubation_period, reporting_delay),
                    rt = rt_opts(prior = list(mean = 2, sd = 0.2)),
                    stan = stan_opts(cores = 6, samples = 3000, warmup = 500, chains = 6, control = list(adapt_delta = 0.95)))

Importing a previous run of the above:

Show code

load("covid_estimates_summary.RData")

plot(estimates) produces predefined plots of cases by date reported and date of infection as well as the effective reproduction number. Given the dramatic increase in reported cases towards the end of the period, it is not surprising that the range of forecast cases is wide. Interestingly, the forecast/ partial forecast reproduction number reduces; presumably this is influenced by the earlier observations where case numbers were low. This issue has not been investigated further, however there are articles that discuss the impact of changes to the testing procedures and other limitations.

Merging forecast reported cases with the actual case data:

Show code

df <- merge(x = df, y = df_reported, by = "date", all.x = TRUE) %>% filter(date < "2021-12-30")

Plotting forecast vs actuals below - the week to 29 December is fully forecast from the model. Actual cases fall within the projected 90% confidence interval range, noting that the test positivity rate increased materially from 25th December so it is likely that the actual cases are under reported.

Show code

colours <- c("actual" = "#3366cc","modelled" = "#dc3912")

ggplot(data=df %>% filter(date >= "2021-12-01"), aes(x=date)) +
geom_line(aes(y=confirm, color="actual")) +
geom_line(aes(y=mean,color="modelled")) +
geom_ribbon(aes(ymin=lower_90, ymax=upper_90), fill="#dc3912", alpha=0.2) +
labs(x="Date", y="Reported cases", title = "Reported cases by date",color="Legend") +
scale_color_manual(values = colours) +
theme_pander()

Common issues in MCMC projections

Divergent transitions: Common fix is changing the step size using the adapt_delta argument in STAN.
Exceeding maximum tree depth: the is a point of efficiency. See notes here for assistance in STAN and other potential warnings.

Bayesian applications

Background

Applications

Libraries

Mental health example

Choosing a prior

Non-informative prior, uniform

Weakly informative prior, bimodial

Informative prior, “failure” more likely

Deriving posterior

Regression - logistic

Background

Heart disease data

Adjusting scales and changing priors

Model fit

Regression - linear

Estimating COVID 19 infections in NSW, Australia

Common issues in MCMC projections

Further reading

References