Background

Bayesian statistics reflects uncertainty about model parameters by assuming some prior probability distribution for those parameters. That prior uncertainty is adjusted for observed data to produce a posterior distribution.

Under a frequentest approach, model parameters are fixed and the observations are assumed to be sampled from a probability distribution with those parameters.

Bayes theorem refresher:

\(\Huge {p}(\theta,x) = \frac{{p}(x,\theta){p}(\theta)}{{p}(x)}\)

Where

\({p}(\theta,x)\) is the posterior, a combination of the likelihood, prior and evidence. Probability distribution of \(\theta\) given the observed data, \(x\).
\({p}(x,\theta)\) is the likelihood which summarizes the likelihood of observing data \(x\) under different values of the underlying support parameter \(\theta\).
\({p}(\theta)\) is the prior, the probability distribution of \(\theta\), independent of the observed data.
\({p}(x)\) is the evidence (normalising term), the probability of the observed data \(x\), independent of \(\theta\).

The evidence term is proportional to the likelihood and therefore the theorem is often stated as

\(\Huge p(\theta | x) \propto p(x | \theta) p(\theta)\)

A paper presented at the 2021 Actuaries (Institute) Summit Hugh Miller, Isabella Lyons (2012) (p2) sets out the ‘tensions’ that exist between Frequentest and Bayesian approaches and suggests that:

“Despite actuaries being pioneers in early work on credibility theory, Bayesian approaches have largely given way to frequentist approaches. This is to our loss, since such methods have advanced substantially over the past couple of decades and they are well-suited to problems with latent variables or noise around underlying ‘true’ values.”

Here we’ll look at a few simple applications of Bayesian techniques.

Applications

The sections below set out a simple example Bayesian refresher, regression models and other applications.

Libraries

A list of packages used in the recipes.

Mental health example

The 2017 Actuaries Institute Green Paper on Mental Health and Insurance “Mental Health and Insurance” (2017) talks about some of the challenges in obtaining reliable data and projections on mental health in insurance (“Mental Health and Insurance” 2017, p26):

"There are sources of data and research about the prevalence and profile of people with different mental health conditions, such as the 2007 National Survey of Mental Health and Wellbeing and many other epidemiological and clinical studies… [T]o have a complete and informed view of all this information is a difficult task for a full-time academic, let alone for an insurer.

Even then, the nature and granularity of the data needed for insurance applications is different from that for population, public health or clinical purposes. Firstly, the ‘exposure’ (the number of insured people and their characteristics) needs to be recorded. There is probably a lot of relevant information that is not collected at the start of an insurance policy and can therefore never be analysed.

Secondly, the claims information needs to be available at a detailed level regarding the nature and cause of the claim and other characteristics of the individual, their history and the cover provided. There is a clear need for consistency in definition, language and data standards to improve the quality of information available across the system.

Naturally, when the insurance cover for mental health conditions is not provided (as in the case of most travel insurance products) there will not be relevant data from the insurance history. However, even when relevant data exists within insurers there are practical and commercial difficulties in turning it into a useful form."

The paper goes on to talk about some of the insurance challenges in detail including

subjectivity in diagnosis
reliance on self-reporting
varying severity and prospects of recovery
impact of co-morbidities
correlation with financial incentive

Based upon data from a single insurer ~19% of IP and TPD claims relate to mental health, with mental health claims contributing 26% to the total cost of claims (in 2015, “Mental Health and Insurance” (2017), p17). This is roughly consistent with the proportion of the general population that experiences an episode of mental illness in a 12 month period, “Mental Health and Insurance” (2020), p110.

In this simple example, we want to know what the rate of mental illness, \(\theta\) is in the insured population. Assume some true underlying \(\theta\), here a fixed value of 20%. Create a population, \(X\) with \(X \sim B(N,\theta)\) where N=100000. Pull a random sample from population.

Show code

theta_true <- 0.2 
pop <- rbinom(100000,1,theta_true)
sample_n = 500
sample <- sample(pop,sample_n)
sample_success <- sum(sample)
sample_p <- sample_success/sample_n

Choosing a prior

The posterior is proportional to the product of the prior and the likelihood. The beta distribution is a conjugate prior for the binomial distribution (posterior is the same probability distribution family as prior). Assume \(\theta \sim {\sf Beta}(\alpha, \beta)\), then:

\(f(\theta;\alpha,\beta) = \theta^{\alpha-1}(1-\theta)^{\beta-1}\)

It represents the probabilities assigned to values of \(\theta\) in the domain (0,1) given values for the parameters \(\alpha\) and \(\beta\). The binomial posterior distribution represents the probability of values of \(X\) given \(\theta\). Varying the values of \(\alpha\) and \(\beta\) can represent a wide range of different prior beliefs about the distribution of \(\theta\).

Priors can be “informative” “uninformative” or “weakly informative,” e.g.

non-informative - uniform dist => \(\theta\) could be anywhere in (0,1)
weakly informative uniform prior => believe \(\theta\) more likely to be at either end of the ranges (less likely to be in the middle)
strongly informative prior
- a and b both high (bimodial) => true theta likely to be at center of range
- a higher than b (strong “success”) => success more likely than failure (where “success” is a mental health claim)
- b higher than a (strong “failure”) => failure more likely than success

Defining prior plots:

Show code

plot_prior <- function(a,b){
  theta = seq(0,1,0.005)
  p_theta = dbeta(theta, a, b)
  p <- qplot(theta, p_theta, geom='line')
  p <- p + theme_bw()
  p <- p + ylab(expression(paste('p(',theta,')', sep = '')))
  p <- p + xlab(expression(theta))
  return(p)}

Non-informative prior, uniform

Show code

plot_prior(1,1) + labs(title="Uniform Prior, a= 1 and b = 1") + theme(plot.title = element_text(hjust = 0.5))

Weakly informative prior, bimodial

Show code

plot_prior(0.5,0.5) + labs(title="Bimodial Prior, a= 0.5 and b = 0.5") + theme(plot.title = element_text(hjust = 0.5))

Informative prior, “failure” more likely

The higher the number of observations, the greater the influence on the posterior

Show code

plot_prior(5,30) + labs(title="Uniform Prior, a= 5 and b = 30") + theme(plot.title = element_text(hjust = 0.5))

Show code

plot_prior(20,120) + labs(title="Uniform Prior, a= 20 and b = 120") + theme(plot.title = element_text(hjust = 0.5))

Deriving posterior

Prior plot values. Re-express binomial as beta dist.

Show code

  prior_a = 50 #successes
  prior_b = 250 #failures
  prior_n = prior_a + prior_b #observations
  prior_theta = prior_a/prior_n
  prior <- data.frame('Dist'='Prior','x'=seq(0,1,0.005), 'y'=dbeta(seq(0,1,0.005),prior_a,prior_b))

Observations are taken from the earlier binomial sample.

Show code

  obs_a <- sample_success + 1
  obs_b <- sample_n - sample_success + 1
  observations <- data.frame('Dist'='Observations',x=seq(0,1,0.005), y=dbeta(seq(0,1,0.005),obs_a,obs_b))

In this example, the posterior is a simple combination of the prior and observations.

Show code

  post_a <- sample_success + prior_a - 1 #combination of prior and observed successes
  post_b <- sample_n - sample_success + prior_b - 1
  post_theta <- post_a / (post_a + post_b)
  posterior <- data.frame('Dist'='Posterior','x'=seq(0,1,0.005), 'y'=dbeta(seq(0,1,0.005),post_a,post_b))

This can be plotted as:

Show code

model_plot <- rbind(prior,observations,posterior)
with(model_plot, Dist <- factor(Dist, levels = c('Prior', 'Observations','Posterior'), ordered = TRUE))

qplot(model_plot$x, model_plot$y, geom='line',color = model_plot$Dist) +
ylab(expression(paste('p(',theta,')', sep = ''))) +
xlab(expression(theta)) +
geom_vline(xintercept = post_theta, linetype="dotted", color = "black", label="blah") +
scale_color_discrete(name = "Dist")  + 
annotate("text",x=post_theta, y=0, label=label_percent()(post_theta)) +
labs(title="Informative prior", subtitle="Low success") + 
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5))

Regression - logistic

Background

RSTANARM is an interface in R for STAN where Stan is a platform for Bayesian inference - it is a probabilistic programming framework where the guts of the inference method happens in the background, while the user focuses on parameterising the model in R. RJAGS is an interface in R for JAGS (Just Another Gibbs Sample) where JAGS is similarly a platform for Bayesian modeling.

Models in both STAN and JAGS can make use of Markov chains Monte Carlo (MCMC) methods. MCMC methods provide a way to take random samples approximately from a posterior distribution. Such samples can be used to summarize any aspect of the posterior distribution of a statistical model.

RSTANARM and RJAGS have pre-written code for common models like regression.

Under a Frequentest approach to linear regression, for a given set of response variables y, the relationship between y and a set of regressor variables x is assumed to be linear (informed by a set of parameters \(\beta\)), with an error term, i.e. \(y = X\beta + \epsilon\).

The error term, \(\epsilon\) is assumed to be normally distributed and the regression co-efficients are estimated by minimising the error term commonly using ordinary least squares (OLS). We make some other assumptions about the error term - not correlated with X and i.i.d.; and define \(Var(\epsilon)=\sigma^2\). The generalised form allows the response variable to have an error that is not normal - this is achieved by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

With Bayesian inference we are interested in the posterior distributions of the parameters \(\beta\) and \(\sigma\), rather than point estimates. We also want to influence those distributions with some prior knowledge of their behaviour:

\(p(\beta,\sigma|y,X) \propto p(y|\beta,\sigma)p(\beta,\sigma)\)

\(p(\beta,\sigma|y,X)\) is not a normalised probability density function (i.e. is proportional to the likelihood and prior). However, we can sample from it and therefore estimate any quantity of interest e.g. mean or variance, (William H. Press 2007, p825). For this we use MCMC, William H. Press (2007), p825:

“MCMC is a random sampling method… the goal is to visit a point \(x\) with a probability proportional to some given distribution function \(\pi(x)\) [not necessarily a probability]. Why would we want to sample a distribution in this way? THe answer is that Bayesian methods, often implemented using MCMC, provide a powerful way of estimating the parameters of a model and their degree of uncertainty.”

Heart disease data

The Framingham Heart Study is a longitudinal study of cardiovastular disease in a sample population within Framingham, Massachusetts. A subset of data for teaching is available at request from the U.S. National Heart, Lung and Blood Institute. This dataset is not appropriate for publication and has been anonymised, but is nevertheless useful for illustrative purposes here. The dataset:

“..is a subset of the data collected as part of the Framingham study and includes laboratory, clinic, questionnaire, and adjudicated event data on 4,434 participants. Participant clinic data was collected during three examination periods, approximately 6 years apart, from roughly 1956 to 1968. Each participant was followed for a total of 24 years for the outcome of the following events: Angina Pectoris, Myocardial Infarction, Atherothrombotic Infarction or Cerebral Hemorrhage (Stroke) or death.”

We’ll look at a logistic regression model example using predefined Bayesian models in RSTANARM.

Show code

df_heart <- read.csv(file = 'frmgham2.csv') 
df_heart <- df_heart[c(1:23)] %>% filter(PERIOD==2) # ignoring health event data, only looking at observations at period == 2 (observations for participants are not independent over time; this does have limitations e.g. observations are truncated at death which could be due to heart disease)
ignore_cols <- names(df_heart) %in% c("PERIOD","RANDID","TIME", "PREVMI", "PREVAP") #exclude identifier and time columns; exclude  PREVMI and PREVAP as they are a subset of PREVCHD
df_heart <- df_heart[!ignore_cols]
head(df_heart)

  SEX TOTCHOL AGE SYSBP DIABP CURSMOKE CIGPDAY   BMI DIABETES BPMEDS
1   2     260  52   105  69.5        0       0 29.43        0      0
2   1     283  54   141  89.0        1      30 25.34        0      0
3   2     232  67   183 109.0        1      20 30.18        0      0
4   2     343  51   109  77.0        1      30 23.48        0      0
5   2     230  49   177 102.0        0       0 31.36        0      1
6   2     220  70   149  81.0        0       0 36.76        0      0
  HEARTRTE GLUCOSE educ PREVCHD PREVSTRK PREVHYP HDLC LDLC
1       80      86    2       0        0       0   NA   NA
2       75      87    1       0        0       0   NA   NA
3       60      89    3       0        0       1   NA   NA
4       90      72    3       0        0       0   NA   NA
5      120      86    2       0        0       1   NA   NA
6       80      98    1       1        0       1   NA   NA

Data description highlights:

RANDIND: unique identifier.
SEX: 1=male; 2=female (categorical, 1/2).
TIME: number of days since baseline exam (continuous).
PERIOD: examination cycle (categorical, 1,2 or 3). We’ll focus on period 2 - observations for the same participant are not independent over time.
TOTCHOL: total cholesterol level (continuous).
AGE: Age of the patient rounded to nearest year (continuous).
SYSBP: systolic blood pressure (continuous).
DIABP: diastolic blood pressure (continuous).
CURSMOKE: whether or not the patient is a current smoker (categorical, 0/1).
CIGPDAY: the number of cigarettes smoked per day on average (continuous).
BMI: Body Mass Index (continuous).
DIABETES: whether or not the patient had diabetes (categorical, 0/1).
BPMEDS: whether or not the patient was on blood pressure medication (categorical, 0/1).
HEARTRTE: heart rate (continuous).
GLUCOSE: glucose level (continuous).
EDU: attained Education (categorical, 1=0-11 year; 2=High School Diploma, GED; 3=Some College, Vocational School; 4=College (BS, BA) degree or more).
HDLC: High Density Lipoprotein Cholesterol - only available for period 3 (continuous).
LDLC: Low Density Lipoprotein Cholesterol - only available for period 3 (continuous).
PREVCHD: prevalence of Coronary Heart Disease (categorical, 0/1).
PREVAP: prevalence of Angina Pectoris (categorical, 0/1). Excluded from model as it is a subset of PREVCHD.
PREVMI: Prevalence of Myocardial Infarction (categorical, 0/1). Excluded from model as it is a subset of PREVCHD.
PREVSTRK: whether or not the patient had previously had a stroke (categorical, 0/1).
PREVHYP: whether or not the patient was hypertensive (categorical, 0/1).

Using the DataExplore package, we can see that the data is mostly complete apart from missing entries in ‘glucose’ and to a lesser extent ‘education.’

Show code

introduce(df_heart)

  rows columns discrete_columns continuous_columns
1 3930      18                0                 16
  all_missing_columns total_missing_values complete_rows
1                   2                 8720             0
  total_observations memory_usage
1              70740       334464

Show code

plot_intro(df_heart)

Show code

plot_missing(df_heart)

Given around 12% of glucose records are missing as well as all of HDL and LDL (only available for period 3), drop the columns. Also drop rows with other missing data as these are small in number.

Show code

df_heart$GLUCOSE <- NULL
df_heart$HDLC <- NULL
df_heart$LDLC <- NULL
df_heart <- na.omit(df_heart)

Plots of the data:

Show code

plot_bar(df_heart, title="Barplot")

Of the 3.6k remaining records, ~7% have a flag for heart disease:

Show code

table(df_heart$PREVCHD)


   0    1 
3319  253

Further plots and checking distributions:

Show code

plot_histogram(df_heart, title="Histogram for continuous")

Show code

qq_data <- df_heart[,c("BMI", "DIABP", "HEARTRTE", "SYSBP", "TOTCHOL")]
plot_qq(qq_data, title="QQ plot")

Show code

log_qq_data <- update_columns(qq_data, 2:4, function(x) log(x + 1))
plot_qq(log_qq_data, title="QQ plot, logged data")

Most significantly correlated with heart disease are AGE, SYSBP, DIABETES, BPMEDS and PREVHYP.

Show code

plot_correlation(na.omit(df_heart))

Split data into training and testing data sets.

Show code

# Create a random sample of row IDs
sample_rows <- sample(nrow(df_heart),0.75*nrow(df_heart))
# Create the training dataset
df_heart_train <- df_heart[sample_rows,]
# Create the test dataset
df_heart_test <- df_heart[-sample_rows,]

Fit a simple GLM using all data:

Show code

full_model_glm_heart <- glm(PREVCHD~.,data=df_heart_train, family = "binomial")
summary(full_model_glm_heart)


Call:
glm(formula = PREVCHD ~ ., family = "binomial", data = df_heart_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3248  -0.4057  -0.2748  -0.1827   2.8921  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.962557   1.250208  -4.769 1.85e-06 ***
SEX         -1.114213   0.177181  -6.289 3.20e-10 ***
TOTCHOL      0.001581   0.001784   0.886  0.37553    
AGE          0.078303   0.011428   6.852 7.28e-12 ***
SYSBP        0.010581   0.005241   2.019  0.04352 *  
DIABP       -0.016745   0.009933  -1.686  0.09184 .  
CURSMOKE     0.042076   0.261488   0.161  0.87216    
CIGPDAY     -0.005157   0.010770  -0.479  0.63204    
BMI          0.026662   0.019952   1.336  0.18144    
DIABETES     0.939410   0.262773   3.575  0.00035 ***
BPMEDS       0.375898   0.231239   1.626  0.10404    
HEARTRTE    -0.011837   0.006496  -1.822  0.06842 .  
educ        -0.033598   0.077837  -0.432  0.66599    
PREVSTRK     0.092152   0.521049   0.177  0.85962    
PREVHYP      0.236107   0.218069   1.083  0.27893    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1366.6  on 2678  degrees of freedom
Residual deviance: 1184.2  on 2664  degrees of freedom
AIC: 1214.2

Number of Fisher Scoring iterations: 6

Use stepwise regression to reduce the explanatory variables:

Show code

# specify a null model with no predictors
null_model_glm_heart <- glm(PREVCHD ~ 1, data = df_heart_train, family = "binomial")
# use a forward stepwise algorithm to build a parsimonious model
step_model_glm_heart <- step(null_model_glm_heart, scope = list(lower = null_model_glm_heart, upper = full_model_glm_heart), direction = "forward")

Summary of the model:

Show code

summary(step_model_glm_heart)


Call:
glm(formula = PREVCHD ~ AGE + SEX + DIABETES + SYSBP + HEARTRTE + 
    BPMEDS, family = "binomial", data = df_heart_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3219  -0.4014  -0.2759  -0.1873   2.8906  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.508481   0.831483  -7.828 4.97e-15 ***
AGE          0.086628   0.010331   8.385  < 2e-16 ***
SEX         -1.021098   0.166346  -6.138 8.34e-10 ***
DIABETES     1.037655   0.258433   4.015 5.94e-05 ***
SYSBP        0.008310   0.003601   2.308   0.0210 *  
HEARTRTE    -0.012274   0.006396  -1.919   0.0550 .  
BPMEDS       0.405392   0.226323   1.791   0.0733 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1366.6  on 2678  degrees of freedom
Residual deviance: 1190.6  on 2672  degrees of freedom
AIC: 1204.6

Number of Fisher Scoring iterations: 6

Dropping BPMEDS, HEARTRTE and BMI as little impact on AIC:

Show code

final_model_glm_heart <- glm(PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL, family = "binomial",data = df_heart_train)
summary(final_model_glm_heart)


Call:
glm(formula = PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL, 
    family = "binomial", data = df_heart_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3239  -0.4108  -0.2797  -0.1870   2.9304  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.047973   0.742204  -9.496  < 2e-16 ***
AGE          0.090472   0.010145   8.918  < 2e-16 ***
SEX         -1.015567   0.168042  -6.044 1.51e-09 ***
PREVHYP      0.397207   0.171145   2.321   0.0203 *  
DIABETES     1.026242   0.254112   4.039 5.38e-05 ***
TOTCHOL      0.001445   0.001779   0.812   0.4166    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1366.6  on 2678  degrees of freedom
Residual deviance: 1198.5  on 2673  degrees of freedom
AIC: 1210.5

Number of Fisher Scoring iterations: 6

Fit a Bayes glm using the same explanatory variables. Here parameter estimates no longer have test statistics and p-values as in the frequentist regression. This is because Bayesian estimation samples from the posterior distribution. This means that instead of a point estimate and a test statistic, we get a distribution of plausible values for the parameters, and the estimates section of the summary summarizes those distributions.

Further documentation on Bayesian generalised linear models via Stan is here. Documentation on choice of priors is here. If priors are not specified, default weekly informative priors are used - you can call a summary of the modeled priors (see below). Default priors are autoscaled to make them less informative (see below for notes on scaling).

Show code

model_bayes_glm_heart <- stan_glm(PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL, family = "binomial",data=df_heart_train, iter=2000, warmup=500)
# chains is number of sample paths from posterior; iter is number of samples within a chain; warmup is number of iterations to discard

A summary of the model below. “Sigma” is the standard deviation of the errors; “mean_PPD” is the mean of the posterior predictive samples.MCMC diagnostics - “RHat” is a measure of within chain variance compared to across chain variance (<1.1 implies convergence); “log-posterior” is similar to likelihood.

Show code

print(summary(model_bayes_glm_heart),digits=4)


Model Info:
 function:     stan_glm
 family:       binomial [logit]
 formula:      PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL
 algorithm:    sampling
 sample:       6000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 2679
 predictors:   6

Estimates:
              mean    sd      10%     50%     90%  
(Intercept) -7.0759  0.7518 -8.0309 -7.0752 -6.1078
AGE          0.0908  0.0103  0.0776  0.0905  0.1040
SEX         -1.0178  0.1678 -1.2333 -1.0174 -0.7994
PREVHYP      0.4005  0.1696  0.1868  0.3976  0.6185
DIABETES     1.0197  0.2583  0.6913  1.0249  1.3447
TOTCHOL      0.0015  0.0018 -0.0008  0.0014  0.0037

Fit Diagnostics:
           mean   sd     10%    50%    90% 
mean_PPD 0.0707 0.0067 0.0620 0.0705 0.0795

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse   Rhat   n_eff
(Intercept)   0.0107 0.9998 4977 
AGE           0.0002 0.9996 4032 
SEX           0.0024 1.0002 4776 
PREVHYP       0.0024 0.9999 5157 
DIABETES      0.0032 1.0000 6535 
TOTCHOL       0.0000 1.0004 5728 
mean_PPD      0.0001 1.0002 6925 
log-posterior 0.0317 1.0007 2999 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

The stan model used default normal priors for intercepts and coefficients. Default priors are automatically scaled to limit how informative they are - below we can see the adjusted (scaled) prior for the coefficients differs from the default 2.5:

Show code

# summary of priors
print(prior_summary(model_bayes_glm_heart),digits=4)

Priors for model 'model_bayes_glm_heart' 
------
Intercept (after predictors centered)
 ~ normal(location = 0, scale = 2.5)

Coefficients
  Specified prior:
    ~ normal(location = [0,0,0,...], scale = [2.5,2.5,2.5,...])
  Adjusted prior:
    ~ normal(location = [0,0,0,...], scale = [0.2926,5.0406,4.9991,...])
------
See help('prior_summary.stanreg') for more details

Credible intervals express the probability that a parameter falls within a given range (vs a confidence interval which is the probability that a range contains the true value of the parameter). These credible intervals are computed by finding the relevant quantiles of the draws from the posterior distribution. They produce similar ranges to the confidence intervals from the frequentist approach.

Show code

confint(final_model_glm_heart, level=0.95)

                  2.5 %       97.5 %
(Intercept) -8.52662803 -5.614429675
AGE          0.07086231  0.110670109
SEX         -1.34930362 -0.689678256
PREVHYP      0.06522520  0.737205187
DIABETES     0.51152722  1.511076030
TOTCHOL     -0.00208170  0.004889717

Show code

posterior_interval(model_bayes_glm_heart, prob=0.95)

                    2.5%        97.5%
(Intercept) -8.547304593 -5.606970389
AGE          0.071086672  0.111312772
SEX         -1.345066980 -0.690262317
PREVHYP      0.072906510  0.732413757
DIABETES     0.496030231  1.505244789
TOTCHOL     -0.002016689  0.004926423

Adjusting scales and changing priors

stan_glm allows a selection of prior distributions for the coefficients, intercept, and auxiliary parameters. The prior distributions can be altered by specifying different locations/scales/dfs, but all the prior take the form of a single chosen probability distribution.

Some drivers for wanting to specify a prior: * Available information/ research might suggest a base value for our parameters. Particularly valuable if our own observed data is sparse. * We may not have a good idea of the value, but know that it is constrained in some way e.g. is positive.

The scale is the standard deviation of the prior. As noted earlier, the model autoscales to ensure priors are not too informative. We can turn this off when we specify our own priors. Documentation on choice of priors is here.

To allow for varying distributions across the priors, we need to define the model in STAN or JAGS. See the linear regression example below for an example of this in JAGS.

Specifying our own priors below. Little impact on the posterior estimates.

Show code

model_bayes_glm_heart_scale <- stan_glm(PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL, family = "binomial",data=df_heart_train, iter=2000, warmup=500,
                            prior_intercept = normal(location = 0, scale = 10, autoscale = FALSE),
                            prior = normal(location = 0, scale = 5, autoscale = FALSE),
                            prior_aux = exponential(rate = 1, autoscale = FALSE)
                            )


SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.001 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 10 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  501 / 2000 [ 25%]  (Sampling)
Chain 1: Iteration:  700 / 2000 [ 35%]  (Sampling)
Chain 1: Iteration:  900 / 2000 [ 45%]  (Sampling)
Chain 1: Iteration: 1100 / 2000 [ 55%]  (Sampling)
Chain 1: Iteration: 1300 / 2000 [ 65%]  (Sampling)
Chain 1: Iteration: 1500 / 2000 [ 75%]  (Sampling)
Chain 1: Iteration: 1700 / 2000 [ 85%]  (Sampling)
Chain 1: Iteration: 1900 / 2000 [ 95%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 46.861 seconds (Warm-up)
Chain 1:                56.413 seconds (Sampling)
Chain 1:                103.274 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0.001 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 10 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  501 / 2000 [ 25%]  (Sampling)
Chain 2: Iteration:  700 / 2000 [ 35%]  (Sampling)
Chain 2: Iteration:  900 / 2000 [ 45%]  (Sampling)
Chain 2: Iteration: 1100 / 2000 [ 55%]  (Sampling)
Chain 2: Iteration: 1300 / 2000 [ 65%]  (Sampling)
Chain 2: Iteration: 1500 / 2000 [ 75%]  (Sampling)
Chain 2: Iteration: 1700 / 2000 [ 85%]  (Sampling)
Chain 2: Iteration: 1900 / 2000 [ 95%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 38.091 seconds (Warm-up)
Chain 2:                46.276 seconds (Sampling)
Chain 2:                84.367 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 0 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  501 / 2000 [ 25%]  (Sampling)
Chain 3: Iteration:  700 / 2000 [ 35%]  (Sampling)
Chain 3: Iteration:  900 / 2000 [ 45%]  (Sampling)
Chain 3: Iteration: 1100 / 2000 [ 55%]  (Sampling)
Chain 3: Iteration: 1300 / 2000 [ 65%]  (Sampling)
Chain 3: Iteration: 1500 / 2000 [ 75%]  (Sampling)
Chain 3: Iteration: 1700 / 2000 [ 85%]  (Sampling)
Chain 3: Iteration: 1900 / 2000 [ 95%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 40.355 seconds (Warm-up)
Chain 3:                48.702 seconds (Sampling)
Chain 3:                89.057 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 0 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  501 / 2000 [ 25%]  (Sampling)
Chain 4: Iteration:  700 / 2000 [ 35%]  (Sampling)
Chain 4: Iteration:  900 / 2000 [ 45%]  (Sampling)
Chain 4: Iteration: 1100 / 2000 [ 55%]  (Sampling)
Chain 4: Iteration: 1300 / 2000 [ 65%]  (Sampling)
Chain 4: Iteration: 1500 / 2000 [ 75%]  (Sampling)
Chain 4: Iteration: 1700 / 2000 [ 85%]  (Sampling)
Chain 4: Iteration: 1900 / 2000 [ 95%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 33.01 seconds (Warm-up)
Chain 4:                46.822 seconds (Sampling)
Chain 4:                79.832 seconds (Total)
Chain 4:

Show code

print(summary(model_bayes_glm_heart_scale),digits=4)


Model Info:
 function:     stan_glm
 family:       binomial [logit]
 formula:      PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL
 algorithm:    sampling
 sample:       6000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 2679
 predictors:   6

Estimates:
              mean    sd      10%     50%     90%  
(Intercept) -7.0698  0.7338 -8.0009 -7.0838 -6.1308
AGE          0.0908  0.0101  0.0780  0.0909  0.1040
SEX         -1.0192  0.1661 -1.2337 -1.0173 -0.8074
PREVHYP      0.4000  0.1728  0.1778  0.4028  0.6215
DIABETES     1.0105  0.2562  0.6795  1.0130  1.3361
TOTCHOL      0.0014  0.0018 -0.0008  0.0014  0.0037

Fit Diagnostics:
           mean   sd     10%    50%    90% 
mean_PPD 0.0706 0.0066 0.0623 0.0705 0.0791

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse   Rhat   n_eff
(Intercept)   0.0097 0.9997 5764 
AGE           0.0001 0.9999 5503 
SEX           0.0027 1.0001 3852 
PREVHYP       0.0028 1.0007 3678 
DIABETES      0.0042 1.0001 3749 
TOTCHOL       0.0000 0.9999 6107 
mean_PPD      0.0001 1.0007 5266 
log-posterior 0.0361 1.0006 2327 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Model fit

We are able to extract predictions from each iteration of the posterior sampling. We can extract predictions for new data. Could be tested against the observations for goodness of fit. The follwing from the manual, “RSTANARM Manual” (2020), p43:

“The posterior predictive distribution (posterior_predict) is the distribution of the outcome implied by the model after using the observed data to update our beliefs about the unknown parameters in the model. Simulating data from the posterior predictive distribution using the observed predictors is useful for checking the fit of the model. Drawing from the posterior predictive distribution at interesting values of the predictors also lets us visualize how a manipulation of a predictor affects (a function of)the outcome(s). With new observations of predictor variables we can use the posterior predictive distribution to generate predicted outcomes.”

Show code

posteriors <- posterior_predict(model_bayes_glm_heart,newdata=df_heart_test)
posterior <- as.data.frame(posteriors) # each column is a data point from the dataset and each row is a prediction
# Print 10 predictions for 5 lives
print("10 predictions for 5 lives")

[1] "10 predictions for 5 lives"

Show code

posteriors[1:10, 1:5]

      4 6 12 17 18
 [1,] 0 0  0  0  0
 [2,] 0 0  0  0  0
 [3,] 0 0  0  0  0
 [4,] 0 0  0  0  0
 [5,] 0 1  1  0  0
 [6,] 0 0  1  0  0
 [7,] 0 0  0  0  0
 [8,] 0 0  0  0  0
 [9,] 0 0  0  0  0
[10,] 0 0  0  0  0

We can plot the posterior distributions for the model parameters from the samples:

Show code

posterior_heart <- as.matrix(model_bayes_glm_heart)

plot1<- mcmc_areas(posterior_heart, pars = c("AGE"), prob = 0.80)
plot2<- mcmc_areas(posterior_heart, pars = c("SEX"), prob = 0.80)
plot3<- mcmc_areas(posterior_heart, pars = c("PREVHYP"), prob = 0.80)
plot4<- mcmc_areas(posterior_heart, pars = c("DIABETES"), prob = 0.80)
plot5<- mcmc_areas(posterior_heart, pars = c("TOTCHOL"), prob = 0.80)

grid.arrange(plot1,plot2, plot3, plot4, plot5, top="Posterior distributions (median + 80% intervals)")

We can extract the \(R^2\) from the posterior distribution and plot:

Show code

r2_posterior <- bayes_R2(model_bayes_glm_heart)
summary(r2_posterior)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.04243 0.07155 0.08063 0.08122 0.08992 0.14692

Show code

hist(r2_posterior) #expect normal

Posterior predictive checks can be extracted using “pp_check”

“dens_overlay” - each light blue line represents the distribution of predicted scores from a single replication, and the dark blue line represents the observed data. If the model fits, the dark blue line should align closely with the light blue lines.
“stat” in the “pp_check” function, we can get a distribution of the mean of the dependent variable from all replications. These are the light blue bars: the means from each replication plotted as a histogram. The mean from the observed data is then plotted on top as a dark blue bar.

Here the outcome is binary so the density overlay outcomes are concentrated at one or zero.

Show code

pp_check(model_bayes_glm_heart, "stat")

Show code

pp_check(model_bayes_glm_heart, "dens_overlay")

Further considerations: The loo package provides “efficient approximate leave-one-out cross-validation (LOO), approximate standard errors for estimated predictive errors (including comparisons), and the widely applicable information criterion (WAIC).The loo package can be used to estimate predictive error of any MCMC item-level log likelihood output.”

Regression - linear

The above example looked at a binary response. In this example we’ll look at predicting a continuous variable using data on my household electricity consumption over 2016-2021. The above example used stan_glm in RSTANARM. Here we will use a model defined in RJAGS with additional flexibility around the choice of priors, which is also possible in a paramaterised model for STAN. RJAGS combines the power of R with the “Just Another Gibbs Sampler” or “JAGS” engine. To get started, first download the “JAGS” program, then set up the packages as above.

Both JAGS and STAN allow for the parameterisation of a range of different models. The example below was chosen for comparison with a conventional GLM approach.

Data prep:

Daily electricity usage for a household in Sydney over 2019-21. Rows represent half hourly readings/ estimates.
Data are grouped into daily usage.
Also includes are temperatures, sunlight and rainfall data for the area taken from BOM.
Calculated fields for weekdays vs weekends, seasons and a post-covid working from home indicator.

Show code

df_usage <- read.csv(file = 'elec_usage.csv')
df_usage <- df_usage %>% mutate(Date = as.Date(Date, format="%Y-%m-%d"), EndDate = parse_date_time(EndDate, '%Y-%m-%d %H:%M:%S'), StartDate = parse_date_time(StartDate, '%Y-%m-%d %H:%M:%S'))
df_usage <- df_usage[c(-2,-3,-4,-6)] %>% # drop start and end dates, rate type, duration
group_by(Date) %>% summarise(Usage = sum(Usage), MaxTemp = max(MaxTemp), MinTemp = min(MinTemp), Rain_mm = max(Rain_mm), Solar_Exposure = max(Solar_Exposure)) %>%
# Add in fields to identify seasons, weekend/ weekday and an indicator for the post-covid/ work from home period
mutate(Season=
  if_else(month(Date)>= 3 & month(Date) < 6,"Autumn", 
  if_else(month(Date)>= 6 & month(Date) < 9,"Winter", 
  if_else(month(Date)>= 9 & month(Date) < 12,"Spring","Summer"))),
Covid_Ind=if_else(Date>= "2020-03-15",1,0),
Weekend=
  if_else(wday(Date)==7 | wday(Date)==1,1,0))

df_usage$Season <- factor(df_usage$Season, levels = c("Summer","Autumn","Winter","Spring")) # change season levels

head(df_usage)

# A tibble: 6 x 9
  Date       Usage MaxTemp MinTemp Rain_mm Solar_Exposure Season
  <date>     <dbl>   <dbl>   <dbl>   <dbl>          <dbl> <fct> 
1 2019-12-21 21.1     26.7    21       0             27.1 Summer
2 2019-12-22  9.53    19.9    18.6     0             11.9 Summer
3 2019-12-23 15.7     22.1    17.2     0             14.4 Summer
4 2019-12-24 18.7     24      18.6     1.8           18.3 Summer
5 2019-12-25 18.9     23.2    21.6     0.2           27.8 Summer
6 2019-12-26 14.6     23.6    21.4     0             30.8 Summer
# ... with 2 more variables: Covid_Ind <dbl>, Weekend <dbl>

Data is mostly complete:

Show code

introduce(df_usage)

# A tibble: 1 x 9
   rows columns discrete_columns continuous_columns all_missing_colum~
  <int>   <int>            <int>              <int>              <int>
1   732       9                2                  7                  0
# ... with 4 more variables: total_missing_values <int>,
#   complete_rows <int>, total_observations <int>, memory_usage <dbl>

Show code

plot_intro(df_usage)

Show code

plot_missing(df_usage)

Drop records with NAs as they are few, also only include non-zero usage, review explanatory variables; Gamma dist potentially a better fit than normal for Usage given the skewness in the tail; alternatively use normal with a log link.

Show code

df_usage <- na.omit(df_usage)
df_usage <- df_usage %>% filter(Usage>0)

plot_bar(df_usage, title="Barplot")

Show code

plot_histogram(df_usage, title="Histogram for continuous")

Show code

qq_data <- df_usage[,c("MaxTemp", "MinTemp", "Solar_Exposure", "Usage")]
plot_qq(qq_data, title="QQ plot")

Show code

log_qq_data <- update_columns(qq_data, 2:4, function(x) log(x + 1))
plot_qq(log_qq_data, title="QQ plot, logged data")

Show code

usage_fit_gamma <- fitdist(df_usage$Usage, distr = "gamma", method = "mme")
summary(usage_fit_gamma)

Fitting of the distribution ' gamma ' by matching moments 
Parameters : 
       estimate
shape 2.9114935
rate  0.1164095
Loglikelihood:  -2813.89   AIC:  5631.78   BIC:  5640.902

Show code

plot(usage_fit_gamma)

Usage correlated with season, max/min temp and covid wfh indicator. Max and Min Temps are not surprisingly highly correlated with one another, with the seasons and with solar exposure.

Show code

plot_correlation(na.omit(df_usage))

Split data into training and testing data sets:

Show code

# Create a random sample of row IDs
sample_rows <- sample(nrow(df_usage),0.75*nrow(df_usage))
# Create the training dataset
df_usage_train <- df_usage[sample_rows,]
# Create the test dataset
df_usage_test <- df_usage[-sample_rows,]

Fit a simple GLM using all variables:

Show code

full_model_glm_usage <- glm(Usage~.,data=df_usage_train[-1], gaussian(link="log"))
summary(full_model_glm_usage)


Call:
glm(formula = Usage ~ ., family = gaussian(link = "log"), data = df_usage_train[-1])

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-47.739   -4.237   -0.169    4.152   49.830  

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.3435178  0.1980361  26.983  < 2e-16 ***
MaxTemp        -0.0236931  0.0081617  -2.903  0.00385 ** 
MinTemp        -0.0963046  0.0094105 -10.234  < 2e-16 ***
Rain_mm        -0.0005874  0.0014031  -0.419  0.67565    
Solar_Exposure -0.0170304  0.0036434  -4.674 3.76e-06 ***
SeasonAutumn    0.0853256  0.0869296   0.982  0.32678    
SeasonWinter    0.0690973  0.0969988   0.712  0.47657    
SeasonSpring    0.0600080  0.0872295   0.688  0.49180    
Covid_Ind      -0.0298564  0.1116571  -0.267  0.78927    
Weekend        -0.0736053  0.0342713  -2.148  0.03220 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 86.66816)

    Null deviance: 114574  on 529  degrees of freedom
Residual deviance:  45065  on 520  degrees of freedom
AIC: 3880.9

Number of Fisher Scoring iterations: 8

Use stepwise regression to reduce the explanatory variables:

Show code

# specify a null model with no predictors
null_model_glm_usage <- glm(Usage ~ 1, data = df_usage_train, gaussian(link="log"))
# use a forward stepwise algorithm to build a parsimonious model
step_model_glm_usage <- step(null_model_glm_usage, scope = list(lower = null_model_glm_usage, upper = full_model_glm_usage), direction = "forward")

Summary of the model:

Show code

summary(step_model_glm_usage)


Call:
glm(formula = Usage ~ MinTemp + Solar_Exposure + MaxTemp + Weekend, 
    family = gaussian(link = "log"), data = df_usage_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-47.704   -4.045   -0.127    4.071   49.756  

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.386132   0.093447  57.639  < 2e-16 ***
MinTemp        -0.099404   0.007008 -14.184  < 2e-16 ***
Solar_Exposure -0.017416   0.003131  -5.563 4.24e-08 ***
MaxTemp        -0.021716   0.007748  -2.803  0.00525 ** 
Weekend        -0.073802   0.034035  -2.168  0.03057 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 86.06824)

    Null deviance: 114574  on 529  degrees of freedom
Residual deviance:  45185  on 525  degrees of freedom
AIC: 3872.3

Number of Fisher Scoring iterations: 7

For simplicity, we’ll assume a model defined by MaxTemp and MinTemp only, noting the significant correlation with other explanatory variables.

Show code

gamma_model_glm_usage <- glm(Usage ~ MaxTemp + MinTemp, family = Gamma(link="inverse"), data = df_usage_train)
summary(gamma_model_glm_usage)


Call:
glm(formula = Usage ~ MaxTemp + MinTemp, family = Gamma(link = "inverse"), 
    data = df_usage_train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.08514  -0.27462  -0.05095   0.18899   2.08795  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.0290515  0.0039115  -7.427  4.5e-13 ***
MaxTemp      0.0006336  0.0003152   2.011   0.0449 *  
MinTemp      0.0040192  0.0003017  13.321  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.1800649)

    Null deviance: 197.76  on 529  degrees of freedom
Residual deviance: 103.91  on 527  degrees of freedom
AIC: 3882.3

Number of Fisher Scoring iterations: 5

Changing the family to gaussian with log link (later graph shows little change in predictions, although AIC is higher). Select gaussian for ease of comparison with the Bayes model.

Show code

final_model_glm_usage <- glm(Usage ~ MaxTemp + MinTemp, family = gaussian(link="log"), data = df_usage_train)
summary(final_model_glm_usage)


Call:
glm(formula = Usage ~ MaxTemp + MinTemp, family = gaussian(link = "log"), 
    data = df_usage_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-52.033   -4.352   -0.494    4.846   48.033  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.384399   0.097722  55.099  < 2e-16 ***
MaxTemp     -0.037435   0.007621  -4.912  1.2e-06 ***
MinTemp     -0.095235   0.007292 -13.060  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 91.93297)

    Null deviance: 114574  on 529  degrees of freedom
Residual deviance:  48448  on 527  degrees of freedom
AIC: 3905.2

Number of Fisher Scoring iterations: 7

Extract coefficients:

Show code

final_model_glm_usage_coeff <- as.data.frame(coef(final_model_glm_usage))
final_model_glm_usage_coeff

            coef(final_model_glm_usage)
(Intercept)                  5.38439891
MaxTemp                     -0.03743478
MinTemp                     -0.09523517

Project the result for a dummy record showing the link:

Show code

print("Dummy prediction with min of 10 and max of 20:")

[1] "Dummy prediction with min of 10 and max of 20:"

Show code

predict(final_model_glm_usage, newdata= as.data.frame(cbind(MaxTemp=20,MinTemp=10)), # model and data
  type="response", # or terms for coefficients
  se.fit = FALSE)

      1 
39.7795

Show code

# replicate
exp(final_model_glm_usage_coeff[1,] + final_model_glm_usage_coeff[2,]*20+final_model_glm_usage_coeff[3,]*10)

[1] 39.7795

Adding the predictions to the test data:

Show code

# predictions, normal
pred_usage <- as.data.frame(
  predict(final_model_glm_usage, newdata= df_usage_test, # model and data
  type="response", # or terms for coefficients
  se.fit = TRUE, # default is false
  interval = "confidence", #default "none", also "prediction"
  level = 0.95
  )
)

# predictions, gamma
usage_pred_gamma <- 
  predict(gamma_model_glm_usage, newdata= df_usage_test, # model and data
  type="response", # or terms for coefficients
  se.fit = FALSE, # default is false
  )

# rename
pred_usage <- rename(pred_usage,usage_pred=fit,se_usage_pred=se.fit)

# add back to data
df_usage_test_pred <- cbind(df_usage_test,pred_usage, usage_pred_gamma)

Plotting the results shows a poorer fit for winter when the usage spikes and in instances with outlier temperatures. Predictions for the test data are not hugely different to the gamma family model.

Show code

df_usage_test_pred  %>%
ggplot() +
geom_point(aes(x=Date,y=Usage,color=Season)) +
# add a modeled line
geom_line(aes(x=Date,y=usage_pred)) +
geom_line(aes(x=Date,y=usage_pred_gamma), linetype = "dashed") +
labs(x="Date", y="Usage", title = "Actual vs predicted usage by date and season",subtitle="Dashed=Gamma Pred; Solid=Normal Pred") +
theme_pander() + scale_color_gdocs()

Show code

# plot temperatures
df_usage_test_pred  %>%
ggplot() +
geom_point(aes(x=Date,y=MaxTemp,color=Season)) + 
labs(x="Date", y="Usage", title = "MaxTemp by date and season") +
theme_pander() + scale_color_gdocs()

Plots show slightly higher residuals at higher predicted values for the step regression model and final selected. QQ plot suggests skewness in the data, which we could see earlier in the histogram of usage.

Show code

par(mfrow = c(2, 2))
plot(final_model_glm_usage)

Show code

plot(step_model_glm_usage)

Define a regression model in RJAGS. The structure here is more flexible than stan_glm which is predefined. This allows for a wider selection of prior distributions and a mixture of modeling choices. Also possible to specify model forms in this way for STAN.

\(Y_i \propto N(m_i,s^2)\)

Where

\(m_i = a + bX_1i +cX_2i\)
\(a, b, c and s\) have some assumed prior distributions.

Note that the distribution syntax is slightly different to R, for example in R a normal distribution is specified using the mean and standard deviation, but in JAGS the second parameter of the dnorm command is the distribution’s precision (1 / variance).

Show code

usage_model <- "model{
  #Likelihood model for Y[i]
  for (i in 1:length(Y)) {
      Y[i] ~ dnorm(m[i], s^(-2)) # link=log
      log(m[i]) <- a + b*X_1[i] + c*X_2[i]
  }

  # Prior models for a, b, c, s
  a ~ dnorm(0, 0.001)     # Intercept
  b ~ dnorm(0, 0.001)     # MaxTemp
  c ~ dnorm(0, 0.001)     # MinTemp
  s ~ dunif(0, 10)        # Error

}"

Simulate from the model:

burn-in period of 2k iterations aims for convergence and throws away these earlier test samples.
10k iterations for 3 chains.
saving every 10th sample only in order to save space.

Show code

usage_sim <- jags.model(textConnection(usage_model),
              data=list(Y = df_usage_train$Usage, X_1 = df_usage_train$MaxTemp, X_2 = df_usage_train$MinTemp),
              n.chains = 3)

Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 530
   Unobserved stochastic nodes: 4
   Total graph size: 3457

Initializing model

Show code

usage_posterior <- coda.samples(usage_sim, 
                                 variable.names = c("a","b","c","s"), 
                                 n.iter = 10000, n.burnin = 2000, n.thin = 10, DIC = F)

Summary of the results below:

The mean of the b chain value was -0.040. This is an estimate of the posterior mean of b. This is consistent with the GLM above
The corresponding naive standard error, 0.000253, measures the potential error in this estimate. This error is calculated by dividing the standard deviation of the b chain values by the square root of the number of iterations (or sample size). This is a naive approach to calculating error in a setting with dependent samples.

Also showing the density plots for the posteriors as well as the trace plots for each of the chains.

Show code

summary(usage_posterior)


Iterations = 1001:11000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 10000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

      Mean       SD  Naive SE Time-series SE
a  5.37886 0.105507 6.091e-04      0.0069295
b -0.03689 0.008541 4.931e-05      0.0008687
c -0.09569 0.007709 4.451e-05      0.0004970
s  9.55431 0.245370 1.417e-03      0.0019104

2. Quantiles for each variable:

      2.5%     25%      50%      75%    97.5%
a  5.17760  5.3076  5.37331  5.44719  5.59401
b -0.05425 -0.0428 -0.03664 -0.03079 -0.02105
c -0.11062 -0.1009 -0.09578 -0.09058 -0.08016
s  9.03848  9.3878  9.56855  9.74158  9.96367

Show code

plot(usage_posterior,trace=FALSE)

Show code

par(mfrow = c(2, 2))
plot(usage_posterior,density=FALSE)

We can plot the posterior distributions for the model parameters from the samples:

Show code

posterior_usage <- as.matrix(usage_posterior)

plot1<- mcmc_areas(posterior_usage, pars = c("a"), prob = 0.80)
plot2<- mcmc_areas(posterior_usage, pars = c("b"), prob = 0.80)
plot3<- mcmc_areas(posterior_usage, pars = c("c"), prob = 0.80)
plot4<- mcmc_areas(posterior_usage, pars = c("s"), prob = 0.80)

grid.arrange(plot1,plot2, plot3, plot4, top="Posterior distributions (median + 80% intervals)")

Similar to the earlier example, we could extract the \(R^2\) from the posterior distribution as well as run posterior predictive checks.

Further considerations: extracting the mean of the parameter projections, we could produce predictions for the test data. We could also produce predictions for test data point based upon each of the samples from the posterior, so above we would produce 2400 predictions for each data point (3x10k, discarding the first 2k from each chain and only saving 1/10 of the samples).

Estimating COVID 19 infections in NSW, Australia

Hugh Miller, Isabella Lyons (2012) discusses a method of forecasting COVID infections using the package “epinow2”, Sam Abbott (2020). The package makes use of STAN.

“[Epinow2] estimates the time-varying reproduction number on cases by date of infection… Imputed infections are then mapped to observed data (for example cases by date of report) via a series of uncertain delay distributions…”

That is to say, the model makes some assumptions about lags in symptoms and in reporting (assuming some distribution for those delays) in order to estimate the true underlying number of infections (backwards). Using that backcast it projects out how those infections might be reported over time. The model also estimates the effective reproduction number in order to predict future infections.

Covid case and test data sourced from the department of NSW Website on 17/01/2022.

We should be careful in placing too heavy a reliance on any models of incidence which are based upon case data (particularly where case rates are expected to be under reported or where testing is incomplete). The example is purely illustrative and does not attempt to correct for gaps in the data.

Case data:

Show code

df_cases <- read.csv(file = 'confirmed_cases_table1_location.csv') 
head(df_cases)

  notification_date postcode lhd_2010_code        lhd_2010_name
1        2020-01-25     2134          X700               Sydney
2        2020-01-25     2121          X760      Northern Sydney
3        2020-01-25     2071          X760      Northern Sydney
4        2020-01-27     2033          X720 South Eastern Sydney
5        2020-03-01     2077          X760      Northern Sydney
6        2020-03-01     2163          X710 South Western Sydney
  lga_code19      lga_name19
1      11300     Burwood (A)
2      16260  Parramatta (C)
3      14500 Ku-ring-gai (A)
4      16550    Randwick (C)
5      14000     Hornsby (A)
6      12850   Fairfield (C)

Show code

df_tests <- read.csv(file = 'pcr_testing_table1_location_agg.csv') 
head(df_tests)

   test_date postcode lhd_2010_code   lhd_2010_name lga_code19
1 2020-01-01     2038          X700          Sydney      14170
2 2020-01-01     2039          X700          Sydney      14170
3 2020-01-01     2040          X700          Sydney      14170
4 2020-01-01     2041          X700          Sydney      14170
5 2020-01-01     2069          X760 Northern Sydney      14500
6 2020-01-01     2074          X760 Northern Sydney      14500
       lga_name19 test_count
1  Inner West (A)          1
2  Inner West (A)          1
3  Inner West (A)          2
4  Inner West (A)          1
5 Ku-ring-gai (A)          1
6 Ku-ring-gai (A)          1

Show code

# cases
df_cases <- df_cases %>% mutate(descr="cases",date=as.Date(notification_date),cases=1) %>%
filter(date > "2021-04-30",date <= "2021-12-31") %>% 
group_by(date, descr) %>% 
summarise(confirm=sum(cases)) 
head(df_cases)

# A tibble: 6 x 3
# Groups:   date [6]
  date       descr confirm
  <date>     <chr>   <dbl>
1 2021-05-01 cases       2
2 2021-05-02 cases       8
3 2021-05-03 cases       9
4 2021-05-04 cases       7
5 2021-05-05 cases       9
6 2021-05-06 cases       4

Plotting recorded cases by date (PCR test outcomes only and only to 31 December 2021):

Show code

df_cases %>% mutate(omicron=ifelse(date > "2021-11-26","first case+","pre")) %>%
ggplot(data=., mapping = aes(x=date, y=confirm)) + 
geom_col(aes(fill=omicron)) +
scale_x_date(date_breaks = "months" , date_labels = "%b-%y", limits = c(as.Date("2021-04-30"), as.Date("2021-12-31"))) +
labs(x="Date", y="Cases", title = "Covid cases by date, NSW") +
theme_pander() + scale_color_gdocs()

Test data:

Show code

df_tests <- df_tests %>% mutate(descr="tests",date=as.Date(test_date)) %>%
filter(date > "2021-04-30",date <= "2021-12-31") %>% 
group_by(date, descr) %>% 
summarise(tests=sum(test_count)) 
head(df_tests)

# A tibble: 6 x 3
# Groups:   date [6]
  date       descr tests
  <date>     <chr> <int>
1 2021-05-01 tests  4514
2 2021-05-02 tests  4775
3 2021-05-03 tests 14877
4 2021-05-04 tests 10868
5 2021-05-05 tests 13126
6 2021-05-06 tests 24029

Plotting tests by date (PCR tests only; to 31 December 2021):

Show code

ggplot(data=df_tests, mapping = aes(x=date, y=tests))+ 
geom_bar(stat="identity") +
scale_x_date(date_breaks = "months" , date_labels = "%b-%y", limits = c(as.Date("2021-04-30"), as.Date("2021-12-31"))) +
labs(x="Date", y="Tests", title = "PCR tests by date, NSW") +
theme_pander() + scale_color_gdocs()

Merged to get positive test rate:

Show code

# merge
df <- merge(x = df_cases, y = df_tests, by = "date", all.x = TRUE) %>%
mutate(pos_rate=pmin(1, confirm/tests)) %>%
fill(pos_rate, .direction = "downup") %>%
# add smoothed positive rate
mutate(pos_rate_sm = fitted(loess(pos_rate~as.numeric(date),span=0.1))) # loess - Local Polynomial Regression Fitting; span - the parameter which controls the degree of smoothing
# drop descriptions
df$descr.x <- NULL
df$descr.y <- NULL

Plot of positive rate shows that the proportion of tests returning a positive increases to above 5% from 25th December 2021, suggesting that the case data beyond this point is less complete.

Show code

# positive rate
ggplot(data=df, mapping = aes(x=date))+ 
geom_point(aes(y=(pos_rate))) +
geom_line(aes(y=pos_rate_sm)) +
scale_x_date(date_breaks = "months" , date_labels = "%b-%y", limits = c(as.Date("2021-04-30"), as.Date("2021-12-31"))) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(x="Date", y="Positive test rate", title = "Positive PCR test rate by date, NSW") +
theme_pander() + scale_color_gdocs()

There are some pragmatic methods for adjusting the case numbers for increasing test positivity rates. For our purposes we will focus the model on case data before 23 December 2021 in order to exclude the period where the test data became less reliable/ complete, noting that underlying data completeness issues are likely present.

Show code

# Reporting delays (delay from symptom onset to reporting/ testing positive)
reporting_delay <- estimate_delay(rlnorm(100, log(6), 1), max_value = 15)
cat("Mean reporting delay = ",reporting_delay$mean, "; ")

# Symptom generation time (time between infection events in an infector-infectee pair) and incubation period (time between moment of infection and symptom onset)
generation_time <- get_generation_time(disease = "SARS-CoV-2", source = "ganyani")
cat("Mean generation time =",generation_time$mean, "; ")
incubation_period <- get_incubation_period(disease = "SARS-CoV-2", source = "lauer")
cat("Mean incubation period =",incubation_period$mean, "; ")

Show code

df_cases_fit <- df_cases %>% filter(date <= "2021-12-22") # exclude data before 23 December 2021

estimates <- epinow(reported_cases = df_cases_fit[c("date","confirm")], # case data from NSW
                    generation_time = generation_time,
                    delays = delay_opts(incubation_period, reporting_delay),
                    rt = rt_opts(prior = list(mean = 2, sd = 0.2)),
                    stan = stan_opts(cores = 6, samples = 3000, warmup = 500, chains = 6, control = list(adapt_delta = 0.95)))

Importing a previous run of the above:

Show code

load("covid_estimates_summary.RData")

plot(estimates) produces predefined plots of cases by date reported and date of infection as well as the effective reproduction number. Given the dramatic increase in reported cases towards the end of the period, it is not surprising that the range of forecast cases is wide. Interestingly, the forecast/ partial forecast reproduction number reduces; presumably this is influenced by the earlier observations where case numbers were low. This issue has not been investigated further, however there are articles that discuss the impact of changes to the testing procedures and other limitations.

Merging forecast reported cases with the actual case data:

Show code

df <- merge(x = df, y = df_reported, by = "date", all.x = TRUE) %>% filter(date < "2021-12-30")

Plotting forecast vs actuals below - the week to 29 December is fully forecast from the model. Actual cases fall within the projected 90% confidence interval range, noting that the test positivity rate increased materially from 25th December so it is likely that the actual cases are under reported.

Show code

colours <- c("actual" = "#3366cc","modelled" = "#dc3912")

ggplot(data=df %>% filter(date >= "2021-12-01"), aes(x=date)) +
geom_line(aes(y=confirm, color="actual")) +
geom_line(aes(y=mean,color="modelled")) +
geom_ribbon(aes(ymin=lower_90, ymax=upper_90), fill="#dc3912", alpha=0.2) +
labs(x="Date", y="Reported cases", title = "Reported cases by date",color="Legend") +
scale_color_manual(values = colours) +
theme_pander()

Common issues in MCMC projections

Divergent transitions: Common fix is changing the step size using the adapt_delta argument in STAN.
Exceeding maximum tree depth: the is a point of efficiency. See notes here for assistance in STAN and other potential warnings.

Bayesian applications

Background

Applications

Libraries

Mental health example

Choosing a prior

Non-informative prior, uniform

Weakly informative prior, bimodial

Informative prior, “failure” more likely

Deriving posterior

Regression - logistic

Background

Heart disease data

Adjusting scales and changing priors

Model fit

Regression - linear

Estimating COVID 19 infections in NSW, Australia

Common issues in MCMC projections

Further reading

References