A quick Bayesian refresher, an extension to regression models and other applications.
Bayesian statistics reflects uncertainty about model parameters by assuming some prior probability distribution for those parameters. That prior uncertainty is adjusted for observed data to produce a posterior distribution.
Under a frequentest approach, model parameters are fixed and the observations are assumed to be sampled from a probability distribution with those parameters.
\(\Huge {p}(\theta,x) = \frac{{p}(x,\theta){p}(\theta)}{{p}(x)}\)
Where
The evidence term is proportional to the likelihood and therefore the theorem is often stated as
\(\Huge p(\theta | x) \propto p(x | \theta) p(\theta)\)
A paper presented at the 2021 Actuaries (Institute) Summit Hugh Miller, Isabella Lyons (2012) (p2) sets out the ‘tensions’ that exist between Frequentest and Bayesian approaches and suggests that:
“Despite actuaries being pioneers in early work on credibility theory, Bayesian approaches have largely given way to frequentist approaches. This is to our loss, since such methods have advanced substantially over the past couple of decades and they are well-suited to problems with latent variables or noise around underlying ‘true’ values.”
Here we’ll look at a few simple applications of Bayesian techniques.
The sections below set out a simple example Bayesian refresher, regression models and other applications.
A list of packages used in the recipes.
The 2017 Actuaries Institute Green Paper on Mental Health and Insurance Mental Health and Insurance (2017) talks about some of the challenges in obtaining reliable data and projections on mental health in insurance (Mental Health and Insurance 2017, p26):
“There are sources of data and research about the prevalence and profile of people with different mental health conditions, such as the 2007 National Survey of Mental Health and Wellbeing and many other epidemiological and clinical studies… [T]o have a complete and informed view of all this information is a difficult task for a full-time academic, let alone for an insurer.
Even then, the nature and granularity of the data needed for insurance applications is different from that for population, public health or clinical purposes. Firstly, the ‘exposure’ (the number of insured people and their characteristics) needs to be recorded. There is probably a lot of relevant information that is not collected at the start of an insurance policy and can therefore never be analysed.
Secondly, the claims information needs to be available at a detailed level regarding the nature and cause of the claim and other characteristics of the individual, their history and the cover provided. There is a clear need for consistency in definition, language and data standards to improve the quality of information available across the system.
Naturally, when the insurance cover for mental health conditions is not provided (as in the case of most travel insurance products) there will not be relevant data from the insurance history. However, even when relevant data exists within insurers there are practical and commercial difficulties in turning it into a useful form.”
The paper goes on to talk about some of the insurance challenges in detail including
Based upon data from a single insurer ~19% of IP and TPD claims relate to mental health, with mental health claims contributing 26% to the total cost of claims (in 2015, Mental Health and Insurance (2017), p17). This is roughly consistent with the proportion of the general population that experiences an episode of mental illness in a 12 month period, “Mental Health and Insurance” (2020), p110.
In this simple example, we want to know what the rate of mental illness, \(\theta\) is in the insured population. Assume some true underlying \(\theta\), here a fixed value of 20%. Create a population, \(X\) with \(X \sim B(N,\theta)\) where N=100000. Pull a random sample from population.
The posterior is proportional to the product of the prior and the likelihood. The beta distribution is a conjugate prior for the binomial distribution (posterior is the same probability distribution family as prior). Assume \(\theta \sim {\sf Beta}(\alpha, \beta)\), then:
\(f(\theta;\alpha,\beta) = \theta^{\alpha-1}(1-\theta)^{\beta-1}\)
It represents the probabilities assigned to values of \(\theta\) in the domain (0,1) given values for the parameters \(\alpha\) and \(\beta\). The binomial posterior distribution represents the probability of values of \(X\) given \(\theta\). Varying the values of \(\alpha\) and \(\beta\) can represent a wide range of different prior beliefs about the distribution of \(\theta\).
Priors can be “informative” “uninformative” or “weakly informative”, e.g.
Defining prior plots:
plot_prior <- function(a,b){
theta = seq(0,1,0.005)
p_theta = dbeta(theta, a, b)
p <- qplot(theta, p_theta, geom='line')
p <- p + theme_bw()
p <- p + ylab(expression(paste('p(',theta,')', sep = '')))
p <- p + xlab(expression(theta))
return(p)}
plot_prior(1,1) + labs(title="Uniform Prior, a= 1 and b = 1") + theme(plot.title = element_text(hjust = 0.5))
plot_prior(0.5,0.5) + labs(title="Bimodial Prior, a= 0.5 and b = 0.5") + theme(plot.title = element_text(hjust = 0.5))
The higher the number of observations, the greater the influence on the posterior
plot_prior(5,30) + labs(title="Uniform Prior, a= 5 and b = 30") + theme(plot.title = element_text(hjust = 0.5))
plot_prior(20,120) + labs(title="Uniform Prior, a= 20 and b = 120") + theme(plot.title = element_text(hjust = 0.5))
Prior plot values. Re-express binomial as beta dist.
prior_a = 50 #successes
prior_b = 250 #failures
prior_n = prior_a + prior_b #observations
prior_theta = prior_a/prior_n
prior <- data.frame('Dist'='Prior','x'=seq(0,1,0.005), 'y'=dbeta(seq(0,1,0.005),prior_a,prior_b))
Observations are taken from the earlier binomial sample.
obs_a <- sample_success + 1
obs_b <- sample_n - sample_success + 1
observations <- data.frame('Dist'='Observations',x=seq(0,1,0.005), y=dbeta(seq(0,1,0.005),obs_a,obs_b))
In this example, the posterior is a simple combination of the prior and observations.
post_a <- sample_success + prior_a - 1 #combination of prior and observed successes
post_b <- sample_n - sample_success + prior_b - 1
post_theta <- post_a / (post_a + post_b)
posterior <- data.frame('Dist'='Posterior','x'=seq(0,1,0.005), 'y'=dbeta(seq(0,1,0.005),post_a,post_b))
This can be plotted as:
model_plot <- rbind(prior,observations,posterior)
with(model_plot, Dist <- factor(Dist, levels = c('Prior', 'Observations','Posterior'), ordered = TRUE))
qplot(model_plot$x, model_plot$y, geom='line',color = model_plot$Dist) +
ylab(expression(paste('p(',theta,')', sep = ''))) +
xlab(expression(theta)) +
geom_vline(xintercept = post_theta, linetype="dotted", color = "black", label="blah") +
scale_color_discrete(name = "Dist") +
annotate("text",x=post_theta, y=0, label=label_percent()(post_theta)) +
labs(title="Informative prior", subtitle="Low success") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5))
RSTANARM is an interface in R for STAN where Stan is a platform for Bayesian inference - it is a probabilistic programming framework where the guts of the inference method happens in the background, while the user focuses on parameterising the model in R. RJAGS is an interface in R for JAGS (Just Another Gibbs Sample) where JAGS is similarly a platform for Bayesian modeling.
Models in both STAN and JAGS can make use of Markov chains Monte Carlo (MCMC) methods. MCMC methods provide a way to take random samples approximately from a posterior distribution. Such samples can be used to summarize any aspect of the posterior distribution of a statistical model.
RSTANARM and RJAGS have pre-written code for common models like regression.
Under a Frequentest approach to linear regression, for a given set of response variables y, the relationship between y and a set of regressor variables x is assumed to be linear (informed by a set of parameters \(\beta\)), with an error term, i.e. \(y = X\beta + \epsilon\).
The error term, \(\epsilon\) is assumed to be normally distributed and the regression co-efficients are estimated by minimising the error term commonly using ordinary least squares (OLS). We make some other assumptions about the error term - not correlated with X and i.i.d.; and define \(Var(\epsilon)=\sigma^2\). The generalised form allows the response variable to have an error that is not normal - this is achieved by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
With Bayesian inference we are interested in the posterior distributions of the parameters \(\beta\) and \(\sigma\), rather than point estimates. We also want to influence those distributions with some prior knowledge of their behaviour:
\(p(\beta,\sigma|y,X) \propto p(y|\beta,\sigma)p(\beta,\sigma)\)
\(p(\beta,\sigma|y,X)\) is not a normalised probability density function (i.e. is proportional to the likelihood and prior). However, we can sample from it and therefore estimate any quantity of interest e.g. mean or variance, (William H. Press 2007, p825). For this we use MCMC, William H. Press (2007), p825:
“MCMC is a random sampling method… the goal is to visit a point \(x\) with a probability proportional to some given distribution function \(\pi(x)\) [not necessarily a probability]. Why would we want to sample a distribution in this way? THe answer is that Bayesian methods, often implemented using MCMC, provide a powerful way of estimating the parameters of a model and their degree of uncertainty.”
The Framingham Heart Study is a longitudinal study of cardiovastular disease in a sample population within Framingham, Massachusetts. A subset of data for teaching is available at request from the U.S. National Heart, Lung and Blood Institute. This dataset is not appropriate for publication and has been anonymised, but is nevertheless useful for illustrative purposes here. The dataset:
“..is a subset of the data collected as part of the Framingham study and includes laboratory, clinic, questionnaire, and adjudicated event data on 4,434 participants. Participant clinic data was collected during three examination periods, approximately 6 years apart, from roughly 1956 to 1968. Each participant was followed for a total of 24 years for the outcome of the following events: Angina Pectoris, Myocardial Infarction, Atherothrombotic Infarction or Cerebral Hemorrhage (Stroke) or death.”
We’ll look at a logistic regression model example using predefined Bayesian models in RSTANARM.
df_heart <- read.csv(file = 'frmgham2.csv')
df_heart <- df_heart[c(1:23)] %>% filter(PERIOD==2) # ignoring health event data, only looking at observations at period == 2 (observations for participants are not independent over time; this does have limitations e.g. observations are truncated at death which could be due to heart disease)
ignore_cols <- names(df_heart) %in% c("PERIOD","RANDID","TIME", "PREVMI", "PREVAP") #exclude identifier and time columns; exclude PREVMI and PREVAP as they are a subset of PREVCHD
df_heart <- df_heart[!ignore_cols]
head(df_heart)
SEX TOTCHOL AGE SYSBP DIABP CURSMOKE CIGPDAY BMI DIABETES BPMEDS
1 2 260 52 105 69.5 0 0 29.43 0 0
2 1 283 54 141 89.0 1 30 25.34 0 0
3 2 232 67 183 109.0 1 20 30.18 0 0
4 2 343 51 109 77.0 1 30 23.48 0 0
5 2 230 49 177 102.0 0 0 31.36 0 1
6 2 220 70 149 81.0 0 0 36.76 0 0
HEARTRTE GLUCOSE educ PREVCHD PREVSTRK PREVHYP HDLC LDLC
1 80 86 2 0 0 0 NA NA
2 75 87 1 0 0 0 NA NA
3 60 89 3 0 0 1 NA NA
4 90 72 3 0 0 0 NA NA
5 120 86 2 0 0 1 NA NA
6 80 98 1 1 0 1 NA NA
Data description highlights:
Using the DataExplore package, we can see that the data is mostly complete apart from missing entries in ‘glucose’ and to a lesser extent ‘education’.
introduce(df_heart)
rows columns discrete_columns continuous_columns
1 3930 18 0 16
all_missing_columns total_missing_values complete_rows
1 2 8720 0
total_observations memory_usage
1 70740 334464
plot_intro(df_heart)
plot_missing(df_heart)
Given around 12% of glucose records are missing as well as all of HDL and LDL (only available for period 3), drop the columns. Also drop rows with other missing data as these are small in number.
df_heart$GLUCOSE <- NULL
df_heart$HDLC <- NULL
df_heart$LDLC <- NULL
df_heart <- na.omit(df_heart)
Plots of the data:
plot_bar(df_heart, title="Barplot")
Of the 3.6k remaining records, ~7% have a flag for heart disease:
table(df_heart$PREVCHD)
0 1
3319 253
Further plots and checking distributions:
plot_histogram(df_heart, title="Histogram for continuous")
log_qq_data <- update_columns(qq_data, 2:4, function(x) log(x + 1))
plot_qq(log_qq_data, title="QQ plot, logged data")
Most significantly correlated with heart disease are AGE, SYSBP, DIABETES, BPMEDS and PREVHYP.
plot_correlation(na.omit(df_heart))
Split data into training and testing data sets.
Fit a simple GLM using all data:
Call:
glm(formula = PREVCHD ~ ., family = "binomial", data = df_heart_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.8479472 1.2549718 -5.457 4.85e-08 ***
SEX -1.1151811 0.1822901 -6.118 9.50e-10 ***
TOTCHOL 0.0031949 0.0017375 1.839 0.065946 .
AGE 0.0846106 0.0115302 7.338 2.17e-13 ***
SYSBP 0.0100912 0.0050029 2.017 0.043691 *
DIABP -0.0147439 0.0096099 -1.534 0.124971
CURSMOKE 0.3163451 0.2544935 1.243 0.213854
CIGPDAY -0.0076423 0.0105696 -0.723 0.469652
BMI 0.0141148 0.0203822 0.693 0.488621
DIABETES 0.9371080 0.2676098 3.502 0.000462 ***
BPMEDS 0.3173614 0.2360895 1.344 0.178870
HEARTRTE -0.0102143 0.0066348 -1.540 0.123682
educ 0.0004762 0.0777335 0.006 0.995112
PREVSTRK 0.2917608 0.5267898 0.554 0.579684
PREVHYP 0.2978378 0.2191547 1.359 0.174137
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1361.4 on 2678 degrees of freedom
Residual deviance: 1178.9 on 2664 degrees of freedom
AIC: 1208.9
Number of Fisher Scoring iterations: 6
Use stepwise regression to reduce the explanatory variables:
# specify a null model with no predictors
null_model_glm_heart <- glm(PREVCHD ~ 1, data = df_heart_train, family = "binomial")
# use a forward stepwise algorithm to build a parsimonious model
step_model_glm_heart <- step(null_model_glm_heart, scope = list(lower = null_model_glm_heart, upper = full_model_glm_heart), direction = "forward")
Summary of the model:
summary(step_model_glm_heart)
Call:
glm(formula = PREVCHD ~ AGE + SEX + DIABETES + SYSBP + TOTCHOL +
HEARTRTE + BPMEDS, family = "binomial", data = df_heart_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.587973 0.917775 -8.268 < 2e-16 ***
AGE 0.089303 0.010475 8.525 < 2e-16 ***
SEX -1.093518 0.175776 -6.221 4.94e-10 ***
DIABETES 1.018590 0.263099 3.872 0.000108 ***
SYSBP 0.008739 0.003593 2.432 0.015006 *
TOTCHOL 0.003100 0.001745 1.777 0.075605 .
HEARTRTE -0.009997 0.006488 -1.541 0.123349
BPMEDS 0.344999 0.232319 1.485 0.137538
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1361.4 on 2678 degrees of freedom
Residual deviance: 1184.6 on 2671 degrees of freedom
AIC: 1200.6
Number of Fisher Scoring iterations: 6
Dropping BPMEDS, HEARTRTE and BMI as little impact on AIC:
Call:
glm(formula = PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL,
family = "binomial", data = df_heart_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.667411 0.742295 -10.329 < 2e-16 ***
AGE 0.093372 0.010254 9.106 < 2e-16 ***
SEX -1.045255 0.170871 -6.117 9.52e-10 ***
PREVHYP 0.441363 0.173035 2.551 0.010750 *
DIABETES 1.000627 0.259912 3.850 0.000118 ***
TOTCHOL 0.003196 0.001726 1.851 0.064132 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1361.4 on 2678 degrees of freedom
Residual deviance: 1190.5 on 2673 degrees of freedom
AIC: 1202.5
Number of Fisher Scoring iterations: 6
Fit a Bayes glm using the same explanatory variables. Here parameter estimates no longer have test statistics and p-values as in the frequentist regression. This is because Bayesian estimation samples from the posterior distribution. This means that instead of a point estimate and a test statistic, we get a distribution of plausible values for the parameters, and the estimates section of the summary summarizes those distributions.
Further documentation on Bayesian generalised linear models via Stan is here. Documentation on choice of priors is here. If priors are not specified, default weekly informative priors are used - you can call a summary of the modeled priors (see below). Default priors are autoscaled to make them less informative (see below for notes on scaling).
model_bayes_glm_heart <- stan_glm(PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL, family = "binomial",data=df_heart_train, iter=2000, warmup=500)
# chains is number of sample paths from posterior; iter is number of samples within a chain; warmup is number of iterations to discard
A summary of the model below. “Sigma” is the standard deviation of the errors; “mean_PPD” is the mean of the posterior predictive samples.MCMC diagnostics - “RHat” is a measure of within chain variance compared to across chain variance (<1.1 implies convergence); “log-posterior” is similar to likelihood.
Model Info:
function: stan_glm
family: binomial [logit]
formula: PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL
algorithm: sampling
sample: 6000 (posterior sample size)
priors: see help('prior_summary')
observations: 2679
predictors: 6
Estimates:
mean sd 10% 50% 90%
(Intercept) -7.6675 0.7337 -8.6151 -7.6631 -6.7333
AGE 0.0933 0.0102 0.0801 0.0933 0.1065
SEX -1.0460 0.1705 -1.2678 -1.0453 -0.8319
PREVHYP 0.4442 0.1769 0.2241 0.4418 0.6744
DIABETES 0.9833 0.2669 0.6348 0.9900 1.3165
TOTCHOL 0.0032 0.0017 0.0010 0.0032 0.0054
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 0.0704 0.0068 0.0616 0.0702 0.0791
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0103 0.9998 5052
AGE 0.0002 1.0000 4593
SEX 0.0023 1.0002 5532
PREVHYP 0.0024 0.9999 5369
DIABETES 0.0034 1.0002 6087
TOTCHOL 0.0000 1.0000 5686
mean_PPD 0.0001 0.9999 7301
log-posterior 0.0332 1.0004 2868
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
The stan model used default normal priors for intercepts and coefficients. Default priors are automatically scaled to limit how informative they are - below we can see the adjusted (scaled) prior for the coefficients differs from the default 2.5:
# summary of priors
print(prior_summary(model_bayes_glm_heart),digits=4)
Priors for model 'model_bayes_glm_heart'
------
Intercept (after predictors centered)
~ normal(location = 0, scale = 2.5)
Coefficients
Specified prior:
~ normal(location = [0,0,0,...], scale = [2.5,2.5,2.5,...])
Adjusted prior:
~ normal(location = [0,0,0,...], scale = [0.2922,5.0215,4.9996,...])
------
See help('prior_summary.stanreg') for more details
Credible intervals express the probability that a parameter falls within a given range (vs a confidence interval which is the probability that a range contains the true value of the parameter). These credible intervals are computed by finding the relevant quantiles of the draws from the posterior distribution. They produce similar ranges to the confidence intervals from the frequentist approach.
confint(final_model_glm_heart, level=0.95)
2.5 % 97.5 %
(Intercept) -9.1471208758 -6.234324626
AGE 0.0735684164 0.113808648
SEX -1.3850915606 -0.714354726
PREVHYP 0.1065103678 0.785995423
DIABETES 0.4723299199 1.495122321
TOTCHOL -0.0002292817 0.006540454
posterior_interval(model_bayes_glm_heart, prob=0.95)
2.5% 97.5%
(Intercept) -9.0944495644 -6.226308036
AGE 0.0737930669 0.113502760
SEX -1.3732792483 -0.714472520
PREVHYP 0.0962806425 0.788979071
DIABETES 0.4395704549 1.489720023
TOTCHOL -0.0001426672 0.006534955
stan_glm allows a selection of prior distributions for the coefficients, intercept, and auxiliary parameters. The prior distributions can be altered by specifying different locations/scales/dfs, but all the prior take the form of a single chosen probability distribution.
Some drivers for wanting to specify a prior: * Available information/ research might suggest a base value for our parameters. Particularly valuable if our own observed data is sparse. * We may not have a good idea of the value, but know that it is constrained in some way e.g. is positive.
The scale is the standard deviation of the prior. As noted earlier, the model autoscales to ensure priors are not too informative. We can turn this off when we specify our own priors. Documentation on choice of priors is here.
To allow for varying distributions across the priors, we need to define the model in STAN or JAGS. See the linear regression example below for an example of this in JAGS.
Specifying our own priors below. Little impact on the posterior estimates.
model_bayes_glm_heart_scale <- stan_glm(PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL, family = "binomial",data=df_heart_train, iter=2000, warmup=500,
prior_intercept = normal(location = 0, scale = 10, autoscale = FALSE),
prior = normal(location = 0, scale = 5, autoscale = FALSE),
prior_aux = exponential(rate = 1, autoscale = FALSE)
)
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 9.1e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.91 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 501 / 2000 [ 25%] (Sampling)
Chain 1: Iteration: 700 / 2000 [ 35%] (Sampling)
Chain 1: Iteration: 900 / 2000 [ 45%] (Sampling)
Chain 1: Iteration: 1100 / 2000 [ 55%] (Sampling)
Chain 1: Iteration: 1300 / 2000 [ 65%] (Sampling)
Chain 1: Iteration: 1500 / 2000 [ 75%] (Sampling)
Chain 1: Iteration: 1700 / 2000 [ 85%] (Sampling)
Chain 1: Iteration: 1900 / 2000 [ 95%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 10.25 seconds (Warm-up)
Chain 1: 13.006 seconds (Sampling)
Chain 1: 23.256 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 7.7e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.77 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 501 / 2000 [ 25%] (Sampling)
Chain 2: Iteration: 700 / 2000 [ 35%] (Sampling)
Chain 2: Iteration: 900 / 2000 [ 45%] (Sampling)
Chain 2: Iteration: 1100 / 2000 [ 55%] (Sampling)
Chain 2: Iteration: 1300 / 2000 [ 65%] (Sampling)
Chain 2: Iteration: 1500 / 2000 [ 75%] (Sampling)
Chain 2: Iteration: 1700 / 2000 [ 85%] (Sampling)
Chain 2: Iteration: 1900 / 2000 [ 95%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 7.179 seconds (Warm-up)
Chain 2: 14.675 seconds (Sampling)
Chain 2: 21.854 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 8e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.8 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 501 / 2000 [ 25%] (Sampling)
Chain 3: Iteration: 700 / 2000 [ 35%] (Sampling)
Chain 3: Iteration: 900 / 2000 [ 45%] (Sampling)
Chain 3: Iteration: 1100 / 2000 [ 55%] (Sampling)
Chain 3: Iteration: 1300 / 2000 [ 65%] (Sampling)
Chain 3: Iteration: 1500 / 2000 [ 75%] (Sampling)
Chain 3: Iteration: 1700 / 2000 [ 85%] (Sampling)
Chain 3: Iteration: 1900 / 2000 [ 95%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 10.085 seconds (Warm-up)
Chain 3: 12.913 seconds (Sampling)
Chain 3: 22.998 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 9.1e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.91 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 501 / 2000 [ 25%] (Sampling)
Chain 4: Iteration: 700 / 2000 [ 35%] (Sampling)
Chain 4: Iteration: 900 / 2000 [ 45%] (Sampling)
Chain 4: Iteration: 1100 / 2000 [ 55%] (Sampling)
Chain 4: Iteration: 1300 / 2000 [ 65%] (Sampling)
Chain 4: Iteration: 1500 / 2000 [ 75%] (Sampling)
Chain 4: Iteration: 1700 / 2000 [ 85%] (Sampling)
Chain 4: Iteration: 1900 / 2000 [ 95%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 5.002 seconds (Warm-up)
Chain 4: 12.771 seconds (Sampling)
Chain 4: 17.773 seconds (Total)
Chain 4:
Model Info:
function: stan_glm
family: binomial [logit]
formula: PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL
algorithm: sampling
sample: 6000 (posterior sample size)
priors: see help('prior_summary')
observations: 2679
predictors: 6
Estimates:
mean sd 10% 50% 90%
(Intercept) -7.7081 0.7267 -8.6259 -7.7068 -6.7884
AGE 0.0940 0.0101 0.0812 0.0941 0.1071
SEX -1.0480 0.1672 -1.2645 -1.0456 -0.8365
PREVHYP 0.4450 0.1744 0.2238 0.4439 0.6671
DIABETES 0.9902 0.2546 0.6699 0.9993 1.3099
TOTCHOL 0.0032 0.0017 0.0009 0.0032 0.0053
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 0.0703 0.0066 0.0620 0.0702 0.0788
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0102 1.0001 5109
AGE 0.0001 1.0007 4845
SEX 0.0028 1.0004 3682
PREVHYP 0.0028 1.0000 3851
DIABETES 0.0041 0.9998 3785
TOTCHOL 0.0000 0.9999 5841
mean_PPD 0.0001 0.9998 5050
log-posterior 0.0351 1.0016 2381
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
We are able to extract predictions from each iteration of the posterior sampling. We can extract predictions for new data. Could be tested against the observations for goodness of fit. The follwing from the manual, RSTANARM Manual (2020), p43:
“The posterior predictive distribution (posterior_predict) is the distribution of the outcome implied by the model after using the observed data to update our beliefs about the unknown parameters in the model. Simulating data from the posterior predictive distribution using the observed predictors is useful for checking the fit of the model. Drawing from the posterior predictive distribution at interesting values of the predictors also lets us visualize how a manipulation of a predictor affects (a function of)the outcome(s). With new observations of predictor variables we can use the posterior predictive distribution to generate predicted outcomes.”
posteriors <- posterior_predict(model_bayes_glm_heart,newdata=df_heart_test)
posterior <- as.data.frame(posteriors) # each column is a data point from the dataset and each row is a prediction
# Print 10 predictions for 5 lives
print("10 predictions for 5 lives")
[1] "10 predictions for 5 lives"
posteriors[1:10, 1:5]
2 6 15 20 26
[1,] 0 0 0 0 0
[2,] 0 0 0 0 0
[3,] 0 0 0 0 0
[4,] 0 0 0 0 0
[5,] 0 0 0 0 0
[6,] 0 0 0 0 0
[7,] 0 0 0 0 0
[8,] 0 0 0 0 0
[9,] 0 0 0 0 0
[10,] 0 0 0 0 0
We can plot the posterior distributions for the model parameters from the samples:
posterior_heart <- as.matrix(model_bayes_glm_heart)
plot1<- mcmc_areas(posterior_heart, pars = c("AGE"), prob = 0.80)
plot2<- mcmc_areas(posterior_heart, pars = c("SEX"), prob = 0.80)
plot3<- mcmc_areas(posterior_heart, pars = c("PREVHYP"), prob = 0.80)
plot4<- mcmc_areas(posterior_heart, pars = c("DIABETES"), prob = 0.80)
plot5<- mcmc_areas(posterior_heart, pars = c("TOTCHOL"), prob = 0.80)
grid.arrange(plot1,plot2, plot3, plot4, plot5, top="Posterior distributions (median + 80% intervals)")
We can extract the \(R^2\) from the posterior distribution and plot:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.04251 0.07077 0.07970 0.08027 0.08920 0.14700
hist(r2_posterior) #expect normal
Posterior predictive checks can be extracted using “pp_check”
Here the outcome is binary so the density overlay outcomes are concentrated at one or zero.
pp_check(model_bayes_glm_heart, "stat")
pp_check(model_bayes_glm_heart, "dens_overlay")
Further considerations: The loo package provides “efficient approximate leave-one-out cross-validation (LOO), approximate standard errors for estimated predictive errors (including comparisons), and the widely applicable information criterion (WAIC).The loo package can be used to estimate predictive error of any MCMC item-level log likelihood output”.
The above example looked at a binary response. In this example we’ll look at predicting a continuous variable using data on my household electricity consumption over 2016-2021. The above example used stan_glm in RSTANARM. Here we will use a model defined in RJAGS with additional flexibility around the choice of priors, which is also possible in a paramaterised model for STAN. RJAGS combines the power of R with the “Just Another Gibbs Sampler” or “JAGS” engine. To get started, first download the “JAGS” program, then set up the packages as above.
Both JAGS and STAN allow for the parameterisation of a range of different models. The example below was chosen for comparison with a conventional GLM approach.
Data prep:
df_usage <- read.csv(file = 'elec_usage.csv')
df_usage <- df_usage %>% mutate(Date = as.Date(Date, format="%Y-%m-%d"), EndDate = parse_date_time(EndDate, '%Y-%m-%d %H:%M:%S'), StartDate = parse_date_time(StartDate, '%Y-%m-%d %H:%M:%S'))
df_usage <- df_usage[c(-2,-3,-4,-6)] %>% # drop start and end dates, rate type, duration
group_by(Date) %>% summarise(Usage = sum(Usage), MaxTemp = max(MaxTemp), MinTemp = min(MinTemp), Rain_mm = max(Rain_mm), Solar_Exposure = max(Solar_Exposure)) %>%
# Add in fields to identify seasons, weekend/ weekday and an indicator for the post-covid/ work from home period
mutate(Season=
if_else(month(Date)>= 3 & month(Date) < 6,"Autumn",
if_else(month(Date)>= 6 & month(Date) < 9,"Winter",
if_else(month(Date)>= 9 & month(Date) < 12,"Spring","Summer"))),
Covid_Ind=if_else(Date>= "2020-03-15",1,0),
Weekend=
if_else(wday(Date)==7 | wday(Date)==1,1,0))
df_usage$Season <- factor(df_usage$Season, levels = c("Summer","Autumn","Winter","Spring")) # change season levels
head(df_usage)
# A tibble: 6 × 9
Date Usage MaxTemp MinTemp Rain_mm Solar_Exposure Season
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 2019-12-21 21.1 26.7 21 0 27.1 Summer
2 2019-12-22 9.53 19.9 18.6 0 11.9 Summer
3 2019-12-23 15.7 22.1 17.2 0 14.4 Summer
4 2019-12-24 18.7 24 18.6 1.8 18.3 Summer
5 2019-12-25 18.9 23.2 21.6 0.2 27.8 Summer
6 2019-12-26 14.6 23.6 21.4 0 30.8 Summer
# ℹ 2 more variables: Covid_Ind <dbl>, Weekend <dbl>
Data is mostly complete:
introduce(df_usage)
# A tibble: 1 × 9
rows columns discrete_columns continuous_columns
<int> <int> <int> <int>
1 732 9 2 7
# ℹ 5 more variables: all_missing_columns <int>,
# total_missing_values <int>, complete_rows <int>,
# total_observations <int>, memory_usage <dbl>
plot_intro(df_usage)
plot_missing(df_usage)
Drop records with NAs as they are few, also only include non-zero usage, review explanatory variables; Gamma dist potentially a better fit than normal for Usage given the skewness in the tail; alternatively use normal with a log link.
plot_histogram(df_usage, title="Histogram for continuous")
log_qq_data <- update_columns(qq_data, 2:4, function(x) log(x + 1))
plot_qq(log_qq_data, title="QQ plot, logged data")
Fitting of the distribution ' gamma ' by matching moments
Parameters :
estimate Std. Error
shape 2.9114935 0.179487630
rate 0.1164095 0.007621289
Loglikelihood: -2813.89 AIC: 5631.78 BIC: 5640.902
Correlation matrix:
shape rate
shape 1.0000000 0.9416261
rate 0.9416261 1.0000000
plot(usage_fit_gamma)
Usage correlated with season, max/min temp and covid wfh indicator. Max and Min Temps are not surprisingly highly correlated with one another, with the seasons and with solar exposure.
plot_correlation(na.omit(df_usage))
Split data into training and testing data sets:
Fit a simple GLM using all variables:
Call:
glm(formula = Usage ~ ., family = gaussian(link = "log"), data = df_usage_train[-1])
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.4312991 0.1936811 28.042 < 2e-16 ***
MaxTemp -0.0238070 0.0083260 -2.859 0.00442 **
MinTemp -0.0978096 0.0093057 -10.511 < 2e-16 ***
Rain_mm -0.0001651 0.0013963 -0.118 0.90594
Solar_Exposure -0.0155053 0.0037217 -4.166 3.63e-05 ***
SeasonAutumn 0.0503508 0.0812128 0.620 0.53554
SeasonWinter 0.0421299 0.0913116 0.461 0.64471
SeasonSpring 0.0188279 0.0816951 0.230 0.81782
Covid_Ind -0.0768909 0.0992380 -0.775 0.43880
Weekend -0.1074911 0.0336111 -3.198 0.00147 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 85.51279)
Null deviance: 113590 on 529 degrees of freedom
Residual deviance: 44465 on 520 degrees of freedom
AIC: 3873.8
Number of Fisher Scoring iterations: 9
Use stepwise regression to reduce the explanatory variables:
# specify a null model with no predictors
null_model_glm_usage <- glm(Usage ~ 1, data = df_usage_train, gaussian(link="log"))
# use a forward stepwise algorithm to build a parsimonious model
step_model_glm_usage <- step(null_model_glm_usage, scope = list(lower = null_model_glm_usage, upper = full_model_glm_usage), direction = "forward")
Summary of the model:
summary(step_model_glm_usage)
Call:
glm(formula = Usage ~ MinTemp + Solar_Exposure + Weekend + MaxTemp,
family = gaussian(link = "log"), data = df_usage_train)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.394821 0.093699 57.576 < 2e-16 ***
MinTemp -0.099655 0.006987 -14.264 < 2e-16 ***
Solar_Exposure -0.016588 0.003219 -5.152 3.65e-07 ***
Weekend -0.107299 0.033428 -3.210 0.00141 **
MaxTemp -0.021957 0.007943 -2.764 0.00590 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 84.89477)
Null deviance: 113590 on 529 degrees of freedom
Residual deviance: 44568 on 525 degrees of freedom
AIC: 3865
Number of Fisher Scoring iterations: 7
For simplicity, we’ll assume a model defined by MaxTemp and MinTemp only, noting the significant correlation with other explanatory variables.
Call:
glm(formula = Usage ~ MaxTemp + MinTemp, family = stats::Gamma(link = "inverse"),
data = df_usage_train)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0283252 0.0040743 -6.952 1.07e-11 ***
MaxTemp 0.0007415 0.0003301 2.246 0.0251 *
MinTemp 0.0037811 0.0003119 12.121 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 0.1921047)
Null deviance: 189.507 on 529 degrees of freedom
Residual deviance: 99.968 on 527 degrees of freedom
AIC: 3875
Number of Fisher Scoring iterations: 5
Changing the family to gaussian with log link (later graph shows little change in predictions, although AIC is higher). Select gaussian for ease of comparison with the Bayes model.
Call:
glm(formula = Usage ~ MaxTemp + MinTemp, family = gaussian(link = "log"),
data = df_usage_train)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.414092 0.098035 55.226 < 2e-16 ***
MaxTemp -0.039265 0.007692 -5.105 4.64e-07 ***
MinTemp -0.094550 0.007262 -13.021 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 90.75013)
Null deviance: 113590 on 529 degrees of freedom
Residual deviance: 47825 on 527 degrees of freedom
AIC: 3898.4
Number of Fisher Scoring iterations: 7
Extract coefficients:
final_model_glm_usage_coeff <- as.data.frame(coef(final_model_glm_usage))
final_model_glm_usage_coeff
coef(final_model_glm_usage)
(Intercept) 5.41409167
MaxTemp -0.03926501
MinTemp -0.09454971
Project the result for a dummy record showing the link:
print("Dummy prediction with min of 10 and max of 20:")
[1] "Dummy prediction with min of 10 and max of 20:"
predict(final_model_glm_usage, newdata= as.data.frame(cbind(MaxTemp=20,MinTemp=10)), # model and data
type="response", # or terms for coefficients
se.fit = FALSE)
1
39.77722
# replicate
exp(final_model_glm_usage_coeff[1,] + final_model_glm_usage_coeff[2,]*20+final_model_glm_usage_coeff[3,]*10)
[1] 39.77722
Adding the predictions to the test data:
# predictions, normal
pred_usage <- as.data.frame(
predict(final_model_glm_usage, newdata= df_usage_test, # model and data
type="response", # or terms for coefficients
se.fit = TRUE, # default is false
interval = "confidence", #default "none", also "prediction"
level = 0.95
)
)
# predictions, gamma
usage_pred_gamma <-
predict(gamma_model_glm_usage, newdata= df_usage_test, # model and data
type="response", # or terms for coefficients
se.fit = FALSE, # default is false
)
# rename
pred_usage <- rename(pred_usage,usage_pred=fit,se_usage_pred=se.fit)
# add back to data
df_usage_test_pred <- cbind(df_usage_test,pred_usage, usage_pred_gamma)
Plotting the results shows a poorer fit for winter when the usage spikes and in instances with outlier temperatures. Predictions for the test data are not hugely different to the gamma family model.
df_usage_test_pred %>%
ggplot() +
geom_point(aes(x=Date,y=Usage,color=Season)) +
# add a modeled line
geom_line(aes(x=Date,y=usage_pred)) +
geom_line(aes(x=Date,y=usage_pred_gamma), linetype = "dashed") +
labs(x="Date", y="Usage", title = "Actual vs predicted usage by date and season",subtitle="Dashed=Gamma Pred; Solid=Normal Pred") +
theme_pander() + scale_color_gdocs()
# plot temperatures
df_usage_test_pred %>%
ggplot() +
geom_point(aes(x=Date,y=MaxTemp,color=Season)) +
labs(x="Date", y="Usage", title = "MaxTemp by date and season") +
theme_pander() + scale_color_gdocs()
Plots show slightly higher residuals at higher predicted values for the step regression model and final selected. QQ plot suggests skewness in the data, which we could see earlier in the histogram of usage.
Define a regression model in RJAGS. The structure here is more flexible than stan_glm which is predefined. This allows for a wider selection of prior distributions and a mixture of modeling choices. Also possible to specify model forms in this way for STAN.
\(Y_i \propto N(m_i,s^2)\)
Where
Note that the distribution syntax is slightly different to R, for example in R a normal distribution is specified using the mean and standard deviation, but in JAGS the second parameter of the dnorm command is the distribution’s precision (1 / variance).
usage_model <- "model{
#Likelihood model for Y[i]
for (i in 1:length(Y)) {
Y[i] ~ dnorm(m[i], s^(-2)) # link=log
log(m[i]) <- a + b*X_1[i] + c*X_2[i]
}
# Prior models for a, b, c, s
a ~ dnorm(0, 0.001) # Intercept
b ~ dnorm(0, 0.001) # MaxTemp
c ~ dnorm(0, 0.001) # MinTemp
s ~ dunif(0, 10) # Error
}"
Simulate from the model:
usage_sim <- jags.model(textConnection(usage_model),
data=list(Y = df_usage_train$Usage, X_1 = df_usage_train$MaxTemp, X_2 = df_usage_train$MinTemp),
n.chains = 3)
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 530
Unobserved stochastic nodes: 4
Total graph size: 2919
Initializing model
usage_posterior <- coda.samples(usage_sim,
variable.names = c("a","b","c","s"),
n.iter = 10000, n.burnin = 2000, n.thin = 10, DIC = F)
Summary of the results below:
b chain value was -0.040. This is an estimate of the posterior mean of b. This is consistent with the GLM aboveb chain values by the square root of the number of iterations (or sample size). This is a naive approach to calculating error in a setting with dependent samples.Also showing the density plots for the posteriors as well as the trace plots for each of the chains.
summary(usage_posterior)
Iterations = 1001:11000
Thinning interval = 1
Number of chains = 3
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
a 5.41410 0.106067 6.124e-04 0.0070111
b -0.03912 0.008177 4.721e-05 0.0007667
c -0.09484 0.007196 4.155e-05 0.0004215
s 9.50785 0.254874 1.472e-03 0.0018820
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
a 5.21749 5.34109 5.41256 5.48393 5.62757
b -0.05514 -0.04475 -0.03913 -0.03349 -0.02289
c -0.10930 -0.09970 -0.09463 -0.08996 -0.08094
s 8.98564 9.33206 9.51959 9.69793 9.94864
plot(usage_posterior,trace=FALSE)
We can plot the posterior distributions for the model parameters from the samples:
posterior_usage <- as.matrix(usage_posterior)
plot1<- mcmc_areas(posterior_usage, pars = c("a"), prob = 0.80)
plot2<- mcmc_areas(posterior_usage, pars = c("b"), prob = 0.80)
plot3<- mcmc_areas(posterior_usage, pars = c("c"), prob = 0.80)
plot4<- mcmc_areas(posterior_usage, pars = c("s"), prob = 0.80)
grid.arrange(plot1,plot2, plot3, plot4, top="Posterior distributions (median + 80% intervals)")
Similar to the earlier example, we could extract the \(R^2\) from the posterior distribution as well as run posterior predictive checks.
Further considerations: extracting the mean of the parameter projections, we could produce predictions for the test data. We could also produce predictions for test data point based upon each of the samples from the posterior, so above we would produce 2400 predictions for each data point (3x10k, discarding the first 2k from each chain and only saving 1/10 of the samples).
Hugh Miller, Isabella Lyons (2012) discusses a method of forecasting COVID infections using the package “epinow2”, Sam Abbott (2020). The package makes use of STAN.
“[Epinow2] estimates the time-varying reproduction number on cases by date of infection… Imputed infections are then mapped to observed data (for example cases by date of report) via a series of uncertain delay distributions…”
That is to say, the model makes some assumptions about lags in symptoms and in reporting (assuming some distribution for those delays) in order to estimate the true underlying number of infections (backwards). Using that backcast it projects out how those infections might be reported over time. The model also estimates the effective reproduction number in order to predict future infections.
Covid case and test data sourced from the department of NSW Website on 17/01/2022.
We should be careful in placing too heavy a reliance on any models of incidence which are based upon case data (particularly where case rates are expected to be under reported or where testing is incomplete). The example is purely illustrative and does not attempt to correct for gaps in the data.
Case data:
notification_date postcode lhd_2010_code lhd_2010_name
1 2020-01-25 2134 X700 Sydney
2 2020-01-25 2121 X760 Northern Sydney
3 2020-01-25 2071 X760 Northern Sydney
4 2020-01-27 2033 X720 South Eastern Sydney
5 2020-03-01 2077 X760 Northern Sydney
6 2020-03-01 2163 X710 South Western Sydney
lga_code19 lga_name19
1 11300 Burwood (A)
2 16260 Parramatta (C)
3 14500 Ku-ring-gai (A)
4 16550 Randwick (C)
5 14000 Hornsby (A)
6 12850 Fairfield (C)
test_date postcode lhd_2010_code lhd_2010_name lga_code19
1 2020-01-01 2038 X700 Sydney 14170
2 2020-01-01 2039 X700 Sydney 14170
3 2020-01-01 2040 X700 Sydney 14170
4 2020-01-01 2041 X700 Sydney 14170
5 2020-01-01 2069 X760 Northern Sydney 14500
6 2020-01-01 2074 X760 Northern Sydney 14500
lga_name19 test_count
1 Inner West (A) 1
2 Inner West (A) 1
3 Inner West (A) 2
4 Inner West (A) 1
5 Ku-ring-gai (A) 1
6 Ku-ring-gai (A) 1
# A tibble: 6 × 3
# Groups: date [6]
date descr confirm
<date> <chr> <dbl>
1 2021-05-01 cases 2
2 2021-05-02 cases 8
3 2021-05-03 cases 9
4 2021-05-04 cases 7
5 2021-05-05 cases 9
6 2021-05-06 cases 4
Plotting recorded cases by date (PCR test outcomes only and only to 31 December 2021):
df_cases %>% mutate(omicron=ifelse(date > "2021-11-26","first case+","pre")) %>%
ggplot(data=., mapping = aes(x=date, y=confirm)) +
geom_col(aes(fill=omicron)) +
scale_x_date(date_breaks = "months" , date_labels = "%b-%y", limits = c(as.Date("2021-04-30"), as.Date("2021-12-31"))) +
labs(x="Date", y="Cases", title = "Covid cases by date, NSW") +
theme_pander() + scale_color_gdocs()
Test data:
# A tibble: 6 × 3
# Groups: date [6]
date descr tests
<date> <chr> <int>
1 2021-05-01 tests 4514
2 2021-05-02 tests 4775
3 2021-05-03 tests 14877
4 2021-05-04 tests 10868
5 2021-05-05 tests 13126
6 2021-05-06 tests 24029
Plotting tests by date (PCR tests only; to 31 December 2021):
ggplot(data=df_tests, mapping = aes(x=date, y=tests))+
geom_bar(stat="identity") +
scale_x_date(date_breaks = "months" , date_labels = "%b-%y", limits = c(as.Date("2021-04-30"), as.Date("2021-12-31"))) +
labs(x="Date", y="Tests", title = "PCR tests by date, NSW") +
theme_pander() + scale_color_gdocs()
Merged to get positive test rate:
# merge
df <- merge(x = df_cases, y = df_tests, by = "date", all.x = TRUE) %>%
mutate(pos_rate=pmin(1, confirm/tests)) %>%
fill(pos_rate, .direction = "downup") %>%
# add smoothed positive rate
mutate(pos_rate_sm = fitted(loess(pos_rate~as.numeric(date),span=0.1))) # loess - Local Polynomial Regression Fitting; span - the parameter which controls the degree of smoothing
# drop descriptions
df$descr.x <- NULL
df$descr.y <- NULL
Plot of positive rate shows that the proportion of tests returning a positive increases to above 5% from 25th December 2021, suggesting that the case data beyond this point is less complete.
# positive rate
ggplot(data=df, mapping = aes(x=date))+
geom_point(aes(y=(pos_rate))) +
geom_line(aes(y=pos_rate_sm)) +
scale_x_date(date_breaks = "months" , date_labels = "%b-%y", limits = c(as.Date("2021-04-30"), as.Date("2021-12-31"))) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(x="Date", y="Positive test rate", title = "Positive PCR test rate by date, NSW") +
theme_pander() + scale_color_gdocs()
There are some pragmatic methods for adjusting the case numbers for increasing test positivity rates. For our purposes we will focus the model on case data before 23 December 2021 in order to exclude the period where the test data became less reliable/ complete, noting that underlying data completeness issues are likely present.
# Reporting delays (delay from symptom onset to reporting/ testing positive)
reporting_delay <- estimate_delay(rlnorm(100, log(6), 1), max_value = 15)
cat("Mean reporting delay = ",reporting_delay$mean, "; ")
# Symptom generation time (time between infection events in an infector-infectee pair) and incubation period (time between moment of infection and symptom onset)
generation_time <- get_generation_time(disease = "SARS-CoV-2", source = "ganyani")
cat("Mean generation time =",generation_time$mean, "; ")
incubation_period <- get_incubation_period(disease = "SARS-CoV-2", source = "lauer")
cat("Mean incubation period =",incubation_period$mean, "; ")
df_cases_fit <- df_cases %>% filter(date <= "2021-12-22") # exclude data before 23 December 2021
estimates <- epinow(reported_cases = df_cases_fit[c("date","confirm")], # case data from NSW
generation_time = generation_time,
delays = delay_opts(incubation_period, reporting_delay),
rt = rt_opts(prior = list(mean = 2, sd = 0.2)),
stan = stan_opts(cores = 6, samples = 3000, warmup = 500, chains = 6, control = list(adapt_delta = 0.95)))
Importing a previous run of the above:
load("covid_estimates_summary.RData")
plot(estimates) produces predefined plots of cases by date reported and date of infection as well as the effective reproduction number. Given the dramatic increase in reported cases towards the end of the period, it is not surprising that the range of forecast cases is wide. Interestingly, the forecast/ partial forecast reproduction number reduces; presumably this is influenced by the earlier observations where case numbers were low. This issue has not been investigated further, however there are articles that discuss the impact of changes to the testing procedures and other limitations.
Merging forecast reported cases with the actual case data:
Plotting forecast vs actuals below - the week to 29 December is fully forecast from the model. Actual cases fall within the projected 90% confidence interval range, noting that the test positivity rate increased materially from 25th December so it is likely that the actual cases are under reported.
colours <- c("actual" = "#3366cc","modelled" = "#dc3912")
ggplot(data=df %>% filter(date >= "2021-12-01"), aes(x=date)) +
geom_line(aes(y=confirm, color="actual")) +
geom_line(aes(y=mean,color="modelled")) +
geom_ribbon(aes(ymin=lower_90, ymax=upper_90), fill="#dc3912", alpha=0.2) +
labs(x="Date", y="Reported cases", title = "Reported cases by date",color="Legend") +
scale_color_manual(values = colours) +
theme_pander()
A few books and articles of interest on STAN: