A quick Bayesian refresher, an extention to regression models and other applications.
Bayesian statistics reflects uncertainty about model parameters by assuming some prior probability distribution for those parameters. That prior uncertainty is adjusted for observed data to produce a posterior distribution.
Under a frequentest approach, model parameters are fixed and the observations are assumed to be sampled from a probability distribution with those parameters.
\(\Huge {p}(\theta,x) = \frac{{p}(x,\theta){p}(\theta)}{{p}(x)}\)
Where
The evidence term is proportional to the likelihood and therefore the theorem is often stated as
\(\Huge p(\theta | x) \propto p(x | \theta) p(\theta)\)
A paper presented at the 2021 Actuaries (Institute) Summit Hugh Miller, Isabella Lyons (2012) (p2) sets out the ‘tensions’ that exist between Frequentest and Bayesian approaches and suggests that:
“Despite actuaries being pioneers in early work on credibility theory, Bayesian approaches have largely given way to frequentist approaches. This is to our loss, since such methods have advanced substantially over the past couple of decades and they are well-suited to problems with latent variables or noise around underlying ‘true’ values.”
Here we’ll look at a few simple applications of Bayesian techniques.
The sections below set out a simple example Bayesian refresher, regression models and other applications.
A list of packages used in the recipes.
The 2017 Actuaries Institute Green Paper on Mental Health and Insurance “Mental Health and Insurance” (2017) talks about some of the challenges in obtaining reliable data and projections on mental health in insurance (“Mental Health and Insurance” 2017, p26):
"There are sources of data and research about the prevalence and profile of people with different mental health conditions, such as the 2007 National Survey of Mental Health and Wellbeing and many other epidemiological and clinical studies… [T]o have a complete and informed view of all this information is a difficult task for a full-time academic, let alone for an insurer.
Even then, the nature and granularity of the data needed for insurance applications is different from that for population, public health or clinical purposes. Firstly, the ‘exposure’ (the number of insured people and their characteristics) needs to be recorded. There is probably a lot of relevant information that is not collected at the start of an insurance policy and can therefore never be analysed.
Secondly, the claims information needs to be available at a detailed level regarding the nature and cause of the claim and other characteristics of the individual, their history and the cover provided. There is a clear need for consistency in definition, language and data standards to improve the quality of information available across the system.
Naturally, when the insurance cover for mental health conditions is not provided (as in the case of most travel insurance products) there will not be relevant data from the insurance history. However, even when relevant data exists within insurers there are practical and commercial difficulties in turning it into a useful form."
The paper goes on to talk about some of the insurance challenges in detail including
Based upon data from a single insurer ~19% of IP and TPD claims relate to mental health, with mental health claims contributing 26% to the total cost of claims (in 2015, “Mental Health and Insurance” (2017), p17). This is roughly consistent with the proportion of the general population that experiences an episode of mental illness in a 12 month period, “Mental Health and Insurance” (2020), p110.
In this simple example, we want to know what the rate of mental illness, \(\theta\) is in the insured population. Assume some true underlying \(\theta\), here a fixed value of 20%. Create a population, \(X\) with \(X \sim B(N,\theta)\) where N=100000. Pull a random sample from population.
The posterior is proportional to the product of the prior and the likelihood. The beta distribution is a conjugate prior for the binomial distribution (posterior is the same probability distribution family as prior). Assume \(\theta \sim {\sf Beta}(\alpha, \beta)\), then:
\(f(\theta;\alpha,\beta) = \theta^{\alpha-1}(1-\theta)^{\beta-1}\)
It represents the probabilities assigned to values of \(\theta\) in the domain (0,1) given values for the parameters \(\alpha\) and \(\beta\). The binomial posterior distribution represents the probability of values of \(X\) given \(\theta\). Varying the values of \(\alpha\) and \(\beta\) can represent a wide range of different prior beliefs about the distribution of \(\theta\).
Priors can be “informative” “uninformative” or “weakly informative,” e.g.
Defining prior plots:
plot_prior <- function(a,b){
theta = seq(0,1,0.005)
p_theta = dbeta(theta, a, b)
p <- qplot(theta, p_theta, geom='line')
p <- p + theme_bw()
p <- p + ylab(expression(paste('p(',theta,')', sep = '')))
p <- p + xlab(expression(theta))
return(p)}
plot_prior(1,1) + labs(title="Uniform Prior, a= 1 and b = 1") + theme(plot.title = element_text(hjust = 0.5))
plot_prior(0.5,0.5) + labs(title="Bimodial Prior, a= 0.5 and b = 0.5") + theme(plot.title = element_text(hjust = 0.5))
The higher the number of observations, the greater the influence on the posterior
plot_prior(5,30) + labs(title="Uniform Prior, a= 5 and b = 30") + theme(plot.title = element_text(hjust = 0.5))
plot_prior(20,120) + labs(title="Uniform Prior, a= 20 and b = 120") + theme(plot.title = element_text(hjust = 0.5))
Prior plot values. Re-express binomial as beta dist.
prior_a = 50 #successes
prior_b = 250 #failures
prior_n = prior_a + prior_b #observations
prior_theta = prior_a/prior_n
prior <- data.frame('Dist'='Prior','x'=seq(0,1,0.005), 'y'=dbeta(seq(0,1,0.005),prior_a,prior_b))
Observations are taken from the earlier binomial sample.
obs_a <- sample_success + 1
obs_b <- sample_n - sample_success + 1
observations <- data.frame('Dist'='Observations',x=seq(0,1,0.005), y=dbeta(seq(0,1,0.005),obs_a,obs_b))
In this example, the posterior is a simple combination of the prior and observations.
post_a <- sample_success + prior_a - 1 #combination of prior and observed successes
post_b <- sample_n - sample_success + prior_b - 1
post_theta <- post_a / (post_a + post_b)
posterior <- data.frame('Dist'='Posterior','x'=seq(0,1,0.005), 'y'=dbeta(seq(0,1,0.005),post_a,post_b))
This can be plotted as:
model_plot <- rbind(prior,observations,posterior)
with(model_plot, Dist <- factor(Dist, levels = c('Prior', 'Observations','Posterior'), ordered = TRUE))
qplot(model_plot$x, model_plot$y, geom='line',color = model_plot$Dist) +
ylab(expression(paste('p(',theta,')', sep = ''))) +
xlab(expression(theta)) +
geom_vline(xintercept = post_theta, linetype="dotted", color = "black", label="blah") +
scale_color_discrete(name = "Dist") +
annotate("text",x=post_theta, y=0, label=label_percent()(post_theta)) +
labs(title="Informative prior", subtitle="Low success") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5))
RSTANARM is an interface in R for STAN where Stan is a platform for Bayesian inference - it is a probabilistic programming framework where the guts of the inference method happens in the background, while the user focuses on parameterising the model in R. RJAGS is an interface in R for JAGS (Just Another Gibbs Sample) where JAGS is similarly a platform for Bayesian modeling.
Models in both STAN and JAGS can make use of Markov chains Monte Carlo (MCMC) methods. MCMC methods provide a way to take random samples approximately from a posterior distribution. Such samples can be used to summarize any aspect of the posterior distribution of a statistical model.
RSTANARM and RJAGS have pre-written code for common models like regression.
Under a Frequentest approach to linear regression, for a given set of response variables y, the relationship between y and a set of regressor variables x is assumed to be linear (informed by a set of parameters \(\beta\)), with an error term, i.e. \(y = X\beta + \epsilon\).
The error term, \(\epsilon\) is assumed to be normally distributed and the regression co-efficients are estimated by minimising the error term commonly using ordinary least squares (OLS). We make some other assumptions about the error term - not correlated with X and i.i.d.; and define \(Var(\epsilon)=\sigma^2\). The generalised form allows the response variable to have an error that is not normal - this is achieved by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
With Bayesian inference we are interested in the posterior distributions of the parameters \(\beta\) and \(\sigma\), rather than point estimates. We also want to influence those distributions with some prior knowledge of their behaviour:
\(p(\beta,\sigma|y,X) \propto p(y|\beta,\sigma)p(\beta,\sigma)\)
\(p(\beta,\sigma|y,X)\) is not a normalised probability density function (i.e. is proportional to the likelihood and prior). However, we can sample from it and therefore estimate any quantity of interest e.g. mean or variance, (William H. Press 2007, p825). For this we use MCMC, William H. Press (2007), p825:
“MCMC is a random sampling method… the goal is to visit a point \(x\) with a probability proportional to some given distribution function \(\pi(x)\) [not necessarily a probability]. Why would we want to sample a distribution in this way? THe answer is that Bayesian methods, often implemented using MCMC, provide a powerful way of estimating the parameters of a model and their degree of uncertainty.”
The Framingham Heart Study is a longitudinal study of cardiovastular disease in a sample population within Framingham, Massachusetts. A subset of data for teaching is available at request from the U.S. National Heart, Lung and Blood Institute. This dataset is not appropriate for publication and has been anonymised, but is nevertheless useful for illustrative purposes here. The dataset:
“..is a subset of the data collected as part of the Framingham study and includes laboratory, clinic, questionnaire, and adjudicated event data on 4,434 participants. Participant clinic data was collected during three examination periods, approximately 6 years apart, from roughly 1956 to 1968. Each participant was followed for a total of 24 years for the outcome of the following events: Angina Pectoris, Myocardial Infarction, Atherothrombotic Infarction or Cerebral Hemorrhage (Stroke) or death.”
We’ll look at a logistic regression model example using predefined Bayesian models in RSTANARM.
df_heart <- read.csv(file = 'frmgham2.csv')
df_heart <- df_heart[c(1:23)] %>% filter(PERIOD==2) # ignoring health event data, only looking at observations at period == 2 (observations for participants are not independent over time; this does have limitations e.g. observations are truncated at death which could be due to heart disease)
ignore_cols <- names(df_heart) %in% c("PERIOD","RANDID","TIME", "PREVMI", "PREVAP") #exclude identifier and time columns; exclude PREVMI and PREVAP as they are a subset of PREVCHD
df_heart <- df_heart[!ignore_cols]
head(df_heart)
SEX TOTCHOL AGE SYSBP DIABP CURSMOKE CIGPDAY BMI DIABETES BPMEDS
1 2 260 52 105 69.5 0 0 29.43 0 0
2 1 283 54 141 89.0 1 30 25.34 0 0
3 2 232 67 183 109.0 1 20 30.18 0 0
4 2 343 51 109 77.0 1 30 23.48 0 0
5 2 230 49 177 102.0 0 0 31.36 0 1
6 2 220 70 149 81.0 0 0 36.76 0 0
HEARTRTE GLUCOSE educ PREVCHD PREVSTRK PREVHYP HDLC LDLC
1 80 86 2 0 0 0 NA NA
2 75 87 1 0 0 0 NA NA
3 60 89 3 0 0 1 NA NA
4 90 72 3 0 0 0 NA NA
5 120 86 2 0 0 1 NA NA
6 80 98 1 1 0 1 NA NA
Data description highlights:
Using the DataExplore package, we can see that the data is mostly complete apart from missing entries in ‘glucose’ and to a lesser extent ‘education.’
introduce(df_heart)
rows columns discrete_columns continuous_columns
1 3930 18 0 16
all_missing_columns total_missing_values complete_rows
1 2 8720 0
total_observations memory_usage
1 70740 334464
plot_intro(df_heart)
plot_missing(df_heart)
Given around 12% of glucose records are missing as well as all of HDL and LDL (only available for period 3), drop the columns. Also drop rows with other missing data as these are small in number.
df_heart$GLUCOSE <- NULL
df_heart$HDLC <- NULL
df_heart$LDLC <- NULL
df_heart <- na.omit(df_heart)
Plots of the data:
plot_bar(df_heart, title="Barplot")
Of the 3.6k remaining records, ~7% have a flag for heart disease:
table(df_heart$PREVCHD)
0 1
3319 253
Further plots and checking distributions:
plot_histogram(df_heart, title="Histogram for continuous")
log_qq_data <- update_columns(qq_data, 2:4, function(x) log(x + 1))
plot_qq(log_qq_data, title="QQ plot, logged data")
Most significantly correlated with heart disease are AGE, SYSBP, DIABETES, BPMEDS and PREVHYP.
plot_correlation(na.omit(df_heart))
Split data into training and testing data sets.
Fit a simple GLM using all data:
Call:
glm(formula = PREVCHD ~ ., family = "binomial", data = df_heart_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3248 -0.4057 -0.2748 -0.1827 2.8921
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.962557 1.250208 -4.769 1.85e-06 ***
SEX -1.114213 0.177181 -6.289 3.20e-10 ***
TOTCHOL 0.001581 0.001784 0.886 0.37553
AGE 0.078303 0.011428 6.852 7.28e-12 ***
SYSBP 0.010581 0.005241 2.019 0.04352 *
DIABP -0.016745 0.009933 -1.686 0.09184 .
CURSMOKE 0.042076 0.261488 0.161 0.87216
CIGPDAY -0.005157 0.010770 -0.479 0.63204
BMI 0.026662 0.019952 1.336 0.18144
DIABETES 0.939410 0.262773 3.575 0.00035 ***
BPMEDS 0.375898 0.231239 1.626 0.10404
HEARTRTE -0.011837 0.006496 -1.822 0.06842 .
educ -0.033598 0.077837 -0.432 0.66599
PREVSTRK 0.092152 0.521049 0.177 0.85962
PREVHYP 0.236107 0.218069 1.083 0.27893
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1366.6 on 2678 degrees of freedom
Residual deviance: 1184.2 on 2664 degrees of freedom
AIC: 1214.2
Number of Fisher Scoring iterations: 6
Use stepwise regression to reduce the explanatory variables:
# specify a null model with no predictors
null_model_glm_heart <- glm(PREVCHD ~ 1, data = df_heart_train, family = "binomial")
# use a forward stepwise algorithm to build a parsimonious model
step_model_glm_heart <- step(null_model_glm_heart, scope = list(lower = null_model_glm_heart, upper = full_model_glm_heart), direction = "forward")
Summary of the model:
summary(step_model_glm_heart)
Call:
glm(formula = PREVCHD ~ AGE + SEX + DIABETES + SYSBP + HEARTRTE +
BPMEDS, family = "binomial", data = df_heart_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3219 -0.4014 -0.2759 -0.1873 2.8906
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.508481 0.831483 -7.828 4.97e-15 ***
AGE 0.086628 0.010331 8.385 < 2e-16 ***
SEX -1.021098 0.166346 -6.138 8.34e-10 ***
DIABETES 1.037655 0.258433 4.015 5.94e-05 ***
SYSBP 0.008310 0.003601 2.308 0.0210 *
HEARTRTE -0.012274 0.006396 -1.919 0.0550 .
BPMEDS 0.405392 0.226323 1.791 0.0733 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1366.6 on 2678 degrees of freedom
Residual deviance: 1190.6 on 2672 degrees of freedom
AIC: 1204.6
Number of Fisher Scoring iterations: 6
Dropping BPMEDS, HEARTRTE and BMI as little impact on AIC:
Call:
glm(formula = PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL,
family = "binomial", data = df_heart_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3239 -0.4108 -0.2797 -0.1870 2.9304
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.047973 0.742204 -9.496 < 2e-16 ***
AGE 0.090472 0.010145 8.918 < 2e-16 ***
SEX -1.015567 0.168042 -6.044 1.51e-09 ***
PREVHYP 0.397207 0.171145 2.321 0.0203 *
DIABETES 1.026242 0.254112 4.039 5.38e-05 ***
TOTCHOL 0.001445 0.001779 0.812 0.4166
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1366.6 on 2678 degrees of freedom
Residual deviance: 1198.5 on 2673 degrees of freedom
AIC: 1210.5
Number of Fisher Scoring iterations: 6
Fit a Bayes glm using the same explanatory variables. Here parameter estimates no longer have test statistics and p-values as in the frequentist regression. This is because Bayesian estimation samples from the posterior distribution. This means that instead of a point estimate and a test statistic, we get a distribution of plausible values for the parameters, and the estimates section of the summary summarizes those distributions.
Further documentation on Bayesian generalised linear models via Stan is here. Documentation on choice of priors is here. If priors are not specified, default weekly informative priors are used - you can call a summary of the modeled priors (see below). Default priors are autoscaled to make them less informative (see below for notes on scaling).
model_bayes_glm_heart <- stan_glm(PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL, family = "binomial",data=df_heart_train, iter=2000, warmup=500)
# chains is number of sample paths from posterior; iter is number of samples within a chain; warmup is number of iterations to discard
A summary of the model below. “Sigma” is the standard deviation of the errors; “mean_PPD” is the mean of the posterior predictive samples.MCMC diagnostics - “RHat” is a measure of within chain variance compared to across chain variance (<1.1 implies convergence); “log-posterior” is similar to likelihood.
Model Info:
function: stan_glm
family: binomial [logit]
formula: PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL
algorithm: sampling
sample: 6000 (posterior sample size)
priors: see help('prior_summary')
observations: 2679
predictors: 6
Estimates:
mean sd 10% 50% 90%
(Intercept) -7.0759 0.7518 -8.0309 -7.0752 -6.1078
AGE 0.0908 0.0103 0.0776 0.0905 0.1040
SEX -1.0178 0.1678 -1.2333 -1.0174 -0.7994
PREVHYP 0.4005 0.1696 0.1868 0.3976 0.6185
DIABETES 1.0197 0.2583 0.6913 1.0249 1.3447
TOTCHOL 0.0015 0.0018 -0.0008 0.0014 0.0037
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 0.0707 0.0067 0.0620 0.0705 0.0795
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0107 0.9998 4977
AGE 0.0002 0.9996 4032
SEX 0.0024 1.0002 4776
PREVHYP 0.0024 0.9999 5157
DIABETES 0.0032 1.0000 6535
TOTCHOL 0.0000 1.0004 5728
mean_PPD 0.0001 1.0002 6925
log-posterior 0.0317 1.0007 2999
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
The stan model used default normal priors for intercepts and coefficients. Default priors are automatically scaled to limit how informative they are - below we can see the adjusted (scaled) prior for the coefficients differs from the default 2.5:
# summary of priors
print(prior_summary(model_bayes_glm_heart),digits=4)
Priors for model 'model_bayes_glm_heart'
------
Intercept (after predictors centered)
~ normal(location = 0, scale = 2.5)
Coefficients
Specified prior:
~ normal(location = [0,0,0,...], scale = [2.5,2.5,2.5,...])
Adjusted prior:
~ normal(location = [0,0,0,...], scale = [0.2926,5.0406,4.9991,...])
------
See help('prior_summary.stanreg') for more details
Credible intervals express the probability that a parameter falls within a given range (vs a confidence interval which is the probability that a range contains the true value of the parameter). These credible intervals are computed by finding the relevant quantiles of the draws from the posterior distribution. They produce similar ranges to the confidence intervals from the frequentist approach.
confint(final_model_glm_heart, level=0.95)
2.5 % 97.5 %
(Intercept) -8.52662803 -5.614429675
AGE 0.07086231 0.110670109
SEX -1.34930362 -0.689678256
PREVHYP 0.06522520 0.737205187
DIABETES 0.51152722 1.511076030
TOTCHOL -0.00208170 0.004889717
posterior_interval(model_bayes_glm_heart, prob=0.95)
2.5% 97.5%
(Intercept) -8.547304593 -5.606970389
AGE 0.071086672 0.111312772
SEX -1.345066980 -0.690262317
PREVHYP 0.072906510 0.732413757
DIABETES 0.496030231 1.505244789
TOTCHOL -0.002016689 0.004926423
stan_glm allows a selection of prior distributions for the coefficients, intercept, and auxiliary parameters. The prior distributions can be altered by specifying different locations/scales/dfs, but all the prior take the form of a single chosen probability distribution.
Some drivers for wanting to specify a prior: * Available information/ research might suggest a base value for our parameters. Particularly valuable if our own observed data is sparse. * We may not have a good idea of the value, but know that it is constrained in some way e.g. is positive.
The scale is the standard deviation of the prior. As noted earlier, the model autoscales to ensure priors are not too informative. We can turn this off when we specify our own priors. Documentation on choice of priors is here.
To allow for varying distributions across the priors, we need to define the model in STAN or JAGS. See the linear regression example below for an example of this in JAGS.
Specifying our own priors below. Little impact on the posterior estimates.
model_bayes_glm_heart_scale <- stan_glm(PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL, family = "binomial",data=df_heart_train, iter=2000, warmup=500,
prior_intercept = normal(location = 0, scale = 10, autoscale = FALSE),
prior = normal(location = 0, scale = 5, autoscale = FALSE),
prior_aux = exponential(rate = 1, autoscale = FALSE)
)
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.001 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 10 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 501 / 2000 [ 25%] (Sampling)
Chain 1: Iteration: 700 / 2000 [ 35%] (Sampling)
Chain 1: Iteration: 900 / 2000 [ 45%] (Sampling)
Chain 1: Iteration: 1100 / 2000 [ 55%] (Sampling)
Chain 1: Iteration: 1300 / 2000 [ 65%] (Sampling)
Chain 1: Iteration: 1500 / 2000 [ 75%] (Sampling)
Chain 1: Iteration: 1700 / 2000 [ 85%] (Sampling)
Chain 1: Iteration: 1900 / 2000 [ 95%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 46.861 seconds (Warm-up)
Chain 1: 56.413 seconds (Sampling)
Chain 1: 103.274 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 0.001 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 10 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 501 / 2000 [ 25%] (Sampling)
Chain 2: Iteration: 700 / 2000 [ 35%] (Sampling)
Chain 2: Iteration: 900 / 2000 [ 45%] (Sampling)
Chain 2: Iteration: 1100 / 2000 [ 55%] (Sampling)
Chain 2: Iteration: 1300 / 2000 [ 65%] (Sampling)
Chain 2: Iteration: 1500 / 2000 [ 75%] (Sampling)
Chain 2: Iteration: 1700 / 2000 [ 85%] (Sampling)
Chain 2: Iteration: 1900 / 2000 [ 95%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 38.091 seconds (Warm-up)
Chain 2: 46.276 seconds (Sampling)
Chain 2: 84.367 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 0 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 501 / 2000 [ 25%] (Sampling)
Chain 3: Iteration: 700 / 2000 [ 35%] (Sampling)
Chain 3: Iteration: 900 / 2000 [ 45%] (Sampling)
Chain 3: Iteration: 1100 / 2000 [ 55%] (Sampling)
Chain 3: Iteration: 1300 / 2000 [ 65%] (Sampling)
Chain 3: Iteration: 1500 / 2000 [ 75%] (Sampling)
Chain 3: Iteration: 1700 / 2000 [ 85%] (Sampling)
Chain 3: Iteration: 1900 / 2000 [ 95%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 40.355 seconds (Warm-up)
Chain 3: 48.702 seconds (Sampling)
Chain 3: 89.057 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 0 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 501 / 2000 [ 25%] (Sampling)
Chain 4: Iteration: 700 / 2000 [ 35%] (Sampling)
Chain 4: Iteration: 900 / 2000 [ 45%] (Sampling)
Chain 4: Iteration: 1100 / 2000 [ 55%] (Sampling)
Chain 4: Iteration: 1300 / 2000 [ 65%] (Sampling)
Chain 4: Iteration: 1500 / 2000 [ 75%] (Sampling)
Chain 4: Iteration: 1700 / 2000 [ 85%] (Sampling)
Chain 4: Iteration: 1900 / 2000 [ 95%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 33.01 seconds (Warm-up)
Chain 4: 46.822 seconds (Sampling)
Chain 4: 79.832 seconds (Total)
Chain 4:
Model Info:
function: stan_glm
family: binomial [logit]
formula: PREVCHD ~ AGE + SEX + PREVHYP + DIABETES + TOTCHOL
algorithm: sampling
sample: 6000 (posterior sample size)
priors: see help('prior_summary')
observations: 2679
predictors: 6
Estimates:
mean sd 10% 50% 90%
(Intercept) -7.0698 0.7338 -8.0009 -7.0838 -6.1308
AGE 0.0908 0.0101 0.0780 0.0909 0.1040
SEX -1.0192 0.1661 -1.2337 -1.0173 -0.8074
PREVHYP 0.4000 0.1728 0.1778 0.4028 0.6215
DIABETES 1.0105 0.2562 0.6795 1.0130 1.3361
TOTCHOL 0.0014 0.0018 -0.0008 0.0014 0.0037
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 0.0706 0.0066 0.0623 0.0705 0.0791
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0097 0.9997 5764
AGE 0.0001 0.9999 5503
SEX 0.0027 1.0001 3852
PREVHYP 0.0028 1.0007 3678
DIABETES 0.0042 1.0001 3749
TOTCHOL 0.0000 0.9999 6107
mean_PPD 0.0001 1.0007 5266
log-posterior 0.0361 1.0006 2327
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
We are able to extract predictions from each iteration of the posterior sampling. We can extract predictions for new data. Could be tested against the observations for goodness of fit. The follwing from the manual, “RSTANARM Manual” (2020), p43:
“The posterior predictive distribution (posterior_predict) is the distribution of the outcome implied by the model after using the observed data to update our beliefs about the unknown parameters in the model. Simulating data from the posterior predictive distribution using the observed predictors is useful for checking the fit of the model. Drawing from the posterior predictive distribution at interesting values of the predictors also lets us visualize how a manipulation of a predictor affects (a function of)the outcome(s). With new observations of predictor variables we can use the posterior predictive distribution to generate predicted outcomes.”
posteriors <- posterior_predict(model_bayes_glm_heart,newdata=df_heart_test)
posterior <- as.data.frame(posteriors) # each column is a data point from the dataset and each row is a prediction
# Print 10 predictions for 5 lives
print("10 predictions for 5 lives")
[1] "10 predictions for 5 lives"
posteriors[1:10, 1:5]
4 6 12 17 18
[1,] 0 0 0 0 0
[2,] 0 0 0 0 0
[3,] 0 0 0 0 0
[4,] 0 0 0 0 0
[5,] 0 1 1 0 0
[6,] 0 0 1 0 0
[7,] 0 0 0 0 0
[8,] 0 0 0 0 0
[9,] 0 0 0 0 0
[10,] 0 0 0 0 0
We can plot the posterior distributions for the model parameters from the samples:
posterior_heart <- as.matrix(model_bayes_glm_heart)
plot1<- mcmc_areas(posterior_heart, pars = c("AGE"), prob = 0.80)
plot2<- mcmc_areas(posterior_heart, pars = c("SEX"), prob = 0.80)
plot3<- mcmc_areas(posterior_heart, pars = c("PREVHYP"), prob = 0.80)
plot4<- mcmc_areas(posterior_heart, pars = c("DIABETES"), prob = 0.80)
plot5<- mcmc_areas(posterior_heart, pars = c("TOTCHOL"), prob = 0.80)
grid.arrange(plot1,plot2, plot3, plot4, plot5, top="Posterior distributions (median + 80% intervals)")
We can extract the \(R^2\) from the posterior distribution and plot:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.04243 0.07155 0.08063 0.08122 0.08992 0.14692
hist(r2_posterior) #expect normal
Posterior predictive checks can be extracted using “pp_check”
Here the outcome is binary so the density overlay outcomes are concentrated at one or zero.
pp_check(model_bayes_glm_heart, "stat")
pp_check(model_bayes_glm_heart, "dens_overlay")
Further considerations: The loo package provides “efficient approximate leave-one-out cross-validation (LOO), approximate standard errors for estimated predictive errors (including comparisons), and the widely applicable information criterion (WAIC).The loo package can be used to estimate predictive error of any MCMC item-level log likelihood output.”
The above example looked at a binary response. In this example we’ll look at predicting a continuous variable using data on my household electricity consumption over 2016-2021. The above example used stan_glm in RSTANARM. Here we will use a model defined in RJAGS with additional flexibility around the choice of priors, which is also possible in a paramaterised model for STAN. RJAGS combines the power of R with the “Just Another Gibbs Sampler” or “JAGS” engine. To get started, first download the “JAGS” program, then set up the packages as above.
Both JAGS and STAN allow for the parameterisation of a range of different models. The example below was chosen for comparison with a conventional GLM approach.
Data prep:
df_usage <- read.csv(file = 'elec_usage.csv')
df_usage <- df_usage %>% mutate(Date = as.Date(Date, format="%Y-%m-%d"), EndDate = parse_date_time(EndDate, '%Y-%m-%d %H:%M:%S'), StartDate = parse_date_time(StartDate, '%Y-%m-%d %H:%M:%S'))
df_usage <- df_usage[c(-2,-3,-4,-6)] %>% # drop start and end dates, rate type, duration
group_by(Date) %>% summarise(Usage = sum(Usage), MaxTemp = max(MaxTemp), MinTemp = min(MinTemp), Rain_mm = max(Rain_mm), Solar_Exposure = max(Solar_Exposure)) %>%
# Add in fields to identify seasons, weekend/ weekday and an indicator for the post-covid/ work from home period
mutate(Season=
if_else(month(Date)>= 3 & month(Date) < 6,"Autumn",
if_else(month(Date)>= 6 & month(Date) < 9,"Winter",
if_else(month(Date)>= 9 & month(Date) < 12,"Spring","Summer"))),
Covid_Ind=if_else(Date>= "2020-03-15",1,0),
Weekend=
if_else(wday(Date)==7 | wday(Date)==1,1,0))
df_usage$Season <- factor(df_usage$Season, levels = c("Summer","Autumn","Winter","Spring")) # change season levels
head(df_usage)
# A tibble: 6 x 9
Date Usage MaxTemp MinTemp Rain_mm Solar_Exposure Season
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 2019-12-21 21.1 26.7 21 0 27.1 Summer
2 2019-12-22 9.53 19.9 18.6 0 11.9 Summer
3 2019-12-23 15.7 22.1 17.2 0 14.4 Summer
4 2019-12-24 18.7 24 18.6 1.8 18.3 Summer
5 2019-12-25 18.9 23.2 21.6 0.2 27.8 Summer
6 2019-12-26 14.6 23.6 21.4 0 30.8 Summer
# ... with 2 more variables: Covid_Ind <dbl>, Weekend <dbl>
Data is mostly complete:
introduce(df_usage)
# A tibble: 1 x 9
rows columns discrete_columns continuous_columns all_missing_colum~
<int> <int> <int> <int> <int>
1 732 9 2 7 0
# ... with 4 more variables: total_missing_values <int>,
# complete_rows <int>, total_observations <int>, memory_usage <dbl>
plot_intro(df_usage)
plot_missing(df_usage)
Drop records with NAs as they are few, also only include non-zero usage, review explanatory variables; Gamma dist potentially a better fit than normal for Usage given the skewness in the tail; alternatively use normal with a log link.
plot_histogram(df_usage, title="Histogram for continuous")
log_qq_data <- update_columns(qq_data, 2:4, function(x) log(x + 1))
plot_qq(log_qq_data, title="QQ plot, logged data")
Fitting of the distribution ' gamma ' by matching moments
Parameters :
estimate
shape 2.9114935
rate 0.1164095
Loglikelihood: -2813.89 AIC: 5631.78 BIC: 5640.902
plot(usage_fit_gamma)
Usage correlated with season, max/min temp and covid wfh indicator. Max and Min Temps are not surprisingly highly correlated with one another, with the seasons and with solar exposure.
plot_correlation(na.omit(df_usage))
Split data into training and testing data sets:
Fit a simple GLM using all variables:
Call:
glm(formula = Usage ~ ., family = gaussian(link = "log"), data = df_usage_train[-1])
Deviance Residuals:
Min 1Q Median 3Q Max
-47.739 -4.237 -0.169 4.152 49.830
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.3435178 0.1980361 26.983 < 2e-16 ***
MaxTemp -0.0236931 0.0081617 -2.903 0.00385 **
MinTemp -0.0963046 0.0094105 -10.234 < 2e-16 ***
Rain_mm -0.0005874 0.0014031 -0.419 0.67565
Solar_Exposure -0.0170304 0.0036434 -4.674 3.76e-06 ***
SeasonAutumn 0.0853256 0.0869296 0.982 0.32678
SeasonWinter 0.0690973 0.0969988 0.712 0.47657
SeasonSpring 0.0600080 0.0872295 0.688 0.49180
Covid_Ind -0.0298564 0.1116571 -0.267 0.78927
Weekend -0.0736053 0.0342713 -2.148 0.03220 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 86.66816)
Null deviance: 114574 on 529 degrees of freedom
Residual deviance: 45065 on 520 degrees of freedom
AIC: 3880.9
Number of Fisher Scoring iterations: 8
Use stepwise regression to reduce the explanatory variables:
# specify a null model with no predictors
null_model_glm_usage <- glm(Usage ~ 1, data = df_usage_train, gaussian(link="log"))
# use a forward stepwise algorithm to build a parsimonious model
step_model_glm_usage <- step(null_model_glm_usage, scope = list(lower = null_model_glm_usage, upper = full_model_glm_usage), direction = "forward")
Summary of the model:
summary(step_model_glm_usage)
Call:
glm(formula = Usage ~ MinTemp + Solar_Exposure + MaxTemp + Weekend,
family = gaussian(link = "log"), data = df_usage_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-47.704 -4.045 -0.127 4.071 49.756
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.386132 0.093447 57.639 < 2e-16 ***
MinTemp -0.099404 0.007008 -14.184 < 2e-16 ***
Solar_Exposure -0.017416 0.003131 -5.563 4.24e-08 ***
MaxTemp -0.021716 0.007748 -2.803 0.00525 **
Weekend -0.073802 0.034035 -2.168 0.03057 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 86.06824)
Null deviance: 114574 on 529 degrees of freedom
Residual deviance: 45185 on 525 degrees of freedom
AIC: 3872.3
Number of Fisher Scoring iterations: 7
For simplicity, we’ll assume a model defined by MaxTemp and MinTemp only, noting the significant correlation with other explanatory variables.
Call:
glm(formula = Usage ~ MaxTemp + MinTemp, family = Gamma(link = "inverse"),
data = df_usage_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.08514 -0.27462 -0.05095 0.18899 2.08795
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0290515 0.0039115 -7.427 4.5e-13 ***
MaxTemp 0.0006336 0.0003152 2.011 0.0449 *
MinTemp 0.0040192 0.0003017 13.321 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Gamma family taken to be 0.1800649)
Null deviance: 197.76 on 529 degrees of freedom
Residual deviance: 103.91 on 527 degrees of freedom
AIC: 3882.3
Number of Fisher Scoring iterations: 5
Changing the family to gaussian with log link (later graph shows little change in predictions, although AIC is higher). Select gaussian for ease of comparison with the Bayes model.
Call:
glm(formula = Usage ~ MaxTemp + MinTemp, family = gaussian(link = "log"),
data = df_usage_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-52.033 -4.352 -0.494 4.846 48.033
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.384399 0.097722 55.099 < 2e-16 ***
MaxTemp -0.037435 0.007621 -4.912 1.2e-06 ***
MinTemp -0.095235 0.007292 -13.060 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 91.93297)
Null deviance: 114574 on 529 degrees of freedom
Residual deviance: 48448 on 527 degrees of freedom
AIC: 3905.2
Number of Fisher Scoring iterations: 7
Extract coefficients:
final_model_glm_usage_coeff <- as.data.frame(coef(final_model_glm_usage))
final_model_glm_usage_coeff
coef(final_model_glm_usage)
(Intercept) 5.38439891
MaxTemp -0.03743478
MinTemp -0.09523517
Project the result for a dummy record showing the link:
print("Dummy prediction with min of 10 and max of 20:")
[1] "Dummy prediction with min of 10 and max of 20:"
predict(final_model_glm_usage, newdata= as.data.frame(cbind(MaxTemp=20,MinTemp=10)), # model and data
type="response", # or terms for coefficients
se.fit = FALSE)
1
39.7795
# replicate
exp(final_model_glm_usage_coeff[1,] + final_model_glm_usage_coeff[2,]*20+final_model_glm_usage_coeff[3,]*10)
[1] 39.7795
Adding the predictions to the test data:
# predictions, normal
pred_usage <- as.data.frame(
predict(final_model_glm_usage, newdata= df_usage_test, # model and data
type="response", # or terms for coefficients
se.fit = TRUE, # default is false
interval = "confidence", #default "none", also "prediction"
level = 0.95
)
)
# predictions, gamma
usage_pred_gamma <-
predict(gamma_model_glm_usage, newdata= df_usage_test, # model and data
type="response", # or terms for coefficients
se.fit = FALSE, # default is false
)
# rename
pred_usage <- rename(pred_usage,usage_pred=fit,se_usage_pred=se.fit)
# add back to data
df_usage_test_pred <- cbind(df_usage_test,pred_usage, usage_pred_gamma)
Plotting the results shows a poorer fit for winter when the usage spikes and in instances with outlier temperatures. Predictions for the test data are not hugely different to the gamma family model.
df_usage_test_pred %>%
ggplot() +
geom_point(aes(x=Date,y=Usage,color=Season)) +
# add a modeled line
geom_line(aes(x=Date,y=usage_pred)) +
geom_line(aes(x=Date,y=usage_pred_gamma), linetype = "dashed") +
labs(x="Date", y="Usage", title = "Actual vs predicted usage by date and season",subtitle="Dashed=Gamma Pred; Solid=Normal Pred") +
theme_pander() + scale_color_gdocs()
# plot temperatures
df_usage_test_pred %>%
ggplot() +
geom_point(aes(x=Date,y=MaxTemp,color=Season)) +
labs(x="Date", y="Usage", title = "MaxTemp by date and season") +
theme_pander() + scale_color_gdocs()
Plots show slightly higher residuals at higher predicted values for the step regression model and final selected. QQ plot suggests skewness in the data, which we could see earlier in the histogram of usage.
Define a regression model in RJAGS. The structure here is more flexible than stan_glm which is predefined. This allows for a wider selection of prior distributions and a mixture of modeling choices. Also possible to specify model forms in this way for STAN.
\(Y_i \propto N(m_i,s^2)\)
Where
Note that the distribution syntax is slightly different to R, for example in R a normal distribution is specified using the mean and standard deviation, but in JAGS the second parameter of the dnorm command is the distribution’s precision (1 / variance).
usage_model <- "model{
#Likelihood model for Y[i]
for (i in 1:length(Y)) {
Y[i] ~ dnorm(m[i], s^(-2)) # link=log
log(m[i]) <- a + b*X_1[i] + c*X_2[i]
}
# Prior models for a, b, c, s
a ~ dnorm(0, 0.001) # Intercept
b ~ dnorm(0, 0.001) # MaxTemp
c ~ dnorm(0, 0.001) # MinTemp
s ~ dunif(0, 10) # Error
}"
Simulate from the model:
usage_sim <- jags.model(textConnection(usage_model),
data=list(Y = df_usage_train$Usage, X_1 = df_usage_train$MaxTemp, X_2 = df_usage_train$MinTemp),
n.chains = 3)
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 530
Unobserved stochastic nodes: 4
Total graph size: 3457
Initializing model
usage_posterior <- coda.samples(usage_sim,
variable.names = c("a","b","c","s"),
n.iter = 10000, n.burnin = 2000, n.thin = 10, DIC = F)
Summary of the results below:
b
chain value was -0.040. This is an estimate of the posterior mean of b
. This is consistent with the GLM aboveb
chain values by the square root of the number of iterations (or sample size). This is a naive approach to calculating error in a setting with dependent samples.Also showing the density plots for the posteriors as well as the trace plots for each of the chains.
summary(usage_posterior)
Iterations = 1001:11000
Thinning interval = 1
Number of chains = 3
Sample size per chain = 10000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
a 5.37886 0.105507 6.091e-04 0.0069295
b -0.03689 0.008541 4.931e-05 0.0008687
c -0.09569 0.007709 4.451e-05 0.0004970
s 9.55431 0.245370 1.417e-03 0.0019104
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
a 5.17760 5.3076 5.37331 5.44719 5.59401
b -0.05425 -0.0428 -0.03664 -0.03079 -0.02105
c -0.11062 -0.1009 -0.09578 -0.09058 -0.08016
s 9.03848 9.3878 9.56855 9.74158 9.96367
plot(usage_posterior,trace=FALSE)
We can plot the posterior distributions for the model parameters from the samples:
posterior_usage <- as.matrix(usage_posterior)
plot1<- mcmc_areas(posterior_usage, pars = c("a"), prob = 0.80)
plot2<- mcmc_areas(posterior_usage, pars = c("b"), prob = 0.80)
plot3<- mcmc_areas(posterior_usage, pars = c("c"), prob = 0.80)
plot4<- mcmc_areas(posterior_usage, pars = c("s"), prob = 0.80)
grid.arrange(plot1,plot2, plot3, plot4, top="Posterior distributions (median + 80% intervals)")
Similar to the earlier example, we could extract the \(R^2\) from the posterior distribution as well as run posterior predictive checks.
Further considerations: extracting the mean of the parameter projections, we could produce predictions for the test data. We could also produce predictions for test data point based upon each of the samples from the posterior, so above we would produce 2400 predictions for each data point (3x10k, discarding the first 2k from each chain and only saving 1/10 of the samples).
Hugh Miller, Isabella Lyons (2012) discusses a method of forecasting COVID infections using the package “epinow2”, Sam Abbott (2020). The package makes use of STAN.
“[Epinow2] estimates the time-varying reproduction number on cases by date of infection… Imputed infections are then mapped to observed data (for example cases by date of report) via a series of uncertain delay distributions…”
That is to say, the model makes some assumptions about lags in symptoms and in reporting (assuming some distribution for those delays) in order to estimate the true underlying number of infections (backwards). Using that backcast it projects out how those infections might be reported over time. The model also estimates the effective reproduction number in order to predict future infections.
Covid case and test data sourced from the department of NSW Website on 17/01/2022.
We should be careful in placing too heavy a reliance on any models of incidence which are based upon case data (particularly where case rates are expected to be under reported or where testing is incomplete). The example is purely illustrative and does not attempt to correct for gaps in the data.
Case data:
notification_date postcode lhd_2010_code lhd_2010_name
1 2020-01-25 2134 X700 Sydney
2 2020-01-25 2121 X760 Northern Sydney
3 2020-01-25 2071 X760 Northern Sydney
4 2020-01-27 2033 X720 South Eastern Sydney
5 2020-03-01 2077 X760 Northern Sydney
6 2020-03-01 2163 X710 South Western Sydney
lga_code19 lga_name19
1 11300 Burwood (A)
2 16260 Parramatta (C)
3 14500 Ku-ring-gai (A)
4 16550 Randwick (C)
5 14000 Hornsby (A)
6 12850 Fairfield (C)
test_date postcode lhd_2010_code lhd_2010_name lga_code19
1 2020-01-01 2038 X700 Sydney 14170
2 2020-01-01 2039 X700 Sydney 14170
3 2020-01-01 2040 X700 Sydney 14170
4 2020-01-01 2041 X700 Sydney 14170
5 2020-01-01 2069 X760 Northern Sydney 14500
6 2020-01-01 2074 X760 Northern Sydney 14500
lga_name19 test_count
1 Inner West (A) 1
2 Inner West (A) 1
3 Inner West (A) 2
4 Inner West (A) 1
5 Ku-ring-gai (A) 1
6 Ku-ring-gai (A) 1
# A tibble: 6 x 3
# Groups: date [6]
date descr confirm
<date> <chr> <dbl>
1 2021-05-01 cases 2
2 2021-05-02 cases 8
3 2021-05-03 cases 9
4 2021-05-04 cases 7
5 2021-05-05 cases 9
6 2021-05-06 cases 4
Plotting recorded cases by date (PCR test outcomes only and only to 31 December 2021):
df_cases %>% mutate(omicron=ifelse(date > "2021-11-26","first case+","pre")) %>%
ggplot(data=., mapping = aes(x=date, y=confirm)) +
geom_col(aes(fill=omicron)) +
scale_x_date(date_breaks = "months" , date_labels = "%b-%y", limits = c(as.Date("2021-04-30"), as.Date("2021-12-31"))) +
labs(x="Date", y="Cases", title = "Covid cases by date, NSW") +
theme_pander() + scale_color_gdocs()
Test data:
# A tibble: 6 x 3
# Groups: date [6]
date descr tests
<date> <chr> <int>
1 2021-05-01 tests 4514
2 2021-05-02 tests 4775
3 2021-05-03 tests 14877
4 2021-05-04 tests 10868
5 2021-05-05 tests 13126
6 2021-05-06 tests 24029
Plotting tests by date (PCR tests only; to 31 December 2021):
ggplot(data=df_tests, mapping = aes(x=date, y=tests))+
geom_bar(stat="identity") +
scale_x_date(date_breaks = "months" , date_labels = "%b-%y", limits = c(as.Date("2021-04-30"), as.Date("2021-12-31"))) +
labs(x="Date", y="Tests", title = "PCR tests by date, NSW") +
theme_pander() + scale_color_gdocs()
Merged to get positive test rate:
# merge
df <- merge(x = df_cases, y = df_tests, by = "date", all.x = TRUE) %>%
mutate(pos_rate=pmin(1, confirm/tests)) %>%
fill(pos_rate, .direction = "downup") %>%
# add smoothed positive rate
mutate(pos_rate_sm = fitted(loess(pos_rate~as.numeric(date),span=0.1))) # loess - Local Polynomial Regression Fitting; span - the parameter which controls the degree of smoothing
# drop descriptions
df$descr.x <- NULL
df$descr.y <- NULL
Plot of positive rate shows that the proportion of tests returning a positive increases to above 5% from 25th December 2021, suggesting that the case data beyond this point is less complete.
# positive rate
ggplot(data=df, mapping = aes(x=date))+
geom_point(aes(y=(pos_rate))) +
geom_line(aes(y=pos_rate_sm)) +
scale_x_date(date_breaks = "months" , date_labels = "%b-%y", limits = c(as.Date("2021-04-30"), as.Date("2021-12-31"))) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(x="Date", y="Positive test rate", title = "Positive PCR test rate by date, NSW") +
theme_pander() + scale_color_gdocs()
There are some pragmatic methods for adjusting the case numbers for increasing test positivity rates. For our purposes we will focus the model on case data before 23 December 2021 in order to exclude the period where the test data became less reliable/ complete, noting that underlying data completeness issues are likely present.
# Reporting delays (delay from symptom onset to reporting/ testing positive)
reporting_delay <- estimate_delay(rlnorm(100, log(6), 1), max_value = 15)
cat("Mean reporting delay = ",reporting_delay$mean, "; ")
# Symptom generation time (time between infection events in an infector-infectee pair) and incubation period (time between moment of infection and symptom onset)
generation_time <- get_generation_time(disease = "SARS-CoV-2", source = "ganyani")
cat("Mean generation time =",generation_time$mean, "; ")
incubation_period <- get_incubation_period(disease = "SARS-CoV-2", source = "lauer")
cat("Mean incubation period =",incubation_period$mean, "; ")
df_cases_fit <- df_cases %>% filter(date <= "2021-12-22") # exclude data before 23 December 2021
estimates <- epinow(reported_cases = df_cases_fit[c("date","confirm")], # case data from NSW
generation_time = generation_time,
delays = delay_opts(incubation_period, reporting_delay),
rt = rt_opts(prior = list(mean = 2, sd = 0.2)),
stan = stan_opts(cores = 6, samples = 3000, warmup = 500, chains = 6, control = list(adapt_delta = 0.95)))
Importing a previous run of the above:
load("covid_estimates_summary.RData")
plot(estimates) produces predefined plots of cases by date reported and date of infection as well as the effective reproduction number. Given the dramatic increase in reported cases towards the end of the period, it is not surprising that the range of forecast cases is wide. Interestingly, the forecast/ partial forecast reproduction number reduces; presumably this is influenced by the earlier observations where case numbers were low. This issue has not been investigated further, however there are articles that discuss the impact of changes to the testing procedures and other limitations.
Merging forecast reported cases with the actual case data:
Plotting forecast vs actuals below - the week to 29 December is fully forecast from the model. Actual cases fall within the projected 90% confidence interval range, noting that the test positivity rate increased materially from 25th December so it is likely that the actual cases are under reported.
colours <- c("actual" = "#3366cc","modelled" = "#dc3912")
ggplot(data=df %>% filter(date >= "2021-12-01"), aes(x=date)) +
geom_line(aes(y=confirm, color="actual")) +
geom_line(aes(y=mean,color="modelled")) +
geom_ribbon(aes(ymin=lower_90, ymax=upper_90), fill="#dc3912", alpha=0.2) +
labs(x="Date", y="Reported cases", title = "Reported cases by date",color="Legend") +
scale_color_manual(values = colours) +
theme_pander()
A few books and articles of interest on STAN: