Relevant readings:
We’d like to extend a normal model of data to incorporate a single covariate.
This is the model we’re starting with:
data {
int N;
vector[N] y;
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
mu ~ normal(0, 10);
sigma ~ exponential(1);
y ~ normal(mu, sigma);
}
We’ll need a couple things:
x
valuesalpha
) and slope (beta
)mu
mu
(we’ll use “weakly informative” priors)Open stan/normal.stan
and we’ll make the appropriate changes.
We’ll use data from the pscl
package, which is a collection of statistical tools and data sets for political science.
install.packages("pscl")
First, load the data into R. We’ll tweak the format a little to be in line with contemporary best practices.
library(tidyverse)
library(pscl)
# data
df <-
as_tibble(unionDensity) %>%
rownames_to_column(var = "country")
Second, center the data (to have a mean of 0) and scale it (to have a standard deviation of 1) using the scale()
function.
Note that scale()
returns a matrix, which we don’t want in this case! The [, 1]
bit turns it into a vector (as.vector(union)
would do the job as well).
# scale data
df_scaled <-
df %>%
mutate(
union = scale(union)[, 1],
left = scale(left)[, 1]
)
summary(df_scaled)
## country union left size
## Length:20 Min. :-1.57656 Min. :-1.1154 Min. : 4.394
## Class :character 1st Qu.:-0.90333 1st Qu.:-0.9087 1st Qu.: 7.567
## Mode :character Median : 0.05786 Median :-0.0930 Median : 8.196
## Mean : 0.00000 Mean : 0.0000 Mean : 8.391
## 3rd Qu.: 0.86707 3rd Qu.: 0.6460 3rd Qu.: 9.713
## Max. : 1.51097 Max. : 2.1955 Max. :11.439
## concen
## Min. :0.860
## 1st Qu.:1.125
## Median :1.520
## Mean :1.402
## 3rd Qu.:1.595
## Max. :2.060
Then, we format the data for Stan:
d <- list(
N = length(df_scaled$union),
y = df_scaled$union,
x = df_scaled$left
)
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
# fit model
# print model fit
Quick comparison to OLS/MLE:
fit_ols <- lm(union ~ left, data = df_scaled)
summary(fit_ols)
##
## Call:
## lm(formula = union ~ left, data = df_scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8204 -0.5476 -0.1897 0.5764 1.5046
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.500e-17 1.689e-01 0.000 1.00000
## left 6.780e-01 1.733e-01 3.913 0.00102 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7552 on 18 degrees of freedom
## Multiple R-squared: 0.4597, Adjusted R-squared: 0.4296
## F-statistic: 15.31 on 1 and 18 DF, p-value: 0.001019
# trace plots
Also look at and interpret Rhat
, ESS (n_eff
), and MCSE (se_mean
) in the model output above. You can plot these too, but that’s more useful when there are more parameters.
Look at the coefficient output above and plot the coefficients below.
# coefficient plots
# pairs plot
We don’t just care about coefficients—we want to assess how well our model fits our data. To do this, we can see what values of y would be generated using the posterior distribution of the parameters. In other words, these simulated y values take into account and carry forward the uncertainty in our parameter estimates.
You could do this in R using the posterior draws from all the coefficients… but we’re already stepping through our data and sampling parameter values, so why not generate some y values along the way?
So, the most efficient way to generate posterior predictive values is from inside Stan, using a new code block called generated quantities
:
generated quantities {
// simulate data from the posterior
vector[N] y_rep;
for (i in 1:N) {
y_rep[i] = normal_rng(mu[i], sigma);
}
}
Add this block to stan/normal.stan
. When we’re interested in calculating model fit statistics, we’ll use this same block to calculate log-likelihood values.
# recompile and refit the model
One of the features of the bayesplot
package is a family of functions for visual posterior predictive checks, ppc_*
. These let you plot some feature or statistic of the simulated y values against the actual data.
In some cases, we’ll be able to use all the simulated values to calculate a summary statistic. In other cases, we’ll want to plot only a subset of the y_rep
values.
library(bayesplot)
# extract y_rep draws from the stanfit object
# plot posterior predictive density overlay
You can read more about posterior predictive checks in bayesplot
here:
How would you extend the regression model to handle multiple covariates?
You could do it one-by-one: vector[N] x1;
, vector[N] x2;
… but that’s not very flexible. Instead, we’ll use a design matrix:
# scale the rest of the data
df_scaled <-
df_scaled %>%
mutate(size = scale(size)[, 1],
concen = scale(concen)[, 1])
# it's a convention to use an upper-case X for a matrix of xs
X <-
df_scaled %>%
select(left, size, concen) %>%
as.matrix()
# alternatively (and this is better when you want to create dummy variables!)
# you'll want to drop the intercept using `[, -1]`
X <- model.matrix(union ~ left + size + concen, data = df_scaled)[, -1]
You’ll need to update the Stan model with these things:
matrix
of X
valuesvector
of beta
parameters to estimateAs long as the x values are on the same scale, you can use the same prior for each beta
.
Give it a try yourself, and see if you can fit the Stan model using more than one covariate!
# fit model
You can read about matrices in Stan here: https://mc-stan.org/docs/2_19/reference-manual/vector-and-matrix-data-types.html
After you’ve tinkered with the model a bit, you can find an example here: https://mc-stan.org/docs/2_19/stan-users-guide/linear-regression.html