Posterior Predictive in INLA
Mar 17, 2018Uncelebrated Greatness
This is a general impression I get. For some reason, introductions to parametric Bayesian models often skip over some of their most powerful features.
- Online learning is easy, because old findings and new data are combined by Bayesian updating.
- Generating new data is easy.
- Quantifying the probability of a new observation is easy.
- Tracking uncertainty accurately is easy.
Experts in Bayesian modeling are so used to these features that they don’t consider them remarkable. I often wind up spending more time figuring out the details of these steps after toolkit introductions walk through the steps below.
- Specify a model by selecting a parametric random distribution
- Fit the parameters of the model to some data
- Use a metric like DIC or WAIC to compare it to a couple other possible models
Those are important steps, but the posterior predictive distribution completes the Bayesian workflow, and Bayesian updating of the model keeps the model accurate after new data comes in.
This blog checks out INLA as a tool and peeks into the use of the posterior predictive distribution. It doesn’t cover Bayesian updating.
INLA
Integrated nested Laplace approximation (INLA) is covered in a book that just came out last month, Bayesian Regression Modeling with INLA. I’m enjoying reading it. It does a great job relating different families of generalized linear model to their respective applications.
The promise of INLA is much faster inference through approximation instead of more computationally expensive methods.
I was disappointed, though, that the book recommended re-fitting the model in order to make predictions for new predictor values (by specifying NA
as the result). Here I’m calling the “dependent variable” or “model output” the result, and I’m calling the “covariates” or “independent variables” the predictors.
Re-fitting the model is easy and fine for small models, but when there’s tons of data, or when the data is no longer available, or when there’s a lot of computation to do and you need to move quickly, it’s important to know that there are other possibilities.
Any Bayesian model has a “posterior predictive” distribution, the probability distribution that you get by integrating over all the parameters of the posterior distribution. The posterior predictive is the one that’s great for generating fake data and for determining how probable a new observation is.
It took a while to figure out how to use the posterior predictive distribution given a fitted INLA model. I got significant assistance from the INLA help mailing list. I review what I learned below.
Data and model
For the example below, I use a simple univariate linear model with noise, but INLA handles much more complex models. Here is the “ground truth” data, representing real-world measurements.
N <- 100
x <- 1:N
y <- 5 + 4 * x + rnorm(N, sd=3)
The true parameters are the following.
- The intercept is 5.
- The coefficient of the predictor is 4.
- The standard deviation of the noise is 3.
Next, we fit an INLA model and check out the posterior distributions for the parameters.
library(INLA)
## Loading required package: sp
## Loading required package: Matrix
## This is INLA_17.06.20 built 2018-03-14 23:31:50 UTC.
## See www.r-inla.org/contact-us for how to get help.
m <- inla(y ~ x, data=data.frame(x=x, y=y))
summary(m)
##
## Call:
## "inla(formula = y ~ x, data = data.frame(x = x, y = y))"
##
## Time used:
## Pre-processing Running inla Post-processing Total
## 0.7702 0.1244 0.2804 1.1749
##
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) 4.7720 0.5813 3.6282 4.7720 5.9146 4.7720 0
## x 4.0053 0.0100 3.9856 4.0053 4.0249 4.0053 0
##
## The model has no random effects
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.1227 0.0171 0.0913 0.1222
## 0.975quant mode
## Precision for the Gaussian observations 0.1585 0.1215
##
## Expected number of effective parameters(std dev): 2.00(0.00)
## Number of equivalent replicates : 50.00
##
## Marginal log-Likelihood: -268.32
The model summary can be interpreted as saying the following.
- The intercept has 95% probability of falling between 3.1424 and 5.6429.
- The coefficient of the predictor variable has a 95% probability of falling between 3.9915 and 4.0345. (The model is pretty sure about this one.)
- The “noise” is characterized by precision.
INLA likes to report precision. Standard deviation is more familiar to me. The mean precision, when converted to standard deviation, is close to the true standard deviation used to generate the noise.
sqrt(1 / 0.0999)
## [1] 3.16386
There’s a convenience function in the brinla
R package that supports the book. It reports standard deviation.
library(brinla)
bri.hyperpar.summary(m)
## mean sd q0.025 q0.5
## SD for the Gaussian observations 2.876217 0.2036968 2.511406 2.861232
## q0.975 mode
## SD for the Gaussian observations 3.314951 2.821355
You can also plot the density of the whole posterior distribution for the standard deviation.
bri.hyperpar.plot(m)
## Loading required package: ggplot2
It can also plot the posterior distributions of the model coefficients.
bri.fixed.plot(m)
You can plot the Bayesian Pearson residuals, which will be distributed vertically in a Gaussian way, unless there’s something your model isn’t taking into account but needs to.
bri.lmresid.plot(m, x, xlab="x", smooth=TRUE)
The residuals look Gaussian throughout the range of the predictor variable, which is good.
Sampling
There is a function in the INLA package that can sample from the joint posterior distribution of a fitted INLA model. It is called inla.posterior.sample
. The first thing it’s likely to say to you is the error below.
sim <- inla.posterior.sample(200, m)
## Error in inla.posterior.sample(200, m): You need an inla-object computed with option 'control.compute=list(config = TRUE)'.
This time I use the option that allows sampling from the posterior.
m <- inla(y ~ x,
data=data.frame(x=x, y=y),
control.compute=list(config=TRUE))
sim <- inla.posterior.sample(N, m)
I could have used a different number of simulated samples than N
, but for the plot later, it’s convenient to use the same number.
The resulting object is a data structure that I found pretty intimidating at first.
length(sim)
## [1] 100
str(sim[[1]])
## List of 3
## $ hyperpar: Named num 0.122
## ..- attr(*, "names")= chr "Precision for the Gaussian observations"
## $ latent : num [1:102, 1] 9.32 13.31 17.3 21.29 25.28 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:102] "Predictor:001" "Predictor:002" "Predictor:003" "Predictor:004" ...
## .. ..$ : chr "sample1"
## $ logdens :List of 3
## ..$ hyperpar: num 3.13
## ..$ latent : num 463
## ..$ joint : num 466
It turns out that the last items in the named-numerical vector called latent
contain samples for the parameters of the model.
s <- sim[[1]]$latent
tail(s[,1], n=3)
## Predictor:100 (Intercept) x
## 404.532084 5.324567 3.992068
By using lapply
, a function can grab those values from each sample by name, so that we can have parameters for the model sampled from the posterior distribution according to their posterior probability.
a <- unlist(lapply(sim, function(s) s$latent["(Intercept)", 1]))
b <- unlist(lapply(sim, function(s) s$latent["x", 1]))
The noise is retrieved from a hyperparameter in a similar way.
sigma <- unlist(lapply(sim,
function(s) 1/sqrt(s$hyperpar["Precision for the Gaussian observations"])))
Now we can generate data from the posterior predictive distribution. It ought to look like the original “ground truth” data.
y.sim <- a + b * x + rnorm(N, sd=sigma)
I use dplyr::bind_rows
to put together the original data and the generated data, to check that they look about the same.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
d <- bind_rows(list(original=data.frame(x=x, y=y),
generated=data.frame(x=x, y=y.sim)),
.id="data.type")
ggplot(d, aes(x, y, color=data.type)) + geom_point()
This post is getting long, so I won’t use the predictive posterior to quantify the probability of new values of the dependent variable given associated predictors.
Trickery
One of the authors of Bayesian Regression Modeling with INLA, Julian Faraway, suggests that sampling is at odds with the spirit of INLA, an approximation method.
He creates a “random effect” variable in the model and uses that instead of the precision hyperparameter we used above. I do understand that step, because the book has many examples like it.
I am not yet sure, though, when it’s possible to use this approximation of the posterior predictive and when it breaks down.
Here is a link to the page, where you can find “trickery” in your browser to go to the part I’m talking about.