数据科学与统计学之简单使用 JAGS 进行蒙特卡罗模拟

简介: 使用 JAGS 进行蒙特卡罗模拟

Introduction to Monte Carlo simulation using JAGS

This practical gives an introduction to the Monte Carlo method using JAGS (Just Another Gibbs Sampler). First we have to install some R packages that act as frontenda for JAGS:

  • R2jags allows fitting JAGS models from within R (depend on rjags);
  • coda provides functions for output analysis and diagnostics for MCMC;
  • lattice provides functions for visualisation of an MCMC object;
  • MCHCvis in an R packages used to visualise, manipulate, and summarise MCMC output.

```{r eval=FALSE}
install.packages('R2jags', dependencies = TRUE) # install coda too

install,packages('coda', dependencies = TRUE)

install.packages('lattice', dependencies = TRUE)
install.packages('MCMvis', dependencies = TRUE)


```{r message=FALSE, warning=FALSE}
library(R2jags)
library(MCMCvis)
library(coda)
library(lattice)

Introduce

The main steps of fitting a Baysian model using R2jags are the following:

  1. Define the JAGS model (as a function);
  2. Prepare the data , which should be passed on as a list. Note that this list should include other known variables not in the data, like the length N of the data vector, when this N is used in the JAGS model definition (e.g. in a for loop);
  3. Identify the parameters whose posterior distribution are of interest.
  4. Define starting values for JAGS. We need starting values for all the stochastic nodes (unknoen parameters, missing data) in each chain. Initial values should be passed on as a list. Note that JAGS runs without user-defined starting values, in which cases it generates its own initial values. But this is not recommended, especially for complex models;
  5. Fit the model in JAGS.
    The most important arguments of the #jags# function includes:
     *data* (data list), inits (list of initial values),     
     *parameters.to.save* (parameters of interest), 
     *n.chains* (number of MC chains - each gives a sequence of simulated values), 
     *n.iter* (number of iterations in the sampling process), 
     *n.burnin* (number of initial samples to discard), 
     *model.file* (the JAGS model definition);
    
  6. Convergence diagnostics. Update if convergence hasn't reached;
  7. Once we are satisfied that the chains have converged, we can start the statistical inference. E.g. we can plot the posterior density, get point estimates and interval estimates by extracting the posterior mean and credible intervals.

To demonstrate the structure of a Baysian model fit using R2jags we will consider some examples with increasing complexity.
For reproducibility we should always set the seed, e.g. set.seed(45621)

Coin example

Suppose we want to known the probability of getting 8 or more heads when we toss a fair coin 10 times.
When we toss a fair coin 10 times, the number of heads Y follows a binomial distribution with parameters n=10 and p=0.5. We are interested in $P(Y>=8)$.
First we represent this model using the BUGS language. Recall that ~ is used to represent stochastic dependent, and <- is used to represent logical dependence.

```{r include=TRUE}

model definition

jags.model.coin<- function(){
Y ~ dbin(0.5,10) # our data model
P8 <- ifelse(Y>7, 1, 0) # the probability of interest
}


For each iteration a sample from $Binom(10,0.5)$ is drawn, and it's checked whether this sample is strictly greater than 7 or not (since the distribution is discrete, this is equivalent to $>=8$). For example, a run of length five could give us

$$Y = (5,5,8,4,9)$$
$$P8 = (0,0,1,0,1)$$

In this simple example we have no data, and since this is a Monte Carlo sampling froma known distribution, there is no real need to provide initial values (although JAGS will still generate itsself). For the same reason we don't have to worry about convergence either (see later).
We will monitor both parameters (Y and P8). This can be specified separately or directly in the jags function. We will generate 100 samples without discarding any, using one chain.

```{r message=FALSE, include=FALSE, results='hide'}
jags.model.fit.coin<- jags(
    data = list(),
    model.file = jags.model.coin,
    parameters.to.save = c('Y','P8'),
    n.chains = 1,
    DIC = FALSE,
    n.burnin = 0,
    n.iter = 100
)

Note that the only reason we need DIC=FALSE in the above is that our model here doesn't contain a likelihood.
To get the numerical summary of the above model run we use the print function.

```{r message=TRUE, warning=FALSE}
print(jags.model.fit.coin)


This gives us the mean, standard deviation and some quantiles of the generated samples (along with some other information) for all the monitored parateters. *What is your point estimate for* $P(Y>=8)$?
We can also plot the simulated samples using the `traceplot` function.

```{r}
traceplot(jags.model.fit.coin)

image.png

image.png

For more diagnostics and visualisation we can convert the jags function into an MCMC object. Using the MCMC object we can look
at numerical summarises,traceplot and desity plots.

# convert into MCMC object
jags.fit.mcmc.coin<- as.mcmc(jags.model.fit.coin)

# get numerical summary
summary(jags.fit.mcmc.coin)

# get traceplots
xyplot(jags.fit.mcmc.coin)

# get density estimate
densityplot(jags.fit.mcmc.coin)

image.png

image.png

image.png

The function xyplot and densityplot of the lattice package give us trace plots and density plots of all the parameters. If we want to concentrate on individual parameters, we can use the MCMCtrace function of the MCMCvis packages

MCMCtrace(jags.fit.mcmc.coin,
          params = 'Y', # parameters of interest
          type = 'density', # density plot
          ind = TRUE, # separate desnity lines for each chains
          pdf = FALSE # plots are NOT exported into a pdf
          )

MCMCtrace(jags.fit.mcmc.coin,
          params = 'P8',
          type = 'trace',
          ind = TRUE,
          pdf = FALSE
          )

image.png

image.png

Note that the above outputs can also be generated manually using the simulated samples. These can be extracted from the jags object, e.g. jags.model.fit.coin$BUGSoutput$sims.list$Y gives us the 100 binomial samples generated by the model fit.
We can also extract summaries of interest. For example, we can get the point estimate, and the two endpoints of a 95% credible interval using the following.

jags.model.fit.coin$BUGSoutput$summary[,1] # mean
jags.model.fit.coin$BUGSoutput$summary[,3] # 2.5 quantile
jags.model.fit.coin$BUGSoutput$summary[,7] # 97.5 quantile

image.png

Exercise

  1. Increase the number of iterations when fitting the model. What effect will this have on your point estimate and why?

  2. Produce output using 2 or 3 chains with varying number of iterations. How similar are the outputs produced by the different chains? Note that using multiple chains is crucial when assessing chain convergence.

  3. By editing the model function, find the probability that a clinical trial with 30 subjects, each with peobability 0.7 of response, will show 15 or fewer response.

  4. By writing your own model find the expectation of the cube of a random variable with mean 1 and standard deviation 2. Note that if only one node is monitored you might have to use print(jags.model.fit.coin[drop=F]) to print the summary.

```{r message=TRUE, include=FALSE}

Solutions

1. Increase the number of iterations when fitting the model. What effect will this have on your point estimate and why?

We can improve the accuracy of the point estimate by taking a large sample size, say n.iter=10000.

2. Produce output using 2 or 3 chains with varying number of iterations. How similar are the outputs produced by the different chains? Note that using multiple chains is crucial when assessing chain convergence.

For small sample size the estiamtes produced by the two chains don't match up as well as it does for a large sample size.

3. By editing the model function, find the probability that a clinical trial with 30 subjects, each with peobability 0.7 of response, will show 15 or fewer response.

jags.model.coin2 <- function(){
Y ~ dbin(0.7, 30) # our data model
P15 <- ifelse(Y<=15, 1, 0) # the probability of interest
}

jags.model.fit.coin2<- jags(
data = list(),
model.file = jags.model.coin2,
parameters.to.save = c('Y','P15'),
n.chains = 1,
DIC = FALSE,
n.burnin = 0,
n.iter = 10000
)

print(jags.model.fit.coin2[drop=F]) # look at the mean of P15

4. By writing your own model find the expectation of the cube of a random variable with mean 1 and standard deviation 0.25. Note that if only one node is monitored you might have to use print(jags.model.fit.coin[drop=F]) to print the summary.

jags.model.normal1<- function(){
y~ dnorm(1, 0.25) # our data model
x<- y^3
}

jags.model.fit.normal1<- jags(
data = list(),
model.file = jags.model.normal1,
parameters.to.save = c('y','x'),
n.chains = 1,
DIC = FALSE,
n.burnin = 0,
n.iter = 10000
)

print(jags.model.fit.normal1[drop=F]) # look at mean of x


# Drug example without data

Consider a drug to be given for relief chronic pain. We want to know the probability that 15 or more patients will ecperience a positive response.
Even though the annual response rate of the drug is unknown, experience with similar compounds has suggested that a response rate between 0.2 and 0.6 could be feasible.
Note that in the Baysian world the responese rate, as an unknown parameter, is treated as a random variable, and needs a prior when we are defining the model.
Since we know that the response rate is feasible between 0.2 and 0.6, we can interpret this as a distribution with mean 0.4 and standard deviation 0.1, and thus suggest a Beta(9.2,13.8) distribution as prior for the annual response rate (it also has to be tetween 0 and 1).
The model definition has to include the data model, the priors for the unknown parameters, and any other quantities of interest. Thus we have the following.

```{r}
### Drug example in JAGS
# model
jags.model.drug<- function(){
    theta ~ dbeta(9.2,13.8) # prior for the unknown parameter
    y ~ dbin(theta, 20) # the data model
    P.crit <- ifelse(y>=15, 1, 0)
}

Even though we don't have any available data, monitoring the parameter y and P.crit will allow us to plot the predictive distribution of y and get a point estimate for the predictive probability that 15 or more patients will ecperience a positive response.
To build the distribution we set the number of iterations to be 10000, which will provide a fairly large sample size.

```{r include=FALSE}
jags.model.fit.drug <- jags(
data = list(),
n.iter = 10000,
DIC = FALSE,
parameters.to.save = c('theta','y','P.crit'),
model.file = jags.model.drug,
n.chains = 1
)


The node *theta* will just gives us samples from the known Beta(9.3,13.8) distribution, y will give us samples from the predictive distribution, while the 0-1 indicator P.crit provides the estimated tail-area probability.
Once the models is fitted we can look at the outcome. The *print* function gives us some numerical summaries, including points estimates, while density plots can be used to visualise the distribution of the monitored parameters. What is the predictive probability of haveing 15 or more responese?

```{r}
# look at the outcome
print(jags.model.fit.drug)

# traceplot

traceplot(jags.model.fit.drug)

# convert into MCMC object for more visualisation tools
jags.fit.mcmc.drug <- as.mcmc(jags.model.fit.drug)
# get density plots of the monitored nodes
densityplot(jags.fit.mcmc.drug)

image.png

image.png

image.png

image.png

image.png

Since the predictive distribution is a discrete distribution in this case, we might also want to look at histograms. The jags.hist function of the jagplot package can be used to create histograms of the R2jags output.

# install the jagsplot packages
remotes::install_github('mbtyers/jagsplot')
library(jagsplot)

# plot histogram of the discrete distribution
jags.hist(jags.model.fit.drug, which.param= c('y', 'P.crit'))

The package has other visualisation tools too. E.g. jags.dens create density plots of selected parameters of an R2jags object.

image.png

image.png

image.png

Exercise

  1. Edit the model code to specify a Uniform(0,1) prior on the response rate theta, and re-run the analysis. Plot the predictive distribution for the number of the number of success. What is now the predictive paobability that 15 or more patients a positive response out of 20 new patients affected?
jags.model.drug2 <- function(){
    theta ~ dunif(0,1) # prior for the unknown parameter
    y ~ dbin(theta, 20) # the data model
    P.crit <- ifelse(y>15, 1, 0)
}

jags.model.fit.drug2 <- jags(
    data = list(),
    n.iter = 10000,
    DIC = FALSE,
    parameters.to.save = c('theta', 'y', 'P.crit'),
    model.file = jags.model.drug2,
    n.chains = 1
)

print(jags.model.fit.drug2) # mean of the P.crit gives the required probability 

# plot predicitive distribution
jags.hist(jags.model.fit.drug2, which.param = c('y'))

image.png

image.png

Inference on proportions - Drug example with data

In the previous example we consider the early investigation of a new drug, where we assumed that the response rate has a Beta(9.2,13.8) prior distribution.
Now suppose that upon treating 20 volunters with the compound we observe $y=15$ response. This is our data.
As a next step, we would consider continuing the development programme if the drug managed to achieve at least a further 25 success out of 40 future trails.
Thus, we are interested in the posterior response rate, and want to know what the probability that we can continue the drug deveploment programme is.
For this we add an extra node to the model definition. This represents the future trial that we are trying to predict. The model is this case will combine the prior on the response rate with the information from the data (which is captured by the likelihood) to get a posterior response rate. Then uses this posterior to make future predictions. In particular, we first find the posterior predicitive distribution in the case of 40 patients, then use this to find the probability that we have at least 25 response out of 40 in this future trial.

# model
jags.model.drug3 <- function(){
    theta ~ dbeta(a, b) # prior
    r ~ dbin(theta, n) # likelihood
    r.pred ~ dbin(theta, m) # predictive distribution
    P.crit <- ifelse(r.pred>= ncrit, 1, 0) # probability of interest
}

Note that in this case r is known (this is the data - not a random object), while r.pred is unknown.
In the previous example the model definition included the values of all the known parameters. A more flexible approach is when dummy variable are used, and then their known values are later passed onto jags as part of the data list. Thus the jags.data variable below not only has our data y=15, but also all the known parameters in our model definition.

# data
a= 9.2
b= 13.8
r= 15 # response
n= 20 # patient number
m= 40 # future paient number
ncrit= 25 # threshold for future trial

jags.data.drug3 <- list('a', 'b', 'r', 'n', 'm', 'ncrit')

Next we choose the parameters we want to monitor: theta will give us the posterior response rate, r.pred will give the posterior predictive distribution, while P.crit gives the probability of interest. (Again, r is known, thus the monitoring this node would give us no extra information)
The stochatic nodes need initial values, thus we choose some feasible values for theta and r.pred, this time for two chains. Since the sampling is from an unknown distribution, convergence would have to be assessed. Convergence is the reason we would need at least twp chains with distinct initial values. Note however, that we won't assess convergence here, this is something that's covered later.

# parameters we want to monitor
jags.model.param.drug3 <- c('theta', 'r.pred', 'P.crit')

# Specify initial values
jags.inits1<- list('theta'= 0.7, 'r.pred'= 20)
jags.inits2<- list('theta'= 0.5, 'r,pred'= 28)
jags.inits.drug3<- list(jags.inits1, jags.inits2)

Finally we fit the model (using two chains this time)

```{r include=FALSE}
jags.model.fit.drug3 <- jags(
data = jags.data.drug3,
inits = jags.inits.drug3,
parameters.to.save = jags.model.param.drug3,
n.chains = 2,
n.iter = 10000,
model.file = jags.model.drug3
)


In order to produce summary statistics dor all the variables we have monitored, and provide a point and interval estimate for the treatment response rate we can use `print` function.

```{r}
print(jags.model.fit.drug3)

image.png

Exercise

  1. Produce kernel density plots of the posterior distribution for theta and the predicitive distribution for r.pred.

  2. Compare the model code with the code for the drug example without the data. Make sure you understand the different between the two analysis.

  3. Edit the model code to specify a Uniform(0,1) prior on the response rate theta (or equivalently, a Beta(1,1) prior), and re-run the analysis. How is the posterior estimate of theta affected? How is the probability that 25 or more patients will experience a positive response out of 40 new patients affected?

# Solutions

## 1. 
## kernel density plot for theta
jags.dens(jags.model.fit.drug3)
## predicitive detribution for r.pred
jags.hist(jags.model.fit.drug3, which.param = c('r.pred'))

## 2.

## 3. Use a=1, amd b=1 and return the previous code
# data
a= 1
b= 1
r= 15 # response
n= 20 # patient number
m= 40 # future paient number
ncrit= 25 # threshold for future trial

jags.data2.drug4 <- list('a', 'b', 'r', 'n', 'm', 'ncrit')

jags.model.fit.drug4 <- jags(
    data = jags.data2.drug4,
    inits = jags.inits.drug3,
    parameters.to.save = jags.model.param.drug3,
    n.chains = 2,
    n.iter = 10000,
    model.file = jags.model.drug3
)

print(jags.model.fit.drug4)

image.png

image.png

image.png

image.png

image.png

image.png

Inference using the model normal distribution - THM example

Regional water companies in the UK are require to take routine measurements of trihalomethane (THM) concentrations in tap water samples for regulatory purposes.

We want to estiamte the average THM concentration in particualr water zone, z. In particular we want to know the probability that the THM levels in the water supply exceed $145\mu g/l$.
Suppose we know that the assay measurement error has a standard deviation $5\mu g/l$. Thus means that we can model the THM concentration using a normal distribution with unknown mean, and standard deviation $5\mu g/l$.
To estimate the mean THM concentration, two independent meansurements were taken and their mean we found to be $130\mu g/l$. This is our data. Note however taht we will need to specify values for both the THM obervations when fiiing the model, so rather than just inputting their mean as the data, we will enter both values as 130.
Since the mean of the THM concentration is an unknown parameter, we also need to specify a prior for this. Historical data on THM levels in other zones supplied from the same souce showed that the mean THM concentration was $120\mu g/l$ with standard deviation $10\mu g/l$, which suggests a $N(120,10^2)$ prior.
Combining all the information gives the following model definition and data list. Note that for logical nodes we can take advantage of the vevtorised way R does computations, but in case of stochastic dependence we have to use a for loop.

# model
jags.model.thm <- function(){
    theta ~ dnorm(120, 1/100) # prior
    # We have N=2 observations, hence the for loop
    for (i in 1:N){
        y[i] ~ dnorm(theta, 1/(5^2)) # likelihood
    }
    ypred ~ dnorm(theta, 1/(5^2)) # predicitive distribution
}
# data (note that some known parameters were included in the model definition)
N <- 2
y <- c(130, 130)
thm.data <- list('N', 'y')

Exercise

  1. Using what you've learnt so far finish the analysis of this example. In particular, edit the model definition and fit the model, so that you can answer the following.
    (a) Calculate the probability that the zone mean THM concentration exceed $130\mu g/l$. (For this you will have to add an extra line to the model definition).
    (b) Calcualte the probability that a future THM concentration measured in the zone exceeds $130\mu g/l$.

  2. Now try using a non-informative prior on the unknown mean THM concentration (assume either a uniform prior with a wide range, or a normal prior with a large variance (small precision)). Repeat your analysis.

# Solution

## 1. Add the following to the model definition: 
jags.model.thm <- function(){
    theta ~ dnorm(120, 1/100) # prior
    # We have N=2 observations, hence the for loop
    for (i in 1:N){
        y[i] ~ dnorm(theta, 1/(5^2)) # likelihood
    }
    ypred ~ dnorm(theta, 1/(5^2)) # predicitive distribution
    # find probability that future THM level will exceed 130
    pred.crit <- ifelse(ypred>130, 1, 0)
    # find probability that current mean THM level exceeds 130
    theta.crit <- ifelse(theta>130, 1, 0)
}

thm.param <- c('theta', 'ypred', 'pred.crit', 'theta.crit')

# initial values
thm.inits <- function(){
    list('theta'= dnorm(100, 5), 'ypred'= dnorm(120, 5))
}

# fit hte model
thm.model.fit <- jags(
    data = thm.data,
    inits = thm.inits,
    parameters.to.save = thm.param,
    n.chains = 2,
    n.iter = 10000,
    n.burnin = 0,
    model.file = jags.model.thm
)

# the mean of the.crit and pred.crit gives the quantities of interest
print(thm.model.fit)

image.png

# 2. For example we can use the following prior
# theta ~ dunif(0, 10^6)
# return the code..

Inference using count data - Flying bombs

Exercise: From scrath, implement the following problem.
Data below are the number of flying bomb hits on London during World War II in a $36km^2$ area of South London. (Area was partitioned into $0.25 km^2$ grid squires and the number of bombs falling in each grid was counted).

Hist,x 0 1 2 3 4 7
number of areas,n 229 211 93 35 7 1

(Total hits $\sum_in_ix_i=537$, and total number of areas, $\sum_in_i=576$).
If the hits are assumed to be random, a Poisson distribution with some unknown hot rate $\theta$ should fit the data. We use the dgamma(0.5,0.00001) distribution as prior for the mean $theta$ of the Poisson distribution (this is an approximation of the improper Gamma(0.5,0) distribution).
Find a point estimate for the unknown parameter and also plot its posterior distribution.

set.seed(123)
# model
jags.model.bomb <- function(){
    # prior
    lambda ~ dgamma(0.001, 0.001)
    # We have N=576 obervations, hence he for loop
    for(i in 1:N){
        y[i] ~ dpois(lambda) # likelihood
    }
}

# data
y <- c(rep(0,229), rep(1,211), rep(2,93), rep(3,35), rep(4,7), rep(7,1))
N <- length(y)
bomb.data <- list('N', 'y')

# initial values (for 2 chains)
bomb.inits1<- list('lambda'=0.8)
bomb.inits2<- list('lambda'=1)
bomb.inits<- list(bomb.inits1, bomb.inits2)

# parameters to monitor
jags.param<- c('lambda')

# fit model
jags.model.fit.bomb<- jags(
    data= bomb.data,
    inits = bomb.inits,
    parameters.to.save = jags.param,
    n.chains = 2,
    n.iter = 10000,
    model.file = jags.model.bomb
)

# look at numerical summary
print(jags.model.fit.bomb)
# point estiamte for the parameter is 0.993

# plot posterior density
bomb.mcmc<- as.mcmc(jags.model.fit.bomb)
MCMCtrace(
    bomb.mcmc,
    params = 'lambda',
    type = 'density',
    ind = TRUE,
    pdf = FALSE
)

image.png

image.png

目录
相关文章
|
9月前
|
机器学习/深度学习 SQL 移动开发
UCB Data100:数据科学的原理和技巧:第十三章到第十五章
UCB Data100:数据科学的原理和技巧:第十三章到第十五章
102 0
|
9月前
|
数据可视化
UCB Data100:数据科学的原理和技巧:第十一章到第十二章(1)
UCB Data100:数据科学的原理和技巧:第十一章到第十二章(1)
108 0
|
9月前
|
JSON 数据可视化 安全
UCB Data100:数据科学的原理和技巧:第六章到第十章(1)
UCB Data100:数据科学的原理和技巧:第六章到第十章(1)
117 0
|
9月前
|
数据可视化 Linux 定位技术
UCB Data100:数据科学的原理和技巧:第六章到第十章(4)
UCB Data100:数据科学的原理和技巧:第六章到第十章(4)
156 0
|
9月前
|
机器学习/深度学习 数据可视化 Linux
UCB Data100:数据科学的原理和技巧:第六章到第十章(3)
UCB Data100:数据科学的原理和技巧:第六章到第十章(3)
92 0
|
9月前
|
数据可视化 Python
UCB Data100:数据科学的原理和技巧:第六章到第十章(2)
UCB Data100:数据科学的原理和技巧:第六章到第十章(2)
95 0
|
9月前
|
分布式计算 数据可视化 内存技术
UCB Data100:数据科学的原理和技巧:第十一章到第十二章(2)
UCB Data100:数据科学的原理和技巧:第十一章到第十二章(2)
89 0
UCB Data100:数据科学的原理和技巧:第十一章到第十二章(2)
|
9月前
|
分布式计算 数据可视化 内存技术
UCB Data100:数据科学的原理和技巧:第十一章到第十二章(3)
UCB Data100:数据科学的原理和技巧:第十一章到第十二章(3)
62 0
UCB Data100:数据科学的原理和技巧:第十一章到第十二章(3)
|
9月前
|
机器学习/深度学习 数据可视化
UCB Data100:数据科学的原理和技巧:第六章到第十章(5)
UCB Data100:数据科学的原理和技巧:第六章到第十章(5)
79 0
|
6月前
|
数据可视化 Python
“数据科学家必备!线性回归:Python中的简单武器,打造复杂预测模型
【8月更文挑战第2天】线性回归是数据科学中简单而强大的工具,用于预测自变量与因变量间的关系。在Python中可通过scikit-learn轻松实现。步骤包括:导入库、准备数据(使用`numpy`生成模拟数据并划分训练集/测试集)、创建并训练模型(使用`LinearRegression`类)、及评估模型与预测(计算均方误差并可视化结果)。掌握线性回归是理解和解决复杂预测问题的基础。
53 2