By Sheng BI
library(magrittr); library(MASS);
library(ggplot2); library(microbenchmark); library(gridExtra);
library(truncnorm); list.dirs('..'); list.files('../input');
Bayesian binary model setup:
\begin{array}{ccccc} y_{i} & \sim & \text{Bernoulli}(\mu_{i})\\ \mu_{i} & = & g^{-1}(\eta_{i}) & , & \eta_{i}=x_{i}\beta\\ \beta & \sim & \text{Normal}(.,.) \end{array}In Bayesian context, the dependent variable $y_{i}$ should be always regarded as unknown, since the regression parameters are unknown.
Here, $\mu_{i}$ is the probability with which $y_{i}$ is equal to 1. The $g(.)$ is a link function which builds a mapping between $x_{i}\beta$ and a value in the interval $[0,1]$.
Now, let us introduce the latent variable: $z_{i}=x_{i}\beta+\epsilon_{i}$, $\epsilon_{i}\sim\text{Normal}(0,\sigma^{2})$, where $\epsilon_{i}$'s are $iid$, and $\sigma^{2}$ is assumed to be known.
Having defined the latent variable $z_i$, we realize that we could use $z_i$ to infer the value of $y_i$ perfectly.
$$ y_{i}=\begin{cases} 1 & z_{i}>0\\ 0 & z_{i}\leq0 \end{cases}, $$in other words, $y_{i}$ become deterministic given $z_i$, while $z_{i}$'s are unknown or random.
.....
The likelihood of the model is
$$ \begin{array}{ccc} L(\beta|Y,X) & = & \prod_{1}^{N}\text{Pr}(y_{i}=1|x_{i},\beta)^{y_{i}}\text{Pr}(y_{i}=0|x_{i},\beta)^{1-y_{i}}\\ & = & \prod_{1}^{N}\text{Pr}(x_{i}\beta+\epsilon_{i}>0)^{y_{i}}\text{Pr}(x_{i}\beta+\epsilon_{i}\leq0)^{1-y_{i}}\\ & = & \prod_{1}^{N}\left[1-\Phi(-x_{i}\beta)\right]^{y_{i}}\left[\Phi(-x_{i}\beta)\right]^{1-y_{i}}\\ \\ \end{array} $$Note that we have $\Phi(-x_{i}\beta) = 1 - \Phi(x_{i}\beta)$
.....
The assumed prior for $\beta$: $\beta\sim\text{Normal}(\underline{\beta},\underline{V})$.
..... The information from $z_{i}$'s to derive the posterior, and we will Gibbs-sample from $p(Z|\beta,Y,X)$ and $p(\beta|Y,Z,X)$ repeatedly.
with $\overline{V}=(\underline{V}^{-1}+X^{T}X/\sigma^{2})^{-1}$
and $\overline{\beta}=\overline{V}(\underline{V}^{-1}\underline{\beta}+X^{T}Z/\sigma^{2})$
At last, it is important to notice that
\begin{array}{ccc} p(\beta|Y,X) & = & \int_{z}p(\beta,Z|Y,X)dZ\\ & = & p(\beta)\times L(\beta|Y,X) \end{array},
which means this posterior probability is exactly equivalent to the one without introducing the latent variable Z, so that our procedure is justified.
This algorithm proceeds in two steps.
Draw $z_{1},...z_{n}$ independently with $z_{i}\sim\text{TN}(x_{i}\beta,\sigma^{2},y_{i})$
Draw $\beta\sim\text{N}(\hat{\beta}(Z),\sigma^{2}(X^{T}X)^{-1})$, where $\hat{\beta}(Z)=(X^{T}X)^{-1}X^{T}Z$
My data generating process is
$z_{i}=1+2x_{1}+3x_{2}+4x_{3}+5x_{4}+\epsilon_{i}$, where $\epsilon_{i}\sim\text{N}(0,3)$
$y_{i}=1$ if $z_{i}\geq0$, and $y_{i}=0$ if $z_{i}<0$
## Data generation proess.
set.seed(2019)
dgp <- function() {
####### Data Generating Process
N <- 1000; ## nb. of observations
k <- 5; ## nb. of regression parameters;
x1.m <- 0; x1.sd <- 1;
x2.m <- 0; x2.sd <- 1;
x3.m <- 0; x3.sd <- 1;
x4.m <- 0; x4.sd <- 1;
x1 <- rnorm(n = 1000, mean = x1.m, sd = x1.sd); ## feature 1
x2 <- rnorm(n = 1000, mean = x2.m, sd = x2.sd); ## feature 2
x3 <- rnorm(n = 1000, mean = x3.m, sd = x3.sd); ## feature 3
x4 <- rnorm(n = 1000, mean = x4.m, sd = x4.sd); ## feature 4
# Input Data matrix
X <- matrix(cbind(rep(1, N),x1,x2,x3,x4), ncol = k); ## Explanatory variables as a matrix
# True values of unknown parameters
btrue <- c(1,2,3,4,5); ## regression coefficients
sigmaY <- diag(3, N);
# True forms of causal relationship
lat <- X %*% matrix(btrue, k, 1) + mvrnorm(1,mu = rep(0,N), Sigma = sigmaY);
Y <- lat; Y[,1] <- ifelse(lat[,1]>=0, 1, 0);
result <- list();
result$X <- X; result$Y <- Y;
result$N <- N; result$k <- k; result$sig <- sigmaY;
result$btrue <- btrue; result$lat <- lat;
return(result);
}
myData <- dgp();
## draw from truncated normal using rtruncnorm
rtn <- function(row, s=myData$sig[1,1]) {
if (row[2] >= 0.5) {
result <- rtruncnorm(n=1, a=0, b=Inf, mean =row[1], sd=sqrt(s))};
if (row[2] < 0.5) {
result <- rtruncnorm(n=1, a=-Inf, b=0, mean=row[1], sd=sqrt(s))};
return(result)
}
## draw from truncated normal using inverse method.
rtn2 <- function(row, s=myData$sig[1,1]) {
u <- runif(1,0,1)
v <- -row[1]/sqrt(s)
if (row[2] >= 0.5) {
result <- row[1] + sqrt(s) * qnorm(pnorm(v) + u*(1-pnorm(v)))
}
if (row[2] < 0.5) {
result <- row[1] + sqrt(s) * qnorm(u * pnorm(v))
}
return(result)
}
Let us first examine the result of OLS, GLM.Probit and GLM.Logit.
my.ols <- lm(Y~X-1, data = myData)
summary(my.ols)
my.probit <- glm(Y~X-1, data = myData, family = binomial(link = 'probit'))
summary(my.probit)
my.logit <- glm(Y~X-1, data = myData, family = binomial(link = 'logit'))
summary(my.logit)
This is a nonlinear model, so OLS with linear probability simply gets the hypothesis wrong. Recall that (1) OLS leads to a weighed average of $y_i$s. (2) Using weighted average is like finding the balance, which is like the center of mass in Physics.
The probit model tends to underestimate, while the logit model tends to overestimate. However, the deviance shows that probit is overall more accurate than logit.
Now, one main cause of the deviation between logit and probit rests upon the difference on the shape of distribution: logistic distribution has fatter tail compared to the probit distribution. A natural consequence is that the standard error of the regression estimates are higher in logit model.
plot(seq(-10,10,0.1), dnorm(seq(-10,10,0.1)), col = 'blue', xlab = "", ylab = '')
lines(seq(-10,10,0.1), dlogis(seq(-10,10,0.1)), col = 'red', xlab = "", ylab = 'density')
legend('topright', legend=c("Logit", "Probit"),
col=c("red", "blue"), lty=1:2, cex=1.5, inset = 0.1, box.lty=0)
Another observation is that the 95% confidence interval of logit model could in general include the true value of the regression parameters, while probit could not.
# gmodels::ci(my.ols, confidence = 0.95)
cat('Probit 95% Confidence Interval: '); gmodels::ci(my.probit, confidence = 0.95);
cat('Logit 95% Confidence Interval: '); gmodels::ci(my.logit, confidence = 0.95);
Let us implement the Albert and Chib’s (1993) data augmentation algorithm for bayesian probit models.
gibbs.probit1 <- function(d = myData,
b.prior.m.start = my.probit$coefficients,
b.prior.v.start = diag(summary(my.probit)$coefficients[,2])^2 ) {
Y <- d$Y; X <- d$X; sig <- d$sig[1,1];
N <- d$N; k <- d$k;
T <- 3000;
## priors.
b <- matrix(0, T, k);
b.prior.v <- b.prior.v.start;
b.prior.m <- b.prior.m.start;
b.prior.v.inv <- solve(b.prior.v);
b.post.v <- solve(b.prior.v.inv + crossprod(X)/sig);
b.post.m.A <- b.post.v %*% (b.prior.v.inv %*% b.prior.m);
b.post.m.B1 <- b.post.v %*% t(X) / sig;
M <- matrix(NA, N, 2); ystar <- rep(0,N);
M[,2] <- Y;
for (t in 2:T) {
M[,1] <- X %*% b[t-1, ];
ystar <- apply(M, 1, rtn); # Ystar <- matrix(ystar, ncol = 1);
b.post.m <- b.post.m.A + b.post.m.B1 %*% ystar
# b.post.m <- b.post.v %*% (b.prior.v.inv %*% b.prior.m + crossprod(X,Ystar)/sig[1,1]);
b[t, ] <- mvrnorm(n = 1, mu = b.post.m, Sigma = b.post.v);
}
return(b)
}
temp.probit <- gibbs.probit1()
This data augmentation method offers a convenient framework for iteratively sampling from the conditional densities $p(z|\beta)$ and $p(\beta|z)$.
Firstly, as you could notice from the trace plot, although there seems to be convergence, there is evidence of strong autocorrelations: A lot of zigzags are going on there! Take the variable $x_4$ as an example:
#graph.probit1 <- function() {
#p1 <- ggplot(data = NULL, mapping = aes(x=1:3000, y=temp[,3])) +
#geom_point(shape = 1, size = 2, alpha = 0.4); p1;}
#graph.probit1()
plot(c(1:3000), temp.probit[,5], ylim=c(0,6),ylab = 'beta_x4', xlab = 'index of iterations')
abline(h=5, col = 'blue')
abline(h=mean(temp.probit[,5]), col = 'red')
text(x = 2000, y = 5.2, labels = 'true value of beta_x4', col = 'blue')
text(x = 2000, y = mean(temp.probit[,5]) - 0.7, labels = 'mean of simulated beta_x4', col = 'red')
To confirm the strong autocorrelation, let us use acf function.
ggacf <- function(x, ci=0.95, type="correlation", xlab="Lag", ylab=NULL,
ylim=NULL, main=NULL, ci.col="blue", lag.max=NULL) {
x <- as.data.frame(x)
x.acf <- acf(x, plot=F, lag.max=lag.max, type=type)
ci.line <- qnorm((1 - ci) / 2) / sqrt(x.acf$n.used)
d.acf <- data.frame(lag=x.acf$lag, acf=x.acf$acf)
g <- ggplot(d.acf, aes(x=lag, y=acf)) +
geom_hline(yintercept=0) +
geom_segment(aes(xend=lag, yend=0)) +
geom_hline(yintercept=ci.line, color=ci.col, linetype="dashed") +
geom_hline(yintercept=-ci.line, color=ci.col, linetype="dashed") +
theme_bw() +
xlab("Lag") +
ggtitle(ifelse(is.null(main), "", main)) +
if (is.null(ylab))
ylab(ifelse(type=="partial", "PACF", "ACF"))
else
ylab(ylab)
g
}
grid.arrange(ggacf(temp.probit[,1]), ggacf(temp.probit[,2]), ggacf(temp.probit[,3]),
ggacf(temp.probit[,4]), ggacf(temp.probit[,5]), nrow=2)
Why the strong autocorrelation? Because we have strong posterior correlation between $Z$ and $\beta$!
Now, secondly, the results can largely differ according to the initial values of $\beta$.
With initial values of $\beta$ coming from GLM.Probit, gibbs sampling yields the following values for our regression parameters:
temp.probit.quantile <- apply(temp.probit[1001:3000, ], 2, quantile)
colnames(temp.probit.quantile) <- c('1', 'x1','x2','x3','x4')
temp.probit.quantile %>% t
Let the GLM.Logit result serve as our initial values of $\beta$. Gibbs sampling yields the following values for our regression parameters:
temp.logit <- gibbs.probit1(d = myData,
b.prior.m.start = my.logit$coefficients,
b.prior.v.start = diag(summary(my.logit)$coefficients[,2])^2 )
temp.logit.quantile <- apply(temp.logit[1001:3000, ], 2, quantile)
colnames(temp.logit.quantile) <- c('1', 'x1','x2','x3','x4')
temp.logit.quantile %>% t
Now the graph of the coefficient of $\x_4$ is:
plot(c(1:3000), temp.logit[,5], ylim=c(0,7.2),ylab = 'beta_x4', xlab = 'index of iterations')
abline(h=5, col = 'blue')
abline(h=mean(temp.logit[,5]), col = 'red')
text(x = 2000, y = 4.8, labels = 'true value of beta_x4', col = 'blue')
text(x = 2000, y = mean(temp.logit[,5]) - 0.7, labels = 'mean of simulated beta_x4', col = 'red')
Surprisingly, few literature has commented much on this point. We will look into other algorithms and examine whether we are able to get more robust result in later studies.