Bayesian probit regression with Data Augmentation and known variance¶

By Sheng BI

In [1]:
library(magrittr); library(MASS);
library(ggplot2); library(microbenchmark); library(gridExtra);
library(truncnorm); list.dirs('..'); list.files('../input');
  1. '..'
  2. '../config'
  3. '../input'
  4. '../lib'
  5. '../lib/kaggle'
  6. '../lib/kaggle/competitions'
  7. '../lib/kaggle/competitions/twosigmanews'
  8. '../working'
  9. '../working/.ipynb_checkpoints'

Today we are going to introduce Albert and Chib’s (1993) data augmentation algorithm for bayesian probit models.¶

Bayesian binary model setup:

\begin{array}{ccccc} y_{i} & \sim & \text{Bernoulli}(\mu_{i})\\ \mu_{i} & = & g^{-1}(\eta_{i}) & , & \eta_{i}=x_{i}\beta\\ \beta & \sim & \text{Normal}(.,.) \end{array}

In Bayesian context, the dependent variable $y_{i}$ should be always regarded as unknown, since the regression parameters are unknown.

Here, $\mu_{i}$ is the probability with which $y_{i}$ is equal to 1. The $g(.)$ is a link function which builds a mapping between $x_{i}\beta$ and a value in the interval $[0,1]$.

Now, let us introduce the latent variable: $z_{i}=x_{i}\beta+\epsilon_{i}$, $\epsilon_{i}\sim\text{Normal}(0,\sigma^{2})$, where $\epsilon_{i}$'s are $iid$, and $\sigma^{2}$ is assumed to be known.

Having defined the latent variable $z_i$, we realize that we could use $z_i$ to infer the value of $y_i$ perfectly.

$$ y_{i}=\begin{cases} 1 & z_{i}>0\\ 0 & z_{i}\leq0 \end{cases}, $$

in other words, $y_{i}$ become deterministic given $z_i$, while $z_{i}$'s are unknown or random.

.....

The likelihood of the model is

$$ \begin{array}{ccc} L(\beta|Y,X) & = & \prod_{1}^{N}\text{Pr}(y_{i}=1|x_{i},\beta)^{y_{i}}\text{Pr}(y_{i}=0|x_{i},\beta)^{1-y_{i}}\\ & = & \prod_{1}^{N}\text{Pr}(x_{i}\beta+\epsilon_{i}>0)^{y_{i}}\text{Pr}(x_{i}\beta+\epsilon_{i}\leq0)^{1-y_{i}}\\ & = & \prod_{1}^{N}\left[1-\Phi(-x_{i}\beta)\right]^{y_{i}}\left[\Phi(-x_{i}\beta)\right]^{1-y_{i}}\\ \\ \end{array} $$

Note that we have $\Phi(-x_{i}\beta) = 1 - \Phi(x_{i}\beta)$

.....

The assumed prior for $\beta$: $\beta\sim\text{Normal}(\underline{\beta},\underline{V})$.

..... The information from $z_{i}$'s to derive the posterior, and we will Gibbs-sample from $p(Z|\beta,Y,X)$ and $p(\beta|Y,Z,X)$ repeatedly.

  1. As for $p(Z|\beta,Y,X)$, we know from the conditional independence of $z_{i}$ that $z_{i}$ could be sampled from truncated normal given $Y$ and $\beta$:
\begin{array}{cccc} z_{i}|\beta,Y & \sim & \text{TruncatedNormal}(x_{i}\beta,\sigma^{2})\times I(z_{i}\geq0) & \text{if }y_{i}\geq0\\ z_{i}|\beta,Y & \sim & \text{TruncatedNormal}(x_{i}\beta,\sigma^{2})\times I(z_{i}<0) & \text{if }y_{i}<0 \end{array}
  1. As for $p(\beta|Y,Z,X)$ , we notice that $p(\beta|Y,Z,X)=p(\beta|Z,X)$, because knowning $Z$ means knowing $Y$. Also $p(\beta|Z,X)$ can be derived exactly as we did in linear regression setup. So we could use the previous result:
$$ \beta|Z,X\sim N(\overline{\beta},\overline{V}). $$

with $\overline{V}=(\underline{V}^{-1}+X^{T}X/\sigma^{2})^{-1}$

and $\overline{\beta}=\overline{V}(\underline{V}^{-1}\underline{\beta}+X^{T}Z/\sigma^{2})$

At last, it is important to notice that

\begin{array}{ccc} p(\beta|Y,X) & = & \int_{z}p(\beta,Z|Y,X)dZ\\ & = & p(\beta)\times L(\beta|Y,X) \end{array}

,

which means this posterior probability is exactly equivalent to the one without introducing the latent variable Z, so that our procedure is justified.

This algorithm proceeds in two steps.

  1. Draw $z_{1},...z_{n}$ independently with $z_{i}\sim\text{TN}(x_{i}\beta,\sigma^{2},y_{i})$

  2. Draw $\beta\sim\text{N}(\hat{\beta}(Z),\sigma^{2}(X^{T}X)^{-1})$, where $\hat{\beta}(Z)=(X^{T}X)^{-1}X^{T}Z$

My data generating process is

$z_{i}=1+2x_{1}+3x_{2}+4x_{3}+5x_{4}+\epsilon_{i}$, where $\epsilon_{i}\sim\text{N}(0,3)$

$y_{i}=1$ if $z_{i}\geq0$, and $y_{i}=0$ if $z_{i}<0$

In [2]:
## Data generation proess.

set.seed(2019)

dgp <- function() {
    
    ####### Data Generating Process
    N  <- 1000; ## nb. of observations
    k  <- 5;    ## nb. of regression parameters;
    x1.m <- 0; x1.sd <- 1;
    x2.m <- 0; x2.sd <- 1;
    x3.m <- 0; x3.sd <- 1;
    x4.m <- 0; x4.sd <- 1;
    x1 <- rnorm(n = 1000, mean = x1.m, sd = x1.sd); ## feature 1
    x2 <- rnorm(n = 1000, mean = x2.m, sd = x2.sd); ## feature 2
    x3 <- rnorm(n = 1000, mean = x3.m, sd = x3.sd); ## feature 3
    x4 <- rnorm(n = 1000, mean = x4.m, sd = x4.sd); ## feature 4
    
    # Input Data matrix
    X <- matrix(cbind(rep(1, N),x1,x2,x3,x4), ncol = k); ## Explanatory variables as a matrix
    
    # True values of unknown parameters
    btrue  <- c(1,2,3,4,5); ## regression coefficients
    sigmaY <- diag(3, N);
    
    # True forms of causal relationship
    lat <- X %*% matrix(btrue, k, 1) + mvrnorm(1,mu = rep(0,N), Sigma = sigmaY);
    Y <- lat; Y[,1] <- ifelse(lat[,1]>=0, 1, 0);
    result <- list();
    result$X <- X; result$Y <- Y; 
    result$N <- N; result$k <- k; result$sig <- sigmaY; 
    result$btrue <- btrue; result$lat <- lat;     
    return(result);
}

myData <- dgp();

## draw from truncated normal using rtruncnorm
rtn <- function(row, s=myData$sig[1,1]) {
    if (row[2] >= 0.5) {
        result <- rtruncnorm(n=1, a=0, b=Inf, mean =row[1], sd=sqrt(s))};
    if (row[2] < 0.5) {
        result <- rtruncnorm(n=1, a=-Inf, b=0, mean=row[1], sd=sqrt(s))};
    return(result)
}

## draw from truncated normal using inverse method.
rtn2 <- function(row, s=myData$sig[1,1]) {
    u <- runif(1,0,1)
    v <- -row[1]/sqrt(s)
    if (row[2] >= 0.5) {
        result <- row[1] + sqrt(s) * qnorm(pnorm(v) + u*(1-pnorm(v)))
        }
    if (row[2] < 0.5) {
        result <- row[1] + sqrt(s) * qnorm(u * pnorm(v))
    }
    return(result)
}

Let us first examine the result of OLS, GLM.Probit and GLM.Logit.

In [3]:
my.ols <- lm(Y~X-1, data = myData)
summary(my.ols)

my.probit <- glm(Y~X-1, data = myData, family = binomial(link = 'probit'))
summary(my.probit)

my.logit <- glm(Y~X-1, data = myData, family = binomial(link = 'logit'))
summary(my.logit)
Call:
lm(formula = Y ~ X - 1, data = myData)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.04836 -0.26331  0.02119  0.26137  0.78773 

Coefficients:
   Estimate Std. Error t value Pr(>|t|)    
X1 0.555783   0.009908   56.09   <2e-16 ***
X2 0.105361   0.009880   10.66   <2e-16 ***
X3 0.146481   0.009718   15.07   <2e-16 ***
X4 0.231994   0.009910   23.41   <2e-16 ***
X5 0.262081   0.009700   27.02   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3119 on 995 degrees of freedom
Multiple R-squared:  0.8167,	Adjusted R-squared:  0.8158 
F-statistic: 886.8 on 5 and 995 DF,  p-value: < 2.2e-16
Warning message:
“glm.fit: fitted probabilities numerically 0 or 1 occurred”
Call:
glm(formula = Y ~ X - 1, family = binomial(link = "probit"), 
    data = myData)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.62663  -0.01802   0.00000   0.03704   2.41589  

Coefficients:
   Estimate Std. Error z value Pr(>|z|)    
X1  0.67206    0.09984   6.731 1.68e-11 ***
X2  1.34917    0.13452  10.030  < 2e-16 ***
X3  1.96084    0.17514  11.196  < 2e-16 ***
X4  2.77463    0.22623  12.265  < 2e-16 ***
X5  3.47285    0.28980  11.984  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1386.29  on 1000  degrees of freedom
Residual deviance:  279.68  on  995  degrees of freedom
AIC: 289.68

Number of Fisher Scoring iterations: 9
Warning message:
“glm.fit: fitted probabilities numerically 0 or 1 occurred”
Call:
glm(formula = Y ~ X - 1, family = binomial(link = "logit"), data = myData)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.59840  -0.05798   0.00016   0.08117   2.41351  

Coefficients:
   Estimate Std. Error z value Pr(>|z|)    
X1   1.2086     0.1837   6.580 4.69e-11 ***
X2   2.3981     0.2551   9.402  < 2e-16 ***
X3   3.4780     0.3350  10.381  < 2e-16 ***
X4   4.9543     0.4406  11.245  < 2e-16 ***
X5   6.1819     0.5624  10.993  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1386.29  on 1000  degrees of freedom
Residual deviance:  282.44  on  995  degrees of freedom
AIC: 292.44

Number of Fisher Scoring iterations: 8

This is a nonlinear model, so OLS with linear probability simply gets the hypothesis wrong. Recall that (1) OLS leads to a weighed average of $y_i$s. (2) Using weighted average is like finding the balance, which is like the center of mass in Physics.

The probit model tends to underestimate, while the logit model tends to overestimate. However, the deviance shows that probit is overall more accurate than logit.

Now, one main cause of the deviation between logit and probit rests upon the difference on the shape of distribution: logistic distribution has fatter tail compared to the probit distribution. A natural consequence is that the standard error of the regression estimates are higher in logit model.

In [6]:
plot(seq(-10,10,0.1), dnorm(seq(-10,10,0.1)), col = 'blue', xlab = "", ylab = '')
lines(seq(-10,10,0.1), dlogis(seq(-10,10,0.1)), col = 'red', xlab = "", ylab = 'density')

legend('topright', legend=c("Logit", "Probit"),
       col=c("red", "blue"), lty=1:2, cex=1.5, inset = 0.1, box.lty=0)

Another observation is that the 95% confidence interval of logit model could in general include the true value of the regression parameters, while probit could not.

In [13]:
# gmodels::ci(my.ols, confidence = 0.95)
cat('Probit 95% Confidence Interval: '); gmodels::ci(my.probit, confidence = 0.95);
cat('Logit 95% Confidence Interval: ');  gmodels::ci(my.logit, confidence = 0.95);
Probit 95% Confidence Interval: 
A matrix: 5 × 5 of type dbl
EstimateCI lowerCI upperStd. Errorp-value
X10.67205820.47613190.86798440.099842611.682977e-11
X21.34916951.08519771.61314120.134518111.129167e-23
X31.96083601.61714682.30452530.175141564.279085e-29
X42.77463112.33068553.21857670.226231501.403630e-34
X53.47284952.90415684.04154210.289801694.334254e-33
Logit 95% Confidence Interval: 
A matrix: 5 × 5 of type dbl
EstimateCI lowerCI upperStd. Errorp-value
X11.2085550.84814861.5689620.18366074.692801e-11
X22.3981101.89759422.8986260.25505945.345167e-21
X33.4779842.82050024.1354670.33504893.040932e-25
X44.9543354.08978885.8188810.44056642.440568e-29
X56.1819445.07836447.2855240.56237644.150806e-28

Let us implement the Albert and Chib’s (1993) data augmentation algorithm for bayesian probit models.

In [26]:
gibbs.probit1 <- function(d = myData, 
                     b.prior.m.start = my.probit$coefficients, 
                     b.prior.v.start = diag(summary(my.probit)$coefficients[,2])^2 ) {
    Y <- d$Y; X <- d$X; sig <- d$sig[1,1];
    N <- d$N; k <- d$k; 
    T <- 3000;
    
    ## priors.
    b <- matrix(0, T, k); 
        
    b.prior.v <- b.prior.v.start;
    b.prior.m <- b.prior.m.start; 
    b.prior.v.inv <- solve(b.prior.v);
    
    b.post.v <- solve(b.prior.v.inv + crossprod(X)/sig);
    b.post.m.A <-  b.post.v %*% (b.prior.v.inv %*% b.prior.m);
    b.post.m.B1 <- b.post.v %*% t(X) / sig;
    
    M <- matrix(NA, N, 2); ystar <- rep(0,N);
    M[,2] <- Y;
    
    for (t in 2:T) {
        M[,1] <- X %*% b[t-1, ];
        
        ystar <- apply(M, 1, rtn); # Ystar <- matrix(ystar, ncol = 1);
        
        b.post.m <- b.post.m.A + b.post.m.B1 %*% ystar
        # b.post.m <- b.post.v %*% (b.prior.v.inv %*% b.prior.m + crossprod(X,Ystar)/sig[1,1]);
        
        b[t, ]  <- mvrnorm(n = 1, mu = b.post.m, Sigma = b.post.v);
    }
    return(b)
}
temp.probit <- gibbs.probit1()

This data augmentation method offers a convenient framework for iteratively sampling from the conditional densities $p(z|\beta)$ and $p(\beta|z)$.

However, the result is not satisfactory. We only have local convergence.¶

Firstly, as you could notice from the trace plot, although there seems to be convergence, there is evidence of strong autocorrelations: A lot of zigzags are going on there! Take the variable $x_4$ as an example:

In [37]:
#graph.probit1 <- function() {
    #p1 <- ggplot(data = NULL, mapping = aes(x=1:3000, y=temp[,3])) + 
    #geom_point(shape = 1, size = 2, alpha = 0.4); p1;}
#graph.probit1()


plot(c(1:3000), temp.probit[,5], ylim=c(0,6),ylab = 'beta_x4', xlab = 'index of iterations')
abline(h=5, col = 'blue')
abline(h=mean(temp.probit[,5]), col = 'red')
text(x = 2000, y = 5.2, labels = 'true value of beta_x4', col = 'blue')
text(x = 2000, y = mean(temp.probit[,5]) - 0.7, labels = 'mean of simulated beta_x4', col = 'red')

To confirm the strong autocorrelation, let us use acf function.

In [29]:
ggacf <- function(x, ci=0.95, type="correlation", xlab="Lag", ylab=NULL,
                  ylim=NULL, main=NULL, ci.col="blue", lag.max=NULL) {

    x <- as.data.frame(x)

    x.acf <- acf(x, plot=F, lag.max=lag.max, type=type)

    ci.line <- qnorm((1 - ci) / 2) / sqrt(x.acf$n.used)

    d.acf <- data.frame(lag=x.acf$lag, acf=x.acf$acf)

    g <- ggplot(d.acf, aes(x=lag, y=acf)) +
        geom_hline(yintercept=0) +
        geom_segment(aes(xend=lag, yend=0)) +
        geom_hline(yintercept=ci.line, color=ci.col, linetype="dashed") +
        geom_hline(yintercept=-ci.line, color=ci.col, linetype="dashed") +
        theme_bw() +
        xlab("Lag") +
        ggtitle(ifelse(is.null(main), "", main)) +
        if (is.null(ylab))
            ylab(ifelse(type=="partial", "PACF", "ACF"))
        else
            ylab(ylab)

    g
}

grid.arrange(ggacf(temp.probit[,1]), ggacf(temp.probit[,2]), ggacf(temp.probit[,3]), 
             ggacf(temp.probit[,4]), ggacf(temp.probit[,5]), nrow=2)

Why the strong autocorrelation? Because we have strong posterior correlation between $Z$ and $\beta$!

Now, secondly, the results can largely differ according to the initial values of $\beta$.

With initial values of $\beta$ coming from GLM.Probit, gibbs sampling yields the following values for our regression parameters:

In [30]:
temp.probit.quantile <- apply(temp.probit[1001:3000, ], 2, quantile)
colnames(temp.probit.quantile)  <- c('1', 'x1','x2','x3','x4')
temp.probit.quantile %>% t
A matrix: 5 × 5 of type dbl
0%25%50%75%100%
10.41619440.66026160.71279290.76782851.014837
x11.20591331.39218111.45800711.52490961.762157
x21.73744942.08255152.15459702.22445722.521394
x32.72078043.05049493.13422403.22188293.675385
x43.40106453.76606403.87523513.98926614.439693

Let the GLM.Logit result serve as our initial values of $\beta$. Gibbs sampling yields the following values for our regression parameters:

In [31]:
temp.logit <- gibbs.probit1(d = myData, 
                     b.prior.m.start = my.logit$coefficients, 
                     b.prior.v.start = diag(summary(my.logit)$coefficients[,2])^2 )
temp.logit.quantile <- apply(temp.logit[1001:3000, ], 2, quantile)
colnames(temp.logit.quantile)  <- c('1', 'x1','x2','x3','x4')
temp.logit.quantile %>% t
A matrix: 5 × 5 of type dbl
0%25%50%75%100%
10.70143351.1089761.1947171.2849041.591236
x11.82400392.2973012.4041812.5074192.941009
x22.93457863.3297663.4561253.6005444.052139
x34.18294784.7567834.9173945.0695955.653067
x45.10131795.9945246.1638606.3556687.101978

Now the graph of the coefficient of $\x_4$ is:

In [39]:
plot(c(1:3000), temp.logit[,5], ylim=c(0,7.2),ylab = 'beta_x4', xlab = 'index of iterations')
abline(h=5, col = 'blue')
abline(h=mean(temp.logit[,5]), col = 'red')
text(x = 2000, y = 4.8, labels = 'true value of beta_x4', col = 'blue')
text(x = 2000, y = mean(temp.logit[,5]) - 0.7, labels = 'mean of simulated beta_x4', col = 'red')

Surprisingly, few literature has commented much on this point. We will look into other algorithms and examine whether we are able to get more robust result in later studies.

to be continued¶