BMA Ridge Regression
What is Bayesian Model Average (BMA)?
Bayesian model average is a well-known model selection method that provides a coherent mechanism for accounting for the model uncertainty. Suppose we have a collection of models \(\mathcal{M}_{1}, \mathcal{M}_{2}, \ldots, \mathcal{M}_{M}\) and we want to select the “best model” \(\mathcal{M}^*\) among the collection. In BMA method, we assume that the mode collection have a prior distribution such that \(\pi\left(\mathcal{M}_{m}\right)>0\) and \(\sum_{m=1}^{M} \pi\left(\mathcal{M}_{m}\right)=1\).
The BMA framework can be written as:
\[\begin{array}{rlr} X_{i} \mid \boldsymbol{\theta}, \mathcal{M}_{m} & \stackrel{\text { ind }}{\sim} f_{m}\left(x_{i} \mid \boldsymbol{\theta}_{m}\right), & i=1, \ldots, n \\ \boldsymbol{\theta}_{m} & \stackrel{\text { ind }}{\sim} g_{m}\left(\boldsymbol{\theta}_{m}\right), & m=1, \ldots, M \\ \mathcal{M}_{m} & \sim \pi\left(\mathcal{M}_{m}\right) \end{array}\]One can calculate the posterior probability \(\pi\left(\mathcal{M}_{m} \mid \mathbf{x}\right)\) of a model \(\mathcal{M}_m\) using Bayes Theorem, and further derive the predictive density \(f(x \mid \mathbf{x})\). Here we use a simple example to illustrate the BMA model.
Example
Suppose we have two models:
\[\begin{aligned} &\mathcal{M}_{0}=Y_{i} \mid \lambda \stackrel{i i d}{\sim} \operatorname{Geo}(\lambda) \quad g_{0}(\lambda) \sim U(0,1) \quad \pi(\mathcal{M}_0) = \frac{1}{2}\\ &\mathcal{M}_{1}=Y_{i} \mid \lambda \stackrel{i i d}{\sim} Poi(\lambda), \quad g_{1}(\lambda)=\frac{1}{\lambda} \quad \pi(\mathcal{M}_1) = \frac{1}{2} \end{aligned}\]We first derive the poseterior distribution of \(\mathcal{M_0}\) and \(\mathcal{M}_1\):
- $\mathcal{M}_1$:
Posterior distribution:
\[\begin{aligned} g_{1}(\lambda \mid \mathbf{y}) &=f_{1}(\mathbf{y} \mid \lambda) g_{1}(\lambda)=\prod_{i=1}^{n} \frac{\lambda^{y_{i}} e^{-\lambda}}{y_{i} !} \cdot \frac{1}{\lambda}=\frac{\lambda^{S_{n}-1} e^{-n \lambda}}{\prod_{i=1}^{n} y_{i} !}, \quad S_{n}=\sum_{i=1}^{n} y_{i} \\ & \propto \frac{n^{S_{n}}}{\Gamma\left(S_{n}\right)} \lambda^{S_{n}-1} e^{-n \lambda} \sim \operatorname{Gamma}\left(S_{n}, \frac{1}{n}\right) \end{aligned}\]Assume that $Y_{n+1} \mid \lambda$ is independengt of $Y_{1}, \cdots, Y_{n}$, the we have the conditional predictive density:
\[f_{1}\left(y_{n+1} \mid \mathbf{y}, \lambda\right)=f_{1}\left(y_{n+1} \mid \lambda\right) \propto \operatorname{Poi}(\lambda)\]Hence,
\[\begin{aligned} f_{1}\left(y_{n+1} \mid \mathbf{y}\right) &=\int_{0}^{\infty} f_{1}\left(y_{n+1} \mid \lambda\right) f_{1}(\lambda \mid \mathbf{y}) d \lambda \\ &=\frac{n^{S_ n}}{\Gamma\left(S_{n}\right)}\int_{0} ^{\infty} \lambda^{S_{n}-1} e^{-n \lambda} \cdot \frac{\lambda^{y_{n+1}} e^{-\lambda}}{y_{n+1} !} d \lambda\\ &=\frac{n^{S_{n}}}{\Gamma\left(S_{n}\right)}\int_{0} ^{\infty} \frac{\lambda^{S_{n+1}-1} e^{-(n+1) \lambda}}{y_{n+1} !} d \lambda\\ &=\frac{n^{S_{n}} \Gamma\left(S_{n+1}\right)}{(n+1)^{S_{n+1}} y_{n+1} ! \Gamma\left(S_{n}\right)}\\ &=\left(\frac{n}{n+1}\right)^{S_{n}}\left(\frac{1}{n+1}\right)^{y_{n+1}}\binom{S_{n}+y_{n+1}-1 }{y_{n+1}}\\ &\sim N B\left(S_{n}, \frac{1}{n+1}\right)\\ \end{aligned}\]We then calculate $\pi(\mathcal{M}_1, \mathbf{y})$
\[\begin{aligned} \pi\left(\mathcal{M}_1, \mathbf{y}\right)&=f_{1}\left(\mathbf{y} \mid \mathcal{M}_{1}\right) \pi\left(\mathcal{M}_{1}\right)=\pi\left(\mathcal{M}_{1}\right) \int g_1(\mathbf{y},\lambda) d\lambda \\ &=\frac{1}{2}\left(\int_{0}^{\infty} \frac{\lambda^{S_{n}-1} e^{-n \lambda}}{\prod_{i=1}^{n} y_{i} !} d \lambda\right)=\frac{1}{2} \cdot \frac{\Gamma\left(S_{n}\right)}{n^{S_{n}} \prod_{i=1}^{n} y_{i} !} \end{aligned}\]- $\mathcal{M}_0$:
Posterior distribution:
\[g_{0}(\lambda \mid \mathbf{y})=\lambda^{n}(1-\lambda)^{S_{n}} \propto \frac{1}{B\left(n+1, S_{n}+1\right)} \lambda^{n}(1-\lambda)^{S_{n}} \sim \operatorname{Beta}\left(n+1, S_{n}+1\right)\]The conditional predictive density:
\[\begin{aligned} f_{0}\left(y_{n+1} \mid \mathbf{y}\right) &=\int_{0}^{1} f \cdot\left(y_{n+1} \mid \lambda_{1}\right) g_{0}(\lambda \mid y) d \lambda \\ &=\int_{0}^{1} \frac{\lambda(1-\lambda)^{y_{n+1}}}{B\left(n+1, S_{n}+1\right)} \lambda^{n}(1-\lambda)^{S_{n}} d \lambda \\ &=\frac{1}{B\left(n+1, S_{n}+1\right)} \int_{0}^{1} \lambda^{n+1}(1-\lambda)^{S_{n+1}} d \lambda \\ &=\frac{B\left(n+2, S_{n+1}+1\right)}{B\left(n+1, S_{n}+1\right)} \end{aligned}\]For $\pi\left(\mathcal{M}_{0}, \mathbf{y}\right)$, we have
\[\pi\left(\mathcal{M}_{0}, \mathbf{y}\right)=f_{0}\left(\mathbf{y} \mid \mathcal{M}_{0}\right) \pi\left(\mathcal{M}_{0}\right)=\frac{1}{2} \int_{0}^{1} \lambda^{n}(1-\lambda)^{S_{n}} d \lambda=\frac{1}{2} B\left(n+1, S_{n}+1\right)\]By Bayes Theorem, we have the predictive density:
\[f(y \mid \mathbf{y})=f_{0}(y \mid \mathbf{y}) \pi\left(\mathcal{M}_{0} \mid \mathbf{y}\right)+f_{1}(y \mid \mathbf{y}) \pi\left(\mathcal{M}_{1} \mid \mathbf{y}\right)\]where
\[\pi\left(\mathcal{M}_{0} \mid \mathbf{y}\right)=\frac{\frac{1}{2} B\left(n+1, S_{n}+1\right)}{\frac{1}{2}\left(B\left(n+1, S_{n}+1\right)+\frac{\Gamma\left(S_{n}\right)}{n^{S_{n}} \prod y_{i} !}\right)}=\frac{B\left(n+1, S_{n}+1\right)}{B\left(n+1, S_{n}+1\right)+\frac{\Gamma\left(S_{n}\right)}{n^{S n}} \prod y_{i} !}\]and
\[\pi\left(M_{1} \mid \mathbf{y}\right)=\frac{\Gamma\left(S_{n}\right)}{n^{S_{n}} \prod y_{i} !\left(B\left(n+1, S_{n}+1\right)+\frac{\Gamma\left(S_{n}\right)}{n^{S_{n}} \prod y_{i} !}\right)}\]BMA in Ridge Regression
In the ridge regression, the estimation of $\beta$ is defined as:
\[\hat{\boldsymbol{\beta}}_{\mathrm{R}}=\left(\mathbf{X}^{T} \mathbf{X}+\lambda \mathbf{I}\right)^{-1} \mathbf{X}^{T} \mathbf{Y}\]The ridge estimator can be viewed as an $\ell_2$-optimazation problem under $\ell_2$-penalization:
\[\hat{\boldsymbol{\beta}}_{\mathrm{R}}=\underset{\boldsymbol{\beta} \in \mathbb{R}^{p}}{\arg \min }\left\{\|\mathbf{Y}-\mathbf{X} \boldsymbol{\beta}\|^{2}+\lambda \sum_{k=1}^{p} \beta_{k}^{2}\right\}\]