Model

An alternative approach to model covariates is to use Pólya-Gamma augmentation. Polson et al. (2013) propose a strategy to use Pólya-Gamma latent variables for fully Bayesian inference in binomial likelihoods. Linderman et al. (2015) use it to develop models for categorical and multinomial data with dependencies among the multinomial parameters. We extend their method to incorporate covariates.

We use the same notations as in the main text. Our corpus contains DD documents. We observe a matrix 𝐗\mathbf{X} of MM covariates whose dimension is D×MD \times M. Topic model assigns one of KK topics to each observed word. 𝐳d\mathbf{z}_d is a vector of assigned topics for a document dd. We introduce the stick-breaking representation of the multinomial distribution () described in Linderman et al. (2015), 𝐳d𝖲𝖡-𝖬𝗎𝗅𝗍𝗂(nd,𝛙d).\begin{align} \mathbf{z}_{d} &\sim \textsf{SB-Multi}(n_d, \boldsymbol{\psi}_d). \end{align} rewrites the KK-dimensional multinomial distribution with K1K-1 binomial distributions.

𝖲𝖡-𝖬𝗎𝗅𝗍𝗂(𝐳dnd,𝛙d)=k=1K1𝖡𝗂𝗇(ndkndk<kndk,ψdk)=k=1K1(Ndkndk)(exp(ψdk)1+exp(ψdk))ndk(11+exp(ψdk))Ndkndk=k=1K1(Ndkndk)exp(ψdk)ndk(1+exp(ψdk))NdkNdk=ndk<kndk\begin{align} \textsf{SB-Multi}(\mathbf{z}_d \mid n_d, \boldsymbol{\psi}_d) &= \prod_{k=1}^{K-1} \textsf{Bin}(n_{dk} \mid n_d - \textstyle\sum_{k' < k} n_{dk'}, \psi_{dk}) \\ &= \prod_{k=1}^{K-1} \binom{N_{dk}}{n_{dk}} \bigg( \frac{\exp(\psi_{dk})}{1 + \exp(\psi_{dk})} \bigg)^{n_{dk}} \bigg( \frac{1}{1 + \exp(\psi_{dk})} \bigg)^{N_{dk} - n_{dk}} \\ &= \prod_{k=1}^{K-1} \binom{N_{dk}}{n_{dk}} \frac{\exp(\psi_{dk})^{n_{dk}}}{(1 + \exp(\psi_{dk}))^{N_{dk}}} \tag{1} \\ \label{eq:sb-multi} N_{dk} &= n_d - \sum_{k' < k} n_{dk'} \end{align}

We model the parameters in with the covariates. First, we assume that coefficients follow the multivariate normal distribution. 𝛌\boldsymbol{\lambda} is a M×(K1)M \times (K-1) matrix, so we introduce the vectorization transformation to draw all elements from a single draw of the multivariate normal distribution.

vec(𝛌)𝒩(vec(𝛍0),𝚺0𝚲01)𝚺0𝒲1(𝐕0,𝛎0)\begin{align} \text{vec}({\boldsymbol{\lambda}}) &\sim \mathcal{N}(\text{vec}(\boldsymbol{\mu}_0), \boldsymbol{\Sigma}_0 \otimes \boldsymbol{\Lambda}_0^{-1}) \tag{2} \\ \label{eq:prior_lambda} \boldsymbol{\Sigma}_0 &\sim \mathcal{W}^{-1}(\mathbf{V}_0, \boldsymbol{\nu}_0) \end{align} where we have the priors 𝛍0=0\boldsymbol{\mu}_0 = \mathbf{0}, and 𝚺0\boldsymbol{\Sigma}_0 is a (K1)×(K1)(K-1) \times (K-1) identity matrix (for topics). 𝚲0\boldsymbol{\Lambda}_0 is a M×MM \times M identity matrix (for covariates). 𝚺0𝚲01\boldsymbol{\Sigma}_0 \otimes \boldsymbol{\Lambda}_0^{-1} becomes a diagonal matrix and equation (2) is the same as 𝛌k𝒩((𝛍0)k,(𝚺0)kk𝚲01), for k=1,,K1\boldsymbol{\lambda}_{k} \sim \mathcal{N}((\boldsymbol{\mu}_{0})_{k}, (\boldsymbol{\Sigma}_0)_{kk} \boldsymbol{\Lambda}_0^{-1}), \text{ for } k = 1, \ldots, K-1.

Next, we use covariates to model the parameters in . 𝛙d𝒩(𝛌𝐱d,𝚺0)\begin{align} \boldsymbol{\psi}_d &\sim \mathcal{N}(\boldsymbol{\lambda}^\top \mathbf{x}_d, \boldsymbol{\Sigma}_0) \end{align} Social scientists often use categorical variables (e.g., authorship of the document) as covariates. Modeling the mean of the multivariate normal distribution with covariates allows us to create variation in the document-topic distribution when two or more documents have the same set of covariates. The multivariate normal distribution can be generalized to the matrix normal distribution. 𝚿𝒩(𝐌,𝐔,𝚺0),\begin{align} \boldsymbol{\Psi}\sim \mathcal{MN}(\mathbf{M}, \mathbf{U}, \boldsymbol{\Sigma}_0), \end{align} where 𝚿\boldsymbol{\Psi} is a D×(K1)D \times (K-1) matrix, each row of 𝐌\mathbf{M} is equal to 𝛌𝐱d\boldsymbol{\lambda}^\top \mathbf{x}_d, and 𝐔\mathbf{U} is the D×DD\times D identity matrix (documents are independent). This generalization will allow us to have a vectorized implementation.

Estimation

We sample 𝛌\boldsymbol{\lambda}, 𝚿\boldsymbol{\Psi}, and Pólya-gamma auxiliary variables 𝛚\boldsymbol{\omega}.

Sampling 𝚿\boldsymbol{\Psi}

Equation (1) has the same form as Theorem 1 of Polson et al. (2013) and we can introduce P{'{o}}lya-gamma auxiliary variables.

p(𝐳d,𝛚dnd,𝛙d)k=1K1exp((ndkNdk/2)ψdkωdkψdk2/2)=k=1K1exp(ωdk2(ψdk22ωdk(ndkNdk/2)ψdk))k=1K1exp(ωdk2(ψdk1ωdk(ndkNdk/2))2)=𝒩(𝛙d𝛀d1𝛋d,𝛀d1)ωdkPG(Ndk,ψdk) for 1,,K1κdk=ndkNdk2 for 1,,K1𝛀d=diag(ωd1,,ωd,K1)\begin{align} p(\mathbf{z}_d, \boldsymbol{\omega}_d \mid n_d, \boldsymbol{\psi}_d) &\propto \prod_{k = 1}^{K - 1} \exp \big( (n_{dk} - {N_{dk}}/{2}) \psi_{dk} - \omega_{dk} \psi_{dk}^2 /2 \big) \\ &= \prod_{k = 1}^{K - 1} \exp \bigg( -\frac{\omega_{dk}}{2} \bigg( \psi_{dk}^2 - \textstyle \frac{2}{\omega_{dk}} (n_{dk} - N_{dk}/2) \psi_{dk} \bigg) \bigg) \\ &\propto \prod_{k = 1}^{K - 1} \exp \bigg( -\frac{\omega_{dk}}{2} \bigg( \psi_{dk} - \textstyle \frac{1}{\omega_{dk}}(n_{dk} - N_{dk}/2) \bigg)^2 \ \bigg) \\ &= \mathcal{N}\big( \boldsymbol{\psi}_d \mid \boldsymbol{\Omega}_d^{-1} \boldsymbol{\kappa}_d, \boldsymbol{\Omega}_{d}^{-1} \big) \\ \omega_{dk} &\sim \text{PG}(N_{dk}, \psi_{dk}) \text{ for } 1, \ldots, K-1 \\ \kappa_{dk} &= n_{dk} - \frac{N_{dk}}{2} \text{ for } 1, \ldots, K-1 \\ \boldsymbol{\Omega}_{d} &= \text{diag}(\omega_{d1}, \ldots, \omega_{d, K-1}) \end{align} We can use the multivariate normal distribution to sample 𝛙d\boldsymbol{\psi}_d. p(𝛙d𝐳d,𝛚d)p(𝐳d𝛙d,𝛋d,𝛀d)p(𝛙d𝚺0,𝐗,𝛌)𝒩(𝛙d𝛍̃,𝚺̃)𝛍̃=𝚺̃[𝛋d+𝚺01𝛌𝐱d]𝚺̃=[𝛀d+𝚺01]1,\begin{align} p(\boldsymbol{\psi}_d \mid \mathbf{z}_d, \boldsymbol{\omega}_d) &\propto p(\mathbf{z}_d \mid \boldsymbol{\psi}_d, \boldsymbol{\kappa}_d, \boldsymbol{\Omega}_d) p(\boldsymbol{\psi}_d \mid \boldsymbol{\Sigma}_0, \mathbf{X}, \boldsymbol{\lambda})\\ &\propto \mathcal{N}(\boldsymbol{\psi}_d \mid \tilde{\boldsymbol{\mu}}, \tilde{\boldsymbol{\Sigma}}) \\ \tilde{\boldsymbol{\mu}} &= \tilde{\boldsymbol{\Sigma}}[\boldsymbol{\kappa}_d + \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\lambda}^\top \mathbf{x}_d] \\ \tilde{\boldsymbol{\Sigma}} &= [\boldsymbol{\Omega}_d + \boldsymbol{\Sigma}_{0}^{-1}]^{-1}, \end{align} where the second proportion comes from Matrix Cook Book 8.1.8 (product of Gaussians).

Sampling 𝛌\boldsymbol{\lambda}

Sampling 𝛌\boldsymbol{\lambda} and 𝐗\mathbf{X} is the same as Bayesian multivariate linear regression in Rossi et al. (2012, pp.31-34).

Ordering effect

Stick-Breaking representation of the multinomial distribution has a potential ordering issue (Zhang and Zhou, 2017). We can regard as a distribution that orders categories according to their proportions. This can be an issue in because topics are pre-labeled, and the order does not necessarily match with the proportion of the topics.

Reference

  • Linderman, S. W., Johnson, M. J., & Adams, R. P. (2015). Dependent multinomial models made easy: Stick breaking with the Pólya-gamma augmentation. Advances in Neural Information Processing Systems, 2015, 3456-3464.
  • Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya-Gamma latent variables. Journal of the American statistical Association, 108(504), 1339-1349.
  • Rossi, P. E., Allenby, G. M., & McCulloch, R. (2012). Bayesian statistics and marketing. John Wiley & Sons.
  • Zhang, Q., & Zhou, M. (2017). Permuted and augmented stick-breaking bayesian multinomial regression. The Journal of Machine Learning Research, 18(1), 7479-7511.