Model

An alternative approach to model covariates is to use Pólya-Gamma augmentation. Polson et al. (2013) propose a strategy to use Pólya-Gamma latent variables for fully Bayesian inference in binomial likelihoods. Linderman et al. (2015) use it to develop models for categorical and multinomial data with dependencies among the multinomial parameters. We extend their method to incorporate covariates.

We use the same notations as in the main text. Our corpus contains \(D\) documents. We observe a matrix \(\mathbf{X}\) of \(M\) covariates whose dimension is \(D \times M\). Topic model assigns one of \(K\) topics to each observed word. \(\mathbf{z}_d\) is a vector of assigned topics for a document \(d\). We introduce the stick-breaking representation of the multinomial distribution () described in Linderman et al. (2015), \[\begin{align} \mathbf{z}_{d} &\sim \textsf{SB-Multi}(n_d, \boldsymbol{\psi}_d). \end{align}\] rewrites the \(K\)-dimensional multinomial distribution with \(K-1\) binomial distributions.

\[\begin{align} \textsf{SB-Multi}(\mathbf{z}_d \mid n_d, \boldsymbol{\psi}_d) &= \prod_{k=1}^{K-1} \textsf{Bin}(n_{dk} \mid n_d - \textstyle\sum_{k' < k} n_{dk'}, \psi_{dk}) \\ &= \prod_{k=1}^{K-1} \binom{N_{dk}}{n_{dk}} \bigg( \frac{\exp(\psi_{dk})}{1 + \exp(\psi_{dk})} \bigg)^{n_{dk}} \bigg( \frac{1}{1 + \exp(\psi_{dk})} \bigg)^{N_{dk} - n_{dk}} \\ &= \prod_{k=1}^{K-1} \binom{N_{dk}}{n_{dk}} \frac{\exp(\psi_{dk})^{n_{dk}}}{(1 + \exp(\psi_{dk}))^{N_{dk}}} \tag{1} \\ \label{eq:sb-multi} N_{dk} &= n_d - \sum_{k' < k} n_{dk'} \end{align}\]

We model the parameters in with the covariates. First, we assume that coefficients follow the multivariate normal distribution. \(\boldsymbol{\lambda}\) is a \(M \times (K-1)\) matrix, so we introduce the vectorization transformation to draw all elements from a single draw of the multivariate normal distribution.

\[\begin{align} \text{vec}({\boldsymbol{\lambda}}) &\sim \mathcal{N}(\text{vec}(\boldsymbol{\mu}_0), \boldsymbol{\Sigma}_0 \otimes \boldsymbol{\Lambda}_0^{-1}) \tag{2} \\ \label{eq:prior_lambda} \boldsymbol{\Sigma}_0 &\sim \mathcal{W}^{-1}(\mathbf{V}_0, \boldsymbol{\nu}_0) \end{align}\] where we have the priors \(\boldsymbol{\mu}_0 = \mathbf{0}\), and \(\boldsymbol{\Sigma}_0\) is a \((K-1) \times (K-1)\) identity matrix (for topics). \(\boldsymbol{\Lambda}_0\) is a \(M \times M\) identity matrix (for covariates). \(\boldsymbol{\Sigma}_0 \otimes \boldsymbol{\Lambda}_0^{-1}\) becomes a diagonal matrix and equation (2) is the same as \(\boldsymbol{\lambda}_{k} \sim \mathcal{N}((\boldsymbol{\mu}_{0})_{k}, (\boldsymbol{\Sigma}_0)_{kk} \boldsymbol{\Lambda}_0^{-1}), \text{ for } k = 1, \ldots, K-1\).

Next, we use covariates to model the parameters in . \[\begin{align} \boldsymbol{\psi}_d &\sim \mathcal{N}(\boldsymbol{\lambda}^\top \mathbf{x}_d, \boldsymbol{\Sigma}_0) \end{align}\] Social scientists often use categorical variables (e.g., authorship of the document) as covariates. Modeling the mean of the multivariate normal distribution with covariates allows us to create variation in the document-topic distribution when two or more documents have the same set of covariates. The multivariate normal distribution can be generalized to the matrix normal distribution. \[\begin{align} \boldsymbol{\Psi}\sim \mathcal{MN}(\mathbf{M}, \mathbf{U}, \boldsymbol{\Sigma}_0), \end{align}\] where \(\boldsymbol{\Psi}\) is a \(D \times (K-1)\) matrix, each row of \(\mathbf{M}\) is equal to \(\boldsymbol{\lambda}^\top \mathbf{x}_d\), and \(\mathbf{U}\) is the \(D\times D\) identity matrix (documents are independent). This generalization will allow us to have a vectorized implementation.

Estimation

We sample \(\boldsymbol{\lambda}\), \(\boldsymbol{\Psi}\), and Pólya-gamma auxiliary variables \(\boldsymbol{\omega}\).

Sampling \(\boldsymbol{\Psi}\)

Equation (1) has the same form as Theorem 1 of Polson et al. (2013) and we can introduce P{'{o}}lya-gamma auxiliary variables.

\[\begin{align} p(\mathbf{z}_d, \boldsymbol{\omega}_d \mid n_d, \boldsymbol{\psi}_d) &\propto \prod_{k = 1}^{K - 1} \exp \big( (n_{dk} - {N_{dk}}/{2}) \psi_{dk} - \omega_{dk} \psi_{dk}^2 /2 \big) \\ &= \prod_{k = 1}^{K - 1} \exp \bigg( -\frac{\omega_{dk}}{2} \bigg( \psi_{dk}^2 - \textstyle \frac{2}{\omega_{dk}} (n_{dk} - N_{dk}/2) \psi_{dk} \bigg) \bigg) \\ &\propto \prod_{k = 1}^{K - 1} \exp \bigg( -\frac{\omega_{dk}}{2} \bigg( \psi_{dk} - \textstyle \frac{1}{\omega_{dk}}(n_{dk} - N_{dk}/2) \bigg)^2 \ \bigg) \\ &= \mathcal{N}\big( \boldsymbol{\psi}_d \mid \boldsymbol{\Omega}_d^{-1} \boldsymbol{\kappa}_d, \boldsymbol{\Omega}_{d}^{-1} \big) \\ \omega_{dk} &\sim \text{PG}(N_{dk}, \psi_{dk}) \text{ for } 1, \ldots, K-1 \\ \kappa_{dk} &= n_{dk} - \frac{N_{dk}}{2} \text{ for } 1, \ldots, K-1 \\ \boldsymbol{\Omega}_{d} &= \text{diag}(\omega_{d1}, \ldots, \omega_{d, K-1}) \end{align}\] We can use the multivariate normal distribution to sample \(\boldsymbol{\psi}_d\). \[\begin{align} p(\boldsymbol{\psi}_d \mid \mathbf{z}_d, \boldsymbol{\omega}_d) &\propto p(\mathbf{z}_d \mid \boldsymbol{\psi}_d, \boldsymbol{\kappa}_d, \boldsymbol{\Omega}_d) p(\boldsymbol{\psi}_d \mid \boldsymbol{\Sigma}_0, \mathbf{X}, \boldsymbol{\lambda})\\ &\propto \mathcal{N}(\boldsymbol{\psi}_d \mid \tilde{\boldsymbol{\mu}}, \tilde{\boldsymbol{\Sigma}}) \\ \tilde{\boldsymbol{\mu}} &= \tilde{\boldsymbol{\Sigma}}[\boldsymbol{\kappa}_d + \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\lambda}^\top \mathbf{x}_d] \\ \tilde{\boldsymbol{\Sigma}} &= [\boldsymbol{\Omega}_d + \boldsymbol{\Sigma}_{0}^{-1}]^{-1}, \end{align}\] where the second proportion comes from Matrix Cook Book 8.1.8 (product of Gaussians).

Sampling \(\boldsymbol{\lambda}\)

Sampling \(\boldsymbol{\lambda}\) and \(\mathbf{X}\) is the same as Bayesian multivariate linear regression in Rossi et al. (2012, pp.31-34).

Ordering effect

Stick-Breaking representation of the multinomial distribution has a potential ordering issue (Zhang and Zhou, 2017). We can regard as a distribution that orders categories according to their proportions. This can be an issue in because topics are pre-labeled, and the order does not necessarily match with the proportion of the topics.

Reference

  • Linderman, S. W., Johnson, M. J., & Adams, R. P. (2015). Dependent multinomial models made easy: Stick breaking with the Pólya-gamma augmentation. Advances in Neural Information Processing Systems, 2015, 3456-3464.
  • Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya-Gamma latent variables. Journal of the American statistical Association, 108(504), 1339-1349.
  • Rossi, P. E., Allenby, G. M., & McCulloch, R. (2012). Bayesian statistics and marketing. John Wiley & Sons.
  • Zhang, Q., & Zhou, M. (2017). Permuted and augmented stick-breaking bayesian multinomial regression. The Journal of Machine Learning Research, 18(1), 7479-7511.