Model

Which model to use?

The table below summarizes keyATM models and other popular models based on the inputs.

Keywords Covariate Time Structure
keyATM Base × ×
keyATM Covariate
keyATM HMM ×
keyATM Label × ×
LDA Weighted × × ×
LDA Weighted Cov × ×
LDA Weighted HMM × ×
Latent Dirichlet Allocation (LDA) × × ×
Structural Topic Model (STM) ×

(○: model incorporates the feature, ×: model does not incorporate the feature, △: model can handle the feature but with some limitations)

The next table compares inference methods and speeds. CGS stands for Collapsed Gibbs Sampling and SS stands for Slice Sampling. Variational inference approximates the target distribution, while CGS and SS sample from the exact distribution.

Inference Speed
keyATM Base CGS + SS Fast
keyATM Covariate CGS + SS Moderate (Depends on # of covariates)
keyATM HMM CGS + SS Fast
keyATM Label CGS + SS Fast
LDA Weighted CGS + SS Fast
LDA Weighted Cov CGS + SS Moderate (Depends on # of covariates)
LDA Weighted HMM CGS + SS Fast
Latent Dirichlet Allocation (LDA) Variational EM / CGS Depends on implementation
Structural Topic Model (STM) Variational EM Very Fast

Preprocessing

Can we use n-grams?

Yes, but you need an extra step in the preprocessing. Let’s try a bigram model. You need a tokens object (see Preparation). Then, quanteda will create a n-gram tokens object (see quanteda’s manual for details),

data_tokens_n2 <- tokens_ngrams(data_tokens, n = 2)  # bigram
head(data_tokens_n2[[1]], 3)
## [1] "fellow-citizens_senate" "senate_house"           "house_representatives"

You can pass this preprocessed object to keyATM just as the unigram model.

data_dfm_n2 <- dfm(data_tokens_n2) %>%
                 dfm_trim(min_termfreq = 3, min_docfreq = 2)
keyATM_docs_n2 <- keyATM_read(data_dfm_n2)

Keywords should respect the n-gram.

keywords_n2 <- list(Government = c("federal_government", "vice_president"),
                    People     = c("men_women", "fellow_citizens"),
                    Peace      = c("peace_world"))

Then, you can fit the keyATM models (here we use the base.

out <- keyATM(docs              = keyATM_docs_n2,    # text input
              no_keyword_topics = 3,                 # number of topics without keywords
              keywords          = keywords_n2,       # keywords
              model             = "base",            # select the model
              options           = list(seed = 250))
top_words(out, 5)
##             1_Government            2_People           3_Peace
## 1 federal_government [✓] fellow_citizens [✓]        four_years
## 2     vice_president [✓]       united_states         years_ago
## 3            one_another     american_people political_parties
## 4             good_faith       men_women [✓]   foreign_nations
## 5            oath_office       people_united         old_world
##               Other_1           Other_2                   Other_3
## 1  general_government government_people             united_states
## 2 constitution_united     among_nations              among_people
## 3         public_debt         god_bless administration_government
## 4   constitution_laws     nations_world             within_limits
## 5     free_government      great_nation          chief_magistrate

Fitting

It takes time to fit the model. What should I do?

Please note that the number of unique words, the total lenght of documents, and the number of topics affect the speed. If you use cov model, the number of covariates matters as well, because we need to estimate coefficieitns for covariates.

If you want to speed up fitting, the first thing you can do is to review preprocessing processes. Usually, documents include a lot of low frequency words that do not help interpretation. quanteda provides various functions to trim those words.

Can I run keyATM on cloud computing services?

Yes! For example, Professor Louis Aslett provides an easy to use Amazon Machine Image of RStudio here. When you select an instance, please note that keyATM does not need multiple cores (one or two cores would be enough because we cannot parallelize Collapsed Gibbs Sampling), but make sure the memory can handle your data.

Can a theta matrix stored with store_theta to TRUE directly interpretable as samples from the posterior and thus appropriate for estimating uncertainty?

Yes. Since we use Collapsed Gibbs sampling, thetas are not sampled directly from the posterior distribution. store_theta option calculates marginal posterior (Equation 11 in our paper) for each iteration, so we can use it to consider uncertainty.