The table below summarizes keyATM models and other popular models based on the inputs.
|LDA Weighted Cov||×||○||×|
|LDA Weighted HMM||×||×||○|
|Latent Dirichlet Allocation (LDA)||×||×||×|
|Structural Topic Model (STM)||×||○||△|
(○: model incorporates the feature, ×: model does not incorporate the feature, △: model can handle the feature but with some limitations)
The next table compares inference methods and speeds. CGS stands for Collapsed Gibbs Sampling and SS stands for Slice Sampling. Variational inference approximates the target distribution, while CGS and SS sample from the exact distribution.
|keyATM Base||CGS + SS||Fast|
|keyATM Covariate||CGS + SS||Moderate (Depends on # of covariates)|
|keyATM HMM||CGS + SS||Fast|
|keyATM Label||CGS + SS||Fast|
|LDA Weighted||CGS + SS||Fast|
|LDA Weighted Cov||CGS + SS||Moderate (Depends on # of covariates)|
|LDA Weighted HMM||CGS + SS||Fast|
|Latent Dirichlet Allocation (LDA)||Variational EM / CGS||Depends on implementation|
|Structural Topic Model (STM)||Variational EM||Very Fast|
Yes, but you need an extra step in the preprocessing. Let’s try a bigram model. You need a tokens object (see Preparation). Then, quanteda will create a n-gram tokens object (see quanteda’s manual for details),
##  "fellow-citizens_senate" "senate_house" "house_representatives"
You can pass this preprocessed object to keyATM just as the unigram model.
data_dfm_n2 <- dfm(data_tokens_n2) %>% dfm_trim(min_termfreq = 3, min_docfreq = 2) keyATM_docs_n2 <- keyATM_read(data_dfm_n2)
Keywords should respect the n-gram.
Then, you can fit the keyATM models (here we use the base.
## 1_Government 2_People 3_Peace ## 1 federal_government [✓] fellow_citizens [✓] four_years ## 2 vice_president [✓] united_states years_ago ## 3 one_another american_people political_parties ## 4 good_faith men_women [✓] foreign_nations ## 5 oath_office people_united old_world ## Other_1 Other_2 Other_3 ## 1 general_government government_people united_states ## 2 constitution_united among_nations among_people ## 3 public_debt god_bless administration_government ## 4 constitution_laws nations_world within_limits ## 5 free_government great_nation chief_magistrate
Please note that the number of unique words, the total lenght of
documents, and the number of topics affect the speed. If you use
cov model, the number of covariates matters as well,
because we need to estimate coefficieitns for covariates.
If you want to speed up fitting, the first thing you can do is to review preprocessing processes. Usually, documents include a lot of low frequency words that do not help interpretation. quanteda provides various functions to trim those words.
Yes! For example, Professor Louis Aslett provides an easy to use Amazon Machine Image of RStudio here. When you select an instance, please note that keyATM does not need multiple cores (one or two cores would be enough because we cannot parallelize Collapsed Gibbs Sampling), but make sure the memory can handle your data.
TRUEdirectly interpretable as samples from the posterior and thus appropriate for estimating uncertainty?
Yes. Since we use Collapsed Gibbs sampling, thetas are not sampled
directly from the posterior distribution.
option calculates marginal posterior (Equation 11 in our paper) for each
iteration, so we can use it to consider uncertainty.