The table below summarizes keyATM models and other popular models based on the inputs.
Keywords | Covariate | Time Structure | |
---|---|---|---|
keyATM Base | ○ | × | × |
keyATM Covariate | ○ | ○ | △ |
keyATM Dynamic | ○ | × | ○ |
LDA Weighted | × | × | × |
LDA Weighted Cov | × | ○ | × |
LDA Weighted HMM | × | × | ○ |
Latent Dirichlet Allocation (LDA) | × | × | × |
Structural Topic Model (STM) | × | ○ | △ |
(○: model incorporates the feature, ×: model does not incorporate the feature, △: model can handle the feature but with some limitations)
The next table compares inference methods and speeds. CGS stands for Collapsed Gibbs Sampling and SS stands for Slice Sampling. Variational inference approximates the target distribution, while CGS and SS sample from the exact distribution.
Inference | Speed | |
---|---|---|
keyATM Base | CGS + SS | Fast |
keyATM Covariate | CGS + SS / PG | Moderate (Depends on # of covariates) |
keyATM Dynamic | CGS + SS | Fast |
LDA Weighted | CGS + SS | Fast |
LDA Weighted Cov | CGS + SS | Moderate (Depends on # of covariates) |
LDA Weighted HMM | CGS + SS | Fast |
Latent Dirichlet Allocation (LDA) | Variational EM / CGS | Depends on implementation |
Structural Topic Model (STM) | Variational EM | Very Fast |
Yes, but you need an extra step in the preprocessing. Let’s try a bigram model. You need a tokens object (see Preparation). Then, quanteda will create a n-gram tokens object (see quanteda’s manual for details),
data_tokens_n2 <- tokens_ngrams(data_tokens, n = 2) # bigram
head(data_tokens_n2[[1]], 3)
## [1] "fellow-citizens_senate" "senate_house" "house_representatives"
You can pass this preprocessed object to keyATM just as the unigram model.
data_dfm_n2 <- dfm(data_tokens_n2) %>%
dfm_trim(min_termfreq = 3, min_docfreq = 2)
keyATM_docs_n2 <- keyATM_read(data_dfm_n2)
Keywords should respect the n-gram.
keywords_n2 <- list(Government = c("federal_government", "vice_president"),
People = c("men_women", "fellow_citizens"),
Peace = c("peace_world"))
Then, you can fit the keyATM models (here we use the base model).
out <- keyATM(docs = keyATM_docs_n2, # text input
no_keyword_topics = 3, # number of topics without keywords
keywords = keywords_n2, # keywords
model = "base", # select the model
options = list(seed = 250))
top_words(out, 5)
## 1_Government 2_People 3_Peace
## 1 federal_government [✓] fellow_citizens [✓] four_years
## 2 vice_president [✓] united_states years_ago
## 3 one_another american_people political_parties
## 4 good_faith men_women [✓] foreign_nations
## 5 oath_office people_united old_world
## Other_1 Other_2 Other_3
## 1 general_government government_people united_states
## 2 constitution_united among_nations among_people
## 3 public_debt god_bless administration_government
## 4 constitution_laws nations_world within_limits
## 5 free_government great_nation chief_magistrate
We can use the dfm_select()
function from the
quanteda package.
keyATM_docs <- keyATM_read(texts = data_dfm)
law_all <- colnames(dfm_select(data_dfm, pattern = "law*")) # terms start with `law`
keywords <- list(
Government = c(law_all, "executive"),
Constitution = c("constitution", "rights"),
ForeignAffairs = c("foreign", "war")
)
Please also consider the read_keywords()
function to
read a dictionary object from quanteda to a named list.
Please note that the number of unique words, the total lenght of
documents, and the number of topics affect the speed. If you use
cov
model, the number of covariates matters as well,
because we need to estimate coefficieitns for covariates.
If you want to speed up fitting, the first thing you can do is to review preprocessing processes. Usually, documents include a lot of low frequency words that do not help interpretation. quanteda provides various functions to trim those words.
keyATM can resume fittng.
Yes! For example, Professor Louis Aslett provides an easy to use Amazon Machine Image of RStudio here. When you select an instance, please note that keyATM does not need multiple cores (one or two cores would be enough because we cannot parallelize Collapsed Gibbs Sampling), but make sure the memory can handle your data.
store_theta
to
TRUE
directly interpretable as samples from the posterior
and thus appropriate for estimating uncertainty?
Yes. Since we use Collapsed Gibbs sampling, thetas are not sampled
directly from the posterior distribution. store_theta
option calculates marginal posterior (Equation 11 in our paper) for each
iteration, so we can use it to consider uncertainty.