Documentation
For the full documentation and inner workings of the algorithm we refer to the research paper:
Dangl, Thomas and Salbrechter, Stefan, Guided Topic Modeling with Word2Vec: A Technical Note (September 19, 2023). Available at SSRN: https://ssrn.com/abstract=4575985
Word2Vec Embeddings
Guided Topic Modeling (GTM) uses special word embeddings that are obtained from a Word2Vec model which was trained on 10 million Thomson Reuters news articles (2.5 billion words) covering the period from 1996 to 2017. The hyperparameters of the Word2Vec algorithm are specifically adapted for the task of topic modeling. Thus we consider a rather low vector dimension of 64 to avoid data sparsity in vector space (curse of dimensionality). This also avoids the issue of too specific topic clusters we observed when working with higher-dimensional vectors. Also, for the task of topic modeling in mind we use the CBOW algorithm with a rather large window size of 18.
Polar Word Embeddings
Standard Word2Vec embeddings do not contain any information about the word polarity which makes them unsuitable for the task of sentiment analysis. We trained and adapted Word2Vec embeddings, primarily for the financial domain. Thus, words that are considered as positive or negative in the financial context. Within a fully data driven method we take the feedback of the stock market to identify positive and negative words. We perform a PCA on the word embeddings obtained from Word2Vec an replace the least informative dimensions with the word polarity measure. Using these adapted embeddings allows the generation of polar topics, i.e. by defining the seed words "rise" and "surge" (positive topic) or "fall" and "decline" (negative topic) we obtain strictly polar topics.
Vocabulary
The vocabulary consists in total of 190.323 unique unigrams (single words) and bigrams (word pairs). To detect bigrams we use the Gensim Phrases model. The trained model can be downloaded here. Apply this model on your dataset to get all bigrams available in GTM.