One of the main practices in natural language processing is determining the common topics and themes found in a set of tweets, articles, blog posts, or long-form texts.
For someone without much knowledge of data science, it might be tempting to quickly Google search "how to extract topics from text", find out that Latent Dirichlet allocation (LDA) is the go-to model to do that, then write a Python script and hope for the best.
Of course, there are other models which might be better for your use-case, depending on the kind of data you want to process. We're going to be comparing the practical considerations of LDA with one such model called non-negative matrix factorization (NMF).
LDA and NMF are two of the most popular models, and while there are others available they seem to be rarely used, or purely academic.
Latent Dirichlet allocation
In short, LDA calculates the distribution of words in each document (tweet, article, essay, etc.), and attempts to fit the distributions into generated topics - the number of which you specify when creating the model.
If you imagine topics as bags of words, then LDA will try to split a document in such a way that most of the words in a document end up in the bags. It then outputs data explaining how much of the document is part of whichever topic.
For a more in-depth explanation check out this article.
Now, for the major points to be aware of:
- Assumes each document has multiple topics.
- Works best with longer texts such as full articles, essays, and books.
- Evolves as you process new documents with the same model.
- Results are not deterministic, meaning you might get different results each time for the same data set.
- Gensim's default implementation is good, but MALLET LDA is better.
Non-negative matrix factorization
Having matrix X where each column is a document, and each row is a word and its frequency of use in each document, you can decompose X into two matrices W and H - with the constraint that negative weights are ignored.
This gives an overlay of components (topics) with each document having a weighted sum in each topic. Essentially, each document has a set of scores for how well it fits each topic.
It's simpler than LDA, and has many applications besides topic modeling. It's a good option for analysis on shorter texts.
For a more in-depth explanation check out this practical guide.
- Calculates how well each document fits each topic, rather than assuming a document has multiple topics.
- Usually faster than LDA.
- Works best with shorter texts such as tweets or titles.
- Results are deterministic (I think), having more consistency when running the same data.
Pre-processing and parametrization
An important caveat to consider is that both models are highly dependent on the data you process with them, and the parameters you define. Make sure to tokenize your data by using stemming and lemmatisation, and removing stopwords.
Once you have well formatted data, try both models on it and visualize the results. Visualization is a data scientist's debugger.
Finally, another big factor in the accuracy of topic modeling is figuring out how many topics to generate. Both models require you to define the number right off the bat. Choose too many topics and you'll end up just getting back what you put in. Choose too few topics and you'll get documents in places they have no business being.
Make sure to test different inputs, and try to find the optimal number of topics that leads to the highest coherence scores.
What coherence score is good enough?
This depends on your input data, but usually try to get around 0.7, and anything above 0.5 is perfectly usable.
You'll almost always get some errors in your results, but you can minimize them with tokenization, a properly trained and parametrized model, and applying post-processing according to your needs.
So, which one should you use?
Use LDA if you're working with longer texts, but if you're working with short texts of one or a few sentences then NMF might give you much better results.