Topic Modeling to Understand Online Reviews

With so many online reviews across many social media websites, it is hard for companies to keep track of their online reputation. Businesses can benefit immensely if they can understand general trends of what their customers are talking about online. A common method to quickly understand trends in topics being discussed in a corpus of text is Latent Dirichlet Allocation.

Latent Dirichlet Allocation assumes each document consists of a combination of topics, and each topic consists of a combination of words. It then approximates probability distributions of topics in a given document and of words in a given topic.


We will perform topic modeling via Latent Dirichlet Allocation (LDA) on online reviews of a beauty retailer from various social media sources.



  1. Cleaning Text Data: Before we model the reviews data with LDA, we clean the review text by lemmatizing, removing punctuation, removing stop words, and filtering for only English reviews.
  2. Identifying Bigrams and Trigrams: We want to identify bigrams and trigrams so we can concatenate them and consider them as one word. Bigrams are phrases containing 2 words e.g. ‘social media’, where ‘social’ and ‘media’ are more likely to co-occur rather than appear separately. Likewise, trigrams are phrases containing 3 words that more likely co-occur e.g. ‘Proctor and Gamble’. We use Pointwise Mutual Information score to identify significant bigrams and trigrams to concatenate. We also filter bigrams or trigrams with the filter (noun/adj, noun), (noun/adj,all types,noun/adj) because these are common structures pointing out noun-type n-grams. This helps the LDA model better cluster topics.
  3. Filtering Nouns: Nouns are most likely indicators of a topic. For example, for the sentence ‘The store is nice’, we know the sentence is talking about ‘store’. The other words in the sentence provide more context and explanation about the topic (‘store’) itself. Therefore, filtering for the noun cleans the text for words that are more interpretable in the topic model.


  1. Optimizing the number of topics: LDA requires that we specify the number of topics that exists in a corpus of text. There are several common measures that can be optimized, such as predictive likelihood, perplexity, and coherence. Much literature has indicated that maximizing coherence, particularly a measure named Cv, leads to better human interpretability. This measure assesses the interpretability of topics given the set of words in generated topics. Therefore, we will optimize this measure.

    The number of topics that yield maximum coherence is around 3-4 topics. We will examine both because 4 topics may still be coherent, while providing more information.
  2. Using gensim’s LDA package to perform topic modeling: With the optimal number of topics, we use gensim’s LDA package to model the data. After comparing 3 topics (left) and 4 topics (right), we concluded that grouping into 4 topics yielded more coherent and insightful topics:

    These 4 main topics can be summarized as: hair salon service, product selection and pricing, brow bar and makeup service, and customer service.
  3. Further enhance interpretability via relevancy score: Sometimes, words that are ranked as top words for a given topic may be ranked high because they are globally frequent across text in a corpus. Relevancy score helps to prioritize terms that belong more exclusively to a given topic. This can increase interpretability even more. The relevance of term w to topic k is defined as:
    The first term measures the probability of term w occurring in topic k, and the second term measures the lift in the term’s probability within a topic to its marginal probability of occurring across the corpus. A lower lambda value gives more importance to the second term, which gives more importance to topic exclusivity. We can use Python’s pyLDAvis for this. For example, when lowering lambda, we can see that topic 0 ranked terms that are even more relevant to the topic of hair salon service the highest.The pyLDAvis tool also gives two other important pieces of information. The circles represent each topic. The distance between the circles visualizes how related topics are to each other. The above plot shows that our topics are quite distinct. Additionally, the size of the circle represents how prevalent that topic is across the corpus of reviews. Circle number 1 represents the topic about customer service, and the fact that it is the biggest circle means that the reviews mention customer service the most. Circle number 2 represents the topic about hair, and the visualization indicates that this topic makes up 22.3% of all tokens.


After applying the above steps, here are the 4 topics and top words for each:

The model has enabled us to understand the 4 most common topics talked about in online reviews about beauty retailer: customer service, hair salon service, product selection, and eyebrow/makeup service.