Online reviews: keyword clusters


When analyzing online reviews, we often focus on keywords. For various purposes such as review classification or keyword suggestion, we may need to group those words by closeness of meaning. Given a set of keywords, we may want to split it into clusters of words. One way we can do so is by using Word2Vec to map each word to a vector value, and then apply hierarchical clustering to those vectors.

First, let’s look at a simple example. We have the following input: [“banana”, “apple”, “lemon”, “nice”, “car”, “RV”, “truck”, “desk”]

     From a clustering algorithm, we would expect this output:

  • cluster 1: [“banana”, “apple”, “lemon”]
  • cluster 2: [“car”,”rv”,”truck”]
  • out of the clusters: [“nice”, “desk”]

1. Word2Vec

In this post we focus more on the grouping algorithms than on the details of the implementation of the Word2Vec model.

In short, Word2Vec provides word embedding. It associates vector values to words. Word2Vec trains a neural network to predict the neighboring words around each word from a corpus. Once it’s trained, the vector value is extracted from the neural network layers before the projections try to predict the context. Words with close meanings should be represented by vectors with close values.

We trained our Word2Vec model on 10^6 online reviews with the gensim Word2Vec implementation. We used a window of 10 words and generated 300 dimensional vectors.

Our use cases involve data from various industries, so when we want to work with keywords on online reviews for a given industry, we train Word2Vec on reviews from this industry, since this context may be useful in specifying the meaning of a word. For example, considering the word “limb”, the meaning may be different in restaurant reviews than in hospital reviews.

2. Hierarchical clustering

To generate clusters, we will apply a strategy of hierarchical clustering. This is an iterative process to create a new cluster at each step by aggregating two clusters. At each step it tries to minimize the increase of a given distance. For a set of n vectors, there are n-1 steps that take you from n singleton clusters to one cluster with everything. This process can also be well represented graphically to show any level of granularity within a single graph (as below).

example of hierarchical clusters

3. Distances used for clustering

Of course the key to generating these clusters is the distance metric used to determine which keywords or clusters are “close” together. To measure the dispersion in a set of vectors we can use different metrics. A classic distance is the empirical variance, corresponding to the geometric distance between the vectors.

3.1 Empirical variance

Let S = (x_i)_{i \in [|1,n|]} be a set of n points, and G its center of inertia (its mean). Then the empirical variance is:

V = \sum_{i = 1}^{n}{||x_i - G||^2}

But we can define two other variances : the inter-class variance V_{inter}, and the intra-class variance V_{intra}.

Let (C_k)_{k \in [|1, K|]} be K subset of S, creating a partition of S (the clusters). (G_k)_{k \in [|1, K|]} is the center (mean) of the clusters (C_i)_{k \in [|1, K|]} and |C_k|_{k \in [|1, K|]} is the number of elements in each cluster.

V_{inter} = \sum_{k \in [|1, K|]}{|C_k|\times|| G_k - G ||^2}
V_{intra} = \sum_{k \in [|1, K|]}{\sum_{i \in C_k}{||x_i - G_k||^2}}

We understand than V_{intra} is a measure of the dispersion inside each subset, and V_{inter} is a measure of the distance  between the clusters.

Huygens theorem shows that:

V = V_{inter} + V_{intra}

Therefore, when we do clustering, since V is a constant, minimizing the intra-class variance is the same than maximizing the inter-class variance. This is what we want to generate: clusters very distant one from another (high V_{inter} ), and every cluster close to his own center (low V_{intra} ).

Now we will see that at each aggregation of two clusters, there is a way to make the aggregation with the lowest increase of V_{intra} with ward distance.

3.2 Ward distance

Ward distance is a distance between two clusters, which we can use in hierarchical clustering: at each step of the algorithm, we try to aggregate two clusters with the smallest Ward distance between them. This distance between two clusters is chosen because of the following result:

Aggregating the two clusters with the minimum Ward distance is equivalent to make the aggregation with the smallest increase of V_{intra} .

This is an example of hierarchical clustering using Ward distance on a set of keywords from hospital online-reviews.


3.3 Thresholds on cluster distance

Now that we have an efficient method to perform hierarchical clustering which tries to keep the intra-class variance as low as possible, we have to determine where to stop this process of aggregating groups; i.e which level of granularity to use as our clusters in any particular application.

There are different ways to choose the right number of clusters (when to stop the aggregations). A classic way to do that in hierarchical clustering is to use the maximum of the second derivative of the Ward distance, which means to effectively look for a big gap in the Ward distance and to stop before this gap.

But we are looking for clusters with high word similarity, so we don’t want to use the second derivative method, which is relative. We also want to find a standard criterion that is independent of the cluster size. Therefore, we will use thresholds on a given distance but the distance should be normalized.

Even though we use the Ward distance as an aggregation distance, we do not necessarily have to use it to delimit the clusters. Instead we can use a threshold on a “dispersion” metric for each cluster C_k. Let’s consider the following three metrics, in addition to the Ward distance:

  • Mean variance in the cluster: \frac{\sum_{i \in C_k}{||x_i - G_k||^2}}{|C_k|}
  • Maximum distance between two elements in the cluster: \max\limits_{i,j \in C_k}{||x_i - x_j||^2}
  • Maximum distance to the mean of the cluster: \max\limits_{i \in C_k}{||x_i - G_k||^2}


On this and other examples, the maximum distance seems to be worse than the other metrics. The maximum distance to the mean is a good metric but it doesn’t consider all the words in the cluster. We will use a threshold on the mean variance to delimit the clusters. In the next dendrogram we plot the threshold: Each cluster has to have a mean variance below 0.6. In this example we have 4 clusters.


But, we need to be careful when using other metrics than the Ward distance to select clusters. The metric may not increase at each iteration, unlike the Ward distance. Here is an example of this:


This would give an advantage to the maximum distance between two vectors in each cluster over other metrics because the maximum distance between two vectors can’t decrease when we add words in a cluster. For our purposes, we will keep the mean variance but choose to stop the aggregation the first time that the mean variance is over the threshold, in order to keep this method consistent.

Now that we have a way to select the clusters, let’s explore some other issues and possible ways to address them.

4. Improve the quality of the clusters

4.1 Outliers

We select clusters on the mean variance. But we can have a low mean variance due to a high number of vectors close to each other and one vector too far of them. We want to remove vectors which are too far from the center of the cluster as we see in the left part of the following schema :


In this schema, the left mean variance is lower in the red cluster than in the blue one, but the point on the bottom left corner of the red cluster needs to be removed from it.

4.2 “Dimension” effect

We used the mean variance to select the clusters because it’s normalized. We did that because we don’t want to preselect the size of the clusters.

But we can notice that for the same mean variance and a different number of elements in the clusters the lexical proximity seems to increase with the size of the cluster. For example in the dendrogram that we used to illustrate the thresholds, [“arrogant”,”personality”,”bedside”] has the same mean variance as [“encouraging”, “understanding”, “caring”, “passionate”,”listens”]). This may be caused by the high dimension (300) of the vector space.

We can use a better metric to keep the selection of clusters independent of their size. We have good results with: \frac{||x_i - G_k ||}{\sqrt{|C_k|}}

But, it seems difficult to use the mean of this metric directly with a threshold after the Ward aggregation because the mean of this metric is not increasing with the aggregations of the clusters.

4.3 New filter against those two problems

Even though the mean of the last metric  \frac{||x_i - G_k ||}{\sqrt{|C_k|}} is not useful in selecting the clusters, we may have another use for it. Instead of using it on a cluster level, we can use it on a word level. For each vector, if it’s too far from its cluster’s center, it will be removed from the cluster. This is very useful for two purposes: to get rid of outliers, and to get rid of small clusters of words with very distant meanings.

Ultimately, we have a three steps process. Let’s return to our the first example: [“banana”, “apple”, “lemon”, “nice”, “car”, “RV”, “truck”, “desk”]

  1.  Hierarchical aggregation with Ward distanceclustering_example
  2. Select clusters with a threshold t1 = 0.6 on their mean varianceclustering_example_thresholdWe have three clusters :
    – [“banana”, “apple”, “lemon”]
    – [“rv”, “car”, “truck”]
    – [“nice”,”desk”]
  3. Remove outlier x_i from cluster C_k when \frac{||x_i - G_k ||}{\sqrt{|C_k|}} is lower than a threshold t2 = 0.4.We limit the “dimension” effect at the same time.
    Both “nice” and “desk” don’t pass the test, so we have the expected result :
    – cluster 1 : [“banana”, “apple”, “lemon”]
    – cluster 2: [“rv”, “car”, “truck”]
    – out of the clusters: [“desk”,”nice”]

5. Further extensions – Use of clusters to extract lexical information

Now that we can select groups of words with very similar meaning, we have some ideas for further research.

An advantage of the Word2Vec word embedding is that it may have geometrical interpretations. We have the example from the Word2Vec creators, who found the word “smallest” by looking at the word representation with the closest cosine distance to vector(“biggest”) – vector(“big”) + vector(“small”).

We understand easily the meaning of the vector (vector(“biggest”) – vector(“big”)). But what about (vector(“big”) – vector(“small”)) ? Along the little segment from vector(“small”) to vector(“big”) we hope to find some adjectives related to the size and increasing from “small” to “big”.

We tried to detect lexical information with tools such as Principal Component Analysis (PCA) on a huge group of words but without success, so we hope that this clustering will help us to achieve this.

It may be interesting to use tools such as PCA on each cluster to detect geometrical structures with lexical interpretations.

This is an idea of the kind of results that we may expect:


6. Conclusion

Vectors generated by Word2Vec can be used to find clusters of words. We try to find clusters without restraining their number or their size, with hierarchical clustering. We can manually choose thresholds values, depending on how scattered we want the clusters to be.

These clustering techniques are useful to find groups of words with the same meaning. It can be used to find keywords more precisely: Let say that we have 100 keywords related to a theme and we want to use Word2Vec to detect new keywords related to this theme. Instead of looking for vectors close to each word representation or close to the entire 100 vector set, we can look for vectors close to each cluster that we generated.

It may be very interesting to analyze geometrical repartitions of the vectors inside the clusters with tools such as PCA, and to find if they correspond to a lexical structure.


Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, 2013

SciPy Hierarchical Clustering and Dendrogram Tutorial, Jörn Hees, 2015

Semantic Convolution with Word2Vec

At, we work with millions of online reviews from hundreds of sources. One of the unusual characteristics of reviews compared to the vast majority of text corpora is that, almost by definition, reviews are structured in such a way that they can be categorized (in one or many dimensions depending on the review site and/or industry). However, we often find ourselves doing text classification/tagging on topics that are not already labeled by the review site.  This article is an informal introduction to a set of techniques we have developed to leverage existing unlabeled corpora in conjunction with the labeled data. In particular, we present a semi-supervised learning algorithm for multi-label text classification.

In recent years, a lot of text classification projects have used supervised learning methods (Naive Bayes, SVM) primarily due to their substantial improvements over non-supervised strategies such as traditional clustering in NLP tasks. Until very recently, most NLP classification work was done with the traditional Bag of Words (BOW) approach – perhaps with a bit of context through the use of a limited range of N-grams and skip-grams. BOW is a feature extraction technique where the text is represented as the frequency of each word in the document, disregarding grammar and ordering but keeping multiplicity. In most cases, defining a pipeline combining the BOW feature extraction technique with a Tf-Idf transform and a simple classifier (Naive Bayes, SVM) produces decent results with respect to most classification metrics.

Semi-Supervised Learning with Word2Vec

In most tutorials, Word2Vec is presented as a stand-alone neural net preprocessor for feature extraction. Word2Vec generates a vector for each word in the text corpora in higher-dimensional space such that words that share contextual meaning are located in close proximity to one another. To use Word2Vec for classification, each word can be replaced by its corresponding word vector and usually combined through a naive algorithm such as addition with normalization or cross product to get a sentence or text vector. Then, using these document vectors we could use a simple classifier for multi-label classification. The advantage of using Word2Vec over a simple BOW feature extraction technique is it supports semi-supervised learning, since the vocabulary from the labeled and unlabeled text can be used to generate the word vectors. This allows the words to have more contextual meaning. However, we have found that this approach does not appear to provide significant improvements over a BOW approach especially when there isn’t a lot of labeled data for training the classifier.

Semantic Convolution for Low Support Topics

A common problem that is seen in multi-label text classification is a major imbalance of labels in a textual corpora. We often see cases where most (>60%) of the sampled data is about the most prevalent topic, and more than half the topic labels exist in <0.1% of the sampled data. Almost inherently with NLP and a BOW approach, this causes a p (number of features) >> n (size of training corpus) problem. Based on a general rules of thumb, getting 1,000 training examples for the low support topic would require millions of labeled training examples, which is prohibitively expensive.

In this world of ‘big data’ the data itself is actually cheap, but developing a tagged training set can be expensive. In the course of our development, we devised an elegant and scalable way to develop and maintain a robust training set across tens of industries (this will be the topic of a separate blog post).

The premise of Semantic Convolution is simple: if a particular word is a good indicator of a particular label, then words with similar meanings (semantics) should also be good indicators of the label. Since we have qualitative evidence that Word2Vec vectors encode a semantic meaning, we can use it to help find words with similar meanings from non-labeled corpora. This allows us to apply a Semantic transform after getting the term frequencies in the BOW pipeline, and before applying the Tf-Idf transform. To apply the Semantic Transform, we use the Word2Vec data to generate a correlation matrix between words with similar contextual meaning in the vocabulary.

Given vocabulary is a dictionary mapping each term with an index, the code to generate the correlation_matrix is:

correlation_matrix = scipy.sparse.identity(len(vocabulary), format="dok")
for idx, word in enumerate(vocabulary.keys()):
    similar_words = []

        similar_words = [x[0] for x in word2vec_model.most_similar(word, topn=5) if x[1] > 0.5]

    for similar_word in similar_words:
        if similar_word in vocabulary:
            correlation_matrix[vocabulary[word], vocabulary[similar_word]] = 1

Using this correlation matrix we can generate the term-document matrix with the augmented term frequencies.

term_frequency_vector += term_frequency_vector * correlation_matrix

Applying this transformation with the correlation matrix increases the word count of all words with contextually similar meaning in the text. This improves the feature collection for low support topics, which allows more precise classification of reviews about low support topics with higher confidence. This allows small amounts of labeled data to be more useful for the machine-learning model, which reduces the cost of developing a robust training set. Also, as mentioned above, this leverages semi-supervised learning from the unlabeled data by building the vocabulary and Word2Vec vectors based on the entire text corpora.


Ultimately the Semantic convolution provides more value from the little labeled data, and improves the performance of the machine leaning algorithm for classification tasks, especially the low support categories. Also, semi-supervised learning with Word2Vec leverages the information gained from the vast amounts of unlabeled data while increase both the precision and the support of the machine-learning model.

Dweep Shah and Anthony Johnson

Meaningless Words to Useful Phrases in Spark – word2phrase

Introduction to word2phrase

When we communicate, we often know that individual words in the correct placements can change the meaning of what we’re trying to say.  Add “very” in front of an adjective and you place more emphasis on the adjective.  Add “york” in after the word “new” and you get a location.  Throw in “times” after that and now it’s a newspaper.

It follows that when working with data, these meanings should be known.  The three separate words “new”, “york”, and “times” are very different than “New York Times” as one phrase.  This is where the word2phrase algorithm comes into play.

Words to Phrases

At its core, word2phrase takes in a sentence of individual words and potentially turns bigrams (two consecutive words) into a phrase by joining the two words together with a symbol (underscore in our case).  Whether or not a bigram is turned into a phrase is determined by the training set and parameters set by the user.  Note that every two consecutive words are considered, so in a sentence with w1 w2 w3, bigrams would be w1w2, w2w3.

In our word2phrase implementation in Spark (and done similarly in Gensim), there are two distinct steps; a training (estimator) step and application (transform) step.

*For clarity, note that “new york” is a bigram, while “new_york” is a phrase.

Estimator Step

The training step is where we pass in a training set to the word2phrase estimator.  The estimator takes this dataset and produces a model using the algorithm.  The model is called the transformer, which we pass in datasets that we want to transform, i.e. sentences that with bigrams that we may want to transform to phrases.

In the training set, the dataset is an array of sentences.  The algorithm will take these sentences and apply the following formula to give a score to each bigram:

score(wi, wj) = (count(wiwj) – delta) / (count(wi) * count(wj))

where wi and wj are word i and word j, and delta is discounting coefficient that can be set to prevent phrases consisting of infrequent words to be formed.  So wiwj is when word j follows word i.

After the score for each bigram is calculated, those above a set threshold (this value can be changed by the user) will be transformed into phrases.  The model produces by the estimator step is thus an array of bigrams; the ones that should be turned to phrases.

Transformer Step

The transform step is incredibly simple; pass in any array of sentences to your model and it will search for matching bigrams.  All matching bigrams in the array you passed in will then be turned to phrases.

You can repeat these steps to produce trigrams (i.e. three words into a phrase).  For example, with “I read the New York Times” may produce “I read the new_york Times” after the first run, but run it again to get “I read the new_york_times”, because in the second run “new_york” is also an individual word now.


First we create our training dataset; it’s a dataframe where the occurrences “new york” and “test drive” appears frequently.  (The sentences make no sense as they are randomly generated words.  See below for link to full dataframe.)

You can copy/paste this into your spark shell to test it, so long as you have the word2phrase algorithm included (available as a maven package with coordinates com.reputation.spark:word2phrase:1.0.1).

Download the package, create our test dataframe:

spark-shell –packages com.reputation.spark.word2phrase.1.0.1


val wordDataFrame = sqlContext.createDataFrame(Seq(
(0, “new york test drive cool york how always learn media new york .”),
(1, “online york new york learn to media cool time .”),
(2, “media play how cool times play .”),
(3, “code to to code york to loaded times media .”),
(4, “play awesome to york .”),
(1099, “work please ideone how awesome times .”),
(1100, “play how play awesome to new york york awesome use new york work please loaded      always like .”),
(1101, “learn like I media online new york .”),
(1102, “media follow learn code code there to york times .”),
(1103, “cool use play work please york cool new york how follow .”),
(1104, “awesome how loaded media use us cool new york online code judge ideone like .”),
(1105, “judge media times time ideone new york new york time us fun .”),
(1106, “new york to time there media time fun there new like media time time .”),
(1107, “awesome to new times learn cool code play how to work please to learn to .”),
(1108, “there work please online new york how to play play judge how always work please .”),
(1109, “fun ideone to play loaded like how .”),
(1110, “fun york test drive awesome play times ideone new us media like follow .”)
)).toDF(“label”, “inputWords”)

We set the input and output column names and create the model (the estimator step, represented by the fit(wordDataFrame) function).

scala> val t = new Word2Phrase().setInputCol(“inputWords”).setOutputCol(“out”)
t: = deltathresholdScal_f07fb0d91c1f

scala> val model =

Here are some of the scores (Table 1) calculated by the algorithm before removing those below the threshold (note all the scores above the threshold are shown here).  The default values have delta -> 100, threshold -> 0.00001, and minWords -> 0.

Table 1
bigram score
 test drive  0.002214815139686856
 work please 0.002047826661381…
 new york  5.946183949006843E-4
 york new  -1.64600247723372…
 york york  -6.43001404062082…
 york how  -6.64999302561707…
 how new  -6.80666229773923…
 new new  -7.42968903739342…
 to new  -7.52757602015383E-5
 york to  -9.25567252744992…

only showing top 10 rows

So our model produces three bigrams that will be searched for in the transform step:

test drive
work please
new york

We then use this model to transform our original dataframe sentences and view the results.  Unfortunately you can’t see the entire row in the spark-shell, but in the out column it’s clear that all instances of “new york” and “test drive” have been transformed into “new_york” and “test_drive”.

scala> val bi_gram_data = model.transform(wordDataFrame)
bi_gram_data: org.apache.spark.sql.DataFrame = [label: int, inputWords: string … 1 more field]


Table 2
label inputWords out
0 new york test dri…  new_york test_dri…
1 online york new y…  online york new_y…
2 media play how co…  media play how co…
3 code to to code y…  code to to code y…
4 play awesome to y…  play awesome to y…
5  like I I always .   like I I always .
6 how to there lear…  how to there lear…
7 judge time us pla…  judge time us pla…
8 judge test drive …  judge test_drive …
9 judge follow fun …  judge follow fun …
 10  how I follow ideo…  how I follow ideo…
 11  use use learn I t…  use use learn I t…
 12  us new york alway…  us new_york alway…
 13  there always how …  there always how …
 14  always time media…  always time media…
 15 how test drive to…  how test_drive to…
 16  cool us online ti…  cool us online ti…
 17 follow time aweso…  follow time aweso…
 18  us york test driv…  us york test_driv…
 19  use fun new york …  use fun new_york …

only showing top 20 rows

The algorithm and test dataset (testSentences.scala) are available at this repository.