## Introduction

When analyzing online reviews, we often focus on keywords. For various purposes such as review classification or keyword suggestion, we may need to group those words by closeness of meaning. Given a set of keywords, we may want to split it into clusters of words. One way we can do so is by using Word2Vec to map each word to a vector value, and then apply hierarchical clustering to those vectors.

First, let’s look at a simple example. We have the following input: [“banana”, “apple”, “lemon”, “nice”, “car”, “RV”, “truck”, “desk”]

From a clustering algorithm, we would expect this output:

• cluster 1: [“banana”, “apple”, “lemon”]
• cluster 2: [“car”,”rv”,”truck”]
• out of the clusters: [“nice”, “desk”]

## 1. Word2Vec

In this post we focus more on the grouping algorithms than on the details of the implementation of the Word2Vec model.

In short, Word2Vec provides word embedding. It associates vector values to words. Word2Vec trains a neural network to predict the neighboring words around each word from a corpus. Once it’s trained, the vector value is extracted from the neural network layers before the projections try to predict the context. Words with close meanings should be represented by vectors with close values.

We trained our Word2Vec model on 10^6 online reviews with the gensim Word2Vec implementation. We used a window of 10 words and generated 300 dimensional vectors.

Our use cases involve data from various industries, so when we want to work with keywords on online reviews for a given industry, we train Word2Vec on reviews from this industry, since this context may be useful in specifying the meaning of a word. For example, considering the word “limb”, the meaning may be different in restaurant reviews than in hospital reviews.

## 2. Hierarchical clustering

To generate clusters, we will apply a strategy of hierarchical clustering. This is an iterative process to create a new cluster at each step by aggregating two clusters. At each step it tries to minimize the increase of a given distance. For a set of n vectors, there are n-1 steps that take you from n singleton clusters to one cluster with everything. This process can also be well represented graphically to show any level of granularity within a single graph (as below).

## 3. Distances used for clustering

Of course the key to generating these clusters is the distance metric used to determine which keywords or clusters are “close” together. To measure the dispersion in a set of vectors we can use different metrics. A classic distance is the empirical variance, corresponding to the geometric distance between the vectors.

### 3.1 Empirical variance

Let $S = (x_i)_{i \in [|1,n|]}$ be a set of n points, and G its center of inertia (its mean). Then the empirical variance is:

$V = \sum_{i = 1}^{n}{||x_i - G||^2}$

But we can define two other variances : the inter-class variance $V_{inter}$, and the intra-class variance $V_{intra}$.

Let $(C_k)_{k \in [|1, K|]}$ be K subset of S, creating a partition of S (the clusters). $(G_k)_{k \in [|1, K|]}$ is the center (mean) of the clusters $(C_i)_{k \in [|1, K|]}$ and $|C_k|_{k \in [|1, K|]}$ is the number of elements in each cluster.

$V_{inter} = \sum_{k \in [|1, K|]}{|C_k|\times|| G_k - G ||^2}$
$V_{intra} = \sum_{k \in [|1, K|]}{\sum_{i \in C_k}{||x_i - G_k||^2}}$

We understand than $V_{intra}$ is a measure of the dispersion inside each subset, and $V_{inter}$ is a measure of the distance  between the clusters.

Huygens theorem shows that:

$V = V_{inter} + V_{intra}$

Therefore, when we do clustering, since V is a constant, minimizing the intra-class variance is the same than maximizing the inter-class variance. This is what we want to generate: clusters very distant one from another (high $V_{inter}$ ), and every cluster close to his own center (low $V_{intra}$ ).

Now we will see that at each aggregation of two clusters, there is a way to make the aggregation with the lowest increase of $V_{intra}$ with ward distance.

3.2 Ward distance

Ward distance is a distance between two clusters, which we can use in hierarchical clustering: at each step of the algorithm, we try to aggregate two clusters with the smallest Ward distance between them. This distance between two clusters is chosen because of the following result:

Aggregating the two clusters with the minimum Ward distance is equivalent to make the aggregation with the smallest increase of $V_{intra}$ .

This is an example of hierarchical clustering using Ward distance on a set of keywords from hospital online-reviews.

### 3.3 Thresholds on cluster distance

Now that we have an efficient method to perform hierarchical clustering which tries to keep the intra-class variance as low as possible, we have to determine where to stop this process of aggregating groups; i.e which level of granularity to use as our clusters in any particular application.

There are different ways to choose the right number of clusters (when to stop the aggregations). A classic way to do that in hierarchical clustering is to use the maximum of the second derivative of the Ward distance, which means to effectively look for a big gap in the Ward distance and to stop before this gap.

But we are looking for clusters with high word similarity, so we don’t want to use the second derivative method, which is relative. We also want to find a standard criterion that is independent of the cluster size. Therefore, we will use thresholds on a given distance but the distance should be normalized.

Even though we use the Ward distance as an aggregation distance, we do not necessarily have to use it to delimit the clusters. Instead we can use a threshold on a “dispersion” metric for each cluster $C_k$. Let’s consider the following three metrics, in addition to the Ward distance:

• Mean variance in the cluster: $\frac{\sum_{i \in C_k}{||x_i - G_k||^2}}{|C_k|}$
• Maximum distance between two elements in the cluster: $\max\limits_{i,j \in C_k}{||x_i - x_j||^2}$
• Maximum distance to the mean of the cluster: $\max\limits_{i \in C_k}{||x_i - G_k||^2}$

On this and other examples, the maximum distance seems to be worse than the other metrics. The maximum distance to the mean is a good metric but it doesn’t consider all the words in the cluster. We will use a threshold on the mean variance to delimit the clusters. In the next dendrogram we plot the threshold: Each cluster has to have a mean variance below 0.6. In this example we have 4 clusters.

But, we need to be careful when using other metrics than the Ward distance to select clusters. The metric may not increase at each iteration, unlike the Ward distance. Here is an example of this:

This would give an advantage to the maximum distance between two vectors in each cluster over other metrics because the maximum distance between two vectors can’t decrease when we add words in a cluster. For our purposes, we will keep the mean variance but choose to stop the aggregation the first time that the mean variance is over the threshold, in order to keep this method consistent.

Now that we have a way to select the clusters, let’s explore some other issues and possible ways to address them.

## 4. Improve the quality of the clusters

### 4.1 Outliers

We select clusters on the mean variance. But we can have a low mean variance due to a high number of vectors close to each other and one vector too far of them. We want to remove vectors which are too far from the center of the cluster as we see in the left part of the following schema :

In this schema, the left mean variance is lower in the red cluster than in the blue one, but the point on the bottom left corner of the red cluster needs to be removed from it.

### 4.2 “Dimension” effect

We used the mean variance to select the clusters because it’s normalized. We did that because we don’t want to preselect the size of the clusters.

But we can notice that for the same mean variance and a different number of elements in the clusters the lexical proximity seems to increase with the size of the cluster. For example in the dendrogram that we used to illustrate the thresholds, [“arrogant”,”personality”,”bedside”] has the same mean variance as [“encouraging”, “understanding”, “caring”, “passionate”,”listens”]). This may be caused by the high dimension (300) of the vector space.

We can use a better metric to keep the selection of clusters independent of their size. We have good results with: $\frac{||x_i - G_k ||}{\sqrt{|C_k|}}$

But, it seems difficult to use the mean of this metric directly with a threshold after the Ward aggregation because the mean of this metric is not increasing with the aggregations of the clusters.

### 4.3 New filter against those two problems

Even though the mean of the last metric  $\frac{||x_i - G_k ||}{\sqrt{|C_k|}}$ is not useful in selecting the clusters, we may have another use for it. Instead of using it on a cluster level, we can use it on a word level. For each vector, if it’s too far from its cluster’s center, it will be removed from the cluster. This is very useful for two purposes: to get rid of outliers, and to get rid of small clusters of words with very distant meanings.

Ultimately, we have a three steps process. Let’s return to our the first example: [“banana”, “apple”, “lemon”, “nice”, “car”, “RV”, “truck”, “desk”]

1.  Hierarchical aggregation with Ward distance
2. Select clusters with a threshold t1 = 0.6 on their mean varianceWe have three clusters :
– [“banana”, “apple”, “lemon”]
– [“rv”, “car”, “truck”]
– [“nice”,”desk”]
3. Remove outlier $x_i$ from cluster $C_k$ when $\frac{||x_i - G_k ||}{\sqrt{|C_k|}}$ is lower than a threshold t2 = 0.4.We limit the “dimension” effect at the same time.
Both “nice” and “desk” don’t pass the test, so we have the expected result :
– cluster 1 : [“banana”, “apple”, “lemon”]
– cluster 2: [“rv”, “car”, “truck”]
– out of the clusters: [“desk”,”nice”]

## 5. Further extensions – Use of clusters to extract lexical information

Now that we can select groups of words with very similar meaning, we have some ideas for further research.

An advantage of the Word2Vec word embedding is that it may have geometrical interpretations. We have the example from the Word2Vec creators, who found the word “smallest” by looking at the word representation with the closest cosine distance to vector(“biggest”) – vector(“big”) + vector(“small”).

We understand easily the meaning of the vector (vector(“biggest”) – vector(“big”)). But what about (vector(“big”) – vector(“small”)) ? Along the little segment from vector(“small”) to vector(“big”) we hope to find some adjectives related to the size and increasing from “small” to “big”.

We tried to detect lexical information with tools such as Principal Component Analysis (PCA) on a huge group of words but without success, so we hope that this clustering will help us to achieve this.

It may be interesting to use tools such as PCA on each cluster to detect geometrical structures with lexical interpretations.

This is an idea of the kind of results that we may expect:

## 6. Conclusion

Vectors generated by Word2Vec can be used to find clusters of words. We try to find clusters without restraining their number or their size, with hierarchical clustering. We can manually choose thresholds values, depending on how scattered we want the clusters to be.

These clustering techniques are useful to find groups of words with the same meaning. It can be used to find keywords more precisely: Let say that we have 100 keywords related to a theme and we want to use Word2Vec to detect new keywords related to this theme. Instead of looking for vectors close to each word representation or close to the entire 100 vector set, we can look for vectors close to each cluster that we generated.

It may be very interesting to analyze geometrical repartitions of the vectors inside the clusters with tools such as PCA, and to find if they correspond to a lexical structure.

###### Sources

Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, 2013

SciPy Hierarchical Clustering and Dendrogram Tutorial, Jörn Hees, 2015