Online reviews: keyword clusters

Introduction

When analyzing online reviews, we often focus on keywords. For various purposes such as review classification or keyword suggestion, we may need to group those words by closeness of meaning. Given a set of keywords, we may want to split it into clusters of words. One way we can do so is by using Word2Vec to map each word to a vector value, and then apply hierarchical clustering to those vectors.

First, let’s look at a simple example. We have the following input: [“banana”, “apple”, “lemon”, “nice”, “car”, “RV”, “truck”, “desk”]

     From a clustering algorithm, we would expect this output:

  • cluster 1: [“banana”, “apple”, “lemon”]
  • cluster 2: [“car”,”rv”,”truck”]
  • out of the clusters: [“nice”, “desk”]

1. Word2Vec

In this post we focus more on the grouping algorithms than on the details of the implementation of the Word2Vec model.

In short, Word2Vec provides word embedding. It associates vector values to words. Word2Vec trains a neural network to predict the neighboring words around each word from a corpus. Once it’s trained, the vector value is extracted from the neural network layers before the projections try to predict the context. Words with close meanings should be represented by vectors with close values.

We trained our Word2Vec model on 10^6 online reviews with the gensim Word2Vec implementation. We used a window of 10 words and generated 300 dimensional vectors.

Our use cases involve data from various industries, so when we want to work with keywords on online reviews for a given industry, we train Word2Vec on reviews from this industry, since this context may be useful in specifying the meaning of a word. For example, considering the word “limb”, the meaning may be different in restaurant reviews than in hospital reviews.

2. Hierarchical clustering

To generate clusters, we will apply a strategy of hierarchical clustering. This is an iterative process to create a new cluster at each step by aggregating two clusters. At each step it tries to minimize the increase of a given distance. For a set of n vectors, there are n-1 steps that take you from n singleton clusters to one cluster with everything. This process can also be well represented graphically to show any level of granularity within a single graph (as below).

example of hierarchical clusters

3. Distances used for clustering

Of course the key to generating these clusters is the distance metric used to determine which keywords or clusters are “close” together. To measure the dispersion in a set of vectors we can use different metrics. A classic distance is the empirical variance, corresponding to the geometric distance between the vectors.

3.1 Empirical variance

Let S = (x_i)_{i \in [|1,n|]} be a set of n points, and G its center of inertia (its mean). Then the empirical variance is:

V = \sum_{i = 1}^{n}{||x_i - G||^2}

But we can define two other variances : the inter-class variance V_{inter}, and the intra-class variance V_{intra}.

Let (C_k)_{k \in [|1, K|]} be K subset of S, creating a partition of S (the clusters). (G_k)_{k \in [|1, K|]} is the center (mean) of the clusters (C_i)_{k \in [|1, K|]} and |C_k|_{k \in [|1, K|]} is the number of elements in each cluster.

V_{inter} = \sum_{k \in [|1, K|]}{|C_k|\times|| G_k - G ||^2}
V_{intra} = \sum_{k \in [|1, K|]}{\sum_{i \in C_k}{||x_i - G_k||^2}}

We understand than V_{intra} is a measure of the dispersion inside each subset, and V_{inter} is a measure of the distance  between the clusters.

Huygens theorem shows that:

V = V_{inter} + V_{intra}

Therefore, when we do clustering, since V is a constant, minimizing the intra-class variance is the same than maximizing the inter-class variance. This is what we want to generate: clusters very distant one from another (high V_{inter} ), and every cluster close to his own center (low V_{intra} ).

Now we will see that at each aggregation of two clusters, there is a way to make the aggregation with the lowest increase of V_{intra} with ward distance.

3.2 Ward distance

Ward distance is a distance between two clusters, which we can use in hierarchical clustering: at each step of the algorithm, we try to aggregate two clusters with the smallest Ward distance between them. This distance between two clusters is chosen because of the following result:

Aggregating the two clusters with the minimum Ward distance is equivalent to make the aggregation with the smallest increase of V_{intra} .

This is an example of hierarchical clustering using Ward distance on a set of keywords from hospital online-reviews.

dendrogram_with_ward_distance

3.3 Thresholds on cluster distance

Now that we have an efficient method to perform hierarchical clustering which tries to keep the intra-class variance as low as possible, we have to determine where to stop this process of aggregating groups; i.e which level of granularity to use as our clusters in any particular application.

There are different ways to choose the right number of clusters (when to stop the aggregations). A classic way to do that in hierarchical clustering is to use the maximum of the second derivative of the Ward distance, which means to effectively look for a big gap in the Ward distance and to stop before this gap.

But we are looking for clusters with high word similarity, so we don’t want to use the second derivative method, which is relative. We also want to find a standard criterion that is independent of the cluster size. Therefore, we will use thresholds on a given distance but the distance should be normalized.

Even though we use the Ward distance as an aggregation distance, we do not necessarily have to use it to delimit the clusters. Instead we can use a threshold on a “dispersion” metric for each cluster C_k. Let’s consider the following three metrics, in addition to the Ward distance:

  • Mean variance in the cluster: \frac{\sum_{i \in C_k}{||x_i - G_k||^2}}{|C_k|}
  • Maximum distance between two elements in the cluster: \max\limits_{i,j \in C_k}{||x_i - x_j||^2}
  • Maximum distance to the mean of the cluster: \max\limits_{i \in C_k}{||x_i - G_k||^2}

dendrograms_with_different_metrics

On this and other examples, the maximum distance seems to be worse than the other metrics. The maximum distance to the mean is a good metric but it doesn’t consider all the words in the cluster. We will use a threshold on the mean variance to delimit the clusters. In the next dendrogram we plot the threshold: Each cluster has to have a mean variance below 0.6. In this example we have 4 clusters.

threshold_dendrogram

But, we need to be careful when using other metrics than the Ward distance to select clusters. The metric may not increase at each iteration, unlike the Ward distance. Here is an example of this:

variance_not_increasing

This would give an advantage to the maximum distance between two vectors in each cluster over other metrics because the maximum distance between two vectors can’t decrease when we add words in a cluster. For our purposes, we will keep the mean variance but choose to stop the aggregation the first time that the mean variance is over the threshold, in order to keep this method consistent.

Now that we have a way to select the clusters, let’s explore some other issues and possible ways to address them.

4. Improve the quality of the clusters

4.1 Outliers

We select clusters on the mean variance. But we can have a low mean variance due to a high number of vectors close to each other and one vector too far of them. We want to remove vectors which are too far from the center of the cluster as we see in the left part of the following schema :

cluster_outlier

In this schema, the left mean variance is lower in the red cluster than in the blue one, but the point on the bottom left corner of the red cluster needs to be removed from it.

4.2 “Dimension” effect

We used the mean variance to select the clusters because it’s normalized. We did that because we don’t want to preselect the size of the clusters.

But we can notice that for the same mean variance and a different number of elements in the clusters the lexical proximity seems to increase with the size of the cluster. For example in the dendrogram that we used to illustrate the thresholds, [“arrogant”,”personality”,”bedside”] has the same mean variance as [“encouraging”, “understanding”, “caring”, “passionate”,”listens”]). This may be caused by the high dimension (300) of the vector space.

We can use a better metric to keep the selection of clusters independent of their size. We have good results with: \frac{||x_i - G_k ||}{\sqrt{|C_k|}}

But, it seems difficult to use the mean of this metric directly with a threshold after the Ward aggregation because the mean of this metric is not increasing with the aggregations of the clusters.

4.3 New filter against those two problems

Even though the mean of the last metric  \frac{||x_i - G_k ||}{\sqrt{|C_k|}} is not useful in selecting the clusters, we may have another use for it. Instead of using it on a cluster level, we can use it on a word level. For each vector, if it’s too far from its cluster’s center, it will be removed from the cluster. This is very useful for two purposes: to get rid of outliers, and to get rid of small clusters of words with very distant meanings.

Ultimately, we have a three steps process. Let’s return to our the first example: [“banana”, “apple”, “lemon”, “nice”, “car”, “RV”, “truck”, “desk”]

  1.  Hierarchical aggregation with Ward distanceclustering_example
  2. Select clusters with a threshold t1 = 0.6 on their mean varianceclustering_example_thresholdWe have three clusters :
    – [“banana”, “apple”, “lemon”]
    – [“rv”, “car”, “truck”]
    – [“nice”,”desk”]
  3. Remove outlier x_i from cluster C_k when \frac{||x_i - G_k ||}{\sqrt{|C_k|}} is lower than a threshold t2 = 0.4.We limit the “dimension” effect at the same time.
    Both “nice” and “desk” don’t pass the test, so we have the expected result :
    – cluster 1 : [“banana”, “apple”, “lemon”]
    – cluster 2: [“rv”, “car”, “truck”]
    – out of the clusters: [“desk”,”nice”]

5. Further extensions – Use of clusters to extract lexical information

Now that we can select groups of words with very similar meaning, we have some ideas for further research.

An advantage of the Word2Vec word embedding is that it may have geometrical interpretations. We have the example from the Word2Vec creators, who found the word “smallest” by looking at the word representation with the closest cosine distance to vector(“biggest”) – vector(“big”) + vector(“small”).

We understand easily the meaning of the vector (vector(“biggest”) – vector(“big”)). But what about (vector(“big”) – vector(“small”)) ? Along the little segment from vector(“small”) to vector(“big”) we hope to find some adjectives related to the size and increasing from “small” to “big”.

We tried to detect lexical information with tools such as Principal Component Analysis (PCA) on a huge group of words but without success, so we hope that this clustering will help us to achieve this.

It may be interesting to use tools such as PCA on each cluster to detect geometrical structures with lexical interpretations.

This is an idea of the kind of results that we may expect:

PCA_on_clusters

6. Conclusion

Vectors generated by Word2Vec can be used to find clusters of words. We try to find clusters without restraining their number or their size, with hierarchical clustering. We can manually choose thresholds values, depending on how scattered we want the clusters to be.

These clustering techniques are useful to find groups of words with the same meaning. It can be used to find keywords more precisely: Let say that we have 100 keywords related to a theme and we want to use Word2Vec to detect new keywords related to this theme. Instead of looking for vectors close to each word representation or close to the entire 100 vector set, we can look for vectors close to each cluster that we generated.

It may be very interesting to analyze geometrical repartitions of the vectors inside the clusters with tools such as PCA, and to find if they correspond to a lexical structure.

Sources

Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, 2013

SciPy Hierarchical Clustering and Dendrogram Tutorial, Jörn Hees, 2015


Sentiment analysis : Machine-Learning approach

Following up on my earlier post, as the frequency-based models were not very accurate and a good rule-based model was very hard to elaborate, we implemented what we known to be state-of-the-art methods for sentiment analysis on short sentences and make a list of the pros and cons of these methods. We train all of them on a 10.000 sentences dataset. These sentences are classified as positive, neutral, and negative by human experts. We the benchmark the models on a hold out sample of 500 sentences.

Word representations in a vector space

Feature extraction

To build a deep-learning model for sentiment analysis, we first have to represent our sentences in a vector space. We studied frequency-based methods in a previous post. They represent a sentence either by a bag-of-words, which is a list of the words that appear in the sentence with their frequencies, or by a term frequency – inverse document frequency (tf-idf) vector where the word frequencies in our sentences are weighted with their frequencies in the entire corpus.

These methods are very useful for long texts. For example, we can describe very precisely a newspaper article or a book by its most frequent words. However, for very short sentences, it’s not accurate at all. First, because 10 words are not enough to aggregate. But also because the structure of the sentence is very important to analyze sentiment and tf-idf models hardly capture negations, amplifications, and concessions. For instance, “Very good food, but bad for service…” would have the same representation as “Bad for food, but very good service!”.

Word vectors

We represent our sentences with vectors that take into account both the words that appear and the semantic structure. A first way to do this is to represent every word with an n-feature vector, and to represent our sentence with a n*length matrix. We can for instance build a vector of the same size as the vocabulary (10.000 for instance), and to represent the i-th word with a 1 in the i-th position and 0 elsewhere.

Tomas Mikolov developed another way to represent words in a vector space, with features that capture the semantic compositionality. He trains the following neural network on a very large corpus:

Neural network trained to get Word2Vec's word vectors

He trains this model and represents the word “ants” by the output vector of the hidden layer. The features of these word vectors we obtain capture most of the semantic information, because it captures enough information to evaluate the statistical repartition of the word that follows “ants” in a sentence.

What we do is similar. We represent every word by an index vector. And we integrate in our deep learning model a hidden layer of linear neurons that transforms these big vectors into much smaller ones. We take these smaller vectors as an input of a convolutional neural network. We train the model as a whole, so that the word vectors we use are trained to fit the sentiment information of the words, i.e. so that the features we get capture enough information on the words to predict the sentiment of the sentence.

Sentence representations

Doc2vec

We want to build a representation of a sentence that takes into account not only the words that appear, but also the sentence’s semantic structure. The easiest way to do this is to superpose these word vectors and build a matrix that represents the sentence. There is another way to do it, that was also developed by Tomas Mikolov and is usually called Doc2Vec.

He modifies the neural network we used for Word2Vec, and takes as an input both the word vectors that come before, and a vector that depends on the sentence they are in. We will take the features of this word vector as parameters of our model and optimize them using a gradient descent. Doing that, we will have for every sentence a set of features that represent the structure of the sentence. These features capture most of the useful information on how the words follow each other.

Neural Network trained to get Doc2Vec's document vectors

Pros and cons for sentiment analysis

These document vectors are very useful for us, because the sentiment of a sentence can be deduced very precisely from these semantic features . As a matter of fact, users writing reviews with positive or negative sentiments will have completely different ways of composing the words. Feeding a logistic regression with these vectors and training the regression to predict sentiment is known to be one of the best methods for sentiment analysis, both for fine-grained (Very negative / Negative / Neutral / Positive / Very positive) and for more general Negative / Positive classification.

We implemented and benchmarked such a method but we chose not to productionalize it. As a matter of fact, building the document vector of a sentence is not an easy operation. For every sentence, we have to run a gradient descent in order to find the right coefficients for this vector. Compared to our other methods for sentiment analysis, where the preprocessing is a very short algorithm (a matter of milliseconds) and the evaluation is almost instantaneous, Doc2Vec classification requires a significant hardware investment and/or takes much longer to process. Before taking that leap, we decided to explore representing our sentences by a matrix of word vectors and to classify sentiments using a deep learning model.

Convolutional neural networks

Convolutional neural networks

The next method we explored for sentiment classification uses a multi-layer neural network with a convolutional layer, multiple dense layers of neurons with a sigmoid activation function, and additional layers designed to prevent overfitting. We explained how convolutional layers work in a previous article. It is a technique that was designed for computer vision, and that improves the accuracy of most image classification and object detection models.

The idea is to apply convolutions to the image with a set of filters, and to take the new images it produces as inputs of the next layer. Depending on the filter we apply, the output image will either capture the edges, or smooth it, or sharpen the key patterns. Training the filter’s coefficients will help our model build extremely relevant features to feed the next layers. These features work like local patches that learn compositionality. During the training, it will automatically learn the best patches depending on the classification problem we want to solve. The features it learns will be location-invariant. It will convolve exactly the same way an object that is at the bottom of the frame and an object that is at the top of the frame. This is key not only for object detection, but for sentiment analysis as well.

Convolution used for edge detection

Convolution used for edge detection

Applications in Natural Language Processing

As these models became more and more popular in computer vision, a lot of people tried to apply them in other fields. They had significantly good results in speech recognition and in natural language processing. In speech recognition, the trick is to build the frequency intensity distribution of the signal for every timestamp and to convolve these images.

For NLP tasks like sentiment analysis, we do something very similar. We build word vectors and convolve the image built by juxtaposing these vectors in order to build relevant features.

Intuitively, the filters will enable us to highlight the intensely positive or intensely negative words. They will enable us to understand the relation between negations and what follows, and things like that. It will capture relevant information about how the words follow each other. It will also learn particular words or n-grams that bear sentiment information. We then feed a fully connected deep neural network with the outputs of these convolutions. It selects the best of these features in order to classify the sentiment of the sentence. The results on our datasets are pretty good.

Convolutional neural networks for Natural Language Processing

LSTM

We also studied, implemented and benchmarked the Long Short-Term Memory Recurrent Neural Network model. It has a very interesting architecture to process natural language. It works exactly as we do. It reads the sentence from the first word to the last one. And it tries to figure out the sentiment after each step. For example, for the sentence “The food sucks, the wine was worse.”. It will read “The”, then “food”, then “sucks”, “the” and “wine”. It will keep in mind both a vector that represents what came before (memory) and a partial output. For instance, it will already think that the sentence is negative halfway through. Then it will continue to update as it processes more data.

Recurrent neural networks - The human way to do sentiment analysis

This is the general idea, but the implementation of these networks is much more complex because it is easy to keep recent information in mind, but very difficult to have a model that captures most of the useful long-term dependencies while avoiding the problems linked to vanishing gradient.

This RNN structure looks very accurate for sentiment analysis tasks. It performs well for speech recognition and for translation. However, it slows down the evaluation process considerably and doesn’t improve accuracy that much in our application so should be implemented with care.

Sentiment trees – RNTN model

Richard Socher et al. describe in the paper Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank another cool method for sentiment analysis. He says that every word has a sentiment meaning. The structure of the sentence should enable us to compose these sentiments in order to get the overall sentiment of the sentence.

Stanford Sentiment Treebank example

They implement a model called the RNTN. It represents the words by vectors and takes a class of tensor-multiplication-based mathematical functions to describe compositionality. Stanford has a very large corpus of movie reviews turned into trees by their NLP libraries. Every node is classified from very negative to very positive by a human annotator. They trained the RNTN model on this corpus, and got very good results. Unfortunately, they train it on IMDB movie reviews data. But it doesn’t perform quite as well on our reviews.

The big advantage of this model is that it is very interpretable. We can understand very precisely how it works. We can visualize which words it detects to be positive or negative, and how it understands the compositions. However, we need to build an extremely large training set (around 10.000 sentences with fine-grain annotations on every node) for every specific application. As we continue to gather more and more detailed training data, this is just one of the types of models we are exploring to continue improving the sentiment models we have in production!


Semantic Convolution with Word2Vec

At Reputation.com, we work with millions of online reviews from hundreds of sources. One of the unusual characteristics of reviews compared to the vast majority of text corpora is that, almost by definition, reviews are structured in such a way that they can be categorized (in one or many dimensions depending on the review site and/or industry). However, we often find ourselves doing text classification/tagging on topics that are not already labeled by the review site.  This article is an informal introduction to a set of techniques we have developed to leverage existing unlabeled corpora in conjunction with the labeled data. In particular, we present a semi-supervised learning algorithm for multi-label text classification.

In recent years, a lot of text classification projects have used supervised learning methods (Naive Bayes, SVM) primarily due to their substantial improvements over non-supervised strategies such as traditional clustering in NLP tasks. Until very recently, most NLP classification work was done with the traditional Bag of Words (BOW) approach – perhaps with a bit of context through the use of a limited range of N-grams and skip-grams. BOW is a feature extraction technique where the text is represented as the frequency of each word in the document, disregarding grammar and ordering but keeping multiplicity. In most cases, defining a pipeline combining the BOW feature extraction technique with a Tf-Idf transform and a simple classifier (Naive Bayes, SVM) produces decent results with respect to most classification metrics.

Semi-Supervised Learning with Word2Vec

In most tutorials, Word2Vec is presented as a stand-alone neural net preprocessor for feature extraction. Word2Vec generates a vector for each word in the text corpora in higher-dimensional space such that words that share contextual meaning are located in close proximity to one another. To use Word2Vec for classification, each word can be replaced by its corresponding word vector and usually combined through a naive algorithm such as addition with normalization or cross product to get a sentence or text vector. Then, using these document vectors we could use a simple classifier for multi-label classification. The advantage of using Word2Vec over a simple BOW feature extraction technique is it supports semi-supervised learning, since the vocabulary from the labeled and unlabeled text can be used to generate the word vectors. This allows the words to have more contextual meaning. However, we have found that this approach does not appear to provide significant improvements over a BOW approach especially when there isn’t a lot of labeled data for training the classifier.

Semantic Convolution for Low Support Topics

A common problem that is seen in multi-label text classification is a major imbalance of labels in a textual corpora. We often see cases where most (>60%) of the sampled data is about the most prevalent topic, and more than half the topic labels exist in <0.1% of the sampled data. Almost inherently with NLP and a BOW approach, this causes a p (number of features) >> n (size of training corpus) problem. Based on a general rules of thumb, getting 1,000 training examples for the low support topic would require millions of labeled training examples, which is prohibitively expensive.

In this world of ‘big data’ the data itself is actually cheap, but developing a tagged training set can be expensive. In the course of our development, we devised an elegant and scalable way to develop and maintain a robust training set across tens of industries (this will be the topic of a separate blog post).

The premise of Semantic Convolution is simple: if a particular word is a good indicator of a particular label, then words with similar meanings (semantics) should also be good indicators of the label. Since we have qualitative evidence that Word2Vec vectors encode a semantic meaning, we can use it to help find words with similar meanings from non-labeled corpora. This allows us to apply a Semantic transform after getting the term frequencies in the BOW pipeline, and before applying the Tf-Idf transform. To apply the Semantic Transform, we use the Word2Vec data to generate a correlation matrix between words with similar contextual meaning in the vocabulary.

Given vocabulary is a dictionary mapping each term with an index, the code to generate the correlation_matrix is:

correlation_matrix = scipy.sparse.identity(len(vocabulary), format="dok")
for idx, word in enumerate(vocabulary.keys()):
    similar_words = []

    try:
        similar_words = [x[0] for x in word2vec_model.most_similar(word, topn=5) if x[1] > 0.5]
    except:
        raise

    for similar_word in similar_words:
        if similar_word in vocabulary:
            correlation_matrix[vocabulary[word], vocabulary[similar_word]] = 1

Using this correlation matrix we can generate the term-document matrix with the augmented term frequencies.

term_frequency_vector += term_frequency_vector * correlation_matrix

Applying this transformation with the correlation matrix increases the word count of all words with contextually similar meaning in the text. This improves the feature collection for low support topics, which allows more precise classification of reviews about low support topics with higher confidence. This allows small amounts of labeled data to be more useful for the machine-learning model, which reduces the cost of developing a robust training set. Also, as mentioned above, this leverages semi-supervised learning from the unlabeled data by building the vocabulary and Word2Vec vectors based on the entire text corpora.

Conclusions

Ultimately the Semantic convolution provides more value from the little labeled data, and improves the performance of the machine leaning algorithm for classification tasks, especially the low support categories. Also, semi-supervised learning with Word2Vec leverages the information gained from the vast amounts of unlabeled data while increase both the precision and the support of the machine-learning model.

Dweep Shah and Anthony Johnson


Meaningless Words to Useful Phrases in Spark – word2phrase

Introduction to word2phrase

When we communicate, we often know that individual words in the correct placements can change the meaning of what we’re trying to say.  Add “very” in front of an adjective and you place more emphasis on the adjective.  Add “york” in after the word “new” and you get a location.  Throw in “times” after that and now it’s a newspaper.

It follows that when working with data, these meanings should be known.  The three separate words “new”, “york”, and “times” are very different than “New York Times” as one phrase.  This is where the word2phrase algorithm comes into play.

Words to Phrases

At its core, word2phrase takes in a sentence of individual words and potentially turns bigrams (two consecutive words) into a phrase by joining the two words together with a symbol (underscore in our case).  Whether or not a bigram is turned into a phrase is determined by the training set and parameters set by the user.  Note that every two consecutive words are considered, so in a sentence with w1 w2 w3, bigrams would be w1w2, w2w3.

In our word2phrase implementation in Spark (and done similarly in Gensim), there are two distinct steps; a training (estimator) step and application (transform) step.

*For clarity, note that “new york” is a bigram, while “new_york” is a phrase.

Estimator Step

The training step is where we pass in a training set to the word2phrase estimator.  The estimator takes this dataset and produces a model using the algorithm.  The model is called the transformer, which we pass in datasets that we want to transform, i.e. sentences that with bigrams that we may want to transform to phrases.

In the training set, the dataset is an array of sentences.  The algorithm will take these sentences and apply the following formula to give a score to each bigram:

score(wi, wj) = (count(wiwj) – delta) / (count(wi) * count(wj))

where wi and wj are word i and word j, and delta is discounting coefficient that can be set to prevent phrases consisting of infrequent words to be formed.  So wiwj is when word j follows word i.

After the score for each bigram is calculated, those above a set threshold (this value can be changed by the user) will be transformed into phrases.  The model produces by the estimator step is thus an array of bigrams; the ones that should be turned to phrases.

Transformer Step

The transform step is incredibly simple; pass in any array of sentences to your model and it will search for matching bigrams.  All matching bigrams in the array you passed in will then be turned to phrases.

You can repeat these steps to produce trigrams (i.e. three words into a phrase).  For example, with “I read the New York Times” may produce “I read the new_york Times” after the first run, but run it again to get “I read the new_york_times”, because in the second run “new_york” is also an individual word now.

Example

First we create our training dataset; it’s a dataframe where the occurrences “new york” and “test drive” appears frequently.  (The sentences make no sense as they are randomly generated words.  See below for link to full dataframe.)

You can copy/paste this into your spark shell to test it, so long as you have the word2phrase algorithm included (available as a maven package with coordinates com.reputation.spark:word2phrase:1.0.1).

Download the package, create our test dataframe:

spark-shell –packages com.reputation.spark.word2phrase.1.0.1

import org.apache.spark.ml.feature.Word2Phrase

val wordDataFrame = sqlContext.createDataFrame(Seq(
(0, “new york test drive cool york how always learn media new york .”),
(1, “online york new york learn to media cool time .”),
(2, “media play how cool times play .”),
(3, “code to to code york to loaded times media .”),
(4, “play awesome to york .”),
.
.
.
(1099, “work please ideone how awesome times .”),
(1100, “play how play awesome to new york york awesome use new york work please loaded      always like .”),
(1101, “learn like I media online new york .”),
(1102, “media follow learn code code there to york times .”),
(1103, “cool use play work please york cool new york how follow .”),
(1104, “awesome how loaded media use us cool new york online code judge ideone like .”),
(1105, “judge media times time ideone new york new york time us fun .”),
(1106, “new york to time there media time fun there new like media time time .”),
(1107, “awesome to new times learn cool code play how to work please to learn to .”),
(1108, “there work please online new york how to play play judge how always work please .”),
(1109, “fun ideone to play loaded like how .”),
(1110, “fun york test drive awesome play times ideone new us media like follow .”)
)).toDF(“label”, “inputWords”)

We set the input and output column names and create the model (the estimator step, represented by the fit(wordDataFrame) function).

scala> val t = new Word2Phrase().setInputCol(“inputWords”).setOutputCol(“out”)
t: org.apache.spark.ml.feature.Word2Phrase = deltathresholdScal_f07fb0d91c1f

scala> val model = t.fit(wordDataFrame)

Here are some of the scores (Table 1) calculated by the algorithm before removing those below the threshold (note all the scores above the threshold are shown here).  The default values have delta -> 100, threshold -> 0.00001, and minWords -> 0.

Table 1
bigram score
 test drive  0.002214815139686856
 work please 0.002047826661381…
 new york  5.946183949006843E-4
 york new  -1.64600247723372…
 york york  -6.43001404062082…
 york how  -6.64999302561707…
 how new  -6.80666229773923…
 new new  -7.42968903739342…
 to new  -7.52757602015383E-5
 york to  -9.25567252744992…

only showing top 10 rows

So our model produces three bigrams that will be searched for in the transform step:

test drive
work please
new york

We then use this model to transform our original dataframe sentences and view the results.  Unfortunately you can’t see the entire row in the spark-shell, but in the out column it’s clear that all instances of “new york” and “test drive” have been transformed into “new_york” and “test_drive”.

scala> val bi_gram_data = model.transform(wordDataFrame)
bi_gram_data: org.apache.spark.sql.DataFrame = [label: int, inputWords: string … 1 more field]

scala> bi_gram_data.show()

Table 2
label inputWords out
0 new york test dri…  new_york test_dri…
1 online york new y…  online york new_y…
2 media play how co…  media play how co…
3 code to to code y…  code to to code y…
4 play awesome to y…  play awesome to y…
5  like I I always .   like I I always .
6 how to there lear…  how to there lear…
7 judge time us pla…  judge time us pla…
8 judge test drive …  judge test_drive …
9 judge follow fun …  judge follow fun …
 10  how I follow ideo…  how I follow ideo…
 11  use use learn I t…  use use learn I t…
 12  us new york alway…  us new_york alway…
 13  there always how …  there always how …
 14  always time media…  always time media…
 15 how test drive to…  how test_drive to…
 16  cool us online ti…  cool us online ti…
 17 follow time aweso…  follow time aweso…
 18  us york test driv…  us york test_driv…
 19  use fun new york …  use fun new_york …

only showing top 20 rows

The algorithm and test dataset (testSentences.scala) are available at this repository.