With so many online reviews across many social media websites, it is hard for companies to keep track of their online reputation. Businesses can benefit immensely if they can understand general trends of what their customers are talking about online. A common method to quickly understand trends in topics being discussed in a corpus of text is Latent Dirichlet Allocation.
Latent Dirichlet Allocation assumes each document consists of a combination of topics, and each topic consists of a combination of words. It then approximates probability distributions of topics in a given document and of words in a given topic.
We will perform topic modeling via Latent Dirichlet Allocation (LDA) on online reviews of a beauty retailer from various social media sources.
- Cleaning Text Data: Before we model the reviews data with LDA, we clean the review text by lemmatizing, removing punctuation, removing stop words, and filtering for only English reviews.
- Identifying Bigrams and Trigrams: We want to identify bigrams and trigrams so we can concatenate them and consider them as one word. Bigrams are phrases containing 2 words e.g. ‘social media’, where ‘social’ and ‘media’ are more likely to co-occur rather than appear separately. Likewise, trigrams are phrases containing 3 words that more likely co-occur e.g. ‘Proctor and Gamble’. We use Pointwise Mutual Information score to identify significant bigrams and trigrams to concatenate. We also filter bigrams or trigrams with the filter (noun/adj, noun), (noun/adj,all types,noun/adj) because these are common structures pointing out noun-type n-grams. This helps the LDA model better cluster topics.
- Filtering Nouns: Nouns are most likely indicators of a topic. For example, for the sentence ‘The store is nice’, we know the sentence is talking about ‘store’. The other words in the sentence provide more context and explanation about the topic (‘store’) itself. Therefore, filtering for the noun cleans the text for words that are more interpretable in the topic model.
- Optimizing the number of topics: LDA requires that we specify the number of topics that exists in a corpus of text. There are several common measures that can be optimized, such as predictive likelihood, perplexity, and coherence. Much literature has indicated that maximizing coherence, particularly a measure named Cv, leads to better human interpretability. This measure assesses the interpretability of topics given the set of words in generated topics. Therefore, we will optimize this measure.
The number of topics that yield maximum coherence is around 3-4 topics. We will examine both because 4 topics may still be coherent, while providing more information.
- Using gensim’s LDA package to perform topic modeling: With the optimal number of topics, we use gensim’s LDA package to model the data. After comparing 3 topics (left) and 4 topics (right), we concluded that grouping into 4 topics yielded more coherent and insightful topics:
These 4 main topics can be summarized as: hair salon service, product selection and pricing, brow bar and makeup service, and customer service.
- Further enhance interpretability via relevancy score: Sometimes, words that are ranked as top words for a given topic may be ranked high because they are globally frequent across text in a corpus. Relevancy score helps to prioritize terms that belong more exclusively to a given topic. This can increase interpretability even more. The relevance of term w to topic k is defined as:
The first term measures the probability of term w occurring in topic k, and the second term measures the lift in the term’s probability within a topic to its marginal probability of occurring across the corpus. A lower lambda value gives more importance to the second term, which gives more importance to topic exclusivity. We can use Python’s pyLDAvis for this. For example, when lowering lambda, we can see that topic 0 ranked terms that are even more relevant to the topic of hair salon service the highest.The pyLDAvis tool also gives two other important pieces of information. The circles represent each topic. The distance between the circles visualizes how related topics are to each other. The above plot shows that our topics are quite distinct. Additionally, the size of the circle represents how prevalent that topic is across the corpus of reviews. Circle number 1 represents the topic about customer service, and the fact that it is the biggest circle means that the reviews mention customer service the most. Circle number 2 represents the topic about hair, and the visualization indicates that this topic makes up 22.3% of all tokens.
After applying the above steps, here are the 4 topics and top words for each:
The model has enabled us to understand the 4 most common topics talked about in online reviews about beauty retailer: customer service, hair salon service, product selection, and eyebrow/makeup service.
It is pretty obvious nowadays that reviews are powerful. We can tell what people care about by what they write in their reviews. Then we can see what these reviewers like or dislike by analyzing the sentiment on popular topics being talked about. An issue that arises when performing topic modeling is the usage of multiple words when talking about the same thing and the overlap of these topics. Two different reviews can talk about the same topic using different wording. For example, someone can use the word “cafe” and someone else say “restaurant” but they can be talking about the same thing.
There are various techniques that can help cluster words together to avoid splitting a single topic. This blog focuses on using GloVe, an unsupervised learning algorithm developed by Jeffrey Pennington, Richard Socher, and Christopher D. Manning at Stanford. In this blog we will be using Stanford’s pre-trained word vectors that will help us detect words that are similar to each other.
Let’s look at an example of how GloVe can improve topic identification in a certain collection (corpus) of reviews.
The goal of this analysis is to better identify when a review is talking about a particular topic even if it uses different phrasing. Specifically, we are going to look at reviews in the restaurant/bar industry.
We can take a look at the most frequent words in a corpus of reviews. After doing simple tokenizing, here are the top 20 words and their raw frequency in 4,842 restaurant reviews:
Now, there is already some overlap in these words. For example, beer and drink are the same topic but are phrased differently. How can we make a corpus of words so that there is less overlap of words?
Methodology (k-means clustering)
Since there is some overlap it would be beneficial to group similar words so the generated topics are better grouped. A clustering technique would be helpful here. Specifically, we used k-means clustering which finds the distance between data points – in this case our words – and groups them into k clusters, by minimizing the distances between each word and the centroid of the cluster. We minimized the Euclidean distance of each word and its centroid and found k using n and a reduction factor, f, where k = n*f.
Now how do we get the distance between words? Lucky for us, we aren’t the first people who have wanted to analyze the distances between words. As mentioned above, we have a pre-trained word vector for each word in the corpus. This vector can be used as the Euclidean coordinates of the word. We used a 50-dimensional vector for simplicity and speed.
Additionally, we can reduce the number of meaningless clusters by computing the average Euclidean distance within each cluster. If this average distance is greater than the average distance for all clusters that have at least two words, we throw out the cluster. For our 2,704 clusters, there were 786 that had at least two words. Of these clusters, the average Euclidean distance was 1.755. Furthermore, 344 clusters had a Euclidean distance greater than 1.755 and were discarded. Below are the clusters that were meaningless with their corresponding average Euclidean distance.
The following shows good clusters that were produced and their corresponding average Euclidean distance:
After we remove the bad clusters and keep the good clusters, we can replace the words in each good cluster with the most frequent word in that cluster. For example, say we have the cluster [beer, drink, drinks]. All of the words in this cluster, beer, drinks, are replaced by the most frequently mentioned word, drink. This allows us see the most frequent topics instead of individual words. Our new most frequent word groups are as follows:
There are a few important things to note from these results:
- Drink and beer get clustered together which increases their frequency and ranking
- Wait, ask, waitress, husband, and lunch are also ranked higher because of their associated clusters
- Place, time, staff and order all have decreased rankings because they have no clusters
Each of these words can be considered a topic that includes everything within its cluster. Using k-means clustering to group these words takes into account the different ways people can say things (and not just synonyms). Grouping words together helps to pick up when people talk about a certain topic better. This allows other topics that can be said in numerous different ways to surface. However, word clustering is not perfect and should not be the only tool used to classify topics. We are continuing to work on this topic in order to identify better and better ways to group similar customer feedback and draw insights for our clients.
When analyzing online reviews, we often focus on keywords. For various purposes such as review classification or keyword suggestion, we may need to group those words by closeness of meaning. Given a set of keywords, we may want to split it into clusters of words. One way we can do so is by using Word2Vec to map each word to a vector value, and then apply hierarchical clustering to those vectors.
First, let’s look at a simple example. We have the following input: [“banana”, “apple”, “lemon”, “nice”, “car”, “RV”, “truck”, “desk”]
From a clustering algorithm, we would expect this output:
- cluster 1: [“banana”, “apple”, “lemon”]
- cluster 2: [“car”,”rv”,”truck”]
- out of the clusters: [“nice”, “desk”]
In this post we focus more on the grouping algorithms than on the details of the implementation of the Word2Vec model.
In short, Word2Vec provides word embedding. It associates vector values to words. Word2Vec trains a neural network to predict the neighboring words around each word from a corpus. Once it’s trained, the vector value is extracted from the neural network layers before the projections try to predict the context. Words with close meanings should be represented by vectors with close values.
We trained our Word2Vec model on 10^6 online reviews with the gensim Word2Vec implementation. We used a window of 10 words and generated 300 dimensional vectors.
Our use cases involve data from various industries, so when we want to work with keywords on online reviews for a given industry, we train Word2Vec on reviews from this industry, since this context may be useful in specifying the meaning of a word. For example, considering the word “limb”, the meaning may be different in restaurant reviews than in hospital reviews.
2. Hierarchical clustering
To generate clusters, we will apply a strategy of hierarchical clustering. This is an iterative process to create a new cluster at each step by aggregating two clusters. At each step it tries to minimize the increase of a given distance. For a set of n vectors, there are n-1 steps that take you from n singleton clusters to one cluster with everything. This process can also be well represented graphically to show any level of granularity within a single graph (as below).
3. Distances used for clustering
Of course the key to generating these clusters is the distance metric used to determine which keywords or clusters are “close” together. To measure the dispersion in a set of vectors we can use different metrics. A classic distance is the empirical variance, corresponding to the geometric distance between the vectors.
3.1 Empirical variance
Let be a set of n points, and G its center of inertia (its mean). Then the empirical variance is:
But we can define two other variances : the inter-class variance , and the intra-class variance .
Let be K subset of S, creating a partition of S (the clusters). is the center (mean) of the clusters and is the number of elements in each cluster.
We understand than is a measure of the dispersion inside each subset, and is a measure of the distance between the clusters.
Huygens theorem shows that:
Therefore, when we do clustering, since V is a constant, minimizing the intra-class variance is the same than maximizing the inter-class variance. This is what we want to generate: clusters very distant one from another (high ), and every cluster close to his own center (low ).
Now we will see that at each aggregation of two clusters, there is a way to make the aggregation with the lowest increase of with ward distance.
3.2 Ward distance
Ward distance is a distance between two clusters, which we can use in hierarchical clustering: at each step of the algorithm, we try to aggregate two clusters with the smallest Ward distance between them. This distance between two clusters is chosen because of the following result:
Aggregating the two clusters with the minimum Ward distance is equivalent to make the aggregation with the smallest increase of .
This is an example of hierarchical clustering using Ward distance on a set of keywords from hospital online-reviews.
3.3 Thresholds on cluster distance
Now that we have an efficient method to perform hierarchical clustering which tries to keep the intra-class variance as low as possible, we have to determine where to stop this process of aggregating groups; i.e which level of granularity to use as our clusters in any particular application.
There are different ways to choose the right number of clusters (when to stop the aggregations). A classic way to do that in hierarchical clustering is to use the maximum of the second derivative of the Ward distance, which means to effectively look for a big gap in the Ward distance and to stop before this gap.
But we are looking for clusters with high word similarity, so we don’t want to use the second derivative method, which is relative. We also want to find a standard criterion that is independent of the cluster size. Therefore, we will use thresholds on a given distance but the distance should be normalized.
Even though we use the Ward distance as an aggregation distance, we do not necessarily have to use it to delimit the clusters. Instead we can use a threshold on a “dispersion” metric for each cluster . Let’s consider the following three metrics, in addition to the Ward distance:
- Mean variance in the cluster:
- Maximum distance between two elements in the cluster:
- Maximum distance to the mean of the cluster:
On this and other examples, the maximum distance seems to be worse than the other metrics. The maximum distance to the mean is a good metric but it doesn’t consider all the words in the cluster. We will use a threshold on the mean variance to delimit the clusters. In the next dendrogram we plot the threshold: Each cluster has to have a mean variance below 0.6. In this example we have 4 clusters.
But, we need to be careful when using other metrics than the Ward distance to select clusters. The metric may not increase at each iteration, unlike the Ward distance. Here is an example of this:
This would give an advantage to the maximum distance between two vectors in each cluster over other metrics because the maximum distance between two vectors can’t decrease when we add words in a cluster. For our purposes, we will keep the mean variance but choose to stop the aggregation the first time that the mean variance is over the threshold, in order to keep this method consistent.
Now that we have a way to select the clusters, let’s explore some other issues and possible ways to address them.
4. Improve the quality of the clusters
We select clusters on the mean variance. But we can have a low mean variance due to a high number of vectors close to each other and one vector too far of them. We want to remove vectors which are too far from the center of the cluster as we see in the left part of the following schema :
In this schema, the left mean variance is lower in the red cluster than in the blue one, but the point on the bottom left corner of the red cluster needs to be removed from it.
4.2 “Dimension” effect
We used the mean variance to select the clusters because it’s normalized. We did that because we don’t want to preselect the size of the clusters.
But we can notice that for the same mean variance and a different number of elements in the clusters the lexical proximity seems to increase with the size of the cluster. For example in the dendrogram that we used to illustrate the thresholds, [“arrogant”,”personality”,”bedside”] has the same mean variance as [“encouraging”, “understanding”, “caring”, “passionate”,”listens”]). This may be caused by the high dimension (300) of the vector space.
We can use a better metric to keep the selection of clusters independent of their size. We have good results with:
But, it seems difficult to use the mean of this metric directly with a threshold after the Ward aggregation because the mean of this metric is not increasing with the aggregations of the clusters.
4.3 New filter against those two problems
Even though the mean of the last metric is not useful in selecting the clusters, we may have another use for it. Instead of using it on a cluster level, we can use it on a word level. For each vector, if it’s too far from its cluster’s center, it will be removed from the cluster. This is very useful for two purposes: to get rid of outliers, and to get rid of small clusters of words with very distant meanings.
Ultimately, we have a three steps process. Let’s return to our the first example: [“banana”, “apple”, “lemon”, “nice”, “car”, “RV”, “truck”, “desk”]
- Hierarchical aggregation with Ward distance
- Select clusters with a threshold t1 = 0.6 on their mean varianceWe have three clusters :
– [“banana”, “apple”, “lemon”]
– [“rv”, “car”, “truck”]
- Remove outlier from cluster when is lower than a threshold t2 = 0.4.We limit the “dimension” effect at the same time.
Both “nice” and “desk” don’t pass the test, so we have the expected result :
– cluster 1 : [“banana”, “apple”, “lemon”]
– cluster 2: [“rv”, “car”, “truck”]
– out of the clusters: [“desk”,”nice”]
5. Further extensions – Use of clusters to extract lexical information
Now that we can select groups of words with very similar meaning, we have some ideas for further research.
An advantage of the Word2Vec word embedding is that it may have geometrical interpretations. We have the example from the Word2Vec creators, who found the word “smallest” by looking at the word representation with the closest cosine distance to vector(“biggest”) – vector(“big”) + vector(“small”).
We understand easily the meaning of the vector (vector(“biggest”) – vector(“big”)). But what about (vector(“big”) – vector(“small”)) ? Along the little segment from vector(“small”) to vector(“big”) we hope to find some adjectives related to the size and increasing from “small” to “big”.
We tried to detect lexical information with tools such as Principal Component Analysis (PCA) on a huge group of words but without success, so we hope that this clustering will help us to achieve this.
It may be interesting to use tools such as PCA on each cluster to detect geometrical structures with lexical interpretations.
This is an idea of the kind of results that we may expect:
Vectors generated by Word2Vec can be used to find clusters of words. We try to find clusters without restraining their number or their size, with hierarchical clustering. We can manually choose thresholds values, depending on how scattered we want the clusters to be.
These clustering techniques are useful to find groups of words with the same meaning. It can be used to find keywords more precisely: Let say that we have 100 keywords related to a theme and we want to use Word2Vec to detect new keywords related to this theme. Instead of looking for vectors close to each word representation or close to the entire 100 vector set, we can look for vectors close to each cluster that we generated.
It may be very interesting to analyze geometrical repartitions of the vectors inside the clusters with tools such as PCA, and to find if they correspond to a lexical structure.
Efficient Estimation of Word Representations in Vector Space, Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, 2013
SciPy Hierarchical Clustering and Dendrogram Tutorial, Jörn Hees, 2015
Following up on my earlier post, as the frequency-based models were not very accurate and a good rule-based model was very hard to elaborate, we implemented what we known to be state-of-the-art methods for sentiment analysis on short sentences and make a list of the pros and cons of these methods. We train all of them on a 10.000 sentences dataset. These sentences are classified as positive, neutral, and negative by human experts. We the benchmark the models on a hold out sample of 500 sentences.
Word representations in a vector space
To build a deep-learning model for sentiment analysis, we first have to represent our sentences in a vector space. We studied frequency-based methods in a previous post. They represent a sentence either by a bag-of-words, which is a list of the words that appear in the sentence with their frequencies, or by a term frequency – inverse document frequency (tf-idf) vector where the word frequencies in our sentences are weighted with their frequencies in the entire corpus.
These methods are very useful for long texts. For example, we can describe very precisely a newspaper article or a book by its most frequent words. However, for very short sentences, it’s not accurate at all. First, because 10 words are not enough to aggregate. But also because the structure of the sentence is very important to analyze sentiment and tf-idf models hardly capture negations, amplifications, and concessions. For instance, “Very good food, but bad for service…” would have the same representation as “Bad for food, but very good service!”.
We represent our sentences with vectors that take into account both the words that appear and the semantic structure. A first way to do this is to represent every word with an n-feature vector, and to represent our sentence with a n*length matrix. We can for instance build a vector of the same size as the vocabulary (10.000 for instance), and to represent the i-th word with a 1 in the i-th position and 0 elsewhere.
Tomas Mikolov developed another way to represent words in a vector space, with features that capture the semantic compositionality. He trains the following neural network on a very large corpus:
He trains this model and represents the word “ants” by the output vector of the hidden layer. The features of these word vectors we obtain capture most of the semantic information, because it captures enough information to evaluate the statistical repartition of the word that follows “ants” in a sentence.
What we do is similar. We represent every word by an index vector. And we integrate in our deep learning model a hidden layer of linear neurons that transforms these big vectors into much smaller ones. We take these smaller vectors as an input of a convolutional neural network. We train the model as a whole, so that the word vectors we use are trained to fit the sentiment information of the words, i.e. so that the features we get capture enough information on the words to predict the sentiment of the sentence.
We want to build a representation of a sentence that takes into account not only the words that appear, but also the sentence’s semantic structure. The easiest way to do this is to superpose these word vectors and build a matrix that represents the sentence. There is another way to do it, that was also developed by Tomas Mikolov and is usually called Doc2Vec.
He modifies the neural network we used for Word2Vec, and takes as an input both the word vectors that come before, and a vector that depends on the sentence they are in. We will take the features of this word vector as parameters of our model and optimize them using a gradient descent. Doing that, we will have for every sentence a set of features that represent the structure of the sentence. These features capture most of the useful information on how the words follow each other.
Pros and cons for sentiment analysis
These document vectors are very useful for us, because the sentiment of a sentence can be deduced very precisely from these semantic features . As a matter of fact, users writing reviews with positive or negative sentiments will have completely different ways of composing the words. Feeding a logistic regression with these vectors and training the regression to predict sentiment is known to be one of the best methods for sentiment analysis, both for fine-grained (Very negative / Negative / Neutral / Positive / Very positive) and for more general Negative / Positive classification.
We implemented and benchmarked such a method but we chose not to productionalize it. As a matter of fact, building the document vector of a sentence is not an easy operation. For every sentence, we have to run a gradient descent in order to find the right coefficients for this vector. Compared to our other methods for sentiment analysis, where the preprocessing is a very short algorithm (a matter of milliseconds) and the evaluation is almost instantaneous, Doc2Vec classification requires a significant hardware investment and/or takes much longer to process. Before taking that leap, we decided to explore representing our sentences by a matrix of word vectors and to classify sentiments using a deep learning model.
Convolutional neural networks
Convolutional neural networks
The next method we explored for sentiment classification uses a multi-layer neural network with a convolutional layer, multiple dense layers of neurons with a sigmoid activation function, and additional layers designed to prevent overfitting. We explained how convolutional layers work in a previous article. It is a technique that was designed for computer vision, and that improves the accuracy of most image classification and object detection models.
The idea is to apply convolutions to the image with a set of filters, and to take the new images it produces as inputs of the next layer. Depending on the filter we apply, the output image will either capture the edges, or smooth it, or sharpen the key patterns. Training the filter’s coefficients will help our model build extremely relevant features to feed the next layers. These features work like local patches that learn compositionality. During the training, it will automatically learn the best patches depending on the classification problem we want to solve. The features it learns will be location-invariant. It will convolve exactly the same way an object that is at the bottom of the frame and an object that is at the top of the frame. This is key not only for object detection, but for sentiment analysis as well.
Convolution used for edge detection
Applications in Natural Language Processing
As these models became more and more popular in computer vision, a lot of people tried to apply them in other fields. They had significantly good results in speech recognition and in natural language processing. In speech recognition, the trick is to build the frequency intensity distribution of the signal for every timestamp and to convolve these images.
For NLP tasks like sentiment analysis, we do something very similar. We build word vectors and convolve the image built by juxtaposing these vectors in order to build relevant features.
Intuitively, the filters will enable us to highlight the intensely positive or intensely negative words. They will enable us to understand the relation between negations and what follows, and things like that. It will capture relevant information about how the words follow each other. It will also learn particular words or n-grams that bear sentiment information. We then feed a fully connected deep neural network with the outputs of these convolutions. It selects the best of these features in order to classify the sentiment of the sentence. The results on our datasets are pretty good.
We also studied, implemented and benchmarked the Long Short-Term Memory Recurrent Neural Network model. It has a very interesting architecture to process natural language. It works exactly as we do. It reads the sentence from the first word to the last one. And it tries to figure out the sentiment after each step. For example, for the sentence “The food sucks, the wine was worse.”. It will read “The”, then “food”, then “sucks”, “the” and “wine”. It will keep in mind both a vector that represents what came before (memory) and a partial output. For instance, it will already think that the sentence is negative halfway through. Then it will continue to update as it processes more data.
This is the general idea, but the implementation of these networks is much more complex because it is easy to keep recent information in mind, but very difficult to have a model that captures most of the useful long-term dependencies while avoiding the problems linked to vanishing gradient.
This RNN structure looks very accurate for sentiment analysis tasks. It performs well for speech recognition and for translation. However, it slows down the evaluation process considerably and doesn’t improve accuracy that much in our application so should be implemented with care.
Sentiment trees – RNTN model
Richard Socher et al. describe in the paper Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank another cool method for sentiment analysis. He says that every word has a sentiment meaning. The structure of the sentence should enable us to compose these sentiments in order to get the overall sentiment of the sentence.
They implement a model called the RNTN. It represents the words by vectors and takes a class of tensor-multiplication-based mathematical functions to describe compositionality. Stanford has a very large corpus of movie reviews turned into trees by their NLP libraries. Every node is classified from very negative to very positive by a human annotator. They trained the RNTN model on this corpus, and got very good results. Unfortunately, they train it on IMDB movie reviews data. But it doesn’t perform quite as well on our reviews.
The big advantage of this model is that it is very interpretable. We can understand very precisely how it works. We can visualize which words it detects to be positive or negative, and how it understands the compositions. However, we need to build an extremely large training set (around 10.000 sentences with fine-grain annotations on every node) for every specific application. As we continue to gather more and more detailed training data, this is just one of the types of models we are exploring to continue improving the sentiment models we have in production!
We give our tenants insights about their online reputation based on their online reviews and ratings. In doing so, one thing we try to do is pull apart the text of reviews to understand what the reviews are dealing with, and tell our clients what their customers are talking about and how happy those customers are with key aspects of our clients’ business.
So for example, we might identify 100 reviews for our client mentioning price, and leveraging the star rating of those reviews, we might discern that 80% of those reviews are positive and the average rating of those reviews is 4.0 stars. However, this method could be improved: a positive review mentioning price is not necessarily positive about price. For example:
The food was awesome, and the service absolutely excellent. The price was very high for a coffee-shop style restaurant.
This 5 star review is obviously negative about the price of the restaurant. We need a model that tells us the local sentiment of a sentence or a subsentence in order to be able to understand what elements drive the rating of the review. I’ll explain some of the techniques we have studied, implemented and benchmarked in order to build our Sentiment Mining Tool.
Naive Bayes Classifier
Naive Bayes is the first and the easiest method to classify sentiment in a text. It’s based on the Bayes formula for conditional probabilities:
We’ll represent a text by a Bag of Words, which is a set of features “the word w appears f times” for each word w in the sentence and f, the frequency of w in the sentence. Assuming the Naive Bayes assumption that these features are independent, this formula helps us deduce the probability that the sentence is positive (A) knowing that w appears f times (B) for every w. In fact, we can deduce from the frequencies in a large enough dataset the probability for a sentence to be positive (A), and the probabilities of every feature and then of their intersection (B). Training the model on a training set of 10,000 annotated sentences, we get a set of informative features that are helpful to predict whether a sentence is positive or negative. Here are the 10 most informative features we get:
Naive Bayes classifier’s informative features
This method is the easiest to implement and the big advantage is that it’s completely transparent. When we process it, we know that the classifier found a set of strongly positive or of strongly negative words, and that it is why we classified the sentence in such a way.
How to improve it
However, there are several drawbacks using this method.
First, it fails to identify the neutral class. As a matter of fact, words can have a positive or a negative meaning (“good”, “awesome”, ”horrible”, …) but no word has a neutral connotation. Often, it’s all about the absence of such positively or negatively meaningful words or about the structure of the sentence that reflects the absence of strong emotion. The Bag of Words representation doesn’t address this problem.
It also fails to understand intensity and negations. Comparing “good” and “quite good” for instance, the first one is more likely to appear in a positive sentence than the second one. We tried some methods to address this: adding a list of meaningful bigrams (which mean that we would read “quite good” as a single word for instance), or training the model on bigrams instead of training it on single words, but both didn’t improve our model very much. We also fail to identify negations most of the time, because this model doesn’t take the word order into account.
Most of all, the Naive Bayes model doesn’t perform very well in solving the local sentiment analysis problem. In a long text, having a high frequency of positive words: “sensational”, “tasty”, … makes it very likely that the author is expressing positive sentiment. But as our goal is to determine the local sentiment, we want to process the tool on short sentences and subsentences. (We already have a star rating that tells us the author’s overall sentiment.) We don’t have enough words in the sentence to aggregate so we need to understand very precisely the semantic structure.
The Bag of Words representation is a very bad way to do this. For instance, the sentence “The food could have been more tasty.”, we detect the word “tasty” that is related to a positive feeling, but we don’t understand that “could have been more” is a kind of negation or nuance. Many short sentences are like that, and looking at only a small sentence dataset reduced our accuracy from around 77% to less than 65%.
Rule-based sentiment models
To improve the Naive Bayes methods and make it fit the short sentences sentiment analysis challenge, we added some rules to take into account negations, intensity markers (“more”, “extremely”, “absolutely”, “the most”, …), nuance, and other semantic structures that appear very often near sentimental phrases and change their meanings. For instance, in “The food wasn’t very tasty”, we want to understand that “not very tasty” is less negative than “not tasty” or “not tasty at all”.
We leveraged the results of the Naive Bayes training to build a large vocabulary of positive and negative words. When we process a given sentence, we attribute every word a positive and a negative score, and calculate the overall scores by a precise analysis of the semantical structure based on the open-source library spacy’s pipelines for part-of-speech tagging and dependency parsing. We get a metric for positive, negative and neutral scores, the neutral score being defined as the proportion of words that are neither positive nor negative in the sentence. We used a deep-learning technique to deduce from our training set the relation between these scores and the sentiment. Here are the graphs we obtained for negative, neutral and positive sentences:
The model helps us decide very well whether an expressive sentence is positive or negative (we get around 75% accuracy), but struggles understanding a criteria for neutrality or absence of sentiment (on our test-set, it’s wrong 80% of the time). It’s much better than the Naive Bayes, but 75% is less than the state-of-art for positive/negative decision.
As an online reputation management company we care about ROI because it reveals the true value of our service and product.
We have clearly seen that working with us results in a valuable increase in web traffic for our clients. However, as a Data Science team we want to go deeper in order to identify and quantify what really drives better online traffic.
Towards this goal, I analyzed web traffic data (in particular Google My Business Insights data) from the GMB pages of our customers and specifically focused on the volumes of actions (calls, website clicks and driving direction requests) and views (appearances of the listing on Google Search and Google Maps) that these listings had over time. Throughout my exploration, I tried to tie this data with data we have internally such as rating and review histories for these locations and Google Search results for these locations, which gives us information about the locations and their competitors.
The difficulty of this analysis lies in the fact that this data has a lot a variance due to various factors, only some of which are tied directly to Online Reputation and Reputation Management. For instance, traffic fluctuations might be due to seasonality, changes in search engine ranking algorithms, improved customer reaction toward a GMB that has more/better reviews, a new advertisement campaign or a sale the company could have launched, or just underlying variance due to the unpredictability of hundreds of millions of consumers.
Despite these challenges, we have been able to come up with some interesting insights when looking at views and actions across different locations in particular enterprises or industries. I’ll discuss two of these here.
Distributions of views, actions, and reviews
Unsurprisingly, a location that has a high number of views also has a high number of actions. Actions (click throughs) can only come from people that have viewed your page.
What’s interesting is how different these volumes of GMB views can be. The distribution of the number of views by location for locations within an enterprise tends to follow a log-normal distribution. (It is interesting to note that we would have exactly the same type of distribution histogram by considering actions instead of views). This skew can make analysis or modeling very challenging.
Furthermore, it turns out that the distribution of the number of reviews by location has a similar very-right-skewed shape and is highly correlated to the view volume. Not surprisingly, the bigger the traffic size of the location is, the more reviews it usually has.
As mentioned earlier, one of the key challenge of related analysis is being able to work with data that is in different orders of magnitude. Besides the very nice smoothing effect that a logarithm transformation of these values (views, actions, number of reviews) provides, it makes sense in the way that fluctuations of these values over time are mostly relative, and that gives a nice interpretation of the log values. Indeed, a locations that has 100 reviews might get 10 more in the coming month about as easily as a location that has 10 reviews could get 1 more, and it is the same for web views.
However, this transformation has some caveats. One of them is that locations with smaller values tend to fluctuate by a relatively higher percentage than bigger and more stable locations. It is the simple manifestation of a regression to the mean phenomenon (higher values tend to decrease, smaller values tend to increase) but is challenging to take into account in models. Another one, is that values in general, and aggregated values lose their primary quantitative sense. Indeed having models minimizing a metric (MSE for a regression for instance) over log values and then switching back to real values will not be successful if you ultimately want to optimize MSE over the real values. Defining what metric to use is something that is not evident but fundamental, and is of course, part of the art of Data Science.
More on the relationship between reviews, views, and actions
Admittedly it is not terribly newsworthy that locations with more views have more reviews on average and vice versa. We obviously want to dig deeper and study questions such as: “Does having or generating more online feedback (i.e. reviews) for a location generate more online traffic or actions.”
We have been looking at this relationship from a number of directions, and one of the most interesting insights has come from looking at the conversion rate (the ratio of actions to views) for a location as a function of how many reviews that location has.
As we discussed, there is an obvious correlation between views and actions, but this conversion rate is not strictly constant. In fact it is generally distributed in a range between 0 and 0.3 and provides us interesting information.
Furthermore, we have found a strong relationship between this rate and the volume of reviews per location. In the graph below, we plotted this conversion rate for different groups based upon their relative number of reviews to their top competitors. We define this relative number of reviews as the ratio of our location’s number of reviews to the average number of reviews of the top three locations that show up on a Google categorical search for this location (e.g. “tire store near me”. We defined this variable, because we found that even more than the absolute number of reviews, having more reviews than your competitors plays a critical role in influencing customer behavior.
As I mentioned above, we are performing other analyses to understand how not just having more reviews, but generating more reviews, generates more online views and conversions, and we will continue to surface over the coming months.
At reputation.com, we process large amounts of text data for our customers, with the goal of figuring out what people are talking about in a set of reviews and what that can tell us about customer sentiment for our clients. There are a lot of open source tools that we leverage to help us extract information about the text, and one of those tools is Doc2Vec, the algorithm developed by Thomas Mikolov from Google. This article is an introduction to some ways we can leverage Doc2Vec to gain insight into a set of online reviews for our clients.
Doc2Vec is a 3 layer neural network that simultaneously learns the vector representation of each word and each sentence of a corpus in a vector space of a fixed number (e.g. 300) of dimensions.
Doc2Vec and Classification
To start with, let’s look at how we could use Doc2Vec to help us categorize sentences in reviews. We start by training 10 Doc2Vec models on 100,000 online reviews related to the dental industry. Before feeding the text to the algorithm, we clean it by lemmatizing and removing stop words to reduce the initial dimensionality.
A good model is a model in which two sentences with very different meanings are represented by vectors distant in the vector space. For instance, given the three sentences:
- “I love this dentist, he has such great bedside manners”,
- “Dr. Doe truly cares about his patients, and makes them feel comfortable”,
- “The parking is always full”
the distance between the vectors representing the first two sentences should be shorter than between sentences 1 and 3.
We have a database of sentences about the dental industry that have been manually tagged as dealing with certain aspects of the experience of going to the dentist. Those sentences were extracted from online reviews. They can deal with multiple aspects but as they are relatively short in practice they are often about only one topic. To evaluate our model, we are going to use sentences from the categories “Parking and Facilities” and “Bedside Manner”. Let’s focus on a pool of a hundred sentences that have been manually tagged as being about “Parking and Facilities” and a thousand sentences about “Bedside Manner”. We use the following process:
- We pick up two sentences from “Parking and Facilities” and compute the cosine distance between their representative vectors in our Doc2Vec models
- We look at all the sentences in the “Bedside Manner” pool one by one and determine if our two sentences from “Parking and Facilities” are closer to each other or to the sentence about “Bedside Manner”. If the distances of the two sentences about “Parking and Facilities” to the sentence of “Bedside Manner” are greater than the distance between the two sentences about “Parking and Facilities”, then it is a success.
- We do this for all the couples of sentence in the “Parking and Facilities” pool of sentences.
Across all those comparisons, the success rate is 74%, meaning that 74% of the time the two sentences about “Parking and Facilities” were closer to each other than EITHER WAS to a given sentence from the “Bedside Manner” category. As a first pass, this is good but not great. Our model obviously captures some of the nuance of the language, but not enough to serve as a stand alone classification algorithm. In practice, we are using this as just one component of the machine learning tagging model we have built to tag reviews without manual inspection.
Doc2Vec and Topic Modeling
Now, let’s see how the model can be used to spot recurring topics. To do that, we cluster a sample of data from the dentist industry running a KMeans algorithm on the set of vectors representing the sentences. We want each cluster to represent a semantic entity, meaning that vectors from the same cluster should be close in meaning. The more clusters you build, the smaller they are and the more similar the sentences are within the cluster. But having too many small clusters does not provide relevant information: we end up with very specific clusters, and different clusters about the same topic. Therefore, we are interested in finding K, the number of clusters that will allow us to have the smallest number of coherent clusters. To do that, let’s draw a plot of the clusters’ average inertia (sum of squares within cluster) divided by the number of points and find the “elbow” of the curve: the point where inertia starts decreasing more slowly with the number of clusters.
We notice a break around 14 clusters. Let’s now try figure out what each cluster represents: in each cluster, let’s read a few sentences and words close to the center. Here is a sample of the results:
- Great Service and all of the staff were friendly and professional.
- The care was excellent and the medical staff was at the cutting edge.
- I highly recommend Dr. xxx.
- I highly recommend Dr yyy.
- Worst experience ever.
- They had me waiting 5 hours in the waiting room.
- 2 hours and still waiting.
- I filled out my paper work and couldn’t have been waiting more than 5 minutes before being called back.
- Love this place good doctors and nurses
- Wonderful staff, and Dr. Bruckel was warm, personable, and reassuring.
- Fast friendly and helpful but still personable.
- Waste of time and money.
- Do not waste your time here.
- Great service in the emergency room.
- They took great care of me
- Went to the e.r. with a kidney stone.
- I went to the ER for some chest pain, I got xray, bloodwork, etc.
- Would definitely recommend this hospital for labor and delivery.
- I would definitely recommend this hospital to anyone.
- The doctor had a great bedside manner.
- He has the best bedside manner of any doctor.
- The staff was courteous and very professional.
- This is an urgent care.
- I highly recommend this urgent care.
- They saved my life.
- Always treated well and with respect.
- This place saved my life, I’m Very thankful with the doctors and nurses.
For most clusters, a dominant them emerges; e.g. recommendation, wait time. Some of these themes span multiple clusters. Some of the clusters however, seem to mix multi unrelated topics. Looking at the inertia of each cluster helps us ID some of the better clusters (e.g. 1, 3, and 8). The lower the inertia, the more coherent the cluster, the more likely the sentences are of similar meaning. We can also look at the distance between clusters to find good candidates for regrouping (e.g. 2 and 9).
Ultimately, as we saw with classification, we can see that Doc2Vec is a useful tool in identifying key topics, but not a standalone tool, at least in the version implemented here. Nonetheless, in these and other applications we have already found and are continuing to find valuable ways Doc2Vec can help us extract actionable insights for our clients.
Word Clouds can provide a powerful visual representation of text data. When it comes to customer review data, word clouds can quickly showcase the most prominent keywords people are using and their associated sentiment. The basic word cloud works by selecting the most frequent keywords in a set of data (in our case customer reviews or surveys) for a given client or industry. The font size of text represents the frequency with which a keyword has been mentioned in these reviews. The color of text (ranging from dark green for positive to dark red for negative) represents the average sentiment of these mentions. This way users can gain immediate insight into what customers are talking about the most and their relative sentiment with respect to those topics without having to read through each customer review one by one. Our goal in the data science labs was to experiment with different ways of depicting a Word Cloud to maximize the insight this tool could provide.
The cloud below showcases customer review data relating to staff professionalism in a hospital. The word “nurse” indicates that a lot of customers evaluated their nurses in reviews and they had somewhat positive experiences. People were even more positive about the “staff” in general. On the other hand, there were various complaints around “bills”, “ultrasounds”, and “rude” and “unprofessional” staff.
But what are the main insights with respect to these reviews? Are we to conclude that experiences with the nurses were not as positive as with the rest of the staff? How positive were these experiences compared to nurse services in other hospitals? Do people mention the nurses and overall staff this frequently in other hospitals as well? Unfortunately, this single cloud cannot answer those questions.
Towards digging into that, we wanted to figure out how to get more context into the word cloud. However, when it comes to specific customization, the free version of ZingChart will not do the trick. For example, the only data I could pass in for a tooltip when hovering over the word is the word itself and its count. I would have to pay technical support fees to extend this feature. Because of this I decided to explore other libraries. In particular I decided to explore what I could do with the popular data visualization tool D3.
This possibility was particularly intriguing to us, as it provided an option to layout a benchmark word cloud side-by-side with the initial word cloud. If both clouds could have exactly the same text and text position, the two could easily align side by side to provide a quick visual benchmark for each word. Thus, users could not only gain general insights of their own customer reviews, but also compare easily to various benchmarks such as industry averages or performance over a longer time period.
In the cloud below, I implemented this for one hospital’s reviews over two time periods, the last 3 months and the last 24 months. Immediately you can see a value to the benchmark. With the two clouds side-by-side (more recent reviews on the left) you can see that feedback about “doctors” and care related to “children” is in fact getting worse. You can also quickly see that sentiment around waiting (“wait”, “hour”, “minute”) is in fact getting more negative compared to some of the other words showing up in yellow here.
This is a big improvement, and I feel this is a good way to add context to a word cloud. However, we do run into some difficulties with this approach. In the case above, all of the words are in pretty much the same place, but this is primarily because the relative frequencies of the words are very similar in the two data sets. In some cases however, this is not true. In a followup post, next month, I will go through one of these cases in more depth and discuss the custom cloud we had to build to overcome these issues.
This blog post discusses profiling methods, specifically for the Python programming language.
Within the data science team, one of the things we are working to build is a processing model for large amounts of textual and review data using natural language processing.
Because we are processing data at such a large scale, it is important that our model is properly optimized to reduce any unnecessary overhead. As such, it is important to identify which areas in our code are taking up the most time. This is where profiling comes in.
Program profiling is a form of analysis that measures things such as the memory usage, time usage, the usage of particular instructions, or the frequency and duration of function calls. It is a way to understand where the largest amount of resources are being spent in order to target optimizations to these areas.
Our use case was to find optimizations in a series of Python files used in our model. In order to find which parts of the program were stalling execution, profiling was used. Python has many native and third party profiling tools that allow for a range of analysis for runtime, memory usage and visualization. Some of the tools we looked at were cProfiler, line_profiler, memory_profiler and QCachegrind. For the purpose of our use case, we are most interested in profiling methods that enable us to see which parts of the program were using up the most time, and if there are any blocking resources.
Profiling Using the Standard Python Library
Profiling Python can be done with the standard Python library, as well as third party modules and programs.
The standard Python library provides three different implementations of the same profiling interface: cProfile, Profile and Hotshot. The most popular of the three is cProfile.
cProfile can be run in terminal, as well as imported as a module in Python.
It shows profiling results by functions for time for ncalls tottime percall cumtime percall, (number of calls to that function, total time of that function excluding calls to other functions, time per call, cumulative time of the function and other function calls, time per cumulative call).
197 function calls (192 primitive calls) in 0.002 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.001 0.001 <string>:1(<module>)
1 0.000 0.000 0.001 0.001 re.py:212(compile)
1 0.000 0.000 0.001 0.001 re.py:268(_compile)
1 0.000 0.000 0.000 0.000 sre_compile.py:172(_compile_charset)
1 0.000 0.000 0.000 0.000 sre_compile.py:201(_optimize_charset)
4 0.000 0.000 0.000 0.000 sre_compile.py:25(_identityfunction)
3/1 0.000 0.000 0.000 0.000 sre_compile.py:33(_compile)
Although we are able to see timing information on a function basis, we aren’t able to see which lines specifically are taking up the most time.
Here is an example that runs in the terminal with cProfiler and reduces the output to the 33 top lines with highest cumulative time:
python -m cProfile -s ‘cumulative’ program.py > temp_file && head -n 33 temp_file && rm temp_file
Third Party Profiling Modules
Third party modules include line profiler and memory profiler for line by line profiling, and QCacheGrind for program visualization.
line_profiler – Line-by-line Timing
line_profiler is a third party module that does line-by-line profiling in a Python program. It shows the time spent on each individual line in the Python program. After installing line_profiler, you use it by decorating the functions that you want to profile using ‘@profile’. Then you create a kernprof script of your Python file that can be used with line_profiler.
pip install line_profiler
kernprof -l myProgram.py
python -m line_profiler myProgram.py.lprof
Timer unit: 1e-06 s
Function: Proc2 at line 149
Line # Hits Time Per Hit % Time Line Contents
150 def Proc2(IntParIO):
151 50000 82003 1.6 13.5 IntLoc = IntParIO + 10
152 50000 63162 1.3 10.4 while 1:
153 50000 69065 1.4 11.4 if Char1Glob == ‘A’:
154 50000 66354 1.3 10.9 IntLoc = IntLoc – 1
155 50000 67263 1.3 11.1 IntParIO = IntLoc – IntGlob
156 50000 65494 1.3 10.8 EnumLoc = Ident1
157 50000 68001 1.4 11.2 if EnumLoc == Ident1:
158 50000 63739 1.3 10.5 break
159 50000 61575 1.2 10.1 return IntParIO
In this output, we can see by line the amount of times a line is executed, the time per execution, the total execution time and percentage time usage. This helps you zero in on which lines are actually causing slowdowns in your program.
memory_profiler – Line-by-line memory usage
memory_profiler is another third party package that is similar to line profiler. It does line by line profiling of a Python program with memory as opposed to time.
pip install -U memory_profiler
python -m memory_profiler example.py
Line # Mem usage Increment Line Contents
4 5.97 MB 0.00 MB def my_func():
5 13.61 MB 7.64 MB a =  * (10 ** 6)
6 166.20 MB 152.59 MB b =  * (2 * 10 ** 7)
7 13.61 MB -152.59 MB del b
8 13.61 MB 0.00 MB return a
The output is similar to line profiler. In the output above, it is seen that memory usage increased when required computation power increased. It is helpful if a program is doing operations that require a lot of memory.
Visualization with QCacheGrind
QCacheGrind is a visual profiling tool, it can be used to view the call stack of a program and see the cumulative time usage of each function in the call stack. You can visually trace through the call stack, and even view the time usage line by line of the source file.
pip install pyprof2calltree
brew install graphviz
brew install qcachegrind –with-graphviz
python -m cProfile -o myscript.cprof myProgram.py
pyprof2calltree -k -i myscript.cprof
Result and Comparison
Profiling helped us zero in on an iteration loop that was taking a large percentage of time. It turns out repeated index references to a large DataFrame object were driving a large percentage of the time usage. This is because while the Pandas DataFrame is a powerful data structure to apply vector operations and aggregation across large amounts of data, it’s inherently a slower data structure when it comes to accessing indexed rows repeatedly or iterating through a number of rows compared to a simple dictionary. After identifying this via profiling, the program was optimized by converting the DataFrame to a list of dictionaries.
We found line_profiler to be the most useful in terms of finding which areas of code to optimize. Using line_profiler we can see the percentage time usage of each function we are interested in line by line. Tools such as cProfile and QCachegrind are able to give a broad perspective on which functions are taking the most time, but do not show which lines of the function are the trouble areas. memory_profiler is good for programs that use heavy amounts of memory, but for our use case memory was not limited.
This blog post is an introduction to the technology stack we use within the Data Science & Data Engineering Teams at Reputation.com to build our Data Science Labs portal. The Data Science Labs portal serves as a platform for our team to prototype data products, and iterate and experiment with new ideas to develop the exact algorithms and data products that will provide the most insight and value for our customers. For many of the tools we are building, the product use-case is being developed alongside the algorithms, so the more quickly we can prototype and collaborate with the product, design, and engineering teams, the more effective we will be at providing the most valuable and useful data products within our customer-facing product stack.
The portal is being actively built and iterated upon by our team of data scientists, data engineers and interns. After a few rounds of experimentation over the course of 2016, we have found that the following stack provides the best combination of feature set and flexibility to help us be as efficient as possible in prototyping new data products.
Since most of the data science toolkits the team uses daily are Python-based (Jupyter iPython, Pandas, Sklearn, TensorFlow), naturally we opted to go with a Python ecosystem for our Data Science Labs backend. We have found Flask to work well as our backend web framework because it has the following characteristics:
- Micro framework
- Rich documentation
- Large open source community support
- Simple to learn and use
- Routing is easy
- Small core codebase and extensible
We also considered Django and CherryPy. Django is much more of a full-fledged solution, providing a ton of features that our application doesn’t need—such as deep ORM integration. There’s also a steep learning curve to ramp up. CherryPy, on the other hand, is comparable to Flask in that it’s an easy-to-learn micro framework. But Flask won out because of its great community support, better documentation, and vast library of plugins and extensions.
We were looking for a frontend framework for our Data Science Labs that:
- is easy to ramp up
- provides a clean abstraction to help us separate our logic and view, and
- allows us to build reusable components.
We chose Vue.js over AngularJS or ReactJS. All three frameworks use 2-way data binding for reactive programming. Angular, being the oldest of the three, has a lot of community support and a deep knowledge base, but it is a behemoth in term of scale and size. Angular also is more inflexible in that it imposes a specific structure on how you layout your codebase, definitely much more opinionated than Vue.js or ReactJS.
ReactJS is rising in popularity and on its way to surpassing AngularJS as the dominant framework. It has many more advanced features than Vue.js, such as mobile rendering and server-side rendering, but neither are priorities for our project.
Vue.js, while being the newcomer, already has a large following and tremendous momentum moving forward, mainly for its reputation of being a minimalistic, simple framework. We find the syntax of Vue.js much more pleasant to work with than ReactJS’s JSX, producing code that is a lot more readable and maintainable. It is much easier to separate your CSS, JS and view files. And with the least amount of ramp-up time as far as learning curve is concerned, we find Vue.js to be the most suitable framework for our project. To summarize, Vue.js proved to be the right choice for us because it us:
- easy to learn
- very flexible; high compatibility with other libraries
- the core library focuses on the view layer, can build on top of as you go
Click here for more detailed comparisons between Vue and other frameworks.
As I mentioned, this stack is actively evolving with every new prototype we build, so check back here for updates as this platform evolves.