At Reputation.com, we work with millions of online reviews from hundreds of sources. One of the unusual characteristics of reviews compared to the vast majority of text corpora is that, almost by definition, reviews are structured in such a way that they can be categorized (in one or many dimensions depending on the review site and/or industry). However, we often find ourselves doing text classification/tagging on topics that are not already labeled by the review site. This article is an informal introduction to a set of techniques we have developed to leverage existing unlabeled corpora in conjunction with the labeled data. In particular, we present a semi-supervised learning algorithm for multi-label text classification.
In recent years, a lot of text classification projects have used supervised learning methods (Naive Bayes, SVM) primarily due to their substantial improvements over non-supervised strategies such as traditional clustering in NLP tasks. Until very recently, most NLP classification work was done with the traditional Bag of Words (BOW) approach – perhaps with a bit of context through the use of a limited range of N-grams and skip-grams. BOW is a feature extraction technique where the text is represented as the frequency of each word in the document, disregarding grammar and ordering but keeping multiplicity. In most cases, defining a pipeline combining the BOW feature extraction technique with a Tf-Idf transform and a simple classifier (Naive Bayes, SVM) produces decent results with respect to most classification metrics.
Semi-Supervised Learning with Word2Vec
In most tutorials, Word2Vec is presented as a stand-alone neural net preprocessor for feature extraction. Word2Vec generates a vector for each word in the text corpora in higher-dimensional space such that words that share contextual meaning are located in close proximity to one another. To use Word2Vec for classification, each word can be replaced by its corresponding word vector and usually combined through a naive algorithm such as addition with normalization or cross product to get a sentence or text vector. Then, using these document vectors we could use a simple classifier for multi-label classification. The advantage of using Word2Vec over a simple BOW feature extraction technique is it supports semi-supervised learning, since the vocabulary from the labeled and unlabeled text can be used to generate the word vectors. This allows the words to have more contextual meaning. However, we have found that this approach does not appear to provide significant improvements over a BOW approach especially when there isn’t a lot of labeled data for training the classifier.
Semantic Convolution for Low Support Topics
A common problem that is seen in multi-label text classification is a major imbalance of labels in a textual corpora. We often see cases where most (>60%) of the sampled data is about the most prevalent topic, and more than half the topic labels exist in <0.1% of the sampled data. Almost inherently with NLP and a BOW approach, this causes a p (number of features) >> n (size of training corpus) problem. Based on a general rules of thumb, getting 1,000 training examples for the low support topic would require millions of labeled training examples, which is prohibitively expensive.
In this world of ‘big data’ the data itself is actually cheap, but developing a tagged training set can be expensive. In the course of our development, we devised an elegant and scalable way to develop and maintain a robust training set across tens of industries (this will be the topic of a separate blog post).
The premise of Semantic Convolution is simple: if a particular word is a good indicator of a particular label, then words with similar meanings (semantics) should also be good indicators of the label. Since we have qualitative evidence that Word2Vec vectors encode a semantic meaning, we can use it to help find words with similar meanings from non-labeled corpora. This allows us to apply a Semantic transform after getting the term frequencies in the BOW pipeline, and before applying the Tf-Idf transform. To apply the Semantic Transform, we use the Word2Vec data to generate a correlation matrix between words with similar contextual meaning in the vocabulary.
vocabulary is a dictionary mapping each term with an index, the code to generate the
correlation_matrix = scipy.sparse.identity(len(vocabulary), format="dok")
for idx, word in enumerate(vocabulary.keys()):
similar_words = 
similar_words = [x for x in word2vec_model.most_similar(word, topn=5) if x > 0.5]
for similar_word in similar_words:
if similar_word in vocabulary:
correlation_matrix[vocabulary[word], vocabulary[similar_word]] = 1
Using this correlation matrix we can generate the term-document matrix with the augmented term frequencies.
term_frequency_vector += term_frequency_vector * correlation_matrix
Applying this transformation with the correlation matrix increases the word count of all words with contextually similar meaning in the text. This improves the feature collection for low support topics, which allows more precise classification of reviews about low support topics with higher confidence. This allows small amounts of labeled data to be more useful for the machine-learning model, which reduces the cost of developing a robust training set. Also, as mentioned above, this leverages semi-supervised learning from the unlabeled data by building the vocabulary and Word2Vec vectors based on the entire text corpora.
Ultimately the Semantic convolution provides more value from the little labeled data, and improves the performance of the machine leaning algorithm for classification tasks, especially the low support categories. Also, semi-supervised learning with Word2Vec leverages the information gained from the vast amounts of unlabeled data while increase both the precision and the support of the machine-learning model.
Dweep Shah and Anthony Johnson
Extracting meaning from online reviews is key to turn seemingly anecdotal reviews into actionable customer satisfaction insights that point to improvement opportunities or authentic, and potentially differentiating strengths. One way to do that is to apply machine learning to automatically read customer reviews and identify the most relevant topics that are the subject of the review. With this information, you can find themes in what customers are saying about a business across thousands of reviews and then help businesses identify areas in which they are receiving a disproportionate number of negative reviews so that they can focus operational efforts on these areas and improve customer experience as well as their online reputation.
We have been working for a while on several approaches, models, and data sets to extract topics and categories from customer reviews with a high precision. In this post I will give an overview of a few neural network models that provide satisfactory results for physician-related reviews. To start, we built a taxonomy of categories that are relevant to physician reviews looking both at clinical patient experience topics from standard patient assessment surveys designed by CMS (Center for Medicare and Medicaid Services) as well as non-clinical topics related to parking, technology/amenities, and cleanliness that are commonly referred to in physician reviews. Then we gathered training data by having a group of crowd-sourced individuals tag a set of 10,000 reviews with the following categories (this is the subject of an upcoming blog entry):
- Administrative Process
- Bedside Manner
- Getting an Appointment
- Likely/Unlikely to recommend
- Staff Courtesy
- Price/Billing issues
- Wait Time
Given this training data, we used a biologically-inspired variant of Artificial Neural Networks to build a classifier that automatically assigns categories to online physician reviews. These neural network classifiers are based on how an animal’s visual cortex processes and exploits the strong spatially local correlation present in natural images. Those models are generally used for image recognition, but are being increasingly used in other fields, especially text classification. Given the promising results documented in this space, we decided to evaluate Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) with respect to our classification problem.
After an initial trial, we decided to focus our implementation on CNN models as they execute faster, are easier to understand, and had comparable results to RNN.
Principle of CNNs
The starting point of our CNNs was to represent a review by a matrix where each row is a vector that represents a word. This vector could be low-dimensional representations or one-hot vectors that index words into a vocabulary. Given this vector, you can then apply several convolutional filters on groups of rows followed by a 1-max pooling (the largest number from each feature map is recorded) in order to extract the meaning of the group of words considered at the beginning. Finally, a softmax layer is applied to generate assessed probabilities of the review belonging to each class.
CNN Implementation Approach
We generated a CNN model for each category independently broken into two classes: reviews that belong to this category and reviews that do not.
Thus, to produce all of the hidden parameters of these models, we fed them with reviews from the training data that were already categorized and modified the parameters incrementally in order to minimize a loss function (a function that represents the difference between the prediction and the real categories).
CNN Model Effectiveness
To assess the performance of these models, we split the 10,000 reviews into 8,000 reviews for training and 2,000 for testing. Given a model built on the training data, we predicted whether each review in the test data belonged in each category and assessed the precision and recall of our predictions with respect to each category.
For the largest categories, we found that our models delivered an overall precision of 81% and an overall recall of 75%. At first sight, those results did not appear very good. However, when we dug deeper, we found that when we considered at each element of crowd-sourced (human-based) training data to be a prediction in itself, these tags exhibited precision and recall metrics lower than 70% (versus the consensus of the group). Thus, our model outperformed human classification.
Furthermore, looking deeper into the training data, we realized that some reviews were truly ambiguous and the categories not precise or discerning enough, which resulted in a high degree a disagreement between humans evaluating the same review. After removing the most ambiguous reviews from the training data set, we observed a marked increase of the overall accuracy of the model. What to do about those ambiguous reviews or how to fine-tune the categories will be the subject of a future post.
The results from CNN models are promising, and we are pushing them further by experimenting with several modifications of the model such as: oversampling the training set in order to have balanced data for each category, splitting reviews by characters instead of by words, and initializing with a low-dimensional representation of words using Word2Vec. Stay tuned for further updates.
A couple of months ago, we looked at the relationship between review site rankings on a business’s local SERP/SEO and the number of reviews of the business on those sites. We found a significant positive relationship between the number of reviews and how highly those sites ranked in local search results.
Most of the analysis that time around centered on automobile dealers around the country and where their Facebook and DealerRater presences ranked in Google search results targeting that dealer. This time around we expanded that analysis to multiple review sites and multiple industries and found that the relationship between review volume and local search ranking varies wildly by domain and industry. We also dug a little deeper into the data to try to estimate the value of adding reviews on these sites over time, and we found that new reviews are valuable in two ways. First, new reviews help review sites rise on search engine results pages, and second, the more reviews that site acquires, the better its chances are of staying at the top of the results page as well.
Reviews and SEO across sites and industries
First, let’s look at the relationship between review volume and domain ranking in local search for that same set of automobile dealers. (Note: None of these analyses include Google, since the Google review presence is usually anchored on the right-hand side of the page.)
The Facebook and DealerRater lines here match up pretty well to the data we presented before. We also see a correlation between review volume and domain rank for the other domains, but it is notable that the apparent impact of additional reviews varies a bit by source. For instance, for four of these domains the average rank of the review site for a location with no reviews is between 8.5 and 10. Having 100 reviews on DealerRater brings the expected rank of that domain down to the top half of the first page, whereas for cars.com and Facebook, we would still expect 100 reviews to leave that site below the fold when someone is searching for that location. Edmunds.com is even worse. It seems no matter how many reviews you get, Google is determined to pin your Edmunds presence to the top of the 2nd page.
This data would lead us to hypothesize that, on average, an additional review on DealerRater is worth considerably more for a car dealer than an additional review on one of these other sites. But before we explore that hypothesis a little more, lets look at similar data for a few other industries. Next let’s look at hospitals:
These are the review site domains that most commonly showed up when we googled over 1000 US hospitals. Again we see the expected directional relationship, more reviews means a generally better SERP/SEO ranking. However, none of these curves are as steep as the steepest curves for auto dealers. It’s also very interesting to note how much Google seems to value a healthgrades.com page regardless of whether there are any reviews on it.
And here is the data for the Self-Storage Unit industry. There isn’t as much breadth in this industry, as there aren’t as many review sites with high volume, but it is very interesting to note that in the storage industry, Facebook has a very strong correlation between review volume and SERP/SEO rank.
All of this is very interesting, but it raises several questions. Most notably, what makes the SERP/SEO ranking of particular review sites seem to be so responsive to review volume in particular industries? And is there actually causation here or does something else explain why some of these correlations are so strong?
Review volume impact on local SEO over time
Let’s address the causation question by looking at some more dynamic data, specifically by looking at how these rankings and volumes change over time. This is still a long way from a controlled experiment, but it would be more compelling if we could show that as review volumes rise for a particular location on a particular site, then the SERP/SEO ranking of that location tends to fall.
Over the last couple of months we gathered SERP/SEO data once a week for several thousand US auto dealers. We then looked at the rankings for major review sites over time and how those changes correlated with total review volume and with changes in review volume. To model this, we fit a Markov Chain that predicted the probability of any weekly SERP/SEO ranking for a review site based upon the domain, that site’s ranking the previous week, the total number of reviews for that location on that site, and whether the number of reviews went up or not.
The first thing we wanted to measure was this – Does getting new reviews positively impact your search engine rank? According to our data, the answer would appear to be yes. In the graph below we plot the predicted impact of getting a new review on one review site according to our model.
According to our data, after we normalize for domain, rank, and total number of reviews prior, review sites that got at least one new review in a given week tended to be placed higher the following week than sites that did not get new reviews. Obviously this impact is much higher when you have no reviews or very few reviews (an average improvement of 1/3 of a spot for sites getting their first review!), and it levels off pretty quickly once you have around a dozen reviews.
Our model spit out one other interesting insight. It found that review volume is important not just for getting a site ranked highly on SERP, but for keeping it there as well. Review site rankings drift from week to week, and our Markov Chain model captures that drift. But what the model also found is that for review sites with a high volume of reviews, regardless of where they ranked the week before, they tended to drift more towards the top of the page (or were more likely to stay there) than review sites with very few reviews.
This graph plots how much an auto dealer’s review volume will impact the drift of that ranking on average. In other words, if you have no reviews, your review site page will lose one spot every three weeks, on average, relative to the norm. If you have 50+ reviews, it will gain 1 spot every 5 weeks on average, relative to the norm. You might ask, “how can I gain a spot if I am already at the top?” Well, links that are in the top spot tend to lose that spot about 20% of the time. If that site has 50+ reviews, it will be much less likely to do so.
Hopefully this analysis shines a light on the value of generating a healthy review volume on review sites that you want your customers to be able to find on Search Engines. And makes it clear that those reviews are valuable not just because they will help those review sites climb to the top of search engine results, but because they will help those sites stay there as well. Also beware that these effects can vary considerably from domain to domain, and the most responsive domains may also vary from industry to industry.