We conducted a survey to understand the healthcare decisions respondents made for themselves or on behalf of a family member using online resources. A family member can include a child, spouse, or aging parent. Decisions were around providers, meaning doctors, clinics, hospitals, and health systems.
The Screener and Methodology
We began the survey with a short screening. If answers to the screener’s questions do not apply to the respondent, we exited that respondent out of the survey automatically. We looked for respondents who lived in a U.S. state, were 26-64 years old, made the majority of the healthcare decisions for themselves and their family, and had a consumer primary health insurance plan.
Consumer plans included an employer’s health insurance coverage (yours or your significant other’s), self-purchased private company coverage, or purchased through a state health insurance exchange (healthcare.gov). Consumer plans did not include government programs like Medicare, Medicaid, VA, TriCare, or other military benefits programs, or coverage by a parent’s insurance plan.
Our survey results in this post also do not include anyone who wasn’t sure of their insurance plan name or who do not have health insurance. The last part of the screener assessed whether the respondent used the internet to find a new healthcare provider in the past two years.
The Kinds of Questions We Asked
- How do people search for providers?
- Where do they start their search? (search engine, map, etc. – does it matter based on what you are looking for?)
- Do they shop for doctors or hospitals? Do you search for an individual doctor’s name?
- Do they start with a branded term – i.e. the name of a health care system or HMO?
- Do they search by a specialty or a condition?
- If one provider’s listing has more detail than another’s (credentials, hours, ratings, etc.) does that influence your decision making?
Consumption of Online Ratings and Reviews
- Are consumers using online ratings and reviews to choose a healthcare provider (doctor, hospital, clinic)?
- For what type of care? (urgent/emergent vs. elective vs. chronic specialty care)
- If so, how often are they looking at them and what sites do they trust?
Influence and Trust of Online Ratings and Reviews
- How are consumers using online reviews to make decisions about healthcare providers?
- Is there a difference between trust in first party sites (hospital- or practice-owned) and independent third party sites (Google, Major Review Site, Healthgrades)?
- What are consumers perceptions of online reviews about healthcare providers?
- How much does it influence their decision to choose a healthcare provider?
- How much do they trust online reviews vs. a personal recommendation from a family member or friend? How about another doctor?
- What is more important to the decision maker – the star ratings or the actual qualitative information in the free text?
- Have you ever left an online review for a healthcare provider?
- What is the expectation around response to online reviews?
The answers of who makes the majority of the healthcare decisions for themselves and their families was interesting to see broken down by gender. It appears that females make slightly more of the decisions than males do.
But while 87% say they make the majority of healthcare decisions for themselves and their family, 65% say they don’t have children living in their home when those decisions are made. That implies that most of the female respondents make healthcare decisions for themselves, not dependents.
The breakdown of insurance types is that 63% of respondents’ health insurance is covered by their employers, while 19.02% is covered by a government program like Medicare, Medicaid, VA, TriCare, or other military benefits programs.
When asked in the last year, what they’ve spent the most time researching online, respondents said “healthcare providers” more than retirement or automobile purchases. In fact, 64% have used the internet to research a healthcare provider in the past 2 years, and of them, it’s not surprising that:
- 87% say it’s for a primary care provider,
- 70% say it’s for a specialist, and
- 51% say it’s for urgent care or the emergency room.
But it’s more interesting that people use reviews more on other sites than government sites:
- 13% medicare.gov,
- 16% Facebook,
- 21% Major Review Site,
- 48% doctor office/hospital site,
- 59% Healthgrades/Vitals, and
- 65% Google!
Respondents use Google 5 times more than medicare.gov! For Healthgrades’ percentage it’s 4-5 times more. A local doctor office’s site is used 3.8 times more.
In the last 12 months, 46% of respondents have used the internet to evaluate a doctor, hospital, or clinic. 82% of respondents have read online reviews to evaluate a healthcare provider and 80% of them claim these ratings and reviews actually influenced their decisions to select a provider. This is a big jump from the just 37% of respondents from our 2016 survey last year who used reviews sites to select a new primary care physician:
Breaking down this question by red/blue state affiliation shows that both are almost equally likely to use star ratings/reviews to choose a provider, with a very slight edge of red states over blue states:
The largest jump in our survey occurs at the question of trust. Respondents have the highest trust for:
- Google 30.74%
- Healthgrades 37.69% to
- medicare.gov’s 3.80%.
That means they trust Google 8 times more than medicare.gov and Healthgrades 10 times more!
Broken down by age, this question of trust shows that younger respondents trust Google the most, and all other age groups trust HealthGrades/Vitals/WebMD/ZocDoc the most. All age groups trust medicare.gov the least, except for ages 18-29 which trust a Major Review Site the least.
When looking at a doctor’s online reviews, the three most important factors when deciding where to go to for care were “how positive the reviews are” (73%), “review recency” (72%), and “overall star rating” 54%. Less important is, “number of reviews the doctor has” (45%), “if the patient reviews have a response from the doctor/provider” (26%), “if the patient reviews were written by a verified or veteran reviewer” (21%), and lastly, the “length of the reviews” (9%).
60% of respondents said a healthcare provider must have a minimum of 4 out of 5 stars for the respondents to use them. And the majority (44%) said they read 6-10 patient reviews to fairly assess the provider. To impact their decision to use a provider, the majority (40%) say an online review must have been written within the last six months, and 25% say it must have been written within the last year.
68% of respondents have selected one provider over another provider based on star ratings or online reviews. Respondents aged 18-44 did do so slightly more than respondents aged 60+.
56% of respondents said they would not pay more out of pocket to see a doctor with better online reviews, but 66% said they would be willing to wait longer. When broken down by age, 18-19 year old respondents were more likely than other age ranges to pay more out of pocket, but were not that much more likely than other demographics to wait.
At Reputation.com, we work with millions of online reviews from hundreds of sources. One of the unusual characteristics of reviews compared to the vast majority of text corpora is that, almost by definition, reviews are structured in such a way that they can be categorized (in one or many dimensions depending on the review site and/or industry). However, we often find ourselves doing text classification/tagging on topics that are not already labeled by the review site. This article is an informal introduction to a set of techniques we have developed to leverage existing unlabeled corpora in conjunction with the labeled data. In particular, we present a semi-supervised learning algorithm for multi-label text classification.
In recent years, a lot of text classification projects have used supervised learning methods (Naive Bayes, SVM) primarily due to their substantial improvements over non-supervised strategies such as traditional clustering in NLP tasks. Until very recently, most NLP classification work was done with the traditional Bag of Words (BOW) approach – perhaps with a bit of context through the use of a limited range of N-grams and skip-grams. BOW is a feature extraction technique where the text is represented as the frequency of each word in the document, disregarding grammar and ordering but keeping multiplicity. In most cases, defining a pipeline combining the BOW feature extraction technique with a Tf-Idf transform and a simple classifier (Naive Bayes, SVM) produces decent results with respect to most classification metrics.
Semi-Supervised Learning with Word2Vec
In most tutorials, Word2Vec is presented as a stand-alone neural net preprocessor for feature extraction. Word2Vec generates a vector for each word in the text corpora in higher-dimensional space such that words that share contextual meaning are located in close proximity to one another. To use Word2Vec for classification, each word can be replaced by its corresponding word vector and usually combined through a naive algorithm such as addition with normalization or cross product to get a sentence or text vector. Then, using these document vectors we could use a simple classifier for multi-label classification. The advantage of using Word2Vec over a simple BOW feature extraction technique is it supports semi-supervised learning, since the vocabulary from the labeled and unlabeled text can be used to generate the word vectors. This allows the words to have more contextual meaning. However, we have found that this approach does not appear to provide significant improvements over a BOW approach especially when there isn’t a lot of labeled data for training the classifier.
Semantic Convolution for Low Support Topics
A common problem that is seen in multi-label text classification is a major imbalance of labels in a textual corpora. We often see cases where most (>60%) of the sampled data is about the most prevalent topic, and more than half the topic labels exist in <0.1% of the sampled data. Almost inherently with NLP and a BOW approach, this causes a p (number of features) >> n (size of training corpus) problem. Based on a general rules of thumb, getting 1,000 training examples for the low support topic would require millions of labeled training examples, which is prohibitively expensive.
In this world of ‘big data’ the data itself is actually cheap, but developing a tagged training set can be expensive. In the course of our development, we devised an elegant and scalable way to develop and maintain a robust training set across tens of industries (this will be the topic of a separate blog post).
The premise of Semantic Convolution is simple: if a particular word is a good indicator of a particular label, then words with similar meanings (semantics) should also be good indicators of the label. Since we have qualitative evidence that Word2Vec vectors encode a semantic meaning, we can use it to help find words with similar meanings from non-labeled corpora. This allows us to apply a Semantic transform after getting the term frequencies in the BOW pipeline, and before applying the Tf-Idf transform. To apply the Semantic Transform, we use the Word2Vec data to generate a correlation matrix between words with similar contextual meaning in the vocabulary.
vocabulary is a dictionary mapping each term with an index, the code to generate the
correlation_matrix = scipy.sparse.identity(len(vocabulary), format="dok")
for idx, word in enumerate(vocabulary.keys()):
similar_words = 
similar_words = [x for x in word2vec_model.most_similar(word, topn=5) if x > 0.5]
for similar_word in similar_words:
if similar_word in vocabulary:
correlation_matrix[vocabulary[word], vocabulary[similar_word]] = 1
Using this correlation matrix we can generate the term-document matrix with the augmented term frequencies.
term_frequency_vector += term_frequency_vector * correlation_matrix
Applying this transformation with the correlation matrix increases the word count of all words with contextually similar meaning in the text. This improves the feature collection for low support topics, which allows more precise classification of reviews about low support topics with higher confidence. This allows small amounts of labeled data to be more useful for the machine-learning model, which reduces the cost of developing a robust training set. Also, as mentioned above, this leverages semi-supervised learning from the unlabeled data by building the vocabulary and Word2Vec vectors based on the entire text corpora.
Ultimately the Semantic convolution provides more value from the little labeled data, and improves the performance of the machine leaning algorithm for classification tasks, especially the low support categories. Also, semi-supervised learning with Word2Vec leverages the information gained from the vast amounts of unlabeled data while increase both the precision and the support of the machine-learning model.
Dweep Shah and Anthony Johnson
Extracting meaning from online reviews is key to turn seemingly anecdotal reviews into actionable customer satisfaction insights that point to improvement opportunities or authentic, and potentially differentiating strengths. One way to do that is to apply machine learning to automatically read customer reviews and identify the most relevant topics that are the subject of the review. With this information, you can find themes in what customers are saying about a business across thousands of reviews and then help businesses identify areas in which they are receiving a disproportionate number of negative reviews so that they can focus operational efforts on these areas and improve customer experience as well as their online reputation.
We have been working for a while on several approaches, models, and data sets to extract topics and categories from customer reviews with a high precision. In this post I will give an overview of a few neural network models that provide satisfactory results for physician-related reviews. To start, we built a taxonomy of categories that are relevant to physician reviews looking both at clinical patient experience topics from standard patient assessment surveys designed by CMS (Center for Medicare and Medicaid Services) as well as non-clinical topics related to parking, technology/amenities, and cleanliness that are commonly referred to in physician reviews. Then we gathered training data by having a group of crowd-sourced individuals tag a set of 10,000 reviews with the following categories (this is the subject of an upcoming blog entry):
- Administrative Process
- Bedside Manner
- Getting an Appointment
- Likely/Unlikely to recommend
- Staff Courtesy
- Price/Billing issues
- Wait Time
Given this training data, we used a biologically-inspired variant of Artificial Neural Networks to build a classifier that automatically assigns categories to online physician reviews. These neural network classifiers are based on how an animal’s visual cortex processes and exploits the strong spatially local correlation present in natural images. Those models are generally used for image recognition, but are being increasingly used in other fields, especially text classification. Given the promising results documented in this space, we decided to evaluate Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) with respect to our classification problem.
After an initial trial, we decided to focus our implementation on CNN models as they execute faster, are easier to understand, and had comparable results to RNN.
Principle of CNNs
The starting point of our CNNs was to represent a review by a matrix where each row is a vector that represents a word. This vector could be low-dimensional representations or one-hot vectors that index words into a vocabulary. Given this vector, you can then apply several convolutional filters on groups of rows followed by a 1-max pooling (the largest number from each feature map is recorded) in order to extract the meaning of the group of words considered at the beginning. Finally, a softmax layer is applied to generate assessed probabilities of the review belonging to each class.
CNN Implementation Approach
We generated a CNN model for each category independently broken into two classes: reviews that belong to this category and reviews that do not.
Thus, to produce all of the hidden parameters of these models, we fed them with reviews from the training data that were already categorized and modified the parameters incrementally in order to minimize a loss function (a function that represents the difference between the prediction and the real categories).
CNN Model Effectiveness
To assess the performance of these models, we split the 10,000 reviews into 8,000 reviews for training and 2,000 for testing. Given a model built on the training data, we predicted whether each review in the test data belonged in each category and assessed the precision and recall of our predictions with respect to each category.
For the largest categories, we found that our models delivered an overall precision of 81% and an overall recall of 75%. At first sight, those results did not appear very good. However, when we dug deeper, we found that when we considered at each element of crowd-sourced (human-based) training data to be a prediction in itself, these tags exhibited precision and recall metrics lower than 70% (versus the consensus of the group). Thus, our model outperformed human classification.
Furthermore, looking deeper into the training data, we realized that some reviews were truly ambiguous and the categories not precise or discerning enough, which resulted in a high degree a disagreement between humans evaluating the same review. After removing the most ambiguous reviews from the training data set, we observed a marked increase of the overall accuracy of the model. What to do about those ambiguous reviews or how to fine-tune the categories will be the subject of a future post.
The results from CNN models are promising, and we are pushing them further by experimenting with several modifications of the model such as: oversampling the training set in order to have balanced data for each category, splitting reviews by characters instead of by words, and initializing with a low-dimensional representation of words using Word2Vec. Stay tuned for further updates.
A couple of months ago, we looked at the relationship between review site rankings on a business’s local SERP/SEO and the number of reviews of the business on those sites. We found a significant positive relationship between the number of reviews and how highly those sites ranked in local search results.
Most of the analysis that time around centered on automobile dealers around the country and where their Facebook and DealerRater presences ranked in Google search results targeting that dealer. This time around we expanded that analysis to multiple review sites and multiple industries and found that the relationship between review volume and local search ranking varies wildly by domain and industry. We also dug a little deeper into the data to try to estimate the value of adding reviews on these sites over time, and we found that new reviews are valuable in two ways. First, new reviews help review sites rise on search engine results pages, and second, the more reviews that site acquires, the better its chances are of staying at the top of the results page as well.
Reviews and SEO across sites and industries
First, let’s look at the relationship between review volume and domain ranking in local search for that same set of automobile dealers. (Note: None of these analyses include Google, since the Google review presence is usually anchored on the right-hand side of the page.)
The Facebook and DealerRater lines here match up pretty well to the data we presented before. We also see a correlation between review volume and domain rank for the other domains, but it is notable that the apparent impact of additional reviews varies a bit by source. For instance, for four of these domains the average rank of the review site for a location with no reviews is between 8.5 and 10. Having 100 reviews on DealerRater brings the expected rank of that domain down to the top half of the first page, whereas for cars.com and Facebook, we would still expect 100 reviews to leave that site below the fold when someone is searching for that location. Edmunds.com is even worse. It seems no matter how many reviews you get, Google is determined to pin your Edmunds presence to the top of the 2nd page.
This data would lead us to hypothesize that, on average, an additional review on DealerRater is worth considerably more for a car dealer than an additional review on one of these other sites. But before we explore that hypothesis a little more, lets look at similar data for a few other industries. Next let’s look at hospitals:
These are the review site domains that most commonly showed up when we googled over 1000 US hospitals. Again we see the expected directional relationship, more reviews means a generally better SERP/SEO ranking. However, none of these curves are as steep as the steepest curves for auto dealers. It’s also very interesting to note how much Google seems to value a healthgrades.com page regardless of whether there are any reviews on it.
And here is the data for the Self-Storage Unit industry. There isn’t as much breadth in this industry, as there aren’t as many review sites with high volume, but it is very interesting to note that in the storage industry, Facebook has a very strong correlation between review volume and SERP/SEO rank.
All of this is very interesting, but it raises several questions. Most notably, what makes the SERP/SEO ranking of particular review sites seem to be so responsive to review volume in particular industries? And is there actually causation here or does something else explain why some of these correlations are so strong?
Review volume impact on local SEO over time
Let’s address the causation question by looking at some more dynamic data, specifically by looking at how these rankings and volumes change over time. This is still a long way from a controlled experiment, but it would be more compelling if we could show that as review volumes rise for a particular location on a particular site, then the SERP/SEO ranking of that location tends to fall.
Over the last couple of months we gathered SERP/SEO data once a week for several thousand US auto dealers. We then looked at the rankings for major review sites over time and how those changes correlated with total review volume and with changes in review volume. To model this, we fit a Markov Chain that predicted the probability of any weekly SERP/SEO ranking for a review site based upon the domain, that site’s ranking the previous week, the total number of reviews for that location on that site, and whether the number of reviews went up or not.
The first thing we wanted to measure was this – Does getting new reviews positively impact your search engine rank? According to our data, the answer would appear to be yes. In the graph below we plot the predicted impact of getting a new review on one review site according to our model.
According to our data, after we normalize for domain, rank, and total number of reviews prior, review sites that got at least one new review in a given week tended to be placed higher the following week than sites that did not get new reviews. Obviously this impact is much higher when you have no reviews or very few reviews (an average improvement of 1/3 of a spot for sites getting their first review!), and it levels off pretty quickly once you have around a dozen reviews.
Our model spit out one other interesting insight. It found that review volume is important not just for getting a site ranked highly on SERP, but for keeping it there as well. Review site rankings drift from week to week, and our Markov Chain model captures that drift. But what the model also found is that for review sites with a high volume of reviews, regardless of where they ranked the week before, they tended to drift more towards the top of the page (or were more likely to stay there) than review sites with very few reviews.
This graph plots how much an auto dealer’s review volume will impact the drift of that ranking on average. In other words, if you have no reviews, your review site page will lose one spot every three weeks, on average, relative to the norm. If you have 50+ reviews, it will gain 1 spot every 5 weeks on average, relative to the norm. You might ask, “how can I gain a spot if I am already at the top?” Well, links that are in the top spot tend to lose that spot about 20% of the time. If that site has 50+ reviews, it will be much less likely to do so.
Hopefully this analysis shines a light on the value of generating a healthy review volume on review sites that you want your customers to be able to find on Search Engines. And makes it clear that those reviews are valuable not just because they will help those review sites climb to the top of search engine results, but because they will help those sites stay there as well. Also beware that these effects can vary considerably from domain to domain, and the most responsive domains may also vary from industry to industry.
Introduction to word2phrase
When we communicate, we often know that individual words in the correct placements can change the meaning of what we’re trying to say. Add “very” in front of an adjective and you place more emphasis on the adjective. Add “york” in after the word “new” and you get a location. Throw in “times” after that and now it’s a newspaper.
It follows that when working with data, these meanings should be known. The three separate words “new”, “york”, and “times” are very different than “New York Times” as one phrase. This is where the word2phrase algorithm comes into play.
At its core, word2phrase takes in a sentence of individual words and potentially turns bigrams (two consecutive words) into a phrase by joining the two words together with a symbol (underscore in our case). Whether or not a bigram is turned into a phrase is determined by the training set and parameters set by the user. Note that every two consecutive words are considered, so in a sentence with w1 w2 w3, bigrams would be w1w2, w2w3.
In our word2phrase implementation in Spark (and done similarly in Gensim), there are two distinct steps; a training (estimator) step and application (transform) step.
*For clarity, note that “new york” is a bigram, while “new_york” is a phrase.
The training step is where we pass in a training set to the word2phrase estimator. The estimator takes this dataset and produces a model using the algorithm. The model is called the transformer, which we pass in datasets that we want to transform, i.e. sentences that with bigrams that we may want to transform to phrases.
In the training set, the dataset is an array of sentences. The algorithm will take these sentences and apply the following formula to give a score to each bigram:
score(wi, wj) = (count(wiwj) – delta) / (count(wi) * count(wj))
where wi and wj are word i and word j, and delta is discounting coefficient that can be set to prevent phrases consisting of infrequent words to be formed. So wiwj is when word j follows word i.
After the score for each bigram is calculated, those above a set threshold (this value can be changed by the user) will be transformed into phrases. The model produces by the estimator step is thus an array of bigrams; the ones that should be turned to phrases.
The transform step is incredibly simple; pass in any array of sentences to your model and it will search for matching bigrams. All matching bigrams in the array you passed in will then be turned to phrases.
You can repeat these steps to produce trigrams (i.e. three words into a phrase). For example, with “I read the New York Times” may produce “I read the new_york Times” after the first run, but run it again to get “I read the new_york_times”, because in the second run “new_york” is also an individual word now.
First we create our training dataset; it’s a dataframe where the occurrences “new york” and “test drive” appears frequently. (The sentences make no sense as they are randomly generated words. See below for link to full dataframe.)
You can copy/paste this into your spark shell to test it, so long as you have the word2phrase algorithm included (available as a maven package with coordinates com.reputation.spark:word2phrase:1.0.1).
Download the package, create our test dataframe:
spark-shell –packages com.reputation.spark.word2phrase.1.0.1
val wordDataFrame = sqlContext.createDataFrame(Seq(
(0, “new york test drive cool york how always learn media new york .”),
(1, “online york new york learn to media cool time .”),
(2, “media play how cool times play .”),
(3, “code to to code york to loaded times media .”),
(4, “play awesome to york .”),
(1099, “work please ideone how awesome times .”),
(1100, “play how play awesome to new york york awesome use new york work please loaded always like .”),
(1101, “learn like I media online new york .”),
(1102, “media follow learn code code there to york times .”),
(1103, “cool use play work please york cool new york how follow .”),
(1104, “awesome how loaded media use us cool new york online code judge ideone like .”),
(1105, “judge media times time ideone new york new york time us fun .”),
(1106, “new york to time there media time fun there new like media time time .”),
(1107, “awesome to new times learn cool code play how to work please to learn to .”),
(1108, “there work please online new york how to play play judge how always work please .”),
(1109, “fun ideone to play loaded like how .”),
(1110, “fun york test drive awesome play times ideone new us media like follow .”)
We set the input and output column names and create the model (the estimator step, represented by the fit(wordDataFrame) function).
scala> val t = new Word2Phrase().setInputCol(“inputWords”).setOutputCol(“out”)
t: org.apache.spark.ml.feature.Word2Phrase = deltathresholdScal_f07fb0d91c1f
scala> val model = t.fit(wordDataFrame)
Here are some of the scores (Table 1) calculated by the algorithm before removing those below the threshold (note all the scores above the threshold are shown here). The default values have delta -> 100, threshold -> 0.00001, and minWords -> 0.
| test drive
| work please
| new york
| york new
| york york
| york how
| how new
| new new
| to new
| york to
only showing top 10 rows
So our model produces three bigrams that will be searched for in the transform step:
We then use this model to transform our original dataframe sentences and view the results. Unfortunately you can’t see the entire row in the spark-shell, but in the out column it’s clear that all instances of “new york” and “test drive” have been transformed into “new_york” and “test_drive”.
scala> val bi_gram_data = model.transform(wordDataFrame)
bi_gram_data: org.apache.spark.sql.DataFrame = [label: int, inputWords: string … 1 more field]
||new york test dri…
|| new_york test_dri…
||online york new y…
|| online york new_y…
||media play how co…
|| media play how co…
||code to to code y…
|| code to to code y…
||play awesome to y…
|| play awesome to y…
|| like I I always .
|| like I I always .
||how to there lear…
|| how to there lear…
||judge time us pla…
|| judge time us pla…
||judge test drive …
|| judge test_drive …
||judge follow fun …
|| judge follow fun …
|| how I follow ideo…
|| how I follow ideo…
|| use use learn I t…
|| use use learn I t…
|| us new york alway…
|| us new_york alway…
|| there always how …
|| there always how …
|| always time media…
|| always time media…
||how test drive to…
|| how test_drive to…
|| cool us online ti…
|| cool us online ti…
||follow time aweso…
|| follow time aweso…
|| us york test driv…
|| us york test_driv…
|| use fun new york …
|| use fun new_york …
only showing top 20 rows
The algorithm and test dataset (testSentences.scala) are available at this repository.
One of the goals of the Analytics team has been to provide newer, more in-depth ways to analyze the millions of comments that Reputation aggregates from various sources for each customer. One way to do this is through natural language processing (NLP) techniques like part-of-speech(POS) tagging, named entity recognition(NER), and stemming/lemmatization. Combining these NLP techniques with our existing segmentation tools allows us to begin comparing statistics across sets defined by the language content of those comments. For example, we could look at the set of Walgreens comments that mention Rite-Aid and see that these had higher than average ratings in comparison to the total set of Walgreens comments.
These evaluations, however, initially required us to load the set of comments that we wished to analyze into Python, then run each comment through a natural language parser one at a time locally each time we wanted to run an analysis. The overhead required to parse each of these reviews began to impede our ability to rapidly test different types of analyses, so we began to look into alternative methods for achieving this goal. What we were ultimately looking for was a pre-processed database that would allow us to look up a comment by id and receive a set of POS tags, named entities, and lemmas without having to re-parse each comment each time. This natural language pre-processing would need to be done retroactively to the tens of millions of comments already stored in our database, as well as incrementally on any new comments that have been pulled in every few days.
Since much of our analysis framework was already implemented in Python, we began adding this new NLP piece in Python as well. Of the various NLP libraries available to Python at the time of this writing, the one that seemed to work best on the 2-3 sentence reviews in our database was the CoreNLP library from Stanford. Essentially CoreNLP comes with a series of models that have been trained on a large corpus of sample words for different languages (presently English, Arabic, Chinese, French and German). These models are then used to evaluate the likely part-of-speech of new inputs based on patterns learned from the original training data. The library also uses similar processes to determine which words in a given input are references to some named entity (for example an organization, individual name, or location name) and to identify the stem form of each word for easier pattern analysis.
The downside of using CoreNLP, however, is that in order to run, it starts up a new, separate Java process which is then passed one comment at a time for parsing. Starting up this Java process creates 5-10 minutes of overhead for processing a set of comments of any size, and even once this separate process is running it can take a few minutes to fully parse an average length comment (3-5 sentences). Thus to run all the millions of historical comments through CoreNLP in a serial fashion would be computationally infeasible. Instead, we decided to use Apache Spark to bring up a distributed cluster to run these comments through CoreNLP in parallel.
Spark provides a set of libraries in either Python, Scala, R, or Java that handle the hassle of creating a distributed cluster of nodes and efficiently distributing data between them. While it can be used for a wide variety of purposes, we used it to take the set of comments that we needed to evaluate and figure out how to split those comments amongst clusters of varying sizes in order to reduce the time necessary to run all of our historical data through CoreNLP. Using Spark also provided the added bonus of easily integrating with AWS’ Elastic Map-Reduce (EMR) service, which has an easy-to-use command line interface for bringing up clusters of EC2 nodes. Amazon has preconfigured settings to automatically pass the relevant information about each EMR cluster through to Spark so that we can easily bring up any number of nodes with the same code. This makes it easy to setup a cron task to automatically parse the last few days worth of reviews on a regular basis.
Additionally, while we originally set out to create a Python application to interact with Spark and CoreNLP, we eventually discovered that we needed the ability to more carefully control which information CoreNLP passed to each Spark process. Since Spark is capable of running multiple threads on each node in order to better parallelize and since each thread runs a separate version of our Spark application, we noticed that each Python application in each thread was instantiating its own CoreNLP Java process. This meant that if we had 4 threads running on the same node, we would also have 4 CoreNLP Java processes running on that node, which would slow that node’s performance to a crawl. To get around this, we had to translate our application into Scala instead. Scala allows for the existence of transient variables, which allowed us to write our code in such a way that when multiple threads are running on the same node, they all use the same CoreNLP Java process, but whenever a new node is brought up it brings up a new process. (Thanks to Databrick’s Spark/CoreNLP wrapper for this idea!)
Below is some of the code from our Scala-based Spark application. It is designed to do the following:
- Pull in some number of reviews from our Vertica database.
- Distribute those reviews to a cluster of independent nodes.
- Run each review through the CoreNLP process for that node.
- Format CoreNLP’s output so that it can uploaded back into Vertica
- Upload the natural language data (POS tags, NER tags, and lemmas) back into the database
Click here for Github Gist
Once our Spark application was working on local developer machines, we began testing running it through EMR’s distributed clusters instead. Initially we ran into some headaches getting Spark to fully utilize the resources made available to it through EMR. There is a line in the code above that talks about pulling in the number of nodes available through the Spark Config (val num_exec = sc.getConf.get(“spark.executor.instances”).toInt). This line tells spark how many nodes it has available so that it can partition the data accordingly. Below are two screenshots of the CPU usage per node in AWS from before this change and after it:
Before proper partitioning – Notice that in this case, the node in blue is the only one that appears to be actually doing any parsing. This is because Spark defaults to assuming a single data partition, so it runs all the comments through the master node.
After proper partitioning – By explicitly telling Spark how many nodes to use, we can see that it now runs some comments through all 8 nodes. (Thanks to Cloudera for explaining this and more about how to properly tune Spark jobs!)
Additionally we ran into some trouble getting EMR to communicate with Vertica through the database’s security restrictions, which involved playing with our VPN settings. Once these hurdles were dealt with though, we were able to begin testing the scaling power of this CoreNLP/Spark/EMR solution. The following graph shows the number of minutes it took Spark to run as dependent on the number of thousands of comments run per each instance in the EMR cluster. As you can see, the time to run increases linearly as a function of how many comments each node is required to run.
Minute to Run vs. # of thousands of comments per node in cluster – This graph shows the time it takes Spark to run our process as a function of number of comments per each distributed node in the cluster. It shows a linear relationship more or less up until the point where there are more than a million comments per node.
The outlier point at 1000 on the x-axis (= 1 million comments per node) is from when we ran all of our historical comments. Further research is required to figure out why performance seems to have degraded for that point.
Interestingly, we also found that it seems when the number of comments per node increases above a about a million or so, the EMR task would fail without outputting any errors in the logs (this is what happened with the rightmost datapoint on the above graph). This may be due to insufficient resources to run the number of comments assigned to that node(we used Amazon’s m3.xlarge instances for each node on each run), but we haven’t done enough analysis to confirm this. The short-term solution to this problem was simply to provide more nodes and get the ratio of comments per node back down to around 1 million or so.
At Reputation, our clientele includes both businesses and individuals, and for businesses, one of the primary components of online reputation is the corpus of online reviews written about that business. As a result, we’ve spent a lot of time assessing how online reviews impact a company’s reputation and ultimately, the bottom line.
Online reviews can impact your company and its reputation in a number of ways. At the most granular level, each individual review can impact whether a potential future customer walks through your door and even anchor what that future customer is likely to think of you. These reviews also aggregate into a high-level summary of your company (e.g. an overall star rating and review count) which each review site (e.g. Google, Facebook, Cars) will use to rank you internally within their site and may be all that many prospective customers ever see about you.
Over the next few months we will dig into various components of these interactions, but for now we will start at an even higher level, looking at how these various ratings and reviews impact how your business is represented on popular search engines such as Google and Bing.
As the online reputation space matures, our clients wind up asking more and more about the impact of online reviews on other metrics. These are normally packaged up under the topic Return on Spend (ROI), which is a little nebulous. This could mean anything from store level foot traffic, increasing the effectiveness of marketing spend, creating visibility, or showing a concrete impact on sales. Over the years, we have been able to make cases for most of these, but the most fun one so far as been reverse engineering what is going through Google’s mind: online reputation’s impact on a location’s local search engine visibility. Our hypothesis is simple: we think that Google increases the Search Engine Results Page (SERP) for review sites based on review volume and their ratings.
A small housekeeping task – we defined Local SERP as what shows up when you are doing a local search on your mobile phone or on the web looking at a specific area.
We are starting with a data set that include about 20k reviews from Dealer Rater and Facebook.
Let’s assume that different sites are treated differently by Google, so let’s start looking at the same data by segmented by the source of reviews:
Since we are dealing with only two factors here, we can look at a scatterplot to get a first level sense of interactions:
A quick analysis of these graphs tells us two interesting interactions:
- The number of reviews has a clear impact on the SERP Rank (right most column, top graph)
- Ratings has a correlation with Rank, but not one that appears to be very strong (left most column, middle graph)
Let’s run a normal linear regression and see if we can get an overall sense of the importance of the two factors. There are arguable better ways of doing this, but it will help confirm our suspicions about the relative importance of these factors:
fit <- lm(Rank ~ Rating + log(review_link_text + 1), data=data)
lm(formula = Rank ~ Rating + log(review_link_text + 1), data = source_data)
Min 1Q Median 3Q Max
-7.3197 -2.4955 -0.8922 1.6404 22.2808
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.47824 0.06445 131.546 < 2e-16 ***
Rating 0.53441 0.07849 6.809 9.98e-12 ***
log(review_link_text + 1) -0.99965 0.01278 -78.191 < 2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.668 on 48297 degrees of freedom
Multiple R-squared: 0.1157, Adjusted R-squared: 0.1156
F-statistic: 3158 on 2 and 48297 DF, p-value: < 2.2e-16
Using a relative importance metric calculation, we get:
Relative importance metrics:
lmg last first pratt
Rating 0.01974774 0.007525053 0.03137841 0.01543049
log(review_link_text + 1) 0.98025226 0.992474947 0.96862159 0.98456951
This confirms what we saw in the graphs – there is a very strong correlation between the review count and the Rank. Rating is also there – it has a small impact, but this needs more analysis. We will look at that in more detail in another blog post and come up with some strategies for that.
Let’s explore the relationship between quantity and SEO visibility a little more and see if we can get a better picture by using a quantile binning of the review count:
It is pretty clear that having a baseline volume of reviews has a huge impact on your SEO visibility – the first 10 reviews that a location gets can boost a location up from hovering around the bottom of the first page or second page to clearly in the top half of the first page.
With search engine rankings, generally speaking we are most interested in how many people click on the links (it is a decent approximation of how many users are looking at your locations specifically.) Other companies like Moz.com have done interesting research on this – and, since all we really care about is approximating the CTR, this should be more than enough: https://moz.com/blog/google-organic-click-through-rates-in-2014
If we take this CTR data into account, we can see that having 50 reviews can increase the expected click through rate by 266% compared to a baseline location.
Now this leaves many open questions – for example, we have just identified a clear correlation between the number of reviews and the SEO ranking, but not certainly not a causal effect. In future blogs on this topic, we will be looking at the following topics:
- Can we identify some likelihood of causality here?
- How does the industry you are in change this?
- Does google treat different review sites differently?
- What are other factors that affect this?
- How long does it take for review volume to have an impact on search rank?