A couple of months ago, we looked at the relationship between review site rankings on a business’s local SERP/SEO and the number of reviews of the business on those sites. We found a significant positive relationship between the number of reviews and how highly those sites ranked in local search results.
Most of the analysis that time around centered on automobile dealers around the country and where their Facebook and DealerRater presences ranked in Google search results targeting that dealer. This time around we expanded that analysis to multiple review sites and multiple industries and found that the relationship between review volume and local search ranking varies wildly by domain and industry. We also dug a little deeper into the data to try to estimate the value of adding reviews on these sites over time, and we found that new reviews are valuable in two ways. First, new reviews help review sites rise on search engine results pages, and second, the more reviews that site acquires, the better its chances are of staying at the top of the results page as well.
Reviews and SEO across sites and industries
First, let’s look at the relationship between review volume and domain ranking in local search for that same set of automobile dealers. (Note: None of these analyses include Google, since the Google review presence is usually anchored on the right-hand side of the page.)
The Facebook and DealerRater lines here match up pretty well to the data we presented before. We also see a correlation between review volume and domain rank for the other domains, but it is notable that the apparent impact of additional reviews varies a bit by source. For instance, for four of these domains the average rank of the review site for a location with no reviews is between 8.5 and 10. Having 100 reviews on DealerRater brings the expected rank of that domain down to the top half of the first page, whereas for cars.com and Facebook, we would still expect 100 reviews to leave that site below the fold when someone is searching for that location. Edmunds.com is even worse. It seems no matter how many reviews you get, Google is determined to pin your Edmunds presence to the top of the 2nd page.
This data would lead us to hypothesize that, on average, an additional review on DealerRater is worth considerably more for a car dealer than an additional review on one of these other sites. But before we explore that hypothesis a little more, lets look at similar data for a few other industries. Next let’s look at hospitals:
These are the review site domains that most commonly showed up when we googled over 1000 US hospitals. Again we see the expected directional relationship, more reviews means a generally better SERP/SEO ranking. However, none of these curves are as steep as the steepest curves for auto dealers. It’s also very interesting to note how much Google seems to value a healthgrades.com page regardless of whether there are any reviews on it.
And here is the data for the Self-Storage Unit industry. There isn’t as much breadth in this industry, as there aren’t as many review sites with high volume, but it is very interesting to note that in the storage industry, Facebook has a very strong correlation between review volume and SERP/SEO rank.
All of this is very interesting, but it raises several questions. Most notably, what makes the SERP/SEO ranking of particular review sites seem to be so responsive to review volume in particular industries? And is there actually causation here or does something else explain why some of these correlations are so strong?
Review volume impact on local SEO over time
Let’s address the causation question by looking at some more dynamic data, specifically by looking at how these rankings and volumes change over time. This is still a long way from a controlled experiment, but it would be more compelling if we could show that as review volumes rise for a particular location on a particular site, then the SERP/SEO ranking of that location tends to fall.
Over the last couple of months we gathered SERP/SEO data once a week for several thousand US auto dealers. We then looked at the rankings for major review sites over time and how those changes correlated with total review volume and with changes in review volume. To model this, we fit a Markov Chain that predicted the probability of any weekly SERP/SEO ranking for a review site based upon the domain, that site’s ranking the previous week, the total number of reviews for that location on that site, and whether the number of reviews went up or not.
The first thing we wanted to measure was this – Does getting new reviews positively impact your search engine rank? According to our data, the answer would appear to be yes. In the graph below we plot the predicted impact of getting a new review on one review site according to our model.
According to our data, after we normalize for domain, rank, and total number of reviews prior, review sites that got at least one new review in a given week tended to be placed higher the following week than sites that did not get new reviews. Obviously this impact is much higher when you have no reviews or very few reviews (an average improvement of 1/3 of a spot for sites getting their first review!), and it levels off pretty quickly once you have around a dozen reviews.
Our model spit out one other interesting insight. It found that review volume is important not just for getting a site ranked highly on SERP, but for keeping it there as well. Review site rankings drift from week to week, and our Markov Chain model captures that drift. But what the model also found is that for review sites with a high volume of reviews, regardless of where they ranked the week before, they tended to drift more towards the top of the page (or were more likely to stay there) than review sites with very few reviews.
This graph plots how much an auto dealer’s review volume will impact the drift of that ranking on average. In other words, if you have no reviews, your review site page will lose one spot every three weeks, on average, relative to the norm. If you have 50+ reviews, it will gain 1 spot every 5 weeks on average, relative to the norm. You might ask, “how can I gain a spot if I am already at the top?” Well, links that are in the top spot tend to lose that spot about 20% of the time. If that site has 50+ reviews, it will be much less likely to do so.
Hopefully this analysis shines a light on the value of generating a healthy review volume on review sites that you want your customers to be able to find on Search Engines. And makes it clear that those reviews are valuable not just because they will help those review sites climb to the top of search engine results, but because they will help those sites stay there as well. Also beware that these effects can vary considerably from domain to domain, and the most responsive domains may also vary from industry to industry.
Introduction to word2phrase
When we communicate, we often know that individual words in the correct placements can change the meaning of what we’re trying to say. Add “very” in front of an adjective and you place more emphasis on the adjective. Add “york” in after the word “new” and you get a location. Throw in “times” after that and now it’s a newspaper.
It follows that when working with data, these meanings should be known. The three separate words “new”, “york”, and “times” are very different than “New York Times” as one phrase. This is where the word2phrase algorithm comes into play.
At its core, word2phrase takes in a sentence of individual words and potentially turns bigrams (two consecutive words) into a phrase by joining the two words together with a symbol (underscore in our case). Whether or not a bigram is turned into a phrase is determined by the training set and parameters set by the user. Note that every two consecutive words are considered, so in a sentence with w1 w2 w3, bigrams would be w1w2, w2w3.
In our word2phrase implementation in Spark (and done similarly in Gensim), there are two distinct steps; a training (estimator) step and application (transform) step.
*For clarity, note that “new york” is a bigram, while “new_york” is a phrase.
The training step is where we pass in a training set to the word2phrase estimator. The estimator takes this dataset and produces a model using the algorithm. The model is called the transformer, which we pass in datasets that we want to transform, i.e. sentences that with bigrams that we may want to transform to phrases.
In the training set, the dataset is an array of sentences. The algorithm will take these sentences and apply the following formula to give a score to each bigram:
score(wi, wj) = (count(wiwj) – delta) / (count(wi) * count(wj))
where wi and wj are word i and word j, and delta is discounting coefficient that can be set to prevent phrases consisting of infrequent words to be formed. So wiwj is when word j follows word i.
After the score for each bigram is calculated, those above a set threshold (this value can be changed by the user) will be transformed into phrases. The model produces by the estimator step is thus an array of bigrams; the ones that should be turned to phrases.
The transform step is incredibly simple; pass in any array of sentences to your model and it will search for matching bigrams. All matching bigrams in the array you passed in will then be turned to phrases.
You can repeat these steps to produce trigrams (i.e. three words into a phrase). For example, with “I read the New York Times” may produce “I read the new_york Times” after the first run, but run it again to get “I read the new_york_times”, because in the second run “new_york” is also an individual word now.
First we create our training dataset; it’s a dataframe where the occurrences “new york” and “test drive” appears frequently. (The sentences make no sense as they are randomly generated words. See below for link to full dataframe.)
You can copy/paste this into your spark shell to test it, so long as you have the word2phrase algorithm included (available as a maven package with coordinates com.reputation.spark:word2phrase:1.0.1).
Download the package, create our test dataframe:
spark-shell –packages com.reputation.spark.word2phrase.1.0.1
val wordDataFrame = sqlContext.createDataFrame(Seq(
(0, “new york test drive cool york how always learn media new york .”),
(1, “online york new york learn to media cool time .”),
(2, “media play how cool times play .”),
(3, “code to to code york to loaded times media .”),
(4, “play awesome to york .”),
(1099, “work please ideone how awesome times .”),
(1100, “play how play awesome to new york york awesome use new york work please loaded always like .”),
(1101, “learn like I media online new york .”),
(1102, “media follow learn code code there to york times .”),
(1103, “cool use play work please york cool new york how follow .”),
(1104, “awesome how loaded media use us cool new york online code judge ideone like .”),
(1105, “judge media times time ideone new york new york time us fun .”),
(1106, “new york to time there media time fun there new like media time time .”),
(1107, “awesome to new times learn cool code play how to work please to learn to .”),
(1108, “there work please online new york how to play play judge how always work please .”),
(1109, “fun ideone to play loaded like how .”),
(1110, “fun york test drive awesome play times ideone new us media like follow .”)
We set the input and output column names and create the model (the estimator step, represented by the fit(wordDataFrame) function).
scala> val t = new Word2Phrase().setInputCol(“inputWords”).setOutputCol(“out”)
t: org.apache.spark.ml.feature.Word2Phrase = deltathresholdScal_f07fb0d91c1f
scala> val model = t.fit(wordDataFrame)
Here are some of the scores (Table 1) calculated by the algorithm before removing those below the threshold (note all the scores above the threshold are shown here). The default values have delta -> 100, threshold -> 0.00001, and minWords -> 0.
| test drive
| work please
| new york
| york new
| york york
| york how
| how new
| new new
| to new
| york to
only showing top 10 rows
So our model produces three bigrams that will be searched for in the transform step:
We then use this model to transform our original dataframe sentences and view the results. Unfortunately you can’t see the entire row in the spark-shell, but in the out column it’s clear that all instances of “new york” and “test drive” have been transformed into “new_york” and “test_drive”.
scala> val bi_gram_data = model.transform(wordDataFrame)
bi_gram_data: org.apache.spark.sql.DataFrame = [label: int, inputWords: string … 1 more field]
||new york test dri…
|| new_york test_dri…
||online york new y…
|| online york new_y…
||media play how co…
|| media play how co…
||code to to code y…
|| code to to code y…
||play awesome to y…
|| play awesome to y…
|| like I I always .
|| like I I always .
||how to there lear…
|| how to there lear…
||judge time us pla…
|| judge time us pla…
||judge test drive …
|| judge test_drive …
||judge follow fun …
|| judge follow fun …
|| how I follow ideo…
|| how I follow ideo…
|| use use learn I t…
|| use use learn I t…
|| us new york alway…
|| us new_york alway…
|| there always how …
|| there always how …
|| always time media…
|| always time media…
||how test drive to…
|| how test_drive to…
|| cool us online ti…
|| cool us online ti…
||follow time aweso…
|| follow time aweso…
|| us york test driv…
|| us york test_driv…
|| use fun new york …
|| use fun new_york …
only showing top 20 rows
The algorithm and test dataset (testSentences.scala) are available at this repository.