At reputation.com, we process large amounts of text data for our customers, with the goal of figuring out what people are talking about in a set of reviews and what that can tell us about customer sentiment for our clients. There are a lot of open source tools that we leverage to help us extract information about the text, and one of those tools is Doc2Vec, the algorithm developed by Thomas Mikolov from Google. This article is an introduction to some ways we can leverage Doc2Vec to gain insight into a set of online reviews for our clients.
Doc2Vec is a 3 layer neural network that simultaneously learns the vector representation of each word and each sentence of a corpus in a vector space of a fixed number (e.g. 300) of dimensions.
Doc2Vec and Classification
To start with, let’s look at how we could use Doc2Vec to help us categorize sentences in reviews. We start by training 10 Doc2Vec models on 100,000 online reviews related to the dental industry. Before feeding the text to the algorithm, we clean it by lemmatizing and removing stop words to reduce the initial dimensionality.
A good model is a model in which two sentences with very different meanings are represented by vectors distant in the vector space. For instance, given the three sentences:
- “I love this dentist, he has such great bedside manners”,
- “Dr. Doe truly cares about his patients, and makes them feel comfortable”,
- “The parking is always full”
the distance between the vectors representing the first two sentences should be shorter than between sentences 1 and 3.
We have a database of sentences about the dental industry that have been manually tagged as dealing with certain aspects of the experience of going to the dentist. Those sentences were extracted from online reviews. They can deal with multiple aspects but as they are relatively short in practice they are often about only one topic. To evaluate our model, we are going to use sentences from the categories “Parking and Facilities” and “Bedside Manner”. Let’s focus on a pool of a hundred sentences that have been manually tagged as being about “Parking and Facilities” and a thousand sentences about “Bedside Manner”. We use the following process:
- We pick up two sentences from “Parking and Facilities” and compute the cosine distance between their representative vectors in our Doc2Vec models
- We look at all the sentences in the “Bedside Manner” pool one by one and determine if our two sentences from “Parking and Facilities” are closer to each other or to the sentence about “Bedside Manner”. If the distances of the two sentences about “Parking and Facilities” to the sentence of “Bedside Manner” are greater than the distance between the two sentences about “Parking and Facilities”, then it is a success.
- We do this for all the couples of sentence in the “Parking and Facilities” pool of sentences.
Across all those comparisons, the success rate is 74%, meaning that 74% of the time the two sentences about “Parking and Facilities” were closer to each other than EITHER WAS to a given sentence from the “Bedside Manner” category. As a first pass, this is good but not great. Our model obviously captures some of the nuance of the language, but not enough to serve as a stand alone classification algorithm. In practice, we are using this as just one component of the machine learning tagging model we have built to tag reviews without manual inspection.
Doc2Vec and Topic Modeling
Now, let’s see how the model can be used to spot recurring topics. To do that, we cluster a sample of data from the dentist industry running a KMeans algorithm on the set of vectors representing the sentences. We want each cluster to represent a semantic entity, meaning that vectors from the same cluster should be close in meaning. The more clusters you build, the smaller they are and the more similar the sentences are within the cluster. But having too many small clusters does not provide relevant information: we end up with very specific clusters, and different clusters about the same topic. Therefore, we are interested in finding K, the number of clusters that will allow us to have the smallest number of coherent clusters. To do that, let’s draw a plot of the clusters’ average inertia (sum of squares within cluster) divided by the number of points and find the “elbow” of the curve: the point where inertia starts decreasing more slowly with the number of clusters.
We notice a break around 14 clusters. Let’s now try figure out what each cluster represents: in each cluster, let’s read a few sentences and words close to the center. Here is a sample of the results:
- Great Service and all of the staff were friendly and professional.
- The care was excellent and the medical staff was at the cutting edge.
- I highly recommend Dr. xxx.
- I highly recommend Dr yyy.
- Worst experience ever.
- They had me waiting 5 hours in the waiting room.
- 2 hours and still waiting.
- I filled out my paper work and couldn’t have been waiting more than 5 minutes before being called back.
- Love this place good doctors and nurses
- Wonderful staff, and Dr. Bruckel was warm, personable, and reassuring.
- Fast friendly and helpful but still personable.
- Waste of time and money.
- Do not waste your time here.
- Great service in the emergency room.
- They took great care of me
- Went to the e.r. with a kidney stone.
- I went to the ER for some chest pain, I got xray, bloodwork, etc.
- Would definitely recommend this hospital for labor and delivery.
- I would definitely recommend this hospital to anyone.
- The doctor had a great bedside manner.
- He has the best bedside manner of any doctor.
- The staff was courteous and very professional.
- This is an urgent care.
- I highly recommend this urgent care.
- They saved my life.
- Always treated well and with respect.
- This place saved my life, I’m Very thankful with the doctors and nurses.
For most clusters, a dominant them emerges; e.g. recommendation, wait time. Some of these themes span multiple clusters. Some of the clusters however, seem to mix multi unrelated topics. Looking at the inertia of each cluster helps us ID some of the better clusters (e.g. 1, 3, and 8). The lower the inertia, the more coherent the cluster, the more likely the sentences are of similar meaning. We can also look at the distance between clusters to find good candidates for regrouping (e.g. 2 and 9).
Ultimately, as we saw with classification, we can see that Doc2Vec is a useful tool in identifying key topics, but not a standalone tool, at least in the version implemented here. Nonetheless, in these and other applications we have already found and are continuing to find valuable ways Doc2Vec can help us extract actionable insights for our clients.