Extracting meaning from online reviews is key to turn seemingly anecdotal reviews into actionable customer satisfaction insights that point to improvement opportunities or authentic, and potentially differentiating strengths. One way to do that is to apply machine learning to automatically read customer reviews and identify the most relevant topics that are the subject of the review. With this information, you can find themes in what customers are saying about a business across thousands of reviews and then help businesses identify areas in which they are receiving a disproportionate number of negative reviews so that they can focus operational efforts on these areas and improve customer experience as well as their online reputation.
We have been working for a while on several approaches, models, and data sets to extract topics and categories from customer reviews with a high precision. In this post I will give an overview of a few neural network models that provide satisfactory results for physician-related reviews. To start, we built a taxonomy of categories that are relevant to physician reviews looking both at clinical patient experience topics from standard patient assessment surveys designed by CMS (Center for Medicare and Medicaid Services) as well as non-clinical topics related to parking, technology/amenities, and cleanliness that are commonly referred to in physician reviews. Then we gathered training data by having a group of crowd-sourced individuals tag a set of 10,000 reviews with the following categories (this is the subject of an upcoming blog entry):
- Administrative Process
- Bedside Manner
- Getting an Appointment
- Likely/Unlikely to recommend
- Staff Courtesy
- Price/Billing issues
- Wait Time
Given this training data, we used a biologically-inspired variant of Artificial Neural Networks to build a classifier that automatically assigns categories to online physician reviews. These neural network classifiers are based on how an animal’s visual cortex processes and exploits the strong spatially local correlation present in natural images. Those models are generally used for image recognition, but are being increasingly used in other fields, especially text classification. Given the promising results documented in this space, we decided to evaluate Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) with respect to our classification problem.
After an initial trial, we decided to focus our implementation on CNN models as they execute faster, are easier to understand, and had comparable results to RNN.
Principle of CNNs
The starting point of our CNNs was to represent a review by a matrix where each row is a vector that represents a word. This vector could be low-dimensional representations or one-hot vectors that index words into a vocabulary. Given this vector, you can then apply several convolutional filters on groups of rows followed by a 1-max pooling (the largest number from each feature map is recorded) in order to extract the meaning of the group of words considered at the beginning. Finally, a softmax layer is applied to generate assessed probabilities of the review belonging to each class.
CNN Implementation Approach
We generated a CNN model for each category independently broken into two classes: reviews that belong to this category and reviews that do not.
Thus, to produce all of the hidden parameters of these models, we fed them with reviews from the training data that were already categorized and modified the parameters incrementally in order to minimize a loss function (a function that represents the difference between the prediction and the real categories).
CNN Model Effectiveness
To assess the performance of these models, we split the 10,000 reviews into 8,000 reviews for training and 2,000 for testing. Given a model built on the training data, we predicted whether each review in the test data belonged in each category and assessed the precision and recall of our predictions with respect to each category.
For the largest categories, we found that our models delivered an overall precision of 81% and an overall recall of 75%. At first sight, those results did not appear very good. However, when we dug deeper, we found that when we considered at each element of crowd-sourced (human-based) training data to be a prediction in itself, these tags exhibited precision and recall metrics lower than 70% (versus the consensus of the group). Thus, our model outperformed human classification.
Furthermore, looking deeper into the training data, we realized that some reviews were truly ambiguous and the categories not precise or discerning enough, which resulted in a high degree a disagreement between humans evaluating the same review. After removing the most ambiguous reviews from the training data set, we observed a marked increase of the overall accuracy of the model. What to do about those ambiguous reviews or how to fine-tune the categories will be the subject of a future post.
The results from CNN models are promising, and we are pushing them further by experimenting with several modifications of the model such as: oversampling the training set in order to have balanced data for each category, splitting reviews by characters instead of by words, and initializing with a low-dimensional representation of words using Word2Vec. Stay tuned for further updates.