Sentiment analysis : Frequency-based models

We give our tenants insights about their online reputation based on their online reviews and ratings. In doing so, one thing we try to do is pull apart the text of reviews to understand what the reviews are dealing with, and tell our clients what their customers are talking about and how happy those customers are with key aspects of our clients’ business.

So for example, we might identify 100 reviews for our client mentioning price, and leveraging the star rating of those reviews, we might discern that 80% of those reviews are positive and the average rating of those reviews is 4.0 stars. However, this method could be improved: a positive review mentioning price is not necessarily positive about price. For example:

The food was awesome, and the service absolutely excellent. The price was very high for a coffee-shop style restaurant.

This 5 star review is obviously negative about the price of the restaurant. We need a model that tells us the local sentiment of a sentence or a subsentence in order to be able to understand what elements drive the rating of the review. I’ll explain some of the techniques we have studied, implemented and benchmarked in order to build our Sentiment Mining Tool.

Naive Bayes Classifier

Naive Bayes is the first and the easiest method to classify sentiment in a text. It’s based on the Bayes formula for conditional probabilities:

Bayes Formula



We’ll represent a text by a Bag of Words, which is a set of features “the word w appears f times” for each word w in the sentence and f, the frequency of w in the sentence. Assuming the Naive Bayes assumption that these features are independent, this formula helps us deduce the probability that the sentence is positive (A) knowing that w appears f times (B) for every w. In fact, we can deduce from the frequencies in a large enough dataset the probability for a sentence to be positive (A), and the probabilities of every feature and then of their intersection (B). Training the model on a training set of 10,000 annotated sentences, we get a set of informative features that are helpful to predict whether a sentence is positive or negative. Here are the 10 most informative features we get:

Naive Bayes sentiment-bearing keywords

Naive Bayes classifier’s informative features

This method is the easiest to implement and the big advantage is that it’s completely transparent. When we process it, we know that the classifier found a set of strongly positive or of strongly negative words, and that it is why we classified the sentence in such a way.

How to improve it

However, there are several drawbacks using this method.

First, it fails to identify the neutral class. As a matter of fact, words can have a positive or a negative meaning (“good”, “awesome”, ”horrible”, …) but no word has a neutral connotation. Often, it’s all about the absence of such positively or negatively meaningful words or about the structure of the sentence that reflects the absence of strong emotion. The Bag of Words representation doesn’t address this problem.

It also fails to understand intensity and negations. Comparing “good” and “quite good” for instance, the first one is more likely to appear in a positive sentence than the second one. We tried some methods to address this: adding a list of meaningful bigrams (which mean that we would read “quite good” as a single word for instance), or training the model on bigrams instead of training it on single words, but both didn’t improve our model very much. We also fail to identify negations most of the time, because this model doesn’t take the word order into account.

Most of all, the Naive Bayes model doesn’t perform very well in solving the local sentiment analysis problem. In a long text, having a high frequency of positive words: “sensational”, “tasty”, … makes it very likely that the author is expressing positive sentiment. But as our goal is to determine the local sentiment, we want to process the tool on short sentences and subsentences. (We already have a star rating that tells us the author’s overall sentiment.) We don’t have enough words in the sentence to aggregate so we need to understand very precisely the semantic structure.

The Bag of Words representation is a very bad way to do this. For instance, the sentence “The food could have been more tasty.”, we detect the word “tasty” that is related to a positive feeling, but we don’t understand that “could have been more” is a kind of negation or nuance. Many short sentences are like that, and looking at only a small sentence dataset reduced our accuracy from around 77% to less than 65%.

Rule-based sentiment models

To improve the Naive Bayes methods and make it fit the short sentences sentiment analysis challenge, we added some rules to take into account negations, intensity markers (“more”, “extremely”, “absolutely”, “the most”, …), nuance, and other semantic structures that appear very often near sentimental phrases and change their meanings. For instance, in “The food wasn’t very tasty”, we want to understand that “not very tasty” is less negative than “not tasty” or “not tasty at all”.

We leveraged the results of the Naive Bayes training to build a large vocabulary of positive and negative words. When we process a given sentence, we attribute every word a positive and a negative score, and calculate the overall scores by a precise analysis of the semantical structure based on the open-source library spacy’s pipelines for part-of-speech tagging and dependency parsing. We get a metric for positive, negative and neutral scores, the neutral score being defined as the proportion of words that are neither positive nor negative in the sentence. We used a deep-learning technique to deduce from our training set the relation between these scores and the sentiment. Here are the graphs we obtained for negative, neutral and positive sentences:

Sentiment scores for negative sentencesSentiment scores for neutral sentencesSentiment scores for positive sentences

The model helps us decide very well whether an expressive sentence is positive or negative (we get around 75% accuracy), but struggles understanding a criteria for neutrality or absence of sentiment (on our test-set, it’s wrong 80% of the time). It’s much better than the Naive Bayes, but 75% is less than the state-of-art for positive/negative decision.

Finding Optimizations in Python With Program Profiling

This blog post discusses profiling methods, specifically for the Python programming language.

Within the data science team, one of the things we are working to build is a processing model for large amounts of textual and review data using natural language processing.

Because we are processing data at such a large scale, it is important that our model is properly optimized to reduce any unnecessary overhead. As such, it is important to identify which areas in our code are taking up the most time. This is where profiling comes in.

Program profiling is a form of analysis that measures things such as the memory usage, time usage, the usage of particular instructions, or the frequency and duration of function calls. It is a way to understand where the largest amount of resources are being spent in order to target optimizations to these areas.

Use Case

Our use case was to find optimizations in a series of Python files used in our model. In order to find which parts of the program were stalling execution, profiling was used. Python has many native and third party profiling tools that allow for a range of analysis for runtime, memory usage and visualization. Some of the tools we looked at were cProfiler, line_profiler, memory_profiler and QCachegrind. For the purpose of our use case, we are most interested in profiling methods that enable us to see which parts of the program were using up the most time, and if there are any blocking resources.

Profiling Using the Standard Python Library

Profiling Python can be done with the standard Python library, as well as third party modules and programs.

The standard Python library provides three different implementations of the same profiling interface: cProfile, Profile and Hotshot. The most popular of the three is cProfile.


cProfile can be run in terminal, as well as imported as a module in Python.

It shows profiling results by functions for time for ncalls  tottime  percall  cumtime  percall, (number of calls to that function, total time of that function excluding calls to other functions, time per call, cumulative time of the function and other function calls, time per cumulative call).


import cProfile
import re‘re.compile(“foo|bar”)’)


    197 function calls (192 primitive calls) in 0.002 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000    0.001    0.001 <string>:1(<module>)
    1    0.000    0.000    0.001    0.001
    1    0.000    0.000    0.001    0.001
    1    0.000    0.000    0.000    0.000
    1    0.000    0.000    0.000    0.000
    4    0.000    0.000    0.000    0.000
  3/1    0.000    0.000    0.000    0.000

Although we are able to see timing information on a function basis, we aren’t able to see which lines specifically are taking up the most time.

Here is an example that runs in the terminal with cProfiler and reduces the output to the 33 top lines with highest cumulative time:

python -m  cProfile -s ‘cumulative’ > temp_file && head -n 33 temp_file && rm temp_file

Third Party Profiling Modules

Third party modules include line profiler and memory profiler for line by line profiling, and QCacheGrind for program visualization.

line_profiler – Line-by-line Timing

line_profiler is a third party module that does line-by-line profiling in a Python program. It shows the time spent on each individual line in the Python program. After installing line_profiler, you use it by decorating the functions that you want to profile using ‘@profile’. Then you create a kernprof script of your Python file that can be used with line_profiler.


pip install line_profiler


kernprof -l

python -m line_profiler

Timer unit: 1e-06 s

Function: Proc2 at line 149

Line #      Hits         Time  Per Hit   % Time  Line Contents
  149                                           @profile
  150                                           def Proc2(IntParIO):
  151     50000        82003      1.6     13.5      IntLoc = IntParIO + 10
  152     50000        63162      1.3     10.4      while 1:
  153     50000        69065      1.4     11.4          if Char1Glob == ‘A’:
  154     50000        66354      1.3     10.9              IntLoc = IntLoc – 1
  155     50000        67263      1.3     11.1              IntParIO = IntLoc – IntGlob
  156     50000        65494      1.3     10.8              EnumLoc = Ident1
  157     50000        68001      1.4     11.2          if EnumLoc == Ident1:
  158     50000        63739      1.3     10.5              break
  159     50000        61575      1.2     10.1      return IntParIO

In this output, we can see by line the amount of times a line is executed, the time per execution, the total execution time and percentage time usage. This helps you zero in on which lines are actually causing slowdowns in your program.

memory_profiler – Line-by-line memory usage

memory_profiler is another third party package that is similar to line profiler. It does line by line profiling of a Python program with memory as opposed to time.

pip install -U memory_profiler

python -m memory_profiler

Line #    Mem usage  Increment   Line Contents
    3                           @profile
    4      5.97 MB    0.00 MB   def my_func():
    5     13.61 MB    7.64 MB       a = [1] * (10 ** 6)
    6    166.20 MB  152.59 MB       b = [2] * (2 * 10 ** 7)
    7     13.61 MB -152.59 MB       del b
    8     13.61 MB    0.00 MB       return a

The output is similar to line profiler. In the output above, it is seen that memory usage increased when required computation power increased. It is helpful if a program is doing operations that require a lot of memory.

Visualization with QCacheGrind

QCacheGrind is a visual profiling tool, it can be used to view the call stack of a program and see the cumulative time usage of each function in the call stack. You can visually trace through the call stack, and even view the time usage line by line of the source file.


pip install pyprof2calltree

brew install graphviz

brew install qcachegrind –with-graphviz


python -m cProfile -o myscript.cprof

pyprof2calltree -k -i myscript.cprof

Result and Comparison

Profiling helped us zero in on an iteration loop that was taking a large percentage of time. It turns out repeated index references to a large DataFrame object were driving a large percentage of the time usage. This is because while the Pandas DataFrame is a powerful data structure to apply vector operations and aggregation across large amounts of data, it’s inherently a slower data structure when it comes to accessing indexed rows repeatedly or iterating through a number of rows compared to a simple dictionary. After identifying this via profiling, the program was optimized by converting the DataFrame to a list of dictionaries.

We found line_profiler to be the most useful in terms of finding which areas of code to optimize. Using line_profiler we can see the percentage time usage of each function we are interested in line by line. Tools such as cProfile and QCachegrind are able to give a broad perspective on which functions are taking the most time, but do not show which lines of the function are the trouble areas. memory_profiler is good for programs that use heavy amounts of memory, but for our use case memory was not limited.

Semantic Convolution with Word2Vec

At, we work with millions of online reviews from hundreds of sources. One of the unusual characteristics of reviews compared to the vast majority of text corpora is that, almost by definition, reviews are structured in such a way that they can be categorized (in one or many dimensions depending on the review site and/or industry). However, we often find ourselves doing text classification/tagging on topics that are not already labeled by the review site.  This article is an informal introduction to a set of techniques we have developed to leverage existing unlabeled corpora in conjunction with the labeled data. In particular, we present a semi-supervised learning algorithm for multi-label text classification.

In recent years, a lot of text classification projects have used supervised learning methods (Naive Bayes, SVM) primarily due to their substantial improvements over non-supervised strategies such as traditional clustering in NLP tasks. Until very recently, most NLP classification work was done with the traditional Bag of Words (BOW) approach – perhaps with a bit of context through the use of a limited range of N-grams and skip-grams. BOW is a feature extraction technique where the text is represented as the frequency of each word in the document, disregarding grammar and ordering but keeping multiplicity. In most cases, defining a pipeline combining the BOW feature extraction technique with a Tf-Idf transform and a simple classifier (Naive Bayes, SVM) produces decent results with respect to most classification metrics.

Semi-Supervised Learning with Word2Vec

In most tutorials, Word2Vec is presented as a stand-alone neural net preprocessor for feature extraction. Word2Vec generates a vector for each word in the text corpora in higher-dimensional space such that words that share contextual meaning are located in close proximity to one another. To use Word2Vec for classification, each word can be replaced by its corresponding word vector and usually combined through a naive algorithm such as addition with normalization or cross product to get a sentence or text vector. Then, using these document vectors we could use a simple classifier for multi-label classification. The advantage of using Word2Vec over a simple BOW feature extraction technique is it supports semi-supervised learning, since the vocabulary from the labeled and unlabeled text can be used to generate the word vectors. This allows the words to have more contextual meaning. However, we have found that this approach does not appear to provide significant improvements over a BOW approach especially when there isn’t a lot of labeled data for training the classifier.

Semantic Convolution for Low Support Topics

A common problem that is seen in multi-label text classification is a major imbalance of labels in a textual corpora. We often see cases where most (>60%) of the sampled data is about the most prevalent topic, and more than half the topic labels exist in <0.1% of the sampled data. Almost inherently with NLP and a BOW approach, this causes a p (number of features) >> n (size of training corpus) problem. Based on a general rules of thumb, getting 1,000 training examples for the low support topic would require millions of labeled training examples, which is prohibitively expensive.

In this world of ‘big data’ the data itself is actually cheap, but developing a tagged training set can be expensive. In the course of our development, we devised an elegant and scalable way to develop and maintain a robust training set across tens of industries (this will be the topic of a separate blog post).

The premise of Semantic Convolution is simple: if a particular word is a good indicator of a particular label, then words with similar meanings (semantics) should also be good indicators of the label. Since we have qualitative evidence that Word2Vec vectors encode a semantic meaning, we can use it to help find words with similar meanings from non-labeled corpora. This allows us to apply a Semantic transform after getting the term frequencies in the BOW pipeline, and before applying the Tf-Idf transform. To apply the Semantic Transform, we use the Word2Vec data to generate a correlation matrix between words with similar contextual meaning in the vocabulary.

Given vocabulary is a dictionary mapping each term with an index, the code to generate the correlation_matrix is:

correlation_matrix = scipy.sparse.identity(len(vocabulary), format="dok")
for idx, word in enumerate(vocabulary.keys()):
    similar_words = []

        similar_words = [x[0] for x in word2vec_model.most_similar(word, topn=5) if x[1] > 0.5]

    for similar_word in similar_words:
        if similar_word in vocabulary:
            correlation_matrix[vocabulary[word], vocabulary[similar_word]] = 1

Using this correlation matrix we can generate the term-document matrix with the augmented term frequencies.

term_frequency_vector += term_frequency_vector * correlation_matrix

Applying this transformation with the correlation matrix increases the word count of all words with contextually similar meaning in the text. This improves the feature collection for low support topics, which allows more precise classification of reviews about low support topics with higher confidence. This allows small amounts of labeled data to be more useful for the machine-learning model, which reduces the cost of developing a robust training set. Also, as mentioned above, this leverages semi-supervised learning from the unlabeled data by building the vocabulary and Word2Vec vectors based on the entire text corpora.


Ultimately the Semantic convolution provides more value from the little labeled data, and improves the performance of the machine leaning algorithm for classification tasks, especially the low support categories. Also, semi-supervised learning with Word2Vec leverages the information gained from the vast amounts of unlabeled data while increase both the precision and the support of the machine-learning model.

Dweep Shah and Anthony Johnson