SENTIMENT ANALYSIS - University of Southern Californiappremkum/portfolio/pdf/report.pdfSentiment...
Transcript of SENTIMENT ANALYSIS - University of Southern Californiappremkum/portfolio/pdf/report.pdfSentiment...
SENTIMENT ANALYSIS
A PROJECT REPORT SUBMITTED TO
THE NATIONAL INSTITUTE OF ENGINEERING
(An Autonomous College)
In partial fulfillment for the award of degree of
Bachelor of Engineering
In
Computer Science & Engineering
Submitted By
MUKUND SHENOY K P SAMEER VARMA
(4NI10CS040) (4NI10CS047)
PREETHAM P SACHINKUMAR KULKARNI
(4NI10CS054) (4NI10CS066)
Under The Guidance Of
Dr. T H SREENIVAS
Professor
Department of CS & E
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
THE NATIONAL INSTITUTE OF ENGINEERING
(An Autonomous College)
Mysore-570 008
2013-14
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
THE NATIONAL INSTITUTE OF ENGINEERING
(Autonomous Institution)
Mysore-570008
CERTIFICATE
This is to certify that the project work “SENTIMENT ANALYSIS”, is a bonafide work carried
out by Mukund Shenoy K (4NI10CS040), P Sameer Varma (4NI10CS047), Preetham P
(4NI10CS054), SachinKumar Kulkarni (4NI10CS066) in partial fulfillment for the award of
degree of Bachelor of Engineering in Computer Science and Engineering, of Visvesvaraya
Technological University, Belgaum during the year 2013-2014. It is certified that all corrections
/ suggestions indicated during Internal viva has been incorporated and corrected copy has been
kept in the Department Library. This project report has approved in partial fulfillment for the
award of the said degree as per academic regulations of The National Institute of Engineering
(An autonomous College).
Project Guide Head of Department
Dr. T H Sreenivas Dr. G.Raghavendra Rao
Assistant Professor Prof. and Head of Department
Department of CS&E Department of CS&E
Dr. G L Shekar
Principal
NIE, Mysore
Examiners Signature with Date
1……………………..
2……………………..
ACKNOWLEDGEMENT
We would like to express our sincere gratitude to all those who helped us in completing the
project successfully.
We express our profound thanks to Dr. G L Shekar, Principal, NIE, Mysore for all the support
and encouragement.
We are grateful to Dr. G. Raghavendra Rao, Prof. and Head of the Department of Computer
Science and Engineering, NIE, Mysore for his support and encouragement facilitating the
progress of this work.
We sincerely extend our thanks to project Guide Dr. T H SREENIVAS, Professor, Department
of Computer Science and Engineering, NIE, Mysore for his valuable guidance, constant
assistance, support, endurance and constructive suggestions for the betterment of this project,
without which the completion of this project would not have been possible.
Finally we thank our families and friends for being a constant source of inspiration and advice.
Mukund Shenoy K
P Sameer Varma
Preetham P
SachinKumar Kulkarni
.
ABSTRACT
Sentiment analysis or opinion mining is the computational study of people’s opinions,
sentiments, attitudes, and emotions expressed in written language. It is one of the most active
research areas in natural language processing and text mining in recent years. We have used the
Naïve Bayes classifier method for sentiment analysis and trained the system using movie
reviews. We then classified the reviews to be positive or negative.
We have used a combination of methods like negation handling, word n-grams, and
feature selection in addition to the Naïve Bayes method for improving the accuracy of
classification. Naïve Bayes is a very simple probabilistic model that tends to work well on text
classifications and usually takes orders of magnitude less time to train when compared to other
models in sentiment classification. We can achieve a high degree of accuracy by using Naïve
Bayes model, which is comparable to the current state of the art models in sentiment
classification.
Contents
1.INTRODUCTION ............................................................................................ 1
1.1 SENTIMENT ANALYSIS……………..…………………………………….1
1.2 APPLICATIONS OF SENTIMENT ANALYSIS………………………….......2
1.3 CHALLENGES OF SENTIMENT ANALYSIS …………….......………….…3
1.4 FEATURES OF SENTIMENT ANALYSIS ………………………………......3
1.4.1 TERM PRESENCE vs TERM FREQUENCY..........................................3
1.4.2 TERM POSITION.................................................................................4
1.4.3 N-GRAM FEATURES...........................................................................4
2.LITRATURE SURVEY ……….…………………………………………....5
2.1 NAÏVE BAYES CLASSIFIER…………………………………………….....5
2.2 MAXIMUM ENTROPY CLASSIFIER…………………………………….....6
2.3 SUPPORT VECTOR MACHINE……………………………………...….….6
2.4 PYTHON PROGRAMMING LANGUAGE……………………….………….7
3. SYSTEM REQUIREMENTS ....................................................................... 8
3.1 HARDWARE REQUIREMENTS …………………………………………...8
3.2 SOFTWARE REQUIREMENTS ………………………………………….....8
4. SYSTEM DESIGN ……….……...………….……………………………....9
4.1 DATA……………………………………………………………………......9
4.2 NAÏVE BAYES CLASSIFIER……………………………………………....10
4.3 LAPLACIAN SMOOTHING…………………………………………….....11
4.4 NEGATION HANDLING………………………………………………......11
4.5 N- GRAMS………………………………………………………………...12
4.6 FEATURE SELECTION…………………………………………………...13
5. SYSTEM IMPLEMENTATION……………………………………………......14
5.1 TRAINING PHASE…………………………………………………….......14
5.2 NEGATE SEQUENCE HANDLING…………………………………….......15
5.3 CLASSIFICATION PHASE…………………………………………….......16
5.4 GUI DEVELOPMENT…..............................................................................18
6. SYSTEM TESTING…...………….…………..…………………………....20
6.1 UNIT TESTING………………………………………………………........20
6.2 INTEGRATION TESTING………………………………………………....20
6.3 WHITE BOX TESTING…………………………………………………....21
6.4 BLACK BOX TESTING…............................................................................21
6.5 FUNCTIONAL TESTING……………………………………………….....21
6.6 PERFORMANCE TEST…………………………………………….……...22
6.7 TESTING PHASE……………………………………………………….....22
CONCLUSION…...………….………….……...…...………………...………....23
FUTURE ENHANCEMENTS..……………….….....……...…………………......24
BIBILIOGRAPHY…...………….......….…………...…………………………...25
APPENDIX A...………….………………..………...………………...………...26
APPENDIX B…...………………………..………...…..……………...………...28
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 1
CHAPTER 1
INTRODUCTION
1.1 SENTIMENT ANALYSIS
Among the most researched topics of natural language processing is sentiment analysis
Sentiment analysis involves extraction of subjective information from documents like online
reviews to determine the polarity with respect to certain objects. It is useful for identifying
trends of public opinion in the social media, for the purpose of marketing and consumer
research. It has its uses in getting customer feedback about new product launches, political
campaigns and even in financial markets. It aims to determine the attitude of a speaker or a
writer with respect to some topic or simply the contextual polarity of a document.
Sentiment Analysis is a Natural Language Processing and Information Extraction task that aims
to obtain writer’s feelings expressed in positive or negative comments, questions and requests,
by analyzing a large numbers of documents. Generally speaking, sentiment analysis aims to
determine the attitude of a speaker or a writer with respect to some topic or the overall tonality
of a document. In recent years, the exponential increase in the Internet usage and exchange of
public opinion is the driving force behind Sentiment Analysis today. The Web is a huge
repository of structured and unstructured data. The analysis of this data to extract latent public
opinion and sentiment is a challenging task.
Liu defines a sentiment or opinion as a quintuple-
“<oj, fjk, soijkl, hi, tl >, where oj is a target object, fjk is a feature of the object oj, soijkl is the
sentiment value of the opinion of the opinion holder hi on feature fjk of object oj at time tl,
soijkl is +ve,-ve, or neutral, or a more granular rating, hi is an opinion holder, tl is the time
when the opinion is expressed.”
The analysis of sentiments may be document based where the sentiment in the entire document
is summarized as positive, negative or objective. It can be sentence based where individual
sentences, bearing sentiments, in the text are classified. SA can be phrase based where the
phrases in a sentence are classified according to polarity.
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 2
Sentiment Analysis identifies the phrases in a text that bears some sentiment. The author may
speak about some objective facts or subjective opinions. It is necessary to distinguish between
the two. SA finds the subject towards whom the sentiment is directed. A text may contain many
entities but it is necessary to find the entity towards which the sentiment is directed. It identifies
the polarity and degree of the sentiment. Sentiments are classified as objective (facts), positive
(denotes a state of happiness, bliss or satisfaction on part of the writer) or negative (denotes a
state of sorrow, dejection or disappointment on part of the writer). The sentiments can further
be given a score based on their degree of positivity, negativity or objectivity.
1.2 APPLICATIONS OF SENTIMENT ANALYSIS
Word of mouth (WOM) is the process of conveying information from person to person
and plays a major role in customer buying decisions. In commercial situations, WOM involves
consumers sharing attitudes, opinions, or reactions about businesses, products, or services with
other people. WOM communication functions based on social networking and trust. People
rely on families, friends, and others in their social network. Research also indicates that people
appear to trust seemingly disinterested opinions from people outside their immediate social
network, such as online reviews. This is where Sentiment Analysis comes into play. Growing
availability of opinion rich resources like online review sites, blogs, social networking sites
have made this “decision-making process” easier for us. With explosion of Web 2.0 platforms
consumers have a soapbox of unprecedented reach and power by which they can share
opinions. Major companies have realized these consumer voices affect shaping voices of other
consumers.
Sentiment Analysis thus finds its use in Consumer Market for Product reviews, Marketing for
knowing consumer attitudes and trends, Social Media for finding general opinion about recent
hot topics in town, Movie to find whether a recently released movie is a hit.
Pang-Lee broadly classifies the applications into the following categories:
a. Applications to Review-Related Websites like Movie Reviews, Product Reviews etc.
b. Applications as a Sub-Component Technology like detecting antagonistic, heated language in
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 3
mails, spam detection, context sensitive information detection etc.
c. Applications in Business and Government Intelligence Knowing Consumer attitudes and
trends.
1.3 CHALLENGES FOR SENTIMENT ANALYSIS
Sentiment Analysis approaches aim to extract positive and negative sentiment bearing
words from a text and classify the text as positive, negative or else objective if it cannot find
any sentiment bearing words. In this respect, it can be thought of as a text categorization task.
In text classification there are many classes corresponding to different topics whereas in
Sentiment Analysis we have only 3 broad classes. Thus it seems Sentiment Analysis is easier
than text classification which is not quite the case.
1.4 FEATURES FOR SENTIMENT ANALYSIS
Feature engineering is an extremely basic and essential task for Sentiment Analysis.
Converting a piece of text to a feature vector is the basic step in any data driven approach to
SA. In the following section, some commonly used features used in Sentiment Analysis and
their critiques are mentioned.
1.4.1 Term Presence vs. Term Frequency
Term frequency has always been considered essential in traditional Information
Retrieval and Text Classification tasks. But Pang-Lee found that term presence is more
important to Sentiment analysis than term frequency. That is, binary-valued feature vectors in
which the entries merely indicate whether a term occurs (value 1) or not (value 0). This is not
counter-intuitive as in the numerous examples we saw before that the presence of even a single
string sentiment bearing words can reverse the polarity of the entire sentence. It has also been
seen that the occurrence of rare words contain more information than frequently occurring
words, a phenomenon called Hapax Legomena.
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 4
1.4.2 Term Position
Words appearing in certain positions in the text carry more sentiment or weightage than
words appearing elsewhere. This is similar to IR where words appearing in topic Titles,
Subtitles or Abstracts etc are given more weightage than those appearing in the body. Although
the text contains positive words throughout, the presence of a negative sentiment at the end
sentence plays the deciding role in determining the sentiment. Thus generally words appearing
in the text are given more weightage than those appearing elsewhere.
1.4.3 N-gram Features
N-grams are capable of capturing context to some extent and are widely used in Natural
Language Processing tasks. Whether higher order n-grams are useful is a matter of debate.
Pang reported that unigrams outperform bigrams when classifying movie reviews by sentiment
polarity, but Dave found that in some settings, bigrams and trigrams perform better.
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 5
CHAPTER 2
LITERATURE SURVEY
Sentiment analysis is a complicated problem but experiments have been done
using Naive Bayes, maximum entropy classifiers and support vector machines.
2.1 NAÏVE BAYES CLASSIFIER
A Naive Bayes classifier is a simple probabilistic model based on the Bayes rule
along with a strong independence assumption.
The Naïve Bayes model involves a simplifying conditional independence assumption.
That is given a class (positive or negative), the words are conditionally independent of
each other. This assumption does not affect the accuracy in text classification by much
but makes really fast classification algorithms applicable for the problem.
In our case, the maximum likelihood probability of a word belonging to a particular
class is given by the expression:
The frequency counts of the words are stored in hash tables during the training phase.
According to the Bayes Rule, the probability of a particular document belonging to a
class ci is given by,
In simple terms, a naive Bayes classifier assumes that the value of a particular feature is
unrelated to the presence or absence of any other feature, given the class variable. For example,
a fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A naive
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 6
Bayes classifier considers each of these features to contribute independently to the probability
that this fruit is an apple, regardless of the presence or absence of the other features.
2.2 MAXIMUM ENTROPY CLASSIFIER
The Max Entropy classifier is a probabilistic classifier which belongs to the class of
exponential models. Unlike the Naive Bayes classifier that we discussed in the previous article,
the Max Entropy does not assume that the features are conditionally independent of each other.
The MaxEnt is based on the Principle of Maximum Entropy and from all the models that fit our
training data, selects the one which has the largest entropy. The Max Entropy classifier can be
used to solve a large variety of text classification problems such as language detection, topic
classification, sentiment analysis and more.
Due to the minimum assumptions that the Maximum Entropy classifier makes, we regularly use
it when we don’t know anything about the prior distributions and when it is unsafe to make any
such assumptions. Moreover Maximum Entropy classifier is used when we can’t assume the
conditional independence of the features. This is particularly true in Text Classification
problems where our features are usually words which obviously are not independent. The Max
Entropy requires more time to train comparing to Naive Bayes, primarily due to the
optimization problem that needs to be solved in order to estimate the parameters of the model.
Nevertheless, after computing these parameters, the method provides robust results and it is
competitive in terms of CPU and memory consumption.
2.3 SUPPORT VECTOR MACHINE
In machine learning, support vector machines (SVMs, also support vector networks[1]
)
are supervised learning models with associated learning algorithms that analyze data and
recognize patterns, used for classification and regression analysis. Given a set of training
examples, each marked as belonging to one of two categories; an SVM training algorithm
builds a model that assigns new examples into one category or the other, making it a non-
probabilistic binary linear classifier. An SVM model is a representation of the examples as
points in space, mapped so that the examples of the separate categories are divided by a clear
gap that is as wide as possible. New examples are then mapped into that same space and
predicted to belong to a category based on which side of the gap they fall on
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 7
More formally, a support vector machine constructs a hyperplane or set of hyperplanes in
a high- or infinite-dimensional space, which can be used for classification, regression, or other
tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance
to the nearest training data point of any class (so-called functional margin), since in general the
larger the margin the lower the generalization error of the classifier.
2.4 PYTHON PROGRAMMING LANGUAGE
Python is a widely used general-purpose, high-level programming language. Its design
philosophy emphasizes code readability, and its syntax allows programmers to express
concepts in fewer lines of code than would be possible in languages such as C. The language
provides constructs intended to enable clear programs on both a small and large scale. Python
supports multi-programming paradigms including object-oriented, imperative and functional
programming or procedural styles. It features a dynamic type system and automatic memory
management and has a large and comprehensive standard library. Like other dynamic
languages, Python is often used as a scripting language, but is also used in a wide range of non-
scripting contexts. Using third-party tools, such as Py2exe, or Pyinstaller, Python code can be
packaged into standalone executable programs. Python interpreters are available for many
operating systems.
We chose Python because it has a shallow learning curve, its syntax and semantics are
transparent, and it has good string-handling functionality. As an interpreted language, Python
facilitates interactive exploration. As an object-oriented language, Python per- mits data and
methods to be encapsulated and re-used easily. As a dynamic language, Python permits
attributes to be added to objects on the fly, and permits variables to be typed dynamically,
facilitating rapid development. Python comes with an extensive standard library, including
components for graphical programming, numerical processing, and web connectivity.
Python is heavily used in industry, scientific research, and education around the world. Python
is often praised for the way it facilitates productivity, quality, and main- tainability of software.
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 8
CHAPTER 3
SYSTEM REQUIREMENTS
3.1 HARDWARE REQUIREMENTS
2GHz+ CPU
1GB RAM
1GB Hard disk Space
3.2 SOFTWARE REQUIREMENTS
Python IDLE 3.3
Web Browser
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 9
CHAPTER 4
SYSTEM DESIGN
4.1 DATA
We used a publicly available dataset of movie reviews from the Internet Movie Data- base
(IMDb) which was compiled by Andrew Maas . It is a set of 25,000 highly polar movie reviews
for training, and 25,000 for testing. Both the training and test sets have an equal number of
positive and negative reviews. We chose movie reviews as our data set because it covers a wide
range of human emotions and captures most of the adjectives relevant to sentiment
classification. Also, most existing research on sentiment classification uses movie review data
for benchmarking.
We used the 25,000 documents in the training set to build our supervised learning model. The
other 25,000 were used for evaluating the accuracy of our classifier.
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 10
4.2 NAÏVE BAYES CLASSIFIER
A Naive Bayes classifier is a simple probabilistic model based on the Bayes rule along
with a strong independence assumption.
The Naïve Bayes model involves a simplifying conditional independence assumption. That is
given a class (positive or negative), the words are conditionally independent of each other. This
assumption does not affect the accuracy in text classification by much but makes really fast
classification algorithms applicable for the problem.
In our case, the maximum likelihood probability of a word belonging to a particular class is
given by the expression:
The frequency counts of the words are stored in hash tables during the training phase.
According to the Bayes Rule, the probability of a particular document belonging to a class ci is
given by,
In simple terms, a naive Bayes classifier assumes that the value of a particular feature is
unrelated to the presence or absence of any other feature, given the class variable. For example,
a fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A naive
Bayes classifier considers each of these features to contribute independently to the probability
that this fruit is an apple, regardless of the presence or absence of the other features.
If we use the simplifying conditional independence assumption, that given a class (positive or
negative), the words are conditionally independent of each other. Due to this simplifying
assumption the model is termed as “naïve”.
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 11
Here the xi s are the individual words of the document. The classifier outputs the class with the
maximum posterior probability. We also remove duplicate words from the document, they
don’t add any additional information; this type of naïve Bayes algorithm is called Bernoulli
Naïve Bayes. Including just the presence of a word instead of the count has been found to
improve performance marginally, when there is a large number of training examples.
4.3 LAPLACIAN SMOOTHING
If the classifier encounters a word that has not been seen in the training set, the
probability of both the classes would become zero and there won’t be anything to compare
between. This problem can be solved by Laplacian smoothing
Usually, k is chosen as 1. This way, there is equal probability for the new word to be in either
class. Since Bernoulli Naïve Bayes is used, the total number of words in a class is computed
differently. For the purpose of this calculation, each document is reduced to a set of unique
words with no duplicates.
4.4 NEGATION HANDLING
Negation handling was one of the factors that contributed significantly to the accuracy
of our classifier. A major problem faced during the task of sentiment classification is that of
handling negations. Since we are using each word as feature, the word “good” in the phrase
“not good” will be contributing to positive sentiment rather that negative sentiment as the
presence of “not” before it is not taken into account.
To solve this problem we devised a simple algorithm for handling negations using state
variables and bootstrapping. We built on the idea of using an alternate representation of
negated forms as shown by Das & Chen. Our algorithm uses a state variable to store the
negation state. It transforms a word followed by a not or n’t into “not_” + word. Whenever the
negation state variable is set, the words read are treated as “not_” + word. The state variable is
reset when a punctuation mark is encountered or when there is double negation.
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 12
The pseudo code of the algorithm is described below:
PSEUDO CODE:
negated := False
for each word in document:
if negated = True:
Transform word to “not_” + word.
if word is “not” or “n’t”:
negated := not negated
if a punctuation mark is encountered
negated := False.
Since the number of negated forms might not be adequate for correct classifications. It is
possible that many words with strong sentiment occur only in their normal forms in the training
set. But their negated forms would be of strong polarity.
We addressed this problem by adding negated forms to the opposite class along with normal
forms of all the features during the training phase. That is to say if we encounter the word
“good” in a positive document during the training phase, we increment the count of “good” in
the positive class and also increment the count of “not_good” for the negative class. This is to
ensure that the numbers of “not_” forms are sufficient for classification. This modification
resulted in a significant improvement in classification accuracy (about 1%) due to
bootstrapping of negated forms during training. This form of negation handling can be applied
to a variety of text related applications.
4.5 N-GRAMS
Generally, information about sentiment is conveyed by adjectives or more specifically
by certain combinations of adjectives with other parts of speech. Adding features like
consecutive pairs of can capture this information words (bigrams), or even triplets of words
(trigrams). Words like "very" or "definite-ly" don't provide much sentiment information on
their own, but phrases like "very bad" or "definitely recommended" increase the probability of
a document being negatively or positively biased. By including bigrams and trigrams, we were
able to capture this information about adjectives and adverbs. Using bigrams and trigrams re-
quire a substantial amount of data in the training set, but this is not a problem as our training set
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 13
had 25,000 reviews. But the data may not be enough to add 4-grams, as this may over-fit the
training set. The counts of the n-grams were stored in a hash table along with the counts of
unigrams.
4.6 FEATURE SELECTION
Feature selection is the process of removing redundant features, while retaining those
features that have high disambiguation capabilities. The use of higher dimensional features like
bigrams and trigrams presents a problem, that of the number of features increasing from
300,000 to about 11,000,000. Most of these features are redundant and noisy in nature.
Including them would affect both efficiency and accuracy. A basic filtering step of removing
the features or terms which occur only once is performed. Now the number of features is
reduced to about 1,500,000 features. The features are further filtered on the basis of mutual
information.
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 14
CHAPTER 5
SYSTEM IMPLEMENTATION
System implementation deals with the implementation details of each module used in
the project and the relationships existing between them. The implementation of the project is
done Python language using the Python Integrated Development Environment 3.3(IDLE) .The
installation link for Python IDLE is provided [1] which is an open source software that can be
downloaded from the official website of Python. The implementation involves three modules
such as training, testing, and classification.
5.1 TRAINING PHASE
Training phase is the first phase where the system is trained with a suitably large
collection of positive and negative movie reviews obtained from the internet movie database
IMDB. In the training data set there is a collection of 12500 positive reviews and 12500
negative reviews which also can be downloaded from the link [2]. To get higher accuracy and
efficiency we need a very large data set for training because our analysis includes both
handling negation and Bi-grams and Tri-grams. The training phase starts off with creating two
empty positive and negative dictionaries called pos and neg respectively.
We take each review from the positive training data set and give it as an argument to the negate
sequence function. This function returns a list of words to be inserted into both pos and neg
dictionaries at the same time, after removing the duplicate words from the list .A similar
procedure is applied to each review from the negative data set thereby populating the neg and
pos dictionaries .we later call the prune_feature function which removes the words from both
the dictionaries whose frequency is less than one thereby reducing the entries in the dictionaries
and facilitating faster access time. We use the concept of pickling for storing the dictionaries in
the file, which are called as pickled files, which can be used later during the classification
phase. Pickling is done by a function called pickle.dump() which retains the data structure of
the dictionary even after storing in the files this is one of the major advantages of pickling.
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 15
5.2 NEGATE SEQUENCE HANDLING
Negation handling was one of the factors that contributed significantly to the accuracy
of our classifier. A major problem faced during the task of sentiment classification is that of
handling negations. Since we are using each word as feature, the word “good” in the phrase
“not good” will be contributing to positive sentiment rather that negative sentiment as the
presence of “not” before it is not taken into account.
To solve this problem we devised a simple algorithm for handling negations using state
variables and bootstrapping. We built on the idea of using an alternate representation of
negated forms as shown by Das & Chen in [3]. Our algorithm uses a state variable to store the
negation state. It transforms a word followed by a not or n’t into “not_” + word. Whenever the
negation state variable is set, the words read are treated as “not_” + word. The state variable is
reset when a punctuation mark is encountered or when there is double negation. The pseudo
code of the algorithm is described be- low:
PSEUDO CODE:
negated := False
for each word in document:
if negated = True:
Transform word to “not_” + word.
if word is “not” or “n’t”:
negated := not negated
if a punctuation mark is encountered
negated := False.
Since the number of negated forms might not be adequate for correct classifications. It is
possible that many words with strong sentiment occur only in their normal forms in the training
set. But their negated forms would be of strong polarity.
We addressed this problem by adding negated forms to the opposite class along with normal
forms of all the features during the training phase. That is to say if we encounter the word
“good” in a positive document during the training phase, we increment the count of “good” in
the positive class and also increment the count of “not_good” for the negative class. This is to
ensure that the number of “not_” forms are sufficient for classification. This modification
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 16
resulted in a significant improve- ment in classification accuracy (about 1%) due to
bootstrapping of negated forms during training. This form of negation handling can be applied
to a variety of text related applications.
5.3 CLASSIFICATION PHASE
Classification phase involves user interaction where the user inputs a query review, he
later gets output displaying whether the result is positive or negative. The classification stats off
with loading the pickled files and getting an input from the user ,the input review is passed as
an argument to the negate sequence function thereby getting a list of words. We apply Bayes
theorem and laplacian smoothing
A Naive Bayes classifier is a simple probabilistic model based on the Bayes rule along with a
strong independence assumption.
The Naïve Bayes model involves a simplifying conditional independence assumption. That is
given a class (positive or negative), the words are conditionally independent of each other. This
assumption does not affect the accuracy in text classification by much but makes really fast
classification algorithms applicable for the problem.
In our case, the maximum likelihood probability of a word belonging to a particular class is
given by the expression:
The frequency counts of the words are stored in hash tables during the training phase.
According to the Bayes Rule, the probability of a particular document belonging to a class ci is
given by,
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 17
In simple terms, a naive Bayes classifier assumes that the value of a particular feature is
unrelated to the presence or absence of any other feature, given the class variable. For example,
a fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A naive
Bayes classifier considers each of these features to contribute independently to the probability
that this fruit is an apple, regardless of the presence or absence of the other features.
If the classifier encounters a word that has not been seen in the training set, the probability of
both the classes would become zero and there won’t be anything to compare between. This
problem can be solved by Laplacian smoothing
Usually, k is chosen as 1. This way, there is equal probability for the new word to be in either
class. Since Bernoulli Naïve Bayes is used, the total number of words in a class is computed
differently. For the purpose of this calculation, each document is reduced to a set of unique
words with no duplicates.
After applying bayes theorem and laplacian smoothing we get negative and positive
probabilities if the positive probability is greater than the negative probbality then the positive
result is displayed otherwise negative result will be displayed there ends the classification
phase
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 18
5.4 GUI DEVELOPMENT
Tkinter is Python's de-facto standard GUI (Graphical User Interface) package. It is a
thin object-oriented layer on top of Tcl/Tk. Tkinter is not the only Gui Programming toolkit for
Python. It is however the most commonly used one.
Tkinter is a Python binding to the Tk GUI toolkit. It is the standard Python interface to the Tk
GUI toolkit
[4] and is Python's de facto standard GUI, and is included with the
standard Windows and Mac OS X install of Python.
The name Tkinter comes from Tk interface. Tkinter was written by Fredrik Lundh.[3]
As with most other modern Tk bindings, Tkinter is implemented as a Python wrapper around a
complete Tcl interpreter embedded in the Python interpreter. Tkinter calls are translated into
Tcl commands which are fed to this embedded interpreter, thus making it possible to mix
Python and Tcl in a single application.
Python 2.7 and Python 3.1 incorporate the "themed Tk" ("ttk") functionality of Tk 8.5.[4][5]
This
allows Tk widgets to be easily themed to look like the native desktop environment in which the
application is running, thereby addressing a long-standing criticism of Tk (and hence of
Tkinter).
PSEUDO CODE:
e := Entry(master)
call e.pack()
call e.focus_set()
define def fun():
classify(call e.get())
b := Button(master, command=fun)
call b.pack()
mainloop()
We have created the text box using the entry function and later created a button which on
clicking the button the function fun() is called by using the command argument in the function
fun() we call the classify method by passing the content typed in the text box as an argument.
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 19
In the classify method appropriate pop-ups are created displaying “positive” ,”negative” or “ no
features to compare messages”
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 20
CHAPTER 6
SYSTEM TESTING
System testing involves unit testing, integration testing, white-box testing, black-box
testing. Strategies for integration software components into a functional product include the
bottom-up strategy, the top-down strategy, and sandwich strategy. Careful planning and
scheduling are required to ensure that modules that will be available for integration into
evolving software product when needed a serious of testing are performed for the proposed
system before the system is ready for user acceptance testing.
6.1 UNIT TESTING
Instead of testing the system as a whole, Unit testing focuses on the modules that
make up the system. Each module is taken up individually and tested for correctness in
coding and logic.
The advantages of unit testing are:
1.Size of the module is quite small and that errors can easily are located.
2.Confusing interactions of multiple errors in wide different parts of the software is
eliminated.
3.Modules level testing can be exhaustive.
6.2 INTEGRATION TESTING
It tests for the errors resulting from integration of modules. One specification of
integration testing is whether parameters match on both sides of type, permissible ranges and
meaning. Integration testing is functional black box test method. It includes testing each
module as an impenetrable mechanism for information. The only concern during integration
testing is that the modules work together properly.
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 21
6.3 WHITE BOX TESTING (Code Testing)
The code-testing strategy examines the logic of the program. To follow this testing
method, the analyst develops test cases that result in executing every instruction in the
program or module so that every path through the program is tested. A path is a specific
combination of conditions that is handled by the program. Code testing does not check the
range of data that the program will accept.
Exercises all logical decisions on their true or false sides.
Executes all loops at their boundaries and within these operational bounds.
6.4 BLACK BOX TESTING (Specification Testing)
To perform specification testing, the analyst examines the specification, starting from
what the program should do and how it should perform under various conditions. Then test
cases are developed for each condition or combinations of conditions and submitted for
processing. By examining the results, the analyst can determine whether the programs
perform according to its specified requirements. This testing strategy sounds exhaustive. If
every statement in the program is checked for its validity, there doesn’t seem to be much
scope for errors.
6.5 FUNCTIONAL TESTING
In this type of testing, the software is tested for the functional requirements. The tests
are written in order to check if the application behaves as expected. Although functional
testing is often done toward the end of the development cycle, it can—and should, —be
started much earlier. Individual components and processes can be tested early on, even
before it's possible to do functional testing on the entire system. Functional testing covers
how well the system executes the functions it is supposed to execute—including user
commands, data manipulation, searches and business processes, user screens, and
integrations. Functional testing covers the obvious surface type of functions, as well as the
back-end operations (such as security and how upgrades affect the system).
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 22
6.6 PERFORMANCE TEST
In software engineering, performance testing is testing that is performed, from one
perspective, to determine how fast some aspect of a system performs under a particular
workload. It can also serve to validate and verify other quality attributes of the system, such
as scalability, reliability and resource usage. Performance testing is a subset of Performance
testing, an emerging computer science practice which strives to build performance into the
design and architecture of a system, prior to the onset of actual coding effort.
Performance testing can compare two systems to find which performs better. Or it can
measure what parts of the system or workload cause the system to perform badly. In the
diagnostic case, software engineers use tools such as profilers to measure what parts of a
device or software contribute most to the poor performance or to establish throughput levels
(and thresholds) for maintained acceptable response time. It is critical to the cost
performance of a new system; the performance test efforts begin at the inception of the
development project and extend through to deployment. The later a performance defect is
detected, the higher the cost of remediation. This is true in the case of functional testing, but
even more so with performance testing, due to the end-to-end nature of its scope.
In performance testing, it is often crucial (and often difficult to arrange) for the test
conditions to be similar to the expected actual use. This is, however, not entirely possible in
actual practice. The reason is that production systems have a random nature of the workload
and while the test workloads do their best to mimic what may happen in the production
environment, it is impossible to exactly replicate this workload variability - except in the
simplest system.
6.7 TESTING PHASE
The testing involves the test data set which contains 12500 positive reviews and 12500
negative reviews .At first 12500 positive reviews are given to the system and the number of
reviews which are concluded to be positive by the analysis are calculated .Thereby calculating
the efficiency of the system for the positive reviews .A similar procedure is carried out
considering 12500 negative reviews present in the test data set
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 23
CONCLUSION
Our results show that a simple Naive Bayes classifier can be enhanced to match the
classification accuracy of more complicated models for sentiment analysis by choosing the
right type of features and removing noise by appropriate feature selection. Naive Bayes
classifiers due to their conditional independence assumptions are extremely fast to train and can
scale over large datasets. They are also robust to noise and less prone to over fitting. Ease of
implementation is also a major advantage of Naive Bayes classifier. They were thought to be
less accurate than their more sophisticated counterparts like support vector machines and
logistic regression but we have shown through this project that a significantly high accuracy
can be achieved. The ideas used in this paper can also be applied to the more general domain of
text classification.
RESULTS
We implemented the classifier in Python using hash tables to store the counts of words in their
respective classes. Training involved preprocessing data and applying negation handling before
counting the words. Since we were using Bernoulli Naive Bayes, each word is counted only
once per document. On a laptop running an Intel i3 processor at 2.1 GHz, training took around
90 seconds and used about 1GB of memory. The memory usage by system is largely because of
bigrams and trigrams prior to feature selection.
Positive efficiency: 11207 out of 12500 (89.65%)
Negative efficiency: 10856 out of 12500 (86.84%)
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 24
FUTURE ENHANCEMENTS
This particular field of sentiment analysis can be utilized to obtain the opinion of public
across the Internet through various social networking sites such as Twitter, Facebook etc.
A system can be designed such as, when some movie name is given as an input with a hash tag,
the system can be made to collect tweets about the movie and later perform the sentiment
analysis for all the tweets collected and give the result based on number of positive and
negative tweets. This method can also be used to gather opinions on general topics from the
public from across the Internet.
The training should be performed incrementally and periodically with new and larger data sets,
thereby increasing the efficiency of the system.
Our project need not be restricted to only movie review classifications. By changing the
training data set to some other domain the project can be scaled to classify other pieces of text.
Also there is scope for applying this project for reviews which contain almost equal number of
positive and negative frequency of words, which in general is termed as “Neutral”. This can be
achieved by using an extra dictionary and some extra functions.
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 25
BIBLIOGRAPHY
[1]. www.python.org/download/
[2]. Large Movie Review Dataset. (n.d.). Retrieved from
http://ai.stanford.edu/~amaas/data/sentiment/
[3]. Socher, Richard, et al. "Semi-supervised recursive autoencoders for predicting sentiment
distributions." Proceedings of the Conference on Empirical Methods in Natural Language
Processing. Association for Computational Linguistics, 2011.
[4].www.wiki.python.org/moin/TkInter
[5]. Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to in-
formation retrieval. Vol. 1. Cambridge: Cambridge University Press, 2008
[6]. Pauls, Adam, and Dan Klein. "Faster and smaller n-gram language models."Proceedings
of the 49th annual meeting of the Association for Computational Linguistics: Human Lan-
guage Technologies. Vol. 1. 2011.
[7].www.wikipedia.com/
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 26
APPENDIX-A
Negative review classification
Positive review classification
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 27
No Features to compare classification
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 28
APPENDIX B
Python implementation codes
Training module code:
import os
from math import log, exp
import pickle
from operator import mul
class MyDict(dict):
def __init__(self):
print ""
def __getitem__(self, key):
if key in self:
return self.get(key)
return 0
pos = MyDict()
neg = MyDict()
totals = [0, 0]
def negate_sequence(text):
"""
Detects negations and transforms negated words into "not_" form.
"""
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 29
negation = False
delims = "?.,!:;"
result = []
words = text.split()
prev = None
pprev = None
for word in words:
# stripped = word.strip(delchars)
stripped = word.strip(delims).lower()
negated = "not_" + stripped if negation else stripped
result.append(negated)
if prev:
bigram = prev + " " + negated
result.append(bigram)
if pprev:
trigram = pprev + " " + bigram
result.append(trigram)
pprev = prev
prev = negated
if any(neg in word for neg in ["not", "n't", "no"]):
negation = not negation
if any(c in word for c in delims):
negation = False
return result
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 30
def train():
global pos, neg,totals
limit = 12500
for file in os.listdir("/Users/Preetham/Desktop/NIE/aclImdb/train/pos")[:limit]:
for word in set(negate_sequence(open("/Users/Preetham/Desktop/NIE/aclImdb/train/pos/" +
file).read())):
pos[word] += 1
neg['not_' + word] += 1
for file in os.listdir("/Users/Preetham/Desktop/NIE/aclImdb/train/neg")[:limit]:
for word in set(negate_sequence(open("/Users/Preetham/Desktop/NIE/aclImdb/train/neg/" +
file).read())):
neg[word] += 1
pos['not_' + word] += 1
prune_features()
with open('mydatapositive.pickle', 'wb') as myposdata:
pickle.dump(pos, myposdata)
with open('mydatanegative.pickle', 'wb') as mynegdata:
pickle.dump(neg, mynegdata)
totals[0] = sum(pos.values())
totals[1] = sum(neg.values())
with open('mytotal.pickle', 'wb') as mytotaldata:
pickle.dump(totals, mytotaldata)
def prune_features():
"""
Remove features that appear only once.
"""
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 31
global pos, neg
for k in pos.keys():
if pos[k] <= 1 and neg[k] <= 1:
del pos[k]
for k in neg.keys():
if neg[k] <= 1 and pos[k] <= 1:
del neg[k]
train()
prune_features()
print("______________")
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 32
Classification module code
import os
import pickle
from math import log,exp
from Tkinter import *
import tkMessageBox
master = Tk()
class MyDict(dict):
def __init__(self):
print ""
def __getitem__(self, key):
if key in self:
return self.get(key)
return 0
pos = MyDict()
neg = MyDict()
totals =[0,0]
with open('mydatapositive.pickle', 'rb') as myposdata:
pos = pickle.load(myposdata)
with open('mydatanegative.pickle', 'rb') as mynegdata:
neg = pickle.load(mynegdata)
with open('mytotal.pickle', 'rb') as mytotaldata:
totals=pickle.load(mytotaldata)
pos['good'] += 100
neg['not_' + 'good'] += 100
totals[0] +=100
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 33
totals[1] +=100
def negate_sequence(text):
"""
Detects negations and transforms negated words into "not_" form.
"""
negation = False
delims = "?.,!:;"
result = []
words = text.split()
prev = None
pprev = None
for word in words:
# stripped = word.strip(delchars)
stripped = word.strip(delims).lower()
negated = "not " + stripped if negation else stripped
result.append(negated)
if prev:
bigram = prev + " " + negated
result.append(bigram)
if pprev:
trigram = pprev + " " + bigram
result.append(trigram)
pprev = prev
prev = negated
if any(neg in word for neg in ["not", "n't", "no"]):
negation = not negation
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 34
if any(c in word for c in delims):
negation = False
return result
def classify(tex):
words = set(word for word in negate_sequence(tex) if word in pos or word in neg)
if (len(words) == 0):
tkMessageBox.showinfo("Result", "NO FEATURES TO COMPARE")
return True
pprob, nprob = 0, 0
for word in words:
pp = log(((pos[word] * 1.0) + 1) / (2.0 * totals[0]))
np = log(((neg[word] * 1.0) + 1) / (2.0 * totals[1]))
# print "%15s %.9f %.9f" % (word, exp(pp), exp(np))
pprob += pp
nprob += np
if pprob > nprob:
tkMessageBox.showinfo("Result", "POSITIVE")
else:
tkMessageBox.showinfo("Result", "NEGATIVE")
master.geometry('720x1020')
e = Entry(master, width=300, xscrollcommand =200`)
e.pack()
e.focus_set()
def fun():
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 35
classify(e.get())
b = Button(master, text="GET RESULT", width=25, command=fun)
b.pack()
mainloop()
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 36
Test module code
import os
import pickle
from math import log,exp
class MyDict(dict):
def __init__(self):
print("")
def __getitem__(self, key):
if key in self:
return self.get(key)
return 0
pos = MyDict()
neg = MyDict()
totals =[0,0]
with open('mydatapositive.pickle', 'rb') as myposdata:
pos = pickle.load(myposdata)
with open('mydatanegative.pickle', 'rb') as mynegdata:
neg = pickle.load(mynegdata)
with open('mytotal.pickle', 'rb') as mytotaldata:
totals=pickle.load(mytotaldata)
pcount=0
ncount=0
def negate_sequence(text):
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 37
"""
Detects negations and transforms negated words into "not_" form.
"""
negation = False
delims = "?.,!:;"
result = []
words = text.split()
prev = None
pprev = None
for word in words:
# stripped = word.strip(delchars)
stripped = word.strip(delims).lower()
negated = "not " + stripped if negation else stripped
result.append(negated)
if prev:
bigram = prev + " " + negated
result.append(bigram)
if pprev:
trigram = pprev + " " + bigram
result.append(trigram)
pprev = prev
prev = negated
if any(neg in word for neg in ["not", "n't", "no"]):
negation = not negation
if any(c in word for c in delims):
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 38
negation = False
return result
def classify():
global pcount, pos, neg,totals,ncount
limit = 12500
for file in os.listdir("/Users/Preetham/Desktop/NIE/aclImdb/test/pos")[:limit]:
words = set(word for word in
negate_sequence(open("/Users/Preetham/Desktop/NIE/aclImdb/test/pos/" + file).read()) if
word in pos or word in neg)
if (len(words) == 0):
print "No features to compare on"
pprob, nprob = 0, 0
for word in words:
pp = log(((pos[word] * 1.0) + 1) / (2.0 * totals[0]))
np = log(((neg[word] * 1.0) + 1) / (2.0 * totals[1]))
pprob += pp
nprob += np
if (pprob > nprob):
pcount+=1
for file in os.listdir("/Users/Preetham/Desktop/NIE/aclImdb/test/neg")[:limit]:
words = set(word for word in
negate_sequence(open("/Users/Preetham/Desktop/NIE/aclImdb/test/neg/" + file).read()) if
word in pos or word in neg)
if (len(words) == 0):
print "No features to compare on"
pprob, nprob = 0, 0
for word in words:
pp = log(((pos[word] * 1.0) + 1) / (2.0 * totals[0]))
Sentiment Analysis
Dept. of Computer Science and Engineering NIE, Mysore Page 39
np = log(((neg[word] * 1.0) + 1) / (2.0 * totals[1]))
pprob += pp
nprob += np
if (pprob <= nprob):
ncount+=1
print ("POSITIVE EFFICIENCY")
print(pcount)
print((pcount/12500.0)*100.0)
print ("NEGATIVE EFFICIENCY")
print(ncount)
print((ncount/12500.0)*100.0)
classify()