SENTIMENT ANALYSIS - University of Southern Californiappremkum/portfolio/pdf/report.pdfSentiment...

SENTIMENT ANALYSIS

A PROJECT REPORT SUBMITTED TO

THE NATIONAL INSTITUTE OF ENGINEERING

(An Autonomous College)

In partial fulfillment for the award of degree of

Bachelor of Engineering

In

Computer Science & Engineering

Submitted By

MUKUND SHENOY K P SAMEER VARMA

(4NI10CS040) (4NI10CS047)

PREETHAM P SACHINKUMAR KULKARNI

(4NI10CS054) (4NI10CS066)

Under The Guidance Of

Dr. T H SREENIVAS

Professor

Department of CS & E

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


(An Autonomous College)

Mysore-570 008

2013-14

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


(Autonomous Institution)

Mysore-570008

CERTIFICATE

This is to certify that the project work “SENTIMENT ANALYSIS”, is a bonafide work carried

out by Mukund Shenoy K (4NI10CS040), P Sameer Varma (4NI10CS047), Preetham P

(4NI10CS054), SachinKumar Kulkarni (4NI10CS066) in partial fulfillment for the award of

degree of Bachelor of Engineering in Computer Science and Engineering, of Visvesvaraya

Technological University, Belgaum during the year 2013-2014. It is certified that all corrections

/ suggestions indicated during Internal viva has been incorporated and corrected copy has been

kept in the Department Library. This project report has approved in partial fulfillment for the

award of the said degree as per academic regulations of The National Institute of Engineering

(An autonomous College).

Project Guide Head of Department

Dr. T H Sreenivas Dr. G.Raghavendra Rao

Assistant Professor Prof. and Head of Department

Department of CS&E Department of CS&E

Dr. G L Shekar

Principal

NIE, Mysore

Examiners Signature with Date

1……………………..

2……………………..

http://en.wikipedia.org/wiki/Visvesvaraya_Technological_University

http://en.wikipedia.org/wiki/Visvesvaraya_Technological_University

ACKNOWLEDGEMENT

We would like to express our sincere gratitude to all those who helped us in completing the

project successfully.

We express our profound thanks to Dr. G L Shekar, Principal, NIE, Mysore for all the support

and encouragement.

We are grateful to Dr. G. Raghavendra Rao, Prof. and Head of the Department of Computer

Science and Engineering, NIE, Mysore for his support and encouragement facilitating the

progress of this work.

We sincerely extend our thanks to project Guide Dr. T H SREENIVAS, Professor, Department

of Computer Science and Engineering, NIE, Mysore for his valuable guidance, constant

assistance, support, endurance and constructive suggestions for the betterment of this project,

without which the completion of this project would not have been possible.

Finally we thank our families and friends for being a constant source of inspiration and advice.

Mukund Shenoy K

P Sameer Varma

Preetham P

SachinKumar Kulkarni

.

ABSTRACT

Sentiment analysis or opinion mining is the computational study of people’s opinions,

sentiments, attitudes, and emotions expressed in written language. It is one of the most active

research areas in natural language processing and text mining in recent years. We have used the

Naïve Bayes classifier method for sentiment analysis and trained the system using movie

reviews. We then classified the reviews to be positive or negative.

We have used a combination of methods like negation handling, word n-grams, and

feature selection in addition to the Naïve Bayes method for improving the accuracy of

classification. Naïve Bayes is a very simple probabilistic model that tends to work well on text

classifications and usually takes orders of magnitude less time to train when compared to other

models in sentiment classification. We can achieve a high degree of accuracy by using Naïve

Bayes model, which is comparable to the current state of the art models in sentiment

classification.

Contents

1.INTRODUCTION ............................................................................................ 1

1.1 SENTIMENT ANALYSIS……………..…………………………………….1

1.2 APPLICATIONS OF SENTIMENT ANALYSIS………………………….......2

1.3 CHALLENGES OF SENTIMENT ANALYSIS …………….......………….…3

1.4 FEATURES OF SENTIMENT ANALYSIS ………………………………......3

1.4.1 TERM PRESENCE vs TERM FREQUENCY..........................................3

1.4.2 TERM POSITION.................................................................................4

1.4.3 N-GRAM FEATURES...........................................................................4

2.LITRATURE SURVEY ……….…………………………………………....5

2.1 NAÏVE BAYES CLASSIFIER…………………………………………….....5

2.2 MAXIMUM ENTROPY CLASSIFIER…………………………………….....6

2.3 SUPPORT VECTOR MACHINE……………………………………...….….6

2.4 PYTHON PROGRAMMING LANGUAGE……………………….………….7

3. SYSTEM REQUIREMENTS ....................................................................... 8

3.1 HARDWARE REQUIREMENTS …………………………………………...8

3.2 SOFTWARE REQUIREMENTS ………………………………………….....8

4. SYSTEM DESIGN ……….……...………….……………………………....9

4.1 DATA……………………………………………………………………......9

4.2 NAÏVE BAYES CLASSIFIER……………………………………………....10

4.3 LAPLACIAN SMOOTHING…………………………………………….....11

4.4 NEGATION HANDLING………………………………………………......11

4.5 N- GRAMS………………………………………………………………...12

4.6 FEATURE SELECTION…………………………………………………...13

5. SYSTEM IMPLEMENTATION……………………………………………......14

5.1 TRAINING PHASE…………………………………………………….......14

5.2 NEGATE SEQUENCE HANDLING…………………………………….......15

5.3 CLASSIFICATION PHASE…………………………………………….......16

5.4 GUI DEVELOPMENT…..............................................................................18

6. SYSTEM TESTING…...………….…………..…………………………....20

6.1 UNIT TESTING………………………………………………………........20

6.2 INTEGRATION TESTING………………………………………………....20

6.3 WHITE BOX TESTING…………………………………………………....21

6.4 BLACK BOX TESTING…............................................................................21

6.5 FUNCTIONAL TESTING……………………………………………….....21

6.6 PERFORMANCE TEST…………………………………………….……...22

6.7 TESTING PHASE……………………………………………………….....22

CONCLUSION…...………….………….……...…...………………...………....23

FUTURE ENHANCEMENTS..……………….….....……...…………………......24

BIBILIOGRAPHY…...………….......….…………...…………………………...25

APPENDIX A...………….………………..………...………………...………...26

APPENDIX B…...………………………..………...…..……………...………...28

Sentiment Analysis

Dept. of Computer Science and Engineering NIE, Mysore Page 1

CHAPTER 1

INTRODUCTION

1.1 SENTIMENT ANALYSIS

Among the most researched topics of natural language processing is sentiment analysis

Sentiment analysis involves extraction of subjective information from documents like online

reviews to determine the polarity with respect to certain objects. It is useful for identifying

trends of public opinion in the social media, for the purpose of marketing and consumer

research. It has its uses in getting customer feedback about new product launches, political

campaigns and even in financial markets. It aims to determine the attitude of a speaker or a

writer with respect to some topic or simply the contextual polarity of a document.

Sentiment Analysis is a Natural Language Processing and Information Extraction task that aims

to obtain writer’s feelings expressed in positive or negative comments, questions and requests,

by analyzing a large numbers of documents. Generally speaking, sentiment analysis aims to

determine the attitude of a speaker or a writer with respect to some topic or the overall tonality

of a document. In recent years, the exponential increase in the Internet usage and exchange of

public opinion is the driving force behind Sentiment Analysis today. The Web is a huge

repository of structured and unstructured data. The analysis of this data to extract latent public

opinion and sentiment is a challenging task.

Liu defines a sentiment or opinion as a quintuple-

“<oj, fjk, soijkl, hi, tl >, where oj is a target object, fjk is a feature of the object oj, soijkl is the

sentiment value of the opinion of the opinion holder hi on feature fjk of object oj at time tl,

soijkl is +ve,-ve, or neutral, or a more granular rating, hi is an opinion holder, tl is the time

when the opinion is expressed.”

The analysis of sentiments may be document based where the sentiment in the entire document

is summarized as positive, negative or objective. It can be sentence based where individual

sentences, bearing sentiments, in the text are classified. SA can be phrase based where the

phrases in a sentence are classified according to polarity.

Sentiment Analysis


Sentiment Analysis identifies the phrases in a text that bears some sentiment. The author may

speak about some objective facts or subjective opinions. It is necessary to distinguish between

the two. SA finds the subject towards whom the sentiment is directed. A text may contain many

entities but it is necessary to find the entity towards which the sentiment is directed. It identifies

the polarity and degree of the sentiment. Sentiments are classified as objective (facts), positive

(denotes a state of happiness, bliss or satisfaction on part of the writer) or negative (denotes a

state of sorrow, dejection or disappointment on part of the writer). The sentiments can further

be given a score based on their degree of positivity, negativity or objectivity.

1.2 APPLICATIONS OF SENTIMENT ANALYSIS

Word of mouth (WOM) is the process of conveying information from person to person

and plays a major role in customer buying decisions. In commercial situations, WOM involves

consumers sharing attitudes, opinions, or reactions about businesses, products, or services with

other people. WOM communication functions based on social networking and trust. People

rely on families, friends, and others in their social network. Research also indicates that people

appear to trust seemingly disinterested opinions from people outside their immediate social

network, such as online reviews. This is where Sentiment Analysis comes into play. Growing

availability of opinion rich resources like online review sites, blogs, social networking sites

have made this “decision-making process” easier for us. With explosion of Web 2.0 platforms

consumers have a soapbox of unprecedented reach and power by which they can share

opinions. Major companies have realized these consumer voices affect shaping voices of other

consumers.

Sentiment Analysis thus finds its use in Consumer Market for Product reviews, Marketing for

knowing consumer attitudes and trends, Social Media for finding general opinion about recent

hot topics in town, Movie to find whether a recently released movie is a hit.

Pang-Lee broadly classifies the applications into the following categories:

a. Applications to Review-Related Websites like Movie Reviews, Product Reviews etc.

b. Applications as a Sub-Component Technology like detecting antagonistic, heated language in

Sentiment Analysis


mails, spam detection, context sensitive information detection etc.

c. Applications in Business and Government Intelligence Knowing Consumer attitudes and

trends.

1.3 CHALLENGES FOR SENTIMENT ANALYSIS

Sentiment Analysis approaches aim to extract positive and negative sentiment bearing

words from a text and classify the text as positive, negative or else objective if it cannot find

any sentiment bearing words. In this respect, it can be thought of as a text categorization task.

In text classification there are many classes corresponding to different topics whereas in

Sentiment Analysis we have only 3 broad classes. Thus it seems Sentiment Analysis is easier

than text classification which is not quite the case.

1.4 FEATURES FOR SENTIMENT ANALYSIS

Feature engineering is an extremely basic and essential task for Sentiment Analysis.

Converting a piece of text to a feature vector is the basic step in any data driven approach to

SA. In the following section, some commonly used features used in Sentiment Analysis and

their critiques are mentioned.

1.4.1 Term Presence vs. Term Frequency

Term frequency has always been considered essential in traditional Information

Retrieval and Text Classification tasks. But Pang-Lee found that term presence is more

important to Sentiment analysis than term frequency. That is, binary-valued feature vectors in

which the entries merely indicate whether a term occurs (value 1) or not (value 0). This is not

counter-intuitive as in the numerous examples we saw before that the presence of even a single

string sentiment bearing words can reverse the polarity of the entire sentence. It has also been

seen that the occurrence of rare words contain more information than frequently occurring

words, a phenomenon called Hapax Legomena.

Sentiment Analysis


1.4.2 Term Position

Words appearing in certain positions in the text carry more sentiment or weightage than

words appearing elsewhere. This is similar to IR where words appearing in topic Titles,

Subtitles or Abstracts etc are given more weightage than those appearing in the body. Although

the text contains positive words throughout, the presence of a negative sentiment at the end

sentence plays the deciding role in determining the sentiment. Thus generally words appearing

in the text are given more weightage than those appearing elsewhere.

1.4.3 N-gram Features

N-grams are capable of capturing context to some extent and are widely used in Natural

Language Processing tasks. Whether higher order n-grams are useful is a matter of debate.

Pang reported that unigrams outperform bigrams when classifying movie reviews by sentiment

polarity, but Dave found that in some settings, bigrams and trigrams perform better.

Sentiment Analysis


CHAPTER 2

LITERATURE SURVEY

Sentiment analysis is a complicated problem but experiments have been done

using Naive Bayes, maximum entropy classifiers and support vector machines.

2.1 NAÏVE BAYES CLASSIFIER

A Naive Bayes classifier is a simple probabilistic model based on the Bayes rule

along with a strong independence assumption.

The Naïve Bayes model involves a simplifying conditional independence assumption.

That is given a class (positive or negative), the words are conditionally independent of

each other. This assumption does not affect the accuracy in text classification by much

but makes really fast classification algorithms applicable for the problem.

In our case, the maximum likelihood probability of a word belonging to a particular

class is given by the expression:

The frequency counts of the words are stored in hash tables during the training phase.

According to the Bayes Rule, the probability of a particular document belonging to a

class ci is given by,

In simple terms, a naive Bayes classifier assumes that the value of a particular feature is

unrelated to the presence or absence of any other feature, given the class variable. For example,

a fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A naive

Sentiment Analysis


Bayes classifier considers each of these features to contribute independently to the probability

that this fruit is an apple, regardless of the presence or absence of the other features.

2.2 MAXIMUM ENTROPY CLASSIFIER

The Max Entropy classifier is a probabilistic classifier which belongs to the class of

exponential models. Unlike the Naive Bayes classifier that we discussed in the previous article,

the Max Entropy does not assume that the features are conditionally independent of each other.

The MaxEnt is based on the Principle of Maximum Entropy and from all the models that fit our

training data, selects the one which has the largest entropy. The Max Entropy classifier can be

used to solve a large variety of text classification problems such as language detection, topic

classification, sentiment analysis and more.

Due to the minimum assumptions that the Maximum Entropy classifier makes, we regularly use

it when we don’t know anything about the prior distributions and when it is unsafe to make any

such assumptions. Moreover Maximum Entropy classifier is used when we can’t assume the

conditional independence of the features. This is particularly true in Text Classification

problems where our features are usually words which obviously are not independent. The Max

Entropy requires more time to train comparing to Naive Bayes, primarily due to the

optimization problem that needs to be solved in order to estimate the parameters of the model.

Nevertheless, after computing these parameters, the method provides robust results and it is

competitive in terms of CPU and memory consumption.

2.3 SUPPORT VECTOR MACHINE

In machine learning, support vector machines (SVMs, also support vector networks[1]

)

are supervised learning models with associated learning algorithms that analyze data and

recognize patterns, used for classification and regression analysis. Given a set of training

examples, each marked as belonging to one of two categories; an SVM training algorithm

builds a model that assigns new examples into one category or the other, making it a non-

probabilistic binary linear classifier. An SVM model is a representation of the examples as

points in space, mapped so that the examples of the separate categories are divided by a clear

gap that is as wide as possible. New examples are then mapped into that same space and

predicted to belong to a category based on which side of the gap they fall on

http://blog.datumbox.com/machine-learning-tutorial-the-naive-bayes-text-classifier/

http://en.wikipedia.org/wiki/Principle_of_maximum_entropy

http://blog.datumbox.com/machine-learning-tutorial-the-naive-bayes-text-classifier/

/wiki/Machine_learning

/wiki/Supervised_learning

/wiki/Statistical_classification

/wiki/Regression_analysis

/wiki/Probabilistic_classification

/wiki/Binary_classifier

/wiki/Linear_classifier

Sentiment Analysis


More formally, a support vector machine constructs a hyperplane or set of hyperplanes in

a high- or infinite-dimensional space, which can be used for classification, regression, or other

tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance

to the nearest training data point of any class (so-called functional margin), since in general the

larger the margin the lower the generalization error of the classifier.

2.4 PYTHON PROGRAMMING LANGUAGE

Python is a widely used general-purpose, high-level programming language. Its design

philosophy emphasizes code readability, and its syntax allows programmers to express

concepts in fewer lines of code than would be possible in languages such as C. The language

provides constructs intended to enable clear programs on both a small and large scale. Python

supports multi-programming paradigms including object-oriented, imperative and functional

programming or procedural styles. It features a dynamic type system and automatic memory

management and has a large and comprehensive standard library. Like other dynamic

languages, Python is often used as a scripting language, but is also used in a wide range of non-

scripting contexts. Using third-party tools, such as Py2exe, or Pyinstaller, Python code can be

packaged into standalone executable programs. Python interpreters are available for many

operating systems.

We chose Python because it has a shallow learning curve, its syntax and semantics are

transparent, and it has good string-handling functionality. As an interpreted language, Python

facilitates interactive exploration. As an object-oriented language, Python permits data and

methods to be encapsulated and re-used easily. As a dynamic language, Python permits

attributes to be added to objects on the fly, and permits variables to be typed dynamically,

facilitating rapid development. Python comes with an extensive standard library, including

components for graphical programming, numerical processing, and web connectivity.

Python is heavily used in industry, scientific research, and education around the world. Python

is often praised for the way it facilitates productivity, quality, and main- tainability of software.

/wiki/Hyperplane

/wiki/High-dimensional_space

/wiki/Generalization_error

/wiki/General-purpose_programming_language

/wiki/High-level_programming_language

/wiki/Readability

/wiki/Lines_of_code

/wiki/C_(programming_language)

/wiki/Object-oriented_programming

/wiki/Imperative_programming

/wiki/Functional_programming

/wiki/Functional_programming

/wiki/Procedural_programming

/wiki/Dynamic_type

/wiki/Memory_management

/wiki/Memory_management

/wiki/Standard_library

/wiki/Dynamic_language

/wiki/Dynamic_language

/wiki/Scripting_language

/wiki/Py2exe

/w/index.php?title=Pyinstaller&action=edit&redlink=1

Sentiment Analysis


CHAPTER 3

SYSTEM REQUIREMENTS

3.1 HARDWARE REQUIREMENTS

2GHz+ CPU

1GB RAM

1GB Hard disk Space

3.2 SOFTWARE REQUIREMENTS

Python IDLE 3.3

Web Browser

Sentiment Analysis


CHAPTER 4

SYSTEM DESIGN

4.1 DATA

We used a publicly available dataset of movie reviews from the Internet Movie Data- base

(IMDb) which was compiled by Andrew Maas . It is a set of 25,000 highly polar movie reviews

for training, and 25,000 for testing. Both the training and test sets have an equal number of

positive and negative reviews. We chose movie reviews as our data set because it covers a wide

range of human emotions and captures most of the adjectives relevant to sentiment

classification. Also, most existing research on sentiment classification uses movie review data

for benchmarking.

We used the 25,000 documents in the training set to build our supervised learning model. The

other 25,000 were used for evaluating the accuracy of our classifier.

Sentiment Analysis


4.2 NAÏVE BAYES CLASSIFIER

A Naive Bayes classifier is a simple probabilistic model based on the Bayes rule along

with a strong independence assumption.

The Naïve Bayes model involves a simplifying conditional independence assumption. That is

given a class (positive or negative), the words are conditionally independent of each other. This

assumption does not affect the accuracy in text classification by much but makes really fast

classification algorithms applicable for the problem.

In our case, the maximum likelihood probability of a word belonging to a particular class is

given by the expression:


According to the Bayes Rule, the probability of a particular document belonging to a class ci is

given by,






If we use the simplifying conditional independence assumption, that given a class (positive or

negative), the words are conditionally independent of each other. Due to this simplifying

assumption the model is termed as “naïve”.

Sentiment Analysis


Here the xi s are the individual words of the document. The classifier outputs the class with the

maximum posterior probability. We also remove duplicate words from the document, they

don’t add any additional information; this type of naïve Bayes algorithm is called Bernoulli

Naïve Bayes. Including just the presence of a word instead of the count has been found to

improve performance marginally, when there is a large number of training examples.

4.3 LAPLACIAN SMOOTHING

If the classifier encounters a word that has not been seen in the training set, the

probability of both the classes would become zero and there won’t be anything to compare

between. This problem can be solved by Laplacian smoothing

Usually, k is chosen as 1. This way, there is equal probability for the new word to be in either

class. Since Bernoulli Naïve Bayes is used, the total number of words in a class is computed

differently. For the purpose of this calculation, each document is reduced to a set of unique

words with no duplicates.

4.4 NEGATION HANDLING

Negation handling was one of the factors that contributed significantly to the accuracy

of our classifier. A major problem faced during the task of sentiment classification is that of

handling negations. Since we are using each word as feature, the word “good” in the phrase

“not good” will be contributing to positive sentiment rather that negative sentiment as the

presence of “not” before it is not taken into account.

To solve this problem we devised a simple algorithm for handling negations using state

variables and bootstrapping. We built on the idea of using an alternate representation of

negated forms as shown by Das & Chen. Our algorithm uses a state variable to store the

negation state. It transforms a word followed by a not or n’t into “not_” + word. Whenever the

negation state variable is set, the words read are treated as “not_” + word. The state variable is

reset when a punctuation mark is encountered or when there is double negation.

Sentiment Analysis


The pseudo code of the algorithm is described below:

PSEUDO CODE:

negated := False

for each word in document:

if negated = True:

Transform word to “not_” + word.

if word is “not” or “n’t”:

negated := not negated

if a punctuation mark is encountered

negated := False.

Since the number of negated forms might not be adequate for correct classifications. It is

possible that many words with strong sentiment occur only in their normal forms in the training

set. But their negated forms would be of strong polarity.

We addressed this problem by adding negated forms to the opposite class along with normal

forms of all the features during the training phase. That is to say if we encounter the word

“good” in a positive document during the training phase, we increment the count of “good” in

the positive class and also increment the count of “not_good” for the negative class. This is to

ensure that the numbers of “not_” forms are sufficient for classification. This modification

resulted in a significant improvement in classification accuracy (about 1%) due to

bootstrapping of negated forms during training. This form of negation handling can be applied

to a variety of text related applications.

4.5 N-GRAMS

Generally, information about sentiment is conveyed by adjectives or more specifically

by certain combinations of adjectives with other parts of speech. Adding features like

consecutive pairs of can capture this information words (bigrams), or even triplets of words

(trigrams). Words like "very" or "definite-ly" don't provide much sentiment information on

their own, but phrases like "very bad" or "definitely recommended" increase the probability of

a document being negatively or positively biased. By including bigrams and trigrams, we were

able to capture this information about adjectives and adverbs. Using bigrams and trigrams re-

quire a substantial amount of data in the training set, but this is not a problem as our training set

Sentiment Analysis


had 25,000 reviews. But the data may not be enough to add 4-grams, as this may over-fit the

training set. The counts of the n-grams were stored in a hash table along with the counts of

unigrams.

4.6 FEATURE SELECTION

Feature selection is the process of removing redundant features, while retaining those

features that have high disambiguation capabilities. The use of higher dimensional features like

bigrams and trigrams presents a problem, that of the number of features increasing from

300,000 to about 11,000,000. Most of these features are redundant and noisy in nature.

Including them would affect both efficiency and accuracy. A basic filtering step of removing

the features or terms which occur only once is performed. Now the number of features is

reduced to about 1,500,000 features. The features are further filtered on the basis of mutual

information.

Sentiment Analysis


CHAPTER 5

SYSTEM IMPLEMENTATION

System implementation deals with the implementation details of each module used in

the project and the relationships existing between them. The implementation of the project is

done Python language using the Python Integrated Development Environment 3.3(IDLE) .The

installation link for Python IDLE is provided [1] which is an open source software that can be

downloaded from the official website of Python. The implementation involves three modules

such as training, testing, and classification.

5.1 TRAINING PHASE

Training phase is the first phase where the system is trained with a suitably large

collection of positive and negative movie reviews obtained from the internet movie database

IMDB. In the training data set there is a collection of 12500 positive reviews and 12500

negative reviews which also can be downloaded from the link [2]. To get higher accuracy and

efficiency we need a very large data set for training because our analysis includes both

handling negation and Bi-grams and Tri-grams. The training phase starts off with creating two

empty positive and negative dictionaries called pos and neg respectively.

We take each review from the positive training data set and give it as an argument to the negate

sequence function. This function returns a list of words to be inserted into both pos and neg

dictionaries at the same time, after removing the duplicate words from the list .A similar

procedure is applied to each review from the negative data set thereby populating the neg and

pos dictionaries .we later call the prune_feature function which removes the words from both

the dictionaries whose frequency is less than one thereby reducing the entries in the dictionaries

and facilitating faster access time. We use the concept of pickling for storing the dictionaries in

the file, which are called as pickled files, which can be used later during the classification

phase. Pickling is done by a function called pickle.dump() which retains the data structure of

the dictionary even after storing in the files this is one of the major advantages of pickling.

Sentiment Analysis


5.2 NEGATE SEQUENCE HANDLING

Negation handling was one of the factors that contributed significantly to the accuracy

of our classifier. A major problem faced during the task of sentiment classification is that of

handling negations. Since we are using each word as feature, the word “good” in the phrase

“not good” will be contributing to positive sentiment rather that negative sentiment as the

presence of “not” before it is not taken into account.

To solve this problem we devised a simple algorithm for handling negations using state

variables and bootstrapping. We built on the idea of using an alternate representation of

negated forms as shown by Das & Chen in [3]. Our algorithm uses a state variable to store the

negation state. It transforms a word followed by a not or n’t into “not_” + word. Whenever the

negation state variable is set, the words read are treated as “not_” + word. The state variable is

reset when a punctuation mark is encountered or when there is double negation. The pseudo

code of the algorithm is described below:

PSEUDO CODE:

negated := False

for each word in document:

if negated = True:

Transform word to “not_” + word.

if word is “not” or “n’t”:

negated := not negated

if a punctuation mark is encountered

negated := False.

Since the number of negated forms might not be adequate for correct classifications. It is

possible that many words with strong sentiment occur only in their normal forms in the training

set. But their negated forms would be of strong polarity.

We addressed this problem by adding negated forms to the opposite class along with normal

forms of all the features during the training phase. That is to say if we encounter the word

“good” in a positive document during the training phase, we increment the count of “good” in

the positive class and also increment the count of “not_good” for the negative class. This is to

ensure that the number of “not_” forms are sufficient for classification. This modification

Sentiment Analysis


resulted in a significant improvement in classification accuracy (about 1%) due to

bootstrapping of negated forms during training. This form of negation handling can be applied

to a variety of text related applications.

5.3 CLASSIFICATION PHASE

Classification phase involves user interaction where the user inputs a query review, he

later gets output displaying whether the result is positive or negative. The classification stats off

with loading the pickled files and getting an input from the user ,the input review is passed as

an argument to the negate sequence function thereby getting a list of words. We apply Bayes

theorem and laplacian smoothing

A Naive Bayes classifier is a simple probabilistic model based on the Bayes rule along with a

strong independence assumption.

The Naïve Bayes model involves a simplifying conditional independence assumption. That is

given a class (positive or negative), the words are conditionally independent of each other. This

assumption does not affect the accuracy in text classification by much but makes really fast

classification algorithms applicable for the problem.

In our case, the maximum likelihood probability of a word belonging to a particular class is

given by the expression:


According to the Bayes Rule, the probability of a particular document belonging to a class ci is

given by,

Sentiment Analysis







If the classifier encounters a word that has not been seen in the training set, the probability of

both the classes would become zero and there won’t be anything to compare between. This

problem can be solved by Laplacian smoothing

Usually, k is chosen as 1. This way, there is equal probability for the new word to be in either

class. Since Bernoulli Naïve Bayes is used, the total number of words in a class is computed

differently. For the purpose of this calculation, each document is reduced to a set of unique

words with no duplicates.

After applying bayes theorem and laplacian smoothing we get negative and positive

probabilities if the positive probability is greater than the negative probbality then the positive

result is displayed otherwise negative result will be displayed there ends the classification

phase

Sentiment Analysis


5.4 GUI DEVELOPMENT

Tkinter is Python's de-facto standard GUI (Graphical User Interface) package. It is a

thin object-oriented layer on top of Tcl/Tk. Tkinter is not the only Gui Programming toolkit for

Python. It is however the most commonly used one.

Tkinter is a Python binding to the Tk GUI toolkit. It is the standard Python interface to the Tk

GUI toolkit

[4] and is Python's de facto standard GUI, and is included with the

standard Windows and Mac OS X install of Python.

The name Tkinter comes from Tk interface. Tkinter was written by Fredrik Lundh.[3]

As with most other modern Tk bindings, Tkinter is implemented as a Python wrapper around a

complete Tcl interpreter embedded in the Python interpreter. Tkinter calls are translated into

Tcl commands which are fed to this embedded interpreter, thus making it possible to mix

Python and Tcl in a single application.

Python 2.7 and Python 3.1 incorporate the "themed Tk" ("ttk") functionality of Tk 8.5.[4][5]

This

allows Tk widgets to be easily themed to look like the native desktop environment in which the

application is running, thereby addressing a long-standing criticism of Tk (and hence of

Tkinter).

PSEUDO CODE:

e := Entry(master)

call e.pack()

call e.focus_set()

define def fun():

classify(call e.get())

b := Button(master, command=fun)

call b.pack()

mainloop()

We have created the text box using the entry function and later created a button which on

clicking the button the function fun() is called by using the command argument in the function

fun() we call the classify method by passing the content typed in the text box as an argument.

http://www.tcl.tk/

http://en.wikipedia.org/wiki/Python_(programming_language)

http://en.wikipedia.org/wiki/Language_binding

http://en.wikipedia.org/wiki/Tk_(framework)

http://en.wikipedia.org/wiki/Graphical_user_interface

http://en.wikipedia.org/wiki/De_facto_standard

http://en.wikipedia.org/wiki/Microsoft_Windows

http://en.wikipedia.org/wiki/Mac_OS_X

http://en.wikipedia.org/wiki/Tkinter#cite_note-3

http://en.wikipedia.org/wiki/Tcl



Sentiment Analysis


In the classify method appropriate pop-ups are created displaying “positive” ,”negative” or “ no

features to compare messages”

Sentiment Analysis


CHAPTER 6

SYSTEM TESTING

System testing involves unit testing, integration testing, white-box testing, black-box

testing. Strategies for integration software components into a functional product include the

bottom-up strategy, the top-down strategy, and sandwich strategy. Careful planning and

scheduling are required to ensure that modules that will be available for integration into

evolving software product when needed a serious of testing are performed for the proposed

system before the system is ready for user acceptance testing.

6.1 UNIT TESTING

Instead of testing the system as a whole, Unit testing focuses on the modules that

make up the system. Each module is taken up individually and tested for correctness in

coding and logic.

The advantages of unit testing are:

1.Size of the module is quite small and that errors can easily are located.

2.Confusing interactions of multiple errors in wide different parts of the software is

eliminated.

3.Modules level testing can be exhaustive.

6.2 INTEGRATION TESTING

It tests for the errors resulting from integration of modules. One specification of

integration testing is whether parameters match on both sides of type, permissible ranges and

meaning. Integration testing is functional black box test method. It includes testing each

module as an impenetrable mechanism for information. The only concern during integration

testing is that the modules work together properly.

Sentiment Analysis


6.3 WHITE BOX TESTING (Code Testing)

The code-testing strategy examines the logic of the program. To follow this testing

method, the analyst develops test cases that result in executing every instruction in the

program or module so that every path through the program is tested. A path is a specific

combination of conditions that is handled by the program. Code testing does not check the

range of data that the program will accept.

Exercises all logical decisions on their true or false sides.

Executes all loops at their boundaries and within these operational bounds.

6.4 BLACK BOX TESTING (Specification Testing)

To perform specification testing, the analyst examines the specification, starting from

what the program should do and how it should perform under various conditions. Then test

cases are developed for each condition or combinations of conditions and submitted for

processing. By examining the results, the analyst can determine whether the programs

perform according to its specified requirements. This testing strategy sounds exhaustive. If

every statement in the program is checked for its validity, there doesn’t seem to be much

scope for errors.

6.5 FUNCTIONAL TESTING

In this type of testing, the software is tested for the functional requirements. The tests

are written in order to check if the application behaves as expected. Although functional

testing is often done toward the end of the development cycle, it can—and should, —be

started much earlier. Individual components and processes can be tested early on, even

before it's possible to do functional testing on the entire system. Functional testing covers

how well the system executes the functions it is supposed to execute—including user

commands, data manipulation, searches and business processes, user screens, and

integrations. Functional testing covers the obvious surface type of functions, as well as the

back-end operations (such as security and how upgrades affect the system).

Sentiment Analysis


6.6 PERFORMANCE TEST

In software engineering, performance testing is testing that is performed, from one

perspective, to determine how fast some aspect of a system performs under a particular

workload. It can also serve to validate and verify other quality attributes of the system, such

as scalability, reliability and resource usage. Performance testing is a subset of Performance

testing, an emerging computer science practice which strives to build performance into the

design and architecture of a system, prior to the onset of actual coding effort.

Performance testing can compare two systems to find which performs better. Or it can

measure what parts of the system or workload cause the system to perform badly. In the

diagnostic case, software engineers use tools such as profilers to measure what parts of a

device or software contribute most to the poor performance or to establish throughput levels

(and thresholds) for maintained acceptable response time. It is critical to the cost

performance of a new system; the performance test efforts begin at the inception of the

development project and extend through to deployment. The later a performance defect is

detected, the higher the cost of remediation. This is true in the case of functional testing, but

even more so with performance testing, due to the end-to-end nature of its scope.

In performance testing, it is often crucial (and often difficult to arrange) for the test

conditions to be similar to the expected actual use. This is, however, not entirely possible in

actual practice. The reason is that production systems have a random nature of the workload

and while the test workloads do their best to mimic what may happen in the production

environment, it is impossible to exactly replicate this workload variability - except in the

simplest system.

6.7 TESTING PHASE

The testing involves the test data set which contains 12500 positive reviews and 12500

negative reviews .At first 12500 positive reviews are given to the system and the number of

reviews which are concluded to be positive by the analysis are calculated .Thereby calculating

the efficiency of the system for the positive reviews .A similar procedure is carried out

considering 12500 negative reviews present in the test data set

Sentiment Analysis


CONCLUSION

Our results show that a simple Naive Bayes classifier can be enhanced to match the

classification accuracy of more complicated models for sentiment analysis by choosing the

right type of features and removing noise by appropriate feature selection. Naive Bayes

classifiers due to their conditional independence assumptions are extremely fast to train and can

scale over large datasets. They are also robust to noise and less prone to over fitting. Ease of

implementation is also a major advantage of Naive Bayes classifier. They were thought to be

less accurate than their more sophisticated counterparts like support vector machines and

logistic regression but we have shown through this project that a significantly high accuracy

can be achieved. The ideas used in this paper can also be applied to the more general domain of

text classification.

RESULTS

We implemented the classifier in Python using hash tables to store the counts of words in their

respective classes. Training involved preprocessing data and applying negation handling before

counting the words. Since we were using Bernoulli Naive Bayes, each word is counted only

once per document. On a laptop running an Intel i3 processor at 2.1 GHz, training took around

90 seconds and used about 1GB of memory. The memory usage by system is largely because of

bigrams and trigrams prior to feature selection.

Positive efficiency: 11207 out of 12500 (89.65%)

Negative efficiency: 10856 out of 12500 (86.84%)

Sentiment Analysis


FUTURE ENHANCEMENTS

This particular field of sentiment analysis can be utilized to obtain the opinion of public

across the Internet through various social networking sites such as Twitter, Facebook etc.

A system can be designed such as, when some movie name is given as an input with a hash tag,

the system can be made to collect tweets about the movie and later perform the sentiment

analysis for all the tweets collected and give the result based on number of positive and

negative tweets. This method can also be used to gather opinions on general topics from the

public from across the Internet.

The training should be performed incrementally and periodically with new and larger data sets,

thereby increasing the efficiency of the system.

Our project need not be restricted to only movie review classifications. By changing the

training data set to some other domain the project can be scaled to classify other pieces of text.

Also there is scope for applying this project for reviews which contain almost equal number of

positive and negative frequency of words, which in general is termed as “Neutral”. This can be

achieved by using an extra dictionary and some extra functions.

Sentiment Analysis


BIBLIOGRAPHY

[1]. www.python.org/download/

[2]. Large Movie Review Dataset. (n.d.). Retrieved from

http://ai.stanford.edu/~amaas/data/sentiment/

[3]. Socher, Richard, et al. "Semi-supervised recursive autoencoders for predicting sentiment

distributions." Proceedings of the Conference on Empirical Methods in Natural Language

Processing. Association for Computational Linguistics, 2011.

[4].www.wiki.python.org/moin/TkInter

[5]. Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to in-

formation retrieval. Vol. 1. Cambridge: Cambridge University Press, 2008

[6]. Pauls, Adam, and Dan Klein. "Faster and smaller n-gram language models."Proceedings

of the 49th annual meeting of the Association for Computational Linguistics: Human Lan-

guage Technologies. Vol. 1. 2011.

[7].www.wikipedia.com/

http://ai.stanford.edu/~amaas/data/sentiment/

Sentiment Analysis


APPENDIX-A

Negative review classification

Positive review classification

Sentiment Analysis


No Features to compare classification

Sentiment Analysis


APPENDIX B

Python implementation codes

Training module code:

import os

from math import log, exp

import pickle

from operator import mul

class MyDict(dict):

def __init__(self):

print ""

def __getitem__(self, key):

if key in self:

return self.get(key)

return 0

pos = MyDict()

neg = MyDict()

totals = [0, 0]

def negate_sequence(text):

"""

Detects negations and transforms negated words into "not_" form.

"""

Sentiment Analysis


negation = False

delims = "?.,!:;"

result = []

words = text.split()

prev = None

pprev = None

for word in words:

# stripped = word.strip(delchars)

stripped = word.strip(delims).lower()

negated = "not_" + stripped if negation else stripped

result.append(negated)

if prev:

bigram = prev + " " + negated

result.append(bigram)

if pprev:

trigram = pprev + " " + bigram

result.append(trigram)

pprev = prev

prev = negated

if any(neg in word for neg in ["not", "n't", "no"]):

negation = not negation

if any(c in word for c in delims):

negation = False

return result

Sentiment Analysis


def train():

global pos, neg,totals

limit = 12500

for file in os.listdir("/Users/Preetham/Desktop/NIE/aclImdb/train/pos")[:limit]:

for word in set(negate_sequence(open("/Users/Preetham/Desktop/NIE/aclImdb/train/pos/" +

file).read())):

pos[word] += 1

neg['not_' + word] += 1

for file in os.listdir("/Users/Preetham/Desktop/NIE/aclImdb/train/neg")[:limit]:

for word in set(negate_sequence(open("/Users/Preetham/Desktop/NIE/aclImdb/train/neg/" +

file).read())):

neg[word] += 1

pos['not_' + word] += 1

prune_features()

with open('mydatapositive.pickle', 'wb') as myposdata:

pickle.dump(pos, myposdata)

with open('mydatanegative.pickle', 'wb') as mynegdata:

pickle.dump(neg, mynegdata)

totals[0] = sum(pos.values())

totals[1] = sum(neg.values())

with open('mytotal.pickle', 'wb') as mytotaldata:

pickle.dump(totals, mytotaldata)

def prune_features():

"""

Remove features that appear only once.

"""

Sentiment Analysis


global pos, neg

for k in pos.keys():

if pos[k] <= 1 and neg[k] <= 1:

del pos[k]

for k in neg.keys():

if neg[k] <= 1 and pos[k] <= 1:

del neg[k]

train()

prune_features()

print("______________")

Sentiment Analysis


Classification module code

import os

import pickle

from math import log,exp

from Tkinter import *

import tkMessageBox

master = Tk()

class MyDict(dict):

def __init__(self):

print ""


if key in self:


return 0

pos = MyDict()

neg = MyDict()

totals =[0,0]

with open('mydatapositive.pickle', 'rb') as myposdata:

pos = pickle.load(myposdata)

with open('mydatanegative.pickle', 'rb') as mynegdata:

neg = pickle.load(mynegdata)

with open('mytotal.pickle', 'rb') as mytotaldata:

totals=pickle.load(mytotaldata)

pos['good'] += 100

neg['not_' + 'good'] += 100

totals[0] +=100

Sentiment Analysis


totals[1] +=100


"""


"""

negation = False

delims = "?.,!:;"

result = []


prev = None

pprev = None

for word in words:



negated = "not " + stripped if negation else stripped


if prev:



if pprev:



pprev = prev

prev = negated



Sentiment Analysis



negation = False

return result

def classify(tex):

words = set(word for word in negate_sequence(tex) if word in pos or word in neg)

if (len(words) == 0):

tkMessageBox.showinfo("Result", "NO FEATURES TO COMPARE")

return True

pprob, nprob = 0, 0

for word in words:

pp = log(((pos[word] * 1.0) + 1) / (2.0 * totals[0]))

np = log(((neg[word] * 1.0) + 1) / (2.0 * totals[1]))

# print "%15s %.9f %.9f" % (word, exp(pp), exp(np))

pprob += pp

nprob += np

if pprob > nprob:

tkMessageBox.showinfo("Result", "POSITIVE")

else:

tkMessageBox.showinfo("Result", "NEGATIVE")

master.geometry('720x1020')

e = Entry(master, width=300, xscrollcommand =200`)

e.pack()

e.focus_set()

def fun():

Sentiment Analysis


classify(e.get())

b = Button(master, text="GET RESULT", width=25, command=fun)

b.pack()

mainloop()

Sentiment Analysis


Test module code

import os

import pickle

from math import log,exp

class MyDict(dict):

def __init__(self):

print("")


if key in self:


return 0

pos = MyDict()

neg = MyDict()

totals =[0,0]

with open('mydatapositive.pickle', 'rb') as myposdata:

pos = pickle.load(myposdata)

with open('mydatanegative.pickle', 'rb') as mynegdata:

neg = pickle.load(mynegdata)

with open('mytotal.pickle', 'rb') as mytotaldata:

totals=pickle.load(mytotaldata)

pcount=0

ncount=0


Sentiment Analysis


"""


"""

negation = False

delims = "?.,!:;"

result = []


prev = None

pprev = None

for word in words:



negated = "not " + stripped if negation else stripped


if prev:



if pprev:



pprev = prev

prev = negated




Sentiment Analysis


negation = False

return result

def classify():

global pcount, pos, neg,totals,ncount

limit = 12500

for file in os.listdir("/Users/Preetham/Desktop/NIE/aclImdb/test/pos")[:limit]:

words = set(word for word in

negate_sequence(open("/Users/Preetham/Desktop/NIE/aclImdb/test/pos/" + file).read()) if

word in pos or word in neg)


print "No features to compare on"

pprob, nprob = 0, 0

for word in words:



pprob += pp

nprob += np

if (pprob > nprob):

pcount+=1

for file in os.listdir("/Users/Preetham/Desktop/NIE/aclImdb/test/neg")[:limit]:

words = set(word for word in

negate_sequence(open("/Users/Preetham/Desktop/NIE/aclImdb/test/neg/" + file).read()) if

word in pos or word in neg)


print "No features to compare on"

pprob, nprob = 0, 0

for word in words:


Sentiment Analysis



pprob += pp

nprob += np

if (pprob <= nprob):

ncount+=1

print ("POSITIVE EFFICIENCY")

print(pcount)

print((pcount/12500.0)*100.0)

print ("NEGATIVE EFFICIENCY")

print(ncount)

print((ncount/12500.0)*100.0)

classify()

SENTIMENT ANALYSIS - University of Southern Californiappremkum/portfolio/pdf/report.pdfSentiment...

Documents

Transcript of SENTIMENT ANALYSIS - University of Southern Californiappremkum/portfolio/pdf/report.pdfSentiment...