Text Sentiment analysis

7/22/2019 Text Sentiment analysis

1/59

Deep learning for SentimentanalysisPRESENTER: HNG D PHAN INSTITUTION OFINFORMATION TECHNOLOGY


2/59

Outline

1. Introduction2. Sentiment analysis approaches

3. Overview of deep learning for applications.

4. Deep learning for sentiment detection.

5. Future research direction


3/59

1. Introduction

Each sentence and paragraph contains it own sentiment feature.

With sentence:

This is a good moviepositive comment.

This movie contains bad words, bad characters and unrelated scennegative comments


4/59

1. Introduction


5/59

1. Introduction

Purpose of sentiment detection: Classification the comment.

Extract relationship between sentences in a paragraph.

Judgment and evaluation

Emotional state

intended emotional communication


6/59

Outline

1. Introduction2. Sentiment analysis approaches





7/59

2. Sentiment analysis approachesIssues:

Classifying the polarity of a given text at the document, sentence, or feature/aspect level

Beyond polarity sentiment classification looks: angry, happy, sad, etc

Early work of Polarity detection:

Peter D. Turney [1]: The classification of a review is predicted by the average semantic othe phrases in the review that contain adjectives or adverb.

Bo Pang and Lillian Lee [2]: Exploiting class relationships for sentiment categorization wirating scales.

Benjamin Snyder and Regina Barzilay [3]: focus on restaurant reviews, analyzing specificrestaurant.


8/59

Peter D. Turney [1] Purpose: classification of film reviews.

Provide a simple unsupervised learning algorithm for classifying reviews as recom(thumbs up) or not recommended (thumbs down).

The classification of a review is predicted by the average semantic orientation of the review that contain adjectives or adverbs.

In this paper, the semantic orientation of a phrase is calculated as the mutual inforbetween the given phrase and the word excellent minus the mutual informationgiven phrase and the word poor.


9/59

Peter D. Turney [1]

1) Identify phrase in input

text contain ADJ and

adverbs

2) Estimate the semantic

orientation of each

extracted phrase

3) assign the given review

to a class, recommended

or not recommended,

Part of speech tagging

Pointwise Mutual Information

(PMI) and Information Retrieval (IR)


10/59

PMI-IR method The Pointwise Mutual Information (PMI) between two words, word1 and word2, is define

follows (Church & Hanks, 1989):

The Semantic Orientation (SO) of a phrase, phrase, is calculated here as follows:

Update the SO based on phrase in hits (matching in the document):


11/59

Peter D. Turney [1]


12/59

Peter D. Turney [1]

Disadvantages: average SO tends to err on the side of guessing that not recommended, when it is actually recommended.


13/59

Bo Pang and Lillian Lee [2]

Determine authors evaluation with respect to a multi-point scale (one tstar).

2 main steps:

Evaluating human performance at the task

Applying a meta-algorithm, based on metric-labeling formulation ofproblem, that alters a given n-ary classifiers output in an explicit atteensure that similar item receive similar labels.


14/59

Bo Pang and Lillian Lee [2]

The idea of metric labeling is provided by JON KLEINBERG AND ETARDOS ([28]).

Extract the cost of the labeling, which represents for the error in labelltotal cost:

Metric labeling: minimize the cost.


15/59

Bo Pang and Lillian Lee [2]Explicitly incorporates information about item similarities together wisimilarity information (for instance, one star. is closer to .two stars. thastars.) is to think of the task as one of metric labeling (Kleinberg and T2002), where label relations are encoded via a distance metric.

To detect the similarity between items and labels, 3 algorithm has beeresearched based on Support Vector Machines:

1. One-vs-all

2. Regression3. Metric labeling

Consider what item similarity measure to apply, proposing one based opositive-sentence percentage.


16/59

Bo Pang and Lillian Lee [2]One-vs-all

Each training point belongs to one of N different classes. The goal is toa function which, given a new data point, will correctly predict the clathe new point belongs [5].

(i) Solve K different binary problems: classify class k" versus the resfor k = 1; .;K.

(ii) Assign a test sample to the class giving the largest fk (x) (most pvalue, where fk (x) is the solution from the kth problem

Purpose: Classify reviews as output labels (score rank) and evaluateaccuracy.


17/59


18/59

Bo Pang and Lillian Lee [2]Regression

the idea is to find the hyperplane that best the training data, but wtraining points whose labels are within distance of the hyperplaneloss:

, is the negative of the distance between l and the value for x by the filted hyperplane function

Koppel and Schler (2005) found that applying linear regression todocuments (in a different corpus than ours) with respect to a threerating scale provided greater accuracy than OVA SVMs and otheralgorithms.


19/59

Bo Pang and Lillian Lee [2]Metric labeling

Let d be a distance metric on labels, and let nnk(x) denote the k nneighbors of item x according to some item-similarity.

Then, it is quite natural to pose our problem as finding a mappinginstances x to labels lx (respecting the original labels of the trainininstances) that minimize


20/59

Bo Pang and Lillian Lee [2] To detect the similarity between item, a traditional measure has us

overlap-based measure such as the cosine between term-frequencdocument vectors.

Ratings can be determined by the positive-sentence percentage (Ptext, i.e., the number of positive sentences divided by the number

subjective sentences.


21/59

Benjamin Snyder and Regina Barzilay Input: in a restaurant review such opinions may include food, ambien

service

Algorithm: The Grief algorithm-jointly learns ranking models for iaspects by modeling the dependencies between assigned ranks .

Analyzing meta-relations between opinions, such as agreement and c

Models the dependencies between different labels via the agreement


22/59

Benjamin Snyder and Regina Barzilay M-aspect ranking model contains m+1 components ((w[1], b[1]),(wb[m]), a). The first m components are individual ranking model, one aspect, the final is agreement model

Predict a joint rank for the m aspects which satisfies the individual ramodels as well as the agreement model.

The decoder then predicts the m ranks which minimize the overall gri


23/59

Benjamin Snyder and Regina Barzilay


24/59

2. Sentiment analysis approaches Objects to analysis:

Text content (adjective, adverb).

The accuracy of review.

Multiple feature/aspect.

Method:

Extension of Support Vector Machine.

Unsupervised learning

Disadvantage: The order of words is ignored and important informa


25/59

Outline1. Introduction

2. Sentiment analysis approaches





26/59

3. Overview of deep learning for applica

Deep learning is a set of algorithms in machine learning that attempin multiple levels of representation, corresponding to different levelabstraction. It typically uses artificial neural networks. [11]

Deep learning application:

Hand writing recognition.

Speech processing.


27/59

Neural network

Artificial neural networks are models inspired byanimal central nervous systems (in particular thebrain) that are capable of machine learning and patternrecognition. They are usually presented as systems ofinterconnected "neurons" that can compute valuesfrom inputs by feeding information through thenetwork.

Main components:

Input, output

Weight.

Activation function.


28/59

The simplest model- the Perceptron

Learning:


29/59

Activation function

This is similar to the behavior of the linearperceptron in neural networks

However, its a nonlinear function, whichallows such networks to compute nontrivialproblems using only a small number of nodes.
http://en.wikipedia.org/wiki/Linear_perceptronhttp://en.wikipedia.org/wiki/Linear_perceptronhttp://en.wikipedia.org/wiki/Linear_perceptronhttp://en.wikipedia.org/wiki/Linear_perceptronhttp://en.wikipedia.org/wiki/Neural_networkshttp://en.wikipedia.org/wiki/Neural_networkshttp://en.wikipedia.org/wiki/Linear_perceptron


30/59

Types of Artificial Neural Network:

Types of Artificial Neural Network:

The feed forward neural network was the first and arguably most simof artificial neural network devised. In this network the informationonly one directionforwards: From the input nodes data goes throhidden nodes (if any) and to the output nodes.

Recurrent neural networks (RNNs) are models with bi-directional dWhile a feed forward network propagates data linearly from input toRNNs also propagate data from later processing stages to earlier stacan be used as general sequence processors.


31/59

The Boltzmann machine A Boltzmann machine is a network of units with an "energy" defined for th

It also has binary units, but unlike Hopfield nets, Boltzmann machine unitsstochastic. The global energy, E, in a Boltzmann machine is identical in foa Hopfield network:

Problems:

the time the machine must be run in order to collect equilibrium statistics grows exponenmachine's size, and with the magnitude of the connection strengths

connection strengths are more plastic when the units being connected have activation prointermediate between zero and one, leading to a so-called variance trap. The net effect is causes the connection strengths to random walk until the activities saturate.


32/59

Restricted Boltzmann Machines RBM Boltzmann Machines (BMs) are a particular form of log-linear Markov Random F

i.e., for which the energy function is linear in its free parameters. Advantages: Not allow intralayer connectionbetween hidden-hidden and between

The energy function E(v,h) of an RBM is defined as:


33/59

Deep learning stepsTwo main steps:

1. Pre-trained one layer at a time: treating each layer in turn as aunsupervised restricted Boltzmann machine (RBM).

2. Fine-tuning: using supervised back propagation.

The resulting model is called a deep belief network, and may be builother building blocks than RBMs


34/59

Deep believe network training1. Train the first layer as an RBM that models the raw input x =h(0) as its visible layer

2. Use that first layer to obtain a representation of the input that will be used as data flayer. Two common solutions exist. This representation can be chosen as being the mactivations p(h(1) =1| h(0) ) or samples of p(h(1) | h(0) ).

3. Train the second layer as an RBM, taking the transformed data (samples or mean atraining examples (for the visible layer of that RBM).

4. Iterate (2 and 3) for the desired number of layers, each time propagating upward eior mean values.

5. Fine-tune all the parameters of this deep architecture with respect to a proxy for thelikelihood, or with respect to a supervised training criterion (after adding extra learninto convert the learned representation into supervised predictions, e.g. a linear classifie


35/59

3.2. Deep learning applicationHand-writing recognition:

The MNIST dataset consists of handwritten digit images and it is div60,000 examples for the training set and 10,000 examples for testing

In Dan Claudiu Ciresand Ueli Meier [15]:

Multi layer perceptron (MLP).

Train 5 MLPs with 2 to 9 hidden layers and varying numbers of hidden units. Malways the number of hidden units per layer decreases towards the output layer.


36/59

3.2. Deep learning applicationIn [15]:


37/59

3.2. Deep learning applicationSpeech recognition:

In George Hilton [17], deep neural networks is used to make acoustifor speech recognition.

Most current speech recognition systems use hidden Markov modelsto deal with the temporal variability of speech and Gaussian mixture(GMMs) to determine how well each state of each HMM fits a framwindow of frames of coefficients that represents the acoustic input.

To evaluate the fit: use a feed-forward neural network

Input: Frames of coefficients.

Output: posterior probabilities over HMM states


38/59







39/59

4. Deep learning for sentiment analys General approaches: use semantic word space.

Semantic word spaces have been very useful but cannot exprelonger phrases in a principled way.

Solution: Sentiment Treebank, with 215,154 phrases in the par11,855 sentences.

Recursive Neural Tensor Network: predict compositional semapresent in new corpus


40/59

4. Deep learning for sentiment analysExample of the Recursiv

Network accurately pre

classes, very negative to

0, +, + +), at every node

capturing the negation

sentence.


41/59

Recursive Neural Tensor Network RN Represent a phrase through word vectors and a parse tree and t

vectors for higher nodes in the tree using the same tensor-basefunction.

Related area research:

Semantic Vector Spaces.

Compositionality in Vector Spaces.

Logical Form

Deep Learning

Sentiment analysis


42/59

Semantic Vector Spaces The dominant approach in semantic vector spaces uses distribu

similarities of single words.

Variants of this idea use more complex frequencies such as hoappears in a certain syntactic context (Pado and Lapata, 2007; 2008).

To overcome this, neural vector (Bengio, 2003) approach has bimplemented.


43/59

Compositionality in Vector Spaces Compositionality algorithms: related datasets capture two wor

:Mitchell and Lapata (2010) [24] two-word phrases and analycomputed by vector addition, multiplication and others.

Some related models:

Holographic reduced representations (Plate, 1995- [21]).

compositional matrix space model (Rudolph and Giesbrecht,


44/59

Compositionality in Vector SpacesCompositional matrix space model:

Assigns ordinal sentiment scores to phrases.

Account for critical interactions among the words in each sentimenphrase.

The score of phrase i:

Wk : d word of phr

Represen


45/59

Compositionality in Vector SpacesCompositional matrix space model (continue):


46/59

Compositionality in Vector SpacesWith Stanford system:

Recursive neural network (RNN)

matrix-vector RNNs .

New algorithm: Recursive Neural Tensor Network (RNTN).


47/59

Recursive Neural Model Translate input text to vector.

Compute parent vector in a bottom up fashion using different typecompositionality functions g.

. Not very good .

0 0 +

..

-

P1= g(b, c)

P2= g(p1, a)


48/59

Recursive Neural Network Two children vector is computed:

= (

) = (

)

f : tanh function, standard element-wise nonlinearity.

Compute label value by soft-max classifier:

= softmax( a)


49/59


50/59

Recursive Neural Tensor Network RN Provide an interaction that would allow the model to have greate

between the input vectors. RNTN: The main idea is to use the same, tensor-based compositi

for all nodes.

A single layer tensor:


51/59

Recursive Neural Tensor Network RN


52/59

Tensor Backprop through StructureThe error as a function of the RNTN parameters = (V;W;Ws;L) for

is:

The full derivative for slice V[k] for this tri-gram tree then is the sumnode:


53/59

Recursive Neural Tensor Network RN


54/59

Stanford Sentiment analysis source coHave library in java, C# and python

Extract from input text:

POS, NER: CRF tagging.

Parsed sentiment tree.

Online demo:

nlp.stanford.edu:8080/sentiment/rntnDemo.html

http://nlp.stanford.edu/sentiment/treebank.html
http://nlp.stanford.edu/sentiment/treebank.htmlhttp://nlp.stanford.edu/sentiment/treebank.htmlhttp://nlp.stanford.edu/sentiment/treebank.htmlhttp://nlp.stanford.edu/sentiment/treebank.html


55/59

Stanford Sentiment analysis source coInput text: Stanford University is located in California. It is a great

founded in 1891.

Stanford

Stanford

0

8

NNP

ORGANIZATION

PER0

(ROOT (S (NP (PRP It)) (VP (VB

great) (NN university)) (, ,) (VP (VBN fo

(CD 1891)))))) (. .)))


56/59

Stanford Sentiment analysis source co


57/59







58/59

5. Future research direction Overview of deep learning in sentiment detection.

Other sentiment analysis researches:

Sentiment Treebank.

Paragraph positive/negative detection.

With researches in Vietnamese language: Vietnamese Treebank (VLSP).

Word and phrase processing.


59/59

THANK YOU FOR YOUR ATTENTION

Text Sentiment analysis

Documents

Transcript of Text Sentiment analysis