Post on 21-Jan-2015
description
MINING USER’S OPINIONS ON
HOTELS
BRIEF RECAP ON CA1
Literature Review / Background
Web is a huge database of opinions on hotels
Commercial Possibilities / Business Intelligence
“What others think” is an important element in decision making
Opinion Mining / Sentiment Analysis
Far From a Solved Problem
Impossible for human read every single opinions Machines can be trained to do this
People always express more than one opinion
Use of Sarcasm and Negation
Expression of sentiments in different topic and domain eg big: Positive when swimming pool is big enough
to swim, Negative when the queue is long
How to train a machine to analyze sentiments
Natural Language Processing (NLP) Transform opinion to a format the machine
understand
Artificial Intelligence Machine are able to use information given by NLP
and a lot of math to analyze sentiments Make the machine determine what is facts and
opinions like how a normal human understand them by reading
Problems of Machine
Subjectivity and Sentiment
Analyze polarity
Opinion rating
Sentiment intensity
Different domains / topic context
Facts Vs Opinion
Ambiguity to machine examples
“The swimming pool is better than the tennis court”. Comparisons are hard to classify
“This hotel is very boleh lah” Use of Slang and cultural communication
“This breakfast is as good as none” Negativity not obvious to machine
“The weather is hot” In different context, the statement has different
polarity
WHAT IS DONE IN CA1
EXTRACTION – Preparing machine to analyze data
Review and aspects extraction process
Extract important datasets from review websites
Word handling to refine datasets
Use part of speech tagging to label text to extract aspects which are nouns
Determine aspects / features that people are concerned about from these reviews by occurrence and context
Part of Speech Tagging
Assigning a label to every word in the text to allow machine to do something with it
Word Handling
Dictionary / Spelling Correction
Slang Check
Foreign language check
Singular / Plural conversion
Duplicate check
END OF CA1
CA2 : Data Processing
Classifying Sentiments using some existing methods
Naïve Bayes To determine polarity of sentiments
Maximum Entropy Using probability distributions on the basis of partial
knowledge
Support Vector machine Analyze patterns and classify sentiments
Naïve Bayes Classifier
To determine polarity of sentiments
P(X | Y) = P(X)P(Y | X) / P(Y)
Probability that a sentiments is positive or negative, given it's contents
Probability of a word occurring given a positive or negative sentiment
Assumptions: There is no link between words
P(sentiment | sentence) = P(sentiment)P(sentence | sentiment) / P(sentence)
Problem with Naïve Bayes
Polarity does not change with domain
Words within sentiments have no relationship with each other
Words not found in lexicon might be missed by Naïve Bayes resulting in inaccuracy of polarity
No opinion rating to determine which sentiment is more polar
Solution to Naïve Bayes
Establish domain sentiment relations
Establish domain aspects relations
Establish aspects sentiments relations
Estimate polarity for unseeded sentiments
Estimate strength of polarity on sentiments
Establishing relations
Establish domain by categorizing aspects founded into domains such as food, location and security
Finding occurrence of aspects / sentiments within sentences for a particular domain
Finding polarity of sentences, aspects and sentiments and establishing relations
Domain
Aspects Sentiments
Finding polarity for unseeded sentiments
After establishing relations, we have a graph of nodes (Sentiments / Aspects)
Some nodes have no polarity after naïve bayes but its connected nodes might have polarity
Determine the probability that the node is positive or negative given its surrounding nodes
Estimating the strength of polarity
Determine the strength of the polarity of an unseeded node given that amount of traversal surrounding nodes with polarity has to take to reach it
Find the shortest path to reach an unseeded node which will result in a spanning tree
This will determine the strength of polarity
Implementation
Using Dijkstra Algorithm to find the spanning tree
Implementation
Find the cost to get from surrounding nodes to an unseed node
END OF CA2
What is going to happen in CA3?
Prototyping
Refining parameters to come up with a prototype mainly to solve the following problems: Analyze polarity Opinion rating Sentiment intensity Different domains / topic context
Manually analyze reviews myself and check prototype for effectiveness and seek to improve accuracy
Prototype testing
Enlarging dataset from various hotel review site
Merging results to find correlations between sentiments expression on different sites
Testing on different domain such as food to get domain dependent results