Data Crackers YELP
-
Upload
prashanth-sandela -
Category
Documents
-
view
235 -
download
0
Transcript of Data Crackers YELP
-
8/10/2019 Data Crackers YELP
1/24
1
Data Mining on YELP Dataset
Advisor - Duc Tran Thanh
Team - Data Crackers
Prashanth Sandela
Vimal Chandra Gorijala
Parineetha Gandhi Tirumali
-
8/10/2019 Data Crackers YELP
2/24
2
Table of Contents1. Project Vision ........................................................................................................................................ 3
2. Data Mining Task ................................................................................................................................... 3
2.1. Data Mining Problem: ................................................................................................................... 3
2.2. Evaluation Metrics: ....................................................................................................................... 3
3. Hypothesis ............................................................................................................................................. 3
4. Data Processing ..................................................................................................................................... 4
4.1. Data ............................................................................................................................................... 4
4.2. Initial Dataset ................................................................................................................................ 4
4.3. Data quality problems ................................................................................................................... 5
4.4. Data Processing Tasks ................................................................................................................... 5
4.5. Resulting Dataset .......................................................................................................................... 6
5. Feature Selection .................................................................................................................................. 7
5.1. Dataset .......................................................................................................................................... 7
5.2. Rationales behind feature selection ............................................................................................. 7
5.3. Feature Selection Tasks: ............................................................................................................... 8
5.4 Selected Features ............................................................................................................................. 10
6. Model Development and Tuning by Prashanth Sandela .................................................................... 10
6.1. Implementation of own model(Navie Bayes) ............................................................................. 10
6.2. Nave Base Multinomial Classification Model ............................................................................. 13
6.3. Experimental Results .................................................................................................................. 14
7. Model Development and Tuning by Vimal Chandra Gorijala ............................................................. 15
7.1. Nave Bayes Multinomial Model ................................................................................................. 15
7.2. Nave Bayes Multinomial Text Model ......................................................................................... 16
7.3. Results Comparison .................................................................................................................... 16
8. Model Development and Tuning by Parineetha Gandhi .................................................................... 17
8.1. K Nearest Neighbors Model ........................................................................................................ 17
8.2. Decision Tree ............................................................................................................................... 199. Main Findings in the Project ............................................................................................................... 20
10. Results and Comparison ................................................................................................................. 21
11. Project Management ...................................................................................................................... 22
12. List of Queries: ................................................................................................................................ 23
-
8/10/2019 Data Crackers YELP
3/24
3
1.Project VisionIn todays fast growing world, there are many businesses which are startups, growing and well
established. For every business, rating holds a stand ground for its survival in market. This rating is given
by users who enjoys goods and services from a business. User expresses his experience towards a
business in the form of review and star ratings through many platforms and the famous one among
them is YELP. The review can be positive, negative or a neutral review. The aim of our Project is to build
a classifier that classifies any given review into labels of Star ratings (-1, 0 and 1). We planned to use
various Data Mining models to classify reviews into user star ratings labels by applying various model
tuning techniques to attain optimal accuracy of classifying.
2.
Data Mining Task
2.1.Data Mining Problem:
The data mining task we are trying to solve is multi-class classification. The classes we used in this
project are -1, 0 and 1 (-1 - negative, 0 - neutral, 1 - Positive).
2.2.
Evaluation Metrics:The following are some of the evaluation metrics we have used to assess the quality of the solution.
1) Percentage or accuracy of correctly classified instances: This metric is appropriate because through
it we can exactly know how our model is performing. But we cannot rely on this metric completely.
2) ROC value: This value gives the ratio of the true positives to false positives. ROC value measurement
is one of the most important values output by WEKA. An "optimal" classifier will have ROC values
approaching 1, with 0.5 being comparable to "random guessing"
Through the combination of the above metrics we can assess the performance of the model and attain
the best results.
3.Hypothesis As it is a classification based on text, words in the reviews are features we should consider to classify
them correctly. For example a review has words like good, excellent, awesome, yumm food etc..,
that review should be classified into Positive class label. We planned to concentrate on them and
apply transformations like stop words removal, stemming etc.., to make use of those words in best
way possible with the help of different tools, so that the review can be classified correctly. We
intended to concentrate mainly on Bayesian algorithms as they have better performances in the
case of text classification.
We intended to use combination of words called bigrams. For example words like very good, yum
yum etc. In general the view of a user about a business is expressed mostly in combination of words,so we thought using bigrams could give good accuracy to the model.
Use of other features like business id, user id individually can improve the accuracy and they should
not be used as a combination.
We will discuss in the results below how the model learning is affected by approaches in the hypothesis.
-
8/10/2019 Data Crackers YELP
4/24
4
4.Data Processing
4.1.Data
We obtained data fromhttp://www.yelp.com/dataset_challenge. It has 40,000 businesses, 1.3 Million
reviews and 250,000 users. The data was in JSON format and we had to do some pre-processing and
converted into .CSV format to obtain the review text, class labels and other features. There are manyirrelevant fields like neighborhoods, votes, etc.., and we have removed all of them and considered only
the required ones. Initially the reviews have star rating class labels associated to them from 1 to 5 and
we have reduced them as 1,2 to negative(-1), 3 as neutral(0) and 4,5 as Positive(1). The below figure
gives the representation of the all the reviews and the modified class labels associated to them.
Graph showing no. of review stars
4.2.
Initial DatasetInitial dataset which has YELP dataset which basically consisted data about Business, Users and
Reviews. Below is snapshot of dataset in JSON format.
Business User Reviews
{ 'type': 'business',
'business_id': (encrypted business id),
'name': (business name),
'neighborhoods': [(hood names)],
'full_address': (localized address),
'city': (city),
'state': (state),
'latitude': latitude,
'longitude': longitude,
'stars': (star rating, rounded to half-stars),
'review_count': review count,
'categories': [(localized category names)]
'open': True / False,
}}
{
'type': 'user',
'user_id': (encrypted user id),
'name': (first name),
'review_count': (review count),
'average_stars': (floating point average,
like 4.31),
'votes': {(vote type): (count)},
'friends': [(friend user_ids)],
'elite': [(years_elite)],
'yelping_since': (date, formatted like
'2012-03'),
'compliments': {
(compliment_type):
}
{
'type': 'review',
'business_id': (encrypted business id),
'user_id': (encrypted user id),
'stars': (star rating, rounded to half-stars),
'text': (review text),
'date': (date, formatted like '2012-03-14'),
'votes': {(vote type): (count)},
}
213509 163761
748188
NEGATIVE NEUTRAL POSITIVE
Review Stars
Count
http://www.yelp.com/dataset_challengehttp://www.yelp.com/dataset_challengehttp://www.yelp.com/dataset_challenge -
8/10/2019 Data Crackers YELP
5/24
5
4.3.Data quality problems
In the dataset weve many quality issues like specified below:
1) Presence of unwanted columns and merging the files in dataset
2) Special Characters
3) Numeric Data
4) Other language Characters
5)
Stop words.
6) Business_id, review_id, user_id has hash indexes which occupy a lot of space.
4.4.Data Processing Tasks
4.4.1. 1.3.1 Removing Unwanted Columns and merging all the files
Among all the columns, we considered only business_id, user_id, review_id, review_text,
review_count and stars. Furthermore, three files are combined to make a single dataset with only
considered attributes. For accomplishing this task, first the entire dataset (Includes all the 3 files) was
converted from JSON to CSV using PYTHON script. Next, all the datasets are combined using ETL Tool
Pentaho. Below is screenshot of ETL Mapping.
4.4.2.
1.3.2 Removing Special Characters, Numeric and Other language words
The main consideration is review_text which is the text data that users entered as a review to
the business which also has stars. There are special characters, new line character, other language
characters. These have been removed by below PHP script.
4.4.3.
1.3.3 Removing Stop Word and convert to lower caseStop words are meaningless words, by removing which the meaning or weightage of sentence
doesnt change. So, by decreasing these stop words the number of tokens are decreased. Converting
complete context to lower case makes it easy for comparing two words when both words are in
lowercase. Below is the Algorithm in PHP Script to perform the operations 1.3.2 and 1.3.3.
-
8/10/2019 Data Crackers YELP
6/24
6
4.5.Resulting Dataset
The resulting dataset consist of business_id, review_id, user_id, stars and review_text in CSV
format. A sample of the file is below.
nYer89hXYAoddMEKTxw7kA,k2u1F6spBGhgk2JtAe97QA,HeDqdFYkKaeDvPtiFy6Xmw,event favorite event a long time lindsey a fabulous job setting up keeping
movie played completely hush hush absolutely love filmbar always a great beer wi ne selection wonderful staff a wacky selection art film movie night wayne
world one time favorites naturally beyond thrilled invited a super foxey date a guy a delicious dog short leash a fabulous ti me thank lindsey filmbar yelp a
fantastic evening party time excellent ,5
nYer89hXYAoddMEKTxw7kA,hdZ3rlgFXctCOUhzoOebvA,XXblLOSqYlq0tXhxHfXUHQ,great time funny movie loved going film bar first time ta bles eat fantastic
,4
nYer89hXYAoddMEKTxw7kA,usQTOj7LQ9v0Fl98gRa3Iw,2fPxXAysOrZLrahZQyJCNg,wayne world short leash filmbar need more a great tuesday night
adventure wayne world took waaaaaay back favorite aiko a chicken dog short leash ages great yelp coordinated outing usual tha nks lindsey yelp crew thanks
kelly a staff filmbar place bomb day week thanks brad kat short leash bunch having trailer available event ,5
nYer89hXYAoddMEKTxw7kA,XTFE2ERq7YvaqGUgQYzVNA,OO6prfuGEMalQcQcU3WCaw,fab concept test out a new well new hadn t film bar prev iously
independent cinema drink a generously gratis beer wine choosing short lease hot dogs oh conveniently parked outside plus samples appetizers heck yes add
excitement anticipation knowing filmed az movie going shown a perfect weeknight event now know movies don t actually based arizona awesome ideas next
mention great filmbar date ideas online dating attempts pan out thanks film bar yelp short lease fun fellow yelpers a great time ps post movie trivia an swers ,5
$result = mysql_query("select business_id, user_id, review_id, text, stars, review_count from reviews");
while($rows = mysql_fetch_array($result)){
$i++;
$text = preg_replace("/[^A-z| ]/i", " ", $rows['text']);
$text = str_replace("\n", " ", $text);
//Process text to remove stop words.
$text = explode(" ", $text);$processed_text = "";
foreach ($text as $s){
$s = strtolower($s);
if($s != null && array_search($s, $stopWords) == false)
$processed_text .= $s." ";
}
$optString .= "'".$rows['business_id']."',";
$optString .= "'".$rows['user_id']."',";
$optString .= "'".$rows['review_id']."',";
$optString .= "'".$processed_text."',";
$optString .= $rows['stars'];
$optString .= $rows[review_count];
$optString .= "\n";
if($i % 1000 == 0) {$fd = fopen("reviews_DetailedStopWords.csv", "a+");
fwrite($fd, $optString);
$optString = "";
echo "$i\n";
}
}
-
8/10/2019 Data Crackers YELP
7/24
7
5.Feature Selection
5.1.Dataset
After data preprocessing, the dataset is in pure csv structured format with required columns.
This table is loaded into HDFS. A `reviews` table is created on the dataset. This table is used for feature
selection.
5.2.Rationales behind feature selection
Now, as content is processed, the next task is to reduce the size of dataset by replacing the hash
value of business_id, review_id and user_id with unique identifier. We create a new table to store those
values, so that the initial id value will not be lost.
Our final aim is to classify stars based on the reviews, so we narrowed down stars 1 5 in three
classifications i.e., Positive, Negative and Neutral reviews. So, 1 and 2 star review fall under
Negative reviews, 4 and 5 fall under Positive review and 3 star rating fall under Neutral review.
Essential feature is review_text for classifying the review. We have tested to classify 25,000 reviews in
WEKA on unigram, bigram and trigram considering 66% data for training the model and 34% for testing.
As per the result, use of ngrams gave significantly high correctness. Based on this experiment weve
planned to use ngrams to train the model.
Weve planned to use 67% of the data to train model, and 33% for testing. Weve removed stop words
in data preprocessing step. Text review is given by end user of YELP application. So, there is high
possibility of having spelling mistakes in the review text. For example, users express their feelings in
various ways. Some users may type gooood instead of good, coooool instead of cool. So, when we
calculate the term frequency, there is high possibility of ignoring these words. This is called
Lemmatization. And Stemming is to identify root word.
For this phase weve implemented LovinsStemmer Algorithm. A UDF is created for this algorithm, which
takes complete text as input, processes it and gives the output accordingly. After this phase, we have
divided the table into unigram, bigram and trigram and calculated the frequency of the words.
$> Hadoop fsput review.csv
$>hive
HIVE>create table reviews(business_id String, review_id String, user_id String, review_text String, stars int)
row format delimited
fields terminated by ,
lines terminated by \n;
HIVE> load data inpath review.csv into table reviews;
HIVE> select * from reviews LIMIT 10;
/* Displays list of columns in correct format */
-
8/10/2019 Data Crackers YELP
8/24
8
5.3.Feature Selection Tasks:
5.3.1.
Assigning numeric id to key attributes
Below are the queries used to assign numeric id to key attributes: business_id, user_id,
review_id
5.3.2. Narrowing stars
Narrowing stars implies converting 1 and 2 stars as negative i.e -1. Converting 3 stars to Neutral
i.e., 0. Converting 4, 5 to Positive i.e 1.
5.3.3.
Removing Stop Words:
Stop Words were removed in data processing phase.
5.3.4.
Stemming review text
We created a UDF `lovinsStemmer()` based on the Lovins Stemming algorithm provided by WAIKATO.
After applying stemming, we removed some newly generated stop words using UDF `stopWords()`.
Above is query for doing these tasks.
HIVE> CREATE TABLE business AS
SELECT DISTINCT id, business_id,
(SELECT RANK() OVER(ORDER BY business_id) as id,
RANK() OVER(ORDER BY user_id) as user_id,
RANK() OVER(ORDER BY review_id) as review_id
business_id
from reviews) a;
HIVE> CREATE TABLE users AS
SELECT DISTINCT id, user_id,
(SELECT RANK() OVER(ORDER BY business_id) as business_id,
RANK() OVER(ORDER BY user_id) as id,
RANK() OVER(ORDER BY review_id) as review_id
user_id
from reviews) a;
HIVE> CREATE TABLE process_reviews ASSELECT RANK() OVER(ORDER BY business_id) as business_id,
RANK() OVER(ORDER BY user_id) as user_id,
RANK() OVER(ORDER BY review_id) as review_id
review_text,
stars
from reviews
HIVE> CREATE TABLE processed_stars_reviews as
SELECT business_id, review_id, user_id, review_text,
CASE WHEN stars = 1 or stars = 2 THEN -1WHEN stars = 3 THEN 0
WHEN stars = 4 or stars = 5 THEN 1
END AS stars
FROM processed_reviews
HIVE> CREATE TABLE stemmed_stars_reviews as
SELECT review_id,
stopWords(lovinsStemmer(review_text)) as review_text,
stars
FROM processed_stars_reviews
http://www.cs.waikato.ac.nz/~eibe/stemmers/http://www.cs.waikato.ac.nz/~eibe/stemmers/http://www.cs.waikato.ac.nz/~eibe/stemmers/http://www.cs.waikato.ac.nz/~eibe/stemmers/ -
8/10/2019 Data Crackers YELP
9/24
9
5.3.5.
Dividing training and Test data
Total number of reviews in the dataset are 1,125,458. So 67% of it, i.e 754,056 are for training the model
and remaining 371,408 reviews are for testing the model. As weve already created unique review_id
from 1 to 1,125,458. Above is the query to split the dataset.
5.3.6.
Classification Using WEKA
In the initial report we have used StringToWord vector filter which uses WordTokenizer to convert the
review text into vectors and applied Navie-Bayes Multinomial classification algorithm. We got 66.7%
instances as correctly classified. The data sample had 5000 instances and 66% of it is used as training
data and 34% as testing data.
In this phase we have applied NGRAM tokenizer which converts the text into NGRAMS (Unigarms,
Bigrams, Trigrams). We also applied Attribute Selection Filter on top of this which uses
InfoGainAttributeEval function to evaluate the worth of the attribute by measuring the information gain
with respect to the class and got 70.02% instances correctly classified. The Data sample has 25000
instances and 66% of it is training data and remaining is testing data.
HIVE> CREATE TABLE test_data asSELECT * FROM processed_stars_reviews
WHERE review_id >= 754056;
HIVE> CREATE TABLE training_data as
SELECT * FROM processed_stars_reviews
WHERE review_id < 754056;
-
8/10/2019 Data Crackers YELP
10/24
10
5.3.7. Generating N-grams
Ngrams generation is predefined function in HIVE. We are using function to make unigrams, bigrams and
trigrams and calculate their frequency. Below are the queries to create unigram, bigram and trigram. Weare selecting top 2000 word list.
5.4.Selected Features
These are selected features for our classification
1) Business Id
2) User Id
3) Bigrams
4) Review Text
6.Model Development and Tuning by Prashanth Sandela
6.1.Implementation of own model (Nave Bayes)
6.1.1.
Idea to develop model in Hive
I developed my own model for classifying star ratings based on text. For developing model, I considered
features business id, user id, review id, review text and stars. The implementation of the model is based
on the probability of the ngram based on the either business id or user id applied on training data and
HIVE> CREATE TABLE unigrams as
SELECT ngrams(sentences(review_text), 1, 2000, 2000) from training_data;
HIVE> CREATE TABLE bigrams as
SELECT ngrams(sentences(review_text), 2, 2000, 2000) from training_data;
HIVE> CREATE TABLE trigrams as
SELECT ngrams(sentences(review_text), 3, 2000, 2000) from training_data;
-
8/10/2019 Data Crackers YELP
11/24
11
apply the model on test data to classify stars of the review as -1, 0, 1 which implies Negative, neutral or
positive review. This model has shown an accuracy of 69.5%.
6.1.2. Model Development & Description
This model is purely developed using HIVE Queries using Amazon Web Services storing data in S3,
development and deploying in Elastic Map Reduce with 3 EC2 Instances. This model serves on entire
dataset of YELP which consists of 1.3 Million records with 67% of Training data and 33% of Test Data.
Steps followed to develop the model:
1. Divide Training and Test Data
2. Find ngram, frequency, star and probability from Training data
3. Find review id, ngram, frequency in test data
4. Train the with test data
5. Compare test and training data set words and including few other features
6. Retrieve the percentage match of training and test data
QueriesBigrams1, Numerics, Bigrams_1, Bigrams_stag_1,Bigrams_stag_2have been used to design themodel. This is the final model which has been tuned. Below is the example to see the model
implementation.
Below is the example of classification in unigrams.
Training Data Total Stars
Word Frequency Star Probability Stars Total Count
Good 100 1 0.33 1 300
Excellent 50 1 0.16 -1 200
Bad 100 -1 0.5 0 150
Good 10 -1 0.05Nice 15 0 0.1
Queries bigrams_test_1_1, stats are used to compute the results. Here in the above example, Icalculated the probability of the word based on the frequency of the word and total word count. Word
Good is available in both 1 and -1 star ratings. So, based on the probability Good will be classified as
+1 star. Below is how the review will be classified based on the text.
Test Data
Review_id Word
Count in
reviews New_Star Original_Star
1 Good 10 1 1
1 Bad 5 -1 1
2 Bad 30 -1 -1
2 Worst 40 -1 -1
Here review id 1, count of word will be considered and it will be classified as a review with star rating 1.
Similarly review id 2 will be classified as star with rating -1.
-
8/10/2019 Data Crackers YELP
12/24
12
I considered only business which had at least 10 reviews and users who have at least 10 reviews. Ive
divided data at each business and user level. For E.g.: If there are 100 reviews for a specific business,
then 66 reviews are supplied to training and rest to test data, when I consider business id as a feature.
This division happens at every business id. Likewise, similar process is repeated for user id and also when
both the features business id and user id are considered together.
In hive, I was not able to develop ROC measure for result metrics.
6.1.3.
Model Tuning
Data set supplied to this model has been removed with stop words and text enrichment. Below are
model tuning procedures:
1) Refining and sampling of training and test data
Initially I just stripped my dataset into 100,000 records in which I stripped first 67,000 as training
and 37,000 as test data. I realized that in the training data more than 90% of the records were
positive. So, I divided data in random sample between few parts and which improved the accuracy
nearly by 7% and which improved my accuracy from 43% to ~51%.2) Change of stemmer
Initially I used lovins Stemmer, on doing some research I found that Porter Stemmer is better than
Lovins Stemmer. I used a java program to implement this Stemmer which resulted in improving
accuracy by ~0.5%.
3) NGrams and identifying frequency count
Use of different grams has changed the accuracy. Use of bigrams has shown better accuracy.
Furthermore, there is slight increase in accuracy when I considered term frequency as 5000. Using
this tuning, there is increase of 4% accuracy.
4) Determine best approach to increase accuracy
Before arriving to the procedure of Probability Model, I used various other models approaches like
sum and count model which didnt help me much to determine accurate results. But use ofProbability Model has increased accuracy significantly.
5) Change of features
We have two more features, business id and user id. When I used business id or user id as an extra
features, accuracy increased significantly. But when I used business id and user id together, there
was actually reduce in accuracy and it makes sense that a use of both features implies that the
model will search for business id and user id for same business id and tries to classify stars. As this
combination will be unique, the accuracy got reduced.
6) Applying on overall Dataset
When I supplied overall dataset on the ration of 67% as training and 33% as test dataset, then I got
accuracy of ~74%. Accuracy on sample data was ~71%, but on overall data its a bit higher.
6.1.4.
Pros and Cons
a) It is an SQL like language. So easy to implement.
b) Main advantage of using this model is, we can tune the model to any extent.
c) There wont be any limitation on size of data or number of fields.
d) Never run out of memory.
e) Can implement this model in a cluster using all the required resources.
-
8/10/2019 Data Crackers YELP
13/24
13
f) It is difficult to design and implement this model any change requires lot of implementations to be
considered like if a query is changed, what might be the effect on the result. Should be very careful
while making changes.
g) There are lot of predefined function already defined by HIVE, any new extensions can be easily
accommodated by designing a UDF(User Defined Function)
6.2.Nave Base Multinomial Classification Model
6.2.1. About Model
I used WEKA Data Modeling Tool to classify stars using Nave Bayes Multinomial Model which is
available in list of Bayes models. WEKA has pre-defined models implemented with many filters and
features.
6.2.2.
Model Tuning
This has been performed on the dataset with Training data of 67% and 33% as Test data and has been
implemented on 100,000 records.
1)
Initial accuracy was about around 47% without any tuning.
2) I supplied new list of stop words list rather than using default stop word list. There was slight
increment of accuracy but it was nearly ~0.5%
3) Use of ngrams instead of Word Tokenizer improved has shown better accuracy.
4) In ngrams, the accuracy was even better when bigrams have been used on top of my dataset.
5) Default NullStemmer was replaced by LovinsStemmer which gave slight increase in accuracy.
6) Use of words to keep also improves the accuracy of the result. Increase in word to keep from 1000
to 5000 has shown me change in accuracy ~2%.
7)
I used Attribute Selection filter with search strategy of Ranker Algorithm with threshold of 0,
generateRanking: True, numToSelect: -1 and leaving starStar to null which has shown me an
increase in accuracy by 1.5%8) Using different features change the accuracy of output. I used business id and user id together to
see an increase in accuracy. But use of these two attributes reduced the accuracy, which is
expected. Coz, a user will give one or two reviews based on his experience in a business. When we
user both features together, then number of instance per review will be narrowed down to either 1
or 2 which implies there is definitely decrease in probability and accuracy. So, I used only one
attribute at a time. User of user id as a feature gave be a better increase in accuracy. There was
increase in accuracy by 3%.
9) Overall accuracy is 76%
6.2.3. Pros and Cons
1)
Using this model with WEKA give the flexibility to use many filters and attributes both for supervisedand unsupervised learning.
2) Using WEKA, only works on small datasets. Working with larger datasets is not possible.
3) This algorithm has already been designed, so effort to change any specific task is not required.
4) If we want to add a new functionality which is not available, it is difficult to implement.
-
8/10/2019 Data Crackers YELP
14/24
14
6.3.Experimental Results
Sl.
NoAction
*Nave
Model
*Nave Bayes
Multinomial
ROC
for
NBM
Discuss Results
1
Initial Dataset 44% 46%
0.47In this stage, no filters are applied and results showa initial model results without applying any filters
2
Refining of
Training and Test
Data +7% N/A
0.54 Default set of training data I selected was more
positive, so sampling of training data helped me to
increase accuracy in my model. But where in Nave
Bayes Multinomial, randomization is automatically
handled by WEKA (using randomizer).
3
Change of
Stemmer +0.5% N/A
0.55 Change of stemmer from Lovins Stemmer to Porter
Stemmer has shown slight increase in accuracy. I
don't have Lovins Stemmer in default stemmers list
in WEKA to implement in Nave Bayes Multinomial
4 Ngrams:
Unigrams +3% +3.5% 0.59 Of these 3 ngrams, in both the cases bigrams gave
me optimal results. So, I went ahead implemented
bigrams.Bigrams +7% +7.5% 0.68
Trigrams +2% +2% 0.60
5 Including Features
Business id and
User id+2% +1%
0.70 Use of both together in Probability Model reduced
accuracy, it might be because of full outer join which
joins records depending on business id and user id
and might be searching for specific instances where
in instances which are in train dataset might not be
available in test dataset. In this case, I found that
Naive Bayes Multinomial gave good result.Business id +2% +3% 0.74 Use of business id and user id increased accuracy.
But accuracy was more when user id alone was
considered. According to this, I can understand that
it is similar to users sentiment analysis, because a
user will use same sort of text to express his
feelings. I saw that there are small number of users
who gave lot of reviews. So, use of user id as a
feature definitely explains increase in accuracyUser id 5% +5%
0.74
6
Bag of words 5% 4%
0.75 Initially I used 1000 words for the frequency count.
But use of 3000 words increased accuracy.
7 Overall Accuracyon 100,000 records ~73% ~76%
0.78
8 Accuracy on
complete dataset ~75% N/A
I was not able to fit everything in WEKA memory
even if I allocated 6GB of memory to WEKA.
* All the accuracy rates are rounded to nearest value
-
8/10/2019 Data Crackers YELP
15/24
15
7.Model Development and Tuning by Vimal Chandra Gorijala
7.1.Nave Bayes Multinomial Model
7.1.1.
About Model
Multinomial Nave Bayes is a special version of Nave Bayes that is designed more for the text
documents. This model is mainly useful for multiclass classification. Initially we have 5 classes to classifythe reviews but we have reduced them into three (positive, neutral, negative), so that we can train the
model in a better manner. Here the probability of a review d being in a class c is computed as
Where P(tk | c) is the conditional probability of a term tk occurring in a review of class c.We interpret
P(tk | c) as measure of how much evidence tk contributes that c is a correct class. P(c) is the Prior
probability of a review occurring in class c. If a review terms do not provide clear evidence for one class
versus another, we choose the one that has higher probability.
We used WEKA to implement the model. Initially the dataset containing the features review text,
business id, review id, user id and the class label are fed to the tool. The preprocessing is done and they
are converted into word Vectors or NGrams based on the filters applied. Now we implement the model
on them.
7.1.2.
Model Tuning
In WEKA we can change various properties to increase the performance of the model. The data sample
has 60,000 records. The following are some of them.
1. Using NGrams rather than Word vectors, but bigrams usage increased the efficiency.
2. Increasing the WordsToKeep count from 1000 to 5000 or 10000 depending on the size of the
dataset.
3. Increasing the minimum term frequency from 1 to 10, which indicates a term with less than 10
occurrences is not considered.
4. Converting all the text to lower tokens
5. Using Attribute selection filter InfoGainAttribute Eval on top of Ngrams so that only top ranked
attributes are taken into account and fed to the model.
6. Using Cross fold option instead of percentage split option.
7. Utilizing additional features like business_id or user_id
-
8/10/2019 Data Crackers YELP
16/24
16
7.1.3.
Experimental Results
Features and ParametersPercentage of
Accuracy
ROC
Initial % with review text 48 0.46
Stopwords 53 0.51
Stemmer 54 0.52
Unigrams 59 0.58
Min word Frequency from 5-10 65 0.64
Business id and User id 65.23 0.64
Bigrams 72 0.73
Trigrams 63 0.62
User id 74 0.75
Business id 74 0.75
Attribute selection Filter 78 0.79
Bag of words count to 5000 79 0.81
OverAll Accuracy 79.49 0.83
Observation: Here in Naive Bayes Multinomial model varying the minimum term frequency and usage
of bigrams feature has improved the performance drastically. The reason behind this is the datacontains many bigrams and concentrating mainly on highly frequent words in the reviews.
7.2.Nave Bayes Multinomial Text Model
7.2.1.
About Model
Multinomial Nave Bayes Text model operates directly on string attributes. Other types of input
attributes are accepted but ignored during training and classification. It uses word frequencies rather
than binary bag of words representation. This model will be useful mainly with the text data.
We used WEKA to implement the model. A data sample of 60,000 instances has been used.
7.2.2. Model Tuning
In WEKA we can change various properties to increase the performance of the model. The following are
some of them:
1. Converting all the text to lower case tokens.
2. Varying the minimum word frequency.
3. Using Ngrams instead of word vectors.
4. Utilizing additional features like business_id or user_id
5. Using Cross fold option instead of percentage split option.
-
8/10/2019 Data Crackers YELP
17/24
17
6. Increasing the WordsToKeep count from 1000 to 5000 or 10000 depending on the size of the
dataset.
7.2.3. Experiment Results
Features and Parameters
Percentage of
Accuracy
ROC
Initial % with review text 54 0.53
Stopwords 56 0.57
Stemmer 59 0.60
Unigrams 61 0.62
Min word Frequency from 5-10 66 0.70
Business id and User id 65 0.64
Bigrams 73 0.75
Trigrams 64 0.66
User id 77 0.79
Business id 77 0.79
OverAll Accuracy 79.6 0.84
Observation: The same reason mentioned in the Naive Bayes Multinomial model is responsible for the
drastic increase in the accuracy of the model. From the above comparison of results we can say thatNave Bayes Multinomial Text model has slightly higher efficiency (about 0.11%) than Nave Bayes
Multinomial model. The reason behind this could be in the Nave Bayes Multinomial Text model some
extra processing is carried out which gives it slightly higher efficiency than Nave Bayes Multinomial
model.
8.Model Development and Tuning by Parineetha GandhiDataset fed to the tool has 25000 reviews consisting 16576 are positive, 5650 are negative and 2772 are
neutral reviews.
8.1.K Nearest Neighbors Model
8.1.1.
About Model
k is a constant given by the user, and an unlabeled vector is classified by assigning the label which is
most frequent among the k nearest training samples to that vector. The following formula defines the
nearest neighbors
-
8/10/2019 Data Crackers YELP
18/24
18
K value should be chosen according to the data, value of k reduces the effect of noise on the
classification, but make boundaries between the classes less distinct.
8.1.2. Model Tuning
Tuned the model by varying the k value. Initial I applied k=1 and observed that the tokenizer does not
make much difference on changing this attribute which gave me the result as 67.2314%
In the second attempt on tuning the model more by applying k=15 and applying tokenizer as bigram I
observed that the accuracy increased to 69.125%.
Parameters Tuned:
Following are the parameters which I changed and tuned the model. Table in the section 8.1.3 clearly
shows the results obtained by varying the parameters.
TFTransform and IDFTransform
minTermFreq
outputwordCounts
lowercasetokensStemmer
stopwords
tokenizer
Also tuned the model with the model specific parameters. For example the parameters like Euclidean
distance, number of nearest neighbors, etc have been changed. For decision tree parameters like laPlace
value, binary split options etc have been changed.
8.1.3.
Experiment Results
-
8/10/2019 Data Crackers YELP
19/24
19
Observation: The results obtained are good when k=5, the reason behind this could be when k is even,
when classifying to more than two groups or when using an even value for k, it might be necessary to
break a tie in the number of nearest neighbors.
When considered KNN specific parameters like Euclidean distance and Manhattan distance, it is
observed that the Euclidean distance gave better results.
Results obtained when used KNN specific parameters
8.2.
Decision Tree
8.2.1.
About Model
Decision tree classifies instances by sorting them down the tree from root to some leaf node which
provides the classification of instances. Each internal node represents an attribute of the instance, each
branch represents the node corresponds to one of the possible values for this attribute
8.2.2.
Model Tuning
Parameters Tuned:
Applied the same parameters for this model as well and got the accuracy 74.25%
Performed percentage split in most of the cases as cross fold validation was taking quite a long time for
each experiment.
Initially I tried to run the model without applying any parameters on the dataset and observed that the
accuracy is 72.41% and ROC is 0.72.
The best results are obtained when set the Laplace value to true which is about 0.82 ROC. The reason
could be the Laplace correction method biases the probability towards a uniform distribution.
Decision Tree Specific Parameters:
Parameters Value Accuracy ROC
Binary Split TRUE 71.54 0.72
numFolds 10 69.93 0.73
useLaplace TRUE 69.98 0.82
Results obtained when used Decision Tree specific parameters
-
8/10/2019 Data Crackers YELP
20/24
20
Result comparison between KNN and Decision Tree
Observation:
The results obtained are good when k=5, the reason behind this could be classifying to more than two
groups or when using an even value for k, it might be necessary to break a tie in the number of nearest
neighbors.
For decision tree the best results are obtained when Laplace value is set to true and that showed the
increase in ROC which is 0.819.
9.Main Findings in the ProjectNaive Bayes Multinomial Text model has performed the best among all models we have tried. The
reasons are listed below.
Varying the parameter min term frequency has drastically affected the performance. The words
which are not repeated frequently and not useful to the classification are ignored.
The review text mostly has Bigrams like very good, feeling awesome etc. So, using these feature forclassifying the reviews has helped a lot.
Use of additional features like user id, business id increased the performance. For example a user
gives most of his reviews as positive for different businesses, most likely the next review given by
him for any other business would be positive. If a business has most of its reviews as positive, most
likely the next incoming review would be positive. For these features to work the reviews of user
must be present in both training and the test set and same with businesses.
Nave Bayes Multinomial model has almost same accuracy as the above model due to same reasons.
But, the Multinomial Nave Bayes Text model has some extra processing to it which increases the
accuracy.
-
8/10/2019 Data Crackers YELP
21/24
21
10. Results and Comparison
Graph shows accuracy results obtained by different models
From the above graph it clearly implies that Nave Bayes Multinomial Text gives optimal accuracy of
~80%. It can be observed that from left to right all the Nave Bayes models started with the accuracy of
40% and 55%. Where in KNN and Decision tree started with better accuracy, we were expecting better
accuracy with these models, but it didnt turn out to what we expected. Accuracy increased as we
included features and filters. From ngrams, it can be observed that bigrams has shown good results, so
we considered bigrams for further processing. In features of business id, user id and both together we
have observed better accuracy when we considered only user id. For KNN the accuracy was good with
k=5 and also by considering the KNN specific features like Euclidean distance the results were
considerably high. For Decision Tree setting the laplace value has increased ROC. Including these special
functions for these two models improved the accuracy slightly.
So, Nave Bayes Multinomial Text Classification gave a good accuracy when compared with other
models.
-
8/10/2019 Data Crackers YELP
22/24
22
11. Project Management
11.1. Task Allocation and Timelines
We used Project Management websitewww.Asana.comto manage entire project and workload. Below
is timeline allocation of work load for each team member. We used this tool to store all intermediate
files, reports and scripts or snippets.
11.2. Self-Assessment:
- Everyone on the team contributed equally. There was no total dependency or delay from
anyone in the team.
-
Everyone was equally active and enthusiastic to learn something new.- Before taking any decision, we made sure that everyone is clear about the requirements and
expected output. We followed the process of Knowledge Transfer and Reverse Knowledge
Transfer to make sure that everyone is on same page.
- There were lot of discussions in the initial phase of project so that everything goes without any
hurdle in the end.
- Everyone in the team has decent knowledge on different tools and technologies like Pentaho
Data Integration, WEKA, MYSQL, JAVA, PHP and Big Data Components. So, if any sort of decision
had to be made, there was always someone to address.
- Everyone used Asana Project Management tool actively.
11.3.
What can be improved?- Domain knowledge
- Increasing awareness and usability of tools to all the members of the team.
http://c/Users/Prashanth/Downloads/www.Asana.comhttp://c/Users/Prashanth/Downloads/www.Asana.comhttp://c/Users/Prashanth/Downloads/www.Asana.comhttp://c/Users/Prashanth/Downloads/www.Asana.com -
8/10/2019 Data Crackers YELP
23/24
23
12. List of Queries:
12.1. Bigrams1CREATETABLEbigrams asSELECTword,star,frequencyFROM(
SELECTword,star,frequency,rank()over(orderbyfrequency desc)asslnoFROM(SELECTword,CASEWHENpos_count >=neg_count ANDpos_count >=nut_count THEN1WHENneg_count >=nut_count THEN-1ELSE0
ENDASstar,CASEWHENpos_count >=neg_count ANDpos_count >=nut_count THENpos_countWHENneg_count >=nut_count THENneg_countELSEnut_count
ENDASfrequencyFROM(SELECTdistinct
CASEWHENneg.gram.ngram[0] ISNOTNULLTHENconcat(neg.gram.ngram[0]," ",neg.gram.ngram[1])WHENnut.gram.ngram[0] ISNOTNULLTHENconcat(nut.gram.ngram[0]," ",nut.gram.ngram[1])ELSEconcat(nut.gram.ngram[0]," ",nut.gram.ngram[1])ENDASword,CASEWHENpos.gram.estfrequency ISNULLTHEN0ELSEpos.gram.estfrequency ENDASpos_count,CASEWHENneg.gram.estfrequency ISNULLTHEN0ELSEneg.gram.estfrequency ENDASneg_count,CASEWHENnut.gram.estfrequency ISNULLTHEN0ELSEnut.gram.estfrequency ENDASnut_countFROMbigrams_neg asneg FULLOUTERJOINbigrams_nut asnutonneg.gram.ngram =nut.gram.ngramFULLOUTERJOINbigrams_pos asposonpos.gram.ngram =nut.gram.ngram
)asa)asb)ascwhereslno
-
8/10/2019 Data Crackers YELP
24/24
24
12.5. Bigrams_stag_2CREATETABLEbigrams_stag_2ASSELECTreview_id,max(prob_sum)asprob_max FROMbigrams_stag_1GROUPBYreview_id;
12.6. Bigrams_test_1_1CREATE
TABLE
bigram_test_1_1AS
SELECTtest.review_id,new_star,test.stars asoriginal_starFROM(SELECTstag1.review_idASreview_id,starASnew_star
FROMbigrams_stag_1ASstag1 INNERJOINbigrams_stag_2ASstag2onstag1.review_id =stag2.review_idandstag1.prob_sum =stag2.prob_max
)a INNERJOINtest_data testona.review_id =test.review_id;
12.7. Stats
This gives final statistics which shows number of correctly classified instances and wrongly classifiedinstances.
SELECTstats,COUNT(*)
FROM(SELECTCASE
WHENnew_star =original_star THEN1ELSE0ENDasstats,new_star,original_star
FROMbigram_test_1_1)resGROUPBYres.stats;
selectoriginal_star,count(*)frombigram_test_1_1 groupbyoriginal_star;
selectnew_star,count(*)frombigram_test_1_1 groupbynew_star;