Data Crackers YELP

8/10/2019 Data Crackers YELP

1/24

1

Data Mining on YELP Dataset

Advisor - Duc Tran Thanh

Team - Data Crackers

Prashanth Sandela

Vimal Chandra Gorijala

Parineetha Gandhi Tirumali


2/24

2

Table of Contents1. Project Vision ........................................................................................................................................ 3

2. Data Mining Task ................................................................................................................................... 3

2.1. Data Mining Problem: ................................................................................................................... 3

2.2. Evaluation Metrics: ....................................................................................................................... 3

3. Hypothesis ............................................................................................................................................. 3

4. Data Processing ..................................................................................................................................... 4

4.1. Data ............................................................................................................................................... 4

4.2. Initial Dataset ................................................................................................................................ 4

4.3. Data quality problems ................................................................................................................... 5

4.4. Data Processing Tasks ................................................................................................................... 5

4.5. Resulting Dataset .......................................................................................................................... 6

5. Feature Selection .................................................................................................................................. 7

5.1. Dataset .......................................................................................................................................... 7

5.2. Rationales behind feature selection ............................................................................................. 7

5.3. Feature Selection Tasks: ............................................................................................................... 8

5.4 Selected Features ............................................................................................................................. 10

6. Model Development and Tuning by Prashanth Sandela .................................................................... 10

6.1. Implementation of own model(Navie Bayes) ............................................................................. 10

6.2. Nave Base Multinomial Classification Model ............................................................................. 13

6.3. Experimental Results .................................................................................................................. 14

7. Model Development and Tuning by Vimal Chandra Gorijala ............................................................. 15

7.1. Nave Bayes Multinomial Model ................................................................................................. 15

7.2. Nave Bayes Multinomial Text Model ......................................................................................... 16

7.3. Results Comparison .................................................................................................................... 16

8. Model Development and Tuning by Parineetha Gandhi .................................................................... 17

8.1. K Nearest Neighbors Model ........................................................................................................ 17

8.2. Decision Tree ............................................................................................................................... 199. Main Findings in the Project ............................................................................................................... 20

10. Results and Comparison ................................................................................................................. 21

11. Project Management ...................................................................................................................... 22

12. List of Queries: ................................................................................................................................ 23


3/24

3

1.Project VisionIn todays fast growing world, there are many businesses which are startups, growing and well

established. For every business, rating holds a stand ground for its survival in market. This rating is given

by users who enjoys goods and services from a business. User expresses his experience towards a

business in the form of review and star ratings through many platforms and the famous one among

them is YELP. The review can be positive, negative or a neutral review. The aim of our Project is to build

a classifier that classifies any given review into labels of Star ratings (-1, 0 and 1). We planned to use

various Data Mining models to classify reviews into user star ratings labels by applying various model

tuning techniques to attain optimal accuracy of classifying.

2.

Data Mining Task

2.1.Data Mining Problem:

The data mining task we are trying to solve is multi-class classification. The classes we used in this

project are -1, 0 and 1 (-1 - negative, 0 - neutral, 1 - Positive).

2.2.

Evaluation Metrics:The following are some of the evaluation metrics we have used to assess the quality of the solution.

1) Percentage or accuracy of correctly classified instances: This metric is appropriate because through

it we can exactly know how our model is performing. But we cannot rely on this metric completely.

2) ROC value: This value gives the ratio of the true positives to false positives. ROC value measurement

is one of the most important values output by WEKA. An "optimal" classifier will have ROC values

approaching 1, with 0.5 being comparable to "random guessing"

Through the combination of the above metrics we can assess the performance of the model and attain

the best results.

3.Hypothesis As it is a classification based on text, words in the reviews are features we should consider to classify

them correctly. For example a review has words like good, excellent, awesome, yumm food etc..,

that review should be classified into Positive class label. We planned to concentrate on them and

apply transformations like stop words removal, stemming etc.., to make use of those words in best

way possible with the help of different tools, so that the review can be classified correctly. We

intended to concentrate mainly on Bayesian algorithms as they have better performances in the

case of text classification.

We intended to use combination of words called bigrams. For example words like very good, yum

yum etc. In general the view of a user about a business is expressed mostly in combination of words,so we thought using bigrams could give good accuracy to the model.

Use of other features like business id, user id individually can improve the accuracy and they should

not be used as a combination.

We will discuss in the results below how the model learning is affected by approaches in the hypothesis.


4/24

4

4.Data Processing

4.1.Data

We obtained data fromhttp://www.yelp.com/dataset_challenge. It has 40,000 businesses, 1.3 Million

reviews and 250,000 users. The data was in JSON format and we had to do some pre-processing and

converted into .CSV format to obtain the review text, class labels and other features. There are manyirrelevant fields like neighborhoods, votes, etc.., and we have removed all of them and considered only

the required ones. Initially the reviews have star rating class labels associated to them from 1 to 5 and

we have reduced them as 1,2 to negative(-1), 3 as neutral(0) and 4,5 as Positive(1). The below figure

gives the representation of the all the reviews and the modified class labels associated to them.

Graph showing no. of review stars

4.2.

Initial DatasetInitial dataset which has YELP dataset which basically consisted data about Business, Users and

Reviews. Below is snapshot of dataset in JSON format.

Business User Reviews

{ 'type': 'business',

'business_id': (encrypted business id),

'name': (business name),

'neighborhoods': [(hood names)],

'full_address': (localized address),

'city': (city),

'state': (state),

'latitude': latitude,

'longitude': longitude,

'stars': (star rating, rounded to half-stars),

'review_count': review count,

'categories': [(localized category names)]

'open': True / False,

}}

{

'type': 'user',

'user_id': (encrypted user id),

'name': (first name),

'review_count': (review count),

'average_stars': (floating point average,

like 4.31),

'votes': {(vote type): (count)},

'friends': [(friend user_ids)],

'elite': [(years_elite)],

'yelping_since': (date, formatted like

'2012-03'),

'compliments': {

(compliment_type):

}

{

'type': 'review',

'business_id': (encrypted business id),

'user_id': (encrypted user id),

'stars': (star rating, rounded to half-stars),

'text': (review text),

'date': (date, formatted like '2012-03-14'),

'votes': {(vote type): (count)},

}

213509 163761

748188

NEGATIVE NEUTRAL POSITIVE

Review Stars

Count
http://www.yelp.com/dataset_challengehttp://www.yelp.com/dataset_challengehttp://www.yelp.com/dataset_challenge


5/24

5

4.3.Data quality problems

In the dataset weve many quality issues like specified below:

1) Presence of unwanted columns and merging the files in dataset

2) Special Characters

3) Numeric Data

4) Other language Characters

5)

Stop words.

6) Business_id, review_id, user_id has hash indexes which occupy a lot of space.

4.4.Data Processing Tasks

4.4.1. 1.3.1 Removing Unwanted Columns and merging all the files

Among all the columns, we considered only business_id, user_id, review_id, review_text,

review_count and stars. Furthermore, three files are combined to make a single dataset with only

considered attributes. For accomplishing this task, first the entire dataset (Includes all the 3 files) was

converted from JSON to CSV using PYTHON script. Next, all the datasets are combined using ETL Tool

Pentaho. Below is screenshot of ETL Mapping.

4.4.2.

1.3.2 Removing Special Characters, Numeric and Other language words

The main consideration is review_text which is the text data that users entered as a review to

the business which also has stars. There are special characters, new line character, other language

characters. These have been removed by below PHP script.

4.4.3.

1.3.3 Removing Stop Word and convert to lower caseStop words are meaningless words, by removing which the meaning or weightage of sentence

doesnt change. So, by decreasing these stop words the number of tokens are decreased. Converting

complete context to lower case makes it easy for comparing two words when both words are in

lowercase. Below is the Algorithm in PHP Script to perform the operations 1.3.2 and 1.3.3.


6/24

6

4.5.Resulting Dataset

The resulting dataset consist of business_id, review_id, user_id, stars and review_text in CSV

format. A sample of the file is below.

nYer89hXYAoddMEKTxw7kA,k2u1F6spBGhgk2JtAe97QA,HeDqdFYkKaeDvPtiFy6Xmw,event favorite event a long time lindsey a fabulous job setting up keeping

movie played completely hush hush absolutely love filmbar always a great beer wi ne selection wonderful staff a wacky selection art film movie night wayne

world one time favorites naturally beyond thrilled invited a super foxey date a guy a delicious dog short leash a fabulous ti me thank lindsey filmbar yelp a

fantastic evening party time excellent ,5

nYer89hXYAoddMEKTxw7kA,hdZ3rlgFXctCOUhzoOebvA,XXblLOSqYlq0tXhxHfXUHQ,great time funny movie loved going film bar first time ta bles eat fantastic

,4

nYer89hXYAoddMEKTxw7kA,usQTOj7LQ9v0Fl98gRa3Iw,2fPxXAysOrZLrahZQyJCNg,wayne world short leash filmbar need more a great tuesday night

adventure wayne world took waaaaaay back favorite aiko a chicken dog short leash ages great yelp coordinated outing usual tha nks lindsey yelp crew thanks

kelly a staff filmbar place bomb day week thanks brad kat short leash bunch having trailer available event ,5

nYer89hXYAoddMEKTxw7kA,XTFE2ERq7YvaqGUgQYzVNA,OO6prfuGEMalQcQcU3WCaw,fab concept test out a new well new hadn t film bar prev iously

independent cinema drink a generously gratis beer wine choosing short lease hot dogs oh conveniently parked outside plus samples appetizers heck yes add

excitement anticipation knowing filmed az movie going shown a perfect weeknight event now know movies don t actually based arizona awesome ideas next

mention great filmbar date ideas online dating attempts pan out thanks film bar yelp short lease fun fellow yelpers a great time ps post movie trivia an swers ,5

$result = mysql_query("select business_id, user_id, review_id, text, stars, review_count from reviews");

while($rows = mysql_fetch_array($result)){

$i++;

$text = preg_replace("/[^A-z| ]/i", " ", $rows['text']);

$text = str_replace("\n", " ", $text);

//Process text to remove stop words.

$text = explode(" ", $text);$processed_text = "";

foreach ($text as $s){

$s = strtolower($s);

if($s != null && array_search($s, $stopWords) == false)

$processed_text .= $s." ";

}

$optString .= "'".$rows['business_id']."',";

$optString .= "'".$rows['user_id']."',";

$optString .= "'".$rows['review_id']."',";

$optString .= "'".$processed_text."',";

$optString .= $rows['stars'];

$optString .= $rows[review_count];

$optString .= "\n";

if($i % 1000 == 0) {$fd = fopen("reviews_DetailedStopWords.csv", "a+");

fwrite($fd, $optString);

$optString = "";

echo "$i\n";

}

}


7/24

7

5.Feature Selection

5.1.Dataset

After data preprocessing, the dataset is in pure csv structured format with required columns.

This table is loaded into HDFS. A `reviews` table is created on the dataset. This table is used for feature

selection.

5.2.Rationales behind feature selection

Now, as content is processed, the next task is to reduce the size of dataset by replacing the hash

value of business_id, review_id and user_id with unique identifier. We create a new table to store those

values, so that the initial id value will not be lost.

Our final aim is to classify stars based on the reviews, so we narrowed down stars 1 5 in three

classifications i.e., Positive, Negative and Neutral reviews. So, 1 and 2 star review fall under

Negative reviews, 4 and 5 fall under Positive review and 3 star rating fall under Neutral review.

Essential feature is review_text for classifying the review. We have tested to classify 25,000 reviews in

WEKA on unigram, bigram and trigram considering 66% data for training the model and 34% for testing.

As per the result, use of ngrams gave significantly high correctness. Based on this experiment weve

planned to use ngrams to train the model.

Weve planned to use 67% of the data to train model, and 33% for testing. Weve removed stop words

in data preprocessing step. Text review is given by end user of YELP application. So, there is high

possibility of having spelling mistakes in the review text. For example, users express their feelings in

various ways. Some users may type gooood instead of good, coooool instead of cool. So, when we

calculate the term frequency, there is high possibility of ignoring these words. This is called

Lemmatization. And Stemming is to identify root word.

For this phase weve implemented LovinsStemmer Algorithm. A UDF is created for this algorithm, which

takes complete text as input, processes it and gives the output accordingly. After this phase, we have

divided the table into unigram, bigram and trigram and calculated the frequency of the words.

$> Hadoop fsput review.csv

$>hive

HIVE>create table reviews(business_id String, review_id String, user_id String, review_text String, stars int)

row format delimited

fields terminated by ,

lines terminated by \n;

HIVE> load data inpath review.csv into table reviews;

HIVE> select * from reviews LIMIT 10;

/* Displays list of columns in correct format */


8/24

8

5.3.Feature Selection Tasks:

5.3.1.

Assigning numeric id to key attributes

Below are the queries used to assign numeric id to key attributes: business_id, user_id,

review_id

5.3.2. Narrowing stars

Narrowing stars implies converting 1 and 2 stars as negative i.e -1. Converting 3 stars to Neutral

i.e., 0. Converting 4, 5 to Positive i.e 1.

5.3.3.

Removing Stop Words:

Stop Words were removed in data processing phase.

5.3.4.

Stemming review text

We created a UDF `lovinsStemmer()` based on the Lovins Stemming algorithm provided by WAIKATO.

After applying stemming, we removed some newly generated stop words using UDF `stopWords()`.

Above is query for doing these tasks.

HIVE> CREATE TABLE business AS

SELECT DISTINCT id, business_id,

(SELECT RANK() OVER(ORDER BY business_id) as id,

RANK() OVER(ORDER BY user_id) as user_id,

RANK() OVER(ORDER BY review_id) as review_id

business_id

from reviews) a;

HIVE> CREATE TABLE users AS

SELECT DISTINCT id, user_id,

(SELECT RANK() OVER(ORDER BY business_id) as business_id,

RANK() OVER(ORDER BY user_id) as id,


user_id

from reviews) a;

HIVE> CREATE TABLE process_reviews ASSELECT RANK() OVER(ORDER BY business_id) as business_id,

RANK() OVER(ORDER BY user_id) as user_id,


review_text,

stars

from reviews

HIVE> CREATE TABLE processed_stars_reviews as

SELECT business_id, review_id, user_id, review_text,

CASE WHEN stars = 1 or stars = 2 THEN -1WHEN stars = 3 THEN 0

WHEN stars = 4 or stars = 5 THEN 1

END AS stars

FROM processed_reviews

HIVE> CREATE TABLE stemmed_stars_reviews as

SELECT review_id,

stopWords(lovinsStemmer(review_text)) as review_text,

stars

FROM processed_stars_reviews
http://www.cs.waikato.ac.nz/~eibe/stemmers/http://www.cs.waikato.ac.nz/~eibe/stemmers/http://www.cs.waikato.ac.nz/~eibe/stemmers/http://www.cs.waikato.ac.nz/~eibe/stemmers/


9/24

9

5.3.5.

Dividing training and Test data

Total number of reviews in the dataset are 1,125,458. So 67% of it, i.e 754,056 are for training the model

and remaining 371,408 reviews are for testing the model. As weve already created unique review_id

from 1 to 1,125,458. Above is the query to split the dataset.

5.3.6.

Classification Using WEKA

In the initial report we have used StringToWord vector filter which uses WordTokenizer to convert the

review text into vectors and applied Navie-Bayes Multinomial classification algorithm. We got 66.7%

instances as correctly classified. The data sample had 5000 instances and 66% of it is used as training

data and 34% as testing data.

In this phase we have applied NGRAM tokenizer which converts the text into NGRAMS (Unigarms,

Bigrams, Trigrams). We also applied Attribute Selection Filter on top of this which uses

InfoGainAttributeEval function to evaluate the worth of the attribute by measuring the information gain

with respect to the class and got 70.02% instances correctly classified. The Data sample has 25000

instances and 66% of it is training data and remaining is testing data.

HIVE> CREATE TABLE test_data asSELECT * FROM processed_stars_reviews

WHERE review_id >= 754056;

HIVE> CREATE TABLE training_data as

SELECT * FROM processed_stars_reviews

WHERE review_id < 754056;


10/24

10

5.3.7. Generating N-grams

Ngrams generation is predefined function in HIVE. We are using function to make unigrams, bigrams and

trigrams and calculate their frequency. Below are the queries to create unigram, bigram and trigram. Weare selecting top 2000 word list.

5.4.Selected Features

These are selected features for our classification

1) Business Id

2) User Id

3) Bigrams

4) Review Text

6.Model Development and Tuning by Prashanth Sandela

6.1.Implementation of own model (Nave Bayes)

6.1.1.

Idea to develop model in Hive

I developed my own model for classifying star ratings based on text. For developing model, I considered

features business id, user id, review id, review text and stars. The implementation of the model is based

on the probability of the ngram based on the either business id or user id applied on training data and

HIVE> CREATE TABLE unigrams as

SELECT ngrams(sentences(review_text), 1, 2000, 2000) from training_data;

HIVE> CREATE TABLE bigrams as


HIVE> CREATE TABLE trigrams as



11/24

11

apply the model on test data to classify stars of the review as -1, 0, 1 which implies Negative, neutral or

positive review. This model has shown an accuracy of 69.5%.

6.1.2. Model Development & Description

This model is purely developed using HIVE Queries using Amazon Web Services storing data in S3,

development and deploying in Elastic Map Reduce with 3 EC2 Instances. This model serves on entire

dataset of YELP which consists of 1.3 Million records with 67% of Training data and 33% of Test Data.

Steps followed to develop the model:

1. Divide Training and Test Data

2. Find ngram, frequency, star and probability from Training data

3. Find review id, ngram, frequency in test data

4. Train the with test data

5. Compare test and training data set words and including few other features

6. Retrieve the percentage match of training and test data

QueriesBigrams1, Numerics, Bigrams_1, Bigrams_stag_1,Bigrams_stag_2have been used to design themodel. This is the final model which has been tuned. Below is the example to see the model

implementation.

Below is the example of classification in unigrams.

Training Data Total Stars

Word Frequency Star Probability Stars Total Count

Good 100 1 0.33 1 300

Excellent 50 1 0.16 -1 200

Bad 100 -1 0.5 0 150

Good 10 -1 0.05Nice 15 0 0.1

Queries bigrams_test_1_1, stats are used to compute the results. Here in the above example, Icalculated the probability of the word based on the frequency of the word and total word count. Word

Good is available in both 1 and -1 star ratings. So, based on the probability Good will be classified as

+1 star. Below is how the review will be classified based on the text.

Test Data

Review_id Word

Count in

reviews New_Star Original_Star

1 Good 10 1 1

1 Bad 5 -1 1

2 Bad 30 -1 -1

2 Worst 40 -1 -1

Here review id 1, count of word will be considered and it will be classified as a review with star rating 1.

Similarly review id 2 will be classified as star with rating -1.


12/24

12

I considered only business which had at least 10 reviews and users who have at least 10 reviews. Ive

divided data at each business and user level. For E.g.: If there are 100 reviews for a specific business,

then 66 reviews are supplied to training and rest to test data, when I consider business id as a feature.

This division happens at every business id. Likewise, similar process is repeated for user id and also when

both the features business id and user id are considered together.

In hive, I was not able to develop ROC measure for result metrics.

6.1.3.

Model Tuning

Data set supplied to this model has been removed with stop words and text enrichment. Below are

model tuning procedures:

1) Refining and sampling of training and test data

Initially I just stripped my dataset into 100,000 records in which I stripped first 67,000 as training

and 37,000 as test data. I realized that in the training data more than 90% of the records were

positive. So, I divided data in random sample between few parts and which improved the accuracy

nearly by 7% and which improved my accuracy from 43% to ~51%.2) Change of stemmer

Initially I used lovins Stemmer, on doing some research I found that Porter Stemmer is better than

Lovins Stemmer. I used a java program to implement this Stemmer which resulted in improving

accuracy by ~0.5%.

3) NGrams and identifying frequency count

Use of different grams has changed the accuracy. Use of bigrams has shown better accuracy.

Furthermore, there is slight increase in accuracy when I considered term frequency as 5000. Using

this tuning, there is increase of 4% accuracy.

4) Determine best approach to increase accuracy

Before arriving to the procedure of Probability Model, I used various other models approaches like

sum and count model which didnt help me much to determine accurate results. But use ofProbability Model has increased accuracy significantly.

5) Change of features

We have two more features, business id and user id. When I used business id or user id as an extra

features, accuracy increased significantly. But when I used business id and user id together, there

was actually reduce in accuracy and it makes sense that a use of both features implies that the

model will search for business id and user id for same business id and tries to classify stars. As this

combination will be unique, the accuracy got reduced.

6) Applying on overall Dataset

When I supplied overall dataset on the ration of 67% as training and 33% as test dataset, then I got

accuracy of ~74%. Accuracy on sample data was ~71%, but on overall data its a bit higher.

6.1.4.

Pros and Cons

a) It is an SQL like language. So easy to implement.

b) Main advantage of using this model is, we can tune the model to any extent.

c) There wont be any limitation on size of data or number of fields.

d) Never run out of memory.

e) Can implement this model in a cluster using all the required resources.


13/24

13

f) It is difficult to design and implement this model any change requires lot of implementations to be

considered like if a query is changed, what might be the effect on the result. Should be very careful

while making changes.

g) There are lot of predefined function already defined by HIVE, any new extensions can be easily

accommodated by designing a UDF(User Defined Function)

6.2.Nave Base Multinomial Classification Model

6.2.1. About Model

I used WEKA Data Modeling Tool to classify stars using Nave Bayes Multinomial Model which is

available in list of Bayes models. WEKA has pre-defined models implemented with many filters and

features.

6.2.2.

Model Tuning

This has been performed on the dataset with Training data of 67% and 33% as Test data and has been

implemented on 100,000 records.

1)

Initial accuracy was about around 47% without any tuning.

2) I supplied new list of stop words list rather than using default stop word list. There was slight

increment of accuracy but it was nearly ~0.5%

3) Use of ngrams instead of Word Tokenizer improved has shown better accuracy.

4) In ngrams, the accuracy was even better when bigrams have been used on top of my dataset.

5) Default NullStemmer was replaced by LovinsStemmer which gave slight increase in accuracy.

6) Use of words to keep also improves the accuracy of the result. Increase in word to keep from 1000

to 5000 has shown me change in accuracy ~2%.

7)

I used Attribute Selection filter with search strategy of Ranker Algorithm with threshold of 0,

generateRanking: True, numToSelect: -1 and leaving starStar to null which has shown me an

increase in accuracy by 1.5%8) Using different features change the accuracy of output. I used business id and user id together to

see an increase in accuracy. But use of these two attributes reduced the accuracy, which is

expected. Coz, a user will give one or two reviews based on his experience in a business. When we

user both features together, then number of instance per review will be narrowed down to either 1

or 2 which implies there is definitely decrease in probability and accuracy. So, I used only one

attribute at a time. User of user id as a feature gave be a better increase in accuracy. There was

increase in accuracy by 3%.

9) Overall accuracy is 76%

6.2.3. Pros and Cons

1)

Using this model with WEKA give the flexibility to use many filters and attributes both for supervisedand unsupervised learning.

2) Using WEKA, only works on small datasets. Working with larger datasets is not possible.

3) This algorithm has already been designed, so effort to change any specific task is not required.

4) If we want to add a new functionality which is not available, it is difficult to implement.


14/24

14

6.3.Experimental Results

Sl.

NoAction

*Nave

Model

*Nave Bayes

Multinomial

ROC

for

NBM

Discuss Results

1

Initial Dataset 44% 46%

0.47In this stage, no filters are applied and results showa initial model results without applying any filters

2

Refining of

Training and Test

Data +7% N/A

0.54 Default set of training data I selected was more

positive, so sampling of training data helped me to

increase accuracy in my model. But where in Nave

Bayes Multinomial, randomization is automatically

handled by WEKA (using randomizer).

3

Change of

Stemmer +0.5% N/A

0.55 Change of stemmer from Lovins Stemmer to Porter

Stemmer has shown slight increase in accuracy. I

don't have Lovins Stemmer in default stemmers list

in WEKA to implement in Nave Bayes Multinomial

4 Ngrams:

Unigrams +3% +3.5% 0.59 Of these 3 ngrams, in both the cases bigrams gave

me optimal results. So, I went ahead implemented

bigrams.Bigrams +7% +7.5% 0.68

Trigrams +2% +2% 0.60

5 Including Features

Business id and

User id+2% +1%

0.70 Use of both together in Probability Model reduced

accuracy, it might be because of full outer join which

joins records depending on business id and user id

and might be searching for specific instances where

in instances which are in train dataset might not be

available in test dataset. In this case, I found that

Naive Bayes Multinomial gave good result.Business id +2% +3% 0.74 Use of business id and user id increased accuracy.

But accuracy was more when user id alone was

considered. According to this, I can understand that

it is similar to users sentiment analysis, because a

user will use same sort of text to express his

feelings. I saw that there are small number of users

who gave lot of reviews. So, use of user id as a

feature definitely explains increase in accuracyUser id 5% +5%

0.74

6

Bag of words 5% 4%

0.75 Initially I used 1000 words for the frequency count.

But use of 3000 words increased accuracy.

7 Overall Accuracyon 100,000 records ~73% ~76%

0.78

8 Accuracy on

complete dataset ~75% N/A

I was not able to fit everything in WEKA memory

even if I allocated 6GB of memory to WEKA.

* All the accuracy rates are rounded to nearest value


15/24

15

7.Model Development and Tuning by Vimal Chandra Gorijala

7.1.Nave Bayes Multinomial Model

7.1.1.

About Model

Multinomial Nave Bayes is a special version of Nave Bayes that is designed more for the text

documents. This model is mainly useful for multiclass classification. Initially we have 5 classes to classifythe reviews but we have reduced them into three (positive, neutral, negative), so that we can train the

model in a better manner. Here the probability of a review d being in a class c is computed as

Where P(tk | c) is the conditional probability of a term tk occurring in a review of class c.We interpret

P(tk | c) as measure of how much evidence tk contributes that c is a correct class. P(c) is the Prior

probability of a review occurring in class c. If a review terms do not provide clear evidence for one class

versus another, we choose the one that has higher probability.

We used WEKA to implement the model. Initially the dataset containing the features review text,

business id, review id, user id and the class label are fed to the tool. The preprocessing is done and they

are converted into word Vectors or NGrams based on the filters applied. Now we implement the model

on them.

7.1.2.

Model Tuning

In WEKA we can change various properties to increase the performance of the model. The data sample

has 60,000 records. The following are some of them.

1. Using NGrams rather than Word vectors, but bigrams usage increased the efficiency.

2. Increasing the WordsToKeep count from 1000 to 5000 or 10000 depending on the size of the

dataset.

3. Increasing the minimum term frequency from 1 to 10, which indicates a term with less than 10

occurrences is not considered.

4. Converting all the text to lower tokens

5. Using Attribute selection filter InfoGainAttribute Eval on top of Ngrams so that only top ranked

attributes are taken into account and fed to the model.

6. Using Cross fold option instead of percentage split option.

7. Utilizing additional features like business_id or user_id


16/24

16

7.1.3.

Experimental Results

Features and ParametersPercentage of

Accuracy

ROC

Initial % with review text 48 0.46

Stopwords 53 0.51

Stemmer 54 0.52

Unigrams 59 0.58

Min word Frequency from 5-10 65 0.64

Business id and User id 65.23 0.64

Bigrams 72 0.73

Trigrams 63 0.62

User id 74 0.75

Business id 74 0.75

Attribute selection Filter 78 0.79

Bag of words count to 5000 79 0.81

OverAll Accuracy 79.49 0.83

Observation: Here in Naive Bayes Multinomial model varying the minimum term frequency and usage

of bigrams feature has improved the performance drastically. The reason behind this is the datacontains many bigrams and concentrating mainly on highly frequent words in the reviews.

7.2.Nave Bayes Multinomial Text Model

7.2.1.

About Model

Multinomial Nave Bayes Text model operates directly on string attributes. Other types of input

attributes are accepted but ignored during training and classification. It uses word frequencies rather

than binary bag of words representation. This model will be useful mainly with the text data.

We used WEKA to implement the model. A data sample of 60,000 instances has been used.

7.2.2. Model Tuning

In WEKA we can change various properties to increase the performance of the model. The following are

some of them:

1. Converting all the text to lower case tokens.

2. Varying the minimum word frequency.

3. Using Ngrams instead of word vectors.

4. Utilizing additional features like business_id or user_id

5. Using Cross fold option instead of percentage split option.


17/24

17

6. Increasing the WordsToKeep count from 1000 to 5000 or 10000 depending on the size of the

dataset.

7.2.3. Experiment Results

Features and Parameters

Percentage of

Accuracy

ROC

Initial % with review text 54 0.53

Stopwords 56 0.57

Stemmer 59 0.60

Unigrams 61 0.62

Min word Frequency from 5-10 66 0.70

Business id and User id 65 0.64

Bigrams 73 0.75

Trigrams 64 0.66

User id 77 0.79

Business id 77 0.79

OverAll Accuracy 79.6 0.84

Observation: The same reason mentioned in the Naive Bayes Multinomial model is responsible for the

drastic increase in the accuracy of the model. From the above comparison of results we can say thatNave Bayes Multinomial Text model has slightly higher efficiency (about 0.11%) than Nave Bayes

Multinomial model. The reason behind this could be in the Nave Bayes Multinomial Text model some

extra processing is carried out which gives it slightly higher efficiency than Nave Bayes Multinomial

model.

8.Model Development and Tuning by Parineetha GandhiDataset fed to the tool has 25000 reviews consisting 16576 are positive, 5650 are negative and 2772 are

neutral reviews.

8.1.K Nearest Neighbors Model

8.1.1.

About Model

k is a constant given by the user, and an unlabeled vector is classified by assigning the label which is

most frequent among the k nearest training samples to that vector. The following formula defines the

nearest neighbors


18/24

18

K value should be chosen according to the data, value of k reduces the effect of noise on the

classification, but make boundaries between the classes less distinct.

8.1.2. Model Tuning

Tuned the model by varying the k value. Initial I applied k=1 and observed that the tokenizer does not

make much difference on changing this attribute which gave me the result as 67.2314%

In the second attempt on tuning the model more by applying k=15 and applying tokenizer as bigram I

observed that the accuracy increased to 69.125%.

Parameters Tuned:

Following are the parameters which I changed and tuned the model. Table in the section 8.1.3 clearly

shows the results obtained by varying the parameters.

TFTransform and IDFTransform

minTermFreq

outputwordCounts

lowercasetokensStemmer

stopwords

tokenizer

Also tuned the model with the model specific parameters. For example the parameters like Euclidean

distance, number of nearest neighbors, etc have been changed. For decision tree parameters like laPlace

value, binary split options etc have been changed.

8.1.3.

Experiment Results


19/24

19

Observation: The results obtained are good when k=5, the reason behind this could be when k is even,

when classifying to more than two groups or when using an even value for k, it might be necessary to

break a tie in the number of nearest neighbors.

When considered KNN specific parameters like Euclidean distance and Manhattan distance, it is

observed that the Euclidean distance gave better results.

Results obtained when used KNN specific parameters

8.2.

Decision Tree

8.2.1.

About Model

Decision tree classifies instances by sorting them down the tree from root to some leaf node which

provides the classification of instances. Each internal node represents an attribute of the instance, each

branch represents the node corresponds to one of the possible values for this attribute

8.2.2.

Model Tuning

Parameters Tuned:

Applied the same parameters for this model as well and got the accuracy 74.25%

Performed percentage split in most of the cases as cross fold validation was taking quite a long time for

each experiment.

Initially I tried to run the model without applying any parameters on the dataset and observed that the

accuracy is 72.41% and ROC is 0.72.

The best results are obtained when set the Laplace value to true which is about 0.82 ROC. The reason

could be the Laplace correction method biases the probability towards a uniform distribution.

Decision Tree Specific Parameters:

Parameters Value Accuracy ROC

Binary Split TRUE 71.54 0.72

numFolds 10 69.93 0.73

useLaplace TRUE 69.98 0.82

Results obtained when used Decision Tree specific parameters


20/24

20

Result comparison between KNN and Decision Tree

Observation:

The results obtained are good when k=5, the reason behind this could be classifying to more than two

groups or when using an even value for k, it might be necessary to break a tie in the number of nearest

neighbors.

For decision tree the best results are obtained when Laplace value is set to true and that showed the

increase in ROC which is 0.819.

9.Main Findings in the ProjectNaive Bayes Multinomial Text model has performed the best among all models we have tried. The

reasons are listed below.

Varying the parameter min term frequency has drastically affected the performance. The words

which are not repeated frequently and not useful to the classification are ignored.

The review text mostly has Bigrams like very good, feeling awesome etc. So, using these feature forclassifying the reviews has helped a lot.

Use of additional features like user id, business id increased the performance. For example a user

gives most of his reviews as positive for different businesses, most likely the next review given by

him for any other business would be positive. If a business has most of its reviews as positive, most

likely the next incoming review would be positive. For these features to work the reviews of user

must be present in both training and the test set and same with businesses.

Nave Bayes Multinomial model has almost same accuracy as the above model due to same reasons.

But, the Multinomial Nave Bayes Text model has some extra processing to it which increases the

accuracy.


21/24

21

10. Results and Comparison

Graph shows accuracy results obtained by different models

From the above graph it clearly implies that Nave Bayes Multinomial Text gives optimal accuracy of

~80%. It can be observed that from left to right all the Nave Bayes models started with the accuracy of

40% and 55%. Where in KNN and Decision tree started with better accuracy, we were expecting better

accuracy with these models, but it didnt turn out to what we expected. Accuracy increased as we

included features and filters. From ngrams, it can be observed that bigrams has shown good results, so

we considered bigrams for further processing. In features of business id, user id and both together we

have observed better accuracy when we considered only user id. For KNN the accuracy was good with

k=5 and also by considering the KNN specific features like Euclidean distance the results were

considerably high. For Decision Tree setting the laplace value has increased ROC. Including these special

functions for these two models improved the accuracy slightly.

So, Nave Bayes Multinomial Text Classification gave a good accuracy when compared with other

models.


22/24

22

11. Project Management

11.1. Task Allocation and Timelines

We used Project Management websitewww.Asana.comto manage entire project and workload. Below

is timeline allocation of work load for each team member. We used this tool to store all intermediate

files, reports and scripts or snippets.

11.2. Self-Assessment:

- Everyone on the team contributed equally. There was no total dependency or delay from

anyone in the team.

-

Everyone was equally active and enthusiastic to learn something new.- Before taking any decision, we made sure that everyone is clear about the requirements and

expected output. We followed the process of Knowledge Transfer and Reverse Knowledge

Transfer to make sure that everyone is on same page.

- There were lot of discussions in the initial phase of project so that everything goes without any

hurdle in the end.

- Everyone in the team has decent knowledge on different tools and technologies like Pentaho

Data Integration, WEKA, MYSQL, JAVA, PHP and Big Data Components. So, if any sort of decision

had to be made, there was always someone to address.

- Everyone used Asana Project Management tool actively.

11.3.

What can be improved?- Domain knowledge

- Increasing awareness and usability of tools to all the members of the team.
http://c/Users/Prashanth/Downloads/www.Asana.comhttp://c/Users/Prashanth/Downloads/www.Asana.comhttp://c/Users/Prashanth/Downloads/www.Asana.comhttp://c/Users/Prashanth/Downloads/www.Asana.com


23/24

23

12. List of Queries:

12.1. Bigrams1CREATETABLEbigrams asSELECTword,star,frequencyFROM(

SELECTword,star,frequency,rank()over(orderbyfrequency desc)asslnoFROM(SELECTword,CASEWHENpos_count >=neg_count ANDpos_count >=nut_count THEN1WHENneg_count >=nut_count THEN-1ELSE0

ENDASstar,CASEWHENpos_count >=neg_count ANDpos_count >=nut_count THENpos_countWHENneg_count >=nut_count THENneg_countELSEnut_count

ENDASfrequencyFROM(SELECTdistinct

CASEWHENneg.gram.ngram[0] ISNOTNULLTHENconcat(neg.gram.ngram[0]," ",neg.gram.ngram[1])WHENnut.gram.ngram[0] ISNOTNULLTHENconcat(nut.gram.ngram[0]," ",nut.gram.ngram[1])ELSEconcat(nut.gram.ngram[0]," ",nut.gram.ngram[1])ENDASword,CASEWHENpos.gram.estfrequency ISNULLTHEN0ELSEpos.gram.estfrequency ENDASpos_count,CASEWHENneg.gram.estfrequency ISNULLTHEN0ELSEneg.gram.estfrequency ENDASneg_count,CASEWHENnut.gram.estfrequency ISNULLTHEN0ELSEnut.gram.estfrequency ENDASnut_countFROMbigrams_neg asneg FULLOUTERJOINbigrams_nut asnutonneg.gram.ngram =nut.gram.ngramFULLOUTERJOINbigrams_pos asposonpos.gram.ngram =nut.gram.ngram

)asa)asb)ascwhereslno


24/24

24

12.5. Bigrams_stag_2CREATETABLEbigrams_stag_2ASSELECTreview_id,max(prob_sum)asprob_max FROMbigrams_stag_1GROUPBYreview_id;

12.6. Bigrams_test_1_1CREATE

TABLE

bigram_test_1_1AS

SELECTtest.review_id,new_star,test.stars asoriginal_starFROM(SELECTstag1.review_idASreview_id,starASnew_star

FROMbigrams_stag_1ASstag1 INNERJOINbigrams_stag_2ASstag2onstag1.review_id =stag2.review_idandstag1.prob_sum =stag2.prob_max

)a INNERJOINtest_data testona.review_id =test.review_id;

12.7. Stats

This gives final statistics which shows number of correctly classified instances and wrongly classifiedinstances.

SELECTstats,COUNT(*)

FROM(SELECTCASE

WHENnew_star =original_star THEN1ELSE0ENDasstats,new_star,original_star

FROMbigram_test_1_1)resGROUPBYres.stats;

selectoriginal_star,count(*)frombigram_test_1_1 groupbyoriginal_star;

selectnew_star,count(*)frombigram_test_1_1 groupbynew_star;

Data Crackers YELP

Documents

Transcript of Data Crackers YELP