Rating Prediction Based on TripAdvisor Reviews · useful in many application like bussiness...

Rating Prediction Based on TripAdvisor Reviews

Dangyi LiuA53221859

[email protected] San Diego

Yu ChaiA53213872


Chenxi ZhengA53221997


Yilun ZhangA53099979


Abstract—Rating is becoming increasingly importantfor customers, which helps them obtain useful informationof products more easily and more efficiently. Normally, acustomer rates products from different aspects and thenmakes an overall rating. According to this human thinkingprocess, we are curious about how ratings in differentaspects would influence the final overall ratings. Therefore,we make two different prediction models: 1. Use reviewcontext to predict overall rates predictions. 2. Use reviewcontext to predict rates in different aspects(cleanliness,location, rooms, service, sleep quality and value) andthen use these ratings to predict the overall ratings. Bycomparing MSE of two models, we found out that the firstmodel is more accurate than the second one.

Keywords—linear model, rating predictions

I. INTRODUCTION

In nowadays, more and more people start to searchon internet for useful information. In order to helpcustomers finding information efficiently and attractingmore customers, many websites starts to ask customerswith their reviews based on their experience. In thisway, other customers could get useful information easilyand choose their products based on those informations.For example, Tripadvisor asks customers’ reviews basedon their experience on different hotels. Other customerscould easily make decisions about hotels, instead ofwasting amount of time to do the research. In orderto get more review context from customers, agents likeTripadvisor will always give customers some bonus tothose who give feedbacks for their products. Also, somecustomers would like to share their experience withothers. Thus, more and more people comment theirperspectives on the websites including many aspectslike utilities and services. It will not only help othercustomers to make decisions, but help the hotels to knowwhat aspects that customers care about. However, peoplerate the hotels based on their objective thoughts. In otherwords, different people emphasize on various aspects.

For example, a customer focuses on cheap price. Thecustomers may still wont be satisfied if the hotel hasexcellent utilities and service with expensive price. Sothe hotel may still get low overall review rates from thiscustomer.

Therefore, to give an overall rating, a logical customerwill first make rates on different aspects like cleanliness,location, rooms, service, sleep quality or values and soon, based on their experience, then combine them into anoverall rating. Therefore, we are interested in how ratesin different aspects could influence the overall rates. Inother words, we want to make more accurate predictionsfor overall rates based on rates in different aspects.

We use two models. we first use review context topredict the overall rating. Then we use reivew context topredict the rating in several aspects: cleanliness, location,rooms, service, sleep quality, and value. Then using theserates, we predict the overall ratings. To compare thosetwo models, we identify if the model could predict theoverall rating more accurately by using rating in severalaspects. A challenge for this task is that, the data set isreally large and we didnt get complete data set. So wefirst need to preprocess the data set then do the predicton.

In this paper, we introduce some related works andwe describe the dataset that being used in detail. Then wego into some details about the model. Finally providesour conclusion.

II. RELATED WORK

With the advancement of internet, the Web has be-come one of the major sources for collecting opinionsand suggestions from consumer s. There are now nu-merous Web sites containing such information and manyscientists have done some relative research on it.

In [5], Satoshi pointed out the importance of knowingthe reputation of manufacture own product and theircompetitor’s product. Since it was extremely hard to

collect and analyze survey data manually, he developed anew framework which could mining product reputationson the Internet. This framework could automaticallycollect peoples opinions about target products from Webpages, and it uses text mining techniques to obtain thereputations of those products. In this paper, they gener-ated syntactic and linguistic rules to determine whetherany given statement is valid, and if so, what is its attitude(positive or negative).

However, only the known attitude in review analysiswas not enough for manufactures, potential customerswanted to know more detail about the product. Therefore,later in [6], focused on online customer reviews ofproducts, Liu proposed a framework for analyzing andcomparing consumer opinions of competing products,and provided a visual side-by-side and feature-by-featurecomparison for potential customers and product manu-factures. This paper came up with a new technique basedon language pattern mining, which proposed to extractproduct features from Pros and Cons in a particular type.

When scientists figured out the method to extractinformation from reviews, more and more research fo-cused on summarizing users opinions and categorizingreviews. In paper [7], authors suggested that classifyingdocuments not by topic but by overall sentiment. Theycompared three ML method (Naive Bayes, maximumentropy classification, SVM) with the sentiment classi-fication on traditional topic-based categorization. In ourpaper, we also consider this factor and will figure out thebetter one while comparing the overall rating directlyfrom sentence with overall rating derived from aspectratings and aspect rating derived from sentence.

But, how could we analyze the comparative sen-tences? We are not the first one to think about it. In [8],the paper prosed the study of identifying comparativesentence. They first analyzed different types of compar-ative sentences from both the linguistic point of viewand the practical usage point of view. Then conclude arule mining and machine learning approach to identifyingsuch comparative sentences. Those sentences are veryuseful in many application like bussiness intelligence andso on.

The closet work to our study is [9], which introducedthe problem of Latent Aspect Rating Analysis(LARA)and proposed a two-stage approach:in the first stage,keywords specified by users are used in a bootstrappingalgorithm to identify the aspects and segment the reviewcontent; in the second stage, a generative Latent Rating

regression(LRR) model is applied to infer aspect ratingand weights in a review.

III. DATASET DESCRIPTION

A. Data Source

We use dataset from DAIS(The Database and Infor-mation Systems Laboratory at The University of Illinoisat Urbana-Champaign)[2]. This dataset consists of over1,500,000 reviews of hotels from Tripadvisor and werandomly pick up nearly tenth of it as our dataset.

B. Data Cleaning

Since the dataset is collected from the website andsome parts of the review data are missing, we have tofirst preprocess the datase t and get rid of the incompletereviews. Normally, each review in the dataset includeseight fields:

1) Author2) AuthorLocation3) Content4) Date5) HotelID6) Ratings7) ReviewID8) Title

For field Ratings, there are seven subfields:

1) Cleanliness2) Location3) Overall4) Rooms5) Service6) Sleep Quality7) Value

After preprocess, we get 64004 pieces of review data,covering 1310 hotels, spanning over 32 months, rangingfrom Feb. 2010 to Sept. 2012. Then we refine the reviewdata by removing stopword in Content and makingcontent words insensitive to capitalization and shuffle thewhole dataset in the end. Then we can conduct somestatistics analysis and explore more information aboutthe dataset in order to fulfill our prediction task.

C. Statistiscs of Ratings over Aspects

First, we draw the box plot as Figure.1 over subfieldsin field Ratings. From the box plot, we can see themedians of each subfield are [4.0, 5.0, 5.0, 4.0, 4.0,

4.0]. The lower quartile(25% percent) of Overall Ratingis equal to its median, namely 4.0 and high quartile(75%percent) of Cleanliness and Location are both equal totheir medians, namely 5.0. And there are some flier data,or outlier data lie in 1.0 and 2.0 in these three subfields.And for the rest four subfields, their rating distributionare quite equivalent, all with higher quartile of 5.0 andlower quartile of 3.0. And they all get lower whisker at1.0. From this figure, we have an approximate understandabout the distribution of ratings in different aspects.

Overall Cleanliness Location Rooms Service Sleep Quality Valuerating aspect

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Ratin

gs

Fig. 1: Box Plot of Ratings over Aspects

D. Average Overall Ratings over Time

Then we explore the relationship between averageoverall ratings and time. We first calculate the numberof reviews over time to see the distribution of reviewnumbers. It is shown in Figure 2 and we can see thatthere exists several period of time with small amountof reviews which are under 500. Normally, the numberof reviews ranges in [1000, 3000], but there also existsperiod of time with large amount of review over 3000.We then plot the figure to show to relationship betweenaverage overall rating and time, and it is shown as Figure3(there is no specific meaning behind different color).The size of area in Figure 3 represents the amountof review numbers, where bigger circle corresponds tolarger amount of reviews. There are some outliers in theupper part of figure which comes from a small amountof reviews. Otherwise, we can see there is no apparenttrend over time, the average overall rating normally liesin [3.9, 4.2].

E. Length of Review and Overall Rating

Intuitively, the length of review has some connectionwith overall rating. We also explore the relationshipbetween average overall ratings and length of contentin reviews. We first count the number of reviews fordifferent length of content and show the result in Figure4. We can see normally the length of reviews lies in

0 5 10 15 20 25 30month span(201002-201209)

0

1000

2000

3000

4000

num

ber o

f rev

iews

Fig. 2: Box Plot of Ratings over Aspects

201001 201006 201011 201104 201109 201202 201207date(year_month)

3.9

4.0

4.1

4.2

4.3

4.4

avg_

over

all_r

atin

g

Fig. 3: Changes of Overall Rating over Time

[6, 200]. There are only few number of reviews withcontent length over 500. The relationship between lengthof review and corresponding overall review can be shownin Figure 5. We can see for those with review lengthunder 200, their average overall rating almost intensivelylies in [3.5, 4.5]. However, for those with larger reviewlength, the average overall ratings varies from [1.0, 5.0]dispersively.

IV. PREDICTION TASK

Our task is to predict overall rating using two differentmodels

• (Direct method) predict overall rating directlybased on review text.

• (Indirect method) predict ratings for individualaspects first, then use individual ratings to predictoverall rating.

We want to know which model produces a betterresult. The metric we use to evaluate models is mean

0 500 1000 1500 2000length of content

0

100

200

300

400

500

600

700nu

mbe

r of r

evie

ws

Fig. 4: Number of Reviews over Content Length

0 100 200 300 400 500 600length of review

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

avg_

over

all_r

atin

g

Fig. 5: Length of Review and Corresponding OverallRating

square error,

MSE = ‖ y − prediction ‖22

We take 10000 reviews out of all 64004 reviews as thetest set, the MSE is calculated on test set.

V. MODELS DESCRIPTION

A. Preprocess of review text

Both methods require us to convert the review contentinto word vector. We’re taking the following steps.

1) Split text into words:

1) Remove non-roman characters. These includesdigits,

2) punctuations, greek letters, etc. To perserve theseperator between words, we replace these char-acters with spaces.

3) Convert letters into lowercase.

4) Split the text into words. It’s possible thatwe mistakenly split the contractions, e.g. split“we’ll” into “we” and “ll”.

5) Remove stopwords such as “a”, “they”, “be-cause”, “between”, “is”. It’s also necessary forus to remove the fragments of contractions suchas “ll”, “ve”, “t”, “s”.

2) Generate Word Set: In our review context, we onlycare about unigrams because most of informative wordsare adjective such as dirty and terribles. Also, we take1000 most common unigrams to make the predictions,in order to avoid rare frequency and meaningless words.

3) Convert Word Bag into Word Vector: Given thecontext review from customers, we need to formalizehow important of each word in the review context.The importance of words increased proportionally to thefrequency of the word appears in the document. However,there are some words are not important but appears a lotof times like ”the”, ”is”. So in order to see the importanceof the word in the review context, we need to balance thefrequency of the word appears in current review contextand other reivew context. Therefore, we choose to use tf-idf (term frequency-inverse document frequency) to filterthe context in different fields like cleaning and so on. Thetf-idf is defined as follows.

idf(t) = logTotal number of documents

Number of documents with term t in it

tf(t) =Number of times term t appears in a document

Total number of terms in the document

tf-idf(t) = tf(t)× idf(t)

B. Direct Method

In direct method, we predict overall rating directlybased on word vector extracted from review contextusing the following linear model.

overall rating = α+ θ0tf-idf(t1) + θ1tf-idf(t2) + . . .

To prevent overfitting, instead of minimizing MSE,we add L2 regularization. We’ll also need to seperatea validation set from training set in order to train thehyperparameter λ. We take 10000 reviews out of thetraining set as the validation set.

The result shows that, this method could achieve anMSE of 0.6598 on test data, when λ = 1000.

C. Indirect Method

In constract, when using indirect method, we take thefollowing steps to make prediction.

1) Train a model to predict overall rating based onindividual ratings

2) For each aspect, train a model to predict therating based on word vector.

The general idea is similar with direct method thoughwe use the aspect rating as mediator. The results of MSEis 0.6632.

1) From Individual to Overall: We use a linear modelto fit between overall rating and individual ratings, withthe following form.

overall rating = α+ θ0 cleanliness + θ1 location

+ θ2 rooms + θ3 service

+ θ4 sleep quality + θ5 value

Since we only have 7 parameters in this model, wecould use linear regression without regularization.

2) From Word Vector to Individual Rating: For eachaspect, we use linear regression with L2 regularizationto predict the individual rating, with the following form.

individual rating = α+ θ0tf-idf(t1) + θ1tf-idf(t2) + . . .

3) Result: For the first step, we calculate the combina-tion of aspect ratings in order to predict the overall rating.Here is the result of the coefficients for each aspect:

Aspect CoefficientCleanliness 0.13864Location 0.08302Rooms 0.25638Service 0.27541Sleep Quality 0.12467Value 0.18121

which achieves an MSE of 0.2118 on test data. We couldsee that the aspect people care most is service, whichhas a heavy impact on overall rating, and location hasthe least impact on overall rating.

For the second step, we take λ = 1000. The MSE foreach aspect are

Aspect MSECleanliness 0.6566Location 0.6504Rooms 0.7569Service 0.7811Sleep Quality 0.8298Value 0.8218

We could draw a wordcloud for each aspect, showingwords with highest and lowest coeffients.

Highest Lowest

Cleanliness

Location

Rooms

Service

Sleep Quality

Value

Fig. 6: Wordcloud for Each Aspect

VI. CONCLUSION

The paper discusses two ways to predict the overallrating from user review context. In the first method, weapply the indirect method. The first step is that we predictthe aspect rating from review context; the second stepis that we predict the overall rating from the weights ofaspect ratings. While in the another method, we calculatethe overall rating directly from the review context. Bothof them are applied with linear model and derive themodel and result according to the minimization of MSE.In our result, we found the MSE of the indirect methodis 0.6632 and the MSE of direct model is 0.6598. Basedon the result, we could conclude the following things:

• Contrary to our original assumption, directmethod performs better than indirect method. Aswe known, when any person goes through a re-view text, s/he would label the key word withoutconscious. For example, when one notices theword “rude” in a review, s/he would label itas service attitude and then reduce their overallrating to this hotel according to the differentimportance of service attitude when they choosehotels. But, in contrast, the machine learning doesnot care about it. It does not care about how theword labels but how it would influence overallrating.

• Though the indirect model use the mediate keys,aspect rating, they are still derived the finalresult by linear model, which is same as thedirect model. In addition, since we derive themodel according to MSE, the direct model wouldprovide the result with the minimal MSE, whileindirect model could only guarantee the localminimal.

• However, the difference between those two meth-ods is very small which means both methodscould be accepted in usage.

REFERENCES

[1] H. Kopka and P. W. Daly, A Guide to LATEX, 3rd ed. Harlow,England: Addison-Wesley, 1999.

[2] Hongning Wang, Chi Wang, ChengXiang Zhai and Jiawei Han.Learning Online Discussion Structures by Conditional RandomFields. The 34th Annual International ACM SIGIR Conference(SIGIR’2011), P435-444, 2011.

[3] K. Dave, S. Lawrence, and D. M. Pennock. Mining the peanutgallery: opinion extraction and semantic classication of productreviews. In WWW ’03, pages 519 - 528, 2003.

[4] M. Hu and B. Liu. Mining and summarizing customer reviews.In W. Kim, R. Kohavi, J. Gehrke, and W. DuMouchel, editors,KDD, pages 168- 177. ACM, 2004.

[5] S. Morinaga, K. Yamanishi, K. Tateishi, and T. Fukushima.Mining product reputations on the web. In KDD ’02, pages 341-349, 2002.

[6] Liu, Bing, Minqing Hu, and Junsheng Cheng. “Opinion ob-server: analyzing and comparing opinions on the web.” Proceed-ings of the 14th international conference on World Wide Web.ACM, 2005.

[7] Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. “Thumbsup?: sentiment classification using machine learning techniques.”Proceedings of the ACL-02 conference on Empirical methods innatural language processing-Volume 10. Association for Com-putational Linguistics, 2002.

[8] Jindal, Nitin, and Bing Liu. “Identifying comparative sentencesin text documents.” Proceedings of the 29th annual internationalACM SIGIR conference on Research and development in infor-mation retrieval. ACM, 2006.

[9] Wang, Hongning, Yue Lu, and Chengxiang Zhai. “Latent aspectrating analysis on review text data: a rating regression approach.”Proceedings of the 16th ACM SIGKDD international conferenceon Knowledge discovery and data mining. ACM, 2010.

Rating Prediction Based on TripAdvisor Reviews · useful in many application like bussiness...

Documents

Transcript of Rating Prediction Based on TripAdvisor Reviews · useful in many application like bussiness...