Download - Yelp Advisor Report

8/15/2019 Yelp Advisor Report

1/15

Classifying Yelp RestaurantsTeam Yelpadvisor: Stephanie Wuerth, Bichen Wu, Tsu-Fang Lu

December 14, 2015

Problem Statement and Background

Our goal is to classify restaurants into existing labels using the Yelp academic dataset. We also hope to

further classify restaurants with more specific labels than their given labels. For example, a restaurant

may just be labeled as “Chinese” but can we further classify it as “Sichuan” or “Taiwanese”?

Our dataset is the Yelp’s academic dataset, which is provided for use as part of the Yelp Dataset

Challenge1. This dataset spans approximately 10 years of Yelp reviews (text and star rating) and 5 years

of Yelp tips, along with hourly sums of check-ins for each business. It also includes general information

about the business, such as its categories, ambiance, business hours, and address, and some information

about the Yelp reviewers such as the number of reviews they have written and the average star rating they

have given. The dataset includes 10 cities: Edinburgh, Karlsruhe, Charlotte, Urbana, Madison, Las Vegas,

Phoenix, Pittsburgh, Montreal, and Waterloo (Canada).

To measure accuracy, we compare our predicted label to the true label. We measure basic accuracy,

precision and recall, as well as AUC. We also compare our model’s accuracy to a baseline model.

Some potential applications of our classification models include:

1. Help Yelp automate restaurant labeling without user inputs.

2. Label vaguely labeled restaurants more specifically or label restaurants with missing labels.

3. Inform customers about restaurants’ specialties and particular cuisines by further sub-categorizing

restaurants into more specific labels.

MethodsData collection

We used the Yelp academic dataset, which is made available by Yelp for the Yelp Dataset Challenge 1. Toobtain this data, we registered for the Yelp Dataset Challenge at http://www.yelp.com/dataset_challenge.

Data preparation

The data provided is in JSON format, but a Python script for converting to csv is offerred at

https://github.com/Yelp/dataset-examples. We used this script ( json_to_csv_converter.py) to convert the

JSON data into csv files, then we read those into a Python notebook and stored the data in Pandas

dataframes. We subset the data for what is potentially useful for our chosen problem. We use 9 of the 10

cities in the Yelp Academic Dataset for our model. Karlsruhe, Germany data is omitted because mostreviews here are not written in English, and review text is the richest component of our dataset. We

further subset by selecting only restaurants (excluding Hotels, Spas, etc.). Within restaurants we further

subset for the 20 most common types of restaurant, as dictated by their given labels. Labels chosen and

number of restaurants with each label in our subsetted dataset are given in Fig. 1 (See in Appendix). We

also got rid of EOL, carriage returns, and certain regex patterns in the review texts for our bag of words

model to work better.

1


2/15

Featurization

We featurize our review text using a Bag of Words (BoW) model, building a training matrix of number of

restaurants by size of vocabulary as follows:

All reviews received by each restaurant in the training set (70% of total) are joined and tokenized with

stopwords removed, then words are counted to create the sparse BoW vector for each restaurant.

We tested several different feature inclusions:- N-grams: Unigrams Only, Bigrams + Unigrams

- Number of features retained: 6000, 15000, 100,000, or ~200,000 (which is the total count of

unique words in our training corpus)

- Feature weights: raw frequencies or term frequency, inverse document frequency (TF-IDF)

weighting. We note here the specifics of the TF-IDF weighting: we used the default parameters of

the sklearn.feature_extraction.text.TfidfTransformer() tool. ( norm='l2' , use_idf=True,

smooth_idf=True, sublinear_tf=False). The norm parameter means we normalize the final

vectors, and the smooth_idf and use_idf parameters mean our features are weighted according to

tf * (idf + 1) , where tf is the frequency of the feature in the restaurant's merged reviews, and idf

is the inverse frequency of the feature in the entire training corpus (all restaurant reviews).

Another featurization we tried, but did not implement in the final pipeline, is to use the star rating matrix,

which is a matrix of number of users by number of restaurants. Each element in the matrix corresponds to

a user’s rating for a certain restaurant. Then we performed matrix factorization (through PCA and

Alternative Least Squares) to obtain a factor matrix of number of factors by number of restaurants. We

treated each vector (with the length of factors) as a data point to represent each restaurant.

Learning

First we describe the learning methods used for the supervised problem of classifying restaurants into

their existing labels. Then we describe the methods for the unsupervised problem of classifying

restaurants into subcategories.

Supervised text-based classification into existing labels

Models tested: Logistic regression and random forest

Logistic regression marginally outperformed our random forest models, so we have chosen the logistic

regression model as our primary model.

Parameter choices:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, max_iter=100, multi_class='ovr',

penalty='l2', random_state=None, solver='liblinear', tol=0.0001,

verbose=0)

Logistic regression: Multi_class = “ovr” indicates that a binary problem is fit for each label. So in our

case, for each of the 20 categories we model whether a restaurant does or does not fall into that category.

2


3/15

This is a logical choice since some restaurants fall into more than one category (for example, many “Sushi

Bars” are also “Japanese”).

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',

max_depth=None, max_features=6000, max_leaf_nodes=None,

min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=4,

oob_score=False, random_state=None, verbose=0,

warm_start=False)

Random forests: We tested a number of parameter choices, but the best performance was achieved by

keeping 6000 features, 100 estimators, bootstrap on, and gini criterion. We initially included fewer

features, because according to sckit-learn documentation2, for classification tasks, the number of features

used in a random forest model is often optimized with max_features=sqrt(n_features). N_features in our

case is ~200,000, so ~500 would be a good choice for max_features. However, we saw increased

accuracy when we included more features.

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Multinomial Naïve Bayes (MNB) (baseline model):

The alpha parameter is set by default to 1 to include adaptive smoothing.

We also tried Bernoulli Naive Bayes by binarizing features such that presence of a word (count of 1 or

more) gave a feature value of 1 while absence of a word gave a feature value of 0. This method gave us

fairly high accuracy, but zero recall for all categories, so we present Multinomial NB as our baseline

model.

Clustering for Sub-Categorization

For sub-categorization, we implemented spectral clustering, which can be summarized as the following

procedures3:

1 . F o r m th e a f f in i t y m a t r ix A , w it h A i j = e x p ( | | si - s j||2

2) for and Aii = 0δ/ =i / j

2. Define D to be a diagonal matrix whose (i,i)-element is the sum of A’s i-th row. And construct

L = D-1/2 A D-1/2

3 . F i n d k e i g e nv e c t or s x1, … , xk corresponding to the k smallest eigenvalues of L. Form matrix

X= [x1

...xk

] .

4 . R e - n or m a l iz e e a c h r o w o f X t o f or m m a t ri x Y .

5. Treat each row of Y as a data point, do K-means clustering on Y.

We implemented this algorithm ourselves and applied it to cluster: (1) all restaurants into groups in order

to see whether or not these groups correspond to a sensical composition of given restaurant types, and (2)

Chinese restaurants into subcategories. The parameter in this algorithm is , which controls theδ

connectivity of data points. The smaller is, the more separated clusters will appear. We set to theδ δ

value to make the number of separated clusters equal to 5.

3


4/15

Other things we tried :

Initially we were working on a different problem involving time series analysis of daily review counts.

We took the time series of the daily review counts as our features and our hope was to (1) find different

customer influx patterns for different types of restaurants, and (2) predict customer influx to certain cities

and venues based on these time series. Our analysis failed because after running some statistical tests, we

learned that there is not enough information in the time series for us to distinguish different types of

restaurants and predict customer influx.

Another method we tried was to factorize the star-rating matrix to yield a factor matrix corresponding to

restaurants, then use this as features and apply k-means on it to find restaurant types. However, the

star-rating matrix is very sparse. Even using ALS (Alternative Least Squares) factorization, our average

prediction error (measured by root of mean squared error) was larger than 1 star. So we abandoned this

feature and used bag of words instead.

Results

Supervised Labeling

How many features should we include?

Figure 2 in the appendix shows that, for the case of unigrams only and no TF-IDF weighting, accuracy,

precision, and AUC are all maximized by including the entire corpus. The effect of increased corpus size

on recall is less clear cut. Since recall is not drastically decreased by including more features, we can base

our choices on the precision and accuracies.

Should we weight our data by TF-IDF? Should we include bigrams?

In Figure 3, we compare the performance of our logistic regression model for four featurization choices.

In all cases, ~200,000 features are used, but we vary inclusion of unigrams vs. unigrams + bigrams, and

we test whether or not to weight with TF-IDF. The top plot shows that using bigrams in addition tounigrams has little effect on the overall accuracies. We see a slight improvement for Italian and Chinese

restaurants when adding bigrams, but this improvement is not substantial. The middle and bottom plots

show precision and recall. We see that using TF-IDF weights generally increases precision but decreases

recall. We average over all 20 categories for these measures in the table below:

Raw frequencies,

unigrams only

TF-IDF weights,

unigrams only

Raw frequencies,

bigrams +

unigrams

TF-IDF weights,

bigrams +

unigrams

Accuracy 0.9517 0.9486 0.9562 0.9447

AUC 0.9220 0.9492 0.9347 0.9498

Precision 0.7215 0.8611 0.7737 0.9040

Recall 0.5672 0.3304 0.5551 0.2665

Table 1. Comparison of different featurization choices for the logistic regression model (with ~200,000

features retained). Measures are averaged across scores for the 20 categories.

4


5/15

The highest scores for each accuracy measure are shown in bold. The highest overall accuracy is achieved

by including bigrams and unigrams, and weighting the features by their raw counts. Precision and AUC

are improved by weighting by TF-IDF, but recall is markedly decreased. Such low recall would cause us,

for example, to fail to recommend a relevant restaurant to a Yelp user, so we choose not to weight our

data by TF-IDF. Thus our final choice for featurization is: ~200,000 features of bigrams and unigrams,

weighted by raw word counts.

Discussion of individual category accuracies

Figure 4 shows accuracy, precision and recall for each category for our chosen model and featurization.

Alongside our model’s accuracies, we include “Always False” accuracies, which is the accuracy for a

model that simply predicts false uniformly for each label. We see that for all categories, our model

outperforms the "Always False" classifier. However, accuracy is very close to this "Always False"

classifier for the rarest categories: Sushi Bars, Delis, Steakhouses, Seafood, and Chicken Wings. With

larger numbers of these types of restaurants, performance for these categories would potentially be

improved.

Taking into account accuracy, precision, and recall, we see that our classifier is best at labeling Mexican,Pizza, and Chinese. It is not as successful at classifying American (Traditional and New), or at classifying

the label "Food." This makes sense because American restaurants and "Food" restaurants have less

obviously identifying word features than Mexican or Chinese restaurants. To visualize this effect, we

examine word clouds for some of these cases (Figure 5). The word clouds display the most frequent

words in all reviews for a given category, sized by their frequencies. Stopwords are removed in addition

to the word “food,” which is common to all categories.

Random Forest Model Results

Here we also report the accuracy measures for our Random Forest model, because it performed nearly as

well as the Logistic Regression model. For the results presented here, we used the same training matrix as

in the primary model (unigrams + bigrams, raw counts), but we only retain 6000 features. The parameters

used are given in the methods sections. We halve our test dataset into validation (for testing parameter

combinations) and final test data sets. Table 2 summarizes accuracy scores for this model (both validation

and final test scores), with logistic regression included for comparison. Accuracy and recall are below the

logistic regression model, but precision is higher. If given more time to test more parameter combinations, it

is plausible that we could achieve higher accuracy with this random forests approach. Recall might

improve with shallower trees or fewer features considered, since these parameters give a simpler model

with lower variance, but with this comes potentially higher bias (lower accuracy).

Accuracy Precision Recall

Logistic Regression 0.9562 0.7737 0.5551

Random Forest (validation) 0.9530 0.8867 0.4666

Random Forest (final test) 0.9522 0.8972 0.4550

Table 2. Accuracy measure comparisons between primary (logistic regression) and a random forest

model. Measures are averaged across scores for the 20 categories.

5


6/15

The random forest model allows us to examine the most important features. Here we list some of the mostimportant features (in decreasing order) : pizza, chinese, bar, mexican, pizzas, burger, subway, mcdonalds,

mexican food, chinese food, sandwiches, bartender, sandwich, sushi, tacos, taco, bartenders, italian, coffee,

burrito, crust, fries, burgers, bar food, drive, fried rice, pizza good, asada, salsa, pepperoni, beer, good pizza,

fast food, pasta, rice, carne asada, waitress, breakfast, italian food, burritos, subs, wings, best pizza, happy

hour, bars, bread, mein, drinks, pizza place, beers, sub, fast, great pizza, cafe, restaurant, italian restaurant,eggs, japanese, place, rice beans, deli, taco bell, great, carne, chinese restaurant, pub.

Many of these features are obvious identifiers for certain labels.

Comparison to baseline model

In the table below we summarize accuracy measures (averaged across our 20 categories) for our primary

model and our baseline model. We see significant improvement in all measures except for recall. The low

precision of the baseline model indicates that it underfits our data, which is expected of a simple model

such as Naive Bayes.

Accuracy AUC Precision Recall

Multinomial NB (Baseline) 0.9119 0.8690 0.4818 0.7648

Logistic Regression (Primary) 0.9562 0.9347 0.7737 0.5551

Table 3. Accuracy measure comparisons between primary (logistic regression) and baseline (multinomial

naive Bayes) models. Measures are averaged across scores for the 20 categories.

We compare performance against the baseline model for all categories in Figure 6. In the top panel we see

that our model is more accurate than our baseline model for all categories. While the improvement does

not appear drastic, it should be noted that "Always False" never outperforms our primary model, but it

outperforms the baseline model for 11 of the 20 categories (these 11 being Fast Food, American

(Traditional), Sandwiches, Food, American (New), Breakfast and Brunch, Cafes, Delis, Steakhouses,

Seafood, and Chicken Wings). The 9 categories for which the baseline model surpasses “Always False”in accuracy are all categories we expect to have more unique vocabularies, such as ethnic cuisine. Our

primary model outperforms the baseline model most significantly for labels Sandwiches (improvement by

>20%) and Fast Food (improvement by >10%). Better accuracy is expected for logistic regression as

compared to naive Bayes for a problem such as ours because naive Bayes is a simplification of logistic

regression. Naive Bayes assumes that features (words) are generated independently given the class (in our

case, the “class” is true or false for each label), whereas logistic regression does not make this

assumption. As such, we expect the naive Bayes model to have higher bias but lower variance, and that it

will underfit our data, leading to low precision and high recall.

Spectral Clustering

We applied spectral clustering on: (1) all restaurants to classify them into groups and analyze the true

labels that comprise these groups, and (2) Chinese restaurants to classify them into subcategories. In order

to figure out which labels each cluster corresponds to, we printed out the top 5 true labels of each of the 5

clusters. As shown in Figure 7, we see that cluster 3 corresponds to pizza or Italian restaurants, cluster 2

corresponds to bar, nightlife type of restaurant. The other three clusters are more difficult to interpret

because they contain mixed types of restaurants. Figure 8 is the result of applying spectral clustering on

6


7/15

Chinese restaurants. Many of the Chinese restaurants have true labels in addition to “Chinese,” such as

“Taiwanese” or “Buffet.” So, as we did for the clusters of all restaurants, we can again print the most

common true labels (other than Chinese) for the restaurants in our Chinese clusters. First we notice that

the most frequent labels in each clusters are “Asian Fusion”, “Buffet”, which provides little information

about their types. Other than that, we see that in the first cluster, we observe Japanese and Sushi bar,

which implies that their styles are more dominated by Japanese food. In the fifth cluster, we observe Thai,

Vietnamese, Szechuan restaurants, which are relatively spicy.

Tools

We performed all of our analysis in iPython notebooks because this platform is useful for visualizing

results alongside code. We used Pandas and NumPy for data manipulations because these are tools all

group members use. At first, we built our BoW features (and TF-IDF weights) using handwritten code

adapted from CS294 homework, but later we migrated towards scikit-learn tools for this task.

sklearn.feature_extraction.text.CountVectorizer() was used to form BoW training matrices. This tool

simplified a few tasks:

(1) setting the maximum feature retention count (“max_features” parameter),

(2) setting which n-grams to include (“n-gram range”), and(3) setting which stop words to remove (we removed words from the given “english” stop word list).

Once those matrices were built, we could transform the counts into their TF-IDF representation with

sklearn.feature_extraction.text.TfidfTransformer().

For supervised labeling, we implemented several models from scikit-learn. The justification is that these

tools are easy to use, especially in an iPython notebook. Models we used include:

from sklearn.linear_model: LogisticRegression()

from sklearn.naive_bayes: BernoulliNB() and MultinomialNB()

from sklearn.ensemble: RandomForestClassifier()

We also used these tools for quantifying model performance:

from sklearn.metrics: roc_curve, roc_auc_score, auc

For unsupervised clustering, we basically used k-means from scikit-learn.

For visualization, we used Matplotlib because it is well-suited for simple graphics, and can be used inline

in an iPython notebook. We also used the wordcloud package to create some appealing visualizations of

our review text.

Lessons Learned

Supervised Labeling: We explored a number of machine learning approaches for the supervised problem

of classifying Yelp restaurants into existing labels. Our best model was a logistic regression model,

closely followed by a random forests model. We thus selected the logistic regression model as our primary model, and we compare it to a baseline model (multinomial naive bayes). The features used were

the words from all of the reviews written for each restaurant that we aimed to classify. We evaluated a

number of featurization choices for these words including:

(1) whether to use unigrams only or whether to additionally include bigrams,

(2) whether to weight the features by raw word counts or TF-IDF weights, and

(3) how many features to include.

7


8/15

As seen in Figure 2 and Table 1, we achieved the best performance for the logistic regression model by

using bigrams+unigrams, retaining 200,000+ features, and representing features as raw word counts.

We measure accuracy in a number of ways:

(1) accuracy (did we correctly predict that a restaurant does or does not fall within a certain category?),

(2) area under the ROC curve,(3) precision, and

(4) recall.

Scores for these accuracy measures are displayed in Table 2. Our logistic regression model outperforms

our baseline model substantially in accuracy, AUC, and precision, but the baseline model has higher

recall. We also show that our primary model outperforms the “Always False” model for all 20 categories,

whereas our baseline model does not for many categories. Our primary model performs best at classifying

ethnic cuisine such as “Chinese” and “Mexican,” which we hypothesize is due to these types of

restaurants having special and unique identifying words such as “Mexican” and “tacos” for Mexican

restaurants and “Chinese” and “noodles” for Chinese restaurants. This is corroborated by the word cloud

visualizations in Figure 5 and in looking at the most important features for our random forests model.

Unsupervised Labeling:

Unsupervised learning for subcategorization is relatively more difficult. In this project, we applied

spectral clustering on the review text in order to find subcategories of restaurants. The intuition is, let’s

say, for Chinese restaurants, people may use “hot”, “spicy” to describe a Sichuan restaurant and use “milk

tea”, “salted popcorn chicken” in reviews for Taiwanese restaurants. However, the difficulty is, it’s not

obvious what each cluster corresponds to.

One way to figure this out is to look at the percentage of existing labels. For example, if in a cluster, 50%

of restaurants are “bar”, 25% are “night life”, then we could reason this cluster corresponds to the bar type

of restaurants. Though we do observe this in some of the clusters (refer to the results section), there are

also clusters with mixed labels that are not easy to interpret. A more fundamental question to ask is, is the

clustering based on restaurant types? Or, is it perhaps more related to something else like star-rating, cost,or other latent factors? A key lesson for us is that unsupervised learning doesn’t always give us the result

we expect.

Team Contributions

*CS294* Bichen (40%): Time series analysis (majority of the “Project Preliminary Data Analysis”

submission), star-rating matrix factorization, spectral clustering of review texts for unsupervised

subcategory classification.

*CS294* Stephanie (40%): Initial reading in of data and exploration of business dataset (majority of

“Project Data Exploration” submission). Completion of bag of words featurization. Small scale

supervised labeling (majority of results presented in PowerPoint presentation) . Majority of textfeaturization and supervised labeling presented in poster presentation and presented here.

*CS194* Tsu-Fang (20%): Data exploration on review texts and user data. Started bag of words

featurization and TF-IDF analysis. Tested value of adding restaurant name feature and TF-IDF effects on

model accuracies after logistic regression and naive bayes (not shown). Ported and formatted results for

poster / presentation.

8


9/15

References

(1) Yelp academic dataset. https://www.yelp.com/academic_dataset.

(2) “Ensemble Methods.” http://scikit-learn.org/stable/modules/ensemble.html

(3) Ng, Andrew Y., Michael I. Jordan, and Yair Weiss. "On spectral clustering: Analysis and an

algorithm." Advances in neural information processing systems 2 (2002): 849-856.

Our github repository is here: htps://github.com/tsufanglu/Yelp-Dataset-Challenget

The most relevant notebooks to this report are:

CatsAllCities.ipynb

Yelp_Restaurnats_Spectral_Clustering.ipynb

They are located in the code folder of the repo:

https://github.com/tsufanglu/Yelp-Dataset-Challenge/tree/master/code

9


10/15

Appendix (Figures)

Figure 1. Chosen restaurant labels and their counts.

10


11/15

Figure 2. Accuracies, precisions, and recalls for our logistic regression model colored by the number of

words retained in the training corpuses (no TF-IDF weighting). These indicate that we ought to keep as

many words as possible as features.

11


12/15

Figure 3. Comparison of accuracies for 4 different featurization choices. In each case 211964 words (or

211964 bigrams + unigrams in the bigram case) are retained for training.

12


13/15

Figure 4. Accuracy measures for our chosen model, broken down by category.

Figure 5. Word clouds for American (New) (Upper Left), American (Traditional) (Upper Right),

Mexican, and Chinese. Notice Chinese has words unique to it such as “Chinese,” “noodle,” “rice”, and

dumpling; Mexican has unique words like “Mexican,” “taco,” and “burrito,” but the upper 2 word clouds

do not show obviously unique words.

13


14/15

Figure 6. Baseline comparisons. There is substantial improvement over baseline for accuracy, AUC, and

precision. The simple baseline model has higher recall. Bottom panel labels serve as a guide for all

panels.

14


15/15

Figure 7. Spectral clustering result of all restaurants. Most frequent 5 labels in each cluster.

Figure 8. Spectral clustering results for Chinese restaurants. Most frequent 5 labels in each cluster.

15