8/15/2019 Yelp Advisor Report
1/15
Classifying Yelp RestaurantsTeam Yelpadvisor: Stephanie Wuerth, Bichen Wu, Tsu-Fang Lu
December 14, 2015
Problem Statement and Background
Our goal is to classify restaurants into existing labels using the Yelp academic dataset. We also hope to
further classify restaurants with more specific labels than their given labels. For example, a restaurant
may just be labeled as “Chinese” but can we further classify it as “Sichuan” or “Taiwanese”?
Our dataset is the Yelp’s academic dataset, which is provided for use as part of the Yelp Dataset
Challenge1. This dataset spans approximately 10 years of Yelp reviews (text and star rating) and 5 years
of Yelp tips, along with hourly sums of check-ins for each business. It also includes general information
about the business, such as its categories, ambiance, business hours, and address, and some information
about the Yelp reviewers such as the number of reviews they have written and the average star rating they
have given. The dataset includes 10 cities: Edinburgh, Karlsruhe, Charlotte, Urbana, Madison, Las Vegas,
Phoenix, Pittsburgh, Montreal, and Waterloo (Canada).
To measure accuracy, we compare our predicted label to the true label. We measure basic accuracy,
precision and recall, as well as AUC. We also compare our model’s accuracy to a baseline model.
Some potential applications of our classification models include:
1. Help Yelp automate restaurant labeling without user inputs.
2. Label vaguely labeled restaurants more specifically or label restaurants with missing labels.
3. Inform customers about restaurants’ specialties and particular cuisines by further sub-categorizing
restaurants into more specific labels.
MethodsData collection
We used the Yelp academic dataset, which is made available by Yelp for the Yelp Dataset Challenge 1. Toobtain this data, we registered for the Yelp Dataset Challenge at http://www.yelp.com/dataset_challenge.
Data preparation
The data provided is in JSON format, but a Python script for converting to csv is offerred at
https://github.com/Yelp/dataset-examples. We used this script ( json_to_csv_converter.py) to convert the
JSON data into csv files, then we read those into a Python notebook and stored the data in Pandas
dataframes. We subset the data for what is potentially useful for our chosen problem. We use 9 of the 10
cities in the Yelp Academic Dataset for our model. Karlsruhe, Germany data is omitted because mostreviews here are not written in English, and review text is the richest component of our dataset. We
further subset by selecting only restaurants (excluding Hotels, Spas, etc.). Within restaurants we further
subset for the 20 most common types of restaurant, as dictated by their given labels. Labels chosen and
number of restaurants with each label in our subsetted dataset are given in Fig. 1 (See in Appendix). We
also got rid of EOL, carriage returns, and certain regex patterns in the review texts for our bag of words
model to work better.
1
8/15/2019 Yelp Advisor Report
2/15
Featurization
We featurize our review text using a Bag of Words (BoW) model, building a training matrix of number of
restaurants by size of vocabulary as follows:
All reviews received by each restaurant in the training set (70% of total) are joined and tokenized with
stopwords removed, then words are counted to create the sparse BoW vector for each restaurant.
We tested several different feature inclusions:- N-grams: Unigrams Only, Bigrams + Unigrams
- Number of features retained: 6000, 15000, 100,000, or ~200,000 (which is the total count of
unique words in our training corpus)
- Feature weights: raw frequencies or term frequency, inverse document frequency (TF-IDF)
weighting. We note here the specifics of the TF-IDF weighting: we used the default parameters of
the sklearn.feature_extraction.text.TfidfTransformer() tool. ( norm='l2' , use_idf=True,
smooth_idf=True, sublinear_tf=False). The norm parameter means we normalize the final
vectors, and the smooth_idf and use_idf parameters mean our features are weighted according to
tf * (idf + 1) , where tf is the frequency of the feature in the restaurant's merged reviews, and idf
is the inverse frequency of the feature in the entire training corpus (all restaurant reviews).
Another featurization we tried, but did not implement in the final pipeline, is to use the star rating matrix,
which is a matrix of number of users by number of restaurants. Each element in the matrix corresponds to
a user’s rating for a certain restaurant. Then we performed matrix factorization (through PCA and
Alternative Least Squares) to obtain a factor matrix of number of factors by number of restaurants. We
treated each vector (with the length of factors) as a data point to represent each restaurant.
Learning
First we describe the learning methods used for the supervised problem of classifying restaurants into
their existing labels. Then we describe the methods for the unsupervised problem of classifying
restaurants into subcategories.
Supervised text-based classification into existing labels
Models tested: Logistic regression and random forest
Logistic regression marginally outperformed our random forest models, so we have chosen the logistic
regression model as our primary model.
Parameter choices:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr',
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0)
Logistic regression: Multi_class = “ovr” indicates that a binary problem is fit for each label. So in our
case, for each of the 20 categories we model whether a restaurant does or does not fall into that category.
2
8/15/2019 Yelp Advisor Report
3/15
This is a logical choice since some restaurants fall into more than one category (for example, many “Sushi
Bars” are also “Japanese”).
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features=6000, max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=4,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
Random forests: We tested a number of parameter choices, but the best performance was achieved by
keeping 6000 features, 100 estimators, bootstrap on, and gini criterion. We initially included fewer
features, because according to sckit-learn documentation2, for classification tasks, the number of features
used in a random forest model is often optimized with max_features=sqrt(n_features). N_features in our
case is ~200,000, so ~500 would be a good choice for max_features. However, we saw increased
accuracy when we included more features.
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
Multinomial Naïve Bayes (MNB) (baseline model):
The alpha parameter is set by default to 1 to include adaptive smoothing.
We also tried Bernoulli Naive Bayes by binarizing features such that presence of a word (count of 1 or
more) gave a feature value of 1 while absence of a word gave a feature value of 0. This method gave us
fairly high accuracy, but zero recall for all categories, so we present Multinomial NB as our baseline
model.
Clustering for Sub-Categorization
For sub-categorization, we implemented spectral clustering, which can be summarized as the following
procedures3:
1 . F o r m th e a f f in i t y m a t r ix A , w it h A i j = e x p ( | | si - s j||2
2) for and Aii = 0δ/ =i / j
2. Define D to be a diagonal matrix whose (i,i)-element is the sum of A’s i-th row. And construct
L = D-1/2 A D-1/2
3 . F i n d k e i g e nv e c t or s x1, … , xk corresponding to the k smallest eigenvalues of L. Form matrix
X= [x1
...xk
] .
4 . R e - n or m a l iz e e a c h r o w o f X t o f or m m a t ri x Y .
5. Treat each row of Y as a data point, do K-means clustering on Y.
We implemented this algorithm ourselves and applied it to cluster: (1) all restaurants into groups in order
to see whether or not these groups correspond to a sensical composition of given restaurant types, and (2)
Chinese restaurants into subcategories. The parameter in this algorithm is , which controls theδ
connectivity of data points. The smaller is, the more separated clusters will appear. We set to theδ δ
value to make the number of separated clusters equal to 5.
3
8/15/2019 Yelp Advisor Report
4/15
Other things we tried :
Initially we were working on a different problem involving time series analysis of daily review counts.
We took the time series of the daily review counts as our features and our hope was to (1) find different
customer influx patterns for different types of restaurants, and (2) predict customer influx to certain cities
and venues based on these time series. Our analysis failed because after running some statistical tests, we
learned that there is not enough information in the time series for us to distinguish different types of
restaurants and predict customer influx.
Another method we tried was to factorize the star-rating matrix to yield a factor matrix corresponding to
restaurants, then use this as features and apply k-means on it to find restaurant types. However, the
star-rating matrix is very sparse. Even using ALS (Alternative Least Squares) factorization, our average
prediction error (measured by root of mean squared error) was larger than 1 star. So we abandoned this
feature and used bag of words instead.
Results
Supervised Labeling
How many features should we include?
Figure 2 in the appendix shows that, for the case of unigrams only and no TF-IDF weighting, accuracy,
precision, and AUC are all maximized by including the entire corpus. The effect of increased corpus size
on recall is less clear cut. Since recall is not drastically decreased by including more features, we can base
our choices on the precision and accuracies.
Should we weight our data by TF-IDF? Should we include bigrams?
In Figure 3, we compare the performance of our logistic regression model for four featurization choices.
In all cases, ~200,000 features are used, but we vary inclusion of unigrams vs. unigrams + bigrams, and
we test whether or not to weight with TF-IDF. The top plot shows that using bigrams in addition tounigrams has little effect on the overall accuracies. We see a slight improvement for Italian and Chinese
restaurants when adding bigrams, but this improvement is not substantial. The middle and bottom plots
show precision and recall. We see that using TF-IDF weights generally increases precision but decreases
recall. We average over all 20 categories for these measures in the table below:
Raw frequencies,
unigrams only
TF-IDF weights,
unigrams only
Raw frequencies,
bigrams +
unigrams
TF-IDF weights,
bigrams +
unigrams
Accuracy 0.9517 0.9486 0.9562 0.9447
AUC 0.9220 0.9492 0.9347 0.9498
Precision 0.7215 0.8611 0.7737 0.9040
Recall 0.5672 0.3304 0.5551 0.2665
Table 1. Comparison of different featurization choices for the logistic regression model (with ~200,000
features retained). Measures are averaged across scores for the 20 categories.
4
8/15/2019 Yelp Advisor Report
5/15
The highest scores for each accuracy measure are shown in bold. The highest overall accuracy is achieved
by including bigrams and unigrams, and weighting the features by their raw counts. Precision and AUC
are improved by weighting by TF-IDF, but recall is markedly decreased. Such low recall would cause us,
for example, to fail to recommend a relevant restaurant to a Yelp user, so we choose not to weight our
data by TF-IDF. Thus our final choice for featurization is: ~200,000 features of bigrams and unigrams,
weighted by raw word counts.
Discussion of individual category accuracies
Figure 4 shows accuracy, precision and recall for each category for our chosen model and featurization.
Alongside our model’s accuracies, we include “Always False” accuracies, which is the accuracy for a
model that simply predicts false uniformly for each label. We see that for all categories, our model
outperforms the "Always False" classifier. However, accuracy is very close to this "Always False"
classifier for the rarest categories: Sushi Bars, Delis, Steakhouses, Seafood, and Chicken Wings. With
larger numbers of these types of restaurants, performance for these categories would potentially be
improved.
Taking into account accuracy, precision, and recall, we see that our classifier is best at labeling Mexican,Pizza, and Chinese. It is not as successful at classifying American (Traditional and New), or at classifying
the label "Food." This makes sense because American restaurants and "Food" restaurants have less
obviously identifying word features than Mexican or Chinese restaurants. To visualize this effect, we
examine word clouds for some of these cases (Figure 5). The word clouds display the most frequent
words in all reviews for a given category, sized by their frequencies. Stopwords are removed in addition
to the word “food,” which is common to all categories.
Random Forest Model Results
Here we also report the accuracy measures for our Random Forest model, because it performed nearly as
well as the Logistic Regression model. For the results presented here, we used the same training matrix as
in the primary model (unigrams + bigrams, raw counts), but we only retain 6000 features. The parameters
used are given in the methods sections. We halve our test dataset into validation (for testing parameter
combinations) and final test data sets. Table 2 summarizes accuracy scores for this model (both validation
and final test scores), with logistic regression included for comparison. Accuracy and recall are below the
logistic regression model, but precision is higher. If given more time to test more parameter combinations, it
is plausible that we could achieve higher accuracy with this random forests approach. Recall might
improve with shallower trees or fewer features considered, since these parameters give a simpler model
with lower variance, but with this comes potentially higher bias (lower accuracy).
Accuracy Precision Recall
Logistic Regression 0.9562 0.7737 0.5551
Random Forest (validation) 0.9530 0.8867 0.4666
Random Forest (final test) 0.9522 0.8972 0.4550
Table 2. Accuracy measure comparisons between primary (logistic regression) and a random forest
model. Measures are averaged across scores for the 20 categories.
5
8/15/2019 Yelp Advisor Report
6/15
The random forest model allows us to examine the most important features. Here we list some of the mostimportant features (in decreasing order) : pizza, chinese, bar, mexican, pizzas, burger, subway, mcdonalds,
mexican food, chinese food, sandwiches, bartender, sandwich, sushi, tacos, taco, bartenders, italian, coffee,
burrito, crust, fries, burgers, bar food, drive, fried rice, pizza good, asada, salsa, pepperoni, beer, good pizza,
fast food, pasta, rice, carne asada, waitress, breakfast, italian food, burritos, subs, wings, best pizza, happy
hour, bars, bread, mein, drinks, pizza place, beers, sub, fast, great pizza, cafe, restaurant, italian restaurant,eggs, japanese, place, rice beans, deli, taco bell, great, carne, chinese restaurant, pub.
Many of these features are obvious identifiers for certain labels.
Comparison to baseline model
In the table below we summarize accuracy measures (averaged across our 20 categories) for our primary
model and our baseline model. We see significant improvement in all measures except for recall. The low
precision of the baseline model indicates that it underfits our data, which is expected of a simple model
such as Naive Bayes.
Accuracy AUC Precision Recall
Multinomial NB (Baseline) 0.9119 0.8690 0.4818 0.7648
Logistic Regression (Primary) 0.9562 0.9347 0.7737 0.5551
Table 3. Accuracy measure comparisons between primary (logistic regression) and baseline (multinomial
naive Bayes) models. Measures are averaged across scores for the 20 categories.
We compare performance against the baseline model for all categories in Figure 6. In the top panel we see
that our model is more accurate than our baseline model for all categories. While the improvement does
not appear drastic, it should be noted that "Always False" never outperforms our primary model, but it
outperforms the baseline model for 11 of the 20 categories (these 11 being Fast Food, American
(Traditional), Sandwiches, Food, American (New), Breakfast and Brunch, Cafes, Delis, Steakhouses,
Seafood, and Chicken Wings). The 9 categories for which the baseline model surpasses “Always False”in accuracy are all categories we expect to have more unique vocabularies, such as ethnic cuisine. Our
primary model outperforms the baseline model most significantly for labels Sandwiches (improvement by
>20%) and Fast Food (improvement by >10%). Better accuracy is expected for logistic regression as
compared to naive Bayes for a problem such as ours because naive Bayes is a simplification of logistic
regression. Naive Bayes assumes that features (words) are generated independently given the class (in our
case, the “class” is true or false for each label), whereas logistic regression does not make this
assumption. As such, we expect the naive Bayes model to have higher bias but lower variance, and that it
will underfit our data, leading to low precision and high recall.
Spectral Clustering
We applied spectral clustering on: (1) all restaurants to classify them into groups and analyze the true
labels that comprise these groups, and (2) Chinese restaurants to classify them into subcategories. In order
to figure out which labels each cluster corresponds to, we printed out the top 5 true labels of each of the 5
clusters. As shown in Figure 7, we see that cluster 3 corresponds to pizza or Italian restaurants, cluster 2
corresponds to bar, nightlife type of restaurant. The other three clusters are more difficult to interpret
because they contain mixed types of restaurants. Figure 8 is the result of applying spectral clustering on
6
8/15/2019 Yelp Advisor Report
7/15
Chinese restaurants. Many of the Chinese restaurants have true labels in addition to “Chinese,” such as
“Taiwanese” or “Buffet.” So, as we did for the clusters of all restaurants, we can again print the most
common true labels (other than Chinese) for the restaurants in our Chinese clusters. First we notice that
the most frequent labels in each clusters are “Asian Fusion”, “Buffet”, which provides little information
about their types. Other than that, we see that in the first cluster, we observe Japanese and Sushi bar,
which implies that their styles are more dominated by Japanese food. In the fifth cluster, we observe Thai,
Vietnamese, Szechuan restaurants, which are relatively spicy.
Tools
We performed all of our analysis in iPython notebooks because this platform is useful for visualizing
results alongside code. We used Pandas and NumPy for data manipulations because these are tools all
group members use. At first, we built our BoW features (and TF-IDF weights) using handwritten code
adapted from CS294 homework, but later we migrated towards scikit-learn tools for this task.
sklearn.feature_extraction.text.CountVectorizer() was used to form BoW training matrices. This tool
simplified a few tasks:
(1) setting the maximum feature retention count (“max_features” parameter),
(2) setting which n-grams to include (“n-gram range”), and(3) setting which stop words to remove (we removed words from the given “english” stop word list).
Once those matrices were built, we could transform the counts into their TF-IDF representation with
sklearn.feature_extraction.text.TfidfTransformer().
For supervised labeling, we implemented several models from scikit-learn. The justification is that these
tools are easy to use, especially in an iPython notebook. Models we used include:
from sklearn.linear_model: LogisticRegression()
from sklearn.naive_bayes: BernoulliNB() and MultinomialNB()
from sklearn.ensemble: RandomForestClassifier()
We also used these tools for quantifying model performance:
from sklearn.metrics: roc_curve, roc_auc_score, auc
For unsupervised clustering, we basically used k-means from scikit-learn.
For visualization, we used Matplotlib because it is well-suited for simple graphics, and can be used inline
in an iPython notebook. We also used the wordcloud package to create some appealing visualizations of
our review text.
Lessons Learned
Supervised Labeling: We explored a number of machine learning approaches for the supervised problem
of classifying Yelp restaurants into existing labels. Our best model was a logistic regression model,
closely followed by a random forests model. We thus selected the logistic regression model as our primary model, and we compare it to a baseline model (multinomial naive bayes). The features used were
the words from all of the reviews written for each restaurant that we aimed to classify. We evaluated a
number of featurization choices for these words including:
(1) whether to use unigrams only or whether to additionally include bigrams,
(2) whether to weight the features by raw word counts or TF-IDF weights, and
(3) how many features to include.
7
8/15/2019 Yelp Advisor Report
8/15
As seen in Figure 2 and Table 1, we achieved the best performance for the logistic regression model by
using bigrams+unigrams, retaining 200,000+ features, and representing features as raw word counts.
We measure accuracy in a number of ways:
(1) accuracy (did we correctly predict that a restaurant does or does not fall within a certain category?),
(2) area under the ROC curve,(3) precision, and
(4) recall.
Scores for these accuracy measures are displayed in Table 2. Our logistic regression model outperforms
our baseline model substantially in accuracy, AUC, and precision, but the baseline model has higher
recall. We also show that our primary model outperforms the “Always False” model for all 20 categories,
whereas our baseline model does not for many categories. Our primary model performs best at classifying
ethnic cuisine such as “Chinese” and “Mexican,” which we hypothesize is due to these types of
restaurants having special and unique identifying words such as “Mexican” and “tacos” for Mexican
restaurants and “Chinese” and “noodles” for Chinese restaurants. This is corroborated by the word cloud
visualizations in Figure 5 and in looking at the most important features for our random forests model.
Unsupervised Labeling:
Unsupervised learning for subcategorization is relatively more difficult. In this project, we applied
spectral clustering on the review text in order to find subcategories of restaurants. The intuition is, let’s
say, for Chinese restaurants, people may use “hot”, “spicy” to describe a Sichuan restaurant and use “milk
tea”, “salted popcorn chicken” in reviews for Taiwanese restaurants. However, the difficulty is, it’s not
obvious what each cluster corresponds to.
One way to figure this out is to look at the percentage of existing labels. For example, if in a cluster, 50%
of restaurants are “bar”, 25% are “night life”, then we could reason this cluster corresponds to the bar type
of restaurants. Though we do observe this in some of the clusters (refer to the results section), there are
also clusters with mixed labels that are not easy to interpret. A more fundamental question to ask is, is the
clustering based on restaurant types? Or, is it perhaps more related to something else like star-rating, cost,or other latent factors? A key lesson for us is that unsupervised learning doesn’t always give us the result
we expect.
Team Contributions
*CS294* Bichen (40%): Time series analysis (majority of the “Project Preliminary Data Analysis”
submission), star-rating matrix factorization, spectral clustering of review texts for unsupervised
subcategory classification.
*CS294* Stephanie (40%): Initial reading in of data and exploration of business dataset (majority of
“Project Data Exploration” submission). Completion of bag of words featurization. Small scale
supervised labeling (majority of results presented in PowerPoint presentation) . Majority of textfeaturization and supervised labeling presented in poster presentation and presented here.
*CS194* Tsu-Fang (20%): Data exploration on review texts and user data. Started bag of words
featurization and TF-IDF analysis. Tested value of adding restaurant name feature and TF-IDF effects on
model accuracies after logistic regression and naive bayes (not shown). Ported and formatted results for
poster / presentation.
8
8/15/2019 Yelp Advisor Report
9/15
References
(1) Yelp academic dataset. https://www.yelp.com/academic_dataset.
(2) “Ensemble Methods.” http://scikit-learn.org/stable/modules/ensemble.html
(3) Ng, Andrew Y., Michael I. Jordan, and Yair Weiss. "On spectral clustering: Analysis and an
algorithm." Advances in neural information processing systems 2 (2002): 849-856.
Our github repository is here: htps://github.com/tsufanglu/Yelp-Dataset-Challenget
The most relevant notebooks to this report are:
CatsAllCities.ipynb
Yelp_Restaurnats_Spectral_Clustering.ipynb
They are located in the code folder of the repo:
https://github.com/tsufanglu/Yelp-Dataset-Challenge/tree/master/code
9
8/15/2019 Yelp Advisor Report
10/15
Appendix (Figures)
Figure 1. Chosen restaurant labels and their counts.
10
8/15/2019 Yelp Advisor Report
11/15
Figure 2. Accuracies, precisions, and recalls for our logistic regression model colored by the number of
words retained in the training corpuses (no TF-IDF weighting). These indicate that we ought to keep as
many words as possible as features.
11
8/15/2019 Yelp Advisor Report
12/15
Figure 3. Comparison of accuracies for 4 different featurization choices. In each case 211964 words (or
211964 bigrams + unigrams in the bigram case) are retained for training.
12
8/15/2019 Yelp Advisor Report
13/15
Figure 4. Accuracy measures for our chosen model, broken down by category.
Figure 5. Word clouds for American (New) (Upper Left), American (Traditional) (Upper Right),
Mexican, and Chinese. Notice Chinese has words unique to it such as “Chinese,” “noodle,” “rice”, and
dumpling; Mexican has unique words like “Mexican,” “taco,” and “burrito,” but the upper 2 word clouds
do not show obviously unique words.
13
8/15/2019 Yelp Advisor Report
14/15
Figure 6. Baseline comparisons. There is substantial improvement over baseline for accuracy, AUC, and
precision. The simple baseline model has higher recall. Bottom panel labels serve as a guide for all
panels.
14
8/15/2019 Yelp Advisor Report
15/15
Figure 7. Spectral clustering result of all restaurants. Most frequent 5 labels in each cluster.
Figure 8. Spectral clustering results for Chinese restaurants. Most frequent 5 labels in each cluster.
15
Top Related