YELP Data Set Challenge

63
Yelp Dataset Challenge MSIS 5633 Deliverable 2 25 NOV 2015 James Lynn (CWID 11644030) Yolande Mbah Mbole (CWID 11696431) Vegard Oelstad (CWID 11681522)

Transcript of YELP Data Set Challenge

Page 1: YELP Data Set Challenge

Yelp Dataset Challenge

MSIS 5633Deliverable 225 NOV 2015

James Lynn (CWID 11644030)

Yolande Mbah Mbole (CWID 11696431)

Vegard Oelstad (CWID 11681522)

Page 2: YELP Data Set Challenge

2 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Executive Summary

Yelp is a web based company providing crowd-sourced reviews of local business via Yelp.com. Its stated goal is to connect people with great local businesses. In recent years, Yelp has made subsets of its data available to the public to promote innovative uses of data and groundbreaking research.

The goal of our project is to leverage this Yelp data to create a classification scheme utilizing Ratings and Price information. The analysis should provide insights into what makes some restaurants earn top rankings while others fall short. Obviously, consumers expect high quality in terms of service, food, ambiance etc. The question is which dimensions are more important. Can a restaurant fall short in some areas and still be rated highly?

Our project could benefit those looking to open a new restaurant by identifying key areas to focus on. It could also help educate inexperienced restaurateurs on customer expectations and what it takes to succeed in terms of ratings and customer perception. Every advantage can help when you consider that a study by Cornell University and Michigan State University researchers found that after the first year 27% of restaurant startups failed. Chef Robert Irvine of TV’s Restaurant Impossible cited inexperience as the primary reason most restaurants fail. Our project can help educate inexperienced restaurateurs on customer expectations and what it takes to succeed in terms of ratings and customer perception.

The one thing found in the analysis to improve the restaurant is the opening hours. Despite the fact that longer opening hours may increase the revenue, shorter hours helps increase the rating of the place. This, together with the fact that the majority of the reviews are concerned about food and service may argue that the managers may consider reducing the hours to increase its ratings – which in turn will help bring in more customers and more revenues.

Project Schedule, Duration and Estimates

Initial Project Timeline

YELP DATASET CHALLENGE ANALYSIS TIMELINE

9/7 9/14 9/21 9/28 10/5 10/12 10/19 10/26 11/2 11/9 11/16 11/23 11/30 12/7 12/14Milestone Kick Off Meeting Team 1 9/2/15 9/2/15Prepare project proposal Team 7 9/6/15 9/12/15 9/12Submit project proposal Team 1 9/13/15 9/13/15 9/13Define data requirements for analysis Team 5 9/13/15 9/18/15 9/18Data consolidation Team 27 9/18/15 10/15/15 10/15Data cleaning Team 27 9/18/15 10/15/15 10/15Data reduction Team 27 9/18/15 10/15/15 10/15Prepare first deliverable Team 3 10/15/15 10/17/15 10/17Submit first deliverable Team 1 10/18/15 10/18/15 10/18Build models Team 10 10/19/15 10/30/15 10/30Analyze models Team 24 11/1/15 11/24/15 11/24Prepare second deliverable Team 3 11/25/15 11/28/15 11/28Submit second deliverable Team 1 11/29/15 11/29/15 11/29Prepare report and presentation Team 11 11/30/15 12/10/15 12/10Submit final deliverable Team 1 12/11/15 12/11/15 12/11

Step Task Lead Est. Duration

Start Date

End Date

Page 3: YELP Data Set Challenge

3 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Final Project Timeline

YELP DATASET CHALLENGE ANALYSIS TIMELINE

9/7 9/14 9/21 9/28 10/5 10/12 10/19 10/26 11/2 11/9 11/16 11/23 11/30 12/7 12/14Kick Off Meeting Team 1 9/2/15 9/2/15Prepare project proposal Team 7 9/6/15 9/12/15 9/12Submit project proposal Team 1 9/13/15 9/13/15 9/13** Major Group meeting Team 1 9/14/15 9/14/15Define data requirements for analysis Team 4 9/15/15 9/18/15 9/18Data cleaning and data consolidation Team 27 9/18/15 10/15/15 10/15Prepare first deliverable Team 3 10/15/15 10/17/15 10/17Submit first deliverable Team 1 10/18/15 10/18/15 10/18** Major Group meeting Team 1 10/19/15 10/19/15 10/19Data Transformation Team 18 10/20/15 11/7/15 11/7Data Reduction Team 6 11/8/15 11/14/15 11/14** Major Group meeting Team 1 11/15/15 11/15/15 11/15Build models Team 5 11/16/15 11/20/15 11/20Analyze models and start preparing 2nd deliverable Team 3 11/21/15 11/23/15 11/23** Major Group meeting Team 1 11/23/15 11/23/15 11/23Finalize second deliverable Team 1 11/24/15 11/24/15 11/28Submit second deliverable Team 1 11/25/15 11/25/15 11/29** Major Group meeting Team 1 11/26/15 11/26/15 11/26Prepare report and presentation Team 10 11/27/15 12/6/15 12/6Submit final deliverable Team 1 12/7/15 12/7/15 12/7

Step Task Lead Est. Duration

Start Date

End Date

Comparing our initial timeline with the final one, we initially planned to do the data reduction before submitting the first deliverable but were only able to so after submitting the deliverable because we spent more time than expected on the data cleaning and consolidation. We also included the duration of the Data Transformation in our updated timeline. We met almost every week, but only the major ones are included in our final timeline. Another major difference in our planned and actual schedule is that we spent more time on data Transformation than planned. As a result, we had to use some of the time we planned to spend on building and analyzing our models on the data transformation. It worked out well and we were able to complete the project on time.

Work Based Structure

YELP Data Mining Project

First Deliverable

-Define data requirements for analysis

-Data cleaning and consolidation

Second Deliverable

-Data Transformation-Data reduction

-Building and analyzing models

Final Deliverable

-Report-Final Presentation

Project Proposal

Page 4: YELP Data Set Challenge

4 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Statement of Scope

Project Objective

The objective of our analysis is to uncover the factors most important in categorizing a Yelp restaurant into a high review category (4, 4.5, or 5 Star rating).

Target Variable

TARGET – this target variable is a binary field with values of 0 or 1. It is created by assigning a value of 1 to restaurants with in the High review category. All other restaurants will be assigned a 0 value.

Predictor Variables

Our initial file included over 100 possible predictor variables. To limit the scope, we started with the variables below and used a decision tree to identify the most important variables in determining the desired outcome. In addition, we selected a few additional variables based on our intuition and curiosity to see how well they performed in terms of classification and prediction. The bolded variables are those actually selected for use in our models.

Ethnicity – type of food (e.g. Italian, Mexican, etc.) Neighborhood Flag –binary variable to indicate whether neighborhoods were listed; could be an

indicator of trendy locations Review Count - number of Yelp reviews Good for Kids – whether restaurant is good for Kids Alcohol – full bar, beer and wine, none, etc. Noise Level – loud, very loud, average, etc. Attire – dressy, casual, etc. Coat Check – True, False Romantic – True, False Classy – True, False Intimate – True, False Hipster – True, False Divey – True, False Touristy – True, False Trendy – True, False Upscale– True, False Casual – True, False Good for Dessert – True, False Good for Late Night – True, False Good for Lunch – True, False Good for Dinner – True, False Good for Breakfast – True, False Good for Brunch – True, False Live Music – True, False Dairy Free – True, False

Page 5: YELP Data Set Challenge

5 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Gluten Free – True, False Vegan – True, False Vegetarian – True, False Wi-Fi – True, False Takes Reservations – True, False Smoking – Yes, No, Outdoor Hours Open – open/close time broken out by day of week Text Topics 1-20 – themes identified through text mining Total Reviews voted as cool Total hours open on weekends Total Tips Total Likes of Tips Percentage of reviews voted Funny Percentage of reviews voted Useful Percentage of reviews voted Cool

People Benefitting from the Analysis

The primary benefactors of this analysis will be restaurant owners and operators. They will receive insights into the most important dimensions of a highly rated restaurant.

Consumers may also benefit. When restaurants aren’t rated or when they have fewer reviews, the criteria may help them determine whether or not to take a chance on a restaurant.

Yelp and advertisers may also benefit. They can use the information from the analysis to approach businesses in a more consultative fashion by providing offerings and recommendations that help restaurants improve key areas of weakness or consumer perceptions in those areas.

Companies who help restaurants could benefit. Perhaps a restaurant scores low for ambiance. Companies specializing in remodeling or interior design could approach these restaurants with proposals or ideas on how improvements could be made.

Finally, job seekers may benefit. The results of the analysis would give them clues on the major values and characteristics that distinguish one restaurant from another. They would then be able to make a better choice of the restaurant they want to work for based on the attributes they value most.

Constraints and Limitations

There are a number of possible constraints associated with this project.

1. Small sample size of highly rated, expensive restaurants - While there are over 6,000 restaurants in the data set rated as a 4, 4.5, or 5, there are only about 175 with those ratings also falling into the most expensive category (rating of 4). Given that fact, we adjusted our original project idea of investigating why expensive restaurants receive low ratings to something broader. We are now looking to predict high restaurant ratings irrespective of price.

2. Format of the data - There are several data fields that include nuggets of information that is not easily accessible without text mining. Even with text mining, over 400 concepts emerge. These concepts must be combined into themes. This is a time consuming and inexact process.

Page 6: YELP Data Set Challenge

6 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

3. Samples - The samples we are using are from a few U.S. cities - Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, and Madison. The samples may not be representative of the U.S. as a whole.

4. Timing – As of the time this paper was written, we have received no formal feedback on our original project proposal. Should changes be required, we will have less time to adapt.

5. Expertise – A good data science team is comprised of individuals with expertise in several disciplines – statistics, computer science, statistics/math, and the business domain. Our group lacks anyone with an in-depth statistics/math background.

Project Costs

The project team associated with this analysis consists of 3 senior data analysts. We estimate the time required to be 50 hours per analyst (150 hours total). At a rate of $250 per hour, the total project cost to be $37,500. This estimate does not take into account the opportunity cost of other projects that are not undertaken.

Since we are using free analysis software and there are no data charges, the intangible costs are negligible.

Feasibility and Risk Assessment

Despite our team’s shortcomings in the realm of statistics, we felt our project was feasible based on the training we have received in MSIS 5633. We felt the biggest challenge facing us was the conversion of JSON files to a format easily readable by SPSS Modeler. The rest of the project was less daunting.

Timing and resource availability was one challenge we faced. With a distance learning student and student athlete on the team, scheduling meetings was sometimes difficult. We were able to overcome the challenge by scheduling regular meetings on Google Hangouts and maintaining ongoing, open communication via email.

We were fortunate to have a robust data set from Yelp. The data set permitted us to easily adjust or modify our sample and the specific data to be used in the project. We also had the necessary programs to perform our analysis with each team member having access to Excel, JMP, R, SAS, SPSS Modeler and Tableau. These tools, combined with training on key data mining and analysis techniques from MSIS 5633 gave us the tools required to successfully achieve our project goals.

Implementing the Plan / Measuring Results

To implement our plan, we would identify start up restaurants in the cities our sample was based on (Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, and Madison) and present our ideas to them.

Our analytic program will be successful if we are able to determine if there are factors in the Yelp data set that can accurately identify the factors that most contribute to an expensive restaurant having a poor rating. If we discover that none of the factors present predict a low rating, which is an interesting insight that may be of value to Yelp. If we discover there are factors that may result in low ratings, which will be of interest to Yelp, restaurant owners, and possibly diners.

Page 7: YELP Data Set Challenge

7 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Beyond our analysis, we would like to succeed by helping struggling restaurants. By leveraging our insights, they could improve the number of customer visits as well as their reviews. If the number of customers significantly increases alongside high ratings, our analysis has done more than succeed.

Our potential clients would be mainly start up restaurants, as well as restaurants with really low ratings (1 or 2 stars). We could present our findings at a range of industry events like the National Restaurant Association Conference, the Restaurant Finance & Development Conference, or something more interesting like the TV show Restaurant Impossible.

Beyond that, we would present our model to customers who may have a vested interest in helping struggling restaurants turn their businesses around. This could include chefs who help with menu selections, interior designers who could improve the look, musicians who could improve the ambience, etc.

Scope Proposal

The scope of this project was limited to U.S. restaurants in the Yelp Dataset Challenge data. We focused on identifying the factors common to highly rated restaurants within this group that are not present in restaurants with lower ratings.

Data Dictionary

Our data dictionary is extensive given the number of variables provided by Yelp and the number of derived fields we created. We elected to maintain a large data dictionary to illustrate the breadth of data we had available and the new fields we created. We also used variable screening methods that leveraged a large number of variables to identify those useful to our model.

Yelp Data Set Challenge Master Data DictionaryVariable Description Type Length Format InformatAges_Allowed Describes ages allowed in

restaurant (e.g. 19plus).Char 7 $CHAR

7.$CHAR7.

Alcohol Describes if/how alcohol is served (e.g. full bar, beer and wine, etc.).

Char 13 $CHAR13.

$CHAR13.

Attire Describes appropriate dress for restaurant (e.g. dressy, casual).

Char 6 $CHAR6.

$CHAR6.

BYOB Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

BYOB_Corkage Field identifies whether attribute is True, False, or NA.

Char 11 $CHAR11.

$CHAR11.

Caters Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Coat_Check Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Corkage Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Credit_Cards Field identifies whether attribute is True, False, or NA.

Char 6 $CHAR6.

$CHAR6.

Delivery Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Page 8: YELP Data Set Challenge

8 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Dogs_Allowed Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Drive_Thru Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Friday_close Close time for this day in 24 hour format.

Char 5 $CHAR5.

$CHAR5.

Friday_open Open time for this day in 24 hour format.

Char 5 $CHAR5.

$CHAR5.

Good_For_Dancing Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Good_For_Groups Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Good_For_Kids2 Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Good_For_breakfast Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Good_For_brunch Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Good_For_dessert Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Good_For_dinner Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Good_For_latenight Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Good_For_lunch Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Good_for_Kids Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Happy_Hour Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Has_TV Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Monday_close Close time for this day in 24 hour format.

Char 5 $CHAR5.

$CHAR5.

Monday_open Open time for this day in 24 hour format.

Char 5 $CHAR5.

$CHAR5.

Music_dj Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Music_jukebox Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Music_karaoke Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Music_live Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Music_playlist Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Music_video Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Noise_Level Describes noise level (e.g. average, quiet, loud).

Char 9 $CHAR9.

$CHAR9.

Open_24_Hrs Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Order_at_Counter Field identifies whether attribute is Char 5 $CHAR $CHAR5.

Page 9: YELP Data Set Challenge

9 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

True, False, or NA. 5.Outdoor_Seating Field identifies whether attribute is

True, False, or NA.Char 5 $CHAR

5.$CHAR5.

Parking_garage Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Parking_lot Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Parking_street Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Parking_valet Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Parking_validated Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Payment_amex Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Payment_cash_only Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Payment_discover Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Payment_mastercard Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Payment_visa Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Saturday_close Close time for this day in 24 hour format.

Char 5 $CHAR5.

$CHAR5.

Saturday_open Open time for this day in 24 hour format.

Char 5 $CHAR5.

$CHAR5.

Smoking Describes if/where smoking is permitted (e.g. no, outdoor).

Char 7 $CHAR7.

$CHAR7.

Sunday_close Close time for this day in 24 hour format.

Char 5 $CHAR5.

$CHAR5.

Sunday_open Open time for this day in 24 hour format.

Char 5 $CHAR5.

$CHAR5.

Take_out Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Takes_Reservations Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Thursday_close Close time for this day in 24 hour format.

Char 5 $CHAR5.

$CHAR5.

Thursday_open Open time for this day in 24 hour format.

Char 5 $CHAR5.

$CHAR5.

Tuesday_close Close time for this day in 24 hour format.

Char 5 $CHAR5.

$CHAR5.

Tuesday_open Open time for this day in 24 hour format.

Char 5 $CHAR5.

$CHAR5.

Waiter_Service Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Wednesday_close Close time for this day in 24 hour format.

Char 5 $CHAR5.

$CHAR5.

Wednesday_open Open time for this day in 24 hour format.

Char 5 $CHAR5.

$CHAR5.

Wheelchair_Accessible Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

Page 10: YELP Data Set Challenge

10 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Wi_Fi Describes wi-fi availability and cost (e.g. no, free).

Char 4 $CHAR4.

$CHAR4.

afternoon_check-ins* Derived from check-ins file. Sum of afternoon check-ins from 11AM to 3PM.

Num 8

avgstars_review_file* Derived from reviews file. Average ratings on rating file for a restaurant.

Num 8

background_music Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

business_id Unique identifier for individual restaurants. Also the primary key.

Char 22 $CHAR22.

$CHAR22.

casual Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

categories Catchall field from Yelp that includes restaurant type, foods, etc.

Char 199 $CHAR199.

$CHAR199.

city City where restaurant is located. Char 35 $CHAR35.

$CHAR35.

classy Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

cool_pct* Derived from reviews file. Percent of total reviews that were voted cool.

Num 8

dairy_free Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

divey Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

ethnicity* Derived from restaurants file. Text mining done to create flags for food type.

Char 25

evening_check-ins* Derived from check-ins file. Sum of evening check-ins from 6PM to 11PM.

Num 8

frihours* Derived from open and close times. Number of hours open this day.

Num 8

full_address Full physical address of restaurant. Char 110 $CHAR110.

$CHAR110.

fullweek_hours* Derived from open and close times. Number of hours open for the week.

Num 8

funny_pct* Derived from reviews file. Percent of total reviews that were voted funny.

Num 8

gluten_free Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

halal Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

hipster Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

intimate Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

kosher Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

lateafternoon_check-ins* Derived from check-ins file. Sum of check-ins from 3PM to 6PM.

Num 8

latenight_check-ins* Derived from check-ins file. Sum of check-ins from 11PM to 5AM.

Num 8

Page 11: YELP Data Set Challenge

11 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

latitude Latitude of restaurant. Num 8 BEST16.

BEST16.

longitude Longitude of restaurant. Num 8 BEST17.

BEST17.

monhours* Derived from open and close times. Number of hours open this day.

Num 8

morning_check-ins* Derived from check-ins file. Sum of morning check-ins from 5AM to 11AM.

Num 8

name Name of restaurant. Char 61 $CHAR61.

$CHAR61.

neighborhoods Neighborhood restaurant is located in.

Char 52 $CHAR52.

$CHAR52.

open Whether the restaurant is still in business (True or False).

Char 5 $CHAR5.

$CHAR5.

pct_likes_of_tips* Derived from Tips file. Percentage of tips that were liked by other users.

Num 8

price_range 1 to 4 with 4 being the most expensive.

Char 2 $7,00 $CHAR2.

rating* Derived from Stars field. Low (1-2), Medium (2.5-3.5), High(3.5-5)

Char 3 $3,00

restaurant_type* Derived from text mining categories field. Type of restaurant (e.g. Bar, Pub, Fast Food).

Char 25

review_count Total number of reviews for restaurant as reported on Yelp business file.

Num 8 BEST4. BEST4.

romantic Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

sathours* Derived from open and close times. Number of hours open this day.

Num 8

soy_free Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

stars Overall rating of restaurant. Num 8 BEST3. BEST3.state State where restaurant is located. Char 3 $CHAR

3.$CHAR3.

sunhours* Derived from open and close times. Number of hours open this day.

Num 8

target* Derived dependent variable. 1 when restaurant has High rating. Zero otherwise.

Num 8

thurshours* Derived from open and close times. Number of hours open this day.

Num 8

tot_check-ins* Derived from check-ins file. Total number of check-ins for restaurant.

Num 8

tot_cool* Derived from tips file. Total number of tips voted cool.

Num 8

tot_funny* Derived from tips file. Total number of tips voted funny.

Num 8

tot_reviews* Derived from reviews file. Total number of reviews for restaurant.

Num 8

tot_tip_likes* Derived from tips file. Total number Num 8

Page 12: YELP Data Set Challenge

12 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

of likes for all tips for a restaurant.tot_tips* Derived from tips file. Total number

of tips for restaurant.Num 8

tot_useful* Derived from tips file. Total number of reviews voted useful.

Num 8

touristy Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

trendy Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

tueshours* Derived from open and close times. Number of hours open this day.

Num 8

type Type of record (e.g. business, review, tip, etc.)

Char 8 $CHAR8.

$CHAR8.

upscale Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

useful_pct* Derived field. Percent of total reviews that were voted useful.

Num 8

vegan Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

vegetarian Field identifies whether attribute is True, False, or NA.

Char 5 $CHAR5.

$CHAR5.

wedhours* Derived from open and close times. Number of hours open this day.

Num 8

weekday_afternoon_check-ins*

Derived from check-ins file. Sum of weekday afternoon check-ins from 11AM to 3PM.

Num 8

weekday_evening_check-ins*

Derived from check-ins file. Sum of weekday evening check-ins from 6PM to 11PM.

Num 8

weekday_hours* Derived from check-ins file. Sum of hours open Monday-Friday.

Num 8

weekday_lateafternoon_check-ins*

Derived from check-ins file. Sum of weekday check-ins from 3PM to 6PM.

Num 8

weekday_latenight_check-ins*

Derived from check-ins file. Sum of weekday check-ins from 11PM to 5AM.

Num 8

weekday_morn_check-ins* Derived from check-ins file. Sum of weekday morning check-ins from 5AM to 11AM.

Num 8

weekend_afternoon_check-ins*

Derived from check-ins file. Sum of weekend afternoon check-ins from 11AM to 3PM.

Num 8

weekend_evening_check-ins*

Derived from check-ins file. Sum of weekend evening check-ins from 6PM to 11PM.

Num 8

weekend_hours* Derived from check-ins file. Sum of hours open Saturday-Sunday.

Num 8

weekend_lateafternoon_check-ins*

Derived from check-ins file. Sum of weekend check-ins from 3PM to 6PM.

Num 8

weekend_latenight_check-ins*

Derived from check-ins file. Sum of weekday check-ins from 11PM to

Num 8

Page 13: YELP Data Set Challenge

13 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

6AM.weekend_morn_check-ins* Derived from check-ins file. Sum of

weekend morning check-ins from 5AM to 11AM.

Num 8

budget_tm* Derived from text mining tips file. Concepts related to money. 0=False, 1=True

Num 8

drinks_tm* Derived from text mining tips file. Concepts related to drinks in general e.g beer, juice, water, tea, shakes. 0=False, 1=True

Num 8

food_tm* Derived from text mining tips file. Concepts related to food, ingredients, vegetables, fruits, dessert. 0=False, 1=True

Num 8

hours_tm* Derived from text mining tips file. Concepts related to days, dates, time, open, closed etc. 0=False, 1=True

Num 8

location_tm* Derived from text mining tips file. Concepts related to location and ambiance of the location e.g seats, doors, kitchen, Arizona. 0=False, 1=True

Num 8

negative_tm* Derived from text mining tips file. Concepts related to negative feelings e.g rude, dirty. 0=False, 1=True

Num 8

people_tm* Derived from text mining tips file. Concepts related to individuals e.g family, friends, kids, wife. 0=False, 1=True

Num 8

positive_tm* Derived from text mining tips file. Concepts which were generally related to positive feelings e.g clean, crispy. 0=False, 1=True

Num 8

service_tm* Derived from text mining tips file. Concepts related to how the service is viewed e.g waitress, manager, wait time. 0=False, 1=True

Num 8

neighborhood_flg* Derived from neighborhood field. 1 if neighborhood was listed, 0 if not.

Num 8

text_topic1* Derived from text mining reviews. Concepts related to: "+taco,+salsa,+chip,+burrito,mexican"

Num 8

text_topic2* Derived from text mining reviews. Concepts related to: "+customer,+know,+bad,+manager,+location"

Num 8

text_topic3* Derived from text mining reviews. Concepts related to: "+pizza,+crust,+slice,+cheese,+thin"

Num 8

text_topic4* Derived from text mining reviews. Concepts related to: "+great,+great food,+great service,+service,+food"

Num 8

Page 14: YELP Data Set Challenge

14 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

text_topic5* Derived from text mining reviews. Concepts related to: "+burger,fries,+fry,+bun,+onion"

Num 8

text_topic6* Derived from text mining reviews. Concepts related to: "+wine,+restaurant,+dish,+dessert,+meal"

Num 8

text_topic7* Derived from text mining reviews. Concepts related to: "+sushi,+roll,+fish,+tuna,+roll"

Num 8

text_topic8* Derived from text mining reviews. Concepts related to: "+breakfast,+egg,+coffee,+toast,+pancake"

Num 8

text_topic9* Derived from text mining reviews. Concepts related to: "+thai,+rice,+dish,+noodle,thai"

Num 8

text_topic10* Derived from text mining reviews. Concepts related to: "+buffet,+crab,+dessert,+leg,+selection"

Num 8

text_topic11* Derived from text mining reviews. Concepts related to: "+beer,+bar,+selection,+drink,+night"

Num 8

text_topic12* Derived from text mining reviews. Concepts related to: "+sandwich,+bread,+lunch,+salad,+meat"

Num 8

text_topic13* Derived from text mining reviews. Concepts related to: "+hour,+happy,+happy hour,+drink,+special"

Num 8

text_topic14* Derived from text mining reviews. Concepts related to: "+price,+steak,+good,good,+portion"

Num 8

text_topic15* Derived from text mining reviews. Concepts related to: "de,est,le,à,+pour"

Num 8

text_topic16* Derived from text mining reviews. Concepts related to: "+steak,+rib,+chicken,bbq,+sauce"

Num 8

text_topic17* Derived from text mining reviews. Concepts related to: "+minute,+wait,+table,+wait,+order"

Num 8

text_topic18* Derived from text mining reviews. Concepts related to: "always,+staff,+friendly,+love,+location"

Num 8

text_topic19* Derived from text mining reviews. Concepts related to: "+time,first,+first time,vegas,+love"

Num 8

text_topic20* Derived from text mining reviews. Concepts related to: "+salad,+lunch,+chicken,always,+special"

Num 8

* Denotes that this is a derived or calculated field.

Data Access

Page 15: YELP Data Set Challenge

15 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Our data was downloaded from the Yelp Dataset Challenge web page. The URL for that page is http://www.yelp.com/dataset_challenge. Click on the ‘Get the Data’ button and complete a form to download.

The data includes information on the businesses that have been reviewed, the reviews, the user/reviewer, user check-ins, and user provided tips. Yelp defines the data as follows:

The Challenge Dataset:

1.6M reviews and 500K tips by 366K users for 61K businesses 481K business attributes, e.g., hours, parking availability, ambience. Social network of 366K users for a total of 2.9M social edges. Aggregated check-ins over time for each of the 61K businesses

Cities:

U.K.: Edinburgh Germany: Karlsruhe Canada: Montreal and Waterloo U.S.: Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison

From the data, we focused only on records associated with restaurants. The processing of consolidating and cleaning the data is outlined in the sections that follow.

Data Consolidation

Yelp provided the data in 5 files. Descriptions of each file are included below.

File Name Description File Format Size Number of Recordsyelp_academic_dataset_business List of reviewed businesses JSON 54MB 61,181yelp_academic_dataset_review Review information on businesses JSON 1.39GB 1,569,264yelp_academic_dataset_user Information on Yelp users/reviewers JSON 162MB 366,715yelp_academic_dataset_checkin Information check-ins at businesses JSON 20MB 45,166yelp_academic_dataset_tip Tips for each business JSON 96MB 495,107

A lot of data cleansing and manipulation had to be done to consolidate the data into a single data set for modeling purposes. In order to get to a single data set, we went through a 5 step process.

1. Identify restaurants on the business file2. Create a subset of the business file that only includes restaurants3. Create subsets of the reviews, check-ins, and tips files4. Summarize data from the review, check-in, and tips file (e.g. sum the number of

check-ins/tips/reviews for each restaurant) and create a file for the summarized data containing only business ID and summary fields that can be appended back to the restaurants file

5. Text mine key text fields in the review and tips file to create content category flags for each restaurant

Page 16: YELP Data Set Challenge

16 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

6. The final step was to merge the summary tables back to the restaurant/business file that would serve as the final modeling data set

Here is a sample of the SQL code used to merge the individual files back to the master.

proc sql;create table yelp.yelp_restaurant_reviews as select a.*, b.rating, b.stars as avg_star_ratingfrom yelp.yelp_restaurant_reviews a left join yelp.yelp_restaurants b on

a.business_id=b.business_id;quit;

Data Cleaning

The data cleaning process was extensive and time consuming with the Yelp data. The JSON data required extensive formatting and some Yelp data fields combine somewhat unrelated data into a single field.

To convert the JSON fields into a more useable tab delimited text format, we used the jsonlite R package and the following commands for each file. The file names were changed for each run to match the file being processed.

library(jsonlite) # load jsonlite libraryyelp<-"yelp_academic_dataset_review.json" # assign file to yelp variablereviews<-stream_in(file(yelp)) # read in filereviews<-flatten(reviews, recursive = TRUE) # flatten JSON filereviews$text <- gsub('\n', ' ', reviews$text) # strip line feed from text fieldreviews$text <- gsub('\r', ' ', reviews$text) # strip carriage return from text fieldreviews <- data.frame(lapply(reviews, as.character), stringsAsFactors=FALSE) # create data frame that works with write tablewrite.table(reviews, "yelp_reviews.txt", sep="\t", row.names=FALSE) # write out data frame as tab delimted text file

The Business/Restaurant file had a field labeled category which was basically a list of key/value pairs. A great deal of text mining leveraging SPSS Text Analytics was required to create clean and create new fields from this attribute.

Data Transformation

Our data transformation focused primarily on the conversion of free-form text fields into flags that indicate whether a restaurant had reviews, tips, or category descriptions containing certain keywords or themes. To accomplish these transformations, we essentially constructed text mining models to create fields that could be fed into our final classification and predictor models.

Our text mining initiatives leveraged SPSS Modeler Text Analytics to accomplish this task for text in the Tips file and Restaurants File. SAS Text Analytics was used to create clusters from the review files.

Page 17: YELP Data Set Challenge

17 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

A number of derived fields were also created. These were generally ways to summarize data that was already available in a different form. The hours each restaurant was open on a daily, weekly, and weekend level were calculated from the start and close time, for example.

Some of the more important derived fields are described below.

Rating – a field that bins Yelp star ratings from a 1 to 5 (in increments of .5) scale into Low, Medium, or High

Text Mining Fields – we are mining reviews for the restaurants to create a list of indicators for the key concepts that emerge. An example of a theme is budget_tm which included concepts involving keywords surrounding price. A value of 1 indicates that a restaurant had a tip related to budget, 0 indicated that the restaurant did not.

Target – a field that serves as the target variable for our analysis. It identifies the restaurants with a price value of 4 (the highest value) and a rating of High

Categories – The business file categories field contains a lot of valuable information about each restaurant. Unfortunately, the information is often unrelated and must be parsed out using a text mining tool to create indicator variables. The field may contain multiple values – Mexican, Tex-Mex, Nightlife, Lounge, etc.

In all, more than 30 fields were created through the text mining process. Those fields, as well as other derived fields, are denoted in the data dictionary with an asterisk.

Data Reduction

Data reduction efforts focused on restricting our data only to the business we identified as a restaurant. To do that, we restricted our business file universe to restaurants using the code below to look for the key word restaurants in the Yelp categories field. From there, we created a new restaurant indicator. We were able to subset the data in the second line of code below with the new restaurant indicator variable. This business IDs from this subset of restaurants was used to restrict records in our reviews, tips, and check-ins files to restaurants only.

# Identify restaurantsbusiness$restaurant_flg <- grepl("Restaurant|restaurant", business$categories)yelp_restaurants<-business[business$restaurant_flg=="TRUE",]

Our next task was to reduce the review data set to include only reviews that corresponded to our newly created list of restaurants. The code below shows our approach to this process using R.

ids<-yelp_restaurants$business_id#subset

Page 18: YELP Data Set Challenge

18 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

restaurant_reviews <- reviews[reviews$business_id %in% ids, ]

Descriptive analysis

Using JMP 12, we did some descriptive analysis to get a better understanding of the distributions of some of the key variables.

Ethnicity

First, the ethnicity variable against the target variable (see data transformation) shows us the likelihood of a restaurant being a 4-5 star restaurant for the different ethnicities.

In the graph, we can see that certain ethnicities stand out. In terms of high likelihood of high rating, Polish, Russian, Scandinavian, and African restaurants seem to be well received. On the other end of the scale, American, Irish, Mexican, and Unknown restaurants are not particularly successful.

To illustrate an essential problem with this analysis, we also brought in a frequency table for the different restaurants. Here we see that most of the different ethnicities have relatively few records to base any assumptions on.

Page 19: YELP Data Set Challenge

19 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Based on the frequency table above, the most frequent ethnicities are American, Asian, Mexican, Italian, and Unknown. Interestingly enough, this list of ethnicities seems to be pretty much the opposite of the likelihood of a high rating. This could be taken as an indicator that one of the aspects needed for a good review might be scarcity or originality, which would make sense for various reasons. By having a restaurant that serves the only food of its kind, there will be fewer restaurants to compare it to. You see this happening to people that taste very high end food – their standards rise after going to a Michelin rated restaurant, compared to someone who has never tasted a Michelin star worthy meal.

Weekly hours

Another interesting observation is the importance of the weekly hours. In the graph below, you can see that likelihood of a high rating decrease as the number of hours goes above 70.

Page 20: YELP Data Set Challenge

20 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Again, we do a simple frequency table to double check that we are not making assumptions based on a small sample size.

As seen in the frequency table, there are at least 400 reviews for each of the blocks of full-week hours between 30 and 110 hours. Hence making assumptions within this range may be safe to do. Focusing on fewer hours may help increase the quality of the restaurant, as it may help ensure that high quality staff

Page 21: YELP Data Set Challenge

21 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

is already at the restaurant, as having more shifts will increase the chance of having to hire less qualified workers.

Location

It is interesting to see the importance of location. Hence we made a map in Tableau to show the relationship between the location, number of reviews, and rating.

Scale:

Karlsruhe, Germany

Page 22: YELP Data Set Challenge

22 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Edinburgh, U.K.

Montreal, Canada

Page 23: YELP Data Set Challenge

23 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Waterloo, Canada

Pittsburgh, PA

Madison, WI

Page 24: YELP Data Set Challenge

24 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Urbana-Campaign, IL

Charlotte, NC

Phoenix, AZ

Page 25: YELP Data Set Challenge

25 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Las Vegas, NV

As seen in the maps above, the distribution of high rated restaurants seems to be independent of the centrality of the location for all the cities. There does however seem to be more high-end restaurants in the larger cities.

Restaurant Type

Another aspect, similar to the restaurant ethnicity is the restaurant type. Below, you can see graphs and summary statistics generated using JMP 12.

Page 26: YELP Data Set Challenge

26 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Page 27: YELP Data Set Challenge

27 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

We see that there are certain groups that seem to be underrepresented in the high rating category. Examples of these are fast food, caterer, and buffet. Amongst the ones that are relatively more represented in the high rated category, we find bakeries, Cafés, Deli, Coffee/Tea Houses, Food Trucks, and Tapas Bars. Again, a case of originality seems to occur, as we saw in the analysis of ethnicity.

Select Modeling Techniques

We elected to build multiple models in order to have a range of techniques and potential outcomes. This section provides the details on each model – why it was selected, how it was used, how it was built, and its results.

Model 1 – The Decision Tree

Our first model choice was a decision tree. Given the high number of potential independent variables in our data set, we needed a way to quickly identify the variables most useful in classifying each record into the highly rated restaurant bucket or non-highly rated restaurant bucket using our target variable. A decision tree seemed to be a logical choice. Decision trees offer a number of benefits in this sort of scenario:

1. They are easy to understand and visualize2. They are easy to implement3. They handle most any kind of data so little pre-processing is required (missing value corrections,

binning, correlation analysis, etc. generally aren’t needed)4. Outliers generally aren’t a problem

Consequently, decision trees provide a quick way to explore data and determine which variables may be of interest in predictive modeling.

Model 1 – Data Splitting and Sub-sampling

Before building the model, we had to determine how the data was to be split and sampled within SPSS Modeler. Model 1 uses three data partition.

Training (used to build the model) – 60% of file Testing (used to evaluate model on different data sample) – 20% of file Validation(used to verify accuracy of model on a third sample) – 20% of file

Our data set size of over 21,000 records allowed for the three partitions. The ratio of these splits should provide sufficient quantities to minimize variance in each. We used the default seed setting to ensure that our seed assignment was repeatable in various iterations and models.

SPSS Modeler Partition Settings

Page 28: YELP Data Set Challenge

28 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

These settings did a good job of randomly assigning target records in each partition. The screen capture below illustrates that the distribution of 0 and 1 values (High Rating=1, Non-High Ratings=9) is roughly proportional in the Training, Testing, and Validation data sets.

Model 1 – Building the Model

The construction of our initial decision tree model was based on our goal of identifying the variables that are most important in classifying our target variable. With that in mind, our target variable was the target field itself.

Most potential classifier/predictive variables were fed into the model in an effort to screen for independent variables for other model types. The only fields that were excluded were those that had a direct tie to the target variable (e.g. the target variable was derived from ratings so all variations of the ratings field were excluded).

Input Fields for the Decision Tree

Page 29: YELP Data Set Challenge

29 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Input Fields for Decision Tree Continued

Page 30: YELP Data Set Challenge

30 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Input Fields for Decision Tree Continued

With the input variables ready and partitions created, the next step was to select the appropriate type of decision tree to build. Past experience has shown that the decision tree variants within SPSS modeler produce similar results. Even so, we decided to experiment with CART, Quest, C5 and CHAID trees to

Page 31: YELP Data Set Challenge

31 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

determine which provided the best initial results. The screen capture below shows how the resulting SPSS Modeler stream.

As we will see, the CART tree performed best on our data so that’s where will focus our build screen captures. For the final CART model we made a changes to the default settings in an attempt to enhance performance.The first change was to enable boosting. This means that a series of trees are built to improve fitting.

The second change was to broaden the tree depth in an attempt to bring in more variables that may be of importance in future model builds (e.g. predictive models).

Model 1 – Assessing the Model

Page 32: YELP Data Set Challenge

32 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Our primary metric in evaluating and assessing decision trees was the percentage of records accurately classified on the Validation data set. Generally speaking, all of our decision trees performed well. They all correctly classified our target variable around 66-68% of the time.

You can see from the following table that CART had the best performance at 68.46%.

Model % Correct (Validation Data)CART 68.46%QUEST 66.50%C5 68.37%CHAID 67.66%

CART Results – Default Settings

The Cart results with default settings are listed below. The test performed consistently from Training to Testing to Validation which means there was little overfitting. Additionally, 10 variables showed up has having the most predictive performance. Four of those, tot_cool, text_topic4, weekend_hours and text_topic2 stood out from the pack. These may be key variables to focus on with something like a logistic regression model.

The actual tree output and decision rules have been omitted since we were using this model only to identify the variables with the most predictive importance.

CART Results – Enhanced Settings

Page 33: YELP Data Set Challenge

33 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Running the same CART tree with boosting improved results a bit. The percentage accurately classified moved up to 70.25%. The list of variables with the most predictive performance looked very different, however. The top 10 fields are totally different and their predictive importance as assessed by the tree is much more evenly balanced.

Based on our results, we have two good decision tree models for classifying records based on our target variable. The question now becomes whether the variables identified can be used in a predictive model.

Model 2 – Logistic Regression

The second model builds on the output of the first. The original decision tree identified 4 variables that may be useful in a predict model - tot_cool, text_topic4, weekend_hours and text_topic2. The goal of this model is to determine these fields can be used to predict our target variable (High Yelp rating). Given that we have a binary target variable, a binary logistic regression model seems appropriate.

Binary logistic regression models require that the dependent variable be binary (have only have two possible values like 0/1 or True/False). Our target variable meets that criteria. Although logistic regression models appear similar to linear regression, they don’t rely on many of the assumptions that linear regression models do. In particular, logistic regression does not require the following:

Linear relationship between independent and dependent variables

Page 34: YELP Data Set Challenge

34 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Independent variables do not need to be normal Error terms do not need to be normally distributed Homoscedasticity is not required Ordinal and nominal variables can be used as predictors

These differences mean that the tests required for the linear regression models discussed in class do not apply to this modeling technique.

Model 2 – Data Splitting and Sub-sampling

This model will use the same data splitting and sub-sampling techniques described for Model 1. It will leverage a Training data set (60% of original file), Test data set (20% of original file), and Validation data set (20% of original file). The rationale for this decision is the same as for Model 1.

Model 2 – Building the Model

Construction of the logistic regression model is an outflow of the decision tree created for Model 1. The target variable will be the binary target field created to indicate whether a restaurant was rated highly.

The independent predictor variables will include the variables that stood out in the original decision tree (tot_cool, text_topic4, weekend_hours and text_topic2).

The Logistic Node was selected in IBM SPSS modeler for this model. The resulting stream is shown below.

Logistic Regression Model Stream in IBM SPSS Modeler

The Enter method was leveraged for variable selection. Using this approach, all variables are entered in a single step. This makes sense in our scenario because we want to test the variables identified in the decision tree together.

Page 35: YELP Data Set Challenge

35 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

The Model Evaluation setting was changed to calculate predictor importance. This will result in output that shows the predictive power of each model variable.

Aside from these selections, the default settings were used.

Model 2 – Assessing the Model

Our primary metric for evaluating this model is accuracy in predicting our target value of 1 in the Validation data set. As illustrated in the screen shot below, the model did not do a good job of prediction. The model correctly identified the target variable in the Validation data set on 39.44% of the time.

Pseudo R Square values confirm that the model was not fit well. McFadden Pseudo R Square values between .2 and .4 generally indicate that a model has an excellent fit. This model is much lower at .078.

Page 36: YELP Data Set Challenge

36 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

The independent variables, although known are shown below. Interestingly, the predictive importance was different between the decision tree and the logistic regression model. Tot_cool, the number of reviews classified as cool, remained at the top in both models, however.

The equation for this logistic regression model was:

Although the equation isn’t terribly predictive, it is interesting that the total cool ratings has a positive impact toward a high rating while weekend hours is slightly negative.

While the variables from our decision tree in Model 1 seemed to work well for classification, they did not perform well for prediction. We had to try different approaches to boost predictive performance.

Model 3 – Logistic Regression Part II

Page 37: YELP Data Set Challenge

37 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Our first logistic regression model was constructed using variables that looked promising from the decision tree in Model 1. Since that logistic regression model did not perform well in term of predictive power, we decided to try logistic regression again. This time, the focus is based on variables selected using our intuition and curiosity. For this model, more variables were selected. The idea was to let the model select those with the most predictive power.

Model 3 – Data Splitting and Sub-sampling

Once again, we used the same data splitting and sub-sampling methodology used in prior models. 60% Training, 20% Test, and 20% Validation.

Model 3 – Building the Model

For this model, the same target variable was used. The independent variables shifted to include 50 variables related to type of food, food specialties, total reviews, types of reviews, hours open, check-in times and days, and a range of text mining fields. For brevity, the fields are not listed here. The model assessment section highlights those selected by the model, however.

For this model, the variable selection method was set to Stepwise. Stepwise is a good method to use when you have a large number of potential independent variables and are unsure which may be best for modeling. It allows for multiple model iterations where variables are added and removed simultaneously until the best combination of variables have been selected.

Aside from this change, all settings remain the same as in the previous logistic regression model.

Model 3 – Assessing the Model

Using the same criteria to evaluate this logistic regression model, we see that it correctly predicted true values for the target variable only 39.74% of the time. This is a slight improvement over the previous model but it’s predictive power is still weak.

Page 38: YELP Data Set Challenge

38 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

The list of variables pulled into the model shows the variables with the most predictive importance.

A few interesting variables rise to the top – weekend hours, ethnicity, restaurant type, afternoon check-ins, touristy, good for breakfast and good for late night could all inform restaurant decision making to drive higher reviews. Unfortunately, their predictive performance is relatively low. Decision making based on the variables selected would be sketchy at best.

The regression equation for this model becomes extremely long making it virtually unusable. For that reason, it has been omitted.

The McFadden Pseudo R Square value has improved but not above .2 where we could say the model is well fitted.

Model 4 – Fit Least Squares

Page 39: YELP Data Set Challenge

39 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

To investigate the topics found in the text mining, we went about and did a least square regression with the 20 topics as the variables using the JMP 12 software. The software would pick the topics that would give the lowest LogWorth (calculated as –log(p-value)), and then use that to compute the best model.

Model 4 – Data Splitting and Sub-Sampling

There was no need to do any splitting, as JMP was able to run through all the variables and records without any splits or samples.

Model 4 – Building the model

To build the model, we used the Fit Model function in JMP. With this model, we used stars as the Y variable to be predicted, and the text_topic1-20 to construct model effects. The personality was Standard Least Squares.

Page 40: YELP Data Set Challenge

40 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Model 4 – Assessing the model

In the output above, you can see the importance of the different topics. The R-square being as low as 0.22 shows how poorly this model is working though. The thing that can be taken from this model, however, is the LogWorth value for the different topics. We can see that text_topic 2 and 4 are the more important ones when analyzing the different topics, together with topic 18, 17, 6, and 12 in order of descending importance. It is interesting to note that text_topic2 and Text_topic4 also stood out in our decision tree model.

If we look at the following groups, we can see that the most important things are manager, location, food, service, wine, dessert, staff, friendliness, time, and bread, salad and meat. So for the opening restaurants, there is a great need of focusing on these parts of the restaurant.

text_topic2* "+customer,+know,+bad,+manager,+location"text_topic4* "+great,+great food,+great service,+service,

+food"

Page 41: YELP Data Set Challenge

41 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

text_topic6* "+wine,+restaurant,+dish,+dessert,+meal"text_topic18* "always,+staff,+friendly,+love,+location"text_topic17* "+minute,+wait,+table,+wait,+order"text_topic12* "+sandwich,+bread,+lunch,+salad,+meat"

Model 5 - Text Profiling

To investigate the reviews to find what terms were the ones most associated with the different stars, we chose to go through SAS’ Text Profiler tool. The resulting output would give the most commonly occurring terms in the different star reviews.

Model5 – Data Splitting and Sub-Sampling

The data was first split into a 5% sample to be able to handle the size of the data. Then the data split into three separate sections, training (20%), validation (50%), and testing (30%).

Model 5 – Building the Model To build the model, the data was first sub-sampled into a 5% sample. Then the sample was run through a partition node to split the data into a 20-50-30 training, validation, testing split. Next was a text parsing node to extract the text files to be used in the analysis. Then a text filter to filter out unnecessary terms, special signs, etc. At last, before the text profiling node, a text topic node to create a set of categorical variables to be used in the text profiling.

Text Parsing settings:

Page 42: YELP Data Set Challenge

42 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Text Filter settings:

Page 43: YELP Data Set Challenge

43 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Text Topic settings:

Text Profile settings:

Text Profile output:

Model 5 – Assessing the model

With the text profile, we can see that the there are certain areas the customers seem to be more concerned about when rating. For the low rated restaurants, the terms seem to be focused on staff/service, mistakes like hair in the food, price, portion, and taste. For the better restaurants, the main terms found in the reviews seem to be more about owner, town, service, and great food.

This model represents very well how we can go about analyzing the YELP reviews. As it is hard to predict the rating based on any terms or other aspects, the best way seems to be through descriptive analytics, and finding the commonalities between the best reviews.

Model 5 Modification

When analyzing the model 5, we did come across one problem: Adjectives. Despite telling us about the content of the review, adjectives don’t give much knowledge in terms of specific parts to focus on when trying to make a restaurant successful. Hence, we separated everything but the nouns found in the reviews by ignoring all the other terms in the text parsing node. The following was the resulting terms the reviewers focused on:

Page 44: YELP Data Set Challenge

44 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Modified Model 5 Output:

In this model, we can see that the most important things to the reviewers seem to be staff/service, town, food, portion and price.

Model 6

Model 6 was built by using linear regression to predict the degree to which the nature of Reviews and Tips influences ratings. The target variable for the model was “Stars,” which is made up of the number of stars per each of the ratings.

Model 6 - Building the model

Before building the model, we assessed the numeric dependent variables to determine which to include in the model. Based on the results of the statistical analysis, we excluded all the independent variables with a correlation value higher than 0.7 with other independent variables from the model.

Correlation between Independent variables

Page 45: YELP Data Set Challenge

45 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Correlation between dependent variable and independent variables

This left us with 5 input variables which were included in the final model:

Model 6 - Assessing the Model

The basic results show the Percentage of total reviews voted “Cool” to have the greatest predictor importance on Ratings, followed by the Percentage of total reviews voted “Funny”.

Page 46: YELP Data Set Challenge

46 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

tot_tips and tot_tip_likes had the same degree of importance, which was not very significant. It was interesting to discover that the Percentage of reviews voted “Useful” had a predictor importance of zero, though it had a strong correlation with the Target variable.

The regression equation: Stars = 3.407 - -0.01232 funny_pct + -0.0012 useful_pct + 0.013202 cool_pct + 0.002287 tot_tips + 0.02388

The results of the regression are presented in the following screenshots:

The adjusted R squared value of 0.108 means that the model does not do a very good job of explaining variation in the dependent variable. Looking at the F value and t values, it seems that the independent variables selected for the model do have some limited ability to explain variation in the dependent variable.

Page 47: YELP Data Set Challenge

47 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

Discussion

From the models we have tried to create, there seems to be great difficulty in actually predicting the review that a customer is going to give. This is natural, as people are of great diversity, and people focus on different things. No two people are going to think the exact same thing about a place. There is still some support in saying certain factors may help improve the chances of satisfied customers.

In model 1, we saw that the most important factors were total cool reviews, text topic 4 (food and service), weekend hours, and text topic 2 (customer, manager, and location). This is similar to what we found in model 4 and 5 in terms of text topics, and similar to model 2, 3, and 6 in terms of the importance of weekend hours and total cool reviews.

Though the number of cool reviews may not explain a lot to us about what to focus on when making a successful restaurant, the fact that weekend hours seems to be so important is of interest. As seen in the plot below, there does seem to be a trend similar to that which we saw in the descriptive analytics part: Less hours = more stars. The reason may be hard to explain without further investigation and data from the businesses, but a possible reason may be as explained in the descriptive analysis section: Fewer shifts may help ensure a high quality staff at all time.

The suggestion about the staff does seem to hold up in the other models too. When looking at topic 4 and model 5, the main two things people seem to be concerned about is in fact the staff/service, and food. The argument that less shifts helps improve the quality is hence also shown in those models (we

Page 48: YELP Data Set Challenge

48 YELP Dataset Challenge, 2nd deliverable, Lynn, Mbole, Oelstad

must not forget that food is as closely connected to people as service, as it is the chefs preparing the food that determine how good the food tastes).

Conclusion

From the above models, we can see that the data given from YELP does not work very well with predictive models. Hence, the better way to go about analyzing the reviews seems to be through text analytics and grouping. Through the Text Profiler, we found that the most important terms seem to be food, and service. In terms of service, we actually see that people use words like love, good service, hair, bug, and care. In other words, if the restaurants focus on quality of their staff, cleanliness, and quality food, they will most likely succeed in the business. We also found in the analysis that the one thing restaurants may need to do is to reduce its hours. This may help resolve a lot of quality issues, and may in turn help increase the rating of the restaurant.