Using NLP to Find “Interesting” Collections of...

Engineering/Operations A TripAdvisor Blog Using NLP to Find “Interesting” Collections of Hotels Craig Schmidt posted October 5, 2015 There are a lot of hotels on TripAdvisor. At the moment, there are 1790 hotels listed for Paris, 1054 hotels for London, and 466 hotels City. We have been working on better ways to explore these hotels, and find an interesting place to stay. In this blog post, I’ll describe a feature on TripAdvisor, that uses Natural Language Processing (NLP) to find groups of hotels that have an interesting theme. It automated process that can be scaled across many cities. There is some interesting technology behind it. For New York City, we might show a list of collections like thisWORK AT TRIPADVISOR

Transcript of Using NLP to Find “Interesting” Collections of...

Page 1: Using NLP to Find “Interesting” Collections of · In this blog post, I’ll desc ribe an upcoming feature on TripAdvisor,

Engineering/Operations A Tr i pAd v i s o r B l o g

Using NLP to Find “Interesting” Collections of HotelsCraig Schmidt posted October 5, 2015

There are a lot of hotels on TripAdvisor. At the moment, there are 1790 hotels listed for Paris, 1054 hotels for London, and 466 hotels in New YorkCity. We have been working on better ways to explore these hotels, and find an interesting place to stay. In this blog post, I’ll describe an upcomingfeature on TripAdvisor, that uses Natural Language Processing (NLP) to find groups of hotels that have an interesting theme. It is a semi-automated process that can be scaled across many cities. There is some interesting technology behind it.

For New York City, we might show a list of collections like this…


Page 2: Using NLP to Find “Interesting” Collections of · In this blog post, I’ll desc ribe an upcoming feature on TripAdvisor,

We have collections for “Times Square Views”, “Trendy Soho”, “Catch a Show”, and “Art Deco Classic”, among others. Each of those is a group ofabout 10 hotels that fit the theme. How can we automate finding these interesting groups, even for cities we’re not familiar with?

Topic Modeling is BoringThe most textbook approach might be to build a topic model based on the review text for the hotels. I first tried a Latent Dirichlet Allocation (LDA)model. It assumes that each review is a mixture of a number of latent topics. You can represent each review as a vector with a “bag of words”representation of the review text.

Page 3: Using NLP to Find “Interesting” Collections of · In this blog post, I’ll desc ribe an upcoming feature on TripAdvisor,

I tried building an LDA, model, but got very boring results. Here are some of the first topics for New York, which a mix of a bunch of general hotelrelated topics.

Single words just don’t have the richness to represent interesting topics. What about ‘breakfast’? Was it good, bad, free, or wonderful? Most hotelreview mix up talking about the room, the location, the staff and more all at once,and these topics show that mixing. But we want to just focus ona single notable aspects of a hotel, not summarize it completely.

Phrases as raw materialSince single words aren’t very expressive, I wanted to find some good phrases in our review text. What is a phrase, exactly? In this case, we meanan n-gram of words that has a specfic semantic meaning. We tried the approach from the paper “Mining Quality Phrases from Massive TextCorpora”, by Liu et. al. at the Univeristy of Illinois and Microsoft (some interesting slides). Because they released an open source implementationtheir SegPhrase algorithm, we were able to try it out.

It is a multiple step process, but the output is that you get a large set of phrases, and a score on how likely it is to be a phrase. We dubbed thisscore the “phrasiness” metric. So what are the strongest phrases in New York?


    0.018 breakfast + 0.015 nice + 0.015 coffee + 0.011 small + 0.009 staff + 0.008 location + 0.008 great + 0.008 clean +    0.032 staff + 0.020 great + 0.016 location + 0.014 stayed + 0.013 friendly + 0.012 helpful + 0.011 new + 0.010 york +     0.015 location + 0.015 great + 0.014 square + 0.013 times + 0.010 walk + 0.010 subway + 0.009 clean + 0.008 around + 0.007    0.010 new + 0.007 distrikt + 0.007 great + 0.006 york + 0.006 breakfast + 0.006 city + 0.006 nyc + 0.005 service + 0.005    0.012 staff + 0.009 yotel + 0.009 desk + 0.008 time + 0.007 front + 0.006 arrived + 0.006 made + 0.006 service + 0.005    0.023 new + 0.019 breakfast + 0.018 york + 0.013 square + 0.011 time + 0.011 empire + 0.010 state + 0.010 building     0.019 great + 0.017 staff + 0.016 bar + 0.012 breakfast + 0.012 location + 0.011 wine + 0.011 new + 0.009 york + 0.009    0.019 great + 0.019 park + 0.018 square + 0.017 location + 0.016 times + 0.016 central + 0.014 staff + 0.012 walking     0.025 air + 0.010 conditioning + 0.009 staff + 0.008 night + 0.007 service + 0.007 stayed + 0.006 smoking + 0.006 new


    club_quarters           0.999620288    hampton_inn             0.999542304     rush_hour               0.999526829    frosted_glass           0.999506234     ritz_carlton            0.999476254     usa_today               0.999473892    jersey_boys             0.999458328     holiday_inn_express     0.999450456     art_deco                0.9994495    gordon_ramsay           0.999448261    battery_park            0.999418922     grand_central_station   0.999410719     naked_cowboy            0.999401511     yankee_stadium          0.999390047     penn_station            0.999386755     columbus_circle         0.999381838     charlie_chaplin         0.999381761     scrambled_eggs          0.999379073

Page 4: Using NLP to Find “Interesting” Collections of · In this blog post, I’ll desc ribe an upcoming feature on TripAdvisor,

The number shown is the phrasiness score. All of these are clear semantic concepts, not just a sequence of words that frequently co-occur.

This is a list of “phrases” with much lower “phrasiness” scores:

You can imagine that these phrases frequently occur in reviews, but don’t really represent a single concept.

We used a threshold of 0.5 for phrasiness, which reduced the approximately 250 thousand potential phrases for New York to around 35 thousand.While people often talk about air_conditioning in a review, and it is clearly a phrase, that’s not the kind of aspirational topic that people will want tobrowse through. Importantly, there are plenty of phrases that have more interesting emotional content, like art_decoandradio_city_music_hall. So we have the raw material we need to find collections. In fact, we may have too much raw material, as there are tensof thousands of a good phrases just for New York hotels.

word2vec Party TricksWe have a bunch of phrases, and many of them are very similar. These phrases …


    jet_lag                 0.999370422     affinia_dumont          0.999364144     harry_potter            0.999357816     les_halles              0.999352377     air_conditioning        0.999346666     mamma_mia               0.999345891     hudson_river            0.999345247     pinot_noir              0.999344796     woody_allen             0.999337025     fairy_tale              0.999306646     grand_central           0.999304571     radio_city_music_hall   0.999301883


    the_staff_were_great        0.116338786    located_on_the_ground_floor 0.116338612    the_rooms_are_very_small    0.116337216    requests_were_honored       0.116334616    an_older_crowd              0.11633381    booked_our_return           0.116329291    quite_rudely                0.116321934    and_pedro                   0.116312719    helped_ourselves            0.11630484    after_spending_over         0.116302541    my_husband's_birthday       0.116302346    being_slightly              0.116297622    no_inconvenience            0.116296059    far_greater                 0.116295743    longer_period_of_time       0.116291693

1     amazing_view_of_central_park

Page 5: Using NLP to Find “Interesting” Collections of · In this blog post, I’ll desc ribe an upcoming feature on TripAdvisor,

seem like a great basis for making a hotel collection. But they are mixed in with tens of thousands of other phrases. It would be very helpful tocluster the related phrases into groups. But what features should we use to do the clustering? We need a similarity metric that puts these phrasesclose to each other.

The word2vec algorithm from Mikolov and other Google researchers is perfect for this case. Given a corpus of text, it maps each word into anumerical vector of say 100 dimensions. The mapping is done in a way that put words that are used in a similar way in the corpus are near eachother in the vector space.

The papers on word2vec give lots of party tricks you can do by adding and subtracting the vectors. For example vector(‘Paris’) – vector(‘France’) +vector(‘Italy’) results in a vector that is very close to vector(‘Rome’), and vector(‘king’) – vector(‘man’) + vector(‘woman’) is close to thevector(‘queen’).

I took all 9 billion words of review text at TripAdvisor, and build a word2vec model with around 1 million unique words. I used the C implementationof word2vec which can be easily run from the command line.

Here are some example words, and their nearest neighbors:


    amazing_views_of_central_park    beautiful_view_of_central_park    beautiful_views_of_central_park    central_park_views    fantastic_view_of_central_park    fantastic_views_of_central_park    great_view_of_central_park    great_view_over_central_park    great_views_of_central_park    nice_view_of_central_park    spectacular_view_of_central_park    wonderful_views_of_central_park


great    terrific        0.932    fantastic       0.908    fabulous        0.876    wonderful       0.861    brilliant       0.852    fab             0.849    superb          0.818    good            0.787    lovely          0.782    phenomenal      0.773noisy    noisey          0.934    loud            0.880    nosiy           0.748

Page 6: Using NLP to Find “Interesting” Collections of · In this blog post, I’ll desc ribe an upcoming feature on TripAdvisor,

The numbers in the table are the cosine similarity of the top word with the near neighbors. We can see that word2vec is doing a nice job ofmapping related words to be near each other.

For our short phrases, we just added the individual word vectors to create a phrase vector, ignoring stop words. So now we have a numericalvector of features for each phrase, and we can use any clustering technique.

With word2vec, people usually use the cosine distance, where a larger value is better. However, in clustering we usually work with a distance,where smaller is better. If we normalize our word2vec vectors so they have a unit norm, then we can just minimize the Euclidean distance, as thisis equivalent to maximizing the cosine similarity (see the Properties section here for the simple derivation). So going forward, we will use theEuclidean distance of the two normalized word2vec phrase vectors as our distance metric.

Clustering phrases into tight topicsUsually, in a clustering problem we are interested in all of the clusters. Here, we have a slightly different situation. We want to group similarkeywords into very focused groups, like our “view of central park” phrases, above. A small fraction of the clusters, say 100 or so, will be the basis


    noicy           0.729    distracting     0.726    busy            0.711    crowded         0.705    annoying        0.696    disruptive      0.693    rowdy           0.685rude    unfriendly      0.908    unhelpful       0.898    arrogant        0.882    impolite        0.876    unprofessional  0.872    surly           0.864    dismissive      0.849    abrupt          0.843    ignorant        0.840    hostile         0.826bed    beds            0.814    mattress        0.780    sofabed         0.717    couch           0.709    matress         0.708    duvet           0.690    bedroom         0.683    futon           0.653    beds            0.652    room            0.650

Page 7: Using NLP to Find “Interesting” Collections of · In this blog post, I’ll desc ribe an upcoming feature on TripAdvisor,

for our collections. The rest of the clusters will be focused on more boring topics like air conditioning, and we’ll just ignore those.

What clustering technique will do a nice job of giving us those tight clusters? The default k-means approach for a fairly small k isn’t probably goingto work too well. It will produce broad groups, and concentrate on getting each phrase into a good cluster, even though we don’t really care aboutmost of the phrases.

I chose the agglomerative clustering routine in the python sklearn library to do the clustering. The “complete linkage” option will try the hardest tokeep each cluster pure.

Agglomerative clustering starts with each phrase in its own cluster, so we have N clusters of size 1. It then goes through and tries to combine eachpair of current clusters into a new combined one. With complete linkage, the score of the combined cluster is the maximum of the distancesbetween every pair of phrases in the combined cluster. The two clusters with the smallest maximum score after combination are actuallycombined, leaving us with N-1 clusters. This process is repeated until we have 2 clusters left.

The overall combination process forms a tree of nodes. At internal nodes, a cluster is composed of the phrases that are leaves above the node inthe tree. If we pick a threshold score, we can prune off subtrees where every cluster has a score better than the threshold. We only need to makethe agglomerative tree once, and then we can quickly experiment with different thresholds.

Let’s look at an example for our New York city hotel data. I chose 18 phrases, such that they formed some natural clusters. There are 4 keywordsabout the view, 6 about being comfortable, 2 about the wi-fi, 1 about breakfast, 3 about friendly staff, and 2 about feeling safe. With luck, thephrases should be clustered into these natural clusters at some point. Here is the clustering, with the natural clusters shown as different colors.

Page 8: Using NLP to Find “Interesting” Collections of · In this blog post, I’ll desc ribe an upcoming feature on TripAdvisor,

The number shown in internal nodes provide the order of the merge, where smaller numbers were merged first, and hence more similar. The leafnodes were given numbers 0 to 17, and so internal node 18 was the first combined cluster with comfortable_beds_and_pillowscomfy_bed_and_pillows. The next node 19 combined friendly_accommodating_staff and friendly_assistance, etc.

In general, the clustering did an excellent job on the test case. A phrase like sooooo_comfy is the least like the other phrases talking about comfort,so it goes in last among the red group.

With the right threshold, we can get natural clusters by themselves. In practice, we can get many useful clusters to use for collections.

We can also go back and find the hotels that frequently mention the phrases in a cluster. These hotels would be the member of the collectionformed by a given cluster.

The human curation post-process

Page 9: Using NLP to Find “Interesting” Collections of · In this blog post, I’ll desc ribe an upcoming feature on TripAdvisor,

Up to this point we have had a totally automated process, that use phrase detection, word2vec features, and agglomerative clustering to findclusters of phrases. The remaining step is to find the interesting ones. At this point we rely on human curation to pick out the interesting ones, andgive them a snappy name.

The manual process is fairly easy. We provide a list of phrases for a cluster, the hotels that are the most related to the cluster, even someexamples sentences from the review text for those hotels. It is fairly easy to look at that information, and determine if it would a good way toexplore the hotels of a city. It is hard to mathematically define what is interesting, but easy for a human to know when they see it. The human alsocomes up with a clever name, which is also simple given the list of phrases.

Some interesting collectionsThe “Catch a Show” collection has phrases like this:

My personal favorite when I’m in New York, the “Near The High Line” collection has:


    at_radio_city_music_hall    b'way_shows    beacon_theater    beacon_theatre    broadway_dance_center    broadway_play    broadway_plays    broadway_shows    broadway_shows_and_great_restaurants    broadway_shows_and_restaurants    comedy_shows    david_letterman_show    easy_walk_to_broadway_shows    evening_entertainment    great_shows    radio_city_hall    radio_city_music    radio_city_music_hall    radio_city_music_hall_and    theater_shows    theatre_shows    walking_distance_to_broadway_shows    walking_distance_to_broadway_theaters    walking_distance_to_shows    walking_distance_to_theatre


    chelsea_market_and_high_line    chelsea_market_and_the_highline    high_line    high_line_park    highline_park

Page 10: Using NLP to Find “Interesting” Collections of · In this blog post, I’ll desc ribe an upcoming feature on TripAdvisor,

Arnab Dutta says:November 3, 2015 at 11:22 pm

Nice post!Came across this post as I was searching for a POC for a large micro-blogging site. Just to add, did you consider adding SemanticMarkup/Annotations to the concepts/entities extracted using other information sources?


Your email address will not be published. Required fields are marked *


Name *

Email *

Leave a Reply

The whole process provides insight into a particular city, picking out interesting neighborhoods, features of the hotels, and nearby attractions. Allthe things someone staying in the hotel might be interested in. Different cities will result in different collections.

Once the user has some ideas, they can focus in on a few hotels, and see if they would be a great place to stay. It wouldn’t be possible without theinsights that Natural Language Processing provides.

One response to “Using NLP to Find “Interesting” Collections of Hotels”


    highline_walk    highline_walkway    the_high_line_park

Page 11: Using NLP to Find “Interesting” Collections of · In this blog post, I’ll desc ribe an upcoming feature on TripAdvisor,