A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3...

22
Vol.:(0123456789) Information Technology & Tourism (2018) 20:37–58 https://doi.org/10.1007/s40558-018-0121-z 1 3 ORIGINAL RESEARCH A study on online travel reviews through intelligent data analysis Michela Fazzolari 1  · Marinella Petrocchi 1 Received: 29 November 2017 / Revised: 24 July 2018 / Accepted: 22 August 2018 / Published online: 29 August 2018 © Springer-Verlag GmbH Germany, part of Springer Nature 2018 Abstract The purpose of this paper is to show the application of a set of intelligent data analy- sis techniques to about 7 million of online travel reviews, with the aim of automati- cally extracting useful information. The reviews, collected from two popular online tourism-related review platforms, are all those posted by reviewers about one spe- cific Italian location, from 2010 to 2017. To carry out the study, the following meth- odology is applied: a preliminary statistical analysis is performed to acquire general knowledge about the datasets, such as the geographical distribution of reviewers, their activities, and a comparison among the time of visits and the average scores of the reviews. Then, Natural Language Processing techniques are applied to extract and compare the most frequent words used in the two platforms. Finally, an Associa- tion Rule Learning algorithm is applied, to extract preferred destinations for distinct groups of reviewers, by mining interesting associations among the countries of ori- gin of the reviewers and the most frequent destinations visited. By elaborating the available data, it is possible to automatically disclose valuable information for con- sumers and providers. The information automatically extracted can be exploited, for example, to build a recommender system for customers or a market analysis tool for service providers. Keywords Online travel reviews · Frequent itemsets · Reviewers activities · Recurrent destinations · Text mining · Association rule mining The research leading to these results has received funding from the regional project Review-Land (reviewland.projects.iit.cnr.it), co-funded by Fondazione Cassa di Risparmio di Lucca, Lucca, Italy and IIT-CNR, Pisa, Italy. * Michela Fazzolari [email protected] Marinella Petrocchi [email protected] 1 Institute of Informatics and Telematics, National Research Council, Pisa, Italy

Transcript of A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3...

Page 1: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

Vol.:(0123456789)

Information Technology & Tourism (2018) 20:37–58https://doi.org/10.1007/s40558-018-0121-z

1 3

ORIGINAL RESEARCH

A study on online travel reviews through intelligent data analysis

Michela Fazzolari1  · Marinella Petrocchi1

Received: 29 November 2017 / Revised: 24 July 2018 / Accepted: 22 August 2018 / Published online: 29 August 2018 © Springer-Verlag GmbH Germany, part of Springer Nature 2018

AbstractThe purpose of this paper is to show the application of a set of intelligent data analy-sis techniques to about 7 million of online travel reviews, with the aim of automati-cally extracting useful information. The reviews, collected from two popular online tourism-related review platforms, are all those posted by reviewers about one spe-cific Italian location, from 2010 to 2017. To carry out the study, the following meth-odology is applied: a preliminary statistical analysis is performed to acquire general knowledge about the datasets, such as the geographical distribution of reviewers, their activities, and a comparison among the time of visits and the average scores of the reviews. Then, Natural Language Processing techniques are applied to extract and compare the most frequent words used in the two platforms. Finally, an Associa-tion Rule Learning algorithm is applied, to extract preferred destinations for distinct groups of reviewers, by mining interesting associations among the countries of ori-gin of the reviewers and the most frequent destinations visited. By elaborating the available data, it is possible to automatically disclose valuable information for con-sumers and providers. The information automatically extracted can be exploited, for example, to build a recommender system for customers or a market analysis tool for service providers.

Keywords Online travel reviews · Frequent itemsets · Reviewers activities · Recurrent destinations · Text mining · Association rule mining

The research leading to these results has received funding from the regional project Review-Land (reviewland.projects.iit.cnr.it), co-funded by Fondazione Cassa di Risparmio di Lucca, Lucca, Italy and IIT-CNR, Pisa, Italy.

* Michela Fazzolari [email protected]

Marinella Petrocchi [email protected]

1 Institute of Informatics and Telematics, National Research Council, Pisa, Italy

Page 2: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

38 M. Fazzolari, M. Petrocchi

1 3

1 Introduction

Social media, forums, and blogs are privileged vehicles for posting and spreading online reviews (Litvin et al. 2008; Xiang and Gretzel 2010). Among the goods and services that are discussed every day on the Internet, we can find those belonging to the most disparate categories, like, e.g., food, clothes, music, toys, hotels, res-taurants, etc. According to surveys available online, a positive (or negative) review about a product can be as effective as a recommendation by a friend.1 As a natural consequence of the high resonance produced by online reviews, the impact on busi-nesses is more than significant: even sporadic negative reviews can instill an over-all bad feeling about a specific product/company, whereas positive opinions usually bring a number of powerful benefits, such as an improvement in search engines’ ranking, a stronger perception of trust, and increased sales (Raguseo and Vitari 2017; Chong et al. 2017; Phillips et al. 2017).

In particular, travel reviews have been deeply investigated in recent years. These reviews should support customers to make travel decisions (Gretzel and Yoo 2008) as well as service providers to adjust their businesses. Nevertheless, the available information is often overwhelming and it becomes unfeasible for consumers and providers to examine all reviews one by one. This contribution applies a set of intel-ligent data analysis techniques to a huge quantity of online reviews, collected from two popular e-advice websites, namely Booking2 and TripAdvisor,3 which host users’ opinions since decades.

The aim of this study is to extract useful information that is originally implicit in review data, with the main purpose of supporting both providers and potential customers, helping the former to adapt their services and the latter to improve their decision processes.

The analyses focus on a specific Italian territory, the province of Lucca, popular for tourism attractions, local food and folk events, but they can easily be extended and generalized to other locations. The series of diverse analyses aim at provid-ing fine-grained insights as well as improving the decision processes of consumers and providers. In fact, the analyses take into account: (1) the reviewers’ activities, (2) the kind and frequencies of review terms, and (3) the preferred destinations of reviewers.

The reviews of hotels, restaurants and attractions of the territory under investi-gation are collected and analyzed and different techniques inherited from the wide field of intelligent data analysis are exploited. Overall, this work deals with more than 7 million reviews and more than 150,000 reviewers, which have been crawled by an ad-hoc developed software.

In particular, the following methodology has been applied:

1 see, e.g., https ://www.brigh tloca l.com/learn /local -consu mer-revie w-surve y/.2 https ://www.booki ng.com.3 https ://www.tripa dviso r.com.

Page 3: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

39

1 3

A study on online travel reviews through intelligent data…

• A preliminary statistical analysis of the available data has been devoted to let a general knowledge emerge, such as the geographical distribution of review-ers, their activities, and a comparison among the time of visits and the average scores.

• Then, Natural Language Processing techniques have been applied, for extracting and comparing the most frequent words associated to the review texts in the two datasets.

• Finally, an algorithm for Mining Association Rules have been applied, to detect destination sets that often appear together. The underlying idea is to extract pre-ferred destinations for distinct groups of reviewers, according to their nationali-ties, both around the territory under investigation and the world.

The main findings of the analyses are: (1) a characterization of the reviewers’ activi-ties, in terms of the number of their reviews: a small amount of reviewers post a high number of reviews, and vice-versa; (2) over the time, the number of reviews and the average numerical scores feature an inverse correlation between each other: the more the number of reviews, the less the average score. This testifies that, during peri-ods with major affluence, the overall users satisfaction decreases (leading to further insights on the accommodation capacity of the locations under investigation); (3) considering the two platforms under investigation, i.e., Booking and TripAdvisor, the reviews sets are comparable between each other, both in terms of contents and average scores; (4) by mining the locations reviewed by users over a 1-year period, it was possible to reconstruct the users’ tours, both around the world and within the specific territory under investigation: the analysis highlights that the reviewers’ geo-graphical origins influence their travelling destinations and tours.

With respect to related work, this paper addresses a detailed picture of travel-related data and data sources. More than one decade of data have been analyzed, posted by users all over the world. From a practical point of view, such extra infor-mation could be further exploited both by visitors and providers:

• Visitors: a Recommender System could be designed, based, e.g., on the extracted information about recurrent travel tours (the output of the analysis in Sect. 4.3), with the aim of suggesting similar routes to potential customers with characteris-tics similar to those experimenting the recurrent tours.

• Providers: a Market Analysis Tool could be constructed, which automatically analyzes the information extracted in all the phases of the methodology and sug-gests marketing strategies to providers, such as, e.g., tuning their offers according to the geographical origin of the visitors (Sect. 4.1) or re-interpreting the overall score of their business based on the total number of reviews (see Sect. 4.1), just to cite few application examples.

Overall, we show how a successful application of intelligent data analysis tech-niques to the Social Web, focusing on the universe of online tourism-related review platforms, demonstrates to be an effective way to perform multi-dimensional auto-mated analyses that ease the human exploration and interpretation of reviews. In this work, the running scenario turns around an Italian tourist hub, nevertheless the same

Page 4: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

40 M. Fazzolari, M. Petrocchi

1 3

investigations could be easily generalized and carry out on different locations, with diverse granularity degrees.

The paper is structured as follows. The next section presents a literature review in the area of online reviews analysis. Section 2.3 introduces the datasets used in this study. Section 2.3 describes the kind of analyses performed and the methodol-ogy used. In Sect. 4, the results obtained are reported and an interpretation of the findings is given. Finally, Sect. 5 concludes the paper and gives directions for future work.

2 Literature review

Given that potential travellers tend to trust statements of other travellers more than commercials of tourism providers ones, social media “represents an important plat-form for electronic commerce and has one of the most metamorphic impacts on business” (Akman and Mishra 2017). Thus, since a decade, the so-called electronic Word-of-Mouth has a strong impact on the customers booking processes, as well as on the booking decisions (Xiang and Gretzel 2010; Gretzel and Yoo 2008; Litvin et al. 2008; Sparks and Browning 2011).

Information on tourism transactions, customers’ behaviour and the tourism mar-ket structure are available online. Thus, several investigations have been focusing on analyzing such available data. In the following, we provide a literature review of recent work, dealing with the extraction of useful information from online reviews.

2.1 Valuable information for businesses

Investigating behavioural patterns and preferences of users through data available on e-commerce and tourism-related platforms could provide valuable information for businesses in establishing marketing strategies and offering added value to their cus-tomers. For example, Amaro and Duarte (2017) investigate how the origin of travel-lers influences their use of social media for planning a travel and highlight that travel marketers can use this knowledge to adapt social media strategies according to the origin of customers. For a number of years, extensive and nationally representative surveys have been carried out, “to evaluate the specific aspects of ratings informa-tion that affect people attitudes toward e-commerce”. It is the case, e.g., of the work in Flanagin et al. (2014), which highlights how people, while taking into accounts the average of ratings for a product, still do not take care of the number of reviews leading to that average. Consumer perception has also been investigated in Krawc-zyk and Xiang (2016), with respect to the lodging industry. Here the authors apply a text analysis approach to create perceptual maps from the most frequent terms used in a dataset related to an online travel agency. These maps are useful for producing insights on brands. Given that online opinions have the potential to transform the way of doing business, work in Yang et al. (2016) analyses the motivations of differ-ent user types (buyers vs. sellers) in sharing online information and comments, and the impact of the kind of sharing on e-commerce. Similarly, work in Ye et al. (2009,

Page 5: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

41

1 3

A study on online travel reviews through intelligent data…

2011) find relationships between sales and bookings, while in Pantano et al. (2017) the authors approach the prediction of future preferences of tourists. Furthermore, work as Wang et  al. (2017) highlights the need for developing better presentation of online reviews, to create more comfortable shopping experiences, and reduce the consumers’ perceived risks to approach the social shopping.

2.2 Valuable information for customers

Usually, travel-related reviews consider hotels, restaurants, and tourism attractions. From a customer perspective, work in  Zhou et  al. (2017) considers TripAdvisor reviews to implement a travel planning tool, useful to travellers as a decision sup-port system. Online reviews have also been analyzed in Rossetti et al. (2016), where the authors apply the topic model method to process textual reviews, with the aim of supporting decisions and providing recommendations to tourists. Aspect-opinion mining has also been investigated in relation to online reviews. In Fang et al. (2015), the authors address the problem of multimodal aspect-opinion mining and con-sider user-generated photos and textual documents, to capture correlations between aspects and opinions. As more and more users describe their travel experiences on travel websites, a great amount of online reviews are generated daily. Therefore, it becomes a hard task for users to identify helpful reviews in a reasonable time. To address this issue, different approaches have been proposed. Summarization systems extract the most representative expressions on product features. In Hu et al. (2017), the authors propose a novel multi-text summarization technique for identifying the top-k most informative sentences of hotel reviews, by also considering critical fac-tors such as author credibility and conflicting opinions. The helpfulness of online reviews has been investigated too. In fact, predicting the helpfulness of reviews allows the user to focus only on prominent ones, thus helping to save time, by con-centrating on a subset of them. Work in Chen et al. (2016) predicts review helpful-ness by relying on the information embedded in the review text. Usually, the implicit assumption, when studying reviews helpfulness, is that reviews are independent one from each other. The work in Zhou and Guo (2017) studies the impact of reviews’ order on their helpfulness, by analyzing restaurant reviews collected from yelp.com, and finds out that the order of a review negatively relates to its helpfulness.

2.3 Intelligent data analysis applied to travelling and tourism activities

Business intelligence aims at automatically extracting valuable information from different data sources and analyzing them by means of complex data mining meth-ods, such as machine learning and statistical approaches. The new knowledge gained can then be used as input to decision support or intelligent and adaptive systems (like recommender systems). Several intelligent data analysis techniques have been investigated in the literature, both supervised and unsupervised. In particular, many contributions exploit Natural Language Processing (NLP) to automatically analyze the text of online reviews (Alaei et al. 2018). Work in García-Pablos et al. (2016), presents a NLP platform based on a modular architecture. The platform is used to

Page 6: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

42 M. Fazzolari, M. Petrocchi

1 3

process textual content from online reviews and extract valuable information from it. The authors use a set of manually annotated hotel reviews for the training and the evaluation of the system. Another contribution is in Menner et al. (2016), where the authors describe approaches to extract relevant topics from touristic reviews. By adopting different data mining techniques, the work compares and evaluates them according to their accuracy level. Berezina et al. (2016) present a text-mining approach to detect hotel characteristics that appear in reviews of satisfied and dis-satisfied customers. The results of the study highlight both theoretical and manage-rial implications, helping to understand satisfied and dissatisfied customers point of views.

In recent years, big data analytics techniques have been applied to process online reviews. For example, the research in  Salehan and Kim (2016) uses a sentiment analysis approach for big data analytics, aiming at investigating how the sentiment polarity of online consumer reviews affects their readership and helpfulness. The proposed method can be adopted by online providers to develop automated systems for sorting and classifying big volumes of online reviews.

Among data mining methods, Association Rule Mining (ARM) algorithms have been effectively used to extract interesting patterns from tourism data.

For example, the authors in Versichele et al. (2014) apply an ARM algorithm on Bluetooth, tracking data of tourists in Ghent, Belgium. The aim is to mine interest-ing patterns, combining the visits to different attractions. A further contribution is described in Aghdam et al. (2014), where several knowledge discovery techniques are applied to analyze tourists behavioral patterns in a Malaysian city. A quantitative analysis is performed by applying clustering and association rule mining techniques, while a qualitative analysis is performed by using the Nvivo software. The aim is to discover hidden knowledge, to suggest appropriate places to visit according to the tourist profile. A similar approach is presented in Qi and Wong (2015), where the authors show the use of a data mining methods in tourism studies. In fact, they apply an ARM algorithm to user-generated data on TripAdvisor about Macau, with the aim of understanding what cultured tourists like more. Thus, the Apriori algorithm is used to predict tourists’ preferences for the different local attractions and to define the profiles of cultured and non-cultured tourists.

3 Methodology

3.1 Datasets

The data used in this work have been collected from two tourism-related review platforms, Booking and TripAdvisor. To collect the data, the available APIs have been exploited, when possible; otherwise, an ad-hoc software has been developed to crawl the needed information, by asking the platforms’ customer service the permis-sion of using the collected data, for scientific purposes only. The analyses are mainly focused on the TripAdvisor dataset, whereas the Booking dataset is considered for comparison purposes.

Page 7: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

43

1 3

A study on online travel reviews through intelligent data…

It is worth noting that TripAdvisor gives some advice before that reviewers post a review: the last we checked, we found that, when writing the review title, the follow-ing suggestion appeared: “Summarize your visit or highlight an interesting detail”. Then, when compiling the text of the review, the following statement was shown to the reviewer: “By sharing your experience, you are helping travelers make better choices and plan their dream trips. Thank you!” Although it would be possible that such statements could somehow influence the content of a review, they may vary, at least occasionally. Thus, it was impossible to perform an experimental evaluation of the supposed influence degree.

Remarkably, when writing a review, Booking does not provide any advice or suggestions.

Finally, for the sake of completeness, we remark that any registered user can write a review for a facility appearing on TripAdvisor, while this is not true for Booking. In fact, only verified guests can write a review. Verified guests are those that book the facility through the platform itself and Booking asks them to write the review only after their staying.

TripAdvisor: Data were collected using two different ways: web-scraping and API access. The data include all the reviews available on TripAdvisor for the Lucca ter-ritory, up to January 2017. The web scraping process was performed by a Python script that navigated through the data available on the Province of Lucca web page,4 collecting all the reviews on Hotels, Restaurants, and Attractions (aka Things To Do). Overall, 309,988 reviews have been obtained divided into 71,365 reviews about 1081 hotels, 204,308 reviews about 1965 restaurants and 34,315 reviews about 487 attractions, posted by 155,018 reviewers. Metadata, such as the language of the review, were obtained by the available APIs. The reviewers’ profiles have also been stored, which include, when available, the age and country of origin of the reviewer. Finally, for each reviewer, all the reviews posted all over the world have been col-lected, for a total of 6,949,809 reviews. This last group of reviews has been used for the analyses described in Sect. 4.3. The characteristics of the TripAdvisor dataset are summarized in Table 1, where we reported the kind and number of reviews and the analyses in which they have been used (see Sect. 2.3).

Booking: This dataset includes reviews of hotels located within the province of Lucca. The reviews were downloaded during August 2016, using a web-scraper.

Table 1 TripAdvisor dataset characteristics

Subject No. of reviews Type of analyses

Lucca and province (all languages) 309,988 a.1, a.2, a.3Lucca and province hotels (all languages) 71,365 a.4Lucca and province hotels (English language) 31,732 b.1All over the world (all languages) 6,949,809 c.1, c.2

4 https ://www.tripa dviso r.com/Touri sm-g1878 98-Lucca _Provi nce_of_Lucca _Tusca ny-Vacat ions.html.

Page 8: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

44 M. Fazzolari, M. Petrocchi

1 3

The dataset also includes reviews about two extremely popular tourism cities, i.e., Paris and New York. This dataset has been used to perform some compara-tive analyses about the hotels of different cities and includes reviews in all lan-guages. Nevertheless, for the analyses that involve the application of NLP tech-niques, only reviews in English have been considered, mainly due to the adoption of existing NLP tools specialized for the English language. For each review, the data exploited are the score, the text, the review date, and the hotel the review refers to. A summary of the characteristics of the dataset is in Table 2 (this table also reports the kind and number of reviews and the analyses in which they have been used).

Review date and visit date: On Booking, reviews are associated to the date at which the review was posted (the review date). The system prevents the users to post reviews later than a fixed deadline (roughly, one month from the time of visit). On TripAdvisor, each review is associated with two dates: the review date and the date of the visit (visit date). The former is always available, provided with a daily granu-larity, while the latter is present in 92% of times, provided with a monthly granular-ity (i.e., the exact day of the visit is not available).

To assess the time elapsed between a visit and the review submission, for each review on TripAdvisor time delta is computed, as the difference between the two dates. The visit date is approximated with the first day of the month. We then draw the density histogram of time delta together with a fitting Kernel Density Estimation (KDE), by limiting the number of time delta to 500 days, see Fig. 1a. We also draw the Cumulative Distribution Function, as shown in Fig. 1b. These results highlight

Table 2 Booking dataset characteristics

Subject No. of reviews Type of analyses

Lucca and province hotels (all languages) 45,798 a.4Lucca and province hotels (English language) 8262 b.1, b.2New York hotels (English language) 182,438 b.2Paris hotels (English language) 93,164 b.2

Fig. 1 TripAdvisor: number of days elapsed between visit date and review date

Page 9: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

45

1 3

A study on online travel reviews through intelligent data…

that, in 50% of cases, a review is submitted within 23 days from the visit time, in 75% of cases within 34 days, and in 80% of cases within 40 days.

Thus, in the following, for those analyses involving TripAdvisor only, we decide to adopt the review dates (thus, approximating the visit date with—roughly—a month granularity). For those analyses involving both the platforms, we consider the review date too (which is the only available on Booking).

3.2 Data analysis

Hereafter, we show how the datasets described in Sect. 2.3 have been processed, to perform several analyses on them.

(a) At first, a set of statistical analyses have been realized, to describe the nature of data under examination. Some analyses aim at describing the reviewers charac-teristics, while some others are focused on trends over the time. In particular, we have been investigating:

1. The geographical distribution of the reviewers of the province of Lucca: for each reviewer, the country of origin is extracted from the user profile, when available.

2. The activity of reviewers who visited the province of Lucca: we analyze the number of reviews written by each reviewer and investigate whether the empirical data follow a specific probability distribution.

3. The trend of the two following indexes, over the time: the number of reviews and the average score, for hotels, restaurants and attractions. To this aim, reviews have been grouped on a monthly basis and for each month we com-pute the number of reviews and the average score.

4. A comparison between the reviews belonging to different platforms, but sharing the same domain, the province of Lucca. Thus, we select the same temporal period and analyze how the average score varies over the time, for the two datasets. For this analysis, we only consider hotel reviews, since the Booking dataset includes only this kind of businesses.

(b) The second group of analyses implies the application of Natural Language Pro-cessing techniques. For the sake of simplicity, we focus on the English language, therefore we consider only reviews written in English. Nevertheless, similar anal-yses can be performed considering other languages. Thus, to compute the word frequency in a set of reviews, we first remove punctuation marks, non-text char-acters and special symbols, so that each review is represented as a set of words (tokenization). Then, we remove stop-words, i.e., words that provide little or no useful information to the text analysis and can hence be considered as noise; common stop-words include articles, conjunctions, prepositions, pronouns, and also words that either appear very often in sentences of the considered language (language-specific stop-words) or in the analyzed context (domain-specific stop-words). Finally, we apply a stemming process, to reduce each token (i.e., word)

Page 10: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

46 M. Fazzolari, M. Petrocchi

1 3

to its stem or root form, by removing its suffix. Thus, words having closely related semantics are grouped. We use the Scikit-learn library (Pedregosa et al. 2011) to tokenize, and the NLTK library for both stop word listing and stem-ming (Snowball package) (Bird et al. 2009). We define the frequency of a word in a set of reviews as the number of occurrences of its stem divided by the total number of occurrences of all the stems. The analyses on the two platforms aim at comparing the most frequent words extracted from the reviews texts, focusing on hotels. In particular, we show:

1. A comparison between TripAdvisor and Booking datasets, when considering the same territory, i.e, the Lucca province.

2. A comparison among three different cities, famous for their touristic attrac-tions and facilities, i.e., Paris, New York, and Lucca. While acknowledging that the worldwide popularity of Paris and New York is not even remotely comparable to that of Lucca, the motivation for this analysis is to see whether the most frequent terms extracted are shared, or not, amongst different loca-tions. If so, a behavioral similarity, at least at the review content level, could be argued for users reviewing different locations.

(c) The last set of analyses involves the application of an Association Rule Mining algorithm, i.e., the Apriori algorithm (Agrawal et al. 1993). The analyses focus only on the TripAdvisor dataset and consider not only the reviews related to Lucca and its territory, but also reviews related to other places, written by all reviewers who have reviewed Lucca at least once. The province of Lucca has a surface of 177,322 square kilometers. Within the province, four geographical zones exist: ‘Lucchesia’, which include downtown and the immediate surround-ings, ‘Versilia’, the part of the province on the coast, ‘Garfagnana’, the part of the province which is mainly mountain, and ‘Valle del Serchio’, a flat countryside which progressively becomes mountain. Such zones are comparable as territo-rial extensions (Valle del Serchio and Garfagnana are a bit larger than the other two zones). Apart from Lucca, there exist other municipalities, like Capannori, Pietrasanta, Camaiore, Forte dei Marmi, Montecarlo and Viareggio, which are smaller than Lucca, but feature various tourism attractions (mainly, art and cul-ture for Pietrasanta, the sea for Viareggio and Forte Dei Marmi, and the villas and countryside for Capannori, Camaiore, and Montecarlo), which make them attractive for visitors. The aim of this study is to identify recurrent travelling preferences of Lucca visitors, mainly according to their nationality. Specifically, a two level analysis has been performed:

1. Identification of preferred destinations within the province of Lucca over a three month period, to discover which are the preferred local destinations during a short stay in the Lucca province.

2. Identification of preferred destinations throughout the world, over a 1-year period, to find out which are the most visited countries over a mid-term period.

Page 11: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

47

1 3

A study on online travel reviews through intelligent data…

The methodology includes the following steps: (a) for each reviewer, we consider a review posted at time t and we look for her reviews in t ± Δt . (b) We construct a pattern containing the reviewer id and the places she reviewed in the period t ± Δt . (c) We infer the most recurrent places visited over the same period by applying the Apriori algorithm to mine frequent itemsets, i.e. to find sets that appear together in many patterns, by using an empirical minSup threshold.

4 Findings

4.1 Datasets characterization

In this section, we report the results of statistical analyses.Figure 2 shows the worldwide distribution of the TripAdvisor reviewers who left

at least one review about facilities in the Province of Lucca, over the period from 08-2002 to 01-2017. The colours in the figure are proportional to the density of the users coming from the correspondent country. As expected, the majority of visitors are Italian, followed by British and American visitors. Even if most of the tourists come from neighbour countries, the map highlights that, among extra-European countries, most of tourists come from Russia, Australia, and Canada.

The second analysis concerns the number of reviews per reviewer. The graphs in Fig. 3-a show how many reviewers posted a certain number of reviews for the Lucca territory. We investigate whether the empirical data follow a probability distribu-tion by applying the powerlaw Python package (Alstott et al. 2014). The real data distribution is compared with the exponential, power law, truncated power law and lognormal distributions. We first compare the power law distribution with the expo-nential one, and, as expected (Muchnik et al. 2013), the former better fits, with a p value equal to 6.8e−12 . The truncated power law distribution fits slightly better than

Fig. 2 TripAdvisor: geographical distribution of reviewers

Page 12: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

48 M. Fazzolari, M. Petrocchi

1 3

the lognormal one, which, in turn, fits better than the power law one. The truncated power law and lognormal distributions are the distributions that better fit our empiri-cal data (Fig. 3b). This result highlights that there is a small amount of reviewers posting a high number of reviews, and vice-versa.

The third analysis show the trends over a period of 6 years (2011– beginning 2017) of the number of reviews and the average score. In Fig. 4a, the trends of the number of reviews, for hotel, restaurants and attractions are reported, respec-tively. The graph clearly shows a cyclic pattern. In fact, the maximum number of reviews, for each year, corresponds to summer, while, for the last 3  years (2014–2016), there are also three local maxima corresponding to Christmas peri-ods (a vertical dotted line indicates the end of the year). These peaks are less vis-ible than the ones for summers, but they can clearly be inferred from the numeri-cal data.

Then, we plot the trends of the number of reviews and the average score for hotels and restaurants (Fig. 4b, c, respectively). Interestingly, a positive peak in the number of reviews corresponds to a negative peak in the average score (for attractions, this relation is not evident).

Finally, a comparison of the average scores of hotel reviews over the same period of time is reported, considering the two platforms, namely TripAdvisor and Book-ing. The scores are normalized to the interval {0, 1} (since the Booking scores range over {2.5, 10}, while the TripAdvisor scores over {1, 5}). The analysis is carried out by considering all the hotels reviewed in the explored data (i.e., there are hotels that appear only in one dataset). As shown in Fig. 5, the average score of TripAdvisor hotel reviews varies from a minimum of 0.811 to a maximum of 0.877, while for Booking the range is {0.806, 0.885}.

Summarizing, we showed the geographical distribution of people reviewing facil-ities for a certain locality. Such a simple analysis gives the idea, at a glance, of the origin of visitors that most travels to a particular venue. Also, we understood that the majority of reviewers tends to post few reviews, while there are few reviewers very active in posting reviews. Interestingly, we showed that, on average, the number of reviews stands in inverse proportion to the conveyed appreciation (in terms of average scores). Finally, we pointed out that reviews of hotels in the same locality,

Fig. 3 TripAdvisor dataset: comparison with statistical distributions

Page 13: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

49

1 3

A study on online travel reviews through intelligent data…

Fig. 4 TripAdvisor, Lucca: reviews over time and comparison with average score

Fig. 5 Average scores: Booking vs. TripAdvisor, Province of Lucca

Page 14: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

50 M. Fazzolari, M. Petrocchi

1 3

posted on different platforms, have similar average scores. The last outcome could reflect a sort of coherency of the reviewers from both the platforms in assigning scores.

4.2 Content analysis

Here, we show the result of the analyses performed on the review text, by applying NLP techniques.

The first analysis is a comparison between the TripAdvisor and Booking datasets, considering the same domain, i.e., the Lucca territory. To this aim, all the avail-able hotels in the two datasets, and all the associated reviews in English have been selected. From such reviews, the most frequent terms are extracted.

The wordclouds in Fig. 6 identify the 20 most frequent words extracted from the two datasets. The dimension of a word is proportional to its frequency. The 80% of the terms is equal, because the two datasets share the same domain (but not neces-sarily the same hotels). The same results are shown in Table 3, which reports the percentage of occurrences of a certain term in the correspondent dataset.

We run a similar comparison only for the Booking dataset, by analyzing the most frequent terms that appear in reviews of hotels settled in three different touris-tic cities, namely Lucca, New York and Paris. While acknowledging that the latter two cities feature an indubitable higher popularity with respect to the former one, the rationale behind this analysis is to examine if reviewers use to report about the same content, with a comparable frequency. We extract three sets of stems, each containing the ten most frequent stems for these cities. Table 4 shows the union of the three sets (turning back to words instead of stems), reporting, for each term, the rank, the number of times that the word appears in all the reviews of the single city, and the frequency as above defined. The most frequent words considerably overlap: 11 words (in bold) are within the first 15 positions in all the sets (the ranking is obtained with respect to the occurrence value).

The two datasets are highly imbalanced, since the Booking dataset includes 8262 English reviews for Lucca, 182,438 English reviews for New York and 93,164 Eng-lish reviews for Paris. Thus, we repeat the same analysis on a balanced dataset, with

Fig. 6 Hotels in Lucca: the 20 most frequent terms in TripAdvisor and Booking

Page 15: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

51

1 3

A study on online travel reviews through intelligent data…

Table 3 Hotels in Lucca: comparison between frequent terms in TripAdvisor and Booking

Term Percentage

Hotel 0.025Room 0.023Stay 0.018Breakfast 0.013Lucca 0.011Good 0.010Staff 0.009Great 0.009Us 0.009Love 0.008Place 0.008Night 0.008Help 0.008Would 0.008Location 0.008One 0.007Day 0.007Clean 0.007Friend 0.007Nice 0.007Room 0.025Location 0.020Breakfast 0.017Staff 0.015Good 0.013Help 0.012Great 0.012Lucca 0.010Friend 0.009Nice 0.009Stay 0.009Hotel 0.008Noth 0.008Clean 0.008Love 0.008Park 0.007Us 0.007Walk 0.006City 0.006Comfort 0.006

Page 16: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

52 M. Fazzolari, M. Petrocchi

1 3

8262 reviews for each city. The obtained results are similar with respect to the ones obtained for the imbalanced dataset. Table  4 shows the results over the original datasets.

4.3 Travelling preferences

In this section we report the results of the analyses performed to identifying the pre-ferred destinations among visitors who posted at least one review for the Lucca ter-ritory. We perform a two level analysis, by extracting their preferred destinations (1) within the province of Lucca, over a three month period; (2) throughout the world, over a 1-year period.

4.3.1 Province of Lucca

For each of the reviews posted by a reviewer about a venue in the municipality of Lucca, we consider all the reviews written by that reviewer in the same, previous, and following month (collapsing the same reviews).

Table 5(a) shows the itemsets representing the most recurrent municipalities with places reviewed in the above described reviews. For page limits, we only show fre-quent itemsets with support > 0.015 and including the municipality of Lucca as item. In 10% of cases, the reviewers who visited Lucca also visited Viareggio, a seaside municipality. A similar analysis is performed about the geographical areas of the province of Lucca: Lucchesia (including the city center), Versilia (the seaside zone), Garfagnana (the upper valley of the river Serchio), and Valle Serchio (the

Table 4 Booking: union of the ten most frequent words in 3 cities

Term Lucca Paris New York

Rank Occurr. Freq. Rank Occurr. Freq. Rank Occurr. Freq.

Room 1 2979 0.025 1 72,787 0.041 1 150,119 0.041Location 2 2393 0.020 2 51,742 0.029 2 10,5237 0.029Breakfast 3 2054 0.017 9 19,671 0.011 9 32,384 0.009Staff 4 1761 0.015 3 41,650 0.023 3 75,115 0.021Good 5 1550 0.013 5 28,575 0.016 6 38,689 0.011Help 6 1412 0.012 8 20,329 0.011 13 30,231 0.008Great 7 1397 0.012 11 18,139 0.010 5 47,970 0.013Friend 9 1092 0.009 10 18,922 0.011 11 30,685 0.008Nice 10 1081 0.009 13 15,086 0.009 15 25,627 0.007Hotel 12 1016 0.009 4 37,830 0.021 4 68,529 0.019Clean 14 929 0.008 6 22,225 0.013 7 37,615 0.010Bed 22 702 0.006 14 13,845 0.008 8 35,418 0.010Small 41 460 0.004 7 20387 0.011 12 30,466 0.008Time 53 381 0.003 35 6532 0.004 10 32,304 0.009

Page 17: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

53

1 3

A study on online travel reviews through intelligent data…

lower valley of river Serchio). Table 5(b) ( minSup = 0.03 ) shows that people visit-ing Lucchesia also prefers Versilia, while a small amount of tourists who visit Lucca visit also the other zones of the province.

4.3.2 Worldwide destinations

Starting from people reviewing the province of Lucca, we identify recurrent destina-tions throughout the world. We set Δt equal to 6 months, thus considering an overall temporal window of 1 year, and minSup = 0.01 . An excerpt of the results is reported in Table 6, showing itemsets with support > 0.025 and including Italy item.

The countries most reviewed by visitors of Lucca are France (almost 12%) and Spain (10%), followed by United Kingdom and USA (9% and 7%, respectively). These results are coherent with the geographical distribution in Sect.  4.1, even if France and Spain received more reviews over a 1-year period.

Then, we group reviewers by their nationality, focusing on Italian, French, Brit-ish and American reviewers—39,090 Italian, 4593 British, 3864 American and 1659 French reviewers who posted at least a review for Lucca. We run the analysis for each nationality, with minSup = 0.05 and Δt = 6 months. For each nationality, the first 10 rows of the obtained results are shown in Table 7(a), (b), (c), (d) (we only show itemsets containing “Italy” item). As expected, reviewers tend to post more reviews for places located in their countries. It is it possible to highlight the preferred destinations for each nationality of reviewers. Notably, here we consider reviewers who post at least one review for Lucca, but the analysis can be easily extended to a wider dataset. From our results, the preferred destinations for Italian reviewers are France and Spain, followed by United Kingdom and Greece. Moreover, in 44% of

Table 5 Recurrent municipalities and geographical zones—province of Lucca

Supp. Itemsets

(a) Municipalities 0.101 Lucca, viareggio 0.051 Lucca, camaiore 0.046 Lucca, pietrasanta 0.039 Lucca, capannori 0.033 Lucca, forte dei marmi 0.027 Lucca, camaiore, viareggio 0.023 Lucca, pietrasanta, viareggio 0.021 Lucca, montecarlo 0.018 Lucca, camaiore, pietrasanta 0.017 Lucca, forte dei marmi, viareggio

(b) Geographical zones 1 Lucchesia 0.16 Lucchesia, versilia 0.04 Lucchesia, valle serchio 0.03 Lucchesia, garfagnana

Page 18: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

54 M. Fazzolari, M. Petrocchi

1 3

cases a reviewer posted a review both for places in Spain and France. The lower part of Table 7(a) show that combinations of the aforementioned countries are often visited during the same year, while we notice the absence of extra-European coun-tries in this excerpt. Similar analyses are run for other nationalities, as shown in Table 7(b), (c), (d). France, Spain and United Kingdom are generally the preferred destinations. Nevertheless, since the cut-off of the support is lower, we notice that French and British reviewers also posted reviews for places in US. Furthermore, American travellers are also interested in the Northern part of Europe (Germany and Netherlands).

5 Conclusions, limitations and future work

The impact of online reviews become more and more relevant in the era of social media, both for businesses and consumers. Travelers rely on online reviews to make decisions about trips planning, whereas businesses take advantage of them to estab-lish effective marketing strategies. Nevertheless, the great amount of available data makes it unfeasible to analyze all the available reviews one by one. Thus, in recent years, several efforts have been done to propose methods that automatically analyze and summarize the reviews features.

Past contributions present the application of automated techniques to acquire new useful knowledge from tourism-related data. This work is in line with past research, and shows how different analyses—performed on two travel-related review data-sets—lead to enriched information on the reviewers activities, on the kind and fre-quencies of the review terms, and on the preferred destinations of reviewers, accord-ing to their country of origin. Summarizing, the contribution presented in this paper is (1) aligned with current trends of literature investigation, (2) offer diverse kinds of analyses, and (3) put the bases for the design and development of fine grained recommender systems.

Table 6 World: recurrent destinations of people reviewing Lucca places

Supp. Itemsets

0.116 Italy, France0.101 Italy, Spain0.091 Italy, UK0.073 Italy, USA0.058 Italy, Germany0.039 Italy, Greece0.037 Italy, France, Spain0.033 Italy, France, UK0.032 Italy, Netherlands0.032 Italy, Austria0.028 Italy, UK, Spain0.028 Italy, Switzerland0.026 Italy, France, USA

Page 19: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

55

1 3

A study on online travel reviews through intelligent data…

Table 7 World: recurrent destinations, according to reviewer nationalities

Supp. Itemsets

(a) Italian 0.65 Italy, France 0.53 Italy, Spain 0.47 Italy, UK 0.44 Italy, France, Spain 0.44 Italy, Greece 0.35 Italy, France, UK 0.35 Italy, Spain, UK 0.32 Italy, Spain, Greece 0.32 Italy, France, Greece 0.29 Italy, France, UK, Spain

(b) French 1 Italy 0.87 Italy, France 0.24 Italy, UK 0.24 Italy, Spain 0.24 Italy, France, Spain 0.21 Italy, France, USA 0.21 Italy, France, UK 0.21 Italy, USA 0.13 Italy, Greece 0.11 Italy, Belgium, France

(c) British 1.00 Italy 0.91 Italy, UK 0.44 Italy, France 0.44 Italy, Spain 0.42 Italy, France, UK 0.42 Italy, UK, Spain 0.28 Italy, USA 0.26 Italy, UK, USA 0.21 Italy, France, Spain 0.21 Italy, France, USA

(d) American 1.00 Italy 0.89 Italy, USA 0.29 Italy, France, USA 0.25 Italy, Spain, USA 0.21 Italy, UK 0.21 Italy, UK, USA 0.14 Italy, Germany 0.14 Italy, Germany, USA 0.11 Italy, Netherlands 0.11 Italy, Mexico, USA

Page 20: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

56 M. Fazzolari, M. Petrocchi

1 3

The analyses have been performed by applying several techniques coming from the field of intelligent data analysis, which include (a) an overall statistical study on the reviewers activities, (b) a content analysis by applying NLP techniques on the subset of English reviews, mainly to compare two travel-related review platforms, and (c) the extraction of frequent patterns of destinations by exploiting an Asso-ciation Rule Mining algorithm. Like in Aghdam et al. (2014), the main advantage of using such an algorithm is the possibility to analyze the association between the characteristics of the reviewers and their traveling choices. This allows the explora-tion of different relationships in the data, which can be used as a reference during a decision-making process (Liao et al. 2010).

The outcome allowed for a better characterization of the visitors’ habits and pref-erences: from the time of the year of highest affluence (and the relative change in the visitors’ satisfaction) to frequent destinations patterns common to users’ groups. While our running scenario turns around an Italian tourist hub, the same explora-tions could be easily carry out on different locations, with diverse granularity degrees. For example, as future work, we aim at extending the application of the Association Rule Mining algorithm to consider not only the country of origin of reviewers, but also additional features, such as the gender, the age, etc. In this case, particular attention should be payed to the number of features considered, since an excessive amount of features could lead to the generation of inconsistent rules. Beside that, additional challenges could limit the adoption of the algorithm in travel and tourism contexts (Li et al. 2010), such as infrequent item sets, negative rules or the evaluation of candidate rules that are not targeted to the application. Moreover, it should be kept in mind that association rule mining is generally used as an explora-tory algorithm to identify patterns to generate hypotheses, which, once generated, need to be tested (Olson and Shi 2007).

The contribution presented in this paper puts the bases for the design and devel-opment of a decision support systems useful for consumers and providers. In par-ticular, we are currently working on the development of a recommender system that can concurrently take into account all the information extracted by the application of the intelligent data analysis techniques. The main aim is to suggest similar routes to potential travelers that present characteristics similar to those experimenting recurrent tours. Another work in progress is the comparison of specific character-istics of locations, through, e.g., the analysis of the appreciation of the attractions and “things to do” expressed by the reviewers, about different towns in the same region (e.g., things to do in Lucca, Pisa, Florence, and Siena, all located in the Tus-cany region). While our current NLP analyses deal only with the English language, a further contribution could take into account more languages, to verify if the most frequent terms that appear in reviews are shared or, instead, reviewers writing in different languages appreciate different aspects of a certain facility and/or business.

Beside the challenges featured by relying on Association Rule Mining, additional limitations are mainly related to the fact that not all the actual visitors use to post a review after their travels. Thus, the obtained results give an incomplete overview of the visitors’ traveling behaviour. Future work may consider the possibility of extract-ing user profiles from data, keeping however in mind that not all the possible pro-files of travelers have a representation in the datasets. Another direction for future

Page 21: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

57

1 3

A study on online travel reviews through intelligent data…

work is to incorporate such user profiles into a market analysis tool for providers, with the aim of defining marketing strategies adjusted to specific groups of users.

References

Aghdam AR, Kamalpour M, Chen D, Sim ATH, Hee JM (2014) Identifying places of interest for tour-ists using knowledge discovery techniques. In: 2014 International conference on industrial auto-mation, information and communications technology, pp 130–134. https ://doi.org/10.1109/IAICT .2014.69220 99

Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large data-bases. SIGMOD Rec 22(2):207–216. https ://doi.org/10.1145/17003 6.17007 2

Akman I, Mishra A (2017) Factors influencing consumer intention in social commerce adoption. IT Peo-ple 30(2):356–370. https ://doi.org/10.1108/ITP-01-2016-0006

Alaei AR, Becken S, Stantic B (2018) Sentiment analysis in tourism: capitalizing on big data. J Travel Res. https ://doi.org/10.1177/00472 87517 74775 3

Alstott J, Bullmore E, Plenz D (2014) Powerlaw: a Python package for analysis of heavy-tailed distribu-tions. PLoS One 9(1):1–11. https ://doi.org/10.1371/journ al.pone.00857 77

Amaro S, Duarte P (2017) Social media use for travel purposes: a cross cultural comparison between Portugal and the UK. Inf Technol Tour 17(2):161–181. https ://doi.org/10.1007/s4055 8-017-0074-7

Berezina K, Bilgihan A, Cobanoglu C, Okumus F (2016) Understanding satisfied and dissatisfied hotel customers: text mining of online hotel reviews. J Hosp Mark Manag 25(1):1–24. https ://doi.org/10.1080/19368 623.2015.98363 1

Bird S, Klein E, Loper E (2009) Natural language processing with Python. O’Reilly Media, CaliforniaChen J, Zhang C, Niu Z (2016) Identifying helpful online reviews with word embedding features. In:

Knowledge science, engineering and management. Springer, New York, pp 123–133. https ://doi.org/10.1007/978-3-319-47650 -6_10

Chong AYL, Ch’ng E, Liu MJ, Li B (2017) Predicting consumer product demands via big data: the roles of online promotional marketing and online reviews. Int J Prod Res 55(17):5142–5156. https ://doi.org/10.1080/00207 543.2015.10665 19

Fang Q, Xu C, Sang J, Hossain MS, Muhammad G (2015) Word-of-mouth understanding: entity-centric multimodal aspect-opinion mining in social media. IEEE Trans Multimed 17(12):2281–2296. https ://doi.org/10.1109/TMM.2015.24910 19

Flanagin A, Metzger M, Pure R, Markov A, Hartsell E (2014) Mitigating risk in e-commerce transactions: perceptions of information credibility and the role of user-generated ratings in product quality and purchase intention. Electron Commer Res 14(1):1–23. https ://doi.org/10.1007/s1066 0-014-9139-2

García-Pablos A, Cuadros M, Linaza MT (2016) Automatic analysis of textual hotel reviews. Inf Technol Tour 16(1):45–69. https ://doi.org/10.1007/s4055 8-015-0047-7

Gretzel U, Yoo KH (2008) Use and impact of online travel reviews. Springer, Vienna, pp 35–46. https ://doi.org/10.1007/978-3-211-77280 -5_4

Hu YH, Chen YL, Chou HL (2017) Opinion mining from online hotel reviews—a text summarization approach. Inf Process Manag 53(2):436–449. https ://doi.org/10.1016/j.ipm.2016.12.002

Krawczyk M, Xiang Z (2016) Perceptual mapping of hotel brands using online reviews: a text analytics approach. Inf Technol Tour 16(1):23–43. https ://doi.org/10.1007/s4055 8-015-0033-0

Li G, Law R, Rong J, Vu HQ (2010) Incorporating both positive and negative association rules into the analysis of outbound tourism in Hong Kong. J Travel Tour Mark 27(8):812–828. https ://doi.org/10.1080/10548 408.2010.52724 8

Liao SH, Chen YJ, Deng M (2010) Mining customer knowledge for tourism new product develop-ment and customer relationship management. Expert Syst Appl 37(6):4212–4223. https ://doi.org/10.1016/j.eswa.2009.11.081

Litvin SW, Goldsmith RE, Pan B (2008) Electronic word-of-mouth in hospitality and tourism manage-ment. Tour Manag 29(3):458–468. https ://doi.org/10.1016/j.tourm an.2007.05.011

Menner T, Höpken W, Fuchs M, Lexhagen M (2016) Topic detection: identifying relevant topics in tour-ism reviews. In: Inversini A, Schegg R (eds) Information and communication technologies in tour-ism 2016. Springer International Publishing, Cham, pp 411–423

Page 22: A study on online travel reviews through intelligent data analysis - IIT · 2019. 1. 31. · 39 1 3 Astudyononlinetravelreviewsthroughintelligentdata • Apreliminarystatisticalanalysisoftheavailabledatahasbeendevotedtolet

58 M. Fazzolari, M. Petrocchi

1 3

Muchnik L, Pei S, Parra LC, Reis SD, Andrade JS Jr, Havlin S, Makse HA (2013) Origins of power-law degree distribution in the heterogeneity of human activity in social networks. Sci Rep 3:1783

Olson D, Shi Y (2007) Introduction to business data mining. McGraw-Hill/Irwin series operations and decision sciences, McGraw Hill, New York. https ://books .googl e.it/books ?id=m_j4AAA ACAAJ

Pantano E, Priporas CV, Stylos N (2017) You will like it! Using open data to predict tourists’ response to a tourist attraction. Tour Manag 60:430–438. https ://doi.org/10.1016/j.tourm an.2016.12.020

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830. http://sciki t-learn .org

Phillips P, Barnes S, Zigan K, Schegg R (2017) Understanding the impact of online reviews on hotel per-formance: an empirical analysis. J Travel Res 56(2):235–249. https ://doi.org/10.1177/00472 87516 63648 1

Qi S, Wong CUI (2015) An application of apriori algorithm association rules mining to profiling the heritage visitors of Macau. In: Tussyadiah I, Inversini A (eds) Information and communication tech-nologies in tourism 2015. Springer International Publishing, Cham, pp 139–151

Raguseo E, Vitari C (2017) The effect of brand on the impact of e-wom on hotels’ financial performance. Int J Electron Commer 21(2):249–269. https ://doi.org/10.1080/10864 415.2016.12342 87

Rossetti M, Stella F, Zanker M (2016) Analyzing user reviews in tourism with topic models. Inf Technol Tour 16(1):5–21. https ://doi.org/10.1007/s4055 8-015-0035-y

Salehan M, Kim DJ (2016) Predicting the performance of online consumer reviews: a sentiment min-ing approach to big data analytics. Decis Support Syst 81:30–40. https ://doi.org/10.1016/j.dss.2015.10.006

Sparks BA, Browning V (2011) The impact of online reviews on hotel booking intentions and perception of trust. Tour Manag 32(6):1310–1323. https ://doi.org/10.1016/j.tourm an.2010.12.011

Versichele M, de Groote L, Bouuaert MC, Neutens T, Moerman I, de Weghe NV (2014) Pattern mining in tourist attraction visits through association rule learning on bluetooth tracking data: a case study of Ghent, Belgium. Tour Manag 44:67–81. https ://doi.org/10.1016/j.tourm an.2014.02.009

Wang Q, Wang L, Zhang X, Mao Y, Wang P (2017) The impact research of online reviews’ senti-ment polarity presentation on consumer purchase decision. IT People 30(3):522–541. https ://doi.org/10.1108/ITP-06-2014-0116

Xiang Z, Gretzel U (2010) Role of social media in online travel information search. Tour Manag 31(2):179–188. https ://doi.org/10.1016/j.tourm an.2009.02.016

Yang J, Sia C, Liu L, Chen H (2016) Sellers versus buyers: differences in user information sharing on social commerce sites. IT People 29(2):444–470. https ://doi.org/10.1108/ITP-01-2015-0002

Ye Q, Law R, Gu B (2009) The impact of online user reviews on hotel room sales. Int J Hosp Manag 28(1):180–182. https ://doi.org/10.1016/j.ijhm.2008.06.011

Ye Q, Law R, Gu B, Chen W (2011) The influence of user-generated content on traveler behavior: an empirical investigation on the effects of e-word-of-mouth to hotel online bookings. Comput Hum Behav 27(2):634–639. https ://doi.org/10.1016/j.chb.2010.04.014

Zhou S, Guo B (2017) The order effect on online review helpfulness. Decis Support Syst 93(C):77–87. https ://doi.org/10.1016/j.dss.2016.09.016

Zhou X, Wang M, Li D (2017) From stay to play–a travel planning tool based on crowdsourcing user-generated contents. Appl Geogr 78:1–11. https ://doi.org/10.1016/j.apgeo g.2016.10.002