Thesis

67
POLITECNICO DI MILANO Master of Science in Computer Engineering Department of Computer Engineering REPUTATION ANALYSIS ON THE WEB 2.0 Supervisor: Prof. Lorenzo Cantoni Master Thesis of: Sarp Erdag, matricola 722413 Academic Year: 2008 - 2009

Transcript of Thesis

Page 1: Thesis

POLITECNICO DI MILANOMaster of Science in Computer Engineering

Department of Computer Engineering

REPUTATION ANALYSIS ON THEWEB 2.0

Supervisor: Prof. Lorenzo CantoniMaster Thesis of: Sarp Erdag, matricola 722413

Academic Year: 2008 - 2009

Page 2: Thesis

Thanks to my family, especially my father for supporting me through allaspects of my two year graduate studies.

Speacial thanks to Davide Eynard and Alessandro Inversini who patientlyguided and advised me during the creation of this work.

Page 3: Thesis

Executive Summary

This report discusses and analyzes the technical side of the project “Web2rism”that is being carried out by a team of researchers and developers situatedat the “Webatelier” lab that is a part of Universita della Svizzera Italianain Switzerland. The project is an e-tourism centric online reputation mon-itoring software where the aim is to measure the popularity of touristicdestinations and services. This research within the project mainly focuseson the methodologies, tools, frameworks and custom applications used anddesigned within the context of the ORM software being developed.

The report begins with an introduction to what “ORM” is and whyit is becoming an important issue businesses should consider. Later on, ex-isting solutions trying to solve common ORM problems are addressed anda comparison between them is made. Reaching a conclusion out of the bestalternatives available, the topic is then bound to how we approached tobuilding a ORM application for the tourism industry. Finally, the ingre-dients for building up the model is presented, our decisions and details onhow we proceeded with the implementation are explained.

The system architecture behind the project relies heavily on data min-ing and web scraping techniques, therefore we have tested a broad range oftools and frameworks to ease the operation. The project is also built as anadvanced web application that uses semantic technologies and knowledge-bases. Until the current phase of the development process, I have been ableto create six different scrapers that gather data from different social mediaand UGC sites, set-up an RDF store for storing the collected data and builtan API over our knowledge-base that is able to do the necessary analysisand shaping of raw data before it is presented to the user. There is alsoa manager application that enables easy administration of the whole datagathering process. Continued, the thesis report discusses my implemen-tation strategies and methodologies and presents the evaluation of chosen

I

Page 4: Thesis

applied technologies.

As a final conclusion reached by the research studies and a phase ofsoftware development completed, we were satisfied with the system archi-tecture and the model designed for an e-tourism centric ORM application.The final chapter of the report discusses the possible future steps that shouldbe taken to take the project to next levels.

Page 5: Thesis

Contents

Executive Summary I

1 Introduction 21.1 Objectives and motivations . . . . . . . . . . . . . . . . . . . 21.2 The Web2rism project . . . . . . . . . . . . . . . . . . . . . . 31.3 My Participation . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Terms and Abbreviations . . . . . . . . . . . . . . . . . . . . 41.5 Citations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Background 122.1 Online Reputation Analysis & Management . . . . . . . . . . 12

2.1.1 Short History of ORM . . . . . . . . . . . . . . . . . . 132.1.2 Scope of ORM . . . . . . . . . . . . . . . . . . . . . . 132.1.3 A Business Approach to Online Reputation Manage-

ment and Monitoring . . . . . . . . . . . . . . . . . . 142.2 Existing Solutions . . . . . . . . . . . . . . . . . . . . . . . . 15

3 My Approach 243.1 The Positioning of Web2rism . . . . . . . . . . . . . . . . . . 243.2 Reputation Analysis Methodology . . . . . . . . . . . . . . . 25

3.2.1 Content . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.2 Sentiments . . . . . . . . . . . . . . . . . . . . . . . . 263.2.3 Authorship . . . . . . . . . . . . . . . . . . . . . . . . 273.2.4 Query expansion . . . . . . . . . . . . . . . . . . . . . 273.2.5 Location Factor . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Technical Methodology . . . . . . . . . . . . . . . . . . . . . . 28

4 Implementation 294.1 Tools and Frameworks Used . . . . . . . . . . . . . . . . . . . 29

4.1.1 Data & Knowledgebase Tools . . . . . . . . . . . . . . 294.1.2 Programming Languages and Web Frameworks . . . . 31

III

Page 6: Thesis

4.1.3 APIs and Wrappers . . . . . . . . . . . . . . . . . . . 324.1.4 User Interface Design . . . . . . . . . . . . . . . . . . 364.1.5 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . 36

4.2 Scrapers, Scraping and Parsing Techniques . . . . . . . . . . 384.2.1 Web2rism Scrapers . . . . . . . . . . . . . . . . . . . . 384.2.2 A Custom Scraper Example . . . . . . . . . . . . . . . 39

4.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . 414.3.1 The MVC Architecture and its Usage . . . . . . . . . 414.3.2 Layers and Components . . . . . . . . . . . . . . . . . 424.3.3 Project Files Organization . . . . . . . . . . . . . . . 484.3.4 System Workflow . . . . . . . . . . . . . . . . . . . . . 49

5 Tests and Evaluations 515.0.5 Functional Scrapers . . . . . . . . . . . . . . . . . . . 515.0.6 KnowledgeBase & Scraper Management . . . . . . . . 555.0.7 System Performance . . . . . . . . . . . . . . . . . . . 56

6 Conclusion 586.1 Current Status of the Work . . . . . . . . . . . . . . . . . . . 586.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.2.1 Extending the Data Gathering Layer . . . . . . . . . . 596.2.2 Optimization & Scalability Issues . . . . . . . . . . . . 606.2.3 A Logging System . . . . . . . . . . . . . . . . . . . . 606.2.4 Easier Interaction with the RDF Store . . . . . . . . . 606.2.5 More Organized Author and Location Classification . 616.2.6 UI Enhancements . . . . . . . . . . . . . . . . . . . . . 61

Page 7: Thesis

List of Figures

2.1 Reputation Analysis Results of Perspective Hotel Singaporeon Brand Karma . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Nielsen’s approach for Advanced Buzz Management. . . . . . 212.3 Comparison of the best of existing ORM tools using Google

Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 Content Analysis for Reputation Measurement . . . . . . . . 26

4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . 434.2 ScraperManager application UML Diagram . . . . . . . . . . 454.3 Scraper Manager UI . . . . . . . . . . . . . . . . . . . . . . . 464.4 An example of a SPARQL query used to get data from the KB 47

5.1 A view from Google BlogSearch Scraper’s findings . . . . . . 525.2 A view from Technorati Scraper’s findings . . . . . . . . . . . 535.3 A view from Twitter Scraper’s findings . . . . . . . . . . . . . 545.4 A view from Flickr Scraper’s findings . . . . . . . . . . . . . . 545.5 A view from YouTube Scraper’s findings . . . . . . . . . . . . 555.6 A view from WikiTravel Scraper’s findings . . . . . . . . . . . 565.7 A view from a call to the KB API to get “Tweets” about Italy 57

0

1

Page 8: Thesis

Chapter 1

Introduction

This chapter acts as as an introductory entry to the whole report. First,my motivation about being involved in the project related to my thesis isexplained and its connection with my objectives are presented. Next, generalinformation about the whole project and how my participation within isexpounded. Lastly, the structure of the report is given and several termsand acronyms that the reader may come across while reading the wholedocument are listed.

1.1 Objectives and motivations

During my academic years, in addition to computer and software engineeringitself, I have always been interested in the human side of things. Softwareengineering and creating useful products is a team’s work. Although onecan code and develop applications on his own, in order to perfect it, marketit and monetize it a group of different people with different expertises arerequired.

In in this manner, I interpret the phrase “human side of things” as“communication between the stakeholders of a product being engineered”. Ibelieve a good engineer has to have excellent communication skills and anentrepreneurial point of view to take himself to higher positions. On thisbasis, creating my own product, marketing it and managing the businessaround has been my dream. Moreover, better and clearer communicationis not only required in the development phase but also after the product isreleased. Especially with the emergence of Web 2.0 and social media, in theweb applications sector, it is now much more crucial to continuously be incontact with the users, customers, fans who are interacting with a software

2

Page 9: Thesis

1.2. The Web2rism project 3

application.

During the last year of my studies, I have started to grow an interestin online reputation management and analysis tools that give us detailedfeedbacks on what people are already saying about a specific brand in so-cial media, blogs, comments, online communities and forums. I also hadin-depth experience in web development and I was in search of a big projectthat I can be a part of both in technical and product design terms.

As a result, I decided to do my thesis within the Web2rism project that iscurrently being developed at the Webatelier lab in Universita Svizzera Ital-iana. As can be deducted from its name, Web2rism (Web 2.0 + Tourism)is a project that aims to study the reputation of a destination starting formuser generated content available online. The project requires a deep researchabout how the data collected from the web can be turned into a reputationindicator. There is also the need of using sentiment analysis methods toactually understand how people are talking about a certain being over theweb. In addition, the project team was using knowledge bases instead oftraditional, relational databases for web2rism, so there is a strong connec-tion to the Semantic Web within the project. All these aspects of the studyhas been extremely appealing for me and they suited well for my area ofinterests.

1.2 The Web2rism project

Developed by the webatelier lab in Universita della Svizzera Italiana (USI,Lugano - Switzerland) and funded by the Swiss Confederation’s innovationpromotion agency CTI and PromAX, the Web2rism project aims to bringtogether the latest trends in tourism and touristic destination managementon the world wide web.

Webatelier.net is a laboratory of the Faculty of Communication Sciencesof the Universita della Svizzera Italiana, directed by Prof. Lorenzo Can-toni. The lab deals with a broad range of topics related to new media incommunication and it is specialized in research and development in the fieldof online communication and ICT in general, stressing the �human side� of it.

The development process of Web2rism had started in January 2009 andas a 2 year long project, the planned finish date is January 2011.

Page 10: Thesis

1.3. My Participation 4

1.3 My Participation

The development of Web2rism holds the characteristics and the require-ments of creating an enterprise level project. Although it is being developedin an academic environment, because there have been several companies andassociations existing as stakeholders of the project, their needs and demandsshould be analyzed really well and they should be converted to a solutionthat will fill their needs. The Web2rism development team at Webatelierwas divided in two groups one which deals with the communication withthe stakeholders and developing a model that has to be converted into asoftware application and another that basically develops the software.

In this manner, there has been a strong technical side of the wholeproject. My participation came on stage at this point. The technical back-ground of Web2rism consists of four main layers and my contribution hasbeen in the creation and use of the software tools that were going to be usedin the data gathering, storing and analysis parts.

All the technical aspects of the project, including the layers that arementioned here, will be explained in detail in the upcoming ‘Implementa-tion’ chapter.

1.4 Terms and Abbreviations

Before getting into deeper details about the project, it is important to clearlydefine some terms that will be used throughout this report.

Web scraping: Also known as Web harvesting or Web data extraction,scraping is a computer software technique of extracting information fromwebsites. Usually, such software programs simulate human explorationof the Web by either implementing low-level Hypertext Transfer Protocol(HTTP), or embedding certain full-fledged Web browsers. Web scraping isclosely related to Web indexing, which indexes Web content using a bot andis a universal technique adopted by most search engines. In contrast, Webscraping focuses more on the transformation of unstructured Web content,typically in HTML format, into structured data that can be stored and an-alyzed in a central local database or spreadsheet.

Social Media: The media designed to be disseminated through social in-teraction, created using highly accessible and scalable publishing techniques.

Page 11: Thesis

1.4. Terms and Abbreviations 5

Social media supports the human need for social interaction, using Internetand Web-based technologies to transform broadcast media monologues (oneto many) into social media dialogues (many to many). It supports the de-mocratization of knowledge and information, transforming people from con-tent consumers into content producers. Businesses also refer to social mediaas user-generated content (UGC) or consumer-generated media (CGM).

Wiki: A wiki is a website that allows the easy creation and editing ofany number of interlinked Web pages, using a simplified markup languageor a WYSIWYG text editor, within the browser

Sentiment Analysis: Also known as opinion mining, Sentiment Analysisrefers to a broad (definitionally challenged) area of natural language pro-cessing, computational linguistics and text mining. It aims to determine theattitude of a speaker or a writer with respect to some topic. The attitudemay be their judgment or evaluation (see appraisal theory), their affectivestate (that is to say, the emotional state of the author when writing) or theintended emotional communication (that is to say, the emotional effect theauthor wishes to have on the reader).

CGM: Abbreviation for “Consumer Generated Media”.

JSON: Abbreviation for “JavaScript Object Notation”. is a lightweightdata-interchange format. It is easy for humans to read and write. It is easyfor machines to parse and generate. It is based on a subset of the JavaScriptProgramming Language, Standard ECMA-262 3rd Edition - December 1999.JSON is a text format that is completely language independent but usesconventions that are familiar to programmers of the C-family of languages,including C languages, Java, JavaScript, Perl, Python, and many others.These properties make JSON an ideal data-interchange language.

Official Website: http://www.json.org

XML: Abbreviation for “Extensible Markup Language”. Extensible MarkupLanguage (XML) is a simple, very flexible text format. Originally designedto meet the challenges of large-scale electronic publishing, XML is also play-ing an increasingly important role in the exchange of a wide variety of dataon the Web and elsewhere. XML has a set of rules for encoding documentselectronically.

Page 12: Thesis

1.4. Terms and Abbreviations 6

Official Website: http://www.w3.org/XML/

AJAX : Abbreviation for “Asynchronous JavaScript and XML”. Ajax isa group of interrelated web development techniques used on the client-sideto create interactive web applications or rich Internet applications. WithAJAX, web applications can retrieve data from the server asynchronouslyin the background without interfering with the display and behavior of theexisting page.

RSS: Abbreviation “Really Simple Syndication”. RSS is a family of webfeed formats used to publish frequently updated works—such as blog en-tries, news headlines, audio, and video—in a standardized format.

RDF: Abbreviation for “Resource Description Framework”. It’s a familyof World Wide Web Consortium (W3C) specifications originally designed asa metadata data model. It has come to be used as a general method forconceptual description or modeling of information that is implemented inweb resources; using a variety of syntax formats.

Official Website: http://www.w3.org/RDF/

RDFS : Abbrevation for “RDF Schema”. RDFS, (also abbreviated asRDFS, RDF(S), RDF-S, or RDF/S) is an extensible knowledge represen-tation language, providing basic elements for the description of ontologies,otherwise called Resource Description Framework vocabularies, intended tostructure RDF resources.

OWL : Abbreviation for “Web Ontology Language”. OWL is a familyof knowledge representation languages for authoring ontologies, and is en-dorsed by the World Wide Web Consortium.

DBMS: Abbreviation for “Database Management System”. A DBMS isa set of computer programs that controls the creation, maintenance, andthe use of the database of an organization and its end users.

RDBMS: Abbreviation for “Relational Database Management System”.An RDBMS is a DBMS in which data is stored in the form of tables andthe relationship among the data is also stored in the form of tables.

SQL: Abbreviation for “Structured Query Language”. SQL is a database

Page 13: Thesis

1.4. Terms and Abbreviations 7

computer language designed for managing data in relational database man-agement systems.

SPARQL: It is an RDF query language; its name is a recursive acronymthat stands for SPARQL Protocol and RDF Query Language. It was stan-dardized by the RDF Data Access Working Group (DAWG) of the WorldWide Web Consortium, and is considered a key semantic web technology.SPARQL allows for a query to consist of triple patterns, conjunctions, dis-junctions, and optional patterns.

API: Abbreviation for “Application Programming Interface”. An interfacein computer science that defines the ways by which an application programmay request services from libraries and/or operating systems. An API de-termines the vocabulary and calling conventions the programmer shouldemploy to use the services.

URL: Abbreviation for “Uniform Resource Locator”. URL is a subset ofthe Uniform Resource Identifier (URI) that specifies where an identified re-source is available and the mechanism for retrieving it. In popular language,a URI is also referred to as a Web address.

SDK: Abbreviation for “Software Development Kit”. An SDK or “devkit” istypically a set of development tools that allows a software engineer to createapplications for a certain software package, software framework, hardwareplatform or similar platform.

grep: Abbreviation for Global Regular Expression Print. Grep is a com-mand line text search utility originally written for Unix.

LAMP: The acronym LAMP refers to a solution stack of software, usuallyfree and open source software, used to run dynamic Web sites or servers. Theoriginal expansion is as follows: Linux, referring to the operating system.Apache, the Web server. MySQL, the database management system (ordatabase server), plus of several scripting languages: Perl, PHP or Python.

CRON: Cron is a time-based job scheduler in Unix-like computer operatingsystems. ‘Cron’ is short for Chronograph and it enables users to schedulejobs (commands or shell scripts) to run automatically at a certain time ordate. It is commonly used to perform system maintenance or administration.

Page 14: Thesis

1.4. Terms and Abbreviations 8

CGI: Abbreviation for “Common Gateway Interface” which is a standardprotocol for interfacing external application software with an informationserver, commonly a web server.

MVC: Abbreviation for “Model View Controller” which is an architecturalpattern used in software engineering. The pattern isolates business logicfrom input and presentation, permitting independent development, testingand maintenance of each. Each model is associated with one or more views(projections) suitable for presentation (not necessarily visual presentation).

ORM (1): Abbreviation “Object-relational mapping”. ORM in computersoftware is a programming technique for converting data between incompat-ible type systems in relational databases and object-oriented programminglanguages. This creates, in effect, a “virtual object database” that can beused from within the programming language.

ORM (2): Abbreviation for “Online Reputation Management”.

REST: Abbreviation “Representational State Transfer”. REST is a style ofsoftware architecture for distributed hypermedia systems such as the WorldWide Web. REST-style architectures consist of clients and servers.

GUI: Abbreviation for “Graphical User Interface”. It is a type of userinterface which allows people to interact with electronic devices such ascomputers; hand-held devices such as MP3 Players, Portable Media Playersor Gaming devices; household appliances and office equipment with imagesrather than text commands.

CSS: Abbreviation for “Cascading Style Sheets” which is a style sheet lan-guage used to describe the presentation semantics of a document written ina markup language.

WYSIWYG: Abbreviation for “What You See Is What You Get”. Theterm is used in computing to describe a system in which content displayedduring editing appears very similar to the final output, which might be aprinted document, web page, slide presentation or even the lighting for atheatrical event.

UGC: Abbreviation for “User-Generated Content”. It is also known asconsumer-generated media (CGM) or user-created content (UCC), refers to

Page 15: Thesis

1.5. Citations 9

various kinds of media content, publicly available, that are produced by end-users. Most of the popular media sites like YouTube, Wikipedia, Facebookor Twitter are examples to UGC sites where the members of the sites areproviding content and making the site grow.

W3C: Abbreviation for “The World Wide Web Consortium”. W3C is themain international standards organization for the World Wide Web.

Official Website: http://www.w3.org/

1.5 Citations

1. Liu B., “Web Data Mining - Exploring Hyperlinks, Contents and UsageData”. Springer, December, 2006(http://www.cs.uic.edu/ liub/WebMiningBook.html)

2. Baumgartner R., “WEB DATA EXTRACTION SYSTEM”(http://www.cs.washington.edu/homes/gatter/download/Web Data Extraction System.pdf)

3. Baraglia, R. Silvestri, F.“Dynamic personalization of web sites withoutuser intervention”, 2007

4. Google. “Using JSON with Google Data APIs”. Retrieved July 3,2009.

5. Schrenk, Michael (2007). “Webbots, Spiders, and Screen Scrapers”.No Starch Press. ISBN 978-1-59327-120-6.

6. GeoSeeker, “Semantic annotation based web scraping”,(http://www.gooseeker.com/en/node/knowledgebase/freeformat)

7. “Django FAQ”. Lawrence Journal-World. Retrieved 2008-04-01.

8. “Django Book 2.0”. (http://www.djangobook.com/en/2.0/)

9. “Django threading review”.(http://code.djangoproject.com/wiki/DjangoSpecifications/Core/Threading)

10. “What is Python Good For?”. General Python FAQ. Python Founda-tion. Retrieved 2008-09-05.

11. “General Python FAQ”. python.org. Python Software Foundation.Retrieved 2009-06-27.(http://www.python.org/doc/faq/general/)

Page 16: Thesis

1.5. Citations 10

12. Boodhoo, Jean-Paul (August 2006). “Design Patterns: Model ViewPresenter”. Retrieved 2009-07-07.

13. World Wide Web Consortium (December 9, 2008). “The Forms Work-ing Group”. Retrieved 2009-07-07.

14. “Resource Description Framework (RDF) Model and Syntax Specifi-cation”.(http://www.w3.org/TR/PR-rdf-syntax/)

15. “Optimized Index Structures for Querying RDF from the Web”. An-dreas Harth, Stefan Decker, 3rd Latin American Web Congress,

16. Grune, Dick; Jacobs, Ceriel J.H., “Parsing Techniques - A Practi-cal Guide”, VU University Amsterdam, Amsterdam, The NetherlandsBuenos Aires, Argentina, October 31 to November 2, 2005, pp. 71-80(http://sw.deri.org/2005/02/dexa/yars.pdf)

17. Schawbel, D. (2009). “Top 10 Reputation Tracking Tools Worth Pay-ing For”.(http://mashable.com/2008/12/29/brand-reputation-monitoring-tools/)

18. Fernando, A. (2004). “Big Blogger is Watching You! ReputationManagement in an Opinionated, Hyperlinked World”. Communica-tion World.

19. Thompson, N. (2003). “More Companies Pay Heed to Their ‘Word ofMouse’ Reputation”. New York Times.

20. Susan Kinzie and Ellen Nakashima (July 2, 2007). “Calling In Prosto Refine Your Google Image”. The Washington Post.

21. “W3C Semantic Web Activity News - SPARQL is a Recommendation”.W3.org. 2008-01-15. Retrieved 2009-10-01.(http://www.w3.org/blog/SW/2008/01/15/sparql is a recommendation)

22. API Documentation from Flickr (http://www.flickr.com/services/api/)

23. API Documentation from Technorati (http://technorati.com/developers/api/)

24. API Documentation from Twitter (http://apiwiki.twitter.com/)

25. BeautifulSoup Documentation(http://www.crummy.com/software/BeautifulSoup/documentation.html)

26. Google Data API Documentation (http://code.google.com/apis/gdata/overview.html)

Page 17: Thesis

1.5. Citations 11

27. Joseki Server Documentation (http://www.joseki.org/documentation.html)

28. W3C, SPARQL Query Language for RDF. W3C Recommendation 15January 2008.(http://www.w3.org/TR/rdfsparqlquery/)

29. Jena Documentation (http://jena.sourceforge.net/documentation.html)

30. Google Visualization API Documentation(http://code.google.com/intl/it-IT/apis/visualization/documentation/)

Page 18: Thesis

Chapter 2

Background

This chapter’s aim is to provide the necessary background to betterunderstand the characteristics of the project I worked on. It begins withexplaining what Online Reputation Analysis and Management means andhow it is used in several different industries. Later on, a short history andscope of the practice is presented. This section is then connected to a partexplaining the details about how ORM systems are affecting companies andbrands. Finally, existing solutions in the whole world are analyzed in detailand my opinions about each of them in detail are given. In addition, thelast part contains information about how they gave me an inspiration andhow I used them in the technical building blocks of the Web2rism project.

2.1 Online Reputation Analysis & Management

Online reputation management, or ORM, is the practice of consistentresearch and analysis of one’s personal or professional, business or industryreputation as represented by the content across all types of online media. Itis also sometimes referred to as online reputation monitoring, maintainingthe same abbreviation.

ORM is a relatively new industry but has been brought to the forefront ofprofessionals’ consciousnesses due to the overwhelming nature of both ama-teur UGCs and professional journalistic content. The type of online contentmonitored in ORM spans professional journalism sponsored by traditionalnews and media giants as well as user-created and user-generated blogs, rat-ings, reviews, and comments, and all manner of specialized websites. Thesewebsites can be about any particular subject, such as an individual, group,

12

Page 19: Thesis

2.1. Online Reputation Analysis & Management 13

company, business, product, event, concept, or trend.

ORM partly formed from a need to manage consumer generated me-dia (CGM).

The appeal of reputation mechanisms is that, when they work, theyfacilitate cooperation without the need for costly enforcement institutions.They have, therefore, the potential of providing more economically efficientoutcomes in a wide range of moral hazard settings where societies currentlyrely on the threat of litigation in order to induce cooperation. The risingimportance of online reputation systems not only invites, but also necessi-tates rigorous research on their functioning and consequences.

Marketing and social media experts see ORM as the future of brand-ing and it is an absolute necessity for any company striving to protect thepositive image and brand equity they’ve worked so hard to achieve.

2.1.1 Short History of ORM

As CGM grew with the rise of social media and other similar user-basedonline content aggregators, it began to effect search results more and more,bringing with it increased attention to the matter of managing these results.

EBay was one of the first web companies to harness the power of CGMfeedback. By using user generated feedback ratings buyers and sellers weregiven reputations that helped other users make purchasing and selling de-cisions. ReputationDefender was one of the first companies that offered toproactively manage online reputations. ClaimID is another company thatearly on presented services designed to promote personal ORM. Other ORMtools include: Trackur, SinoBuzz, BrandsEye and Google Alerts.

The UK market for ORM will grow by around 30% in 2008, to an esti-mated value of $60 million.

2.1.2 Scope of ORM

Specifically, the online media that is monitored in ORM is:

• Social networks (e.g. Facebook, MySpace, FriendFeed)

• Social news/bookmarking sites (e.g. Delicious, Digg)

Page 20: Thesis

2.1. Online Reputation Analysis & Management 14

• Traditional/mainstream websites

• Consumer Review sites (e.g. Yelp, Epinioins)

• Sites like PersonRatings.com which allow reviews of individuals.

• Collaborative Research sites (e.g. Yahoo Answers, Rediff Q&A)

• Independent discussion forums

• User-generated content (UGC) / Consumer Generated Media (CGM)

• Blogs

• Blogging communities (e.g. Open Diary, LiveJournal, Xanga)

• Microblogs (e.g. Twitter, Identica, Jaiku, Plurk)

2.1.3 A Business Approach to Online Reputation Manage-ment and Monitoring

This section lists how ORM is being used by companies and how it is usefulfor their brand identities.

Before the development of the web, news was slow moving and orga-nizations could take their time to develop structured responses to problems.Currently, rapid developments in CGM sites mean that the general publiccan quickly air their views. These views can make or break a brand. Con-sumers trust these published opinions and base their buying decisions onthem. Any information available to potential clients affects a company’sreputation and its customers’ buying decisions.

Similarly, ex-employees, and brand activists can easily get their per-sonal viewpoints out there. Competitors who can also spread maliciousrumors and lies about a company or brand in the hopes of stealing its mar-ket share. These types of unsubstantiated reporting can affect company’scorporate images. Sites containing these kinds of information are being in-dexed by search engines and appearing in search results for brand names.And more importantly, the information can spread to the traditional media,compounding the damage.

The goals of Online Reputation Management are high rankings and in-dexing in the search engines for all positive associated web sites. The resultis an increase in a brand’s overall positive web presence, which will help the

Page 21: Thesis

2.2. Existing Solutions 15

company own the top spots of the search engine rankings for its brand. ORMenables companies to protect and manage their reputation by becoming ac-tively involved in the outcome of search engine results through a three-stepprocess.

The three steps involved in Online Reputation Management are:

• Monitoring and tracking what is being said online Monitoringgives immediate heads-up if adverse information is appearing and itis an essential and useful tactic for controlling adverse information onthe search engines and social media sites.

• Analyze how the visible information affects a brand At thisstage of the analysis, it is possible to analyze the present position of abrand.

• Influence Ths stage is when the company starts influencing the resultsby participating in the conversation and eliminating negative sites.

This report, as its name has it, will be focusing mostly on Online Repu-tation Monitoring that is the 1st stage of an Online Reputation Managementwork. The letter “M” in the abbreviation “ORM” stands both for “Man-agement” and “Monitoring”, therefore I decided to use the word “analysis”and will continue with the phrase “Reputation Analysis” which correspondsto the first stage of the ORM process.

2.2 Existing Solutions

This section describes other people’s approaches to the reputation analy-sis issue. Each of them has specific pros and cons, targeting different marketareas. For each tool, first a general description, usually extracted from theofficial websites of the services are presented, then my personal comments,how they gave me an inspiration and how I used them in the technical build-ing blocks of the web2rism project are explained.

Google Trends

Google Trends is a public web facility of Google Inc., about Google Search,that shows how often a particular search-term is entered relative to the totalsearch-volume across various regions of the world, and in various languages.

Page 22: Thesis

2.2. Existing Solutions 16

Google Trends also allows the user to compare the volume of searches be-tween two or more terms. An additional feature of Google Trends is in itsability to show news related to the search-term overlaid on the chart, show-ing how new events affect search popularity.

Personal Comment: Google Trends is a great tool that depends on the mostpopular and powerful search engine on the planet. However, it is not builtwith a a semantic point of view. It just measures the density of searchesconnected to given keywords. In this manner, we can not really speak of areputation analysis tool but a “density of buzz” tracker.

Website: http://www.google.com/trends

Google Insights for Search

Google Insights for Search is a service by Google similar to Google Trends,providing insights into the search terms people have been entering into theGoogle search engine. Unlike Google Trends, Google Insights for Searchprovides a visual representation of regional interest on a map. It has beennoted, however, that term order is important, and that different results willbe found if the results are keywords are placed in a different order

Personal Comment: In many aspects, Google Insights has a lot of simi-larities with Google Trends. There have been quite a lot discussions whyGoogle has released a second, similar product when it already has GoogleTrends, but then it was announced in the Google Adwords blog that Insightswas slightly more targeted towards advertisers with Google. For instance,it contains categories (alternatively called verticals) to restrict your termsto. It also shows top searches and top rising searches in the neighborhoodof keywords you enter. Overall, this seems to be a huge extension to GoogleTrends, Google Ad Planner, and the tools available within AdWords to ad-vertisers.

For this tool, Google does not provide an API, it is only possible to geta csv, excel format report of the data output. In the web2rism project, thedata analytics provided by Google Insights was densely used and we hadto manually scrape and parse the csv file. It may be better for 3rd partydevelopers to create tools that use Google Insights if the system had an API.

Website: http://www.google.com/insights

Page 23: Thesis

2.2. Existing Solutions 17

Image-1: Web Search Interest data from Google Insights - Comparing“Microsoft” vs “Google”

Google Alerts

Google Alerts is a service offered by the search engine company Google whichnotifies its users by email (or as a feed) about the latest web and news pagesof their choice. Google currently offers six types of alert searches: “News”,“Web”, “Blogs”, “Comprehensive”, “Video” and “Groups”. A News alert isan email that lets the user know if new articles make it into the top tenresults of his/her Google News search. A Web alert is an email that lets theuser know if new web pages appear in the top twenty results for the user’sGoogle Web search. A News & Web alert is an email that lets the user knowwhen new articles related to his/her search term make it into the top tenresults for a Google News search or the top twenty results for a Google Websearch. A Groups alert is an email that lets the user know if new posts makeit into the top fifty results of the Google Groups search.

Personal Comment: The tool is not an all-in-one reputation managementtool. It is only an alert system that keeps the user informed over time about

Page 24: Thesis

2.2. Existing Solutions 18

a keyword he/she has specified. It can be of better use when used in integra-tion with Google’s other reputation and trend analysis services. Generallyspeaking, except offering a whole solution, Google provides separate, dis-tributed tools. This may actually be a part of their company vision, sinceGoogle’s attitude has always been about letting people reach any data in aneasier way. They may not be in an aim to create a powerful ORM suit but toprovide tools that let the world of developers create their own applications.

Website: http://www.google.com/alerts

Circos Brand Karma

Circos is a leading technology company that specializes in extracting brandsentiments from the actual text in consumer-written reviews and comments.The companys proprietary technologies apply semantic analysis to socialmedia, surfacing rich insights about brands based on personal preferences.Circos specializes in social media for the hotel and tourism industries, andits Brand Karma product is helping leading hotel brands increase revenue,improve search engine optimization, and credibly brand themselves online.

Personal Comment: Circos Brand Karma has quite a lot similarities withthe web2rism project since it also focuses on the tourism industry. Also interms of UI design, it has been a good example and inspiration on how I canbuild up the look&feel of the reporting pages of our project. Another goodpoint noted about Brand Karma is that it does not only give the popularityof a touristic destination or hotel’s name but it also provides real businessresults – like increased revenue, customer satisfaction, and loyalty.

Website: https://brandkarma.circos.com/

Radian6

Radian6 offers a solution, where you can setup certain keywords to monitoron a dashboard, automatically track the keywords on blogs, image sharingsites and microblogging sites, and then have it report back to you with ananalysis of the results. Data is captured in real-time as discovered and deliv-ered to dashboard analysis widgets. The solution covers all forms of socialmedia including blogs, top video and image sharing sites, forums, opinionsites, mainstream online media and emerging media like Twitter. Conversa-tional dynamics are constantly tallied to track the viral nature of each post.

Page 25: Thesis

2.2. Existing Solutions 19

Figure 2.1: Reputation Analysis Results of Perspective Hotel Singapore on Brand Karma

Personal Comment: Radian6 monitors a considerably big amount of so-cial media and trends sites. The latest news on its official site was thatit has now integrated WebTrends web analytics, and SalesForce.com CRMinformation. Radian6’s analysis works over website addresses instead of key-words. It is more suited for web analytics experts who are trying to learnmore about the status of the reputation of their websites.

Website: http://www.radian6.com/

TNS Cymfony

TNS Cymfony offers the Mestro Platform, which is built on a Natural Lan-guage Processing engine that automatically identifies, classifies, qualifies andbenchmarks important people, places, companies and topics for you. Theplatform is able to decipher between different media sources, such as tradi-tional media and social media. Cymfony’s differentiation is that their enginedissects articles, paragraphs and sentences to determine who and what is be-ing talked about, whether something or someone is a key focus or a passingreference, and how the various entities mentioned relate to one another.

Personal Comment: Although I was not able to test the Maestro Software(a demo is only available upon approved request), I liked TNS Cymfony’sapproach because they are not only doing keyword based analysis. Theirtechnology uses a more sophisticated form of information extraction based on

Page 26: Thesis

2.2. Existing Solutions 20

detailed grammatical analysis of the text. Grammer-based approaches elim-inate irrelevant content and are far more intelligent than keyword searching.This provided us examples on how we can connect web2rism to a sentimentanalysis / NLP system. See section 4.2.5 of this report for more detailsabout the Sentiment Analysis work.

Website: http://www.cymfony.com/solutions/our-approach/orchestra-platform

Sentiment Metrics

Sentiment Metrics has a reputation management tool that, just like theother services mentioned, helps you monitor what is being said about you,your brand and your products across blogs, forums and news sites. Thereports reveiwed by using this software focus on sentiment, which tells youif the mention is positive, negative or neutral.

Personal Comment: Sentiment Metric’s analysis tool is also demoed uponbooking that’s why I was not able to do a hands on test. As far as I am en-lightened by the company website, they offer quite a powerful tool and theywere able to get their tool used by big brands like Samsung, LG and Honda.They are also using sentiment analysis technologies and it is indicated thatthey are working with academicians. In this way, it has similarities withweb2rism, since web2rism is also being developed in a university’s academicenvironment.

Website: http://www.sentimentmetrics.com/

Nielsen Buzzmetrics

Nielsen offers Buzzmetrics, which will supply you with key brand health met-rics and consumer commentary from all consumer-generated media. Theyalso have ThreatTracker, which alerts of real-time online reputation threatsand gives you a scorecard to show you how you are doing relative to thecompetition. Nielsen uncovers and integrates data-driven insights culledfrom nearly 100 million blogs, social networks, groups, boards and otherconsumer-generated media platforms

Personal Comment: Nielsen’s harvest - clean - analyze - find relevancy ap-proach to the buzz management shows resemblances with web2rism. Asexplained in section 3.2 - Reputation Model, web2rism has a very similar

Page 27: Thesis

2.2. Existing Solutions 21

workflow for gathering and converting the gathered data into a reputationmeasurement. Moreover, Nielsen’s approach looks more professional andenterprise level when compared to the other tools.

Website: http://en-us.nielsen.com/

Figure 2.2: Nielsen’s approach for Advanced Buzz Management.

Cision

Cision offers the Cision Social Media service, which claims to monitor over100 million blogs, tens of thousands of online forums, and over 450 leadingrich media sites. One of the main benefits, just like Nielsen Buzzmetrics, isthat these companies have been monitoring and measuring traditional mediasites for decades, so they can provide a more comprehensive solution acrossthe board.

Cision’s product is unique in that it offers 24/7 buzz reporting. Their ser-vice is powered by Radian6, which is mentioned above. They also have aDashboard and daily reports, just like the other services, where they tellyou what’s going on with your brand twice a day through email.

Personal Comment: Cision’s product uses a mix of best tools and theyare providing a more umbrella approach to the industry. They have severaldifferent packages of their social media intelligence software and none ofthem was able to be tested for free so I was only able to deduct what kind

Page 28: Thesis

2.2. Existing Solutions 22

of a service they provide by browsing the company / product website andexamine the screenshots.

Website: http://www.cision.com/en/

BrandsEye

BrandsEye was developed by Quirk, and has been used internally by the de-velopment team and their clients as it’s algorithm was tested and tweaked. Ithas been recognised as signifying a massive leap from the pre-existing ORMtools that merely track and monitor the brand buzz. BrandsEye not onlytraces and assesses your online presence but provides you with a real-timeReputation Score for both you and your competitors. This allows companiesto monitor the sentiments and opinions of their own customers, while mak-ing educated judgements about how to respond to attacks on their onlinereputation.

Personal Comment: Brands Eye offers reputation management packagesfor bloggers, small businesses and enterprises. The tool tracks every onlinemention of your brand, giving you a score that accurately reflects the stateof your reputation over time. Part of the differentiation is that you canactually tag mentions of your brand and rank them in terms of a number ofpre-determined criteria. Brands eye’s target market spans a bigger portionof users in contrast to the other tools explained above and it has affordablesolutions for even non-profit fun bloggers.

Website: http://www.brandseye.com/

To reach to a final conclusion about which existing solution is the mostpopular one being used, I did an interesting study and used Google Insightsitself. I did not include market specific tools like Circos Brand Karma orgeneral keyword based volume measurers like Google Trends or Insights tosee the best among enterprise level buzz management suites. As a result,the graph shown below was generated.

Page 29: Thesis

2.2. Existing Solutions 23

Figure 2.3: Comparison of the best of existing ORM tools using Google Insights

Page 30: Thesis

Chapter 3

My Approach

In this chapter my approach and methods that I intended to use to create theORM software of the Web2rism project are explained. In the first section,the positioning of the whole project and how a tourism focused ORM toolshould function is discussed, next the reputation model and how I decidedto interpret the business requirements into pieces of code are presented.

3.1 The Positioning of Web2rism

This section basically explains how ORM should be used in the Web2rismproject.

“Online Reputation Management” is relatively a new industry and thereare good opportunities both in re-defining the concept and the creation oftools. In the business world a strong reputation is a company’s greatestasset. But in the web world, reputation no longer hangs on what’s real.It hangs on a “perception of reality” created by CGMs and UGCs. Everyday, for good or bad, someone, somewhere is talking about someone’s busi-ness. While constructive criticism is always welcome, malicious attacks oncompany reputations can spread like a global wildfire without the companyowners even knowing it.

When we look at the already existing solutions, we see that most ofthe ORM solutions are dealing with broader ranges of audiences and theyare not focused on certain markets. Since these tools, in theory can be usedto track anything that is representable by at least one keyword on the web,different sectors of companies using ORM software does not become really

24

Page 31: Thesis

3.2. Reputation Analysis Methodology 25

important. In this manner the specifications of an E-Tourism centric ORMsolution is really vital. Questions like “How the reputation of a Hotel differsfrom the reputation of a newly released car differ?”, or “How and wherepeople are talking about a touristic experience they had?” were the startingpoints at Webatelier. We found that checking tourism blogs and hotel reviewsites is much more important than going directly to YouTube or Flickr tomeasure the buzz. The important thing here is that people can be talking alot about something, but it does not indicate whether what they are talkingabout is good quality product or service. The tracking and calculation of“reputation” should depend on something that’s rating can be measured.

A good thing about guessing what potential customers will require andask from an E-Tourism ORM was to get them use our development versions.Since Lugano is already a highly touristic area, Webatelier was able to createnecessary contacts with local touristic businesses, let them test our softwareand we were able to get frequent feedbacks from them.

3.2 Reputation Analysis Methodology

This section explains how we’ve built up the model to calculate reputationof touristic destinations or services. As the analysis model is still evolving,the current status of the model is able to provide us enough data to collectand filter to convert to a reputation identifier.

3.2.1 Content

We categorized the various types of “objects” that can effect the reputationof a touristic destination as four main kinds.

• Reviews: These are the reviews about hotels and services on tourismrelated portals such as “TripAdvisor”. The reviews are written bypeople who stayed at the hotel and used the services provided.

• Photos: These are the photos taken during touristic experiences. Theyhave connections with the destinations that is being analyzed. Themain source of photos to get photo content was Flickr and Picasa forWeb2rism.

• Videos: These are the videos taken during touristic experiences. Theyhave connections with the destinations that is being analyzed. Themain source of video content was YouTube for Web2rism.

Page 32: Thesis

3.2. Reputation Analysis Methodology 26

• Blog Posts: These include the posts submitted to blogging sites or per-sonal webpages and little pieces of texts (such as Tweets from Twitter)submitted to micro-blogging platforms. As there are different methodsfor someone to setup a blog site, we first decided to go and get linksfrom blog indexes. Google’s Blogspot, Technorati or Wordpress are inthe range of our blog tracking portals.

After we categorized the content, the essential part was to decide on howwe are going to extract the indicators that effect the reputation of brands.Since we are dealing with UGC sites, like most of social-media type systems,the user content generated has got ratings and comments. In addition tothese, one more factor we took into consideration was the “density” of com-ments and ratings. The higher number of ratings or comments a contentobject has, the more saturated it is in terms of people’s opinions.

Figure 3.1: Content Analysis for Reputation Measurement

3.2.2 Sentiments

The raw form of scraped content in text format does not really meananything unless it has any ratings or indicators showing that is has positiveor negative values. Therefore, there is the requirement of a sentimentalanalysis layer built on top the data gathering part. This way we are able

Page 33: Thesis

3.2. Reputation Analysis Methodology 27

to understand which language is being used in the comment texts and seeif they are being written in a positive manner or not. Details on how weused sentiment analysis tools and technologies are presented in the Chapter4: Implementation.

3.2.3 Authorship

A person can have many identities on the Web. One can be a memberof many different social sites and they might be using different nicknameson each. However, we needed to understand if the person on one site, usingthe same nick on another site is the same person or not. To overcome thisproblem, “author” objects attached to “content” objects and they can beshared among different contents. Calculation and detection of the authorobject can be a big problem though and in the current phase, we have beenstudying deeper the issue.

3.2.4 Query expansion

When using popular search engines, there is a feature that gives yourelated search suggestions as you type letters into the search box. Inspired bythis, we decided people might be looking for related topics to what they havestarted searching for. When a keyword (representing a hotel, a destinationor a brand) on Web2rism is about to be sent to the system for analysis, it isfirst expanded and suggestions about possible similar searches are presentedto the user.

3.2.5 Location Factor

With the increasing number of mobile devices and applications connectedto social media sites and UGC platforms, the uploaded content has now thepossibility to have geo-tagging properties. For example most of the photos onFlickr now have got latitude and longitude values representing geographicalcoordinates assigned to them. Although not clearly defined and convertedto code yet, there is the idea of adding also the geo-location attributes of thecontent objects to the reputation calculations. More info about the locationfactor included in the project is given in the Future Work section underconclusion chapter of this report.

Page 34: Thesis

3.3. Technical Methodology 28

3.3 Technical Methodology

This section explains foreseeings on the best way to implement the idea ofa E-Tourism centric ORM system into a software application, based on theproject’s needs, and its characteristics.

Needs and Decisions

• The applications has to be reached from anywhere in the world.

• It should use open source tools and do not depend on commercialsoftware.

• The project is being developed in an academic environment and itshould provide the possibility to students and researchers to do studyon related areas.

• The software should be easy to understand and keep the learning curveas short as possible for future developers.

• It should be using the latest trends in software development so thatthe best results can be achieved.

• The software development should respect the DRY (Do NOT RepeatYourself) principle so that existing code can be reused for other pur-poses.

Considering all these necessities, we decided to create a high scalability,high performance web application that is able to use semantic technologies.We chose completely open sourced tools and frameworks that use open stan-dards to ease the interoperability of different layers of our system with eachother. We also saw the necessity to develop our own API, since it is thetrend in most web applications for the past few years. Sites having APIsare usually more trusted and they are much more open to extensions, whichis usually done by third party developers. Since Web2rism is an academicproject where many people get involved from time to time, we thought new-comers should be able to understand how the already built system worksand do not lose time with design changes.

The upcoming “Implementation” chapter presents which tools, frameworksand methods we chose and how wwe designed the system architecture.

Page 35: Thesis

Chapter 4

Implementation

In this chapter, various different tools and frameworks used throughout thesoftware development process of the Web2rism project are presented. In ad-dition, the results of the usages of mentioned tools and how they were testedand evaluated as useful are analyzed. Later on a detailed infrastructure ofhow I built the overall system and how each component interacts with eachother are explained.

4.1 Tools and Frameworks Used

This section lists the various programming languages, frameworks, toolsand applications used in the various layers of development process. Trying,learning, experimenting and deciding on the tool to use was an importantissue about the project because we needed to work with several differentUGC sites that were providing their data specially for 3rd party applicationdevelopment.

4.1.1 Data & Knowledgebase Tools

JosekiJoseki is an HTTP engine that supports the SPARQL Protocol and theSPARQL RDF Query language. Joseki’s features include RDF Data gener-ation from files and databases and HTTP (GET and POST) implementationof the SPARQL protocol.

Official Website: http://www.joseki.org/

29

Page 36: Thesis

4.1. Tools and Frameworks Used 30

JENAJena is a Java framework for building Semantic Web applications. It pro-vides a programmatic environment for RDF, RDFS and OWL, SPARQLand includes a rule-based inference engine. Jena is open source and grownout of work with the HP Labs Semantic Web Program.

Jena is included in the Joseki distribution along with ARQ which is a ASPARQL query engine for Jena. We used Jena’s API to extract data fromand write to our Knowledgebase RDF graphs. The graphs are representedas an abstract “model” each graph can be sourced with data from files,databases, URLs or a combination of these. A model can also be queriedthrough SPARQL and updated through SPARUL. Detailed explanations ofhow we used SPARQL and SPARUL queries to query our KB model aregiven in section 4.3.2 - Layers and Components, part: Data Storage.

Official Website: http://jena.sourceforge.net/

SqLite3SQLite is a software library that implements a self-contained, serverless,zero-configuration, transactional SQL database engine. The code for SQLiteis in the public domain and is thus free for use for any purpose, commercialor private. SQLite is an embedded SQL database engine. Unlike most otherSQL databases, SQLite does not have a separate server process. SQLitereads and writes directly to ordinary disk files. A complete SQL databasewith multiple tables, indices, triggers, and views, is contained in a singledisk file.

The reason why I chose to use SQLite instead of MySQL is because I neededto store the configurations of different Scrapers in the ScraperManager ap-plication in a fast and simple way. As explained in Section 4.1 - SystemArchitecture, each Scraper script is embedded dynamically into the systemand the threads that are run by these scripts has to be logged permanently.Since the manager application will only be used by the administrators of theWeb2rism project, there is no need to set up a huge MySQL installation.SQLite’s compact and easy to integrate structure filled this need fast andadequately.

Official Website: http://www.sqlite.org/

Page 37: Thesis

4.1. Tools and Frameworks Used 31

4.1.2 Programming Languages and Web Frameworks

Programming LanguageAs the main programming language, I have used the Python (version 2.5.1).The reason why I chose to code in Python is because it is a trending highlevel language that is used by the technology and trend leader of the worldwide web, Google. In addition, Python lets you work more quickly and in-tegrate your systems more effectively. It is also open sourced and free touse, even for commercial products, because of its OSI-approved license.

Official Website: http://www.python.org/

Web FrameworkBecause of the need for web user interfaces especially for management con-soles, I used Django (version 1.1) which is the most popular Python webframework available. It follows the model-view-controller architectural pat-tern. Django’s primary goal is to ease the creation of complex, database-driven websites. The framework also emphasizes reusability and “plugga-bility” of components, rapid development, and the principle of DRY (Don’tRepeat Yourself). Python is used throughout, even for settings, files, anddata models.

Official Website: http://www.djangoproject.com/

External Applications:

Another advantage of Python is that it eases the way of plugging exter-nal Python applications to your projects. This way, I was able to search forapplications that have been created by other programmers to solve commonproblems. For instance, to convert the models (business objects) which arefilled by the data coming from the knowledge base, into JSON objects, Ihave used simplejson. Below is a full list of the pluggable python apps Iused in the server side programming process.

SimplejsonSimplejson is a simple, fast, extensible JSON encoder/decoder for Python.It is compatible with Python 2.4 and later with no external dependencies.It covers the full JSON specification for both encoding and decoding, withunicode support.

Page 38: Thesis

4.1. Tools and Frameworks Used 32

Official Website: http://www.undefined.org/python/

BeautifulSoupBeautiful Soup is an HTML/XML parser for Python that can turn eveninvalid markup into a parse tree. It provides simple, idiomatic ways of navi-gating, searching, and modifying the parse tree. I used BeautifulSoup in theprocess of parsing the HTML data after scraping sites that do not providetheir own APIs to let us easily gather data.

Official Website: http://www.crummy.com/software/BeautifulSoup/

WWW:Mechanize Mecahize is a handy Perl object behaving much likeBeautifulSoup for Python. Mechanize automatically stores and sends cook-ies, follows redirects, can follow links, and submit forms. Form fields canbe populated and submitted. I use mechanize while getting used to scraperprogramming and testing custom scrapers written in Perl.

Official Website: http://search.cpan.org/dist/WWW-Mechanize/

4.1.3 APIs and Wrappers

Google BlogSearch

Blog Search is Google search technology focused on blogs. Google is a strongbeliever in the self-publishing phenomenon represented by blogging, and itsBlog Search helps users to explore the blogging universe more effectively.Its results include all blogs, not just those published through Blogger, soit is possible to get the most accurate and up-to-date results. The goal ofBlog Search is to include every blog that publishes a site feed (either RSS orAtom). It is not restricted to Blogger blogs, or blogs from any other service.

Web2rism required the crawling of travel blogs from all over the world, that’swhy blog search engines like Google’s was crucial for the project. GoogleBlogsearch offers its raw data in ATOM or RSS formats and it was easy tointegrate into our data gathering layer.

Official Website: http://blogsearch.google.com/

Google Suggest

Page 39: Thesis

4.1. Tools and Frameworks Used 33

As you type into the search box on Google Web Search, Google Suggestoffers searches similar to the one you’re typing. Start to type [ Como ] –even just [ Co ] – and you’ll be able to pick searches for the Como Lago,Como Italy, and Como te llamas (which is the Spanish way asking someone’sname). Type some more, and you may see a link straight to the site Googlethinks you’re looking for – all from the search box.

This feature of Google was used in the query expansion of Web2rism, wherethe given keywords are inserted into Google’s suggestion function and theyare re-presented to the user to help him specify what he is really looking for.

Official Website: http://www.googlelabs.com/

Yahoo Search Suggests

Yahoo’s search suggestion service is a similar one to what Google has. How-ever, our analysis showed that it provided pretty different results for differentsearch terms that’s why we also included this service for the query expansionpart of Web2rism. Details about the ‘Query Expansion’ part is explained inthe upcoming section: 4.2 - System Architecture.

Official Website: http://www.yahoo.com/

Google Charts API

The Google Chart API lets you dynamically generate charts. It basicallyreturns a PNG-format image in response to a URL. Several types of imagecan be generated, including line, bar, and pie charts. For each image type,you can specify attributes such as size, colors, and labels. The charts API isused in the visualization of the data shaped by the reputation model, to theend-user which is in most cases the manager/owner of a brand or a touristicdestination.

Official Website: http://code.google.com/apis/chart/

Google Data API

Google’s YouTube APIs and Tools enable developers to integrate YouTube’svideo content and functionality into their websites, software applications, ordevices. . It is possible to search for videos, retrieve standard feeds, and

Page 40: Thesis

4.1. Tools and Frameworks Used 34

see related content. A program can also authenticate as a user to uploadvideos, modify user playlists, and more.

I used the YouTube API in the Web2rism scrapers to fetch the texts ofthe comments written for a given video, count the density of videos up-loaded about a given query or to find the top contributing authors. TheAPI works over HTTP returning XML. Though it is possible to use theseservices with a simple HTTP client, there are libraries that provide helpfultools to streamline the code and keep up with server-side changes. BecauseI had been coding in Python, I used the Gdata Python Client, which is aglobal library created for Google to use their APIs over Python codes.

Official Website: http://code.google.com/apis/gdata/

Google Data (gdata) Python Client Library

The Google Data Python Client Library provides a library and source codethat make it easy to access data through Google Data APIs. This library isable to communicate with many of the Google products that provide theirown APIs such as the Google Blogspot, Google Calendar, Maps, Picasa WebAlbums, etc.

Official Website: http://code.google.com/p/gdata-python-client/

Google Insights

Google Insights for Search is a service by Google similar to Google Trends,providing insights into the search terms people have been entering into theGoogle search engine. Unlike Google Trends, Google Insights for Searchprovides a visual representation of regional interest on a map.

The consumption of Google’s Insights data was rather a hard process be-cause the site does not offer an API. There is only a CSV file of the resultsgenerated automatically and I had to parse this file manually to extract thedata required.

Official Website: http://www.google.com/insights/search/

Page 41: Thesis

4.1. Tools and Frameworks Used 35

Twitter Search API

The popular micro-blogging platform Twitter’s data is exposed via its easy-to-use API. The API is divided into two parts as: Search and REST. I haveused the search API to browse through the tweets of Twitter users, calculatetrending topics and top contributing users.

Official Website: http://apiwiki.twitter.com/

Technorati API

Technorati, the famous blog indexing and search service has a programcalled ‘The Technorati Developer Program’ which helps power users andtool developers integrate Technorati data directly into their favorite appli-cations. They provide an SDK, example scripts, a mailing list, and otherhelpful resources to assist your development process. Technorati exposesmany of its data services via an application programming interface. Tech-norati’s API returns results in its own proprietary XML as well as commonfeed formats such as RSS.

The consumation of Technorati data played an important role in theWeb2rismproject since as indicated in the Google BlogSearch explanation, the crawl-ing of travel blogs from all over the world was a crucial thing for the datagathering part.

Official Website: http://technorati.com/developers/api/

Flickr API

The Flickr API consists of a set of callable methods, and some API end-points. To perform an action using the Flickr API, you it is needed to selecta calling convention, send a request to its endpoint specifying a method andsome arguments, and will receive a formatted response. The function callscan return data in JSON or XML formats.

I used the Flickr API in the project to fetch comments given to photos,receive meta-tags of uploaded medias, calculate trending topics, locationsand top contributing users. An important feature of Flickr was that itprovided geo-tagging functionality for photos uploaded and this played animportant role in the reputation model of the Web2rism project.

Page 42: Thesis

4.1. Tools and Frameworks Used 36

Official Website: http://www.flickr.com/services/api/

4.1.4 User Interface Design

jQueryjQuery is a fast and concise JavaScript Library that simplifies HTML docu-ment traversing, event handling, animating, and Ajax interactions for rapidweb development. Microsoft and Nokia have announced plans to bundlejQuery on their platforms, Microsoft adopting it initially within Visual Stu-dio for use within Microsoft’s ASP.NET AJAX framework and ASP.NETMVC Framework whilst Nokia will integrate it into their Web Run-Timeplatform.

Because it is becoming an industry standard and a must have library forweb development, I have used jQuery for faster creation and the durabilityof the user interfaces instead of coding purely in JavaScript.

Official Website: http://www.jquery.com

4.1.5 Sentiment Analysis

Before diving more into how Sentiment Analysis practices are used in theproject, it’s essential to give some definitions and explanations on the mat-ter.

Sentiment analysis or opinion mining refers to a broad area of naturallanguage processing, computational linguistics and text mining. Generallyspeaking, it aims to determine the attitude of a speaker or a writer withrespect to some topic. The attitude may be their judgement or evaluation,their affective state or the intended emotional communication.

After the “Data Gathering” part where text-based data is collected fromseveral different resources, the collection is sent through a sentiment analysistool of our choice that is able to detect whether a piece of text is written in apositive manner or the opposite. For instance, let’s think about a commenton a YouTube video about the Lake of Como. By the basic analysis of thecomment, it can be understood whether the text or sentence given is positiveor negative. Basically, if the comment text contains words like “beautiful”,“good” or “relaxing” and if the sentence is positive grammatically, it can bededucted that the comment is a positive one.

Page 43: Thesis

4.1. Tools and Frameworks Used 37

Mood / Sentiment detection

The kind of analysis mentioned above is called Mood / Sentiment detec-tion and in most of the content analyzer tools, there are 4 different kinds ofanalysis ways:

• Polarity Basic Analysis: Basic analysis of whether the text or sentencegiven is positive or negative.

• Subjectivity Basic: Analyzes whether the opinion or fact of the giventext.

• Hierarchal Polarity analysis: Analyzes the polarity (e.g. positivityor negativity) of the given text using subjectivity results. Uses thesubjectivity classifier to extract subjective sentences from reviews tobe used for polarity classification.

• Polarity Whole Analysis (cross-validation of hierarchal polarity analy-sis): Analyzes the polarity of the given text or sentence using multiplecheck and test using different training and test sets. Cross-validationperforms multiple divisions of the data into training and test sets andthen averages the results in order to bring down evaluation variancein order to tighten confidence intervals.

For Sentiment Analysis, a famous, multi-lingual tool called Lingpipe isused. When checked with movie review datas, Lingpipe gives 81% accuracy(average of the four kinds of test results).

Natural Language Detection

The web is not consisting of one language. And since the matter here ise-tourism, it is very likely to find info about a touristic destination in thelanguage of the country which the touristic place belongs to. In these cases,before doing content format analysis, there is the need of language detection.Going over the YouTube video comment example mentioned in the begin-ning of this section, using natural language detection tools it is possible tofind out the language of the comment text or any kind of text format datathat our system needs to analyze after the data gathering phase.

For natural language detection in Web2rism, the Lextek Language Iden-tifier is used. This tool is capable of identifying not only what language

Page 44: Thesis

4.2. Scrapers, Scraping and Parsing Techniques 38

it is written in but also its character encoding and what other languagesare most similar. It offers more language and encoding modules than anyother language identifier. Currently, there are 260 language and encodingmodules for you to use in your analysis.

Official website: http://www.lextek.com/langid/li

4.2 Scrapers, Scraping and Parsing Techniques

By definition, Web scraping (also called Web harvesting or Web data ex-traction) refers to a computer software technique of extracting informationfrom websites. Usually, such software programs simulate human explorationof the Web by either implementing low-level Hypertext Transfer Protocol,or embedding certain full-fledged Web browsers. Web scraping is closelyrelated to Web indexing, which indexes Web content using a bot. This isa universal technique adopted by most search engines. On the other hand,Web scraping focuses more on the transformation of unstructured Web con-tent, typically in HTML format, into structured data that can be stored andanalyzed in a central local database or spreadsheet. Web scraping is alsorelated to Web automation, which simulates human Web browsing usingcomputer software.

The process of automatically collecting Web information shares a commongoal with the Semantic Web vision, which is a more ambitious initiative thatstill requires breakthroughs in text processing, semantic understanding, arti-ficial intelligence and human-computer interactions. Web scraping, instead,favors practical solutions based on existing technologies even though somesolutions are entirely ad-hoc.

4.2.1 Web2rism Scrapers

There are different levels of automations that existing Web-scraping tech-nologies currently provide. These include text grepping, regular expressionmatching, HTML parsing, DOM parsing and Semantic annotation recogniz-ing. In the Web2rism project however, my approach to the term “scraper”included one more technology usage that is the consumption of web-servicesand APIs provided by the target sites. Since we were interested mostly insocial-media or UGC sites, the big bosses of the sector had already providingtheir website content in organized formats (mostly in JSON or XML) viatheir web-services or APIs.

Page 45: Thesis

4.2. Scrapers, Scraping and Parsing Techniques 39

During the development phase of Web2rism software, I created 7 differentscraper scripts that use the APIs and/or HTML parsing tools mentionedin the “Tools and Frameworks Used” section of this chapter. Scrapers forFlickr, YouTube, Technorati and Twitter were using the sites’ respectiveAPIs while I built custom parsers for WikiTravel, Google Blogsearch andGoogle Insights.

4.2.2 A Custom Scraper Example

This section shows a scraper script example coded in Python and Beauti-fulSoup library. The code ıs pretty simple and straightforward. Using theurrlib2 library, an http connection is opened to the remote site to be scrapedand then the source is retrieved. After that, the magic BeautifulSoup librarycomes and we play sift though the HTML tags of the source until we get thearea we are interested in. Later on, we can fill a custom object and prepareit to save to any database / knowledge-base.

Page 46: Thesis

4.2. Scrapers, Scraping and Parsing Techniques 40

from BeautifulSoup import BeautifulSoup, NavigableStringimport urllib2, sysimport rereload(sys)sys.setdefaultencoding(“utf-8”)

def WikiTravelURL(city name):domain = “http://wikitravel.org”city url = domain + “/en/” + city namereturn city url

def ScrapeWikiTravel(city name):city url = WikiTravelURL(city name)city html = urllib2.urlopen(city url).read()city soup = BeautifulSoup(city html)see section = city soup.find(attrs={“name” : “See”})do section = city soup.find(attrs={“name” : “Do”})

if see section != None:data = see section.findNextSibling(“ul”)if data != None:try:sibling count = 0# go until the “do section”while data.findNextSibling(“a”) == do section:sibling count = sibling count + 1# go through the list of “li” tagsdata = data.findNextSibling(“ul”)if data != None:if data.li.b != None:# the tagfree destination name,# can be checked if it exists on the knowledgebaseif data.li.b.a != None:destination name = data.li.b.a.nextelse:destination name = data.li.b.next

if data.li.span != None:if data.li.span != None:destination name = data.li.span.span.nextelse:destination name = data.li.span.next

places to see.append(destination name)

return places to see

Page 47: Thesis

4.3. System Architecture 41

4.3 System Architecture

In this part the general architecture of the system and how various com-ponents of the system are communicating with each other are explained.

4.3.1 The MVC Architecture and its Usage

The main aim of the MVC architecture is to separate the business logicand application data from the presentation to the user.

The reasons why I decided to use the MVC design pattern is thatbecause they are reusable and expressive. When problems recurs, there isno need to invent a new solution, we just have to follow the pattern andadapt it as necessary.

1). Model: The model object knows about all the data that need to bedisplayed. It is model who is aware about all the operations that can beapplied to transform that object. It only represents the data of an applica-tion and is not aware about the presentation data and how that data willbe displayed to the browser.

2). View : The view represents the presentation of the application. Theview object refers to the model. It uses the query methods of the model toobtain the contents and renders it. The view is not dependent on the ap-plication logic. It remains same if there is any modification in the businesslogic. In other words, it is the responsibility of the of the view’s to maintainthe consistency in its presentation when the model changes.

3). Controller: Whenever the user sends a request for something then it al-ways go through the controller. The controller is responsible for interceptingthe requests from view and passes it to the model for the appropriate action.After the action has been taken on the data, the controller is responsiblefor directing the appropriate view to the user. In GUIs, the views and thecontrollers often work very closely together.

However, the web framework I used, Django interprets the term MVC ina different way. In Django, the Controller is called the “view”, and theView is called the “template”. The “view” describes the data that gets pre-sented to the user. It’s not necessarily how the data looks, but which datais presented. The view describes which data you see, not how you see it.

Page 48: Thesis

4.3. System Architecture 42

It’s a subtle distinction.

So, in Django’s case, a “view” is the Python callback function for a particu-lar URL, because that callback function describes which data is presented.Furthermore, it’s sensible to separate content from presentation – which iswhere templates come in. In Django, a “view” describes which data is pre-sented, but a view normally delegates to a template, which describes howthe data is presented.

As for the “controller” it’s probably the framework itself: the machinerythat sends a request to the appropriate view, according to the Django URLconfiguration.

As a result it can be said say that Django is a “MTV” framework – that is,“model”, “template”, and “view”.

4.3.2 Layers and Components

To provide a better understanding of the whole system, it is essential tofollow the architecture chart below.

4.3.2.1 Data Gathering

Scraper

As explained on section 4.2.1 - Web2rism Scrapers, a scraper in our projectis a standalone code script that uses an API over HTTP or just uses anHTML/XMLS parser and an HTTP connection library. The two methodsare explained in detail below:

1) Using APIs

In this method, the script uses the API of the website needed to be scraped.When compared to scrapers that use parsers, this method is much easier touse since most of the APIs are already returning data to users in JSON orXML formats. For YouTube, Twitter, Technorati and Flickr, I created APIusing scraper scripts in Python with related API wrappers.

One problem that can be faced while using APIs is that the sites offer-ing their APIs may be having technical difficulties from time to time. Forexample Technorati’s service was unstable during the weeks I was coding the

Page 49: Thesis

4.3. System Architecture 43

Figure 4.1: System Architecture

Technorati scraper and they were making changes to their API frequently.While designing apps that use 3rd party APIs, it is vital to keep up withthe changes in the APIs used and get frequent news and updates about thechanges so that the developed software can be adapted and unexpected er-rors can be avoided.

2) Using a Parser

In this method, the scraper basically travels to the given URI and startsreading the HTML source line by line. Then using the parser library, thenecessary information is extracted and shaped in the format that we wouldlike to save into the knowledge-base. The problem encountered with thismethod is that the format of a specific kind of page on a website does nothave the same format all the time. For instance, for WikiTravel, we neededto scrape the “City” pages where each city is reviewed with sections like“See”, “Do”, “Eat”, “Stay” etc. And not all the city pages included all of

Page 50: Thesis

4.3. System Architecture 44

these sections. For example for big city like Rome, the page was filled withall kinds of data for all the sections but for a small town like Lecco, most ofthese sections were empty. Moreover, the pages’ HTML sources can greatlychange if the design of the site is changed and our custom scraper scriptscan never be aware of such situations. Custom scrapers run exactly theway we want but they require much more attention and maintenance thanAPI-Scrapers.

A nice feature of the system designed is that the scraper scripts can workin a language independent way. Since the scrapers are important buildingblocks of the system and the project in whole is an academic one wheremany students with different programming professions can be involved in,we decided to implement this feature thinking of ease in future extensibil-ity of the application. The scraper class (model in Django terms) has twoattributes named “command” and “script file”. Using the ScraperManagerAdmin UI, administrators can install new scrapers to the system. Detailson how the scrapers are managed are explained in detail in the upcoming“ScraperManager” subsection.

ScraperThread

Crawling through big stacks of interconnected pages may take a long timeespecially for scrapers that use parsers. To ease the management of scrap-ers and see their statuses, I created a multi-threaded system that enablesthe running of multiple scrapers at the same time. Each time a scraperruns, it creates a thread that runs in the background and saves informationabout the active thread to the database. It updates this information on theconfiguration database (sqlite3) when a change occurs in the status of thescraper (such as the completion of the scraping activity or the occurrenceof an exception).

ScraperManager

The scraper manager, so called “SkyScraper” as will be described in theupcoming subsection “Project Files Organization” is the application thatlets the administrators using Web2rism to see the status of the scrapers at-tached to the system. It is possible to start/stop scrapers, add new scrapersto the system or inactivate / remove existing ones.

Page 51: Thesis

4.3. System Architecture 45

Figure 4.2: ScraperManager application UML Diagram

SkyScraper has got a CRON job that is scheduled to run all the activescrapers in certain time intervals. The CRON job checks the ScraperMan-ager config database (sqlite3), reads the list of active scrapers and runs themautomatically on the exact time they are scheduled to run. Apart from this,the scrapers can be run manually, without being bound to the CRON job.

4.3.2.2 Data Storage

KnowledgeBase As mentioned in previous sections, Web2rism uses a Se-mantic RDF Store. The data is saved in a subject, object, predicate modeland queried by SPARQL and SPARUL queries over JENA. On my Djangoproject, I have created an interface to easily connect to the KB by posting“get” and “update” queries. These queries are simply sent over HTTP usinga normal HTTP post and the server is able to return results in XML orJSON formats. Later on our KnowlegeBase API gets the results and shapesthem in the needed way for easier presentation on the UI.

Page 52: Thesis

4.3. System Architecture 46

Figure 4.3: Scraper Manager UI

The data mined by Web2rism scraper scripts are directly saved into theRDF store in a raw format. For instance, “tweets” collected over Twitter’sAPI have attributes like the author, date/time, text, geo-location of the“tweet” objects. Below, can be seen how these attributes are distributed asobjects inside the RDF model.

For different needs like asking data from the RDF store about specificsubjects, it is possible to create complex queries. Below is a SPARQL querythat asks the KB for “anything” related to the keyword “Lugano”.

Page 53: Thesis

4.3. System Architecture 47

Figure 4.4: An example of a SPARQL query used to get data from the KB

Configuration Database

This is a simple sqlite3 database used for the ScraperManager app’s ORM.I decided to use a seperate DB instead of using the KnowledgeBase to makethings faster and be able to use the ORM capability of Django. Django na-tively supports ORM for only MySQL, SQLite and PostgreSQL and sqlite3was the easiest to use among them and it is the best choice for rather smalldatabases with few tables.

Django also offers an automatically generated admin page for easy man-agement of application models in the project. So if a problem occurs inthe ScraperManager UI it is also possible to manage everything with directaccess to database tables over this admin panel.

4.3.2.3 Data Analysis

KnowledgeBase API

One of the most interesting points of the Web2rism project was to create ourown API that will let any developer to use the data in the way they want.The API acts as a service connected to the Knowledge-Base and it functionsas an analyzer and shaper of the raw data coming from the Knowledge-Base.These analysis include the language detection, sentimental analysis, popu-larity analysis and ordering of data according to several different filters.

For example, video comments coming from YouTube are saved in the KB inraw format as if they are directly coming from YouTube’s databases. When

Page 54: Thesis

4.3. System Architecture 48

a reputation analysis is going to be made, the reputation calculator can di-rectly use the KB-API and get the highest rated comments of the videos oftouristic places which the search keyword represents. The API’s usage alsoeases the development process because it returns data in JSON format andusing any programming language, it is possible to read and easily traversethe gathered data.

To convert the data gathered from the KB into JSON, I first gathered theSubject - Object - Predicate formatted data from the RDF Store and thenassigned the data they contain into objects like “video”, “tweet”,“photo”etc. And then I used the SimpleJson Python library to encode it to JSONstrings and output them as an HTTP Response.

Details on how the API methods are functioning can be seen in Chapter5: Tests and Evaluations.

4.3.2.4 Presentation

The UI to view scrapers have got dense uses of Google Charts API becauseof complex graphs needed to be generated from numeric data. For the man-agement parts, the client side JavaScript library jQuery is used. I decidedto keep the interface as simple as possible since data visualization is alreadyan important issue in usability matters. All the presentation layer of thewhole project is over the web, powered by HTML4 and CSS3 standards.

4.3.3 Project Files Organization

Python’s most appealing feature is that it greatly eases the pluggability ofPythonic applications with each other. Related with this while creating aDjango or Python project, creating the files and folders and separating piecesof code logically to increase pluggability is very important. Explained belowis how I created different Django applications that when attached together,they make up the data gathering and analysis parts of the Web2rism project.

Core

Contains core functions that are used throughout the whole Django project.For instance, post and get functions that enable interaction with the KBare defined here. In addition, there are some classes that provide customtemplate tags to be used in the view.

Page 55: Thesis

4.3. System Architecture 49

KnowledgeBase API

This is the source of the KB API. The models file is empty here since Ihave not used any ORM for mapping any business models with the seman-tic knowledgebase. There are just basic views and class files that do thedata analysis work and it returns an interface to the user powered in JSONformatted data.

Scrapers

This directory contains the custom scraper scripts that are attached to thesystem over the scraper manager interface. All the files are uploaded via aweb interface and the scripts can be in any language.

SkyScraper

I have named the Scraper-Manager layer as “SkyScraper” as it resembled anumbrella like structure, acting on top of all scrapers attached to the system.This directory contains the Scraper and ScraperThread models (classes) aswell as the views and url configuration files of the “skyscraper” Django ap-plication.

The schema below shows how the files are organized in the project directoryof my implementation.

4.3.4 System Workflow

This subsection describes in detail the workflow of a typical data gatheringscenario consisting of a sequence of connected steps. In each of the step, thecomponents I designed have specific interactions with each other.

1. A scraper script is coded in any programming language.

2. Scraper script is fed into the scraper manager.

3. Scraper manager assigns a default schedule. This schedule can laterbe modified by the administrators.

4. According to the schedule, the manager creates a scraper-thread andruns the scraper script over certain time intervals.

5. The thread runs and starts collecting data from the web.

Page 56: Thesis

4.3. System Architecture 50

6. The collected data is filtered, analyzed (sentiment) and shaped to beused.

7. The shaped data is saved to the Knowledge-Base.

8. The thread ends and updates its state as idle to re-run in the database.

9. The data gathering and analysis part is over, the user now can querythe system using a keyword representing the place he wants to learnits reputation.

10. The keyword is received and query expansion is applied.

11. The expanded query is sent to the KB API which reads the Knowledge-Base and exposes its results in JSON format.

12. The UI gets the results and displays them to the user.

Page 57: Thesis

Chapter 5

Tests and Evaluations

In this chapter, the current status and evaluation of my work and in theWeb2rism project is presented.

5.0.5 Functional Scrapers

Google Trends / Insights

The scraper that runs over Google Insights for Search and Google Trendsgets the regions, cities and languages where the hits about a certain keywordare coming from on Google’s search engine. It is also able to get the relatednews from Google News. Currently, it is not connected to the RDF store andit is working on the fly just displaying what is can find on the CSV file gener-ated by Google’s service. An example on what the Trends CSV contains canbe seen here: http://www.google.com/trends/viz?qlugano&graphall csv&saN

Google Blogsearch

The Google BlogSearch scraper can get the titles, excerpts of posts contain-ing the search keyword, as well as the authorship properties. The data isscraped from the ATOM feed generated by custom searches on Google’s side.In addition, there is an on the fly calculation about which author have writ-ten most blog post about Lugano. For instance for the keyword “Lugano”, itshows that the top blogger is Prof. Lorenzo Cantoni from USI. In addition,it can show a graph of timeline of blog posts densities related with a keywordgrouped by days. As a reference, it also lists the latest posts of the search.An example of the ATOM formatted blog search results can be seen here:feed://blogsearch.google.com/blogsearch feeds?hlen&qlugano&outputatom

51

Page 58: Thesis

52

Figure 5.1: A view from Google BlogSearch Scraper’s findings

Technorati

The Technorati Scraper runs using Technorati’s native API. It can list thenumber of blog entries found, and the post densities varying by date. Asin Google BlogSearch, the graphs are generated using Google Charts API.There is also the top authors calculation.

Twitter

Twitter’s API is quite powerful and I have tried to get as much as pos-sible from the 140 character texts. Currently, Web2rism’s Twitter Scrapercalculates a ratio of the tweets containing the search keyword, out of 1000tweets posted in the last 7 days. It also brings up who the most activeTwitter users are (who have been writing tweets about the queried keywordmost). The Twitter scraper works fully integrated with the Knowledge-Base,saving tweet objects data to the RDF store. There are two different scrapersone an older version that works on the fly and makes the calculations and

Page 59: Thesis

53

Figure 5.2: A view from Technorati Scraper’s findings

measurements explained above, and another one working by saving data tothe RDF while the analysis is being done over the API.

Flickr

The Flickr Scraper runs over Flickr’s own API and it is able to list the“most interesting” photos, number of photos, top authors and photos bydate timeline graph. Although not fully complete, the analysis tool on theAPI is also able to provide a function that lets users to query specific areaby adjusting the radius of a target location to see photos posted by Flickrusers inside the zone.

YouTube

The YouTube scraper runs over Google’s native GData API. It is able tofetch the top videos about the query keyword, get their ratings, calculateaverage ratings, get comments, calculate comment ratings, get favoritedcontent and get tags associated with the videos. There are also options like

Page 60: Thesis

54

Figure 5.3: A view from Twitter Scraper’s findings

Figure 5.4: A view from Flickr Scraper’s findings

Page 61: Thesis

55

getting the top authors, most used tags and so on. The GData API is re-ally powerful and easy to use therefore it can be extended greatly for futurereleases too. Currently, the YouTube scraper does not work over the RDFstore but just displays data on the fly.

Figure 5.5: A view from YouTube Scraper’s findings

WikiTravel

The WikiTravel scraper uses parser libraries and runs kind of problematicbecause of WikiTravel’s unstable and not fully uni-formatted pages. It isable to get the recommended places to see about the queried city or countryand list the top contributing authors.

5.0.6 KnowledgeBase & Scraper Management

The Knowledge-Base API gives the Query Expansion functions that useGoogle Suggest and Yahoo Search Suggestions interfaces. In addition, thereare functions for querying the Knowledge-Base for Tweets and YouTubevideos. All the functions return results in JSON format. The scraper man-ager is also fully functional, letting the administrators of Web2rism attachnew scraper scripts to the system. Linked with the scraper management

Page 62: Thesis

56

Figure 5.6: A view from WikiTravel Scraper’s findings

screen, there is a script upload form on the manager UI. The forms asks forthe name, description and the UNIX command that will be required to runthe scraper script over the CRON job on a scheduled basis. It is possible tosee the active scrapers’ threads, their statuses, start / stop them or deleteand create new scrapers.

5.0.7 System Performance

In terms of system resources, it is foreseen that an online reputation analysistool will be consuming a lot but we haven’t been able to test the whole sys-tem, running all the scrapers continuously so far. This is because still, theproject is in heavy development phase and there is more time requirementuntil the test phase comes. However, considering the resource hungry sys-tem, an innovative idea like trying to use distributed computing on Hadoopclusters can be applied. Especially after the RDF store gets filled with hugeamounts of data, the system will need really powerful server machines, cpu/ memory / cache usage optimizations. To make the system scalable, I seethe opportunity to try using cloud computing on services like Google Ap-pEngine or Amazon’s S3 services, or use specialized software frameworks fordistributed processing of large data sets on compute clusters.

Page 63: Thesis

57

Figure 5.7: A view from a call to the KB API to get “Tweets” about Italy

Page 64: Thesis

Chapter 6

Conclusion

What characterizes the most this project has been constructive approachtaken, and the assembly of a set of tools that power final big picture. As theWeb2rism project still spans to two years and although the core is complete,the project is far from complete. This chapter has got two sections, firstone explaining the current status of the work and the latter one listing myopinions on how future enhancements can be.

6.1 Current Status of the Work

As discussed in detail in the preceding chapters, currently Web2rism has gota functional auto scraping system that saves data into an RDF Knowledge-Base. During the development phase, I needed to demo my work from timeto time to other people involved therefore I had started building scrapersworking on the fly (not saving any data anywhere and just showing whatis scraped). Later on, I started converting these scrapers to be able towork in conjunction with the Knowledge-Base. Currently, the Twitter andYouTube scrapers are working in RDF mode. The Knowledge-Base is alsofunctional for both of these scrapers, providing functions that are callableover HTTP to get data in JSON format from the store. In addition, theScraperManager application is functional, both usable over its UI I createdand accessible through the native Django admin panel. As a final conclusionof the research and studies made, we were satisfied with the system archi-tecture in general and the model developed for an e-tourism centric ORMapplication.

The current status of the overall Web2rism application can be seen over

58

Page 65: Thesis

6.2. Future Work 59

this URL: http://web2rism.ath.cx/

And the latest version of my work, which we named “Django2rism” is overhere: http://django2rism.ath.cx/

6.2 Future Work

This subsection deals with the analysis of required features and their char-acteristics I find needed for the marketable version of the product.

6.2.1 Extending the Data Gathering Layer

Web2rism is all about measuring the “buzz density” on the web about aspecific brand, product or place. So like every ORM or ORA tool, the mostessential components of the system are its scrapers. As for future work, theexisting scrapers’ should be extended to work more efficiently and new scrap-ers should be coded. For example, LonePlanet (www.lonelyplanet.com) hasbeen a quite good and trustable source for me during the time I travelledin Europe and there are hotel and touristic destination reviews inside. Inaddition, they are growing rapidly, increasing their content size. TripAdvi-sor is another example as it is one of the most popular trip planning sites.However TripAdvisor does not have an API of itself and the site pages areextremely crowded and big in size, hardening the scraping using parsersprocess.

In addition I see the need to create a generic “buzz data” model thatwill help us organize and shape the data collected. Right now, right nowmost of the things are happening on the fly and there collected data isnot being shaped and interlinked with other scraped data. For instance,the “buzz” object can have attributes like “author”, “create date”, “text”,“geo location”, “sentiment positiveness”, “view estimation”, “site url” andso on. As you may have noticed, this “buzz” object can be used to shapedata coming from any UGC site. Let it be Twitter, YouTube, WikiTravelor TripAdvisor, every piece of content on a UGC site has got an author, acreate date, a location where the posting was made, an indication if it isfavorited or not, or how much rating it has. If the collected data is shapedaccording to such a generic object, I believe it will be much easier to siftthrough all the buzz collection and calculate the overall reputation of thebrand or product that is being analysed.

Page 66: Thesis

6.2. Future Work 60

6.2.2 Optimization & Scalability Issues

Web2rism’s scrapers and management panels work fine when only few clientsare connected and querying the system but I see there is a lot of work tobe done to enable the system function, serving to high number of users.First of all, constantly scraping remote sources and saving data to localdatabases is a bandwidth and space consuming issue. In this manner, dis-tributed computing mechanisms can be used instead of working only onone computer. For example lately the distributed computing scene is seeinggood examples of the usage of the tool: Hadoop from Apache community.Hadoop is an open-source implementation of frameworks for reliable, scal-able, distributed computing and data storage. It has its own file system anda high-performance coordination service for distributed applications. Con-sidering Hadoop’s trend increasing dy by day and foreseeing Web2rism as alarge-scale application being developed in an academic environment, I thinkfuture scalability problems can be solved in this way.

6.2.3 A Logging System

Currently, the system does not have very exception handling capabilities.There are many open source error logging and tracking frameworks availableand one should be chosen and attached to the system for better managementof the system.

6.2.4 Easier Interaction with the RDF Store

Although I was not involved in the Data Storing Layer, I was not really com-fortable using the JENA API over the Joseki RDF server and I tended touse some an RDF engine compatible with the Django project I am develop-ing. In this manner I wanted to use a pluggable engine named “django-rdf”.Django-RDF is an RDF engine implemented in a generic, reusable Djangoapp, providing complete RDF support to Django projects without requiringany modifications to existing framework or app source code, or incurring anyperformance penalty on existing control flow paths. The biggest obstacle toimplement a web framework reliant RDF engine to the system was obvious.Django-RDF work only with Django and it is not usable with program-ming languages other than Python. Web2rism’s web layer can be changedto PHP, ASP or any other similar web technologies in the future and theteam did not want to use a language dependant framework. However, in myopinion in the use of a stable Django RDF engine, it is possible to do webdevelopment using Django just like you’re used to, then turn the knob and

Page 67: Thesis

6.2. Future Work 61

- with no additional effort - expose your project on the semantic web.

6.2.5 More Organized Author and Location Classification

This is kind of a harder issue and may require some research but there isa vital need especially for “Authorship Recognition”. By authorship recog-nition, we mean understanding that it is the same user by checking thenicknames of a person being the same on two different UGC sites. Similarto the author recognition work, a location classification should be done. Inthis manner, conversion of the collected buzz location into GPS coordinatescan be useful. Both of these issues are included in the “Reputation Analysis”method but they haven’t been included in the software yet.

6.2.6 UI Enhancements

Currently, the project’s user interfaces lack the attractiveness and usefulnessof slick web apps. There is the need to create a basic yet useful UI basedon usability tests and needs of the customers. Of course, this is a step thathas to be taken much further of the whole development process since mostof the required features rely on server-side coding.