schema.org usage for hotels

26
© Copyright 2015 STI INNSBRUCK www.sti- innsbruck.at Elias Kärle – June 15th, 2015 – OC Meeting schema.org usage for hotels An analysis based on the Web Data Commons data set

Transcript of schema.org usage for hotels

© Copyright 2015 STI INNSBRUCK www.sti-innsbruck.at

Elias Kärle – June 15th, 2015 – OC Meeting

schema.org usage for hotels

An analysis based on the Web Data Commons data set

www.sti-innsbruck.at

Inhalt

1. Motivation

2. Data

3. Analysis

www.sti-innsbruck.at

1. Motivation

3

www.sti-innsbruck.at

1. Motivation

4

• Dieter Fensel has a Wikipedia page

www.sti-innsbruck.at

1. Motivation

5

• Italian swimmer VS. @cyberandy• How did he do it?

www.sti-innsbruck.at

1. Motivation

6

• Schema.org annotation

• Hotels and tourism

do they use annotations?

www.sti-innsbruck.at

1. Motivation

7

1) How many hotels use schema.org?

2) How is schema.org used?1) Which classes?

2) Which attributes?

3) Is schema.org used correctly?

3) Who is using schema.org in tourism?

www.sti-innsbruck.at

2. Daten

8

What is schema.org?

• Initiative founded 2011• Vocabulary for structuring data in web sites• Embedded into html

– Microdata– RDFa– JSON-LD

Source: http://www.schema.org

www.sti-innsbruck.at

2. Daten

9

Analysis of all web sites:

• Founded in 2007 • Non-Profit Organisation• Crawls web 4 times per year• Datadumps are available open for public• November 2013: 2,3 billion webseiten, 148TB• Dezember 2014: 2,1 billion webseiten, 160TB

Source: http://commoncrawl.org/the-data/get-started/

www.sti-innsbruck.at

2. Daten

10

Only survey structured data:

WebDataCommons:• 2012 Freie Universität Berlin & KIT• Currently Uni Mannheim• Operated by Chris Bizer• Extracts structured data from the Common Crawl

– WebTables: 147 Million relational tab. (11Billion HTML Tab.)– Hyperlink Graph: 3,5 Billion Webseiten, 128 Billion Links– Semantically annotated data:

• November 2013: 44TB, 2.2Bn URLs• Dezember 2014: 160TB, 2Bn URLs

Source: http://webdatacommons.org/structureddata/

www.sti-innsbruck.at

2. Daten

11

• November 2013 corpus

• Subset: schema.org/Hotel– 35GB– 127 Mio. Triples

• OWLIM-SE Repository – thanks Ontotext

• SPARQL Queries

• Linux Debian 3.2, STI – thanks David

www.sti-innsbruck.at

3. Analyse

12

1) How many hotels are annotated with schema.org?

4.841.353• Hotels annotated several times

– own website– booking websites

740.298• Lost all hotels with same names

– Adler, Post, ...

Bind to address!

www.sti-innsbruck.at

3. Analyse

13

Hotel4.841.353

Address3.035.000

Country

1.904.000Name

1.125.000Region

1.902.000

ZIP

2.011.000

Street

2.284.000

www.sti-innsbruck.at

3. Analyse

14

Hotels per Country

Austria: 148

Tirol: 287

Innsbruck: 63

1. US 10215132. CA 523603. CN 206484. GB 115805. DE 31636. MX 19217. PR 12508. AR 10169. PH 765

10. IN 699

11. TR 68112. AE 39113. KR 37714. RO 37315. QA 34316. PA 29917. SA 29218. AU 29019. BR 25820. CH 238

21. TH 23422. SR 21723. HK 15624. EC 15025. AT 14826. CO 14327. PE 12928. BE 12729. ID 10930. BH 93

Obviously errors in annotating

www.sti-innsbruck.at

3. Analyse

15

Hotels grouped by ZIP in Tirol

18%

10%

8%

4%

4%

3%2%2%2%2%

45%

6020 6370 6100 6450 6580 6456 6215 6213 6365 6010 other

Innsbruck

Kitzbühel

Seefeld

Sölden

St. Anton

ObergurglAchenkirch

PertisauKirchberg

www.sti-innsbruck.at

3. Analyse

16

What categories of hotels are annotated?

http://schema.org/Rating

www.sti-innsbruck.at

3. Analyse

17

Hotel4.841.353

Address3.035.000

Country

1.904.000Name

1.125.000Region

1.902.000

ZIP

2.011.000

Street

2.284.000

www.sti-innsbruck.at

3. Analyse

18

Hotel4.841.353

Address3.035.000

Country

1.904.000Name

1.125.000Region

1.902.000

Rating

2.377.000

RatingValue

2.375.000

www.sti-innsbruck.at

3. Analyse

19

What categories of hotels are annotated?

866.932

651.606

426.925

176.800

135.958

35.079

66.208

15.476

941

www.sti-innsbruck.at

3. Analyse

20

2) How is schema.org used?

15%

14%

13%

9%8%

7%

6%

5%

5%

4%

13%

schema.org usage

http://schema.org/Hotel/name http://schema.org/Hotel/reviewhttp://www.w3.org/1999/02/22-rdf-syntax-ns#type http://schema.org/Hotel/imagehttp://schema.org/Hotel/address http://schema.org/Hotel/aggregateRatinghttp://schema.org/Hotel/rating http://schema.org/Hotel/descriptionhttp://schema.org/Hotel/url http://schema.org/Hotel/geoOther

Property # %http://schema.org/Hotel/name 5666474 117.0432http://schema.org/Hotel/review 5226132 107.9478http://www.w3.org/1999/02/22-rdf-syntax-ns#type 4841353 100http://schema.org/Hotel/image 3439579 71.04582http://schema.org/Hotel/address 3035301 62.6953http://schema.org/Hotel/aggregateRating 2723587 56.25673http://schema.org/Hotel/rating 2377406 49.10623http://schema.org/Hotel/description 1934486 39.95755http://schema.org/Hotel/url 1749830 36.14341http://schema.org/Hotel/geo 1323333 27.33395http://schema.org/Hotel/telephone 1124948 23.23623http://schema.org/Hotel/faxNumber 703274 14.52639http://schema.org/Hotel/photo 642159 13.26404http://schema.org/Hotel/openingHours 558353 11.533http://schema.org/Hotel/logo 549525 11.35065http://schema.org/Hotel/branchof 369942 7.641294http://schema.org/Hotel/additionalType 308168 6.365328http://schema.org/Hotel/photos 224887 4.645127http://schema.org/Hotel/maps 86935 1.795676http://schema.org/Hotel/breadcrumb 82122 1.696261http://schema.org/Hotel/priceRange 52071 1.075546http://schema.org/Hotel/price 37634 0.777345http://schema.org/Hotel/email 31854 0.657957http://schema.org/Hotel/event 24838 0.513038

www.sti-innsbruck.at

3. Analyse

21

3) Who uses schema.org in tourism?

Hypothesis:

„Schema.org is mainly used by booking- and rating websites, barely by hotels themselves.“

www.sti-innsbruck.at

3. Analyse

22

Approach:• Hotels on booking- & rating sitesSearch for annotation on own web site

• Countercheck with annotated hotel websitesMultiple appearance in data set?

Currently: exemplaric (top-booking sites)

Next step: full data set

www.sti-innsbruck.at

3. Analyse

23

Summary:

• Main user of schema.org/Hotel:booking- and rating sites

Errors: incompleteWrong clacesWrong attributesWrong datatypesComprehensive errory analysis: Uni Mannheim

(R. Meusel & H. Paulheim) [1]

[1] http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/MeuselPaulheim-HeuristicsForFixingCommonErrorsInDeployedSchemaOrgMicrodata-ESWC2015.pdf

www.sti-innsbruck.at

3. Analyse

www.sti-innsbruck.at

3. Analyse

Annotation „Hotel“ right but the same on every subpage!

www.sti-innsbruck.at

3. Analyse

26

What did we do with that knowledge?

• Talk at TFF 2015 in Mayrhofen

• Paper for SEMANTICS 2015

• Consulting of some participants of TFF– Hotel Adlers Innsbruck– Hotel in Seefeld– ...