Post on 17-Jan-2016
description
How is the Semantic Web Being Used? An Analysis of the Billion
Triples Challenge Corpus
Mike DeanPrincipal EngineerBBN Technologiesmdean@bbn.com
1
Assumptions
• Technology – Intermediate– Familiarity with RDF and OWL
• Interest in– Semantic Web usage patterns– Semantic Web Challenge
2
Presenter Background• Principal Engineer at BBN Technologies (1984-present)• Principal Investigator for DARPA Agent Markup Language (DAML) Integration and
Transition (2000-2005)– Chaired the Joint US/EU Committee that developed DAML+OIL and SWRL
• Developer and/or Principal Investigator for many Semantic Web tools, datasets, and applications (2000-present)
• Member of the W3C RDF Core, Web Ontology, and Rule Interchange Format Working Groups– Co-editor of the W3C OWL Reference
• Member of the Semantic Web Challenge Advisory Board since its inception• Local co-chair for ISWC2009• Other SemTech presentations
– Semantic Query: Solving the Needs of a Net-Centric Data Sharing Environment (2007, w/ Matt Fisher)
– Semantic Queries and Mediation in a RESTful Architecture (2008, w/ John Gilman and Matt Fisher)
– Use of SWRL for Ontology Translation (2008)– Semantic Web @ BBN: Application to the Digital Whitewater Challenge (2009, w/ John
Hebeler)
3
Semantic Web Challenge
• Founded in 2003 by Michel Klein and Ubbo Visser• Demonstrate the value of the Semantic Web
through applications• Submissions evaluated according to a set of
minimal requirements and additional desirable features
• Has become an annual event at International Semantic Web Conferences– 22 submissions in 2008
4
2008 Billion Triples Challenge• A new Semantic Web Challenge track in 2008
– Do “something interesting” with a large subset of a billion provided triples
– Co-chaired by Jim Hendler and Peter Mika• 12 real web data sets
– Not a scientific sample– Enough to be interesting and probably representative– Stable snapshot
• Our analysis initially arose from discussing a possible application– We now know “yes, there is enough data to support what we
wanted to do”– Tools and techniques should be generally applicable to other
corpora
5
2008 Billion Triples CorpusData Set Format Triples URLs Size Composition
Webscope WARC 82,768,342 1,979,022 2.7 GB Heterogeneous
Falcon WARC 32,512,340 541,518 834 MB Heterogeneous
Swoogle WARC 174,981,639 1,468,766 3.2 GB Heterogeneous
Watson WARC 59,750,019 130,701 267 MB Heterogeneous
SWSE-1 WARC 30,346,451 194,259 4 GB Heterogeneous
SWSE-2 WARC 60,504,716 389,107 2.4 GB Heterogeneous
DBpedia tar.gz 110,241,463 29 1.9 GB Homogeneous
Geonames WARC 69,778,255 6,668,395 3.4 GB Homogeneous
SwetoDBLP tar.gz 14,936,600 1 167 MB Homogeneous
WordNet tar.gz 1,942,887 1 17 MB Homogeneous
Freebase tar.gz 63,069,952 1 569 MB Heterogeneous
US Census tar.gz 445,752,172 1 3.3 GB Homogeneous
TOTAL 1,146,584,836 11,371,801 22.8 GB6http://www.cs.vu.nl/~pmika/swc/btc.html
Data Set Characterization
• Metrics that can impact selection/tuning of KB implementations– Statement count– Number of classes and predicates– Statements per subject/predicate/object– Degree of interconnectedness (percentage of non-
literal statements, with/without rdf:type)– RDFS and OWL reasoning employed– Use of reification
7
Analysis• Stream processing of the compressed data set archives
– Statement counts– Datatype, language, predicate, and type counts
• Use of RDF, RDFS, OWL, FOAF, and other vocabularies– (May include duplicate statements)
• Load each dataset into its own Parliament KB– (Eliminates duplicates within dataset)
• (Both programs used code based on Peter Mika’s WARC example with the OpenRDF RIO parser and no inference)
• Process the statement and resource tables– Mark each node as resource and/or literal– URI, blank node, and literal counts– Chain length statistics and histograms– (Parliament worked very well here. Each operation took 1-736 seconds.)
8
Stream Processing• Many Semantic Web tools provide streaming
parsers rather than, or in addition to, model access– Analogous to XML SAX vs. DOM
• For suitable applications, this can be a lot faster than loading statements into a KB
• Streaming analysis of the 2009 corpus was performed at an overall rate of 103K statements/sec on a Mac laptop with a portable external disk– Compare to loading 10-20K statements/second on a
server 9
Classes and PredicatesData Set Classes Predicates
Webscope 724 782
Falcon 19,660 29,248
Swoogle 33,318 33,981
Watson 13,660 18,091
SWSE-1 115 1,040
SWSE-2 104 625
DBpedia 4 288
Geonames 1 17
SwetoDBLP 11 145
WordNet 22 41
Freebase 0 5,008
US Census 8 1,68210
Statements
• Statement (subject, predicate, object)– Resource object• rdf:type predicate• Other predicate
– Literal object• rdf:datatype• Plain literal
– xml:lang– Neither datatype nor language
11
Statement % (distinct values)
12
Dataset rdf:type rdf:resource rdf:datatype xml:lang Neither
Webscope 24 (724) 32 3 (10) 14 (93) 27
Falcon 16 (19,660) 50 16 (72) 9 (252) 18
Swoogle 15 (33,318) 39 2 (87) 18 (280) 26
Watson 16 (13,660) 40 2 (79) 29 (162) 13
SWSE-1 13 (115) 53 0 (1) 32 (6) 1
SWSE-2 13 (104) 53 0 (1) 32 (15) 1
DBpedia 0 (4) 91 0 (6) 8 (1) 0
Geonames 10 (1) 49 0 (0) 1 (342) 41
SwetoDBLP 18 (11) 28 14 (4) 0 (0) 41
WordNet 24 (22) 30 0 (1) 46 (1) 0
Freebase 0 (0) 62 0 (0) 19 (169) 19
US Census 0 (8) 19 78 (2) 0 (0) 3
Resources and Literals
• Node– Resource• URI• Blank Node
– Literal
13
Node %Data Set URI Blank Node Literal
Webscope 24 53 23
Falcon 56 13 31
Swoogle 31 34 41
Watson 29 32 40
SWSE-1 39 36 25
SWSE-2 35 42 23
DBpedia 74 0 26
Geonames 45 0 55
SwetoDBLP 27 17 56
WordNet 55 0 45
Freebase 52 0 48
US Census 0 98 214
Chain Lengths
• How long are the linked-list chains used by Parliament?– How many statements share the same subject,
predicate, or object?
• Histograms proved unwieldy– Presenting summary statistics instead
• rdf:type statements significantly impact results
15
Mean chain lengths (std dev)Data Set Subject Predicate Object Literal Object
Webscope 3.96 (9.77) 87,900 (722,575) 3.43 (2170) 4.33 (659)
Falcon 4.22 (13) 983 (31,773) 2.56 (328) 2.31 (217)
Swoogle 5.65 (36) 4,464 (188,023) 3.27 (1,793) 3.38 (569)
Watson 5.58 (56) 3,040 (98,288) 2.87 (918) 2.91 (407)
SWSE-1 5.25 (15) 25,404 (289,000) 2.46 (1,138) 2.29 (187)
SWSE-2 5.37 (15) 83,773 (739,736) 2.89 (1,741) 2.87 (300)
DBpedia 15 (39) 300,855 (3,560,666) 3.84 (148) 1.17 (22)
Geonames 10.4 (1.66) 4,096,150 (3,167,048) 2.81 (1,623) 1.67 (15)
SwetoDBLP 5.63 (3.82) 103,009 (325,380) 2.93 (629) 2.36 (168)
Wordnet 4.18 (2.04) 47,387 (100,907) 2.53 (295) 2.39 (271)
Freebase 4.45 (15) 12,329 (316,363) 2.79 (1,286) 1.83 (116)
US Census 5.39 (9.18) 265,005 (1,921,537) 5.29 (15,916) 227 (115,616)16
RDF/RDFS/OWL Usage• 80,309,558 rdf:type statements in 11 data sets• 4,033,540 rdfs:subClassOf statements in 6 data sets• 2,988,396 owl:Class instances in 6 data sets• 1,492,214 rdf:_1 statements in 7 data sets• 1,042,032 owl:Restriction instances in 5 data sets• 480,771 owl:sameAs statements in 9 data sets• 299,962 rdfs:Class instances in same 6 data sets as owl:Class• 265,124 rdfs:domain statements in 6 data sets• 252,175 rdfs:range statements in 6 data sets• ~238,000 reified statements in 4 data sets• 50,482 instances of rdf:Bag in 5 data sets• 22,154 instances of owl:Ontology in 5 data sets• 14,913 owl:imports statements in 3 data sets• 83 rdf:_2000 statements in 3 data sets• 1 rdf:_10763 statement in 1 data set
17
Popular Vocabularies• FOAF
– 29,308,169 Person instances in 7 data sets– 25,864,527 knows statements in 6 data sets
• Dublin Core– 43,591,844 title statements in 7 data sets– 4,416,716 date statements in 6 data sets
• Geospatial– 7,075,380 wgs84_pos:lat statements in 9 data sets– 4,436 georss:point statements in 5 data sets
• SKOS– 6,619,912 subject statements in 4 data sets– 403,912 Concept instances in 4 data sets
• RSS 1.0– 2,893,750 item instances in 6 data sets
• OWL-S– 92 0.9-1.2 Profiles in 3 data sets
• OWL-Time– No usage?
18
Errors
• 95,937 Java exceptions• Lots of bad languages and datatypes• Lots of namespace/URI typos/confusion• Slightly different statement counts, due to
exceptions, duplicates, etc.– 1,063,616,774 statements (4% less)
19
Crawled Data
• Webscope, Falcon, Swoogle, Watson, SWSE-1, and SWSE-2 consisted of crawled data from a wide range of sites– Included some data I published in 2002
20
DBpedia• Information extracted from Wikipedia pages• Example
<http://dbpedia.org/resource/San_Jose%2C_California> rdfs:label "San Jose, California"@en ; dbpedia:officialName "City of San Jose"@en ; geo:lat "37.304"^^xsd:float ; geo:long "-121.873"^^xsd:float ; dbpedia:populationTotal "929936" ; dbpedia:areaLandSqMi "174.9" ; dbpedia:timezone <http://dbpedia.org/resource/Pacific_Time_Zone> ; foaf:homepage <http://www.sanjoseca.gov> ; foaf:img <http://upload.wikimedia.org/wikipedia/commons/3/3f/SJPan.jpg> ; foaf:page <http://en.wikipedia.org/wiki/San_Jose%2C_California> ; dbpedia:wikilink <http://dbpedia.org/resource/April_3> , ... ; owl:sameAs <http://sws.geonames.org/5392171/> , <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/
san_jose> .• See http://dbpedia.org
21
Freebase• Collections of curated datasets
– RDF-like data model– Data exports available, but no standard mapping to RDF until rdf.freebase.com was
announced at ISWC2008• Follows Linked Data principles• Standard RDF dump still not available
• Some anomalies in the corpus mappings affected statistics– Used freebase:type rather than rdf:type– Language codes had a prepended /, e.g. “/en”– freebase.org (a different site) should be freebase.com• Example
<http://www.freebase.org/guid/9202a8c04000641f800000000006809a> <http://www.freebase.org/type/object/name> "San Jose, California"@/en ; <http://www.freebase.org/type/object/type> <http://www.freebase.org/location/citytown> , <http://www.freebase.org/location/us_citytown> ; <http://www.freebase.org/location/citytown/founded> "1777-11-29" ; <http://www.freebase.org/location/location/area> "461.5” .
• See http://freebase.com and http://rdf.freebase.com
22
Geonames• 8 million geographic names and locations• Example
<http://sws.geonames.org/6484236/> a geonames:Feature> ; geonames:featureClass geonames:S ; geonames:featureCode geonames:S.HTL ; geonames:inCountry <http://www.geonames.org/countries/#US> ; geonames:locationMap "http://www.geonames.org/6484236/the-fairmont-san-
jose.html" ; geonames:name "The Fairmont San Jose" ; geonames:nearbyFeatures> <http://sws.geonames.org/6484236/nearby.rdf> ; geonames:parentFeature <http://sws.geonames.org/5332921/> ; geo:lat "37.3326" ; geo:long "-121.8893" .
• See http://geonames.org
23
SwetoDBLP• Metadata on publications in Computer Science (originally Databases and Logic Programming)• Example
<http://dblp.uni-trier.de/rec/bibtex/conf/geos/KolasHD05> a opus:Article_in_Proceedings ; rdfs:label "Geospatial Semantic Web: Architecture of Ontologies." ; opus:author [ a rdf:Seq ; rdf:_1 <http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/k/Kolas:Dave.html> ; rdf:_2 <http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/h/Hebeler:John.html> ; rdf:_3 <http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/d/Dean:Mike.html> ] ; opus:isIncludedIn <http://dblp.uni-trier.de/rec/bibtex/conf/geos/2005> ; opus:book_title "GeoS" ; opus:year "2005"^^xsd:gYear ; opus:pages "183-194" ; dcelem:relation "http://www.informatik.uni-trier.de/~ley/db/conf/geos/geos2005.html#KolasHD05" ; opus:last_modified_date "2005-11-08"^^xsd:date .<http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/d/Dean:Mike.html> a foaf:Person ; foaf:name "Mike Dean" .
• See http://lsdis.cs.uga.edu/projects/semdis/swetodblp/
24
WordNet• Lexical database of English, including multiple word senses and
synonym sets• Example
wn20instances:wordsense-semantic-adjective-1 a wn20schema:AdjectiveWordSense ; rdfs:label "semantic"@en-us ; wn20schema:adjectivePertainsTo wn20instances:wordsense-semantics-noun-1 ; wn20schema:tagCount "3"@en-us ; wn20schema:word wn20instances:word-semantic .wn20instances:word-semantic a wn20schema:Word ; wn20schema:lexicalForm "semantic"@en-us .
• See http://www.w3.org/2006/03/wn/wn20/
25
US Census• 1 billion triples published by Joshua Tauberer in April 2007• Highly tabular data• Example
<http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose> a <http://www.rdfabout.com/rdf/schema/usgovt/Town> ; dc:title "San Jose" ; dcterms:hasPart <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose/fruitdale> , <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose/seven_trees> , ... ; dcterms:isPartOf <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county> ; census:details <http://www.rdfabout.com/rdf/usgov/geo/us/ca/counties/santa_clara_county/san_jose/censustables> ; census:households 559949 ; census:landArea "1144714122 m^2" ; census:population 1621316 ; census:waterArea "20064384 m^2" ; geo:lat "37.318892" ; geo:long "-121.928244" .
• See http://www.rdfabout.com/demo/census/
26
2009 Corpus• All crawled data, using Falcon-S, Sindice,
Swoogle, SWSE, and Watson• 1,151,383,509 statements in 116 chunks of 10
million• Represented in NQuads format– Explicit source/context for each statement– No parsing errors
• See http://vmlion25.deri.ie/– Includes sampled statistics (which I found to be highly
accurate)• Sources by “Pay Level Domain”
27
LUBM• The Lehigh University Benchmark (LUBM) is widely used for Semantic Web benchmarking
– Synthetic data generated for a specified number of universities• Example
<http://www.Department0.University0.edu/FullProfessor0> a ub:FullProfessor ; ub:doctoralDegreeFrom <http://www.University241.edu> ; ub:emailAddress "FullProfessor0@Department0.University0.edu" ; ub:mastersDegreeFrom <http://www.University875.edu> ; ub:name "FullProfessor0" ; ub:researchInterest "Research20" ; ub:teacherOf <http://www.Department0.University0.edu/GraduateCourse1> ,
<http://www.Department0.University0.edu/Course0> , <http://www.Department0.University0.edu/GraduateCourse0> ;
ub:telephone "xxx-xxx-xxxx" ; ub:undergraduateDegreeFrom <http://www.University84.edu> ; ub:worksFor <http://www.Department0.University0.edu> .
• See http://swat.cse.lehigh.edu/projects/lubm/
28
Statement % (distinct values)
29
Dataset rdf:type rdf:resource rdf:datatype xml:lang Neither
Webscope 24 (724) 32 3 (10) 14 (93) 27
Falcon 16 (19,660) 50 16 (72) 9 (252) 18
Swoogle 15 (33,318) 39 2 (87) 18 (280) 26
Watson 16 (13,660) 40 2 (79) 29 (162) 13
SWSE-1 13 (115) 53 0 (1) 32 (6) 1
SWSE-2 13 (104) 53 0 (1) 32 (15) 1
DBpedia 0 (4) 91 0 (6) 8 (1) 0
Geonames 10 (1) 49 0 (0) 1 (342) 41
SwetoDBLP 18 (11) 28 14 (4) 0 (0) 41
WordNet 24 (22) 30 0 (1) 46 (1) 0
Freebase 0 (0) 62 0 (0) 19 (169) 19
US Census 0 (8) 19 78 (2) 0 (0) 3
BTC 2009 12 (283,612) 57 1 (198) 7 (386) 22
LUBM 1 20 (15) 48 0 (0) 0 (0) 32
RDF/RDFS/OWL Usage2008 2009
rdf:type 80,309,558 143,293,758
rdfs:subClassOf 4,033,540 2,712,766
owl:Class 2,988,396 2,680,081
rdf:_1 1,492,214 757,717
owl:Restriction 1,042,032 440,750
owl:sameAs 480,771 6,565,347
rdfs:Class 299,962 186,770
rdfs:domain 265,124 195,053
rdfs:range 252,175 187,746
reified statements ~238,000 ~328,000
rdf:Bag 50,482 47,843
owl:Ontology 22,154 445,994
owl:imports 14,913 212,731
rdf:_2000 83 2,018
rdf:_10763 1 43
rdf:_32061 0 130
Popular Vocabularies2008 2009
foaf:Person 29,308,169 38,790,680
foaf:knows 25,864,527 35,811,115
dc:date 43,591,844 12,537,177
dc:title 4,416,716 22,326,441
wgs84_pos:lat 7,075,380 7,398,911
georss:point 4,436 367,291
skos:subject 6,619,912 18,257,337
skos:Concept 403,912 697,311
rss:item 2,893,750 13,687,021
owls:Profile 92 138
31
Corpus Composition2008 2009
US Census 39% rdfabout.com 2%
DBpedia 10% dbpedia.org 35%
GeoNames 6% geonames.org 11%
Freebase 6% freebase.com 1%
OTHER 40% OTHER 50%
32
Further Analysis• Node level comparison of the 2009 corpus• Increased factoring of rdf:type statements– How many rdf:type’s are associated with each resource?
• Overlap between 2008 and 2009 corpora• Analysis and reporting by Pay Level Domain rather
than dataset– By vocabulary (aggregated source vs. aggregated
predicate/type)• Drilldown into particular patterns, e.g. 32K element
set/bag• Additional graph metrics (e.g. diameter)
33
2008 Billion Triples Winners
• SemaPlorer: map-based exploration and visualization
• SearchWebDB: inexact keyword search• MaRVIN: scalable reasoning from LarKC• i-MoCo: storage and browsing of 250M+
triples with an iPhone application• SAOR: Scalable Authoritative OWL Reasoning• Virtuoso: sophisticated storage and querying
34
2009 Challenge
• Consider entering the Semantic Web Challenge
• Submissions due October 1
• Submissions will be presented and winners named at the 8th International Semantic Web Conference (ISWC2009) October 25-29 near Washington, DC
35
More Information
• Semantic Web Challenge– http://challenge.semanticweb.org
• Analysis Code and Raw Data– 2008: http://asio.bbn.com/2008/10/btc/– 2009: http://asio.bbn.com/2009/06/btc/
• Parliament– http://parliament.semwebcentral.org
• ISWC2009– http://iswc2009.semanticweb.org
36