NJVR: The NanJing Vocabulary Repository

12
NJVR: The NanJing Vocabulary Repository Gong Cheng, Min Liu, Yuzhong Qu Nanjing University

description

Presentation given by Yuzhong Qu at CSWS2012.

Transcript of NJVR: The NanJing Vocabulary Repository

Page 1: NJVR: The NanJing Vocabulary Repository

NJVR: The NanJing Vocabulary Repository

Gong Cheng, Min Liu, Yuzhong Qu

Nanjing University

Page 2: NJVR: The NanJing Vocabulary Repository

Motivation

summarization

rankingmatching

Ontology-related research topics A large and representativecollection of real-world vocabularies

Page 3: NJVR: The NanJing Vocabulary Repository

State of the art

Top-down efforts Bottom-up efforts

Our goal

Size: small (hundreds)

Access: directly (via browsing)

Size: large (thousands)

Access: indirectly (via searching)

Page 4: NJVR: The NanJing Vocabulary Repository

Contribution

• NJVR: A large and freely-accessible vocabulary repository– Source: An index of 4.1 B RDF triples distributed in 15.9 M RDF

documents crawled from 5.8K pay-level domains (PLDs)– Constitution:

• RDF descriptions of 2,996 dereferenceable vocabularies crawled from 261 PLDs• Document-level statistical data on their instantiations (e.g. term frequency)

– Accessibility: Publicly downloadable

Page 5: NJVR: The NanJing Vocabulary Repository

Construction of NJVR

1. Crawling

2. Vocabulary identification

3. Vocabulary instantiation

Page 6: NJVR: The NanJing Vocabulary Repository

Crawling (2007—May 2011)

1. Initialization (of the URI pool)– Other freely-accessible repositories, e.g. pingthesemanticweb.com– LOD cloud– Search results, e.g. Swoogle, Google

1. URI Dereference and document parsing– java.net package– Jena

1. Pool expansion– URIs in parsed documents– Submissions from the users of Falcons

Page 7: NJVR: The NanJing Vocabulary Repository

Vocabulary identification

• Bottom-up strategy1. Term: URI that identifies a class/property in its dereference

document

2. Vocabulary: Terms in a common namespace are grouped

Page 8: NJVR: The NanJing Vocabulary Repository

Results

• 455,718 terms– 396,023 classes, 59,868 properties, (many are in YAGO NS)

• 2,996 vocabularies– From 261 PLDs , (many are from w3.org)

• Instantiation found for– 115,707 classes (29.2%), e.g. foaf:Person– 25,963 properties (43.4%), e.g. dc:creator– 1,874 vocabularies (62.6%)

Page 9: NJVR: The NanJing Vocabulary Repository

Applications of NJVR

• Vocabulary ranking• Vocabulary matching• …

Page 10: NJVR: The NanJing Vocabulary Repository

NJVR for vocabulary ranking

• Using NJVR as a test case for vocabulary ranking

Page 11: NJVR: The NanJing Vocabulary Repository

Future work

• Removal of low-quality vocabularies from NJVR• Comparative analysis of NJVR and other repositories• …

Page 12: NJVR: The NanJing Vocabulary Repository

Just use it!

ws.nju.edu.cn/njvr

ws.nju.edu.cn/falcons