PhD Proposal Defense - Prateek Jain

51
1 About 22 years ago..

description

Slides from Prateek Jain's PhD Proposal Defense.

Transcript of PhD Proposal Defense - Prateek Jain

Page 1: PhD Proposal Defense - Prateek Jain

1

About 22 years ago..

Page 2: PhD Proposal Defense - Prateek Jain

11 years later…

Image from Scientific American Website

Page 3: PhD Proposal Defense - Prateek Jain

3

Page 4: PhD Proposal Defense - Prateek Jain

4

Page 5: PhD Proposal Defense - Prateek Jain

5

Page 6: PhD Proposal Defense - Prateek Jain

6

Tim Berners-Lee 2006

1.Use URIs as names for things

2.Use HTTP URIs so that people can look up those names.

3.When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)

4. Include links to other URIs. so that they can discover more things.

Page 7: PhD Proposal Defense - Prateek Jain

7

In 2006 Web of Data

Page 8: PhD Proposal Defense - Prateek Jain

8

8

Linked Open Data

• Massive collection of instance data

• Primarily connected via owl:sameAs relationship

• Excellent source of information for background knowledge

• Labeled as mainstream Semantic Web6/11/12

Page 9: PhD Proposal Defense - Prateek Jain

9

Is it really mainstream Semantic Web?

• What is the relationship between the models whose instances are being linked?

• How to do querying on LOD without knowing individual datasets?

• How to perform schema level reasoning over LOD cloud?

Page 10: PhD Proposal Defense - Prateek Jain

10

What can be done?

• Relationships are at the heart of Semantics

• LOD primarily consists of owl:sameAs links

• LOD captures instance level relationships, but lacks class level relationships.o Superclasso Subclasso Equivalence

• How to find these relationships?o Perform a matching of the LOD Ontology’s using state of the art ontology matching tools.

Page 11: PhD Proposal Defense - Prateek Jain

Linked Data Alignment and Enrichment

Proposal Defense June 11th, 2012

Prateek JainKno.e.sis Center

Wright State University, Dayton, OH

Page 12: PhD Proposal Defense - Prateek Jain

Agenda

• Motivation and Significance of this research

• Research questions and proposed solutions

• State of the current research and planned work

• Questions and comments

14th February 2012 12

Page 13: PhD Proposal Defense - Prateek Jain

Linked Open Data

14th February 2012 13

• A set of best practices for publishing and connecting structured data on the Web

• Practices have been adopted by an increasing number of data providers in the past 5 years

• Latest count is at 295 datasets with over 50 Billion triples (approx)

Page 14: PhD Proposal Defense - Prateek Jain

14

Linked Open Data 2007 (May)

Linking Open Data cloud diagram, this and subsequent pages, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Page 15: PhD Proposal Defense - Prateek Jain

15

Linked Open Data 2007 (Oct)

Page 16: PhD Proposal Defense - Prateek Jain

16

Linked Open Data 2009

Page 17: PhD Proposal Defense - Prateek Jain

17

Linked Open Data 2011

Page 18: PhD Proposal Defense - Prateek Jain

18

Linked Open DataNumber of Datasets

2011-09-19 295

2010-09-22 203

2009-07-14 95

2008-09-18 45

2007-10-08 25

2007-05-01 12

Number of triples (Sept 2011)

31,634,213,770

with 503,998,829 out-links

From http://www4.wiwiss.fu-berlin.de/lodcloud/state/

Page 19: PhD Proposal Defense - Prateek Jain

6 years of existence how many applications come to

your mind?

6/11/12 19

Page 20: PhD Proposal Defense - Prateek Jain

20

Page 21: PhD Proposal Defense - Prateek Jain

21

21

Reality…

• “We DID NOT use the entire Dbpedia or LOD. The only component of LOD which helped us with Watson was YAGO class hierarchy present in DBpedia. We had strict information gain requirements and other components honestly did not help much“

– Researcher with the Watson Team

6/11/12

Page 22: PhD Proposal Defense - Prateek Jain

Why?

Page 23: PhD Proposal Defense - Prateek Jain

23

A simple query..

“Identify congress members, who have voted “No” on pro environmental legislation in the past four years, with high-pollution industry in their congressional districts.”

But even with LOD we cannot answer this query.

Page 24: PhD Proposal Defense - Prateek Jain

24

Example: GovTrack

Bills:h3962

H.R. 3962: Affordable Health Care for America

Act

Votes:2009-887/+

people/P000197

Nancy PelosiOn Passage: H R 3962 Affordable Health Care for

America Act

Vote: 2009-887

vote:hasAction

vote:vote

dc:title

vote:hasOption

rdfs:labelAye

dc:title

vote:votedBy

name

Page 25: PhD Proposal Defense - Prateek Jain

25

Example: GeoNames

rdfs:subClassOf?

Page 26: PhD Proposal Defense - Prateek Jain

26

Our ApproachUse knowledge contributed by users

To enhance existing approaches to solve these issues:

• Ontology integration

• Detection relationships within and across datasets

• Querying multiple datasets

LOD Cloud

Page 27: PhD Proposal Defense - Prateek Jain

28

28

Circling Back

• LOD captures instance level relationships, but lacks class level relationships.o Superclasso Subclasso Equivalence

6/11/12

Page 28: PhD Proposal Defense - Prateek Jain

BLOOMS – Bootstrapping …

Page 29: PhD Proposal Defense - Prateek Jain

30

• BLOOMS - Bootstrapping-based Linked Open Data Ontology Matching System

• Developed specifically for LOD Ontologies

• Identifies schema level links between different LOD datasets

• Aligns ontologies belonging to diverse domains using diverse data sources

• Technique relies on using hierarchy in other datasets (therefore bootstrapping)

Page 30: PhD Proposal Defense - Prateek Jain

31

Existing Approaches

A survey of approaches to automatic Ontology matching by Erhard Rahm, Philip A. Bernstein in the VLDB Journal 10: 334–350 (2001)

Page 31: PhD Proposal Defense - Prateek Jain

32

LOD Ontology Alignment

• Actual Results from these techniques Nation = Menstruation, Confidence=0.9

• They perform extremely well on established benchmarks, but typically not in the wilds.

• LOD Ontology’s are of very different nature• Created by community for community.

• Emphasis on number of instances, not number of meaningful relationships.

• Require solutions beyond syntactic and structural matching.

Page 32: PhD Proposal Defense - Prateek Jain

33

Rabbit out of a hat?

• Traditional auxiliary data sources (WordNet, Upper Level Ontologies) have limited coverage.

• Community generated is noisy, but is rich in • Content

• Structure

• Has a “self healing property”

• Problems like Ontology Matching have a dimension of context associated with them.

Page 33: PhD Proposal Defense - Prateek Jain

34

Wikipedia

• The English version alone has more than 2.9 million articles

• Continually expanded by approx. 100,000 active volunteer editors

• Multiple points of view are mentioned with proper contexts

• Article creation/correction is an ongoing activity

Page 34: PhD Proposal Defense - Prateek Jain

35

Ontology Matching using Wikipedia

• On Wikipedia, categories are used to organize the entire project.

• Wikipedia's category system consists of overlapping trees.

• Simple rules for categorization

Page 35: PhD Proposal Defense - Prateek Jain

36

BLOOMS Approach – Step 1

• Pre-process the input ontology Remove property restrictions Remove individuals, properties

• Tokenize the class names Remove underscores, hyphens and other delimiters Breakdown complex class names

• example: SemanticWeb => Semantic Web

Page 36: PhD Proposal Defense - Prateek Jain

37

BLOOMS Approach – Step 2

• Identify article in Wikipedia corresponding to the concept.o Each article related to the concept indicates a sense of the usage of the

word.

• For each article found in the previous stepo Identify the Wikipedia category to which it belongs.o For each category found, find its parent categories till level 4.

• Once the “BLOOMS tree” for each of the sense of the source concept is created (Ts), utilize it for comparison with the “BLOOMS tree” of the target concepts (Tt).

Page 37: PhD Proposal Defense - Prateek Jain

38

BLOOMS Approach – Step 3• In the tree Ts, remove all nodes for which the parent node

which occurs in Tt to create Ts’.o All leaves of Ts are of level 4 or occur in Tt. o The pruned nodes do not contribute any additional new knowledge.

• Compute overlap Os between the source and target tree.o Os= n/(k-1), n = |z|, z ε Ts’ Π Tt, k= |s|, s ε Ts’

• The decision of alignment is made as follows.o For Ts ε Tc and Tt ε Td, we have Ts=Tt, then C=D.o If min{o(Ts,Tt),o(Tt,Ts)} ≥ x, then set C rdfs:subClassOf D if o(Ts,Tt) ≤ o(Tt,

Ts), and set D rdfs:subClassOf C if o(Ts, Tt) ≥ o(Tt, Ts).

Page 38: PhD Proposal Defense - Prateek Jain

39

Example

Page 39: PhD Proposal Defense - Prateek Jain

40

Evaluation Objectives

• To examine BLOOMS as a tool for the purpose of LOD ontology matching.

• To examine the ability of BLOOMS to serve as a general purpose ontology matching system.

Page 40: PhD Proposal Defense - Prateek Jain

41

41

Circling Back

• LOD primarily consists of owl:sameAs links

6/11/12

Page 41: PhD Proposal Defense - Prateek Jain

Part of Relationship Identification

Page 42: PhD Proposal Defense - Prateek Jain

43

Partonomy Identification

• Currently entities across datasets are linked using primarily the owl:sameAs relationship

• Relationships such as partonomy (part-of), and causality can allow creating even more intelligent applications such as Watson

• Approach PLATO (Part-Of relation finder on Linked Open DAta Tool)

Page 43: PhD Proposal Defense - Prateek Jain

44

PLATO Approach

• PLATO generates all possible partonomically linked pairs between the entities in the dataset. o Utilize “strongly” associated entities

• Identify the type of each entity in the pair using WordNet.o Use Class Nameso Gives the lexicographer files for the synsets

corresponding to these entities

• Use this information to determine the applicable OWL partonomy properties.o Using Winston’s taxonomy

Page 44: PhD Proposal Defense - Prateek Jain

45

Winston’s Taxonomy

Page 45: PhD Proposal Defense - Prateek Jain

46

PLATO Approach – Step 2

• PLATO generates linguistic patterns for each applicable property based on linguistic cues suggested by Winston.o Cell Wall is made of Cellulose

• Tests the lexical patterns for each entity pair in a corpus-driven manner.o Using Web as a corpus

• PLATO counts the total number of web pages that contain the patterno Parse the page and identify the occurance of pattern.

Page 46: PhD Proposal Defense - Prateek Jain

47

PLATO Approach – Step 3

• Asserts the partonomy property with strongest supporting evidenceo Cell Wall is made of Cellulose, 48o Cellulose is made of Cell Wall, 10

• PLATO also enriches the schema by generalizing from the instance level assertions.

Page 47: PhD Proposal Defense - Prateek Jain

48

Evaluation Objectives

• To examine PLATO as a tool for finding different kinds of part-of relation.

• To examine PLATO as a tool for finding part-of relation within a dataset

• To examine PLATO as a tool for finding part-of relation across dataset

Page 48: PhD Proposal Defense - Prateek Jain

49

Page 49: PhD Proposal Defense - Prateek Jain

14th February 2012 50

BLOOMS BLOOMS+ PLATO Others

2010 1. 1 paper at ISWC 2. 1 paper at OM

workshop

1. Paper at AAAI SS2. Paper at GEOS

2011 1. 1 paper at ESWC2. Workshop at ICBO

2012 1. 1 paper at ACM Hypertext

Total of 7 publications covering this research

Page 50: PhD Proposal Defense - Prateek Jain

Research Plan

14th February 2012 51

• Evaluation of BLOOMS on LOD ontologies

• Evaluation of PLATO

• Automatic classification of datasets

• Property alignment on LOD

Page 51: PhD Proposal Defense - Prateek Jain

Questions?