in collaboration with Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

44
[email protected] http://www.mpi-inf.mpg.de/~weikum/ Gerhard Weikum DB & IR: Both Sides Now llaboration with iana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Schenkel, Fabian Suchanek, Martin Theobald

description

in collaboration with Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchanek, Martin Theobald. DB and IR: Two Parallel Universes. parallel universes forever ?. Database Systems. Information Retrieval. canonical application:. accounting. - PowerPoint PPT Presentation

Transcript of in collaboration with Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

Page 1: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

[email protected]://www.mpi-inf.mpg.de/~weikum/

Gerhard Weikum

DB & IR: Both Sides Now

in collaboration with Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchanek, Martin Theobald

Page 2: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

2/44

DB and IR: Two Parallel Universes

canonical application: accounting libraries

data type: numbers,short strings

text

foundation: algebraic /logic based

probabilistic /statistics based

searchparadigm:

Boolean retrieval(exact queries,result sets/bags)

ranked retrieval(vague queries,result lists)

Database Systems Information Retrieval

parallel universes forever ?

marketleaders:

Oracle, IBM DB2,MS SQL Server, etc.

Google, Yahoo!, MSN,Verity, Fast, etc.

Page 3: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

3/44

Why DB&IR Now? – Application Needs

• Global health-care management for monitoring epidemics• News archives for journalists, press agencies, etc. • Product catalogs for houses, cars, vacation places, etc.• Customer support & CRM in insurances, telcom, retail, software, etc.• Bulletin boards for social communities• Enterprise search for projects, skills, know-how, etc.• Personalized & collaborative search in digital libraries, Web, etc.• Comprehensive archive of blogs with time-travel search

Simplify life for application areas like:

Typical data:Disease (DId, Name, Category, Pathogen …) UMLS-Categories ( … )Patient (… Age, HId, Date, Report, TreatedDId) Hospital (HId, Address …)Typical query: symptoms of tropical virus diseases and reported anomalieswith young patients in central Europe in the last two weeks

Page 4: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

4/44

Why DB&IR Now? – Platform Desiderata

Platform desiderata (from app developer‘s viewpoint):• Flexible ranking on text, categorical, numerical attributes

• cope with „too many answers“ and „no answers“

• High update rate concurrently with high query load

• Ontologies (dimensions, facets) for products, locations, org‘s, etc.• for query rewriting (relaxation, strengthening)

• Complex queries combining text & structured attributes• XPath/XQuery Full-Text with ranking

Structured data (records) Unstructured data (documents)

Unstructuredsearch(keywords)

Structuredsearch(SQL,XQuery)

DB Systems

IR SystemsSearch Engines

Keyword Search onRelational Graphs(IIT Bombay, UCSD, MSR, Hebrew U,CU Hong Kong, Duke U, ...)

Querying entities &relations from IE(MSR Beijing, UW Seattle,IBM Almaden, UIUC, MPI, … )

IntegratedDB&IR Platform

Page 5: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

5/44

Why DB&IR Forever?Turn the Web, Web2.0, and Web3.0 into the world‘s

most comprehensive knowledge base („semantic DB“) !

• Data enrichment at very large scale• Text and speech are key sources of knowledge production (publications, patents, conferences, meetings, ...)

2000 2007

indexed Web 2 Bio. 20 Bio.Flickr photos --- 100 Mio.digital photos ? 150 Bio. Wikipedia 8 000 1.8 Mio.OECD researchers 7.4 Mio. 8.4 Mio.patents world-wide ? 60 Mio.US Library of Congres 115 Mio. 134 Mio.Google Scholar --- 500 Mio.

Page 6: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

6/44

Outline

• Past

• Future

• Present

: Matter, Antimatter, and Wormholes

: From Data to Knowledge

: XML and Graph IR

Page 7: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

7/44

A: USAB: EnglandC: NetherlandsD: GermanyE: SingaporeF: Indonesia

Quiz Time

Gerard Salton: in which country

was he born (and did grow up) ?

D: Germany

Gerard Salton 1927 – 1995Prof. Cornell Univ. 1965 – 1995

Page 8: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

8/44

Parallel Universes: A Closer Look

Matter Antimatter

• user = programmer• query = precise spec. of info request• interaction via API

• user = your kids• query = approximation of user‘s real info needs• interaction process via GUI

• strength: indexing, QP• weakness: user model

• strength: ranking model• weakness: interoperability

• eval. measure: efficiency (throughput, response time, TPC-H, XMark, …)

• eval. measure: effectiveness (precision, recall, F1, MAP, NDCG, TREC & INEX benchmarks, …

Page 9: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

[email protected]://www.mpi-inf.mpg.de/~weikum/

Gerhard Weikum

DB & IR: Both Sides Now

DBDBDBDB

IRIRIRIR

19901990 19951995 20002000 20052005

VAGUE(Motro)

Proximal Nodes(Baeza-Yates et al.)

Web EntitySearch:Libra, Avatar,ExDB …

Faceted Search:Flamenco …

1st Gen.XML IR:

XXL,XIRQL,Elixir,JuruXML

Multimedia IR

WHIRL(Cohen)

Web QueryLanguages:W3QS, WebOQL,Araneus …

Semistructured Data: Lore, Xyleme …

2nd Gen. XML IR:XRank,Timber, TIJAH,XSearch, FleXPath,CoXML, TopX,MarkLogic, Fast …

Prob. Datalog(Fuhr et al.)

Uncertain &Prob. Relations:Mystiq, Trio …

Struct. Docs

Deep Web Search

INEX

XPath

XPathFull-Text

Digital Libraries

GraphIR

Prob. DB(Cavallo&Pittarelli)

Prob. Tuples(Barbara et al.)

Page 10: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

10/44

WHIRL: IR over Relations [W.W. Cohen: SIGMOD’98]

Add text-similarity selection and join to relational algebraExample: Select * From Movies M, Reviews R Where M.Plot ~ ”fight“ And M.Year > 1990 And R.Rating > 3 And M.Title ~ R.Title And M.Plot ~ R.Comment

Title Plot … Year

Movies

Title Comment … Rating

Reviews

Matrix

Hero

Matrix 1

MatrixReloaded

MatrixEigenvalues

Ying xiongaka. Hero

Shrek 2

… matrix spectrum … orthonormal …

… fight for peace …… sword fight … dramatic colors …

In ancient China … fights … sword fight …fights Broken Sword …

In the near future … computer hacker Neo …… fight training …

… cool fights …new techniques …

… fights …and more fights …… fairly boring …

1999

2002

2004In Far Far Away …our lovely herofights with cat killer …

4

1

5

5

Scoring and ranking:

s (<x,y>, q: A~B) = cosine (x.A, y.B)

s (<x,y>, q1 … qm) =

m

iiqyxs

1

),,(

xj ~ tf (word j in x) idf (word j)with dampening & normalization

• DB&IR for query-time data integration• More recent work: MinorThird, Spider, DBLife, etc.• But scoring models fairly ad hoc

Page 11: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

11/44

Professor

Name:GerhardWeikum

Address...

City: SBCountry: Germany

TeachingResearch

Course

Title: IR

Description: Information retrieval ...

Syllabus

...Book Article

... ...

ProjectTitle: IntelligentSearch ofHeterogeneousXML Data

Funding: EU

...

Name:RalfSchenkel

Lecturer

Address:Max-PlanckInstitute forInformatics,Germany

Activities

Seminar

Contents:Ranked retrieval …

Literature: …

Scientific

Name:INEX taskcoordinator(Initiative for the Evaluation of XML …)

Other

Sponsor: EU

XXL: Early XML IR [Anja Theobald, GW: Adding Relevance toXML, WebDB’00]

Which professors from Saarbruecken (SB)are teaching IR and haveresearch projects on XML?

Union of heterogeneous sources without global schema Similarity-aware XPath://~Professor [//* = ”~SB“] [//~Course [//* = ”~IR“] ] [//~Research [//* = ”~XML“] ]

Similarity-aware XPath://~Professor [//* = ”~SB“] [//~Course [//* = ”~IR“] ] [//~Research [//* = ”~XML“] ]

Page 12: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

12/44

Professor

Name:GerhardWeikum

Address...

City: SBCountry: Germany

TeachingResearch

Course

Title: IR

Description: Information retrieval ...

Syllabus

...Book Article

... ...

ProjectTitle: IntelligentSearch ofHeterogeneousXML Data

Funding: EU

...

Name:RalfSchenkel

Lecturer

Address:Max-PlanckInstitute forInformatics,Germany

Activities

Seminar

Contents:Ranked retrieval …

Literature: …

Scientific

Name:INEX taskcoordinator(Initiative for the Evaluation of XML …)

Other

Sponsor: EU

XXL: Early XML IR [Anja Theobald, GW: Adding Relevance toXML, WebDB’00]

Scoring and ranking:• tf*idf for content condition• ontological similarity for relaxed tag condition• score aggregation with probabilistic independence

Wu&Palmer: |path| through lca(x,y)

Dice coeff.: 2 #(x,y) / (#x + #y) on Web

Similarity-aware XPath://~Professor [//* = ”~Saarbruecken“] [//~Course [//* = ”~IR“] ] [//~Research [//* = ”~XML“] ]

Which professors from Saarbruecken (SB)are teaching IR and haveresearch projects on XML?

Motivation: Union of heterogeneous sources has no schema

query expansion model:disjunction of tags

magician

wizard

intellectual

artist

alchemist

directorprimadonna

professor

teacher

scholar

academic,academician,faculty member

scientist

researcher

HYPONYM (0.749)HYPONYM (0.749)

investigator

mentor

RELATED (0.48)RELATED (0.48)

lecturer

Page 13: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

13/44

The Past: Lessons Learned

• DB&IR: added flexible ranking to (semi) structured querying to cope with schema and instance diversity

• but ranking seems „ad hoc“ and not consistently good in benchmarks

• to win benchmark: tuning needed, but tuning is easier if ranking is principled !

• ontologies are mixed blessing: quality diverse, concept similarity subtle, danger of topic drift

• ontology-based query expansion (into large disjunctions) poses efficiency challenge

prec

isio

n

recall

// ~Professor [...]

// { Professor, Researcher, Lecturer, Scientist, Scholar, Academic, ... }[...]

element

gold

produce

Golden Delicious

entity

food

substancesolid

edible fruit

applepome

Page 14: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

14/44

Outline

Past

• Future

• Present

: Matter, Antimatter, and Wormholes

: From Data to Knowledge

: XML and Graph IR

Page 15: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

15/44

A: Yahoo! AnswersB: INEX benchmarkC: Derwent WPID: Elsevier ScopusE: 51.comF: Traffic violations in EU

Quiz Time

Which is the largest XMLdata collection in the universe?

C: Derwent WPI

Page 16: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

16/44

TopX: 2nd Generation XML IR

”Semantic“ XPath Full-Text query: /Article [ftcontains(//Person, ”Max Planck“)] [ftcontains(//Work, ”quantum physics“)]//Children[@Gender = ”female“]//Birthdates

supported by TopX engine: http://infao5501.ag5.mpi-sb.mpg.de:8080/topx/ http://topx.sourceforge.net

• Exploit tags & structure for better precision• Can relax tag names & structure for better recall• Principled ranking by probabilistic IR (Okapi BM25 for XML)• Efficient top-k query processing (using improved TA)• Robust ontology integration (self-throttling to avoid topic drift)• Efficient query expansion (on demand, by extended TA)• Relevance feedback for automatic query rewriting

[Martin Theobald, Ralf Schenkel, GW: VLDB’05, VLDB Journal]

Page 17: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

17/44

Commercial Break[Martin Theobald, Ralf Schenkel, GW: VLDB’95]

TopX demo today 3:30 – 5:30

Page 18: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

18/44

Principled Ranking by Probabilistic IR

]|[

]|[

]|)([

]|)([),(

dRP

dRP

dofcontentsqRdP

dofcontentsqRdPqds

odds for item d withterms di being relevant for query q = {q1, …, qm}

binary features, conditional independence of features [Robertson & Sparck-Jones 1976]

mi RdP

RdP

i

i1 ]|[

]|[~

dqii

i

i

i

q

q

p

p 1log

1log~ ]|[ RdPp ii

]|[ RdPq ii

Now estimate pi and qi values from •relevance feedback,•pseudo-relevance feedback, •corpus statistics

by MLE (with statistical smoothing)and store precomputed pi, qi in index

docsdocsrelpi /#).(#ˆ

]|[ corpusdPq ii

ki kdf

idfq

)(

)(ˆ

ki dktf

ditfp

),(

),(ˆ

i

k

kidf

kdf

dk

ditf

)(

)(

),(

),(log

Relationship to tf*idf

„God does not play dice.“ (Einstein)IR does.

with

related to but different fromstatistical language models

• led to Okapi BM25 (wins TREC tasks)• adapted and extended to XML in TopX, ...

Page 19: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

19/44

Probabilistic Ranking for SQL

SQL queries that return many answers need rankingExamples: • Houses (Id, City, Price, #Rooms, View, Pool, SchoolDistrict, …) Select * From Houses Where View = ”Lake“ And City In (”Redmond“, ”Bellevue“)• Movies (Id, Title, Genre, Country, Era, Format, Director, Actor1, Actor2, …) Select * From Movies Where Genre = ”Romance“ And Era = ”90s“

]|[

]|[

]|[

]|[~),(

RdP

RdP

dRP

dRPqds

odds for tuple d withattributes XY relevant for query q: X1=x1 … Xm=xm

]|[

]|[

RXYP

RXYP

][]|[

]|[1

YPRYP

YXP

Estimate prob‘s, exploiting workload W:

[S. Chaudhuri, G. Das, V. Hristidis, GW: TODS‘06]

]|[]|[ XWYPRYP Example: frequent queries

• … Where Genre = ”Romance“ And Actor1 = ”Hugh Grant“• … Where Actor1 = ”Hugh Grant“ And Actor2 = ”Julia Roberts“boosts HG and JR movies in ranking for Genre = ”Romance“ And Era = ”90s“

Page 20: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

20/44

From Tables and Trees to Graphs

Example: Conferences (CId, Title, Location, Year) Journals (JId, Title)CPublications (PId, Title, CId) JPublications (PId, Title, Vol, No, Year) Authors (PId, Person) Editors (CId, Person)Select * From * Where * Contains ”Gray, DeWitt, XML, Performance“ And Year > 95

Schema-agnostic keyword search over multiple tables:graph of tuples with foreign-key relationships as edges

[BANKS, Discover, DBExplorer, KUPS, SphereSearch, BLINKS]

Result is connected tree with nodes that contain as many query keywords as possible

Ranking: 1)(1)1(),(),(

eedgesnnodes

eedgeScoreqnnodeScoreqtrees

with nodeScore based on tf*idf or prob. IRand edgeScore reflecting importance of relationships (or confidence, authority, etc.)

Related use cases:• XML beyond trees• RDF graphs• ER graphs (e.g. from IE)• social networks

Top-k querying: compute best trees, e.g. Steiner trees (NP-hard)

Page 21: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

21/44

The Present: Observations & Opportunities• Probabilistic IR and statistical language models

yield principled ranking and high effectiveness (related to prob. relational models (Suciu, Getoor, …) but different)

• Structural similarity and ranking based on tree edit distance (FleXPath, Timber, …)

• Aim for comprehensive XML ranking model capturing content, structure, ontologies

• Aim to generate structure skeleton in XPath query from user feedback

• Good progress on performance but still many open efficiency issues

actor

movie movie

plot directormovie

actor actor director

plot

”life physicist Max Planck“

//article[//person ”Max Planck“] [//category ”physicist“] //biography

Page 22: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

22/44

Outline

Past

• Future

Present

: Matter, Antimatter, and Wormholes

: From Data to Knowledge

: XML and Graph IR

Page 23: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

23/44

Quiz Time

Who said:Information is not knowledge. Knowledge is not wisdom. Wisdom is not truth. Truth is not beauty. Beauty is not love. Love is not music. Music is the best.

A: Richard FeynmanB: Sigmund FreudC: Larry PageD: Frank ZappaE: Marie CurieF: Lao-tse

?

D: Frank Zappa

Page 24: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

24/44

Knowledge Queries

Nobel laureate who survived both world wars and his children

drama with three women making a prophecy to a British nobleman that he will become king

proteins that inhibit both protease and some other enzyme

connection between Thomas Mann and Goethe

differences in Rembetiko music from Greece and from Turkey

neutron stars with Xray bursts > 1040 erg s-1 & black holes in 10‘‘

market impact of Web2.0 technology in December 2006

sympathy or antipathy for Germany from May to August 2006

Turn the Web, Web2.0, and Web3.0 into the world‘s most comprehensive knowledge base („semantic DB“) !

Answer „knowledge queries“ such as:

Page 25: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

25/44

Three Roads to Knowledge

• Handcrafted High-Quality Knowledge Bases (Semantic-Web-style ontologies, encyclopedias, etc.)

• Large-scale Information Extraction & Harvesting: (using pattern matching, NLP, statistical learning, etc. for product search, Web entity/object search, ...)

• Social Wisdom from Web 2.0 Communities (social tagging, folksonomies, human computing, e.g.: del.icio.us, flickr, answers.yahoo, iknow.baidu, ...)

• Social Wisdom from Web 2.0 Communities (social tagging, folksonomies, human computing, e.g.: del.icio.us, flickr, answers.yahoo, iknow.baidu, ...)

Page 26: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

26/44

High-Quality Knowledge Sources• universal „common-sense“ ontologies:

• SUMO (Suggested Upper Merged Ontology): 60 000 OWL axioms• Cyc: 5 Mio. facts (OpenCyc: 2 Mio. facts)

• domain-specific ontologies:• UMLS (Unified Medical Language System): 1 Mio. biomedical concepts 135 categories, 54 relations (e.g. virus causes disease | symptom)• GeneOntology, etc.

• thesauri and concept networks:• WordNet: 200 000 concepts (word senses) and hypernym/hyponym relations• can be cast into OWL-lite (or typed graph with statistical weights)

• lexical sources:• Wikipedia (1.8 Mio. articles, 40 Mio. links, 100 languages) etc.

• hand-tagged natural-language corpora:• TEI (Text Encoding Initiative) markup of historic encyclopedia• FrameNet: sentences classified into frames with semantic roles

growing with strong momentum

Page 27: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

27/44

High-Quality Knowledge SourcesGeneral-purpose thesauri and concept networks: WordNet family

enzyme -- (any of several complex proteins that are produced by cells and act as catalysts in specific biochemical reactions) => protein -- (any of a large group of nitrogenous organic compounds that are essential constituents of living cells; ...) => macromolecule, supermolecule ... => organic compound -- (any compound of carbon and another element or a radical)... => catalyst, accelerator -- ((chemistry) a substance that initiates or accelerates a chemical reaction without itself being affected) => activator -- ((biology) any agency bringing about activation; ...)

can be cast into • OWL-lite or into • graph, with weights for relation strengths (derived from co-occurrence statistics)

Page 28: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

28/44

High-Quality Knowledge SourcesWikipedia and other lexical sources

Page 29: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

29/44

{{Infobox_Scientist| name = Max Planck| birth_date = [[April 23]], [[1858]] | birth_place = [[Kiel]], [[Germany]]| death_date = [[October 4]], [[1947]]| death_place = [[Göttingen]], [[Germany]]| residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]]| work_institution = [[University of Kiel]]</br> [[Humboldt-Universität zu Berlin]]</br> [[Georg-August-Universität Göttingen]]| alma_mater = [[Ludwig-Maximilians-Universität München]]| doctoral_advisor = [[Philipp von Jolly]]| doctoral_students = [[Gustav Ludwig Hertz]]</br>… | known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]]| prizes = [[Nobel Prize in Physics]] (1918)…

Exploit Hand-Crafted KnowledgeWikipedia, WordNet, and other lexical sources

Page 30: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

30/44

YAGO: Yet Another Great Ontology[F. Suchanek, G. Kasneci, GW: WWW 2007]

• Turn Wikipedia into explicit knowledge base (semantic DB)

• Exploit hand-crafted categories and templates

• Represent facts as explicit knowledge triples:

relation (entity1, entity2)

(in 1st-order logic, compatible with RDF, OWL-lite, XML, etc.)

• Map (and disambiguate) relations into WordNet concept DAG

entity1 entity2relation

Max_Planck KielbornIn

Kiel CityisInstanceOf

Examples:

Page 31: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

31/44

YAGO Knowledge RepresentationEntity

Max_Planck April 23, 1858

Person

City Country

subclass Location

subclass

instanceOf

subclass subclass

bornOn

“Max Planck”

means

“Dr. Planck”

means

subclass

October 4, 1947 diedOn

KielbornInNobel Prize Erwin_Planck

FatherOfhasWon

Scientist

means

“Max Karl Ernst Ludwig Planck”

Physicist

instanceOf

subclassBiologist

subclass

concepts

individuals

words

Knowledge Base # Facts

KnowItAll 30 000SUMO 60 000WordNet 200 000OpenCyc 300 000Cyc 5 000 000YAGO 6 000 000

Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/

Accuracy: 97%

Page 32: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

32/44

NAGA: Graph IR on YAGO [G. Kasneci et al.: WWW‘07]

queries with regular expressions

Ling $x scientistisa hasFirstName | hasLastName

$y ZhejianglocatedIn*

worksFor

conjunctive queries

Beng Chin Ooi

(coAuthor| advisor)*

Kiel $x scientistisa bornIn

Graph-based search on YAGO-style knowledge bases with built-in ranking based on confidence and informativeness

statistical language model for result graphs

Page 33: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

33/44

Ranking FactorsConfidence:Prefer results that are likely to be correct

Certainty of IE Authenticity and Authority of Sources

Informativeness:Prefer results that are likely importantMay prefer results that are likely new to user

Frequency in answer Frequency in corpus (e.g. Web) Frequency in query log

Compactness:Prefer results that are tightly connected

Size of answer graph

bornIn (Max Planck, Kiel) from„Max Planck was born in Kiel“(Wikipedia)

livesIn (Elvis Presley, Mars) from„They believe Elvis hides on Mars“(Martian Bloggeria)

q: isa (Einstein, $y)

isa (Einstein, scientist)isa (Einstein, vegetarian)

q: isa ($x, vegetarian)

isa (Einstein, vegetarian)isa (Al Nobody, vegetarian)

Einstein

vegetarian

BohrNobel Prize

Tom Cruise

1962

isa isa bornIn

diedInwon

won

Page 34: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

34/44

Information Extraction (IE): Text to Records

Max Planck 4/23, 1858 KielAlbert Einstein 3/14, 1879 Ulm Mahatma Gandhi 10/2, 1869 Porbandar

Person BirthDate BirthPlace ...

Person ScientificResult

Max Planck Quantum Theory

Person CollaboratorMax Planck Albert EinsteinMax Planck Niels Bohr

Planck‘s constant 6.2261023 Js

Constant Value Dimension

combine NLP, pattern matching, lexicons, statistical learning

Page 35: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

35/44

Knowledge Acquisition from the WebLearn Semantic Relations from Entire Corpora at Large Scale(as exhaustively as possible but with high accuracy)

Examples: • all cities, all basketball players, all composers• headquarters of companies, CEOs of companies, synonyms of proteins• birthdates of people, capitals of countries, rivers in cities• which musician plays which instruments• who discovered or invented what• which enzyme catalyzes which biochemical reaction

Existing approaches and tools (Snowball [Gravano et al. 2000], KnowItAll [Etzioni et al. 2004], …):

almost-unsupervised pattern matching and learning:seeds (known facts) patterns (in text) (extraction) rule (new) facts

Page 36: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

36/44

city(Beijing) plays(Coltrane, sax) city(Beijing) old center of Beijingplays(Coltrane, sax) sax player Coltranecity(Beijing) old center of Beijing old center of Xplays(Coltrane, sax) sax player Coltrane Y player X

Methods for Web-Scale Fact Extration

Example:city (Seattle) in downtown Seattle city (Seattle) Seattle and other towns city (Las Vegas) Las Vegas and other townsplays (Zappa, guitar) playing guitar: … Zappaplays (Davis, trumpet) Davis … blows trumpet

seeds text rules new facts

Example:city (Seattle) in downtown Seattle in downtown Xcity (Seattle) Seattle and other towns X and other townscity (Las Vegas) Las Vegas and other towns X and other townsplays (Zappa, guitar) playing guitar: … Zappa playing Y: … Xplays (Davis, trumpet) Davis … blows trumpet X … blows Y

Example:city (Seattle) in downtown Seattle in downtown Xcity (Seattle) Seattle and other towns X and other townscity (Las Vegas) Las Vegas and other towns X and other townsplays (Zappa, guitar) playing guitar: … Zappa playing Y: … Xplays (Davis, trumpet) Davis … blows trumpet X … blows Y

Example:city (Seattle) in downtown Seattle in downtown Xcity (Seattle) Seattle and other towns X and other townscity (Las Vegas) Las Vegas and other towns X and other townsplays (Zappa, guitar) playing guitar: … Zappa playing Y: … Xplays (Davis, trumpet) Davis … blows trumpet X … blows Y

in downtown Beijing city(Beijing) Coltrane blows sax plays(C., sax)

Assessment of facts & generation of rules based on statisticsRules can be more sophisticated: playing NN: (ADJ|ADV)* NP & class(NN)=instrument & class(head(NP))=person plays(head(NP), NN)

Page 37: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

37/44

Performance of Web-IEState-of-the-art precision/recall results:

Anecdotic evidence:invented (A.G. Bell, telephone)married (Hillary Clinton, Bill Clinton)isa (yoga, relaxation technique)isa (zearalenone, mycotoxin)contains (chocolate, theobromine)contains (Singapore sling, gin)

invented (Johannes Kepler, logarithm tables)married (Segolene Royal, Francois Hollande)isa (yoga, excellent way)isa (your day, good one)contains (chocolate, raisins)plays (the liver, central role)makes (everybody, mistakes)

relation precision recall corpus systemscountries 80% 90% Web KnowItAllcities 80% ??? Web KnowItAllscientists 60% ??? Web KnowItAllheadquarters 90% 50% News Snowball, LEILAbirthdates 80% 70% Wikipedia LEILAinstanceOf 40% 20% Web Text2Onto, LEILA

Open IE 80% ??? Web TextRunner

precision value-chain: entities 80%, attributes 70%, facts 60%, events 50%

Page 38: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

38/44

Beyond Surface Learning with LEILA

Almost-unsupervised Statistical Learning with Dependency Parsing

(Cologne, Rhine), (Cairo, Nile), … (Cairo, Rhine), (Rome, 0911), (, [0..9]*), …

Paris was founded on an island in the Seine

(Paris, Seine)

Ss Pv MVp Ds

Js

DG

Js

MVp

NP VPVP PP NP NPPP NPNP

Cologne lies on the banks of the Rhine

Ss MVp DMc Mp Dg

JsJp

NP PPVP NP PP NP NPNP

People in Cairo like wine from the Rhine valley

Mp Js Os

Sp Mvp DsJs

AN

NP NPPP VP PP NPNP NPNP

Limitation of surface patterns:who discovered or invented what “Tesla’s work formed the basis of AC electric power”

Learning to Extract Information by Linguistic Analysis [F.Suchanek, G.Ifrim, GW: KDD‘06]

LEILA outperforms other Web-IE methodsin terms of precision, recall, F1, but:• dependency parser is slow• one relation at a time

“Al Gore funded more work for a better basis of the Internet”

Page 39: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

39/44

IE Efficiency and Accuracy Tradeoffs

• precision vs. recall: two-stage processing (filter pipeline)1) recall-oriented harvesting2) precision-oriented scrutinizing

• preprocessing• indexing: NLP trees & graphs, N-grams, PoS-tag patterns ?

• exploit ontologies? exploit usage logs ?• turn crawl&extract into set-oriented query processing

• candidate finding• efficient phrase, pattern, and proximity queries• optimizing entire text-mining workflows [Ipeirotis et al.: SIGMOD‘06]

IE is cool, but what‘s in it for DB folks?

[see also tutorials by Cohen, Doan/Ramakrishnan/Vaithyanathan, Agichtein/Sarawagi]

Page 40: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

40/44

The Future: Challenges• Generalize YAGO approach (Wikipedia + WordNet)• Methods for comprehensive, highly accurate

mappings across many knowledge sources• cross-lingual, cross-temporal• scalable in size, diversity, number of sources

• Pursue DB support towards efficient IE (and NLP)• Achieve Web-scale IE throughput that can

• sustain rate of new content production (e.g. blogs) • with > 90% accuracy and Wikipedia-like coverage

• Integrate handcrafted knowledge with NLP/ML-based IE• Incorporate social tagging and human computing

Page 41: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

41/44

Outline

Past

Future

Present

: Matter, Antimatter, and Wormholes

: From Data to Knowledge

: XML and Graph IR

Page 42: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

42/44

Major Trends in DB and IR

malleable schema (later) deep NLP, adding structure

record linkage info extraction

graph mining entity-relationship graph IR

ontologies

ranking

Database Systems Information Retrieval

statistical language models

data uncertainty

programmability search as Web Service

dataspaces Web objects

Web 2.0 Web 2.0

Page 43: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

43/44

Conclusion• DB&IR integration agenda:

• models − ranking, ontologies, prob. SQL ?, graph IR ?• languages and APIs − XQuery Full-Text++ ?• systems − drop SQL, go light-weight ? − combine with P2P, Deep Web, ... ?

• Rethink progress measures and experimental methodology

• Address killer app(s) and grand challenge(s):• from data to knowledge (Web, products, enterprises)• integrate knowledge bases, info extraction, social wisdom• cope with uncertainty; ranking as first-class principle

• Bridge cultural differences between DB and IR:• co-locate SIGIR and SIGMOD

Page 44: in collaboration with  Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,

44/44

DB&IR: Both Sides NowJoni Mitchell (1969): Both Sides Now

…I've looked at life from both sides now,From up and down, and still somehowIt's life's illusions i recall.I really don't know life at all.

Thank You !