Lab name TBA1IIS internal talk Intelligent Information Integration (I 3 ) Chun-Nan Hsu Institute of...

29
Lab name TBA 1 IIS internal talk Intelligent Information Integration (I 3 ) Chun-Nan Hsu Institute of Information Science Academia Sinica, Taipei, TAIWAN Copyright © 1998 Chun-Nan Hsu, All right reserved Prepared for a presentation at IIS, AS, Taiwan October 12, 1998

Transcript of Lab name TBA1IIS internal talk Intelligent Information Integration (I 3 ) Chun-Nan Hsu Institute of...

Lab name TBA 1IIS internal talk

Intelligent Information Integration (I3)

Chun-Nan Hsu

Institute of Information Science

Academia Sinica, Taipei, TAIWANCopyright © 1998 Chun-Nan Hsu, All right reserved

Prepared for a presentation at IIS, AS, Taiwan

October 12, 1998

Lab name TBA 2IIS internal talk

Information Distribution and Information Integration

Real query: need a list of attorneys in Phoenix Metro Area specialized in immigration and deportation. Also show their years-in-service, educational background and languages spoken??

The answer IS on the Web!» US West yellow page web site» US bar association member directory web site» Alumni directory of law schools» and more…

BUT…

Lab name TBA 3IIS internal talk

Intelligent Information Integration

Environment assumptions:» Autonomous information sources» Heterogeneous but relevant» Query only (no or limited update allowed)

Desiderata:» Extensible

– easily add new sources

» Flexible – can be queried in as many ways as integrated sources

» Scalable – integrate 1,000s, 10,000s, 100,000s relevant information sources

Lab name TBA 4IIS internal talk

Solution: Information Integration Systems (IIS)

Also known as » information mediation agents, information mediators» information gathering agents» information brokering agents, information brokers

Key ideas:» users access data through a domain model» information sources represented by a source model» the mediator reformulates domain model query into source

model sub-queries» the mediator constructs a query plan that determines the

orders of data flow and execution to retrieve data

Lab name TBA 5IIS internal talk

Architecture

Text,Images/Video,Spreadsheets

Hierarchical& NetworkDatabases

RelationalDatabases

Object &Knowledge

Bases

SQL ORBWrapper Wrapper

Sourcemodel

Domainmodel

Human & Computer Users

Heterogeneous Data Sources

InformationIntegrationService

Provides translation and communication with

sources

Queryplanner

User Services:• Query• Monitor

Describes the domain: terms and their relations

Query planoptimizer/ executor

Provides source descriptions and

semantic integration

Determines how to answer input queries

Optimizes and executes a query plan

Lab name TBA 6IIS internal talk

Query processing flow

Informationsources

ORB

SQL

Wrapper

Wrapper

Sourcemodel

Domainmodel

InformationIntegrationService

Queryplanner

Query planoptimizer/ executor

User query

Query plan

SubquerySubquery

SubquerySubquery

Translated query and

dataAnswer

Lab name TBA 7IIS internal talk

Query plan

outputSELECT *FROM P,S,…WHERE P.name1 = S.name2

SELECT name1, phone, addressFROM LAWFIRMS WHERE location = Phoenixand class = law-firmsand specialty = immigration

Us-west.com

Query: Find immigration attorneys in Phoenix and their educational background...

Law.yale.edu

SELECT name2, year, degreeFROM ALUMNIWHERELaw.yale.edu

Law.harvard.eduLaw.yale.edu

S

P

Name2 is one of name1

Lab name TBA 8IIS internal talk

Representation and integration of domain and source models

Domainrepresentation

IIntegration

SIMS LOOM(descriptionlogic, object-likewith inheritance)

Links betweenclasses andattributes

TSIMMIS OEM (Objectexchange model)

Domain classesas views ofsources

IM, Infomaster,Softbot

Datalog Source classesas views ofdomains

Lab name TBA 9IIS internal talk

Integrating domain and source models -- Example

Domain as view of source

pilots(pilot,airline) s1(pilot,aircraft), s2(aircraft,airline).

Source as view of domain:s1(pilot,_)

pilots(pilot,_).

s2(_,airline) pilots(_,airline).

Referential integrity constraint on aircraft

Source-links

pilots

s2s1

Domain class

Sourceclasses

airline

pilot

airline

aircraft

pilot

Source-links

Lab name TBA 10IIS internal talk

Representation of queries

Queries» enumerate?» Conjunctive?» Negation?» Disjunctive?» Aggregate operators? (group-by, having, etc. SQL stuff)

Lab name TBA 11IIS internal talk

Properties of Query Plans

Quality of the answer» Anything not asked is returned?» Maximally contained? (due to O. Duscheka, 1998)

Executable (retrievable) query plans» One that contains no domain model term

Lab name TBA 12IIS internal talk

Query Planning in SIMS -- decompose

Q: Pilots for the same airline as Mike

Q(?p):- pilots(?p, ?a), pilots(mike,?a).

Sources:s1(pilot,aircraft).

S2(aircraft,airline). Decomposed query: Q(?p):-

s1(mike, ?a), s2(?a,?al),

s2(?a2,?al),

s1(?p,?a2).

pilots

s2s1

Domain class

Sourceclasses

airline

pilot

airline

aircraft

pilot

Source-links

Lab name TBA 13IIS internal talk

Query Planning in SIMS -- partition

Q: What’s flight-hours of Mike?

Q(?h):- pilots(mike,?h).

Sources:s3(pilot,aircraft,hours).

S4(pilot,aircraft,hours). Partitioned subqueries: Q(?p):-

s3(mike,_,?h).

Q(?p):-

s4(mike,_,?h).

Civilpilots

s4s3

Domain classes

Sourceclasses

airline

pilot

pilot

aircraft

pilot

pilotsFlight-hours

Militarypilots

Subset-of

Lab name TBA 14IIS internal talk

Query planning in SIMS

There are 7 other such operators (Arens et al. 1995, JIIS) for query “reformulation”

In addition there are 9 other operators about opening a source, moving data around, etc (Knoblock, 1996, AIPS)

Planning involves selecting appropriate operators and determining the best order for these operators

There are always many choices and search is required to find the “optimal” query plan

Lab name TBA 15IIS internal talk

Recursive query plan

Q: Pilots for the same airline as Mike

Q(?p):- pilots(?p, ?a),

pilots(mike,?a).

Sources:s1(pilot,aircraft). Non-recursive query plan? Maximally contained?

– (this part is due to O. Duscheka, 1997, PhD thesis, Stanford Univ.)

pilots

s1

Domain class

Sourceclasses

airline

pilot

aircraft

pilot

Source-links

Lab name TBA 16IIS internal talk

Negative results of query planning using source-as-view

Query planning for a query plan equivalent to an input datalog query is UNDECIDABLE» otherwise, theorem-proving for first-order logic will be

decidable» (see O. Duscheka, 1998, PhD thesis, Stanford University)

Query planning for conjunctive, comparison-free queries with minimal number of sources accessed is NP-complete» otherwise, containment of two datalog program will be

polynomial» (see A. Levy, 1995, PODS)

Lab name TBA 17IIS internal talk

Domain as view of source

Simply replacing domain terms in a query with their view definitions will yield an executable query plan

Add a new source may require change the whole domain model- source model integration» not a problem for source-as-view

Lab name TBA 18IIS internal talk

Query optimizations

Semantic query optimization (Hsu and Knoblock,1999 IEEE TKDE)

Less “semantic” (using local completeness, functional dependency, etc.) (Kwok AAAI-96, Levy)

Exploring parallelism in plans (Knoblock, IJCAI-95)

Replanning failed retrieval (Knoblock, IJCAI-95)

Caching (static) Dynamic caching (using partial results from

subqueries)

Lab name TBA 19IIS internal talk

Input QueryGive me all the paperswritten by “Chunnan”

Optimized QueryGive me all the “AI” papers

written by “Chunnan”

Semantic Rules

Databases

PESTOQuery Optimizer

R1: If AUTHOR is an “AIer” PAPER is “AI” paper

R2: “Chunnan” is an “AIer”R3: ...

BASILlearner/KDDer

Basic idea of adaptive semantic query optimization

Lab name TBA 20IIS internal talk

Web wrapper

Name Degree School Affiliation

WL Hsu PhD Cornell IIS, SinicaCS Ho PhD NTU EE,NTITC.Chen PhD SUNY EE,NTITC.Wu PhD Utexas Cedu,NNUMark Liao PhD NWU IIS, SinicaCJ Liau PhD NTU IIS, SinicaWK Cheng PhD TKU TunghaiWC Wang MS Syracus FIT : : :

Lab name TBA 21IIS internal talk

Wrapper construction

For structured databases, wrapper construction is an engineering problem

Web sources requires an information extractor Hand-encoded Web information extractor?

» Web page changed frequently (8% monthly failure rate at Junglee)

Web wrapper induction? YES (Hsu 1999, J of Info Systems; Kushmerick 1997, PhD Thesis, U of WA)

XML will make wrapper induction easier

Lab name TBA 22IIS internal talk

Major players (1)

SIMS, Ariadne» Arens, Knoblock, Minton, Shen, Hsu, et al. at ISI of USC» flexible query planner, adaptive semantic query optimizer

Information Manifold» Levy, Srivastava, Kirk, et al. At AT&T Lab» query reformulation, relevant source selections

TSIMMIS» Hammer, Garcia-Molina, Papakonstantinou, Ullman et al. at

Stanford University» object-based data modeling (OEM)

Lab name TBA 23IIS internal talk

Major players (2)

Softbot family: Occam, Razor, etc.» Etzioni, Weld, Kowk, Kushmerick, Friedman» Fast query planning, wrapper induction, query optimization

Infomaster» Duscheka and Genesereth at Stanford» recursive query plans, theoretical analysis of III

Others» HERMES at U of Maryland, Broker Agents at SRI,

Ontobroker at AFIB Germany, etc.» Taiwan? Academia Sinica (WL Hsu, CN Hsu) and VF at

NTU (YJ Hsu), others?

Lab name TBA 24IIS internal talk

Positive results of intelligent information integration

Spin-off’s» Junglee (www.junglee.com)

– Key scientists: Mike Stonebraker? Peter Norvig

– Largest integration: 700 Web sites, 30 attributes, 1000+ wrappers

– Bought by Amazon.com for ~$180,000,000

» Jango (www.jango.com)– Key scientists: Dan Weld, Oren Etzioni

– Bought by Excite.com

Startups» Socratix Systems

– Key scientist: Oliver Duscheka

Lab name TBA 25IIS internal talk

Competing alternatives

Hardwired » mostly applied?

Schema Integration » dying?

Distributed Heterogeneous Multi-Databases » dying? Name too long?

Data warehousing» kicking real good!» Updating a tough problem

Software Reverse Engineering» Taiwan has an edge on this?

Lab name TBA 26IIS internal talk

Research challenges

Optimization Probabilistic representation of domain-source models Probabilistic query answering, anytime, imprecise

query answering Automatic locating and integrating relevant new

sources Sharing information between incompatible sources

(F-C? Exchange rate? Aliases?) Wrapper induction Cooperative agents for information integration

Lab name TBA 27IIS internal talk

Information sources of intelligent information

integration

Journals» Journal of intelligent systems, information systems, intelligent

information systems, cooperative information systems, agents(?)

» Other journals for AI, databases

Meetings» 1998 AAAI Workshop on AI & Information Integration» 1998 ECAI Workshop on Intelligent Information Integration» 1997 SIGMOD Workshop on Semistructured data» 1997 German Annual Conference On AI Workshop on III» 1995 AAAI Symposium on Information Gathering from

Distributed Heterogenous Sources

Lab name TBA 28IIS internal talk

More information sources on Intelligent Information

Integration

Best papers usually published in AAAI, IJCAI, SIGMOD and PODS

Upcoming meeting:» IJCAI-99 WORKSHOP on Intelligent Information Integration

(proposed)

Lab name TBA 29IIS internal talk

The FutureNetworked Information

Mediators

Phoenix AttorneyMediator

Immigration AttorneyMediator

Human & Computer Users

Heterogeneous Data Sources

AttorneyMediator

Bad AttorneyMediator

Law schoolMediator

Good AttorneyMediator