Lab name TBA1IIS internal talk Intelligent Information Integration (I 3 ) Chun-Nan Hsu Institute of...
-
Upload
rosalind-warner -
Category
Documents
-
view
221 -
download
1
Transcript of Lab name TBA1IIS internal talk Intelligent Information Integration (I 3 ) Chun-Nan Hsu Institute of...
Lab name TBA 1IIS internal talk
Intelligent Information Integration (I3)
Chun-Nan Hsu
Institute of Information Science
Academia Sinica, Taipei, TAIWANCopyright © 1998 Chun-Nan Hsu, All right reserved
Prepared for a presentation at IIS, AS, Taiwan
October 12, 1998
Lab name TBA 2IIS internal talk
Information Distribution and Information Integration
Real query: need a list of attorneys in Phoenix Metro Area specialized in immigration and deportation. Also show their years-in-service, educational background and languages spoken??
The answer IS on the Web!» US West yellow page web site» US bar association member directory web site» Alumni directory of law schools» and more…
BUT…
Lab name TBA 3IIS internal talk
Intelligent Information Integration
Environment assumptions:» Autonomous information sources» Heterogeneous but relevant» Query only (no or limited update allowed)
Desiderata:» Extensible
– easily add new sources
» Flexible – can be queried in as many ways as integrated sources
» Scalable – integrate 1,000s, 10,000s, 100,000s relevant information sources
Lab name TBA 4IIS internal talk
Solution: Information Integration Systems (IIS)
Also known as » information mediation agents, information mediators» information gathering agents» information brokering agents, information brokers
Key ideas:» users access data through a domain model» information sources represented by a source model» the mediator reformulates domain model query into source
model sub-queries» the mediator constructs a query plan that determines the
orders of data flow and execution to retrieve data
Lab name TBA 5IIS internal talk
Architecture
Text,Images/Video,Spreadsheets
Hierarchical& NetworkDatabases
RelationalDatabases
Object &Knowledge
Bases
SQL ORBWrapper Wrapper
Sourcemodel
Domainmodel
Human & Computer Users
Heterogeneous Data Sources
InformationIntegrationService
Provides translation and communication with
sources
Queryplanner
User Services:• Query• Monitor
Describes the domain: terms and their relations
Query planoptimizer/ executor
Provides source descriptions and
semantic integration
Determines how to answer input queries
Optimizes and executes a query plan
Lab name TBA 6IIS internal talk
Query processing flow
Informationsources
ORB
SQL
Wrapper
Wrapper
Sourcemodel
Domainmodel
InformationIntegrationService
Queryplanner
Query planoptimizer/ executor
User query
Query plan
SubquerySubquery
SubquerySubquery
Translated query and
dataAnswer
Lab name TBA 7IIS internal talk
Query plan
outputSELECT *FROM P,S,…WHERE P.name1 = S.name2
SELECT name1, phone, addressFROM LAWFIRMS WHERE location = Phoenixand class = law-firmsand specialty = immigration
Us-west.com
Query: Find immigration attorneys in Phoenix and their educational background...
Law.yale.edu
SELECT name2, year, degreeFROM ALUMNIWHERELaw.yale.edu
Law.harvard.eduLaw.yale.edu
S
P
Name2 is one of name1
Lab name TBA 8IIS internal talk
Representation and integration of domain and source models
Domainrepresentation
IIntegration
SIMS LOOM(descriptionlogic, object-likewith inheritance)
Links betweenclasses andattributes
TSIMMIS OEM (Objectexchange model)
Domain classesas views ofsources
IM, Infomaster,Softbot
Datalog Source classesas views ofdomains
Lab name TBA 9IIS internal talk
Integrating domain and source models -- Example
Domain as view of source
pilots(pilot,airline) s1(pilot,aircraft), s2(aircraft,airline).
Source as view of domain:s1(pilot,_)
pilots(pilot,_).
s2(_,airline) pilots(_,airline).
Referential integrity constraint on aircraft
Source-links
pilots
s2s1
Domain class
Sourceclasses
airline
pilot
airline
aircraft
pilot
Source-links
Lab name TBA 10IIS internal talk
Representation of queries
Queries» enumerate?» Conjunctive?» Negation?» Disjunctive?» Aggregate operators? (group-by, having, etc. SQL stuff)
Lab name TBA 11IIS internal talk
Properties of Query Plans
Quality of the answer» Anything not asked is returned?» Maximally contained? (due to O. Duscheka, 1998)
Executable (retrievable) query plans» One that contains no domain model term
Lab name TBA 12IIS internal talk
Query Planning in SIMS -- decompose
Q: Pilots for the same airline as Mike
Q(?p):- pilots(?p, ?a), pilots(mike,?a).
Sources:s1(pilot,aircraft).
S2(aircraft,airline). Decomposed query: Q(?p):-
s1(mike, ?a), s2(?a,?al),
s2(?a2,?al),
s1(?p,?a2).
pilots
s2s1
Domain class
Sourceclasses
airline
pilot
airline
aircraft
pilot
Source-links
Lab name TBA 13IIS internal talk
Query Planning in SIMS -- partition
Q: What’s flight-hours of Mike?
Q(?h):- pilots(mike,?h).
Sources:s3(pilot,aircraft,hours).
S4(pilot,aircraft,hours). Partitioned subqueries: Q(?p):-
s3(mike,_,?h).
Q(?p):-
s4(mike,_,?h).
Civilpilots
s4s3
Domain classes
Sourceclasses
airline
pilot
pilot
aircraft
pilot
pilotsFlight-hours
Militarypilots
Subset-of
Lab name TBA 14IIS internal talk
Query planning in SIMS
There are 7 other such operators (Arens et al. 1995, JIIS) for query “reformulation”
In addition there are 9 other operators about opening a source, moving data around, etc (Knoblock, 1996, AIPS)
Planning involves selecting appropriate operators and determining the best order for these operators
There are always many choices and search is required to find the “optimal” query plan
Lab name TBA 15IIS internal talk
Recursive query plan
Q: Pilots for the same airline as Mike
Q(?p):- pilots(?p, ?a),
pilots(mike,?a).
Sources:s1(pilot,aircraft). Non-recursive query plan? Maximally contained?
– (this part is due to O. Duscheka, 1997, PhD thesis, Stanford Univ.)
pilots
s1
Domain class
Sourceclasses
airline
pilot
aircraft
pilot
Source-links
Lab name TBA 16IIS internal talk
Negative results of query planning using source-as-view
Query planning for a query plan equivalent to an input datalog query is UNDECIDABLE» otherwise, theorem-proving for first-order logic will be
decidable» (see O. Duscheka, 1998, PhD thesis, Stanford University)
Query planning for conjunctive, comparison-free queries with minimal number of sources accessed is NP-complete» otherwise, containment of two datalog program will be
polynomial» (see A. Levy, 1995, PODS)
Lab name TBA 17IIS internal talk
Domain as view of source
Simply replacing domain terms in a query with their view definitions will yield an executable query plan
Add a new source may require change the whole domain model- source model integration» not a problem for source-as-view
Lab name TBA 18IIS internal talk
Query optimizations
Semantic query optimization (Hsu and Knoblock,1999 IEEE TKDE)
Less “semantic” (using local completeness, functional dependency, etc.) (Kwok AAAI-96, Levy)
Exploring parallelism in plans (Knoblock, IJCAI-95)
Replanning failed retrieval (Knoblock, IJCAI-95)
Caching (static) Dynamic caching (using partial results from
subqueries)
Lab name TBA 19IIS internal talk
Input QueryGive me all the paperswritten by “Chunnan”
Optimized QueryGive me all the “AI” papers
written by “Chunnan”
Semantic Rules
Databases
PESTOQuery Optimizer
R1: If AUTHOR is an “AIer” PAPER is “AI” paper
R2: “Chunnan” is an “AIer”R3: ...
BASILlearner/KDDer
Basic idea of adaptive semantic query optimization
Lab name TBA 20IIS internal talk
Web wrapper
Name Degree School Affiliation
WL Hsu PhD Cornell IIS, SinicaCS Ho PhD NTU EE,NTITC.Chen PhD SUNY EE,NTITC.Wu PhD Utexas Cedu,NNUMark Liao PhD NWU IIS, SinicaCJ Liau PhD NTU IIS, SinicaWK Cheng PhD TKU TunghaiWC Wang MS Syracus FIT : : :
Lab name TBA 21IIS internal talk
Wrapper construction
For structured databases, wrapper construction is an engineering problem
Web sources requires an information extractor Hand-encoded Web information extractor?
» Web page changed frequently (8% monthly failure rate at Junglee)
Web wrapper induction? YES (Hsu 1999, J of Info Systems; Kushmerick 1997, PhD Thesis, U of WA)
XML will make wrapper induction easier
Lab name TBA 22IIS internal talk
Major players (1)
SIMS, Ariadne» Arens, Knoblock, Minton, Shen, Hsu, et al. at ISI of USC» flexible query planner, adaptive semantic query optimizer
Information Manifold» Levy, Srivastava, Kirk, et al. At AT&T Lab» query reformulation, relevant source selections
TSIMMIS» Hammer, Garcia-Molina, Papakonstantinou, Ullman et al. at
Stanford University» object-based data modeling (OEM)
Lab name TBA 23IIS internal talk
Major players (2)
Softbot family: Occam, Razor, etc.» Etzioni, Weld, Kowk, Kushmerick, Friedman» Fast query planning, wrapper induction, query optimization
Infomaster» Duscheka and Genesereth at Stanford» recursive query plans, theoretical analysis of III
Others» HERMES at U of Maryland, Broker Agents at SRI,
Ontobroker at AFIB Germany, etc.» Taiwan? Academia Sinica (WL Hsu, CN Hsu) and VF at
NTU (YJ Hsu), others?
Lab name TBA 24IIS internal talk
Positive results of intelligent information integration
Spin-off’s» Junglee (www.junglee.com)
– Key scientists: Mike Stonebraker? Peter Norvig
– Largest integration: 700 Web sites, 30 attributes, 1000+ wrappers
– Bought by Amazon.com for ~$180,000,000
» Jango (www.jango.com)– Key scientists: Dan Weld, Oren Etzioni
– Bought by Excite.com
Startups» Socratix Systems
– Key scientist: Oliver Duscheka
Lab name TBA 25IIS internal talk
Competing alternatives
Hardwired » mostly applied?
Schema Integration » dying?
Distributed Heterogeneous Multi-Databases » dying? Name too long?
Data warehousing» kicking real good!» Updating a tough problem
Software Reverse Engineering» Taiwan has an edge on this?
Lab name TBA 26IIS internal talk
Research challenges
Optimization Probabilistic representation of domain-source models Probabilistic query answering, anytime, imprecise
query answering Automatic locating and integrating relevant new
sources Sharing information between incompatible sources
(F-C? Exchange rate? Aliases?) Wrapper induction Cooperative agents for information integration
Lab name TBA 27IIS internal talk
Information sources of intelligent information
integration
Journals» Journal of intelligent systems, information systems, intelligent
information systems, cooperative information systems, agents(?)
» Other journals for AI, databases
Meetings» 1998 AAAI Workshop on AI & Information Integration» 1998 ECAI Workshop on Intelligent Information Integration» 1997 SIGMOD Workshop on Semistructured data» 1997 German Annual Conference On AI Workshop on III» 1995 AAAI Symposium on Information Gathering from
Distributed Heterogenous Sources
Lab name TBA 28IIS internal talk
More information sources on Intelligent Information
Integration
Best papers usually published in AAAI, IJCAI, SIGMOD and PODS
Upcoming meeting:» IJCAI-99 WORKSHOP on Intelligent Information Integration
(proposed)