Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources...

42
Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems Analysis

Transcript of Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources...

Page 1: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Automating the Committee Meeting:Intelligent Integration of

Information From Diverse Sources

Pedrito Maynard-Zhang

Department of Computer Science & Systems Analysis

Page 2: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Information Integration

Information integration is ubiquitous:• Committee meetings• Research papers• Information retrieval on the web• Assessing intelligence on the

battlefield• …

Page 3: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Information Integration

Page 4: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Outline

• Introduction• Automating Information Integration

– Database Integration– Model Integration– Conflict Resolution and Meta-

Information• Integrating Learned Probabilistic

Information• Conclusion and Current Work

Page 5: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Multi-Disciplinary Research• Databases (e.g., Halevy’s group at

U. of Washington)• Artificial Intelligence (e.g.,

Stanford’s Knowledge Systems Laboratory)

• Business (e.g., MIT-Sloan’s Aggregators Group)

• Decision Analysis (e.g., Clemen & Winkler’s work at Duke)

Page 6: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Database Integration

Gene-Clinics

Locus-Link

EntrezOMIM

Genes Proteins NucleotideSequences

Mediation Layer

Source Databases

bioinformatics query

Page 7: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Database Integration

• Application: Querying distributed databases

• Examples– Bioinformatics– Corporate data management– Question-answer systems on the web– Detecting bioterrorism

Page 8: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Model Integration

if cancer then operate…

cdi = CIRDE BML…

expert system probabilistic model mathematical model

super model

Page 9: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Model Integration

• Applications: Diagnosis and prediction

• Examples:– Medical diagnosis– NASA spacecraft design and diagnosis– Expert system integration– Combining commonsense knowledge

bases

Page 10: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Challenges

• Efficient query processing and optimization– Parsing XML

• Defining expressive yet tractable mediator languages

• Handling heterogeneous source languages– Wrapper technology development

Page 11: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Challenges

• Resolving ontological differences– e.g., realizing that the field “Name”

for one source stores the same information as “First Name” and “Last Name” for another.

• Detecting conflicts• Resolving conflicts

– Resolution done manually in practice– We can automate more!

Page 12: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Uninformed Integration

raining sunny raining

What’s the weather like?

Page 13: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Intelligent Integration

raining sunny raining

meteorologist practical joker own eyes

What’s the weather like?

Page 14: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Types of Meta-Information

• Credibility, experience, political clout• Areas of expertise• How source acquired information:

– Source’s sources– Processes source used to accumulate

information

• Structure of the data representation

Page 15: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Outline

• Introduction• Automating Information Integration• Integrating Learned Probabilistic

Information– Medical Scenario– Semantic Framework– LinOP-Based Aggregation– Aggregating Bayesian Networks– Experimental Validation

• Conclusion and Current Work

Page 16: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Medical Expert System Scenario

3 years experience

20 years experience

10 years experience

Expert system

Page 17: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Source Meta-Information

• Doctors learned probabilistic models from patient data using some known standard learning algorithm.

• We know the relative amount of experience doctors have had (i.e., years of practice).

Page 18: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Popular Aggregation Approaches

• Intuition approach: Take simple weighted averages, etc. unexpected behavior

• Axiomatic approach: Find aggregation algorithm satisfying certain “obvious” properties impossibility results

• Problem: Not semantically grounded

Page 19: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Aggregation Semantics

M samples generated from the true distribution p

learningalgorithm

optimaldistribution

p*

learningalgorithm

learningalgorithm

learningalgorithm

p1

p2

pL

aggregationalgorithm

aggregatedistribution

p

not available in practice

^

……

Page 20: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Linear Opinion Pool (LinOP)• LinOP: Weighted sum of joint distributions.

• Precisely, for joint distributions pi and joint variable instantiation w,

LinOP(p1, p2, …, pL)(w) = i ipi(w).

i weights: relative experience.

• Satisfies unanimity, non-dictatorship, and marginalization.

• Doesn’t preserve shared independences.

Page 21: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

LinOP and Joint Learning

If– sources learn joint distributions using

maximum likelihood or MAP learning and– the same learning framework would be

used on the combined data set to learn p*

thenp* LinOP(p1, p2, …, pL).

Page 22: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Bayesian Network (BN)

• Summary: Compact, graphical representation of a probability distribution.

• Definition: Directed acyclic graph (DAG) over nodes (random variables); each node has a local conditional probability distribution (CPD) associated with it.

• Exploits causal structure in the domain.

Page 23: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Alarm BN

Burglary Earthquake

JohnCalls

Alarm

MaryCalls

P(B)

.001

P(E)

.002

B E P(A)

+ + .95

+ - .94

- + .29

- - .001A P(M)

+ .70

- .01

A P(J)

+ .90

- .05

Page 24: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

BN Advantages

• Compact representation and graph encodes conditional independences.

• Elicitation easy in practice.• Inference efficient in practice.• Can be learned from data.• Deployed successfully – medical

diagnosis, Microsoft Office, NASA Mission Control, and more.

Page 25: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

BN Learning

• Idea: Select BN most likely to have generated data.

• Standard algorithm:– Search over structures by adding,

deleting, and reversing edges.– Parameterize and score structures

using statistics from the data.– Penalize complex structures.

Page 26: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Aggregating BNs

• Each source i learns BN pi.• p* is the BN we would learn from the

combined data set.• We want to approximate p* as closely

as possible by aggregating p1, …, pL.• Source information: estimates for

the relative experience of the sources and the total amount of data seen (M).

Page 27: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

AGGR: BN Aggregation Algorithm

• Idea: Use BN learning algorithm.• Problem: We don’t have data.• Key observation: We can use

LinOP to approximate the statistics needed for the parameterization and scoring steps!

• Also, we can use LinOP properties to make algorithm reasonably efficient.

Page 28: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Asia BNVisit to

AsiaSmoking

Lung CancerTuberculosis

Abnormalityin Chest

Bronchitis

X-Ray Dyspnea

Page 29: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Experimental Setup

• Generate data for sources from well-known ASIA BN which relates smoking, visiting Asia, and lung cancer.

• Compare our algorithm AGGR against the optimal algorithm OPT that has access to the combined data set.

• Accuracy measure: KL divergence from generating distribution.

Page 30: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Sensitivity to M Experiments

• Sensitivity to M– Size of the combined data set M

varies.– AGGR’s estimate of M is accurate.

• Sensitivity to Estimate of M– Size of the combined data set M is

fixed.– AGGR’s estimate of M varies.

Page 31: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Sensitivity to M

0.0000.0100.0200.0300.0400.0500.0600.0700.0800.0900.100

200 600 1000 1400 1800 2200 2600 3000

M

KL

Div

erg

ence

S1S2OPTAGGR

Page 32: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Sensitivity to Estimate of M

0.0000

0.0002

0.0004

0.0006

0.0008

0.0010

0.0012

0.0014

0.0016

-1.00 -0.75 -0.50 -0.25 0.00 0.25 0.50 0.75 1.00

M'/M (log10)

KL D

iverg

en

ce

S1

S2

OPT

AGGR

M=10k

Page 33: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Subpopulations

• Each source’s data may come from a different subpopulation P(D|Si), where D is the data.

• We want to learn P(D).

• P(D) = LinOP(P(D|S1), P(D|S2), …, P(D|SL)) with sources’ weights based on P(Si).

• We can apply the same algorithm.

Page 34: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Subpopulations Experiments• In the Asia network domain, one

doctor practices in San Francisco, another in Cincinnati.

• Subpopulations have different priors for smoking and having visited Asia, so doctors’ beliefs are biased.

• The aggregate distribution comes much closer to the original distribution.

Page 35: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Asia BNVisit to

AsiaSmoking

Lung CancerTuberculosis

Abnormalityin Chest

Bronchitis

X-Ray Dyspnea

Doctor

Page 36: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

0.000.020.040.060.080.100.120.140.160.18

200 600 1000 1400 1800 2200 2600 3000

M

KL

Div

erg

ence

S1

S2

OPT

AGGR

Subpopulations

Page 37: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Contributions

• A semantic framework for aggregating learned probabilistic models.

• A LinOP-based algorithm for aggregating learned BNs.

• Experiments showing algorithm behaves well.

Page 38: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Outline

• Introduction• Automating Information Integration• Integrating Learned Probabilistic

Information• Conclusion and Current Work

Page 39: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Conclusion

• Conflict resolution is key in automated information integration.

• This is a difficult task in general.• However, information about

sources is often readily available.• Principled use of this information

can greatly enhance the ability to resolve conflicts intelligently.

Page 40: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Current Work

• Allow dependence between sources’ data sets in probabilistic aggregation work.

• Apply semantic framework to aggregation in other learning paradigms.

• Explore application of algorithms to database integration, RoboCup, stock market prediction, etc.

• Making committee meetings obsolete!

Page 41: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

Multi-Agent Research Zone• Research interests:

– Information integration– Multi-agent machine learning– RoboCup soccer simulation league testbed

• Masters students– Jian Xu: Information integration in medical

informatics– Linxin Gan: Ensemble learning in stock

market prediction

Page 42: Automating the Committee Meeting: Intelligent Integration of Information From Diverse Sources Pedrito Maynard-Zhang Department of Computer Science & Systems.

CSA Graduate Program

• Masters in Computer Science• Research areas include:

– machine learning, KRR, and MAS– information retrieval, databases, and NLP– networking and virtual environments– simulation and evolutionary computation– software engineering and formal methods

http://unixgen.muohio.edu/~maynarp/[email protected]