Distributed NLP and Machine Learning for Question ...sonntag/040824_Ecai.pdf · Distributed NLP and...
Transcript of Distributed NLP and Machine Learning for Question ...sonntag/040824_Ecai.pdf · Distributed NLP and...
Daniel Sonntag (RIC/AM)
ECAI 2004
Distributed NLP and Machine Learning forQuestion Answering Grid
Daniel Sonntag, RIC/AM 2
Agenda� Introduction: � What is QA? Challenges� Survey on Question Answering Process, Distributed Computing, and JavaSpaces� QA components:� Distributed QA Grid Architecture � Question Answering Process Workflow� ML and QA Grid:� Master Control Protocol� Learning of MCP with Association Rules
Daniel Sonntag, RIC/AM 3
Introduction: What is QA?� QA is the task of finding answers to natural language questions by searching large document collections (Provide short answers vs. IR).� Polarisation-, disjunctive-, constituent interrogatives question� Definition-, Enumeration-, single instance (factiod) question� Examples from Search Engine Query log:� Who invented surf music?� How to copy a DVD?� How tall is the sears tower?� What are the 7 wonders of the world?� Where is a good restaurant in Valencia?
Daniel Sonntag, RIC/AM 4
Introduction: What is QA?� QA Challenges:� Questions and answers may have different surface structures.� Lexical gap/ partly failures affect many components. � How sophisticated (linguistically/knowledge intensive) have the employed techniques to be? � QA system example: HTTP://askjeeves.com
Daniel Sonntag, RIC/AM 5
Introduction: Survey on QA process � QA process characteristics:� QA process/QA component usage can be a complex composite
process.� Best way of selecting and applying single components is not obvious (no closed approach known, QA workflow/best solution rather undeterministic).� Data access, data availability, good performances of single components cannot be guaranteed.
Daniel Sonntag, RIC/AM 6
Introduction: Survey on QA process � Central idea to enhance performance and robustness:� Vary processing steps depending on complexity of the query.� Use shallow or deep QA strategies depending on question type
(Use shallow processing for fact-based questions, shown for Person, Location, Date, Quantity.)
Daniel Sonntag, RIC/AM 7
Introduction: Survey on QA process � Problem: no robustness and scalability for more difficult
queries� unexpected availability problems, unexpected problems of coverage -> performance unknown/unexpected in advance� Duration and measures, short/long queries, ungrammatical queries, unrecognized tokens, no answer in knowledge base.� Proposed solution: � Specification: Computational and conceptual difficulty is a question of:� Availability of PRs/LRs for single query words (unlike query
type)
-> treat (every) instance individually.� New problem: How to abstract from query memory?
Daniel Sonntag, RIC/AM 8
Introduction: Proposed Solution: QA Grid � QA is a Semantic Grid application!� Data Mining to reveal semantics about QA components� Planning techniques for Grid Computing
Daniel Sonntag, RIC/AM 9
Introduction: Related Grid Architectures� Resource reuse, distributing and accessing language
resources� Model resources in an object-oriented way (Interfaces).� 1997: GATE (General Architecture for Text Engineering), Cunningham et al., University of Sheffield: � 2004: UIMA (Unstructured Information Management Architecture), IBM Research (Watson) � Rapid combination of UIM technologies (-> Sem. Grid)� flexible deployment options (-> Sem. Grid)� 2005: DataMiningGrid, DC and cons.
Daniel Sonntag, RIC/AM 10
Introduction: Distributed Computing and JavaSpaces� Distributed Computing task:
Co-operation of several computers on a processing-intensive problem.
Object-oriented view: component objects work in parallel or sequential
-> implementation as Java objects, Java Spaces� Uncouple LRs and PRs (instead of hard-wired)� Benefit from the inferred meta data of Grid components� Allow for parallel processing � Allow for variation in predefined QA stream� Decide on the most suitable components for answering a specific query.
Daniel Sonntag, RIC/AM 11
Introduction: Distributed Computing and JavaSpaces� Java Spaces:� Persistent object exchange area� Remote processes can co-ordinate their actions and exchange
data. � Scalability: add new components to match size of � single/parallel question processing problem � Address answer retrieval problem: Simply replace/add new server without changing client application
Daniel Sonntag, RIC/AM 12
Distributed QA Grid Architecture� Java Space Concept for QA� exchange area � data communication� data synchronisation� standard appl. scenario� classic blackboard pattern
-> declare components
-> focus on workflow control
Daniel Sonntag, RIC/AM 13
QA Components Declaration � Goal: Suitable components for QA workflow variations� classification of interchangeable, exchangeable, and decomposable components� LRs: data-only resources: lexica, corpora, thesauri, ontologies� PRs: programmatic or algorithmic resources: POS-Tagger, NE-Rec. ...� Define equivalence class in the sense of the same IO-
behaviour.
-> abstraction from individual LRs/PRs when appropriate
Daniel Sonntag, RIC/AM 14
QA Processing Stages/ Basic Workflow (abstract)
Answer Type Detection Answer Template Filling
Doc/Sent/Para Retrieval
Answer Verification
Answer Zooming
Daniel Sonntag, RIC/AM 15
QA Processing Stages/ Basic Workflow
LRs: Wordnet, Wortschatz,
Leo,Domain Dics
PRs: Tagger, Chunker, Duden (Soap)NER, LSA
A{QE,{PR1, LR1}}
... ... ...
Daniel Sonntag, RIC/AM 16
QA Components Declaration � Every PR may contain several other PRs and LRs.� Each LR is atomic.� Each PR/LR defines a subtask.� Different top-level components may have the same subtask associated.� Example: {PR, PR}, {PR,{PR,LR}}
Daniel Sonntag, RIC/AM 17
QA Workflow Problems (and Potentials for Training?) � It is not clear, which PR/LR should be applied at stage X.� PRs/LRs modify input/answer, and add possibly misleading
information.� Example: Calculate Document similarity on filtered or unfiltereddocs?� Optimisation of QA process:
Daniel Sonntag, RIC/AM 18
ML and QA Grid� Concentrate on mining the protocol for the selection of PRs/LRs sets associated with top-level components (Problem1).� Optimise single components vs. Optimise completeanswering process.
� Accept an unreliable decision. � Sets only define the plan to subsequent steps in workflow.
Reveal meta data aboutinput, components, overall result.
HMM POS-TaggerPCFG Sentence Parser
SVM Classifier
Daniel Sonntag, RIC/AM 19
Master Control Protocol� When was Frank Sinatra born? December 12, 1915
(When) (was) (Frank) (Sinatra) (born)
{Date} ........... {Person}
input#PR1#output#quality ... (at stage x)
input#{PR2,{PR3,LR1}}#output#quality (at stage y)
overall result quality
Daniel Sonntag, RIC/AM 20
Learning of Master Control Protocol� Association Rules� Let be a set of events, events of , database a
multiset of transactions � an association rule is an implication:
Daniel Sonntag, RIC/AM 21
Learning of Master Control Protocol� Statements of association rules, prominent examples
from basket analysis:
Daniel Sonntag, RIC/AM 22
Learning of Master Control Protocol� Simplification: patterns of flat simple items� PR1 = {PR2,{PR3, LR1}} -> {PR1, PR2, PR3, LR1}� Filter statistically relevant implications and association
rules of form:� [instance prop..., PRs..., LRs...]
-> successful answer� [PRs..., LRs...]
-> [PRs..., LRs..., successful answer]
Daniel Sonntag, RIC/AM 23
Learning of Master Control Protocol� We believe that some very interesting rules can be
inferred from: actual input words, question words, abstract terms.� Reveal patterns about strength of applying spec. workflows and components. � Get meta information about exchangeability of components.� Use rules for workflow decisions on incoming questions:� e.g. Person -> P4 (Lookup in Factbook)
Daniel Sonntag, RIC/AM 24
Conclusion � (1) We implemented an opportunistic problem solving strategy for semantic Grid applications (for QA, blackboard pattern).� (2) Grid serves for QA components to assemble knowledge.� (3) Assemble special purpose Grids via automatic training� (4) Grid Semantics = Learned Control Protocol
Daniel Sonntag, RIC/AM 25
Reading Material