IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant...

25
IBM’s DeepQA, or Watson

Transcript of IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant...

Page 1: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

IBM’s DeepQA, or

Watson

Page 2: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

Little history

• Carnegie Mellon (CMU) collab.

• OpenEphyra (2002)• Piquant (2004)

• Initially 15% accuracy • 15% is not very good, is it?

Page 3: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

OpenEphyra, Piquant, & Jeopardy!

Source: [1] http://www.aaai.org/Magazine/Watson/watson.php

Page 4: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

Principles

• Massive parallelism: Exploit massive parallelism in the consideration of multiple interpretations and hypotheses.

• Many experts: Facilitate the integration, application, and contextual evaluation of a wide range of loosely coupled probabilistic question and content analytics.

• Pervasive confidence estimation: No component commits to an answer; all components produce features and associated confidences, scoring different question and content interpretations. An underlying confidence-processing substrate learns how to stack and combine the scores.

• Integrate shallow and deep knowledge: Balance the use of strict semantics and shallow semantics, leveraging many loosely formed ontologies.

Page 5: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

Source: [4] http://xkcd.com/720/ Randall Munroe (CC BY-NC 2.5)

Page 6: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

20 researchers, 3 years later (2008)

Source: [1] http://www.aaai.org/Magazine/Watson/watson.php

Page 7: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

What’s Watson’s source of information?

Structured content

databases, taxonomies, ontologies

Domain data

encyclopedias, dictionaries, thesauri, newswire articles, literary works

Machine learning? Test question training sets

Page 8: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

Learning framework• Trained with a set of approximately 25,000

Jeopardy! questions comprising 5.7 million question-answer pairs (instances) where each instance had 550 features.

• Implemented machine learning techniques such as: transfer learning, stacking, and successive refinement.

Page 9: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

Learning framework is based on phases

• Configurable• Uses 7 phases for Jeopardy• Trained with a set of approximately 25,000

Jeopardy! questions comprising 5.7 million question-answer pairs (instances) where each instance had 550 features.

• Implemented machine learning techniques such as: transfer learning, stacking, and successive refinement.

Page 10: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

Phases• 1. Hitlist normalization• 2. Base• 3. Transfer learning• 4. Merge evidence• 5. Elite• 6. Evidence Diffusion• 7. Multi-answers

– Within phases there are 3 main steps:• 1. Evidence Merging• 2. Postprocessing• 3. Classifier: Training/Application?

Page 11: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

1. Hitlist Normalization:

• Merge identical strings from different sources. Partition into question classes. Different classes of questions such as multiple choice, useless LAT: eg. “it” “this”, date questions, and so forth may require different weighing of evidence. The DeepQA confidence estimation framework supports this through the concept of routes. In the Jeopardy! system, profitable question classes for specialized routing were manually identified. Routes are archetypes.

Page 12: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

2. Base:

• Weed out extremely bad candidates; the top 100 candidates after hitlist normalization are passed to later phases. With at most 100 answers per question, the standardized features are recomputed. The recomputation of the standardized features at the start of the base phase is the primary reason that the Hitlist Normalization phase exists: By eliminating a large volume of junk answers (ie. ones that were not remotely close to being considered), the remaining answers provide a more useful distribution of feature values to compare each answer to.

Page 13: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

3. Transfer learning:

• For uncommon question classes ie. Adding more routed models

• The phase-based framework supports a straightforward parameter-transfer approach to transfer learning by passing one phase’s output of a general model into the next phase as a feature into a specialized model.

• Logistic regression uses a linear combination of weights, the weights that are learned in the transfer phase can be roughly interpreted as an update to the parameters learned from the general task.

Page 14: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

Logistic regression• Research group experimented with:

• Logistic regression• Support vector machines (SVMs) with linear and nonlinear kernels, • Single and multilayer neural nets• Boosting• Decision trees• Locally weighted learning• Etc.

• Logistic regression found to be the best method for classifying / gauging weights. Used in all phases / steps.

Page 15: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

4. Evidence Merging(=ANSWERS):

• Between equivalent answers • Selecting a canonical form. E.g.:

• John F. Kennedy• J.F.K.• Kennedy.

• Need robust methods!• Neuro-linguistic programming (NLP)

Page 16: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

• Can merge answers that are connected by a relation other than equivalence. It merges answers when it detects a more_specific relation between them.

• “MYTHING IN ACTION: One legend says this was given by the Lady of the Lake & thrown back in the lake on King Arthur’s

death.”

• Watson merged the two answers ”sword”, ”Excalibur” and selected ”sword” as canonical form because it had higher initial points.

Page 17: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

5. Elite:

• Near the end of the learning pipeline trains and applies to only the top five answers as ranked by the previous phase.

• Similar to phase 2. Base.

Page 18: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

6. Evidence Diffusion:

• Diffusing evidence between related answers. • Diffusion criteria.• Similar to the Answer Merging phase but

combines evidence from related answers, not equivalent ones

Page 19: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

• WORLD TRAVEL: If you want to visit this country, you can fly into Sunan

International Airport or ... or not visit this country. (Correct answer: North Korea)

• Most sources would cite Pyongjang as the location of the airport, overwhelming the answer North Korea.

• Evidence may be diffused in this phase from source (North Korea) to target (Pyongjang)

• 1. Has to meet expected target type (is a country)• 2. There is a semantic relation (located_in)• 3. The transitivity of the relation allows for

meaningful diffusion given the question.

Page 20: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

7. Multi-answers:

• Join answer candidates for questions requiring multiple answers.

Page 21: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

3 Steps…• 1. Evidence Merging: combines evidence for a given question-

answer pair across different occurrences (e.g., different passages containing a given answer).

• 2. Postprocessing: transforms the matrix of question-answer pairs and their feature values (e.g., removing answers and/or features, deriving new features from existing features). Sensitivity and dynamic range. Relativity!

• 3. Classifier Training/Application: runs in either training mode, in which a model is produced over training data, or application mode, where the previously trained models are used to rank and estimate confidence in answers for given question.

Page 22: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

3. Application classifier

• After merging, time to rank answer confidence based on merged scores.

• Watson uses machine learning to assign a confidence level to each of the merged answers, on how likely they are correct.

• Ensemble methods:• Mixture of experts• Stacked generalisation metalearner

Page 23: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

This is just the learning framework.

Page 24: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

Sources• [1] http://www.aaai.org/Magazine/Watson/watson.php • ASSOCIATION FOR THE ADVANCEMENT OF ARTIFICIAL INTELLIGENCE• Building Watson: An Overview of the DeepQA Project• Published in AI Magazine Fall, 2010. Copyright ©2010 AAAI. All rights reserved.• Written by David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek,

Aditya A. Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John Prager, Nico Schlaefer, and Chris Welty

• [2] http://brenocon.com/watson_special_issue/14%20a%20framework%20for%20merging%20and%20ranking%20answers.pdf

• A framework for merging and ranking of answers in DeepQA.• D. C. Gondek A. Lally A. Kalyanpur J. W. Murdock P. A. Duboue L. Zhang Y. Pan Z. M. Qiu C.

Welty• [3]

https://laplacian.wordpress.com/2011/02/27/how-ibms-watson-computer-thinks-on-jeopardy/

• Blog, Free Won’t• [4] http://imgs.xkcd.com/comics/recipes.png

Page 25: IBM’s DeepQA, or Watson. Little history Carnegie Mellon (CMU) collab. OpenEphyra (2002) Piquant (2004) Initially 15% accuracy 15% is not very good, is.

Questions?

• Further reading:• http://en.wikipedia.org/wiki/Learning_to_rank• http://en.wikipedia.org/wiki/Supervised_learning• http://en.wikipedia.org/wiki/Neuro-linguistic_progr

amming