Automatically Labeling Facts in a Never-Ending Langue Learning system

Automa'cally Labeling Facts in a Never-‐Ending Langue Learning system

Estevam R. Hruschka Jr. Federal University of São Carlos

Joint Work with the Carnegie Mellon Read The Web Group

Never-‐Ending Learning Language

Never-‐Ending Learning

Never-‐Ending Learning •  Main Task: acquire a growing competence without asymptote •  over years •  mul'ple func'ons •  where learning one thing improves ability to learn the next •  acquiring data from humans, environment

•  Many candidate domains: •  Robots •  SoEbots •  Game players

Never-‐Ending Learning

NELL: Never-‐Ending Language Learner

Inputs: l  initial ontology l  handful of examples of each predicate in ontology l  the web l  occasional interaction with human trainers

The task:

l  run 24x7, forever • each day: 1. extract more facts from the web to populate the initial ontology 2. learn to read (perform #1) better than yesterday

hGp://rtw.ml.cmu.edu


Goal: • run 24x7, forever • each day:

1. extract more facts from the web to populate given ontology 2. learn to read better than yesterday

Today... Running 24 x 7, since January, 2010 Input: • ontology defining ~800 categories and relations • 10-20 seed examples of each • 1 billion web pages (ClueWeb – Jamie Callan) Result: • continuously growing KB with +70,000,000 extracted beliefs

Human Advice and e

Human Advice


Knowledge Base Valida'on in NELL

•  Human Supervision: RTW group members; •  Conversing Learning: NELL can autonomously talk to people in web communi'es and ask for help

•  Web Querying: NELL can query the Web on specific facts to verify correctness, or to predict the validity of a new fact;

•  Hiring Labelers: NELL can autonomously hire people (using web services such as Mechanical Turk) to label data and help the system to validate acquired knowledge.

NELL: Never-‐Ending Language Learner Knowledge Base Valida'on in NELL

•  Human Supervision: RTW group members;

Conversing Learning

Conversing Learning

Basic Steps:

•  Decide which task is going to be asked •  Determine who are the oracles the ML system is going to consult

•  Propose a method of conversa'on with oracles, oEen humans

•  Determine how to feedback the ML system with the community inputs

Conversing Learning

Decide which task is going to be asked •  Learned facts •  Learned Inference Rules •  Metadata (mainly for automa'cally extending the ontology)

Conversing Learning

Basic Steps:




Conversing Learning who are the oracles the ML system is going to consult Yahoo! Answers

– very popular on the Web – a lot of metadata to harvest

TwiGer – millions of users worldwide – a system that was not designed to work as a QA environment

Both web communi'es have API to connect to their database

Conversing Learning

Conversing Learning

Basic Steps:


•  Propose a method of conversaBon with oracles, oDen humans


Conversing Learning

Propose a method of conversaBon with oracles, oDen humans Macro Ques'on-‐Answering For each posted ques'on:

–  Ask for yes/no simple answers –  Try to understand every answer –  Discard answers too difficult to understand –  Conclude based only on fully understood answers

Conversing Learning

Basic Steps:




Conversing Learning

how to feedback the ML system with the community inputs? Suggested ac'ons to NELL:

–  Synonym/co-‐reference resolu'on –  Automa'cally update the Knowledge Base

Conversing Learning

Some Ini'al Results with First Order Rules: •  Take top 10% of rules from Rule Learner •  60 rules were converted into ques'ons and asked with both the regular and the Yes/No ques'on approach

•  The 120 ques'ons received a total of 350 answers.

Conversing Learning Some Ini'al Results with First Order Rules: •  Rule extracted from NELL in PROLOG format stateLocatedInCountry(x,y):-‐statehascapital(x,z), citylocatedincoutry(z,y) •  converted into ques'on: Is this statement always true? If state X has capital Z and city Z is located in country Y then state X is located in country Y.

Conversing Learning Ques'on: (Yes or No?) If athlete Z is member of team X and athlete Z plays in league Y, then team X plays in league Y.

•  TwiGer answers sample: No. (Z in X) ∧ (Z in Y) → (X in Y)

•  Yahoo! Answers sample:

NO, Not in EVERY case. Athlete Z could be a member of football team X and he could also play in his pub’s Friday nights dart team. The Dart team could play in league Y (and Z therefore by defini'on plays in league Y). This does not mean that the football team plays in the darts league!

Conversing Learning

Conversing Learning Some Ini'al Results with Facts Valida'on:

Some Ini'al Results with Metadata: •  Ques'on: Could you please give me some examples of clothing?

•  Answer 01: Snowshoes, rain ponchos, galoshes, sunhats, visors, scarves, miGens, and wellies are all examples of weather specific clothing!

•  Answer 02: pants •  Answer 03: Training shoes can be worn by anyone for any purpose, but the term means to train in sports

Conversing Learning

Some Ini'al Results with Metadata:

•  Users replied with 552 seeds for 129 categories Total of 5900 promo'ons with seeds created by NELL’s developers

•  Total of 5300 promo'ons with seeds extracted from answers of TwiGer users (similar precision)

Conversing Learning

Some Ini'al Results with Metadata: •  For Rela'on Discovery Components

– Symmetry: Is it always true that if a person P1 is neighbor of a person P2, then P2 is neighbor of P1?

– An'-‐symmetry: Is it always true that if a person P1 is the coach of a person P2, then P2 is not coach of P1?

Conversing Learning

Some Ini'al Results with Metadata: •  Feature Weigh'ng/Selec'on for CMC

– Logis'c Regression features are based on noun phrase morphology

–  (true or false) hotel names tend to be compound noun phrases having “hotel” as last the word.

–  (true or false) a word having “burgh” as sufix (ex. PiGsburgh) tend to be a city name.

Conversing Learning

On going and future work

•  Asking to the right community and to the right person •  Asking the right thing to maximize the results with minimum ques'ons (mulB-‐view Ac've Learning)

•  BeGer Ques'on-‐Answering methods •  Asking in different languages and explore 'me zones.

Conversing Learning

OpenEval: Web InformaBon Query EvaluaBon

Mehdi Samadi, Manuela Veloso and Manuel Blum Computer Science Department

Carnegie Mellon University, PiGsburgh, PA

AAAI 2013, July 16, Bellevue, WA, USA

I can wait more…

Shrimp is healthy

0.72

49

Informa'on Valida'on

healthyFood (shrimp)

healthyFood (shrimp)

healthyFood (apple)

0.88

•  Querying by human or agent •  Informa'on valida'on

•  Open Web •  Online/Any'me

•  Scalable •  Few seed examples for training

•  Small ontology

Mo'va'on

Learning

healthyFood unHealthyFood . . .

50

Food

Apple Kale Black Beans Salmon Walnut Banana …

Animal

Learning


51

Food

1-‐ Given an input predicate instance and a keyword, OpenEval first formulates a search query.

A predicate instance healthyFood(Apple)

Convert to a query: {“apple”}.

Animal

Learning


52

Food

2-‐ OpenEval queries the open Web and processes the retrieved unstructured Web pages.

A predicate instance healthyFood(Apple)

Convert to a query: {“apple”}.

.

.

.

Animal

Extrac'ng CBIs


53

Food

3-‐ OpenEval extracts a set of Context-‐Based Instances (CBI).

A predicate instance healthyFood(Shrimp)

Convert to a query: {“shrimp”}.

.

.

.

X pomaceous fruit apple tree, species Malus domes'ca rose family

widely known members genus Malus used humans. X grow small , deciduous trees. tree originated Central Asia, wild ancestora

.

.

.

Animal

Learning


OpenEval extracts CBIs for each predicate.

. . . . . . + + + + . . . + + + +

healthyFood unHealthyFood

. . . + + -‐ -‐

healthyFood

-‐ +

CBI

54

Food Animal

Learning


OpenEval extracts CBIs for each predicate.

. . . . . . + + + + . . . + + + +


healthyFood

-‐ +

CBI

55

Food

. . . + + -‐ -‐ . . .

OpenEval trains a SVM for each predicate using training CBIs.

Animal

What does OpenEval learn?

healthyFood(apple) healthyFood(apple) “vitamin”

Learn how to map instances to an appropriate predicate (i.e., sense) that they belong to. 56

Learning . . . . . . + + -‐ -‐

healthyFood . . . . . . + + -‐ -‐

unHealthyFood

. . .

57

Learning . . .

Choose predicate with maximum entropy.

. . . + + + + . . . + + + +


. . . + + -‐ -‐ healthyFood

-‐ + -‐

. . .

. . . + + -‐ -‐ healthyFood

. . . . . . + + -‐ -‐ unHealthyFood

. . .

Choose a keyword for the selected predicate. Extract CBIs for the predicate using the selected keyword.

+ + . .

Re-‐train a SVM for the predicate. 58

Predicate Instance Evaluator

keywords:

healthyFood(shrimp)?

Given the input Bme, which CBIs should be extracted?

59

Vitamin 0.88 Calories 0.83 Grow 0.69 Tree 0.66 Amount 0.59 Minerals 0.49

.

.

.


OpenEval in the last itera'on: academicfield 0.8976357986206526 Environmental Anthropology. Several excellent textbooks and readers in environmental anthropology have now appeared, establishing a basic survey of the field.


OpenEval in the last itera'on: academicfield 0.912473775634353 Anesthesiology. The Department of Anesthesiology is commiGed to excellence in clinical service, educa'on, research and faculty development.


OpenEval in the last itera'on: worksfor 0.9845774661303888 (charles osgood, cbs). Charles Osgood, oEen referred to as CBS News' poet-‐in-‐residence, has been anchor of "CBS News Sunday Morning" since 1994.


Hiring Labelers: •  Currently NELL can autonomously hire people (using Amazon’s Mechanical Turk)

•  Default number of instances is (uniformly distributed) sampled from each Category and each Rela'on

•  Can be used to precision es'mate


Hiring Labelers: •  Task is to validate Category and Rela'on instances – Category instances: Is Google a company? Is Mountain View a city?

– Rela'on instances: Is Google headquartered in Mountain View? Does Tom Mitchell work for Carnegie Mellon?


Hiring Labelers: •  Research Ques'ons:

– Sampling Strategies/Adap've Sampling – Quality of answers/turkers

NELL: Never-‐Ending Language Learner NELL is grown enough for a new step

NELL turned 4 on Jan 12!� CongratulaBons NELL!!

NELL: Never-‐Ending Language Learner NELL is grown enough for a new step

NELL: Never-‐Ending Language Learner NELL is grown enough for a new step •  Knowledge on Demand

NELL: Never-‐Ending Language Learner NELL is grown enough for a new step •  Knowledge on Demand – Ask NELL

[email protected]

Thank you very much Google Mountain View!

And thanks to Google, DARPA, NSF, CNPq

for partial funding! And thanks to Yahoo! for M45 computing and and thanks to Microsoft for fellowship to Edith Law and thanks to Carnegie Mellon University and thanks to Federal University of São Carlos

References •  [Fern, 2008] Xiaoli Z. Fern, CS 434: Machine Learning and Data Mining, School of Electrical Engineering

and Computer Science, Oregon State University, Fall 2008. •  [DARPA, 2012] DARPA Machine Reading Program, hGp://www.darpa.mil/Our_Work/I2O/Programs/

Machine_Reading.aspx. •  [Mitchell, 2006] Tom M. Mitchell, The Discipline of Machine Learning, my perspec've on this research

field, July 2006 (hGp://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf). •  [Mitchell, 1997] Tom M. Mitchell, Machine Learning. McGraw-‐Hill, 1997. •  [Etzioni et al., 2007] Oren Etzioni, Michele Banko, and Michael J. Cafarella, Machine Reading.The 2007

AAAI Spring Symposium. Published by The AAAI Press, Menlo Park, California, 2007. •  [Clark et al., 2007] Peter Clark, Phil Harrison, John Thompson, Rick Wojcik, Tom Jenkins, David Israel,

Reading to Learn: An Inves'ga'on into Language Understanding. The 2007 AAAI Spring Symposium. Published by The AAAI Press, Menlo Park, California, 2007.

•  [Norvig, 2007] Peter Norvig, Inference in Text Understanding. The 2007 AAAI Spring Symposium. Published by The AAAI Press, Menlo Park, California, 2007.

•  [Wang & Cohen, 2007] Richard C. Wang and William W. Cohen: Language-‐Independent Set Expansion of Named En''es using the Web. In Proceedings of IEEE InternaHonal Conference on Data Mining (ICDM 2007), Omaha, NE, USA. 2007.

•  [Etzioni, 2008] Oren Etzioni. 2008. Machine reading at web scale. In Proceedings of the internaHonal conference on Web search and web data mining (WSDM '08). ACM, New York, NY, USA, 2-‐2.

•  [Banko, et al., 2007] Michele Banko, Michael J. Cafarella, Stephen Soderland, MaGhew Broadhead, Oren Etzioni: Open Informa'on Extrac'on from the Web. IJCAI 2007: 2670-‐2676

References •  [Weikum et al., 2009] G. Weikum, G., Kasneci, M. Ramanath, F. Suchanek. DB & IR methods for •  knowledge discovery. Communica'ons of the ACM 52(4), 2009. •  [Theobald & Weikum, 2012] Mar'n Theobald and Gerhard Weikum. From Informa'on to Knowledge:

Harves'ng En''es and Rela'onships from Web Sources. Tutorial at PODS 2012 •  [Hoffart et al., 2012] Johannes Hoffart, Fabian Suchanek, Klaus Berberich, Gerhard Weikum. YAGO2: A

Spa'ally and Temporally Enhanced Knowledge Base from Wikipedia. Special issue of the Ar'ficial Intelligence Journal, 2012

•  [Etzioni et al., 2011] Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and Mausam "Open Informa'on Extrac'on: the Second Genera'on“. Proceedings of the 22nd InternaHonal Joint Conference on ArHficial Intelligence (IJCAI 2011).

•  [Hady et al., 2011] Hady W. Lauw, Ralf Schenkel, Fabian Suchanek, Mar'n Theobald, and Gerhard Weikum, "Seman'c Knowledge Bases from Web Sources" at IJCAI 2011, Barcelona, July 2011

•  [Fader et al., 2011] Anthony Fader, Stephen Soderland, and Oren Etzioni. "Iden'fying Rela'ons for Open Informa'on Extrac'on”. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011)

•  SeGles, B.: Closing the loop: Fast, interac've semi-‐supervised annota'on with queries on features and instances. In: Proc. of the EMNLP’11, Edinburgh, ACL (2011) 1467–1478 5.

•  Carlson, A., BeGeridge, J., Kisiel, B., SeGles, B., Jr., E.R.H., Mitchell, T.M.: Toward an architecture for never-‐ending language learning. In: Proceedings of the Twenty-‐Fourth Conference on Ar'ficial Intelligence (AAAI 2010).

•  Pedro, S.D.S., Hruschka Jr., E.R.: Collec've intelligence as a source for machine learning self-‐supervision. In: Proc. of the 4th Interna'onal Workshop on Web Intelligence and Communi'es. WIC12, NY, USA, ACM (2012) 5:1–5:9

References •  [Appel & Hruschka Jr., 2011] Appel, A.P., Hruschka Jr., E.R.: Prophet – a link-‐predictor to learn new rules on Nell.

In: Proceedings of the 2011 IEEE 11th Interna'onal Conference on Data Mining Workshops. pp. 917–924. ICDMW ’11, IEEE Computer Society, Washington, DC, USA (2011)

•  [Mohamed et al., 2011] Mohamed, T.P., Hruschka, Jr., E.R., Mitchell, T.M.: Discovering rela'ons between noun categories. In: Proceedings of the Conference on Empirical Methods in Nat-‐ ural Language Processing. pp. 1447–1455. EMNLP ’11, Associa'on for Computa-‐ 'onal Linguis'cs, Stroudsburg, PA, USA (2011)

•  [Pedro & Hruschka Jr., 2012] Saulo D.S. Pedro and Estevam R. Hruschka Jr., Conversing Learning: ac've learning and ac've social interac'on for human supervision in never-‐ending learning systems. Xiii Ibero-‐american Conference On Ar'ficial Intelligence, IBERAMIA 2012, 2012.

•  Krishnamurthy, J., Mitchell, T.M.: Which noun phrases denote which concepts. In: Proceedings of the Forty Ninth Annual Mee'ng of the Associa'on for Compu-‐ ta'onal Linguis'cs (2011)

•  Lao, N., Mitchell, T., Cohen, W.W.: Random walk inference and learning in a large scale knowledge base. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. pp. 529–539. Associa-‐ 'on for Computa'onal Linguis'cs, Edinburgh, Scotland, UK. (July 2011), hGp://www.aclweb.org/anthology/D11-‐1049

•  E. R. Hruschka Jr. and M. C. Duarte and M. C. Nicole�. Coupling as Strategy for Reducing Concept-‐DriE in Never-‐ending Learning Environments. Fundamenta Informa'cae, IOS Press, 2012.

•  Saulo D.S. Pedro, Ana Paula Appel, and Estevam R. Hruschka, Jr. Autonomously reviewing and valida'ng the knowledge base of a never-‐ending learning system. In Proceedings of the 22nd internaHonal conference on World Wide Web companion (WWW '13 Companion), 1195-‐120, 2013.

•  S. Verma and E. R. Hruschka Jr. Coupled Bayesian Sets Algorithm for Semi-‐supervised Learning and Informa'on Extrac'on. In Proceedings of the European Conference on Machine Learning and Principles and Prac'ce of Knowledge Discovery in Databases (ECML PKDD), 2012.

•  Navarro, L. F. and Appel, A. P. and Hruschka Jr., E. R., GraphDB – Storing Large Graphs on Secondary Memory. In New Trends in Databases and Informa'on. Advances in Intelligent Systems and Compu'ng, Springer, 177-‐186, 2013.

References •  Assuming Facts Are Expressed More Than Once.

J. BeGeridge, A. RiGer and T. Mitchell In Proceedings of the 27th Interna'onal Florida Ar'ficial Intelligence Research Society Conference (FLAIRS-‐27), 2014.

•  EsBmaBng Accuracy from Unlabeled Data. E. A. Platanios, A. Blum, T. Mitchell. In Uncertainty in Ar'ficial Intelligence (UAI), 2014.

•  CTPs: Contextual Temporal Profiles for Time Scoping Facts via EnBty State Change DetecBon. D.T. Wijaya, N. Nakashole and T.M. Mitchell. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.

•  IncorporaBng Vector Space Similarity in Random Walk Inference over Knowledge Bases. M. Gardner, P. Talukdar, J. Krishnamurthy and T.M. Mitchell. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.

•  Scaling Graph-‐based Semi Supervised Learning to Large Number of Labels Using Count-‐Min Sketch P. P. Talukdar, and W. Cohen In 17th Interna'onal Conference on Ar'ficial Intelligence and Sta's'cs (AISTATS, 2014.

•  Programming with Personalized PageRank: A Locally Groundable First-‐Order ProbabilisBc Logic. W.Y. Wang, K. Mazai's and W.W. Cohen. In Proceedings of the Conference on Informa'on and Knowledge Management (CIKM), 2013.

•  Improving Learning and Inference in a Large Knowledge-‐base using Latent SyntacBc Cues. MaG Gardner, Partha Pra'm Talukdar, Bryan Kisiel, and Tom Mitchell. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), 2013.

Automatically Labeling Facts in a Never-Ending Langue Learning system

Data & Analytics

Transcript of Automatically Labeling Facts in a Never-Ending Langue Learning system