Knowledge acquisition using automated techniques

44
Methods of Knowledge Extraction Deepti Aggarwal SIEL|SERL, IIIT-Hyderabad, India

description

 

Transcript of Knowledge acquisition using automated techniques

Page 1: Knowledge acquisition using automated techniques

Methods of

Knowledge Extraction

Deepti Aggarwal

SIEL|SERL, IIIT-Hyderabad, India

Page 2: Knowledge acquisition using automated techniques

AgendaIntroduction to Web as a knowledge

repository

Automated extraction techniques (Input sources, extracted structures, input pre-processing, extraction methods, output generation)

Issues with automated extraction

Page 3: Knowledge acquisition using automated techniques

What is knowledge?A familiarity with someone or

something with experience

Includes facts, information, descriptions, skills

Page 4: Knowledge acquisition using automated techniques

Types of KnowledgeExplicit

Knowledge

Always present explicitly in records

Objective facts having a definite answer

E.g., Hyderabad is the capital of A.P.

Implicit Knowledge

Not present explicitly for analysis

Cultural beliefs with subjective judgments

E.g., Hyderabad is the best city to live in India.

Page 5: Knowledge acquisition using automated techniques

How knowledge is represented over a period of time?From Public library to global library

Page 6: Knowledge acquisition using automated techniques

How knowledge is represented over the web?Millions of documents, blogs, forums,

social networks scattered on web

Diverse topic, different formats, from diverse people in diverse language, different point of views

Page 7: Knowledge acquisition using automated techniques

Benefits of knowledge extraction over the WebQuestion Answering systems

Search engines

Validating knowledge

Tracking a particular information

Predicting market, polls etc.

Community advertisements

Explicitknowledge

Implicitknowledge

Page 8: Knowledge acquisition using automated techniques

Problems with knowledge acquisition over web

Abundance of data

Relevance of information

Personalized retrieval

Page 9: Knowledge acquisition using automated techniques

Possible approachesManual filtering

Automated techniques

Combination of both

Page 10: Knowledge acquisition using automated techniques

Automated Extraction

Page 11: Knowledge acquisition using automated techniques

Input sources

Extraction system

Database of all facts, relations

Inputpre-

processing

Extractionmethods

Outputprocessing

Working of automated extraction systems

Defining output

structures

Page 12: Knowledge acquisition using automated techniques

Input sourcesTypes

Page 13: Knowledge acquisition using automated techniques

Input sourcesweb documents

news articles

blogs

social networks activities (user profiles, posts, comments)

Sentence level parsing required.

Page 14: Knowledge acquisition using automated techniques

Defining the structures of

outputNamed Entities and their relations

Page 15: Knowledge acquisition using automated techniques

Output structures Named Entities

Named entities relations

Page 16: Knowledge acquisition using automated techniques

1. Named Entity: Definition

It is an atomic element in a body of text.

Types: person, organization, location etc.

Different named entities when linked

together, form a relation.

Page 17: Knowledge acquisition using automated techniques

1. Named Entity: An example

Sachin Tendulkar was born in Bombay.

NE of type ‘Person’ NE of type ‘Location’

Page 18: Knowledge acquisition using automated techniques

2. Named Entity Relationship: Structure

Subject – Relation - Object

NE of any type

Verb, Adjective, Adverb

NE of any type

Page 19: Knowledge acquisition using automated techniques

2. Named Entity Relationship: An Example

Sachin Tendulkar was born in Bombay

Subject Relation Object

Page 20: Knowledge acquisition using automated techniques

Co-referencing

Sachin was born in Bombay. He is a ...

Sachin Tendulkar …. Mr. Tendulkar … Master Blaster ...

Page 21: Knowledge acquisition using automated techniques

Input pre-processing

Libraries

Page 22: Knowledge acquisition using automated techniques

NLP libraries: Splitting each sentence into tokens,

words, digits using Sentence Tokenizer

Recognizing language constructs, nouns, verbs, pronouns using Part-of-speech Tagger

Example: Sachin/NNP Tendulkar/NNP was/VBD born/VBN in/IN Bombay/NNP

Page 23: Knowledge acquisition using automated techniques

NLP libraries (contd.): Linking individual constituents of a

sentence with Parser to form parse tree

Identify types of named entity using Named Entity Recognizer

Example: Sachin Tendulkar/PERSON was born in Bombay/LOCATION

Page 24: Knowledge acquisition using automated techniques

NLP libraries (contd.): Identify all co-references and replace

with actual entity using Co -reference Resolution tool

Identify specific meaning of a word Word Sense Disambiguation External vocabularies: MindNet,

DBpedia, WordNet E.g., contextual meaning of ‘crane’:

noun-bird, verb-lift/move

Page 25: Knowledge acquisition using automated techniques

Extraction methods

Page 26: Knowledge acquisition using automated techniques

Extracting relationships among NEs: Standard process

1. Identify named entities within a sentence.

2. Find the verb or adjective that

connects the identified named entities.

3. Connect them together to form

relation.

Page 27: Knowledge acquisition using automated techniques

Extracting relationships among NEs: Required process

1. Identify part-of-speech constructs: noun, verb, adjective etc.

2. Determine Co-references, Acronyms and

abbreviations.

3. Connect them together to form a

relationship.

Page 28: Knowledge acquisition using automated techniques

Extraction Methods

Natural Language Processing: rule based.

Based on sentence structure

E.g., for English language, a rule can be “noun-verb-noun”

Machine Learning: supervised and unsupervised learning.

Features are detected from the training data

E.g., to extract instances of some medical diseases, system is trained over all the symptoms of each given disease.

Page 29: Knowledge acquisition using automated techniques

Extraction Methods (contd.)

Other methods: Vocabulary based systems, context based clustering.

Maintaining a mapping file of all countries and their nationalities helps to determine nationality of a person when his birth place is known.

Hybrid:

NLP based libraries to pre-process the input data, applying machine learning approach to extract the relations by using some external vocabulary as WordNet.

Page 30: Knowledge acquisition using automated techniques

Output generation

Page 31: Knowledge acquisition using automated techniques

Types of output systems

1. Identifies all mentions of named entities and their relations.

E.g., from a given corpus, extract all named entity relations.

2. Identify missing relations of a database

E.g., Given a database, extract the missing attributes of given entities from the corpus.

3. Linking various entities within a database.

E.g., Given a database, link two entities together with some relation extracted from the corpus.

Page 32: Knowledge acquisition using automated techniques

Input sources

Extraction system

Database of all facts, relations

Inputpre-

processing

Extractionmethods

Outputprocessing

Working of automated extraction systems

Defining output

structures

Page 33: Knowledge acquisition using automated techniques

Issues with automated extraction

Accuracy, running time, dependency

Page 34: Knowledge acquisition using automated techniques

Issue 1: Challenges of language structure

Co-reference resolutionAmbiguous, complex sentencesAbbreviationsAcronyms

Page 35: Knowledge acquisition using automated techniques

See an example…

“Tom called his father last night. They

talked for an hour. He said he would be home the next day."

What is ‘He' referring to? Tom or his father?

Page 36: Knowledge acquisition using automated techniques

“You see sir, I can talk English, I can walk English, I can laugh English, I can run English, because

English is such a funny language.” Amitabh in Namak Halal

Page 37: Knowledge acquisition using automated techniques

Issue 2: AccuracyNamed entity detection: 90%,

relationship 50-70%. Introduction of noise at each step.

E.g., disambiguation of acronym ‘crane’ with WordNet, introduces contextual errors, which then decreases accuracy of rule based relationship extraction

Page 38: Knowledge acquisition using automated techniques

Issue 3: EfficiencyFeature detection steps are

expensive.

Require days for computation

Page 39: Knowledge acquisition using automated techniques

Issue 4: Dependencyon external vocabulary sources, like

Wikipedia, WordNet, MindNet etc.Maintenance & updation of vocabulary

sources is manual: costly and require expertise.

Limited size produce context based noise

Domain-dependent: medical domainCorpus-dependent: Wikipedia, news

corpusRelation specific: Date and Place-of-

event

Page 40: Knowledge acquisition using automated techniques

Issue 5: Problem with Implicit knowledge extraction

Community Knowledge is learned and shared

No one can be an expert.

cultural competence and perception of workers are fed into a system as variables.

Cultural Consensus Theory provides models to include such variables into the system.

Page 41: Knowledge acquisition using automated techniques

Can we do better?

Can we seek human intelligence to improve the accuracy of automated techniques?

Page 42: Knowledge acquisition using automated techniques

References[1] I. Tuomi. Data is more than knowledge:

implications of the reversed knowledge hierarchy for knowledge management and organizational memory. J. Manage. Inf. Syst. , 16(3):103–117, Dec. 1999.

[2] S. Sekine. Named Entity: History and Future. 2004.

[3] S. Sarawagi. Information extraction. Found. Trends databases , 1(3):261–377, Mar. 2008.

[4] S. C. Weller. Cultural consensus theory: Applications and frequently asked questions. Field Methods,19(4):339–368, 2007.

Page 43: Knowledge acquisition using automated techniques

References (contd.)[5] Z. Syed, E. Viegas, and S. Parastatidis. Automatic

discovery of semantic relations using mindnet. LREC,2010.

[6] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Wordnet: An on-line lexical database. International Journal of Lexicography , 3:235–244, 1990

[7] T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Eng. Bull. , pages 40–48, 2006.

[8] E. Greengrass. Information retrieval: A survey, 2000.

Page 44: Knowledge acquisition using automated techniques

Thank youQuestions?