Knowledge acquisition using automated techniques

Post on 22-Nov-2014

490 views 2 download

Tags:

description

 

Transcript of Knowledge acquisition using automated techniques

Methods of

Knowledge Extraction

Deepti Aggarwal

SIEL|SERL, IIIT-Hyderabad, India

AgendaIntroduction to Web as a knowledge

repository

Automated extraction techniques (Input sources, extracted structures, input pre-processing, extraction methods, output generation)

Issues with automated extraction

What is knowledge?A familiarity with someone or

something with experience

Includes facts, information, descriptions, skills

Types of KnowledgeExplicit

Knowledge

Always present explicitly in records

Objective facts having a definite answer

E.g., Hyderabad is the capital of A.P.

Implicit Knowledge

Not present explicitly for analysis

Cultural beliefs with subjective judgments

E.g., Hyderabad is the best city to live in India.

How knowledge is represented over a period of time?From Public library to global library

How knowledge is represented over the web?Millions of documents, blogs, forums,

social networks scattered on web

Diverse topic, different formats, from diverse people in diverse language, different point of views

Benefits of knowledge extraction over the WebQuestion Answering systems

Search engines

Validating knowledge

Tracking a particular information

Predicting market, polls etc.

Community advertisements

Explicitknowledge

Implicitknowledge

Problems with knowledge acquisition over web

Abundance of data

Relevance of information

Personalized retrieval

Possible approachesManual filtering

Automated techniques

Combination of both

Automated Extraction

Input sources

Extraction system

Database of all facts, relations

Inputpre-

processing

Extractionmethods

Outputprocessing

Working of automated extraction systems

Defining output

structures

Input sourcesTypes

Input sourcesweb documents

news articles

blogs

social networks activities (user profiles, posts, comments)

Sentence level parsing required.

Defining the structures of

outputNamed Entities and their relations

Output structures Named Entities

Named entities relations

1. Named Entity: Definition

It is an atomic element in a body of text.

Types: person, organization, location etc.

Different named entities when linked

together, form a relation.

1. Named Entity: An example

Sachin Tendulkar was born in Bombay.

NE of type ‘Person’ NE of type ‘Location’

2. Named Entity Relationship: Structure

Subject – Relation - Object

NE of any type

Verb, Adjective, Adverb

NE of any type

2. Named Entity Relationship: An Example

Sachin Tendulkar was born in Bombay

Subject Relation Object

Co-referencing

Sachin was born in Bombay. He is a ...

Sachin Tendulkar …. Mr. Tendulkar … Master Blaster ...

Input pre-processing

Libraries

NLP libraries: Splitting each sentence into tokens,

words, digits using Sentence Tokenizer

Recognizing language constructs, nouns, verbs, pronouns using Part-of-speech Tagger

Example: Sachin/NNP Tendulkar/NNP was/VBD born/VBN in/IN Bombay/NNP

NLP libraries (contd.): Linking individual constituents of a

sentence with Parser to form parse tree

Identify types of named entity using Named Entity Recognizer

Example: Sachin Tendulkar/PERSON was born in Bombay/LOCATION

NLP libraries (contd.): Identify all co-references and replace

with actual entity using Co -reference Resolution tool

Identify specific meaning of a word Word Sense Disambiguation External vocabularies: MindNet,

DBpedia, WordNet E.g., contextual meaning of ‘crane’:

noun-bird, verb-lift/move

Extraction methods

Extracting relationships among NEs: Standard process

1. Identify named entities within a sentence.

2. Find the verb or adjective that

connects the identified named entities.

3. Connect them together to form

relation.

Extracting relationships among NEs: Required process

1. Identify part-of-speech constructs: noun, verb, adjective etc.

2. Determine Co-references, Acronyms and

abbreviations.

3. Connect them together to form a

relationship.

Extraction Methods

Natural Language Processing: rule based.

Based on sentence structure

E.g., for English language, a rule can be “noun-verb-noun”

Machine Learning: supervised and unsupervised learning.

Features are detected from the training data

E.g., to extract instances of some medical diseases, system is trained over all the symptoms of each given disease.

Extraction Methods (contd.)

Other methods: Vocabulary based systems, context based clustering.

Maintaining a mapping file of all countries and their nationalities helps to determine nationality of a person when his birth place is known.

Hybrid:

NLP based libraries to pre-process the input data, applying machine learning approach to extract the relations by using some external vocabulary as WordNet.

Output generation

Types of output systems

1. Identifies all mentions of named entities and their relations.

E.g., from a given corpus, extract all named entity relations.

2. Identify missing relations of a database

E.g., Given a database, extract the missing attributes of given entities from the corpus.

3. Linking various entities within a database.

E.g., Given a database, link two entities together with some relation extracted from the corpus.

Input sources

Extraction system

Database of all facts, relations

Inputpre-

processing

Extractionmethods

Outputprocessing

Working of automated extraction systems

Defining output

structures

Issues with automated extraction

Accuracy, running time, dependency

Issue 1: Challenges of language structure

Co-reference resolutionAmbiguous, complex sentencesAbbreviationsAcronyms

See an example…

“Tom called his father last night. They

talked for an hour. He said he would be home the next day."

What is ‘He' referring to? Tom or his father?

“You see sir, I can talk English, I can walk English, I can laugh English, I can run English, because

English is such a funny language.” Amitabh in Namak Halal

Issue 2: AccuracyNamed entity detection: 90%,

relationship 50-70%. Introduction of noise at each step.

E.g., disambiguation of acronym ‘crane’ with WordNet, introduces contextual errors, which then decreases accuracy of rule based relationship extraction

Issue 3: EfficiencyFeature detection steps are

expensive.

Require days for computation

Issue 4: Dependencyon external vocabulary sources, like

Wikipedia, WordNet, MindNet etc.Maintenance & updation of vocabulary

sources is manual: costly and require expertise.

Limited size produce context based noise

Domain-dependent: medical domainCorpus-dependent: Wikipedia, news

corpusRelation specific: Date and Place-of-

event

Issue 5: Problem with Implicit knowledge extraction

Community Knowledge is learned and shared

No one can be an expert.

cultural competence and perception of workers are fed into a system as variables.

Cultural Consensus Theory provides models to include such variables into the system.

Can we do better?

Can we seek human intelligence to improve the accuracy of automated techniques?

References[1] I. Tuomi. Data is more than knowledge:

implications of the reversed knowledge hierarchy for knowledge management and organizational memory. J. Manage. Inf. Syst. , 16(3):103–117, Dec. 1999.

[2] S. Sekine. Named Entity: History and Future. 2004.

[3] S. Sarawagi. Information extraction. Found. Trends databases , 1(3):261–377, Mar. 2008.

[4] S. C. Weller. Cultural consensus theory: Applications and frequently asked questions. Field Methods,19(4):339–368, 2007.

References (contd.)[5] Z. Syed, E. Viegas, and S. Parastatidis. Automatic

discovery of semantic relations using mindnet. LREC,2010.

[6] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Wordnet: An on-line lexical database. International Journal of Lexicography , 3:235–244, 1990

[7] T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Eng. Bull. , pages 40–48, 2006.

[8] E. Greengrass. Information retrieval: A survey, 2000.

Thank youQuestions?