Lidia Pivovarova

58
Lidia Pivovarova phd student, lecturer, researcher Saint-Petersburg State University

description

И это все о ней.

Transcript of Lidia Pivovarova

Page 1: Lidia Pivovarova

Lidia Pivovarovaphd student, lecturer, researcher

Saint-Petersburg State University

Page 2: Lidia Pivovarova

My supervisor

V. Sh. Rubashkin -Dr. technical science, professor, PhD in Philosophy

Page 3: Lidia Pivovarova

Our goals (in general)

• Natural Language Understanding (NLU)

• Conceptual Modeling

• The title of my future phd thesis:

Ontology-based Information Extraction for newspaper texts

Page 4: Lidia Pivovarova

Ontology & Ontoeditor

• We are developing universal ontology

• It means:– Top-level– General model appropriable for all domains– Several domains deeply developed

Page 5: Lidia Pivovarova

Conceptual model

• Common approach:– Hierarchy of objects

Page 6: Lidia Pivovarova

Our approach: attribute tree

Рубашкин В. Ш. Представление и анализ смысла в интеллектуальных информационных системах – М.: Наука Гл. ред. Физ.-мат. Лит., 1989 – 192 с. – (Проблемы искусственного интеллекта) – ISBN – 5-02-01-4213-1

Page 7: Lidia Pivovarova

InTez ontology

• An attribute tree: objects alternate with attributes

• A small fragment: TRANSPORT

o BY ENERGY SOURCE ELECTRIC TRANSPORT ATOMIC TRANSPORT FUEL TRANSPORT WIND-DRIVEN TRANSPORT

o BY ENVIRONMENT TYPE AIR TRANSPORT WATER TRANSPORT LAND TRANSPORT SPACE TRANSPORT

Page 8: Lidia Pivovarova

Attribute tree- the most natural way to present different

links between concepts

• value <-> attribute *great color vs. great volume

• attribute <-> object classSOLID -> SHAPE vs. *LIQUID -> SHAPE

• extension relations: incompatibility, intersection, inclusion

Page 9: Lidia Pivovarova

e.g. HUMAN AGE

– CHILD– ADULT– AGED

SEX– MALE– FEMALE

Formal definitions:Girl = child & femaleBoy = child & male

BOY GIRL

BOY STUDENT

Page 10: Lidia Pivovarova

«Associative» relations

- unified

• PART -> WHOLE

• OBJECT -> LOCALIZATION

• OBJECT -> FUNCTION

- specialized

• COUNTRY –> CAPITAL

• ORGANIZATION –> CHIEF

Page 11: Lidia Pivovarova

- an internal part of the working ontology

CONCEPTS LEXICAL UNITS

Lexical units:

• words or collocations

• standard terms (names of ontology concepts) or “synonyms”

Lexicon

Page 12: Lidia Pivovarova

Functionality

Ontology means a terminology system model.

From the technological point of view ontology is a library of program functions (*.dll).

The functions look as:– F(1)( D ), F(2)( D1 , D2 ),

where D, D1, D2 - are ontology concepts.

Page 13: Lidia Pivovarova

Ontoeditor InTez

Developed by: V.Sh. Rubashkin and B.U. Chuprin

http://inttez.ru/ - in Russian, sorry

Page 14: Lidia Pivovarova

Technological point of view

Ontology is a library of program functions (*.dll).

The functions look as:– F(1)( D ), F(2)( D1 , D2 ),

where D, D1, D2 - are ontology concepts.

Page 15: Lidia Pivovarova

Information Extraction

Valery Sh. Rubashkin, Boris Chuprin, Lidia Pivovarova, Anton Babanov, Olga Usmanova

We are developing the intelligent environment which supports thedomain expert activity and capable for adaptation to texts features.The environment have to minimize the expert efforts, not replace him.

Page 16: Lidia Pivovarova

General System DescriptionThe

Ontology

TEXTS

Lemmatization, part-of-

speech tagging, semantic mark-up

Morph. analyzer

Semantic analyzer Situatio

n State

Search Patterns

Page 17: Lidia Pivovarova

The Factors

Factors – the required information aspects.~ 100 factors

Factors: - qualitative

e.g. social tension, investment attractiveness,level of sovereignty, human rights activity

- quantitativee.g. the number of unemployed, an average

salary, the inflation level, the ammount of import

Page 18: Lidia Pivovarova

Numerical valuesQualitative factors:

very small, small, less than average, average, more

than average, large, very large.

Quantitative factors:

the number + <unit>

e. g.

an average salary –> monetary unit (ruble, $, …)

the number of unemployed -> no units

Page 19: Lidia Pivovarova

The PatternsQualitative factors ->“factor + numerical value” patterns.

e. g. Social tension <-- spontaneous meeting (large)

Quantitative factors -> “only factor” patterns.

e. g. The number of unemployed <-- become unemployed

Search algorithm

1) find a pattern

2) find a number + unit

if not

3) find words large, small, increase, decrease etc.

Page 20: Lidia Pivovarova

Pattern Formation ProcessPattern is a set of words and ontology concepts.

Ontology provides:

- pattern generalization

- synonym accumulation

- information about units

Pattern formation: user marks relevant fragment in a text or chooses concept from the ontology.

Page 21: Lidia Pivovarova

Example

As is known, European Union strictly demanded Latvia to close the both generating units of Ignalinskaya nuclear power station. It is also promised to remit 3 billions euro for this goal.

Factors:

The EU pressure to Latvia.

The financial aid of EU to Latvia.

Page 22: Lidia Pivovarova

Discussion

• I am not sure that such thinks as level of sovereignty might be found in newspapers

• We had very few examples, so we wasn’t able to test it

• I think that it is necessary to collaborate with experts (sociologists) to address such task

• Russian language: very few resources

Page 23: Lidia Pivovarova

Ontology Learning

V. Bocharov, L. Pivovarova, V. Rubashkin, B. Chuprin Ontological Parsing of Encyclopedia Information. In Computational Linguistics and Intelligent Text Processing 11th International Conference, CICLing 2010, Iasi, Romania, March 21-27, 2010. Proceedings. Lecture Notes in Computer Science. - Springer Berlin / Heidelberg – 2010 – pp. 564 – 579

~ 2500 concepts~ 1000 words and collocations

Should include ~ 100000 concepts

Page 24: Lidia Pivovarova

- reuse of traditional lexicographical informationRussian Encyclopedic Dictionary. A. M. Prohorov

(ed.). Russian Encyclopedic Dictionary, Moscow (2001) [In Russian]

- without toponyms and proper names26,375 entries21,782 different terms

Ontology learning: our approach

Page 25: Lidia Pivovarova

Basic hypothesis

Usually, a hyperonym for a dictionary term is the first subjective-case noun of its definition (“basic word”).

Page 26: Lidia Pivovarova

ПЕРИСТИЛЬ – прямоугольный двор, сад, площадь, окруженные с 4 сторон крытой колоннадой.

PERISTYLE – a colonnade surrounding a building or court.

ЯТАГАН – рубяще-колющее оружие (среднее между саблей и кинжалом) у народов Ближнего и Среднего Востока (известно с 16 в.).

YATAGHAN - a long knife or short saber that lacks a guard for the hand at the juncture of blade and hilt and that usually has a double curve to the edge and a nearly straight back.

Basic hypothesis

Page 27: Lidia Pivovarova

Dictionary entry (text + labels + abbrevations)

Lexicographical processing

Morphology and syntax

Relation extraction

Dictionary entry (text only)

Dependency tree

Relation (term <–> basic word)

Import to ontology

General framework

Page 28: Lidia Pivovarova

Lexicographical processing

• term recognition

• replacement of abbreviations by full forms of words

• removing of labels

• bracket text elimination

Page 29: Lidia Pivovarova

Lexicographical processing

на Сев. Кавказе

at N. Caucasus

на Северном Кавказе

at the North Caucasus

в 18 в.

in 18 c.

в 18 веке

in 18th century

Page 30: Lidia Pivovarova

Morphology and syntax

• Simple context-free grammar (noun groups only) – Tomita formalism

• AOT tool to compile grammar (immediate constituent structure)

• Dependency tree

Page 31: Lidia Pivovarova

[ANP] -> [ADJ] [NP root]

: $0.grm := case_number_gender($1.grm, $2.type_grm, $2.grm);

[GP] -> [NP root] [NP grm="рд"];

[PP] -> [PREP root] [NP];

[NP] -> [NOUN];

[NP] -> [NP root] [PP] ;

[NP] -> [PP] | [GP] | [ANP];

Morphology and syntax

Page 32: Lidia Pivovarova

Morphology and syntax: example

Халат - верхняя одежда у некоторых азиатских народов.

Oriental robe – outdoor clothes of some Asian nations.

Page 33: Lidia Pivovarova

ВЕРХНЯЯ

ОДЕЖДА

НЕКОТОРЫХ

У

НАРОДОВ

АЗИАТСКИХANP

ANP

ANP

PP

NP

Immediate constituent structure

Page 34: Lidia Pivovarova

Dependency tree

ВЕРХНЯЯ

ОДЕЖДА

НЕКОТОРЫХ

У

НАРОДОВ

АЗИАТСКИХANP

ANP

ANP

PP

NP

Page 35: Lidia Pivovarova

Disambiguation

Before syntax

After syntax

Average number of lemmas for one word form

1,27 1,06

Average number of morphological analysis outputs for one word form

2,26 1,64

Page 36: Lidia Pivovarova

Disambiguation• о чукотском море (about Chukchee sea)• море

– мор (pestilence), prepositional case, singular, masculine gender;

– море (sea), prepositional case, singular, neuter gender;

– мора (mora), prepositional case, singular, feminine gender

• чукотском (Chukchee) adjective in prepositional case and masculine or neuter gender

• мора has to be rejected

Page 37: Lidia Pivovarova

Relation recognition

Relation Description Notation

GENERALIZATION (IS-A) – default value

Gen

INSTANCE (reverse to Gen) Spec

IDENTITY Same

PART Part

WHOLE (reverse to Part) Whole

FUNCTION Func

OTHER Other

Page 38: Lidia Pivovarova

Logical-linguistic rules

• a specific rule is attached to a certain word

• describes, first, the type of relation indicated by this word

• and, second, a directive of saving this word as a basic, or rejecting it and obtaining the next basic word candidate

Page 39: Lidia Pivovarova

Examples of GENERALIZATION relation rules Basic word: род, вид, сорт, тип… (kind, sort, type, class, etc.)Example: ПИДЖИНЫ – тип языков, используемых как средство

межэтнического общения в среде разноязычного населения.

PIDGINS – a sort of languages, used for communication between people with different languages.

Rule:1. Save default type of relation (<Gen> )2. Save next noun as a basic word Result:ПИДЖИН язык GENPIDGIN language GEN

Page 40: Lidia Pivovarova

Examples of GENERALIZATION relation rules Basic word: жанр genre

Example: МИСТЕРИЯ – жанр средневекового западноевропейского

религиозного театра.MYSTERY – a genre of the religious medieval theatre.Rule:1.Save word as a basic word with default relation type2. Save default type of relation (<Gen>)3. Save the next noun as a basic word context.

Result:МИСТЕРИЯ жанр GENМИСТЕРИЯ театр GEN

MYSTERY genre GENMYSTERY theatre GEN

Page 41: Lidia Pivovarova

Main types of rules

1.– save the first basic word;

– change the type of relation;

– save the next basic word.

2.– reject the first basic word;

– change the type of relation;

– save the next basic word.

Page 42: Lidia Pivovarova

“Complicated” rulesBasic word: инструмент, прибор, аппарат... (instrument, tool,

device, etc.)

Example: ФЕН – электрический аппарат для сушки волос.HAIRDRYER – an electric device for hair drying. Rule:Save word – move to the next prepositionIf it is для (for):- change relation type to <Func>- save next nounResult:ФЕН аппарат GENФЕН сушка FUNC

HAIRDRYER device GENHAIRDRYER drying FUNC

Page 43: Lidia Pivovarova

OTHER relation

АБОРТ – прерывание беременности в сроки до 28 недель (то есть до момента, когда возможно рождение жизнеспособного плода).

ABORTION – the termination of a pregnancy after, accompanied by, resulting in, or closely followed by the death of the embryo or fetus.

Page 44: Lidia Pivovarova

OTHER relation

ХОМИНГ – способность животного возвращаться со значительного расстояния на свой участок обитания, к гнезду, логову и т. д.

HOMING – the ability of animals to come back from the considerable distance to their home range, nest, lie etc .

Page 45: Lidia Pivovarova

OTHER relation- features: характеристика (characteristic), признак

(attribute), свойство (property), число (number), показатель (index), степень (degree), количество (quantity), характер (character), масса (mass), состояние (condition), способность (ability), место (place), источник (source)

- transformations: переход (transition), извлечение (extraction), превращение (transformation), введение (introduction), выделение (emission), возникновение (origination), нарушение (deviation), прерывание (termination), развитие (evolution), образование (formation), увеличение (increase), уменьшение (decrease)

Page 46: Lidia Pivovarova

The last slide about rules

• 18 rules

• 91 basic words

• 8484 dictionary entries where rules are used

• 4679 different basic words

• 1978 basic terms

Page 47: Lidia Pivovarova

Most frequent basic words1 УСТРОЙСТВО DEVICE 3322 МИНЕРАЛ MINERAL 3223 ЕДИНИЦА UNIT 2934 ПРИБОР INSTRUMENT 2925 ВЕЩЕСТВО SUBSTANCE 2776 ПРОЦЕСС PROCESS 2437 ИНСТРУМЕНТ TOOL 2358 ЭЛЕМЕНТ ELEMENT 2289 ЗАБОЛЕВАНИЕ DISEASE 21010 НАУКА DISCIPLINE 19911 СОЕДИНЕНИЕ COMPOUND 18412 БОЛЕЗНЬ ILLNESS 17413 ПОРОДА BREED 17014 ОРГАН ORGAN 168

Page 48: Lidia Pivovarova

Evaluation• Expert evaluation, 200 entries

• 90% of entries (179 of 200), the results obtained by the expert and our sofware are identical.

• 21 dictionary entries, which are incorrectly processed by the program:– 16 of 21 can be eliminated by minor

modifications – 5 – а basic word is missing from the definition

text

Page 49: Lidia Pivovarova

Inconvenient entries• АБРАЗИВНЫЙ ИНСТРУМЕНТ – служит

для механической обработки (шлифование, притирка и другие ).

• ABRASIVE TOOL – is designed for mechanical processing (grinding, reseating, etc.).

• АБИТУРИЕНТ – оканчивающий среднее учебное заведение.

• COLLEGE APPLICANT – a person graduating from high school.

Page 50: Lidia Pivovarova

Import to ontology

Page 51: Lidia Pivovarova

Manual process

• choosing a basic word in the ontology taxonomy (attribute tree)

• forming a subset of dictionary entries• adding subset terms to the ontology• postediting

Page 52: Lidia Pivovarova

Wikipedia

– Articles design … is various• where is «the first sentence of definition»?

– Topics … are peculiar• computer games ~ 2000 articles

– Articles without definitions • «List of FTP server return codes»• «March 25 is the 84th day of the year…»

Page 53: Lidia Pivovarova

Wikipedia: preliminary results

• Expert evaluation, 500 entries

• 82% of entries (410 of 500), the results obtained by the expert and our sofware are identica

• 40% of the errors (36 of 90 entries) - irregularities in the article texts

Page 54: Lidia Pivovarova

Wikipedia vs. EncyclopediaBasic word Wikipedia Encyclopedia

pод (kind) 3084 58вид (sort) 2526 384образование (formation) 2215 114название (name) 2129 594персонаж (character) 1809 8cемейство (family) 1644 7растение (plant) 1388 146птица (bird) 1319 5единица (unit) 1316 286система (system) 1239 391район (region) 1182 9группа (group) 1077 224организация (organization) 1005 50

Page 55: Lidia Pivovarova

• Device

Encyclopedia – 331, Wikipedia - 672For example: A Stargate is a portal device within the

Stargate fictional universe that allows practical, rapid travel between two distant locations

• Science

Encyclopedia – 196, Wikipedia – 338For example: A vampirology

Wikipedia vs. Encyclopedia

Page 56: Lidia Pivovarova

1. An improvement of ontoeditor

2. An expansion of syntax

3. An expansion of rules

4. Collocation extraction techniques

5. Better evaluation

6. Studies of dictionary structures

Future work

Page 57: Lidia Pivovarova

What else about me?

• Teaching: Information Retrieval, Information Systems

• Supervising: Lena Bilyk & Lena Sergeeva Citations Extraction from the Newspaper texts

• Co-organizing: – Natural Language Processing seminar

– Russian Summer School of Information Retrieval

Page 58: Lidia Pivovarova

Thank you!