Standards, Use and Prospects for Language Resource Management

56
Standards, Use and Prospects for Language Resource Management Key-Sun Choi 16 Aug. 2008 TII, Moscow

description

Standards, Use and Prospects for Language Resource Management. Key-Sun Choi 16 Aug. 2008 TII, Moscow. MOTIVATION. Wikipedia. Web-based collaborative authoring multi-lingual encyclopedia 8.29 M pages/ 253 languages (2007/9) 2.0 M pages/ English (2007/9) ~ now 5.0 M pages. Computer science. - PowerPoint PPT Presentation

Transcript of Standards, Use and Prospects for Language Resource Management

Page 1: Standards, Use and Prospects for Language Resource Management

Standards, Use and Prospects for Language Resource Management

Key-Sun Choi16 Aug. 2008TII, Moscow

Page 2: Standards, Use and Prospects for Language Resource Management

MOTIVATION

Page 3: Standards, Use and Prospects for Language Resource Management

Wikipedia

• Web-based collaborative authoring multi-lingual encyclopedia– 8.29 M pages/ 253 languages (2007/9)– 2.0 M pages/ English (2007/9) ~ now 5.0 M pages

Computer science

Computer science

AlgorithmsAlgorithms DatabasesDatabases Computer scientistsComputer scientists

Martic Kay

Martic Kay

Robert WatsonRobert Watson

Parallel databaseParallel

databaseSQLSQLDivide & ConquerDivide & Conquer

Category Classification

Category Classification

Category PageCategory Page

3

Page 4: Standards, Use and Prospects for Language Resource Management

Problem: IS-A Relation Extraction from Wikipedia

• Relation Classification from Category System– By Term Formation Rule, Wikipedia Structure

(Ponzetto & Strube, 2007)

Computer science

Computer science

AlgorithmsAlgorithms DatabasesDatabases Computer scientistsComputer scientists

RelationClassification

IS-A relation

Not IS-A relationUpper-lower levelCategory relation

IS-A IS-ANot IS-A

4

Page 5: Standards, Use and Prospects for Language Resource Management

Relation Extraction by Pattern

• (Ryu & Choi, 2007)– http://cseight.kaist.ac.kr:8080/RelExt

Computer display mode

Computer display mode

Text modeText mode

IS-A

5

Page 6: Standards, Use and Prospects for Language Resource Management

Problem: IS-A Relation Extraction from Wiktonary

• Web-based Collaborative Multilingual Dictionary– 617,639 entries/401 languages

• ISA relation extraction from Definition Pattern– http://cseight.kaist.ac.kr:8080/Wiktionary

IS-A

IS-A

6

Page 7: Standards, Use and Prospects for Language Resource Management

Problem: IS-A Relation Extraction from WordNet

• Semantic Word Net (English)– 117,798 nouns, 82,115 synset (Ver. 3.0)– ISA relation extraction through ISA between

Synsets

Synset #12Synset #12

Synset #22Synset #22 Synset #23Synset #23 Synset #33Synset #33

chemical engineering

chemical engineering

computer science, computing

computer science, computing

electrical engineering

electrical engineering

engineering, applied scienceengineering, applied science

IS-A IS-A

7

Page 8: Standards, Use and Prospects for Language Resource Management

LMFLexical Markup Framework

Page 9: Standards, Use and Prospects for Language Resource Management

Wikipedia: IS-A Annotation

9

IS-A (Entry, Term in Page)IS-A (Term in Page, Term in Page)Synonymy (Entry, Term in Page)

Page 10: Standards, Use and Prospects for Language Resource Management

What is common representation?

• Graph Structure

PIVOTPIVOT

AA

BB

CC FF

EE

DD

cat: NP

cat: PUNCtype: hyphen

cat: VBG

N e w Y o r k - b a s e d

cat: JJcat: NNP

cat: ADJP

role: altrole: a

lt

Page 11: Standards, Use and Prospects for Language Resource Management

Linguistic Annotation Framework• ISO-GrAF: Graph Structure-based Annotation– GrAF XML schema type hierarchy

• graphElementType; Attributes: ID, type• edgeType extends graphElementType• nodeType extends graphElementType• spanType extends nodeType; Attributes: start, end• graphElementSetType• edgeSetType extends graphElementSetType• nodeSetType extends graphElementSetType• featureStructureType• featureType• annotationSetType

Page 12: Standards, Use and Prospects for Language Resource Management

12/24

Problem: Causality between Terms• Causal relation between terms

• Term clustering based on inter-term causality

– Terms with similar causality tend to be similar concept.

– Realization & Evaluation

[ Skin cancer ] usually appears in adulthood, but it is caused by [ sun exposure ] and [ sunburns ] that began in childhood .[ Skin cancer ] usually appears in adulthood, but it is caused by [ sun exposure ] and [ sunburns ] that began in childhood .

IL-2

TG

Egr-1 IFN-gamma

Stat5Interleukin-2

Page 13: Standards, Use and Prospects for Language Resource Management

13/24

Is it true?Terms with similar causality tend to be similar concept.

The oral bacteria that cause gum disease appear to be the culprit. Cigarette smoking and use of smokeless tobacco products may also cause gum disease. Gum disease is the second most common cause of toothache

The oral bacteria that cause gum disease appear to be the culprit. Cigarette smoking and use of smokeless tobacco products may also cause gum disease. Gum disease is the second most common cause of toothache

cigarette

Oral bacteria

Smokeless tobacco product Toothache

Gum disease

Periodontal disease can lead to toothache. Cigarette smoking is the number one environmental risk for periodontal disease.

Periodontal disease can lead to toothache. Cigarette smoking is the number one environmental risk for periodontal disease.

cigarette

ToothachePeriodontal disease

Page 14: Standards, Use and Prospects for Language Resource Management

14/24

What to do• Is it true?

– Terms with similar causality tend to be similar concept

• We try to test the term clustering based on causal information– Prove that causality is one of effective features for term

clustering.

• Focus on– Causal NP pair extraction (Chang and Choi, 2004)– Causal term pair extraction– Term clustering based on causal similarity– Term clustering evaluation

Page 15: Standards, Use and Prospects for Language Resource Management

15/24

Features on term clustering (1/3)

• Useful features for Term clustering– Internal feature

• Word lexicon/structure in terms• (Bourigault and Jacquemin,1999): POS sequences including

insertion– NPDNInsAj = NOunl ((Adv? Adj)0-3 Prep Det? (Adv? Adj)0-3 ) Noun3

– 93~98% precision

– Outer-term feature• Structural modifier/modifiee of term• Some words nearby term• (Maynard et al., 2000)

– Hand-made semantic frame information

Page 16: Standards, Use and Prospects for Language Resource Management

Feature Structure Representation

(1) Employee• {<SEX, female>, <NAME, Sandy Jones>, <AGE,

30>}(2) Sound segment /p/• {<CONSONANTAL, + >, <ANTERIOR, + >, <VOICED,

->, <CONTINUANT, ->}(3) Grammatical features of the verb ‘love’• {<POS, verb>, <VALENCE, transitive>,

<SEMANTIC_RELATION, loving>},

Page 17: Standards, Use and Prospects for Language Resource Management

FSR: Graph vs. Matrix Notation

M

Page 18: Standards, Use and Prospects for Language Resource Management

18/24

Related Works on term clustering (3/3)

• Discussion– Causal information is one of “long-distance

contextual information”

Cigarette smoking and use of smokeless tobacco products may also cause gum disease. Cigarette smoking and use of smokeless tobacco products may also cause gum disease.

cause

use

Smokeless tobacco product

Gum disease

Page 19: Standards, Use and Prospects for Language Resource Management

19/24

it

that

it

that

appears

caused

began

Skin cancer

adulthoodSun exposure

Sunburns

child

Event & ternary extraction

Dependency StructureDependency Structure

usually in but

is

by and

appears

caused

beganin

Skin cancer usually appears in adulthood , but it is caused by sun exposure and sunburns that began in childhood .

Skin cancer

adulthoodSun exposure

Sunburns

child

Causal event pair candidate<cause event, cue phrase, effect event>Causal event pair candidate<cause event, cue phrase, effect event>Skin cancer – RNP caused by CNP – sun exposure

Skin cancer – RNP caused by CNP – sunburns

Skin cancer – RNP caused by CNP – sun exposure

Skin cancer – RNP caused by CNP – sunburns

NP chunking Reference finding Cue phrases filteringVerb selection

Page 20: Standards, Use and Prospects for Language Resource Management

Representation Scheme

• Morpho-syntactic Annotation Framework• Syntactic Annotation Framework

Page 21: Standards, Use and Prospects for Language Resource Management

Morpho-Syntactic Annotation Framework: MAF

– <token id=" t1 ">to</ token>– <token id=" t2 ">eventually</ token>–

3 <token id=" t3 ">decide</ token>– <wordForm lemma=" to_decide " tokens=" t1 t3 "/>–

5 <wordForm lemma=" eventually " tokens=" t2 "/>

Page 22: Standards, Use and Prospects for Language Resource Management

MAF: token <token id=" t1 ">The</ token><token id=" t2 ">vi c t im</ token><token id=" t3 ">’ s</ token><token id=" t4 ">f r i e n d s</ token><token id=" t5 ">t o ld</ token><token id=" t6 ">p o l i c e</ token><token id=" t7 ">that</ token><token id=" t8 ">Krueger</ token><token id=" t9 ">drove</ token><token id=" t10 ">int o</ token><token id=" t11 ">the</ token><token id=" t12 ">quar ry</ token><token id=" t13 ">and</ token><token id=" t14 ">never</ token><token id=" t15 ">sur f a c ed</ token><token id=" t16 ">.</ token>

Page 23: Standards, Use and Prospects for Language Resource Management

Syntactic Annotation Framework

Page 24: Standards, Use and Prospects for Language Resource Management

Semantic Annotation Framework: TimeML

• no more than 60 days – <TIMEX3 tid="t1" type="DURATION"

value="P60D" mod="EQUAL_OR_LESS"> no more than 60 days </TIMEX3>

• the dawn of 2000 – <TIMEX3 tid="t2" type="DATE" value="2000"

mod="START"> the dawn of 2000 </TIMEX3>

Page 25: Standards, Use and Prospects for Language Resource Management

ONTOLOGY EXTRACTION/LEARNING AND QUESTION-ANSWERING

Page 26: Standards, Use and Prospects for Language Resource Management

26

Page 27: Standards, Use and Prospects for Language Resource Management

Word Segmentation

Page 28: Standards, Use and Prospects for Language Resource Management

MULTILINGUAL INFORMATION FRAMEWORK

Page 29: Standards, Use and Prospects for Language Resource Management

IT Ontology

29

software system

Embeddedsoftware

Embeddedsystem

appliance

OS Middleware

EmbeddedOS

platform

App.Program

Dev.Env.

Comm.middleware

browserMediaplayer

DVDplayer Set-top

boxMP3player

DigitalCamera

vendor

Real-timeEmbed. OS

Non-real-timeembed. OSRTOS

VxWorks pSOS

VRTX

WinCE

MicrosoftWind River

consists_ofreside_on

venderIT Core Ontology

Page 30: Standards, Use and Prospects for Language Resource Management

A Scenario

30

User ControlServer

OntologyReasoner

RuleReasoner

What is the best RTOS Vendor? Do you know?

No

What is RTOS?

Real-time Operating System

What are instances?VxWorksVendor?

Wind River..Microsoft Which is better?

Page 31: Standards, Use and Prospects for Language Resource Management

Dialogue actsWell-known examples of communicative functions (“core

dialogue acts”):• question

• WH-question• YN-question• check/verification

• statement/inform• answer (WH-answer. YN-answer)• confirmation, disconfirmation• request• instruct• promise• acknowledgement• greeting

Page 32: Standards, Use and Prospects for Language Resource Management

General-purpose functionsApplicable in any dimension are:

Information-seeking functions WH-question, YN-question, Alternatives-question, Check,..

Information-providing functionsInform, WH-Answer, YN-Answer, Confirmation, Disconfirmation, Agreement, Correction,..

Commissive functionsOffer, Promise, AcceptRequest,..

Directive functionsInstruct, Request, Suggest,..

Page 33: Standards, Use and Prospects for Language Resource Management

DiaML concrete syntax

Page 34: Standards, Use and Prospects for Language Resource Management

From sentence to ontologies

Sentence

A [camera] is a [device] that [take]s [video].

Dependency analysis

Term recognition

A camera is a device that takes video.

camera is device takes that video

Triplets extraction(camera, ISA, device)(camera, hasPropertyOf, that AND (take video))

ontologycamera

device

artifact

video

contents

···

Page 35: Standards, Use and Prospects for Language Resource Management

Standards for language processing

Primary resources(text, dialogues)

Structural mark-upBasic annotations[TEI, MPEG7, TMX(XHTML…), etc.]

NLP structures(annotations)POS tagging

Chunks (cf. Named Entities)Deep Syntactic structures

Co-references etc.[Eagles/ISLE,

CES, MATE,…]

Knowledge structuresHierarchies of types

Relations between concepts(subjects/topics etc.)

Links to primary resources[Topic Maps, OIL, RDF]

Lexical structures(Language models)

TerminologiesTransfer lexica

LTAG/HPSG/LFG lexica[TBX, OLIF,

Eagles/ ISLE (Genelex)]

Links

Meta-data[Dublin core, OLAC,ISLE, MPEG7, RDF]

Access protocols[Corba, SOAP]

Page 36: Standards, Use and Prospects for Language Resource Management

Context

• ISO TC37 - Terminology and other language resources– SC3 - Computer applications in terminology

• ISO 12200 - Martif– Latest version of TEI Terminology chapter

• ISO 12620 - Data categories• ISO CD (DIS: under ballot) 16642 - TMF (Terminological

Markup Framework)

– SC4 - Language resources

Page 37: Standards, Use and Prospects for Language Resource Management

TC37/SC4 details• Scope: Platform for designing and implementing linguistic resource

formats and processes– Multi-layer annotation of linguistic resources– Exchange of information between NLP modules

• General strategy– Involve a wide community from academia and industry

• Identification of experts in the various work items• Involvment through national standardizing bodies

• Agenda– Current: identification of possible work items and working groups– Constituancy meeting and technical workshop at LREC (May 2002)

Page 38: Standards, Use and Prospects for Language Resource Management

Organization

• Chair:– Laurent Romary, France

• Secretary:– Key-Sun Choi, Korea

• International Advisory Committee– Chair: Prof. Antonio Zampolli, Italy

Page 39: Standards, Use and Prospects for Language Resource Management

SC4 and other standardizing bodies

W3C-basic protocols and formatsXML (Schemas)XPathXPointer+ RDF, SVG, SMIL, SOAP

MPEG- Multimedia, XML basede.g. MPEG7-4Word and phone lattices

ISO TC37/SC4- language resources, NLP perspectivee.g. linguistic annotations,lexical formats

TEI-text representationReference for primary sourcese.g.: text archives

Text

Audio/Speech

Technical background

What about gestures?• Kinetic in the TEI• SMIL?

Oscar

Page 40: Standards, Use and Prospects for Language Resource Management

TC37/SC4 Work Items

• WG1/WI-0: Terminology of Language Resources• WG1/WI-1: Linguistic annotation framework• WG1/WI-2: Meta-data for multimodal and multilingual

information• WG2/WI-3: Structural content representation scheme• WG2/WI-4: Multimodal content representation sheme• WG2/WI-5: Discourse level representation scheme

Page 41: Standards, Use and Prospects for Language Resource Management

TC37/SC4 Work Items - cont.

• WG3/WI-6a: Multilingual text representation

• WG4/WI-7: NLP Lexica• WG5/WI-8: Net-based distributed cooperative

work for the creation of LRs

Page 42: Standards, Use and Prospects for Language Resource Management

WI-0

• Terminology of Language Resources– Basic terminology of the various sub-fields of

language resources and general methodology– Project leader: Klaus-Dirk Schmitz– Sources:

• ISO 1087• LREC proceedings + KAIST• English dictionaries in Linguistics?

– Support from GTW

Page 43: Standards, Use and Prospects for Language Resource Management

WI-1• Linguistic annotation framework

– Basic mechanisms and data structures for linguistic annotation and representation [data architecture]

• Methods and principles for the design of an annotation scheme• Structural nodes and information units, Data category specification• Linking and pointing mechanisms, Feature Structures, Meta-Markup• « Stand-off » and « in-line » views - equivalences, combining levels.• Administrative data categories

Page 44: Standards, Use and Prospects for Language Resource Management

WI-1 - cont.

– Project leader: Nancy Ide (TBC)– Contributors: Alan Melby, Koiti Hasida, Lee Gillam, Yves

Savourel, Laurent Romary…– Possible sources:

• TMF, iso12620-revised, Mate (general methodology)• TEI (Linking mechanisms, feature structures)• Link with Linguistic DS

Page 45: Standards, Use and Prospects for Language Resource Management

WI-2• Meta-data for multimodal and multilingual information

– Description of a meta-data representation scheme to document linguistic information structures and processes

• General content description• Local content description

– Project leader: Peter Wittenburg, MPI (Nijmegen, NL)– Participants: Steven Bird, TEI aware person– Possible sources:

• OLAC, Mile, TEI Header– Liaison: TC46 (SC9), MPEG7/MDS, SCORM

Page 46: Standards, Use and Prospects for Language Resource Management

WI-3• Structural content representation scheme

– Definition of annotation/representation scheme(s) for morpho-syntax and syntax, to be used for annotation and interchange purposes

• Meta-model for morpho-syntactic annotation• Meta-model(s) for syntactic annotation (lexicalized grammar,

elementary trees, dependancy structures)• + corresponding Data category registries

– Project leader:John Carroll ?? – Participants: Nuria Bell– Possible sources:

• Eagles, TAGML, Linguistic DS• SIGPARSE• Working group with representatives from existing TreeBanks

initiatives

Page 47: Standards, Use and Prospects for Language Resource Management

WI-4• Multimodal meaning representation scheme

– Representation scheme for the semantic content of multimodal information (textual, spoken, graphical and gestural)

• Meta-modal for content representation (Events, participants, etc.)• Data category registry for multimodal content

– Project leader: Harry Bunt (id=“1”)– Possible sources:

• SIGSEM working group on semantic content– Chair: #1

– « Liaison »• Semantic web activities

Page 48: Standards, Use and Prospects for Language Resource Management

WI-5

• Discourse level representation scheme• Meta-model for discourse and dialogue representation• Meta-model for discourse level annotation (e.g.

reference annotation)• + corresponding DatCat registry

– Possible sources:• SIGDIAL• DRI - Discourse Resource Initiative• Mate

Page 49: Standards, Use and Prospects for Language Resource Management

WI-6• Multilingual text representation scheme

– Framework for representing language specific and multi-lingual textual information

• Translation Memory• Alignment – Parallel Corpora• Word count algorithms (characters, words, segments)

– Possible sources:• TMX for translation memories• TEI based linking mechanism (or see WI-1) for Parallel texts

Page 50: Standards, Use and Prospects for Language Resource Management

• WI 6A• Translation Memory, Alignment of parallel corpora

– Sources:• OSCAR/TMX for translation memories• TEI based linking mechanism (or see WI-1) for Parallel texts

Page 51: Standards, Use and Prospects for Language Resource Management

• WI 6b• Segmentation and counting algorithms (characters, words,

sentences etc.)– Sources:

• OSCAR

Page 52: Standards, Use and Prospects for Language Resource Management

• WI 6C• Meta-markup for GIL (Globalization, Internationalization and

Localization)– Possible sources:

• OSCAR/OpenTag

Page 53: Standards, Use and Prospects for Language Resource Management

WI-7

• NLP lexica– Lexicon representation formats for the various types of NLP

applications (Machine Readable Lexica)• Define a set of meta-models (classes of applications)• Specific data categories (derivation, phonology, etc.)• Based on the work done in other work items

– Sources• Eagles/multext• ISLE Computational Working group/Genelex• OLIF

Page 54: Standards, Use and Prospects for Language Resource Management

WI-8• Net-based distributed cooperative work for the creation

of LRs– Principles and methods for designing collaborative and

cooperative compilation of LRs– Define what is specific to LRs with regards

• Tracability of resources, version control, validation, quality management

• Protocols (Corba, SOAP), Workflow standards, Data management

– Contacts: Christian Galinski, Remi Zajac– Sources

Page 55: Standards, Use and Prospects for Language Resource Management

Liaison - OSCAR• Brief history of LR exchange standards• Parallel events since 1997

– Open Tag - meta-markup (XML vs. Others)• Major current OSCAR activities

– TMX - Translation Memory eXchange– Counting and segmentation algorithms– TBX (Terminologies) and OLIF (MT lexica)– XLIFF and CGS - Annotation of source code and localisation of

web sites– xml:lang etc.: J. DeCamp and S.-E. Wright

Page 56: Standards, Use and Prospects for Language Resource Management

Liaison - TEI– General architecture and data modeling

• WI-1– Annotations (paragraph level, external annotations)

• WI-1– TEI Header

• WI-2– NLP lexica

• WI-7– Feature structures

• WI-1