Dutch HLT Resources: from BLARK to Priority Lists Helmer Strik, Diana Binnenpoorte, Janienke Sturm,...

29
Dutch HLT Resources: Dutch HLT Resources: from BLARK to Priority from BLARK to Priority Lists Lists Helmer Strik, Diana Binnenpoorte, Helmer Strik, Diana Binnenpoorte, Janienke Sturm, Janienke Sturm, Folkert de Vriend, and Catia Cucchiarini* Folkert de Vriend, and Catia Cucchiarini* A A 2 RT, Dept. of Language and Speech, RT, Dept. of Language and Speech, Nijmegen Nijmegen * NTU, Dutch Language Union, The * NTU, Dutch Language Union, The Hague Hague Walter Daelemans Walter Daelemans

Transcript of Dutch HLT Resources: from BLARK to Priority Lists Helmer Strik, Diana Binnenpoorte, Janienke Sturm,...

Dutch HLT Resources: Dutch HLT Resources: from BLARK to Priority Listsfrom BLARK to Priority Lists

Helmer Strik, Diana Binnenpoorte, Janienke Helmer Strik, Diana Binnenpoorte, Janienke Sturm,Sturm,

Folkert de Vriend, and Catia Cucchiarini*Folkert de Vriend, and Catia Cucchiarini*

AA22RT, Dept. of Language and Speech, RT, Dept. of Language and Speech, NijmegenNijmegen

* NTU, Dutch Language Union, The Hague* NTU, Dutch Language Union, The Hague

Walter DaelemansWalter Daelemans

Dept. of CNTS Language Technology, Dept. of CNTS Language Technology, AntwerpAntwerp

Dutch HLT PlatformDutch HLT Platform NTU NTU

NTU - Nederlandse TaalunieNTU - Nederlandse Taalunie

(Dutch Language Union)(Dutch Language Union)

Mission: Strengthening the position of the Mission: Strengthening the position of the Dutch LanguageDutch Language

Dutch HLT PlatformDutch HLT Platform

Aim: To contribute to the further Aim: To contribute to the further development of an adequate language and development of an adequate language and speech technology infrastructure for Dutchspeech technology infrastructure for Dutch

Dutch HLT PlatformDutch HLT PlatformOther participantsOther participants

Ministry of the Flemish CommunityMinistry of the Flemish Community Flemish Institute for the Promotion of Scientific-Flemish Institute for the Promotion of Scientific-

technological Research in Industrytechnological Research in Industry Fund for Scientific Research - FlandersFund for Scientific Research - Flanders

Dutch Ministry of Education, Culture and SciencesDutch Ministry of Education, Culture and Sciences Dutch Ministry of Economic AffairsDutch Ministry of Economic Affairs Netherlands Organisation for Scientific ResearchNetherlands Organisation for Scientific Research Senter (an agency of the Dutch Ministry of Senter (an agency of the Dutch Ministry of

Economic Affairs)Economic Affairs)

Dutch HLT PlatformDutch HLT PlatformFour action linesFour action lines

A. Performing a market place functionB. Strengthening the HLT infrastructureC. Working out standards and evaluation

criteriaD. Developing a management, maintenance,

and distribution plan

This presentationThis presentationPlatform BCPlatform BC

A.A. --B. Strengthening the HLT infrastructureC. Working out standards and evaluation

criteriaD. -

B+C => Platform BC

Focus on method (skip many details) More details: see publications, web sites

Platform BCPlatform BCWhat?What?

1.1. BLARK: Basic LAnguage Resources KitBLARK: Basic LAnguage Resources Kit

2.2. Inventory & EvaluationInventory & Evaluation

3.3. Priority listsPriority lists

Platform BCPlatform BCWho?Who?

Steering committee:Steering committee: 8 HLT experts8 HLT experts NTUNTU NWO (funding body)NWO (funding body)

4 field researchers4 field researchers

Platform BCPlatform BCHow?How?

1.1. BLARKBLARK

2.2. Inventory & Eval.Inventory & Eval.

3.3. Priority listsPriority listsReport Report 11

Feedback:Feedback:•Dutch HLT FieldDutch HLT Field

•Workshop 15/11/2001Workshop 15/11/2001

1.1. BLARKBLARK

2.2. Inventory & Eval.Inventory & Eval.

3.3. Priority listsPriority listsReport Report 22

1. BLARK1. BLARK Basic LAnguage Resources Kit Basic LAnguage Resources Kit

Components:Components:• Applications: classes of applications Applications: classes of applications

rather than specific applications or rather than specific applications or products.products.

• Modules (or semi-products): the basic Modules (or semi-products): the basic software components of HLT applications.software components of HLT applications.

• Data: sets of language data and Data: sets of language data and descriptions in machine readable form.descriptions in machine readable form.

BLARKBLARK Basic LAnguage Resources Kit Basic LAnguage Resources Kit

2 matrices:2 matrices:

1.1. Modules x DataModules x Data

2.2. Modules x ApplicationsModules x Applications

=> BLARK=> BLARK

Data Applications

Modules

mon

olin

gle

x

mul

tili

nle

x

thes

auri

anno

cor

p

unan

noco

rp

spee

chco

rp

mul

ti li

ngco

rp

mul

tim

od c

orp

corp

mul

tim

edia

cor

CA

LL

acce

ssco

ntro

l

spee

chin

put

spee

chou

tput

dial

ogsy

stem

s

doc

prod

info

acce

ss

tran

sla-

tion

Language Technology

Grapheme-phonemeconv.

++ ++ + ++ ++ + +

Token detection ++ + ++ + + + + + +Sent boundary detection + ++ ++ + ++ ++ + ++ ++ ++Name recognition + + + ++ ++ ++ + ++ ++ + ++ ++ ++Spelling correction +Lemmatising ++ ++ + + + + + + + +Morphological analysis ++ ++ + + + ++ + ++ ++ ++Morphological synthesis ++ ++ + + ++ + ++ ++Word sort disambig. ++ ++ + + ++ + ++ ++ ++ ++Parsers and grammars ++ ++ + ++ ++ ++ ++ ++ ++Shallow parsing ++ ++ ++ + ++ ++ ++ ++ ++ ++Constituent recognition ++ ++ + + ++ ++ ++ ++ ++ ++Semantic analysis ++ ++ ++ ++ ++ + ++ ++ ++ ++ ++Referent resolution + ++ ++ + + ++ ++ ++ ++ ++Word meaning disambig. + ++ ++ + + ++ + + + ++ ++Pragmatic analysis + + ++ ++ ++ + ++ ++ ++ + ++Text generation ++ ++ ++ ++ ++ + ++ ++ ++ ++Lang. dep. translation ++ ++ ++ ++ + ++ ++

Speech Technology

Complete speech recog. ++ + ++ + ++ + ++ ++ ++ ++ ++ ++ ++ ++ ++Acoustic models ++ + ++ + ++ + + + ++ + ++ ++ + + +Language models + ++ + + + + + ++ + ++ ++ ++ ++ ++Pronunciation lexicon ++ + + ++ + + + ++ + ++ + ++ + ++ ++Robust speechrecognition

+ + + + + + ++ + + ++ ++ + + +

Non-native speech recog. + ++ + ++ ++ + + ++ + + + + +Speaker adaptation + + + ++ + + ++ + + ++ + + ++ +Lexicon adaptation ++ + + ++ + + + ++ + ++ + ++ + ++ ++Prosody recognition + + ++ + ++ + + + ++ + ++ ++ ++ ++ ++Complete speech synth. ++ + + + + + ++ ++ + + ++Allophone synthesis + + + + + + + + + +Di-phone synthesis ++ + + + + + ++ ++ + + +Unit selection ++ + + + + + ++ ++ + + +Prosody prediction forText-to-Speech

++ + + + + + ++ ++ ++ + ++

Aut. phon. transcription ++ ++ + + ++ + + + ++ + + + + + + +Aut. phon. segmentation ++ ++ + + ++ + + + ++ + + + + + + +Phoneme alignment + + + ++ + + + ++ + + + +Distance calc. phonemes + + + ++ + + + ++ + + + +Speaker identification + ++ ++ ++ + ++ + + ++ + + + +Speaker verification + ++ ++ ++ + ++ + ++ + + + +Speaker tracking + ++ ++ ++ + ++ + + + + +Language identification + ++ + + ++ ++ + + + + + + + +Dialect identification + ++ + + ++ ++ + + + + + + + +Confidence measures + + + ++ + ++ + ++ ++ ++ ++ + + +Utterance verification + + + ++ + + + + + ++ ++ + + +

DatDataa

ApplicationApplicationssModulesModules

BLARKBLARKLanguage technologyLanguage technology

• Modules• Robust modular text preprocessing• Morphological analysis and morphosyntactic

disambiguation / unknown words• Robust syntactic analysis• Aspects of semantic analysis (word meaning and reference)

• Data• Monolingual lexicon• Annotated corpus of written Dutch• Benchmarks for evaluation

BLARKBLARKSpeech technologySpeech technology

• Modules• Automatic speech recognition• Speech synthesis system• Tools for annotation of speech corpora• Confidence measures and utterance verification• Identification (speaker, language, dialect)

• Data• Monolingual speech corpora for specific applications• Multilingual speech corpora • Multimodal/medial speech corpora • Benchmarks for evaluation

2. Inventory & Evaluation2. Inventory & Evaluation

B. Inventory:B. Inventory:

Which components in BLARK are available?Which components in BLARK are available?

C. Evaluation:C. Evaluation:

And of sufficient quality?And of sufficient quality?

Checklist approachChecklist approach

=> B&C together: platform BC=> B&C together: platform BC

See matrix 3 - AvailabilitySee matrix 3 - Availability

Modules Availability

Grapheme-phoneme conversion 8

Token detection 9

Sentence boundary detection 3

Name recognition 4

Spelling correction 3

Lemmatising 9

Morphological analysis 7Morphological synthesis 9Word sort disambiguation 7

Parsers and grammars 3

Shallow parsing 2

Constituent recognition 5

Semantic analysis 3

Referent resolution 2

Word meaning disambiguation 2

Pragmatic analysis 1

Text generation 3

Language dependent translation 3

Complete speech recognition 4

Acoustic models 8

Language models 3

Pronunciation lexicon 5

Robust speech recognition 2

Non-native speech recognition 2

Speaker adaptation 2

Lexicon adaptation 2

Prosody recognition 2

Complete speech synthesis 6

Allophone synthesis 7

Di-phone synthesis 6

Unit selection 1

Prosody prediction for Text-to-Speech 3

Autom. phonetic transcription 3

Autom. phonetic segmentation 5

Phoneme alignment 8

Distance calculation of phonemes 8

Speaker identification 2

Speaker verification 2

Speaker tracking 2

Language identification 2

Dialect identification 2

Confidence measures 2

Utterance verification 2

Data

Unannotated corpora 9

Annotated corpora 5

Speech corpora 4

Multi lingual corpora 3

Multi modal corpora 1

Multi media corpora 1

Test corpora 1

Monolingual lexicons 8

Multilingual lexicons 6

Thesaurus 4

ModuleModuless

AvailabiliAvailabilityty

3. Priority lists3. Priority lists

BLARK Inventory

Priority lists

Priority listsPriority lists

The prioritisation was based on the following The prioritisation was based on the following requirements:requirements:

The components should currently be The components should currently be unavailable, inaccessible, or of insufficient unavailable, inaccessible, or of insufficient quality.quality.

The components should be relevant for a The components should be relevant for a large number of applications.large number of applications.

Developing the components should be Developing the components should be possible in the short term. possible in the short term.

Priority listPriority listLanguage technologyLanguage technology

1. Annotated corpus of written Dutch1. Annotated corpus of written Dutch

2. Syntactic analysis2. Syntactic analysis

3. Robust text pre-processing3. Robust text pre-processing

4. Semantic annotations for treebank in 14. Semantic annotations for treebank in 1

5. Translation equivalents5. Translation equivalents

6. Benchmarks for evaluation 6. Benchmarks for evaluation

Priority listPriority listSpeech technologySpeech technology

1. Automatic speech recognition1. Automatic speech recognition

2. Speech corpora2. Speech corpora

3. Multi-media speech corpora3. Multi-media speech corpora

4. Tools for (semi-) automatic 4. Tools for (semi-) automatic transcription of speech datatranscription of speech data

5. Speech synthesis5. Speech synthesis

6. Benchmarks for evaluation6. Benchmarks for evaluation

FeedbackFeedback

Report 1Report 1

FeedbackFeedback Sent to the Dutch-Flemish HLT field Sent to the Dutch-Flemish HLT field

(2000)(2000) Workshop 15/11/2001Workshop 15/11/2001

=> Report 2=> Report 2

Platform BCPlatform BCHow?How?

1.1. BLARKBLARK

2.2. Inventory & Eval.Inventory & Eval.

3.3. Priority listsPriority listsReport Report 11

Feedback:Feedback:•Dutch HLT FieldDutch HLT Field

•Workshop 15/11/2001Workshop 15/11/2001

1.1. BLARKBLARK

2.2. Inventory & Eval.Inventory & Eval.

3.3. Priority listsPriority listsReport Report 22

When BLARK is established...When BLARK is established...

Intellectual rights by NTUIntellectual rights by NTU

Actual management and maintenance of Actual management and maintenance of resources by HLT agency, to be foundedresources by HLT agency, to be founded

Maintenance of expertise by Maintenance of expertise by

Dutch-Flemish steering committees and Dutch-Flemish steering committees and

HLT management committee, HLT management committee,

both to be foundedboth to be founded

General conclusionsGeneral conclusions

Goals have been achieved so that the proper Goals have been achieved so that the proper prior conditions for development of prior conditions for development of materials in BLARK are createdmaterials in BLARK are created

This work, carried out in the Dutch speaking This work, carried out in the Dutch speaking area, can be profitable for other countries area, can be profitable for other countries when starting similar activities:when starting similar activities:

Presentations & publicationsPresentations & publications Part of the report is translated into EnglishPart of the report is translated into English

Web sitesWeb sites

http:http:

//www.taaluniversum.org/tst///www.taaluniversum.org/tst/

//www.hltcentral.org/htmlengine.shtml?//www.hltcentral.org/htmlengine.shtml?id=996id=996

//lands.let.kun.nl/TSpublic/strik/platform-BC.html

That’s itThat’s it

Web sitesWeb sites

http:http:

//www.taaluniversum.org/tst///www.taaluniversum.org/tst/

//www.hltcentral.org/htmlengine.shtml?//www.hltcentral.org/htmlengine.shtml?id=996id=996

//lands.let.kun.nl/TSpublic/strik/platform-BC.html

ObjectivesObjectives

strengthening the position of Dutch in HLTstrengthening the position of Dutch in HLT establishing the proper conditions for a successful establishing the proper conditions for a successful

management and maintenance of basic HLT management and maintenance of basic HLT resources developed through governmental resources developed through governmental fundingfunding

stimulating co-operation between academia and stimulating co-operation between academia and industry in the field of HLTindustry in the field of HLT

contributing to the realisation of European co-contributing to the realisation of European co-operation in HLT-relevant areasoperation in HLT-relevant areas

establishing a network that brings together supply establishing a network that brings together supply and demand for knowledge, products, and servicesand demand for knowledge, products, and services

Platform BCPlatform BCWho?Who?

Steering committee: 8 HLT expertsSteering committee: 8 HLT experts

Lang. Tech.Lang. Tech. Speech Speech Tech.Tech.

FlandersFlanders 1. 1. WDWD

2. FvE2. FvE1. JPM1. JPM

2. DvC2. DvC

NetherlandNetherlandss

1. GB1. GB

2. 2. AN/DH/FdJAN/DH/FdJ

1. 1. HSHS

2. RV / AD2. RV / AD