Dutch HLT Resources: from BLARK to Priority Lists
description
Transcript of Dutch HLT Resources: from BLARK to Priority Lists
Dutch HLT Resources: Dutch HLT Resources: from BLARK to Priority Listsfrom BLARK to Priority Lists
Helmer Strik, Diana Binnenpoorte, Janienke Helmer Strik, Diana Binnenpoorte, Janienke Sturm,Sturm,
Folkert de Vriend, and Catia Cucchiarini*Folkert de Vriend, and Catia Cucchiarini*
AA22RT, Dept. of Language and Speech, RT, Dept. of Language and Speech, NijmegenNijmegen
* NTU, Dutch Language Union, The Hague* NTU, Dutch Language Union, The Hague
Walter DaelemansWalter Daelemans
Dept. of CNTS Language Technology, Dept. of CNTS Language Technology, AntwerpAntwerp
Dutch HLT PlatformDutch HLT Platform NTU NTU
NTU - Nederlandse TaalunieNTU - Nederlandse Taalunie
(Dutch Language Union)(Dutch Language Union)
Mission: Strengthening the position of the Mission: Strengthening the position of the Dutch LanguageDutch Language
Dutch HLT PlatformDutch HLT Platform
Aim: To contribute to the further Aim: To contribute to the further development of an adequate language and development of an adequate language and speech technology infrastructure for Dutchspeech technology infrastructure for Dutch
Dutch HLT PlatformDutch HLT PlatformOther participantsOther participants
Ministry of the Flemish CommunityMinistry of the Flemish Community Flemish Institute for the Promotion of Scientific-Flemish Institute for the Promotion of Scientific-
technological Research in Industrytechnological Research in Industry Fund for Scientific Research - FlandersFund for Scientific Research - Flanders
Dutch Ministry of Education, Culture and SciencesDutch Ministry of Education, Culture and Sciences Dutch Ministry of Economic AffairsDutch Ministry of Economic Affairs Netherlands Organisation for Scientific ResearchNetherlands Organisation for Scientific Research Senter (an agency of the Dutch Ministry of Senter (an agency of the Dutch Ministry of
Economic Affairs)Economic Affairs)
Dutch HLT PlatformDutch HLT PlatformFour action linesFour action lines
A. Performing a market place functionB. Strengthening the HLT infrastructureC. Working out standards and evaluation
criteriaD. Developing a management, maintenance,
and distribution plan
This presentationThis presentationPlatform BCPlatform BC
A.A. --B. Strengthening the HLT infrastructureC. Working out standards and evaluation
criteriaD. -
B+C => Platform BC
Focus on method (skip many details) More details: see publications, web sites
Platform BCPlatform BCWhat?What?
1.1. BLARK: Basic LAnguage Resources KitBLARK: Basic LAnguage Resources Kit
2.2. Inventory & EvaluationInventory & Evaluation
3.3. Priority listsPriority lists
Platform BCPlatform BCWho?Who?
Steering committee:Steering committee: 8 HLT experts8 HLT experts NTUNTU NWO (funding body)NWO (funding body)
4 field researchers4 field researchers
Platform BCPlatform BCHow?How?
1.1. BLARKBLARK
2.2. Inventory & Eval.Inventory & Eval.
3.3. Priority listsPriority listsReport Report 11
Feedback:Feedback:•Dutch HLT FieldDutch HLT Field
•Workshop 15/11/2001Workshop 15/11/2001
1.1. BLARKBLARK
2.2. Inventory & Eval.Inventory & Eval.
3.3. Priority listsPriority listsReport Report 22
1. BLARK1. BLARK Basic LAnguage Resources Kit Basic LAnguage Resources Kit
Components:Components:• Applications: classes of applications Applications: classes of applications
rather than specific applications or rather than specific applications or products.products.
• Modules (or semi-products): the basic Modules (or semi-products): the basic software components of HLT applications.software components of HLT applications.
• Data: sets of language data and Data: sets of language data and descriptions in machine readable form.descriptions in machine readable form.
BLARKBLARK Basic LAnguage Resources Kit Basic LAnguage Resources Kit
2 matrices:2 matrices:
1.1. Modules x DataModules x Data
2.2. Modules x ApplicationsModules x Applications
=> BLARK=> BLARK
Data Applications
Modules
mon
olin
gle
x
mul
tili
nle
x
thes
auri
anno
cor
p
unan
noco
rp
spee
chco
rp
mul
ti li
ngco
rp
mul
tim
od c
orp
corp
mul
tim
edia
cor
CA
LL
acce
ssco
ntro
l
spee
chin
put
spee
chou
tput
dial
ogsy
stem
s
doc
prod
info
acce
ss
tran
sla-
tion
Language Technology
Grapheme-phonemeconv.
++ ++ + ++ ++ + +
Token detection ++ + ++ + + + + + +Sent boundary detection + ++ ++ + ++ ++ + ++ ++ ++Name recognition + + + ++ ++ ++ + ++ ++ + ++ ++ ++Spelling correction +Lemmatising ++ ++ + + + + + + + +Morphological analysis ++ ++ + + + ++ + ++ ++ ++Morphological synthesis ++ ++ + + ++ + ++ ++Word sort disambig. ++ ++ + + ++ + ++ ++ ++ ++Parsers and grammars ++ ++ + ++ ++ ++ ++ ++ ++Shallow parsing ++ ++ ++ + ++ ++ ++ ++ ++ ++Constituent recognition ++ ++ + + ++ ++ ++ ++ ++ ++Semantic analysis ++ ++ ++ ++ ++ + ++ ++ ++ ++ ++Referent resolution + ++ ++ + + ++ ++ ++ ++ ++Word meaning disambig. + ++ ++ + + ++ + + + ++ ++Pragmatic analysis + + ++ ++ ++ + ++ ++ ++ + ++Text generation ++ ++ ++ ++ ++ + ++ ++ ++ ++Lang. dep. translation ++ ++ ++ ++ + ++ ++
Speech Technology
Complete speech recog. ++ + ++ + ++ + ++ ++ ++ ++ ++ ++ ++ ++ ++Acoustic models ++ + ++ + ++ + + + ++ + ++ ++ + + +Language models + ++ + + + + + ++ + ++ ++ ++ ++ ++Pronunciation lexicon ++ + + ++ + + + ++ + ++ + ++ + ++ ++Robust speechrecognition
+ + + + + + ++ + + ++ ++ + + +
Non-native speech recog. + ++ + ++ ++ + + ++ + + + + +Speaker adaptation + + + ++ + + ++ + + ++ + + ++ +Lexicon adaptation ++ + + ++ + + + ++ + ++ + ++ + ++ ++Prosody recognition + + ++ + ++ + + + ++ + ++ ++ ++ ++ ++Complete speech synth. ++ + + + + + ++ ++ + + ++Allophone synthesis + + + + + + + + + +Di-phone synthesis ++ + + + + + ++ ++ + + +Unit selection ++ + + + + + ++ ++ + + +Prosody prediction forText-to-Speech
++ + + + + + ++ ++ ++ + ++
Aut. phon. transcription ++ ++ + + ++ + + + ++ + + + + + + +Aut. phon. segmentation ++ ++ + + ++ + + + ++ + + + + + + +Phoneme alignment + + + ++ + + + ++ + + + +Distance calc. phonemes + + + ++ + + + ++ + + + +Speaker identification + ++ ++ ++ + ++ + + ++ + + + +Speaker verification + ++ ++ ++ + ++ + ++ + + + +Speaker tracking + ++ ++ ++ + ++ + + + + +Language identification + ++ + + ++ ++ + + + + + + + +Dialect identification + ++ + + ++ ++ + + + + + + + +Confidence measures + + + ++ + ++ + ++ ++ ++ ++ + + +Utterance verification + + + ++ + + + + + ++ ++ + + +
DatDataa
ApplicationApplicationssModulesModules
BLARKBLARKLanguage technologyLanguage technology
• Modules• Robust modular text preprocessing• Morphological analysis and morphosyntactic
disambiguation / unknown words• Robust syntactic analysis• Aspects of semantic analysis (word meaning and reference)
• Data• Monolingual lexicon• Annotated corpus of written Dutch• Benchmarks for evaluation
BLARKBLARKSpeech technologySpeech technology
• Modules• Automatic speech recognition• Speech synthesis system• Tools for annotation of speech corpora• Confidence measures and utterance verification• Identification (speaker, language, dialect)
• Data• Monolingual speech corpora for specific applications• Multilingual speech corpora • Multimodal/medial speech corpora • Benchmarks for evaluation
2. Inventory & Evaluation2. Inventory & Evaluation
B. Inventory:B. Inventory:
Which components in BLARK are available?Which components in BLARK are available?
C. Evaluation:C. Evaluation:
And of sufficient quality?And of sufficient quality?
Checklist approachChecklist approach
=> B&C together: platform BC=> B&C together: platform BC
See matrix 3 - AvailabilitySee matrix 3 - Availability
Modules Availability
Grapheme-phoneme conversion 8
Token detection 9
Sentence boundary detection 3
Name recognition 4
Spelling correction 3
Lemmatising 9
Morphological analysis 7Morphological synthesis 9Word sort disambiguation 7
Parsers and grammars 3
Shallow parsing 2
Constituent recognition 5
Semantic analysis 3
Referent resolution 2
Word meaning disambiguation 2
Pragmatic analysis 1
Text generation 3
Language dependent translation 3
Complete speech recognition 4
Acoustic models 8
Language models 3
Pronunciation lexicon 5
Robust speech recognition 2
Non-native speech recognition 2
Speaker adaptation 2
Lexicon adaptation 2
Prosody recognition 2
Complete speech synthesis 6
Allophone synthesis 7
Di-phone synthesis 6
Unit selection 1
Prosody prediction for Text-to-Speech 3
Autom. phonetic transcription 3
Autom. phonetic segmentation 5
Phoneme alignment 8
Distance calculation of phonemes 8
Speaker identification 2
Speaker verification 2
Speaker tracking 2
Language identification 2
Dialect identification 2
Confidence measures 2
Utterance verification 2
Data
Unannotated corpora 9
Annotated corpora 5
Speech corpora 4
Multi lingual corpora 3
Multi modal corpora 1
Multi media corpora 1
Test corpora 1
Monolingual lexicons 8
Multilingual lexicons 6
Thesaurus 4
ModuleModuless
AvailabiliAvailabilityty
3. Priority lists3. Priority lists
BLARK Inventory
Priority lists
Priority listsPriority lists
The prioritisation was based on the following The prioritisation was based on the following requirements:requirements:
The components should currently be The components should currently be unavailable, inaccessible, or of insufficient unavailable, inaccessible, or of insufficient quality.quality.
The components should be relevant for a The components should be relevant for a large number of applications.large number of applications.
Developing the components should be Developing the components should be possible in the short term. possible in the short term.
Priority listPriority listLanguage technologyLanguage technology
1. Annotated corpus of written Dutch1. Annotated corpus of written Dutch
2. Syntactic analysis2. Syntactic analysis
3. Robust text pre-processing3. Robust text pre-processing
4. Semantic annotations for treebank in 14. Semantic annotations for treebank in 1
5. Translation equivalents5. Translation equivalents
6. Benchmarks for evaluation 6. Benchmarks for evaluation
Priority listPriority listSpeech technologySpeech technology
1. Automatic speech recognition1. Automatic speech recognition
2. Speech corpora2. Speech corpora
3. Multi-media speech corpora3. Multi-media speech corpora
4. Tools for (semi-) automatic 4. Tools for (semi-) automatic transcription of speech datatranscription of speech data
5. Speech synthesis5. Speech synthesis
6. Benchmarks for evaluation6. Benchmarks for evaluation
FeedbackFeedback
Report 1Report 1
FeedbackFeedback Sent to the Dutch-Flemish HLT field Sent to the Dutch-Flemish HLT field
(2000)(2000) Workshop 15/11/2001Workshop 15/11/2001
=> Report 2=> Report 2
Platform BCPlatform BCHow?How?
1.1. BLARKBLARK
2.2. Inventory & Eval.Inventory & Eval.
3.3. Priority listsPriority listsReport Report 11
Feedback:Feedback:•Dutch HLT FieldDutch HLT Field
•Workshop 15/11/2001Workshop 15/11/2001
1.1. BLARKBLARK
2.2. Inventory & Eval.Inventory & Eval.
3.3. Priority listsPriority listsReport Report 22
When BLARK is established...When BLARK is established...
Intellectual rights by NTUIntellectual rights by NTU
Actual management and maintenance of Actual management and maintenance of resources by HLT agency, to be foundedresources by HLT agency, to be founded
Maintenance of expertise by Maintenance of expertise by
Dutch-Flemish steering committees and Dutch-Flemish steering committees and
HLT management committee, HLT management committee,
both to be foundedboth to be founded
General conclusionsGeneral conclusions
Goals have been achieved so that the proper Goals have been achieved so that the proper prior conditions for development of prior conditions for development of materials in BLARK are createdmaterials in BLARK are created
This work, carried out in the Dutch speaking This work, carried out in the Dutch speaking area, can be profitable for other countries area, can be profitable for other countries when starting similar activities:when starting similar activities:
Presentations & publicationsPresentations & publications Part of the report is translated into EnglishPart of the report is translated into English
Web sitesWeb sites
http:http:
//www.taaluniversum.org/tst///www.taaluniversum.org/tst/
//www.hltcentral.org/htmlengine.shtml?//www.hltcentral.org/htmlengine.shtml?id=996id=996
//lands.let.kun.nl/TSpublic/strik/platform-BC.html
That’s itThat’s it
Web sitesWeb sites
http:http:
//www.taaluniversum.org/tst///www.taaluniversum.org/tst/
//www.hltcentral.org/htmlengine.shtml?//www.hltcentral.org/htmlengine.shtml?id=996id=996
//lands.let.kun.nl/TSpublic/strik/platform-BC.html
ObjectivesObjectives
strengthening the position of Dutch in HLTstrengthening the position of Dutch in HLT establishing the proper conditions for a successful establishing the proper conditions for a successful
management and maintenance of basic HLT management and maintenance of basic HLT resources developed through governmental resources developed through governmental fundingfunding
stimulating co-operation between academia and stimulating co-operation between academia and industry in the field of HLTindustry in the field of HLT
contributing to the realisation of European co-contributing to the realisation of European co-operation in HLT-relevant areasoperation in HLT-relevant areas
establishing a network that brings together supply establishing a network that brings together supply and demand for knowledge, products, and servicesand demand for knowledge, products, and services
Platform BCPlatform BCWho?Who?
Steering committee: 8 HLT expertsSteering committee: 8 HLT experts
Lang. Tech.Lang. Tech. Speech Speech Tech.Tech.
FlandersFlanders 1. 1. WDWD
2. FvE2. FvE1. JPM1. JPM
2. DvC2. DvC
NetherlandNetherlandss
1. GB1. GB
2. 2. AN/DH/FdJAN/DH/FdJ
1. 1. HSHS
2. RV / AD2. RV / AD