11 th ECESS Meeting College Language Resources

16
Language Resources College 11 th ECESS meeting 11 11 th ECESS Meeting th ECESS Meeting College College Language Resources Language Resources 0. Minutes making for College ‘Language Resources’ 1. Goal of meeting 2. Status members of College 3. Interests and acceptance of associated members and observers 4. Acceptance of College minutes of last meeting 5. College-Action List of 10 th meeting 6. Status of partners Pronunciation lexica (Pool Lex1, Pool Lex2) Acoustic data for TTS voices (Pool Voice1, Pool Voice2) Text Corpora (Pool Text1, Pool Text2)

description

11 th ECESS Meeting College Language Resources. 0. Minutes making for College ‘Language Resources’ 1. Goal of meeting 2. Status members of College 3. Interests and acceptance of associated members and observers 4. Acceptance of College minutes of last meeting - PowerPoint PPT Presentation

Transcript of 11 th ECESS Meeting College Language Resources

Page 1: 11 th ECESS Meeting College  Language Resources

Language Resources College 11th ECESS meeting

1111th ECESS Meetingth ECESS MeetingCollege College Language ResourcesLanguage Resources

0. Minutes making for College ‘Language Resources’

1. Goal of meeting

2. Status members of College

3. Interests and acceptance of associated members and observers

4. Acceptance of College minutes of last meeting

5. College-Action List of 10th meeting

6. Status of partners• Pronunciation lexica (Pool Lex1, Pool Lex2) • Acoustic data for TTS voices (Pool Voice1, Pool Voice2)• Text Corpora (Pool Text1, Pool Text2)

Page 2: 11 th ECESS Meeting College  Language Resources

7. The actual state of LR specification• Accepting specification for Text Corpora (Pool Text1, Pool Text2)• Accepting specification for Acoustic data for TTS voices (minimal requirements,

Pool Voice2)

8. Further plans of partners

9. Discussion: General issues• ECESS LR specification documents (public webpage)• LR distribution (internal page) • Splitting LR

10. Discussion: Further directions of LR College• Extension of LR collection (new languages)• Specification for new types of Pools• Publications, promotion of ECESS LR

11. New Action List of College

Language Resources College 11th ECESS meeting

Page 3: 11 th ECESS Meeting College  Language Resources

Language Resources College 11th ECESS meeting

• Status and further plans of partners

• Interests and acceptance of members, associated members and observers

• Accepting the specification for Text Corpora (Pool Text1, Pool Text2)

• Finalizing the specification for Acoustic data for TTS voices (Pool Voice2)

• ECESS LR specification documents (public and internal page)

• Extension of LR collection

1. GoalGoal of Meeting of Meeting

Page 4: 11 th ECESS Meeting College  Language Resources

2. Status members of College

Current members of LR College

• AMU (Coordinator) Grażyna Demenko

• Siemens (Ute Ziegenhain)• Middle East Technical University, Ankara (Tolga Çiloğlu) • CAS (Jinhua Tao)• Uni Munich (Uwe Reichel)

Associated partners and Observers

• Nokia (Imre Kiss)• Microsoft Portugal (Daniela Braga)• University of Bielefeld (Dafydd Gibbon)• CNRS Aix en Provence (Daniel Hirst)

Language Resources College 11th ECESS meeting

Page 5: 11 th ECESS Meeting College  Language Resources

3. Interests and acceptance of associated members and observers

Voting a member of LR College

• CNRS, Aix en Provence (Daniel Hirst)• University of Bielefeld (Dafydd Gibbon)• Others potentially interested in LR?

Language Resources College 11th ECESS meeting

Page 6: 11 th ECESS Meeting College  Language Resources

• introduction of the agenda

• Dafydd Gibbon (Uni Bielefeld) want to contribute (MBROLA diphone voice, German lexicon)

• CNRS wants to become member of LR college

• present resources: UK lexicon, UK baseline voice, Mandarin lexicon, Mandarin voice, Polish lexicon (extended format), Catalan(UK baseline voice and Polish lexicon still have to be validated)

• POS tagging still has to be specified (size of text, domains, tokenisation problems, tag set, format of POS tags, validation)

• minimal requirements for recording voice (Hartmut Pfitzinger)

• plans of partners (table of supported languages)

4. Acceptance of College minutes of Acceptance of College minutes of last meetinglast meeting

Language Resources College 11th ECESS meeting

Page 7: 11 th ECESS Meeting College  Language Resources

• discussion, general issues: settled documents are on the public web-page, documents wich are still under discussion will be only on the internal page

• agreed specifications will be renamed as ECESS version, not TC-STAR anymore

• splitting LRs, for instance phonetic lexicon: proper names should be put in a separate lexicon, because they are task specific, may confuse the OOV routines, and increase production costs

• in college "tools", Maribor acts as a distributor of tools needed forevaluation

• promotion of ECESS LR (LREC 2008)

• extension of LR collection (new pools, languages)

Language Resources College 11th ECESS meeting

Page 8: 11 th ECESS Meeting College  Language Resources

5. College-Action List of 10th meeting

• Finalizing specifications for Text Corpora POS: PT1, PT2

• Finalizing specifications for Acoustic data fot TTS voices (PV2)

• Lexicons PL1, PL2: final documentation, reports of validation to be published on the internal ECESS pages

• Extension of LR collection (new types of Pools e.g., speaker characterization/emotional/pathological voices/speech)

Language Resources College 11th ECESS meeting

Page 9: 11 th ECESS Meeting College  Language Resources

6. Status of partners

• Pronunciation lexica (Pool Lex1, Pool Lex2)

• Acoustic data for TTS voices (Pool Voice1, Pool Voice2)

• Text Corpora (Pool Text1, Pool Text2)

Language Resources College 11th ECESS meeting

Page 10: 11 th ECESS Meeting College  Language Resources

1.1.1 Language List

Lexica-Pool1 Baseline Voices-Pool2 Tagging Corpus-Pool Partner Language Amount-Sex-Language Amount in Words

CAS (CN) 1fCN CN 150K Syllables (plan)

Uni Bonn DE (1fDE) DE Uni Munich (DE) (2fDE,2mDE) 200KDE Siemens UK 1mUK3 40K DE Uni Posnan PL (1mPL) Uni. METU (Tr) (1mTr,1fTr) 10K Tr

1Lexica with (-out) bracket show (no) deviation from LC-STAR lexicon ( common words)

2Voices with (-out) bracket show (no) deviation from TC-STAR baseline voice specifications; 3 one voice will be given to the ECESS consortium without precondition Language Codes

UK= UK-English JP=Japanese CN=Mandarin DE= German PL= Polish EU=Basque ES= Spanish FI= Finnish SI= Slowenian Tr=Turkish PT=Portugese

Language Resources College 11th ECESS meeting

Page 11: 11 th ECESS Meeting College  Language Resources

7. The actual state of LR specification

• Accepting the specification for Text Corpora (Pool Text), Ute Ziegenhain, SIEMENS Tagged text corpora (end of Sept.)

• Finalizing the specification for Acoustic data for TTS voices (Pool Voice2), IPDS Kiel

• Preparing Polish lexicon (extended version) for validation

Language Resources College 11th ECESS meeting

Page 12: 11 th ECESS Meeting College  Language Resources

8. Further plans of partners

Pools for Acoustic data for TTS voices, one voice counts as ED ( Exchange Deliverables) (1)         PV1 Pool Voice1, according to TC-STAR specs,(2)      ( PV2) Pool Voice2, according to minimum requirements.

Pools for Pronunciation lexica(3)          PL1 Pool Lex1, according to LC-STAR specs(4)          (PL2) Pool Lex2, according to minimum requirements (minimal requirements are being specified)

Pools for Text Corpora(5)          PT1 Pool Text1, according ECESS Specs(6)          (PT2) Pool Text2, according to minimum requirements

Types of LR and related Pools:

Pool Text1Preliminary Text corpora specifications (requirements are being specified)

Pools Lex1 and Voice1-          Pool Lex1:According to LC-STAR specs as described earlier (documents available from the ECESS website)-          Pool Voice1: According to TC-STAR specs as described earlier (documents available from the ECESS websites)

Pools Lex2 and Voice 2- Specifications of Minimum Requirements and thresholds will be defined during the first Period of ECESS (coordinated by Uni. Munich).- Preferably defined as a subset of TC/LC-STAR criteria.

- Expected size of text data: 100K tokens minimum.- Domains: as specified for the TC-STAR text corpora (in line with acoustic data creation).- LC-STAR tag set (or similar, e.g., WSJ for UK), for languages with no LC-STAR lexicon: comparable tag set.- 100% of the POS tags have to be manually checked.- Format of tagging: PennTree preferable (differences to be marked in the LSP).- LSP for every language mandatory.

Page 13: 11 th ECESS Meeting College  Language Resources

Uni Bielefeld:

Input for ECESS

The topics proposed so far by the Bielefeld partner are based on current Bielefeld activities and need to be adapted to ECESS needs. After further discussion, it is suggested that the top priority should be in the area of lexicon design i.e. formal specification and XML model for a flexible lexicon format which will permit extension in the following areas:a) Multilingual lexicon for speech synthesis

b) Integrated lexicon for multimodal speech synthesis (e.g. gesture sublexicons)

c) Integrated lexicon for NLP and synthesis components.A demonstration core lexicon for German is being prepared.

Language Resources College 11th ECESS meeting

Page 14: 11 th ECESS Meeting College  Language Resources

9. Discussion. General issues

• ECESS LR specification documents (public page):

The language independent specification is public and should be accessible from the public web-page.

• LR distribution (internal webpage): contact information

• LSPs specifications (internal page):The language specific data (LSP – language specific peculiarities) is part of the LR dedicated for a pool. The LSPs have to be approved by the LR college and be located on the internal webpage of ECESS (College LR).

• Splitting LR The data in the lexicon pool could be divided into lexicon of common words and lexicon of proper names: partners interested only in parts of the lexica could then choose what they want to deliver and exchange. Advantage: some partners may only want to deliver/get certain parts of a particular language; production costs for the different parts are more comparable.

Language Resources College 11th ECESS meeting

Page 15: 11 th ECESS Meeting College  Language Resources

• Extension of LR collection

New types of Pools (e.g. acoustic databases for speaker characterization, emotional databases, special databases with pathological voices/speech) depending on interests and needs of ECESS. Inclusion of new languages.

• Specification for new types of Pools: preliminary remarks

• Promotion of ECESS LR, publications: SASR, Poland 2008, update the publication list

Language Resources College 11th ECESS meeting

10. Discussion. Further directions of LR College

Page 16: 11 th ECESS Meeting College  Language Resources

• Make available to partners, end of Sept. decide on Ute specifications

• promotion of ECESS activities SASR Workshop, Poland 2008 (flyers, presentation) (AW)

• LR – publications/SASR/Poland’2008 (AW)

• emotional databases (exchange the information) (IH)

• Specifications for the acoustic data, make the info available (Hatrmut), (AW)

• lexicon (PL) evaluation (AW)

• Availability of lexica (splitting) (AW)

• Collect info about lexica for inflected languages (adding new spcification) (ZK)

Language Resources College 11th ECESS meeting

11. New Action List of College