Cocosda 2001 ELRA/ELDA KC/1 Brief Overview of recent activities in Europe Khalid CHOUKRI ELRA/ELDA...

35
Cocosda 2001 ELRA/ELDA ELRA/ELDA KC/1 Brief Overview of recent activities in Europe Khalid CHOUKRI ELRA/ELDA 55 Rue Brillat-Savarin, F-75013 Paris, France Tel. +33 1 43 13 33 33 -- Fax. +33 1 43 13 33 30 Email: [email protected] Web: http://www.elda.fr/

Transcript of Cocosda 2001 ELRA/ELDA KC/1 Brief Overview of recent activities in Europe Khalid CHOUKRI ELRA/ELDA...

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/1

Brief Overview of recent activities in Europe

Khalid CHOUKRIELRA/ELDA

55 Rue Brillat-Savarin, F-75013 Paris, FranceTel. +33 1 43 13 33 33 -- Fax. +33 1 43 13 33 30

Email: [email protected]: http://www.elda.fr/

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/2

ELRA … ELRA … European vs National activities European vs National activities Speech resources collections Speech resources collections Other projects (Enabler, EuroMap, etc.)Other projects (Enabler, EuroMap, etc.)Evaluation Evaluation LREC2002LREC2002

Outline

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/3

European Language Resource Association An Improved infrastructure for Data sharing

Centralized Not-for-profit organization for the collection, distribution, and validation of

speech, text, and terminology resources and tools.

Extension to:

•Multimodal/Multimedia Resources

•Evaluation.

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/4

European Language Resource Association An Improved infrastructure for Data sharing

A Repository Center:Technical & Logistic issuesCommercial issues (prices, fees, royalties)Legal issues (Licensing, IPR)Information Dissemination

An Association of users of Language Resources

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/5

Brief Overview of recent activities in EuropeEuropean Union Level

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/6

Brief Overview of recent activities in EuropeNational Level

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/7

Brief Overview of recent activities in EuropeEuropean Union Level

European R&D Framework Programmes (FP): back to early Eighties

On-going Actions

• FP5 with a Thematic programme on Information Society technologies

•MLIS (Multi-Lingual Information Society)

•INCO (International Cooperation )

•E-Content

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/8

Brief Overview of recent activities in EuropeEuropean Union Level

European R&D Framework Programmes (FP): back to early Eighties

On-going Actions

•E-Content: Promoting European Digital Content on the Global Networks".

action line 1: "Improving access to and expanding use of public sector information"

action line 2: "Enhancing content production in a multilingual and multicultural environment

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/9

Brief Overview of recent activities in EuropeEuropean Union Level

Some Projects within FP5 and previous FPs …. Related to Cocosda concerns

Resources production: Speechdat Family

Specifications of new types of resources: Natural Interaction and MultiModality

within ISLE (International Standards for Language Engineering) project

Dialog & Evaluation : Seneca

Evaluation: CLASS

Standards: Eagles and its extension … the EU/US collaborative project ISLE

Networks: ELSNET, ENABLER

Information gathering & Dissemination : Euromap and its follow-up Hope

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/10

SpeechDat Family

SpeechDat(M) --- Fixed Telephone network -- 1K Speakers SpeechDat(M) --- Fixed Telephone network -- 1K Speakers

SpeechDat-II Fixed, Mobile, 1-5KspeakersSpeechDat-II Fixed, Mobile, 1-5Kspeakers

SpeechDat-II Speaker VerificationSpeechDat-II Speaker Verification

SpeechDat-E (CEE - SpeechDat-E (CEE - Polish Czech Slovak Russian Hungarian) Polish Czech Slovak Russian Hungarian)

SALA (Speech Across Latin America) SALA (Speech Across Latin America) and Now SALA-IIand Now SALA-II

SpeechDat-Car (inc. cellular)SpeechDat-Car (inc. cellular)

SpeeCon (Consumer products)SpeeCon (Consumer products)

Orien’telOrien’tel

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/11

SpeeCon Project

Participantnumber

Participant

Name

Participantshort name

Country

1 Siemens Aktiengesellschaft Siemens Germany

2 Ericsson Eurolab Deutschland GmbH EEDN Germany

3 IBM Deutschland Entwicklung GmbH IBM Germany

4 Lernout & Hauspie Speech Products NV L&H Belgium

5 Matra Nortel Communications Matra France

6 Nokia Corporation Nokia Finland

7 Philips Speech Processing AachenZweigniederlassung der Philips GmbH

Philips Germany

8 Sony International (Europe) GmbH Sony Germany

9 TEMIC TELEFUNKEN microelectronicGmbH

TEMIC Germany

10 DaimlerChrysler AG DCAG Germany

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/12

SpeeCon Project

Dialectal zone Language Region RemarksEsl_ES Spanish Spain (excluding Latin America)Rus_RU 1) Russian RussiaIta_IT Italian ItalySve_SE_FI Swedish Sweden and FinlandDeu_DE_AT German Germany and Austria (excluding e.g. Belgium, Luxembourg,

Switzerland)Eng_GB English United KingdomDan_DK Danish DenmarkDut_BE Dutch BelgiumFra_CA French CanadaFra_FR French France (excluding e.g. Belgium, Luxembourg,

Switzerland)Fin_FI Finnish FinlandZho_CN_HK Mandarin P. R. China (incl. Hongkong) (excluding e.g. Taiwan)Dut_NL Dutch The NetherlandsJpn_JP Japanese JapanPol_PL Polish PolandPor_PT Portuguese Portugal (excluding Brazil)Deu_CH German SwitzerlandEng_US English USA (excluding e.g. Canada)

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/13

SpeechDat Family: OrienTel

Multilingual access to interactive communication services for the Mediterranean and the Middle East

7 linguistic regions 10 OrienTel countries 23 databases

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/14

SpeechDat Family: OrienTel

Linguistic affiliation OrienTel countries Languages coveredMorocco Standard Arabic

Colloquial Moroccan ArabicFrench

Mahgreb Arabic(excluding Algeria and parts ofLibya) Tunisia Standard Arabic

Colloquial Tunisian ArabicFrench

Egyptian Arabic(excluding parts of Libya)

Egypt Standard ArabicColloquial Egyptian ArabicEnglish

Levantine Arabic(excluding Syria, Lebanon andJordan)

Israel and PalestineAuthorities

HebrewStandard ArabicColl. South Levantine Arabic

United Arab Emirates Standard ArabicColloquial Gulf ArabicEnglishGulf Arabic

(excluding Kuwait, Bahrain,Qatar, Oman and Yemen)

Saudi Arabia Standard ArabicColloquial Gulf ArabicEnglish

Cypriote Greek Cyprus GreekEnglish

Hebrew Israel HebrewTurkish Turkey, Germany for German Turkish

German

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/15

SpeechDat Family: SALA

Phase IFixed Network

MexicoArgentina

ChileBrazil

Colombia

Venezuela

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/16

SpeechDat Family: SALA - II

Phase IICellular/Mobile Network

Latin America US and CanadaMexico US English North EastArgentinaChile* US Spanish EastBrazil English South West

or US Spanish West.Colombia US English North WestVenezuelaCosta Rica* US English South EastPeru* Canadian American

English

US English North West US English South West US English North East US English South East US Spanish East (Caribbean variant) US Spanish West (Mexican variant) Canadian British English Canadian American English Canadian French

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/17

Other resources’ oriented projects

C-Oral-Rom : Conversational Speech C-Oral-Rom : Conversational Speech

Roman Languages: French, Italian, Spanish, PortugueseRoman Languages: French, Italian, Spanish, Portuguese

““Comparable” dataComparable” data

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/18

Brief Overview of recent activities in EuropeEuropean Union Level

A major project within MLIS …. Related to Cocosda concerns

NETWORK-DC: Network of international & regional Data Centers

Partners: ELRA, SPEX & LDC

Others (GSK,…) welcome to join

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/19

Brief Overview of recent activities in EuropeNational Level

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/20

Brief Overview of recent activities in EuropeNational Projects/programs

Netherlands & Belgium:

Dutch spoken Corpus (Coming presentation): Data Available via ELRA, Release of April2001

OVER Nine National projects:

Germany:

From Vermobile (Data Available via ELRA) to SmartKom

France:

Reseau National en Recherche en TéléCommunication (RNRT),

Others RIAM, RNTL, Coming: Evaluation program………..

Italy, Greece, Czech, ….

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/21

Brief Overview of recent activities in EuropeNational Projects/programs… Dutch & Flemish

Release 1 (March 2000) 62 hours speech samples orthographically transcribed (615,000 words), 90,000 words enriched with

Part-of-Speech tags; annotation CD with first version of PRAAT (annotation tool) and first version of documentation (in

Dutch) among which relevant information on the speakers (e.g. gender, age, socio-economic class) andsamples (e.g. recording conditions, the equipment) (information on the speakers in anonymous form);

Release 2 (October 2000) over 150 hours of speech samples, orthographically transcribed (over 1,500,000 words), approximately

750,000 words enriched with Part-of-Speech tags; annotation CD with annotation protocols and relevant information on the speakers (e.g. gender, age,

socio-economic class) and samples (e.g. recording conditions, the equipment) is available (informationon the speaker in anonymous form);

Release 3 (April 2001) more orthographically data enriched with Part-of-Speech tags; the first broad phonetic transcriptions, word alignments, syntactic annotations, lexicon link-up will be

available; annotation CD with documentation among which relevant information on the speakers (e.g. gender,

age, socio-economic class) and samples (e.g. recording conditions, the equipment);this release encompasses the first version of Corex, the exploitation tool.

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/22

MARKET ANALYSIS

First objective:

To get hard facts about the needs/requirements To get reliable figures about the market

Second objective:

To enforce /confirm our knowledge / assessments

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/23

0

10

20

30

40

50

60

total telephony office consumer

1998

2003

Million EUR Million €

MARKET ANALYSIS (Worldwide Market of LR - Commercial Use)

Courtesy of Siemens

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/24

Speech Recognition -- Market SegmentationImplications for the Language Resource Market 1998-2003

Market Segment Office Telephony Consumer Total Market

# of Costumers 4 - 8 10 - 20 10 - 30 24 - 58

# of databases* 200 - 400 1000-2000 1000-3000 2200-5400

Market Size ( M € ) 6 - 12 30-60 30-90 66-162

50 Languages 30K€ per LR

Telephony: 2 databases/language (fixed and mobile network)Consumer: 2 databases/language (car and public environment)

* all databases needed by all providers of speech recognition technology

** Estimated accumulated market from 1998 until 2003 ( in M€)

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/27

Distribution Activities of Language Resources for Evaluation

(via ELRA)

EVALUATION ACTIVITIES

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/28

AURORA (Speech distributed recognition)

AMARYLLIS (Multilingual/Parallel

corpora)

CLEF (Cross-Language Evaluation Forum)

ARCADE/ROMANSEVAL

Distribution ActivitiesLanguage Resources for Evaluation

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/29

AURORA (Speech distributed recognition)

Set up to establish a worldwide standard for the feature extraction software in a DSR (Distributed Speech Recognition) system:

(i) Evaluation of algorithms for front-end feature extraction algorithms in background noise

(ii) Evaluation and comparison of the performance of noise robust speech recognition algorithms.

Distribution ActivitiesLanguage Resources for Evaluation

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/33

Language Resources for Evaluation

(production/commissioning; Distribution)

Methodologies for Evaluation

Management of Evaluation Campaigns

Evaluation of Language Resources

(Validation)

European Language Resource Association& Evaluation

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/34

ENABLER European National Activities for Basic Language Engineering & Resources

Survey of existing national activities

Fostering common research and compatibility of LR

Suggestion for and contribution to international

cooperation

-- A new InitiativeIdentification of existing resources (Universal Catalogue)The Basics (e.g. Standards, tools, evaluation procedures, …)

Extension foreseen/ Planned

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/35

EUROMAP - HOPE

HLT CENTRAL

http://www.hltcentral.org

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/36

EUROMAP - HOPE

HOPE is a knowledge building and dissemination project

whose main goal is to

raise awareness about the market readiness and potential benefits of

Human Language Technologies (HLT)

among appropriate market players in the information society.

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/37

EUROMAP - HOPE

Center for Sprogteknologi CST DK

VDI/VDETechnologiezentrumInformationstechnikGmbH

VDI/VDE-IT DE

VIKOP Verein fürInternationale Forschungs-Technologie undBildungskooperation

BIT AT

Instituto Cervantes IC ES

Scientific Computing Ltd. CSC FI

Consorzio Pisa Ricerche CPR IT

Arax Limited Arax UK

European LanguageResources DistributionAgency

ELDA FR

University of Brighton ITRI UK

Institute for Language andSpeech

ILSP GR

Nederlandse Taalunie NTU NL

Central Laboratory forParallel Processing

CLPP BG

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/38

LREC-2002

LAS PALMAS DE GRAN CANARIA, CANARY ISLANDS SPAIN

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/39

LREC-2002

Issues in the design, construction and use of Language Resources (LR)

Issues in Human Language Technologies evaluation

General issues (National and international activities and projects, Cooperations,…)

Conference: 29-30-31 MAY 2002Pre Conference Workshops: 27-28 MAY 2002Post Conference Workshop: 1-2 JUNE 2002

Cocosda 2001 ELRA/ELDAELRA/ELDAKC/40

LREC-2002 …. IMPORTANT DATES

Submission of proposals for oral and poster papers, referenced demos, panels and workshops:

20 NOVEMBER 2001

Notification of acceptance of workshop and panel proposals: 10 DECEMBER 2001

Notification of acceptance of papers, posters, referenced demos:2 FEBRUARY 2002

Final versions: 2 APRIL 2002