Post on 18-Dec-2015
Tanja SchultzTanja Schultz
Carnegie Mellon UniversityCarnegie Mellon University
Cairo, Egypt, May-21 2001Cairo, Egypt, May-21 2001
Data Recording, Transcription, and Data Recording, Transcription, and Speech Recognition for EgyptSpeech Recognition for Egypt
OutlineOutline
Requirements for Speech Recognition
Data Requirements Audio data Pronunciation dictionary Text corpus data
Recording of Audio data
Transcription of Audio data
Initialization of an Egypt Speech Recognition Engine
Multilingual Speech Recognition
Rapid Adaptation to new Languages
Part 1Part 1
Requirements for Speech Recognition
Data Requirements Audio data Pronunciation dictionary Text corpus data
Recording of Audio data
Transcription of Audio data
Thanks to Celine Morel and Susanne Burger
Speech RecognitionSpeech Recognition
hello
HelloHale BobHallo ::
TTS
Speech Input - Preprocessing
Decoding/ Search
Postprocessing - Synthesis
Fundamental Equation of SRFundamental Equation of SR
hello
P(W/x) = [ P(x/W) * P(W) ] / P(x)
Am AE MAre A RI AIyou J Uwe VE
I am you are we are:
Acoustic Model Pronunciation Language Model
A-b A-m A-e
SR: Data RequirementsSR: Data Requirements
Audio DataPhoneme Set
PronunciationDictionary
Text Data
Am AE MAre A RI AIyou J Uwe VE
I am you are we are:
Acoustic Model Pronunciation Language Model
A-b A-m A-e
Audio Data Audio Data For training and testing the SR-engine many high quality data in the
target language should be collected What kind of data are needed
Scenario and Task How to collect these data, Recording setup Preparation of Information
Quality of data Sampling rate, resolution
Amount of data Number of dialogs and speakers
Transcription of Audio Data
What kind of Audio DataWhat kind of Audio Data
C-Star Scenario: Travel arrangement
(planning a vacation trip, booking a hotel room, ...)
Scenario is realistic and attractive to the people
Dialog between two people: One Agent: Travel assistant
One Client: Traveler, pretends to visit a specific site
Speakers get instructions about what task they have to accomplish
but not HOW to do that
Role playing setup
How to collect Audio DataHow to collect Audio Data Recording setup
The dialog partners can NOT see each other, i.e. no face-to-face (in preparation of telephone, web applications)
No non-verbal communication Spontaneous Speech (noise effects, disfluencies, ... may occur) No Push-to-talk, try to avoid crosstalk Balanced dialogs
Dialog structure, Task Greetings and formalities between dialog partners Client gives information like number of persons traveling, date of travel
(arrival/departure), interest Client ask questions about means of transportation (train,flight), hotel or appartment
modalities, visits of sights or cultural events Agent provides information according to clients questions
Prepare Information for Client and AgentPrepare Information for Client and Agent A: Hotel list (3-4 hotels per dialog) A: Transportation list (3-4 flights, train, bus schedules) A: List of 3-4 cultural events per dialog C: information about specific task:
who is traveling (i.e. client travels with partner + two kids) when is s/he traveling (i.e. 2 weeks vacation trip in July) where (i.e. trip to Pennsylvania, US) how ( i.e. direct flight to Pittsburgh, rental car) what are the places of interest (CMU - Pittsburgh, Liberty Bell in
Philadelphia, ...) Date and time of recording might be faked Dialog takes place at recording place Example sheets Celine Morel
Quality and Quantity of Audio DataQuality and Quantity of Audio Data Quality of data
High quality clean speech close-speaking microphone, like Sennheiser H-420
16kHz sampling rate, 16 bit resolution Amount of data
Minimum of 10 hours of spoken speech Average length of dialogs 10 - 20 minutes 10 hours 30 - 60 dialogs
Number of speakers as much speakers as possible (speaker independent AM) 30 - 60 dialogs = maximum of 120 different spk Split up the speakers/dialogs into three disjunctive subsets:
training set, development testset, evaluation testset
RecordingRecording Tool: Total Recorder Tool: Total Recorder http://www.highcriteria.com/download/totrec.exe
Registration fee: 11.95 $ IBM compatible PC, soundcard (i.e. Soundblaster) Close-speaking microphone (i.e. Sennheiser H-420) Win95, Win98, Win2000, WinNT
Sound-board
Sound-boardDriver
TotalRecorder
Transcription of Audio DataTranscription of Audio DataFor training the SR-engine we need to transcribe the spoken data
manually Very time consuming (10-20 times real time) The more accurate transcribed the more valuable Since we do have the pronunciations, only word-based
transcriptions are needed Transcription convention from Susanne Burger
download from http://www.cs.cmu.edu/~tanja Describes notation
Transcription tool: transEdit (Burger & Meier)
Transliteration conventionsTransliteration conventionsExample:tanja_0001: this sentence +uhm+ was spoken +pause+ by ~Tanja and
+/cos/+ contains one restart
Parsability - one turn per line: Tanja_0001 Consistency Filter programs
tagging of proper names ~Tanja tagging of numbers special noise markers +uhm+ no capitalization at the beginning of turns
Pronunciation Dictionary Pronunciation Dictionary For each word seen in the training set, a pronunciation of this
word has to be defined in terms of the phoneme set Define an appropriate phoneme set: atomar sounds of language Describe each word to be recognized in terms of this phoneme set Example in English:
I AI
you J U
Strong Grapheme-to-Phoneme relation in Egypt/Arabic IF the vocalization is transcribed, romanized transcription
Grapheme-to-Phoneme tool for Standard Arabic (collected in Tunesia and Palestine) already developed at CMU (master student Jamal Abu-Alwan)
Phoneme Set (i.e. Standard Arabic)Phoneme Set (i.e. Standard Arabic)Phon.
Symbol Trans. Name Arabic
Symbol Phon.
Symbol Trans. Name Arabic
Symbol
SD Sd saad ص E E hamza ء
DD Dd daad ض AA A~ wasla آ
TT Tt tta ط AE Ae hamza أ ,إ DS D~ tha ظ O O hamza ؤ E3 3 ain ع I I hamza ئ GH Gh gin غ A A alif ا F F fa ف U U alif
maksura ى
Q Q qaaf ق B B ba ب K K kaaf ك TE Te ta marbuta ة L L lam ل T T ta ت M M mim م TH Th sa ث N N noon ن J J jeem ج W W waw و H7 7 ha ح Y Y yaa ي H H ha هه
a a fatha َ# KH Kh khaf خ
u u damma َ% D D daal د
i i kasra ِِ�� DH Dh thal ذ
an an tanwin fatha َ( R R ra ر
un un tanwin damma َ* Z Z za ز
in in tanwin kasra َ, S S seen س
SH Sh sha ش
Text Data Text Data For training the language model we need a huge corpus of text
data of same domain The language model helps guiding the search Compute probabilities of words, word pairs and word tripels Millions of words needed to calculate these probs Text corpus should be as close as possible to the given
domain Writing systems must be the same Other text might be useful as background information
Computer RequirementsComputer Requirements Data collection
IBM compatible PC High quality Soundcard like Soundblaster Close-speaking microphone like Sennheiser H-420 Operating System Win95 Large Harddisc
16000 x 2 bytes per sec 30 kBytes/sec 2 Mb/min 120 Mb/hr 1.2 GigaBytes for 10hr spoken speech
Speech Recognition Fast processor - as fast as possible RAM 512 Mb Additional 2-4 GigaBytes for temporary files during training and testing
Translation Donna, Lori?
DiscussionDiscussion Speech Recognizer in Egypt or Standard Arabic language ? Egypt
Spoken -used- language more interesting for a human-to-human speech-to-speech translation system?
Standardized pronunciation? Large text resources available in Egypt? Parser output follows Standard Arabic vocalization? Use Egypt CallHome data and pronunciation dictionaries (LDC)?
Standard Arabic Useful to a larger community? Canonical pronunciation? Preliminary speech recognizer and data already available at CMU Larger text resources available?
Do we want monolingual dialogs (agent&client) or multilingual recordings?
Part 2Part 2
Initialization of an Egypt Speech Recognition Engine
Multilingual Speech Recognition
Rapid Adaptation to new Languages
Initialization of Egypt SR EngineInitialization of Egypt SR Engine
Rapid initialization of an Egypt/Arabic speech recognizer?
Pronunciation dictionary: Grapheme-to-Phoneme tool available
if vocalization, romanization is provided by trl
Language model: text corpora if vocalized
Apply Egypt parser for vocalization?
Acoustic models: Initialization or Adaptation according to our
fast adaptation approach PDTS
GlobalPhoneGlobalPhone Multilingual Database
Widespread languages Native Speakers Uniformity Broad domain Huge text resources
Internet Newspapers
Total sum of resources 15 languages so far 300 hours speech data 1400 native speakers
ArabicCh-MandarinCh-ShanghaiEnglishFrench
German JapaneseKoreanCroatianPortuguese
RussianSpanishSwedishTamilTurkish
Speech Recognition in Multiple LanguagesSpeech Recognition in Multiple Languages
Pronunciationrules Text data
Sound systemSpeech data( 10 hours)
Goal: Speech recognition in a many different languagesProblem: Only few or no training data available (costs, time)
ela /e/l/a/eu /e/u/sou /s/u/
eu souvocê éela é
AM Lex LM
Speech Recognition in Multiple LanguagesSpeech Recognition in Multiple Languages
Pronunciationrules Text data
Sound systemSpeech data
ela /e/l/a/eu /e/u/sou /s/u/
eu souvocê éela é
AM Lex LM
Multilingual Acoustic ModelingMultilingual Acoustic Modeling
Step 1: • Combine acoustic models• Share data across languages
Multilingual Acoustic ModelingMultilingual Acoustic Modeling
Sound production is human not language specific: International Phonetic Alphabet (IPA) Multilingual Acoustic Modeling
1) Universal sound inventory based on IPA 485 sounds are reduced to 162 IPA-sound classes
2) Each sound class is represented by one “phoneme” which is trained through data sharing across languages
m,n,s,l occur in all languages p,b,t,d,k,g,f and i,u,e,a,o occur in almost all languages no sharing of triphthongs and palatal consonants
Rapid Language AdaptationRapid Language AdaptationStep 2: • Use ML acoustic models, borrow data• Adapt ML acoustic models to target language
ela /e/l/a/eu /e/u/sou /s/u/
eu souvocê éela é
AM Lex LM
Rapid Language AdaptationRapid Language AdaptationModel mapping to the target language
1) Map the multilingual phonemes to Portuguese ones based on the IPA-scheme
2) Copy the corresponding acoustic models in order to initialize Portuguese models
Problem: Contexts are language specific, how to apply context dependent models to a new target language
Solution: Adaptation of multilingual contexts to the target language based on limited training data
Language Adaptation ExperimentsLanguage Adaptation Experiments
69,1
57,149,9
40,632,8
28,9
19,6 19
0
20
40
60
80
100
Wor
d E
rror
rat
e [%
]
0 0:15 0:15 0:25 0:25 0:25 1:30 16:30
Ø Tree ML-Tree Po-Tree PDTS
+
SummarySummary Multilingual database suitable for MLVCSR Covers the most widespread languages Language dependent recognition in 10 languages Language independent acoustic modeling
Global phoneme set that covers 10 languages Data sharing thru multilingual models
Language adaptive speech recognition Limited amount of language specific data
Create speech engines in new target languages using only limited data, save time and money
Selected PublicationsSelected Publications
Language Independent and Language Adaptive Acoustic Modeling Tanja Schultz and Alex Waibel in: Speech Communication, To appear 2001
Multilinguality in Speech and Spoken Language Systems Alex Waibel, Petra Geutner, Laura Mayfield-Tomokiyo, Tanja Schultz, and Monika Woszczyna in: Proceedings of the IEEE, Special Issue on Spoken Language Processing, Volume 88(8), pp 1297-1313, August 2000
Polyphone Decision Tree Specialization for Language Adaptation Tanja Schultz and Alex Waibel in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2000), Istanbul, Turkey, June 2000.
Download from http://www.cs.cmu.edu/~tanja