Post on 28-Mar-2015
AUTOMATIC PHONETIC ANNOTATIONOF AN ORTHOGRAPHICALLY TRANSCRIBED
SPEECH CORPUS
Rui Amaral, Pedro Carvalho, Diamantino Caseiro, Isabel Trancoso, Luís Oliveira
IST, Instituto Superior Técnico
INESC, Instituto de Engenharia de Sistemas e Computadores
Summary
• Motivation
• System Architecture– Module 1: Grapheme-to-phone converter (G2P)
– Module 2: Alternative transcriptions generator (ATG)
– Module 3: Acoustic signal processor
– Module 4: Phonetic decoder and aligner
• Training and Test Corpora
• Results– Transcription and alignment (Development phase)
– Test corpus annotation (Evaluation phase)
• Conclusions and Future Work
Motivation
• Time consuming, repetitive task ( over 60 x real time)
• Large corpora processing
• No expert intervention– Non-existence of widely adopted standard procedures
– Error prone
– Inconsistency's among human annotators
System Architecture
speech corpus
Orthographically transcribed
Acoustic signalprocessor
AlternativeTranscriptions
Generator
PhoneticDecoder/Aligner
RulesLexicon
Grapheme-to-PhoneConverter Phonetically annotated
speech corpus
- Module 1 -
Grapheme-to-Phone Converter
Modules of the Portuguese TTS system (DIXI)
• Text normalisation– Special symbols, numerals, abbreviations and acronyms
• Broad Phonetic Transcription– Careful pronunciation of the word pronunciation
– Set of 200 rules
– Small exceptions dictionary (364 entries)
– SAMPA phonetic alphabet
- Module 2 -
Alternative Transcriptions Generator
Transformation of phone sequences into lattices
• Based on optional rules:
– Which account for:
» Sandhi
» Vowel reduction
– Specified using finite-state-grammars and simple transduction operators
A (B C) D
Examples:
Type Text Broad P.T. Alternative P.T.
de uma [d@ um6] [djum6]sandhi with vowelquality change
mesmo assim [m"eZmu 6s"i~] [m"eZmw6s"I~]
de uma [d@ um6] [dum6]sandhi withvowel reduction
mesmo assim [m"eZmu 6s"i~] [m"eZm6s"i~]
semana [s@m"6n6] [sm"6n6]vowel reduction
oito ["ojtu] ["ojt]
restaurante [R@Stawr"6~t] [R@StOr"6~t]Alternative pronunciations viagens [vj"aZ6~j~S] [vj"aZe~S]
Phrase “vou para a praia.”
Canonical P.T. [v"o p6r6 6 pr"aj6]
Narrow P. T. (most freq.) [v"o pr"a pr"ai6]
= sandhi + vowel reduction
Rules:
DEF_RULE 6a, ( (6 NULL) (sil NULL) (6 a) )
DEF_RULE pra, ( p ("6 NULL) r 6 )
Lattice
rp "6 r 6 sil 6 sil p...
ar
...
Example (rules application):
- Module 3 -
Acoustic Signal Processor
Extraction of acoustical signal characteristics
• Sampling: 16 kHz, 16 bits
• Parameterisation: MFCC (Mel - Frequency Cepstral Coefficients)
– Decoding: 14 coefficients, energy, 1st and 2nd order differences, 25 ms Hamming windows, updated every 10 ms.
– Alignment: 14 coefficients, energy, 1st and 2nd order differences, 16 ms Hamming windows, updated every 5 ms.
- Module 4 -
Phonetic Decoder and Aligner
Selection of the phonetic transcription which is closest to the utterance
• Viterbi algorithm
• 2 x 60 HMM models– Architecture
» left-to-right
» 3-state
» 3-mixture
NOTE: modules 3 and 4 use Hidden Markov Model Toolkit (Entropic Research Labs)
Training and Test Corpora
• Subset of the EUROM 1 multilingual corpus
– European Portuguese
– Collected in an anechoic room, 16 kHz, 16 bits.
– 5 male + 5 female speakers (few talkers)
– Prompt texts
» Passages: • Paragraphs of 5 related sentences
• Free translations of the English version of EUROM 1
• Adapted from books and newspaper text
» Filler sentences:• 50 sentences grouped in blocks of 5 sentences each
• Built to increase the numbers of different diphones in the corpus
– Manually annotated.
Training and Test Corpora (cont.)
Speaker Passages Phrases
1 O0 - O4 O5 - O9 P0 - P4 F5 - F9
2 O0 - O4 O5 - O9 P0 - 04 F0 - F4
3 P5 - P9 Q0 - Q4 Q5 - Q9 F5 - F9
4 P0 - P4 P5 - P9 Q0 - Q4 F5 - F9
5 O5 - O9 P0 - P4 P5 - P9 F0 - F4
6 P5 - P9 Q0 - Q4 Q5 - Q9 F5 - F9
7 O0 - O4 O5 - O9 P0 - P4 F0 - F4
8 Q0 - Q4 Q5 - Q9 R0 - R4 F0 - F4
9 R5 - R9 O0 - O4 O5 - O9 F5 - F9
10 Q5 - Q9 R0 - R4 R5 - R9 F5 - F9
Training Corpus
Test Corpus 1
Test Corpus 2
Passages:O0-O9, P0-P9: English translations
Q0-Q9, R0-R9: Books and newspaper text.
Filler sentences:F0-F9
Transcription AlignmentModels
Precision < 10ms Percentile 90%
HMM (transcription) 52,8 % 66,9 % 20 ms
HMM (alignment) 43 % 78,9 % 18 ms
Transcription and alignment results
• Transcription:– Precision = ((correct - inserted)/Total) x 100%
• Alignment:– % of cases in which the absolute error is < 10 ms
– average absolute error including 90 % of cases
Annotation strategies and Results
Transcription AlignmentModels
Precision < 10ms Percentile 90%
Strategy 1 85,3 % 77,4 % 20 ms
Strategy 2 85,8 % 44 % 29 ms
Strategy 3 85,8 % 78 % 19 ms
NOTE: Alignment evaluated only in places where the decoded sequence matched the manual sequence
Transcription Alignment
Strategy 1 HMM alignment HMM alignment
Strategy 2 HMM recognition HMM recognition
Strategy 3 HMM recognition HMM alignment
Annotation results - Transcription -
• Comments– Better precision achieved for canonical transcriptions of Test 2
– Highest global precision achieved in Test 1
– Successive application of the rules leads to a better precision
PrecisionRules
Test 1 Test 2
Canonical 74 % 76,9 %
Sandhi 77,1 % 79,4 %
Vowel reduction andalternative pronunciation
85,1 % 84,5 %
Annotation results - Alignment -
• Comments– Better alignment obtained with the best decoder
– Some problematic transitions: vowels, nasals vowels and liquids.
Alignment
Test 1 Test 2Rules
< 10 ms 90 % < 10 ms 90 %
Canonical 74,68 % 24 ms 75,18 % 25 ms
Sandhi 75,04 % 23 ms 75,41 % 24 ms
Vowel reduction andalternative pronunciations 78,76 % 19 ms 77,27 % 22 ms
Conclusions
• Better annotations results with:
– Alternative Transcriptions (comparatively to canonical).
– Use of different models for alignment and recognition
• About 84 % precision in transcription and 22 ms of
maximum alignment error for 90 % of the cases
Future Works
• Automatic rule inference – 1st Phase: comparison and selection of rules
– 2nd Phase: validation or phonetic-linguistic interpretation
• Annotation of other speech corpora to build better acoustic models
• Assignment of probabilistic information to the alternative pronunciations generated by rule
TOPIC ANNOTATION IN BROADCAST NEWS
Rui Amaral, Isabel Trancoso
IST, Instituto Superior Técnico
INESC, Instituto de Engenharia de Sistemas e Computadores
Preliminary work
• System Architecture– Two-stage unsupervised clustering algorithm
» nearest-neighbour search method
» Kullback-Leibler distance measure
– Topic language models
» smoothed unigrams statistics
– Topic Decoder
» based on Hidden Markov Models (HMM)
NOTE: topic models created with CMU Cambridge Statistical Language Modelling Toolkit
System Architecture
Topic Segmentationand Labelling
T 1
T i
T k
C 1
Topic ModelGeneration
Topic HMM
DECODING PHASE
N EWSPAPER T EXT C ORPUS
(TOPIC LABELED )
Process 1:
Process 2:
TRAINING PHASE
C i
C k
TM 1 TM kTM i
Topic annotated textsTexts
Selection &Filtering
Clustering
N EWSPAPER T EXT C ORPUS
(TOPIC UNLABELED )
Training and Test Corpora
• Subset of the BD_PUBLICO newspaper text corpus
– 20000 stories
– 6 month period (September 95 - February 96)
– topic annotated
– size between 100 and 2000 word
– normalised text