CMB polarisation results from QUIET Ingunn Kathrine Wehus 23rd Rencontres de Blois, 1/6 -11.
LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen
description
Transcript of LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen
1
www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech
LREC 2008
Ingunn Amdal, Ole Morten Strand,Jørn Almberg, and Torbjørn Svendsen
RUNDKAST:An Annotated Norwegian
Broadcast News Speech Corpus
2
www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech
Overview
• Purpose of Rundkast• An overview of the database Rundkast• Structure of annotation• Orthographic transcription• Broad phonetic annotation
3
www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech
Purpose of Rundkast
Databases of broadcast news can be used for a number of research topics in speech technology such as:
• Supplement to existing databases of read speech for training and testing automatic speech recognition and speaker adaptation.
• Research on recognition of spontaneous speech.• Research on automatic indexing of audio data.• Research on topic and/or speaker segmentation.• Research on speech/non-speech detection (e.g. background
music).• International research cooperation involving speech technology
for broadcast news applications.
A corpus of this kind is necessary for language technology research, but has not been available for Norwegian
4
www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech
Overview of Rundkasthttp://www.iet.ntnu.no/projects/rundkast/
Database of 77 hours radio broadcast news fromthe Norwegian Broadcasting Corporation (NRK):
• Read and spontaneous speech, as well as spontaneous dialogsand multipart discussions
• There is large variation between speakers, speaking styles and topics
• Speaker turns may be rapid and several speakers may talk simultaneously
• The quality of the recordings include studio and telephone(mobile, satellite etc)
• Frequent occurrences of background noise, jingles,music and audio illustrations
Funded by the Norwegian University of Science and Technology (NTNU)
5
www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech
Structure of annotation
Rundkast is hierarchically organizedand orthographically annotated:
• Name of programme, type and date• Name of speaker (if known) and dialect (5 regions)• Type of speech: spontaneity, channel, recording quality• Segmented in speaker turns of app. 2-5 seconds• Orthographic transcription (standard Norwegian)• Labels for noise (speaker noise, background noise etc.)• Labels for pronunciation mistakes, foreign words, unintelligible
speech etc.
• ~70 hrs work per hour of recording
Transcriber used for annotation: ”standard”-tool
6
www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech
Hierarchy of annotation levels
[i] blah blah ... more blah ...[lp] • • •
speaker 1 speaker 2 no speaker • • •speaker 1
report fillernontrans • • •report
one episode file
[b-]noisy blah[-b] ...
annotation level:
1
2
3
levels: 1=section, 2=speaker turn, and 3=segment
7
www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech
Orthographic transcription
• The lowest level in the annotation hierarchy, segments, are transcribed orthographically.
• Orthographic transcription of spoken language is a challenge, especially for Norwegian. Using dialect also in official circumstances is more and more accepted.
• The majority of RUNDKAST is not compliant to any standard pronunciation.
• The aim of the conventions for the orthographic transcription in RUNDKAST is to minimize uncertainty about pronunciations and facilitate consistency.
8
www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech
Orthographic transcription:Main conventions
• Words are transcribed with the written forms closest to actual pronunciations. A limited number of interjections are allowed.
• Text codes are used to mark mispronunciations, truncations, and unknown words.
• Numbers and symbols are written out as words.• Abbreviations are not used.• Punctuation marks are restricted to comma, period, and
question mark.• Space is used between spelled letters, also when acronyms
have spelled pronunciation.• Capital letters are used in proper names, spellings, and
acronyms, but not at the start of sentences.
9
www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech
Example annotation in Transcriber
10
www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech
Broad phonetic annotation
• Part of the data were to be phonetically annotated– Use for low-level experiments in ASR (new methods),
smaller Norwegian counterpart to TIMIT– Auto-segmentation for e.g. unit selection TTS
• Annotation to be based on existing standards– with necessary adjustments
• Exploit experience and specifications from development of Norwegian speech synthesis databases
• ”Suitable” level of detail: Acoustic boundaries should be labeled, but more phonemic than phonetic
• Consistency of utmost importance!
11
www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech
Broad phonetic annotation:Selected data
• 10 speakers (5 male and 5 female)
• Amount of speech per speaker:– app 5 min ”planned” speech and 1 min spontaneous speech
– discard noisy parts (as far as possible)
– from more than one programme
– use turn segmentation from orthographic annotation
• All in all 1 hour of speech• Approximately 1000 hours of work
12
www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech
Broad phonetic annotation:Main principles
• The annotation is mainly phonemic using the phoneme symbols closest to the perceived sound
• Acoustic boundaries should be marked; some acoustically motivated symbols are included
• A transcription as close as possible to the citation form is preferred
• Norwegian standard SAMPA is preferred– Some English phonemes included as well as dialect variants
– Example: 3 variants of the /r/-sound/r/ (tap/trill)/R/ (uvular fricative)/r\/ (approximant)
13
www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech
Broad phonetic annotation:Annotation procedure
1. Conversion of orthographic transcription to a format suitable for automatic transcription.
2. Automatic segmentation with a phonotypical transcription using a speech recognizer.
3. Manual correction of both segments and labels by four phonetics students using Praat.
4. Format check.
5. Control of all annotation by one supervisor.
14
www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech
Broad phonetic annotation:Comments on deviations
Always cases of uncertainty, need a log for these.
Problem: will the log be read?
Solution: Codes for deviations!
• Additional Praat tier for deviations• Synchronous with the phoneme tier• Easy to utilize automatically• Examples:
– creaky voice
– unexpected voiced/unvoiced
– uncertain boundary or symbol
• ... in addition a log file with whatever deviations left
15
www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech
Example annotation in Praat
16
www.ntnu.no 2008-05-29Rundkast at LREC 2008, Marrakech
Concluding remarks
• Availability:– Planned to be included for non-commercial use in a future
Norwegian language bank
– Will complement other corpora also intended to be included
• To be validated by Spex• Planned use at NTNU: SIRKUS project
– Investigation in new paradigms for ASR– Low-level phone recognition experiments initially
• multi-linguality aspects
– Spoken information retrieval