Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics...

Linking transcriptions to spoken audio

John Coleman and Sergio Grau

Oxford University Phonetics Laboratoryhttp://www.phon.ox.ac.uk/SpokenBNC

Many thanks to• Lou Burnard (re XML)• Jiahong Yuan, UPenn (for P2FA aligner)• Dave de Roure & Kevin Page (for discussions re linked data)•John Pybus & Amir Nettler (for experiments with streamed audio fragments)• for £££

Outline of our talk:

• Large audio corpora and their challenges

• Mining a Year of Speech

• Random access to audio snippets

Multimedia dominates the internet

• 2005: YouTube launched

• 2008: YouTube surpasses Yahoo as world’s No. 2 search engine

• 2011: video/audio dominates peak-time bandwidth in North America

Some browsable audio corpora • www.oyez.org

(US Supreme Court recordings)• whitehousetapes.net

(1940-1973)• www.scottishcorpus.ac.uk

(Scottish Corpus of Texts and Speech)• http://sounds.bl.uk/

(British Library Archival Sound Recordings)

Challenges of very large audio collections of spoken language

How does a researcher find audio segments of interest?

How do audio corpus providers mark them up to facilitate searching and browsing?

How to make very large scale audio collections accessible?

Server-side challenges

Amount of material

Storage– CD quality audio: 635 MB/hour– Uncompressed .wav files: 115 MB/hour– 1.02 TB/year– Library/archive .wav files: 1 GB/hr, 9 TB/yr

1 TB (1000 GB) hard drive: c. £65 Now £39.95!

Spoken audio = 250 times XML

---

Server-side challenges

Audio format issues

– Uncompressed .wav files: 115 MB/hour– Temptation to use compressed formats– For speech analysis, low bitrate

compression (40 kbs) is pretty disastrous– Spectral centre-of-gravity measures are

unreliable even at higher compression rates, but pitch and formant estimation is OK

van Son (2005) Acta Acustica with Acustica 91: 771-778

Challenges• Amount of material

• Computing – distance measures, etc.– alignment of labels– searching and browsing– Just reading or copying 9 TB takes >1 day– Download time: days or weeks

How large?Some biggish transcribed corpora:

• Switchboard corpus: 13 days (included in MYS)

• Spoken Dutch: 1 month, only a fraction transcribed

• Spoken Spanish: 110 hours• OSU Buckeye Corpus: 2 days• Wellington Corpus, NZ: 3 days

• Mining a Year of Speech: 218 days so far, on track towards 3.6 years (>1200 days)

The “Year of Speech”A grove of corpora, held at various sites with a common indexing scheme and search tools:

US English: 2,240 hours of telephone conversations

• 1,255 hours of broadcast news• Talk show conversations (1,000 hrs),

Supreme Court oral arguments (5,000 hrs), political speeches and debates

British English: Spoken audio part of the British National Corpus• >7.4 million words of transcribed speech• 1,400 hours• Digitized by collaboration with British

Library

Analogue audio in librariesBritish Library: >1m disks and tapes, 5%

digitizedLibrary of Congress Recorded Sound

Reference Center: >2m items, including …International Storytelling Foundation:

>8000 hrs of audio and videoEuropean broadcast archives: >20m hrs

(2,283 years) cf. Large Hadron Collider

74% on ¼” tape19% shellac and vinyl7% digital

Analogue audio in librariesWorld wide: ~100m hours (11,415 yrs)

analoguei.e. 4-5 Large Hadron

Colliders!

Cost of professional digitization and cataloguing: ~£20/$32 per tape (e.g. C-90 cassette)

Using speech recognition and natural language technologies (e.g. summarization) could provide more detailed cataloguing/indexing without time-consuming human listening

Why so large? Lopsided sparsity I Top ten words each occurYou 58,000 timesitthe 'sand n'taThat 12,400 words (23%) onlyYeah occur once

Why so large? Lopsided sparsity

A rule of thumb

To catch most• English sounds, you need minutes of audio• common words of English … a few hours• a typical person's vocabulary … >100 hrs

• pairs of common words … >1000 hrs• arbitrary word-pairs … >100 years

Main problem in large corporaFinding needles in the haystack

To address that challenge, we think there are two “killer apps”

Forced alignment Data linking, or at least open exposure of

digital material, coupled with cross-searching

Practicalities

• In order to be of much practical use, such very large corpora must be indexed at word and segment level

• All included speech corpora must therefore have associated text transcriptions

• We’re using P2FA, the Penn Phonetics Laboratory Forced Aligner, to associate each word and segment with the corresponding start and end points in the sound files

Mining (indexing by forced alignment)

x 21 million

Mining (indexing by forced alignment)

Mining (a needle in a haystack)

Mining (a diamond in the rough)

Challenges for alignments

Problems with documentation and records

• Transcription errors• Long untranscribed portions• Some transcribed regions with no audio

(lost in copying)


Broadcast recordings may include untranscribed commercials

Transcripts generally edit out dysfluenciesPolitical speeches may extemporize,

departing from the published script


• Overlapping speakers• Background noise/music/babble• Variable signal loudness• Reverberation• Distortion• Poor speaker vocal health/voice quality• Unexpected accents: need multidialect

pronouncing dictionary

Issues we’re still grappling with

• No standards for adding phonemic transcriptions and timing information to XML transcriptions

• Many different possible schemes

• How to decide?

Enabling other corpora to be brought in in futurePromoting common standards for audio

with linked transcription

?<w c5="AV0" hw="well" pos="ADV" >Well </w>

Automatic Speech-to-Phoneme alignment

Aligner output to extended XML

• HTK example:

• HTK output+ XML -> extended XML• How to represent the obtained time

information within the existing TEI-XML structure?

0.56250.6125"IH1"0.61250.8225"T”

0.56250.8225"IT”

Integrating alignment information in the TEI-XML structure• Time information

• Word level• Phoneme level

• Phonemic representation of each word

• Timeline

Other representations: EXMARaLDA

EXMARaLDA: “Extensible Markup Language for Discourse Annotation” http://www.exmaralda.org/

<common-timeline><tli id="T0" time="0.0"/> <tli id="T1" time="1.309974117691172"/> <tli id="T2" time="1.899962460773455"/> <tli id="T3" time="2.3399537674788866"/> ....<tier id="TIE0" speaker="SPK0" category="v" type="t"

display-name="PRE [v]"> <event start="T2" end="T3">Good evening. </event> <event start="T5" end="T6">I have with me tonight

Ann Elk Mistress Ann Elk. </event>

Other representations: Voices of the Holocaust

http://voices.iit.edu/xml/voth_project_tei_example.xml <div corresp="#transcription_id">  <div xml:lang="en"> <u who="#interviewer_id" start="1.631">This is the

first utterance of the interviewer.</u> <u who="#interviewee_id" start="2.465">This is the

first utterance of the interviewee.</u> </div>

Other representations: IFA Dialog Video corpus, Phonetic Sciences, University of Amsterdam

van Son, R., Wesseling, W., Sanders, E., and van den Heuvel, H., The IFADV corpus: A free dialog video corpus, LREC’08, Marrakech, 2008

<TIME_ORDER> <TIME_SLOT TIME_SLOT_ID="ts1" TIME_VALUE="0"/> <TIME_SLOT TIME_SLOT_ID="ts2" TIME_VALUE="10"/> <TIME_SLOT TIME_SLOT_ID="ts3" TIME_VALUE="462"/> <TIME_SLOT TIME_SLOT_ID="ts4" TIME_VALUE="840"/> ... <ANNOTATION> <ALIGNABLE_ANNOTATION ANNOTATION_ID="a1"

TIME_SLOT_REF1="ts4" TIME_SLOT_REF2="ts7"> <ANNOTATION_VALUE>beginnen we weer

opnieuw?</ANNOTATION_VALUE> </ALIGNABLE_ANNOTATION> </ANNOTATION>

Other representations: Labb-Cat (ONZE Miner)

http://onzeminer.sourceforge.net

Transcriber or Praat representation

Other representations: Transcriber

http://trans.sourceforge.net

<Turn speaker="spk2" startTime="0.557" endTime="5.851"> <Sync time="0.557"/> so what do you know of your family ’s <Sync time="2.255"/> history like <Sync time="3.410"/> do you know when and why they came to Oxford

</Turn>

Other representations: COLT Corpus

http://www.hd.uib.no/colt/

– Sentence Level <u who=5 id=1 time=0.112> But I must see Mr <name> [smile again.] <u who=1 id=2 time=2.016> [<unclear> spoiled again?] ...

– Word level <u who=5 id=1 time=0.112><Audio word=BUT time=0.112 durn=0.176>But</Audio> <Audio word=I time=0.288 durn=0.064>I</Audio> <Audio word=MUST time=0.352 durn=0.304>must</Audio> <Audio word=SEE time=0.816 durn=0.352>see</Audio> <Audio word=MR time=1.168 durn=0.160>Mr</Audio> ...

Other representations: Summary

• Mostly sentence/word level time information representation

• No phoneme analysis

• No phoneme time information • Timeline representation

• TEI standard?

Other representations: Summary

• Mostly sentence/word level time information representation

• No phoneme analysis

• No phoneme time information • Timeline representation

• TEI standard?

• Extended TEI-XML with time and phoneme information

<u who="D94PSUNK"> <s n="3"> <w c5="VVD" hw="want" pos="VERB">Wanted </w> <w c5="PNP" hw="i" pos="PRON">me </w> <w c5="TO0" hw="to" pos="PREP">to</w> <c c5="PUN">.</c> </s></u>

<u who="D94PSUNK"> <s n="3"> <w ana="#D94:0083:11" c5="VVD" hw="want" pos="VERB">Wanted </w> <w ana="#D94:0083:12" c5="PNP" hw="i" pos="PRON">me </w> <w ana="#D94:0083:13" c5="TO0" hw="to" pos="PREP">to</w> <c c5="PUN">.</c>

<fs xml:id="D94:0083:11"> <f name="orth">wanted</f> <f name="phon_ana"> <vcoll type="lst"> <symbol synch="#D94:0083:11:0" value="W"/> <symbol synch="#D94:0083:11:1" value="AO1"/> <symbol synch="#D94:0083:11:2" value="N"/> <symbol synch="#D94:0083:11:3" value="AH0"/> <symbol synch="#D94:0083:11:4" value="D"/> </vcoll> </f> </fs>

Q. When you have an indexing scheme and a big database, what do you want to do with it?

A. Random access to audio snippets

Random access to audio snippets

• Timing of fragments in URL

• e.g. Gaudi (Google Labs) everyzing.com (ramp.com)

• http://audio.weei.com/search?q=something• http://audio.weei.com/a/42828235/red-sox-p

regame-show.htm#q=something&seek=311.989

Random access to audio snippets• Audio objects in HTML5 (in the browser)e.g. http://www.phon.ox.ac.uk/jcoleman/useful_test.html

• W3C media fragments protocole.g. http://www.w3.org/2008/WebVideo/Fragments/Demo:

http://ninsuna.elis.ugent.be/MediaFragmentsPlayer

URN’s for audio snippets

• Linked data/semantic web approach:refer to each specific word, phoneme etc as a specific audio object, not just a time range inside an audio file

• Challenge: need for an ontology for sounds and sound timelines in audio recordings

• Some progress in music ontologies

Conclusion• Sound and multimedia corpora/collections

are getting very big• In fact multimedia, not text, dominates the

internet• So, we need some standard ways for

representing audio structure and accessing its parts

• Forced alignment allows us to map transcriptions to audio, reasonably accurately

• For searching, there are several “demonstration” possibilities, but this is still work in progress

Thank you very much!

Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics...

Documents

Transcript of Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics...