Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics...
-
Upload
kaleb-bedell -
Category
Documents
-
view
215 -
download
3
Transcript of Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics...
![Page 1: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/1.jpg)
Linking transcriptions to spoken audio
John Coleman and Sergio Grau
Oxford University Phonetics Laboratoryhttp://www.phon.ox.ac.uk/SpokenBNC
![Page 2: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/2.jpg)
Many thanks to• Lou Burnard (re XML)• Jiahong Yuan, UPenn (for P2FA aligner)• Dave de Roure & Kevin Page (for discussions re linked data)•John Pybus & Amir Nettler (for experiments with streamed audio fragments)• for £££
![Page 3: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/3.jpg)
Outline of our talk:
• Large audio corpora and their challenges
• Mining a Year of Speech
• Random access to audio snippets
![Page 4: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/4.jpg)
Multimedia dominates the internet
• 2005: YouTube launched
• 2008: YouTube surpasses Yahoo as world’s No. 2 search engine
• 2011: video/audio dominates peak-time bandwidth in North America
![Page 5: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/5.jpg)
Some browsable audio corpora • www.oyez.org
(US Supreme Court recordings)• whitehousetapes.net
(1940-1973)• www.scottishcorpus.ac.uk
(Scottish Corpus of Texts and Speech)• http://sounds.bl.uk/
(British Library Archival Sound Recordings)
![Page 6: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/6.jpg)
![Page 7: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/7.jpg)
Challenges of very large audio collections of spoken language
How does a researcher find audio segments of interest?
How do audio corpus providers mark them up to facilitate searching and browsing?
How to make very large scale audio collections accessible?
![Page 8: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/8.jpg)
Server-side challenges
Amount of material
Storage– CD quality audio: 635 MB/hour– Uncompressed .wav files: 115 MB/hour– 1.02 TB/year– Library/archive .wav files: 1 GB/hr, 9 TB/yr
1 TB (1000 GB) hard drive: c. £65 Now £39.95!
Spoken audio = 250 times XML
---
![Page 9: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/9.jpg)
Server-side challenges
Audio format issues
– Uncompressed .wav files: 115 MB/hour– Temptation to use compressed formats– For speech analysis, low bitrate
compression (40 kbs) is pretty disastrous– Spectral centre-of-gravity measures are
unreliable even at higher compression rates, but pitch and formant estimation is OK
van Son (2005) Acta Acustica with Acustica 91: 771-778
![Page 10: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/10.jpg)
Challenges• Amount of material
• Computing – distance measures, etc.– alignment of labels– searching and browsing– Just reading or copying 9 TB takes >1 day– Download time: days or weeks
![Page 11: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/11.jpg)
How large?Some biggish transcribed corpora:
• Switchboard corpus: 13 days (included in MYS)
• Spoken Dutch: 1 month, only a fraction transcribed
• Spoken Spanish: 110 hours• OSU Buckeye Corpus: 2 days• Wellington Corpus, NZ: 3 days
• Mining a Year of Speech: 218 days so far, on track towards 3.6 years (>1200 days)
![Page 12: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/12.jpg)
The “Year of Speech”A grove of corpora, held at various sites with a common indexing scheme and search tools:
US English: 2,240 hours of telephone conversations
• 1,255 hours of broadcast news• Talk show conversations (1,000 hrs),
Supreme Court oral arguments (5,000 hrs), political speeches and debates
British English: Spoken audio part of the British National Corpus• >7.4 million words of transcribed speech• 1,400 hours• Digitized by collaboration with British
Library
![Page 13: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/13.jpg)
Analogue audio in librariesBritish Library: >1m disks and tapes, 5%
digitizedLibrary of Congress Recorded Sound
Reference Center: >2m items, including …International Storytelling Foundation:
>8000 hrs of audio and videoEuropean broadcast archives: >20m hrs
(2,283 years) cf. Large Hadron Collider
74% on ¼” tape19% shellac and vinyl7% digital
![Page 14: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/14.jpg)
Analogue audio in librariesWorld wide: ~100m hours (11,415 yrs)
analoguei.e. 4-5 Large Hadron
Colliders!
Cost of professional digitization and cataloguing: ~£20/$32 per tape (e.g. C-90 cassette)
Using speech recognition and natural language technologies (e.g. summarization) could provide more detailed cataloguing/indexing without time-consuming human listening
![Page 15: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/15.jpg)
Why so large? Lopsided sparsity I Top ten words each occurYou 58,000 timesitthe 'sand n'taThat 12,400 words (23%) onlyYeah occur once
![Page 16: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/16.jpg)
Why so large? Lopsided sparsity
![Page 17: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/17.jpg)
A rule of thumb
To catch most• English sounds, you need minutes of audio• common words of English … a few hours• a typical person's vocabulary … >100 hrs
• pairs of common words … >1000 hrs• arbitrary word-pairs … >100 years
![Page 18: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/18.jpg)
Main problem in large corporaFinding needles in the haystack
To address that challenge, we think there are two “killer apps”
Forced alignment Data linking, or at least open exposure of
digital material, coupled with cross-searching
![Page 19: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/19.jpg)
Practicalities
• In order to be of much practical use, such very large corpora must be indexed at word and segment level
• All included speech corpora must therefore have associated text transcriptions
• We’re using P2FA, the Penn Phonetics Laboratory Forced Aligner, to associate each word and segment with the corresponding start and end points in the sound files
![Page 20: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/20.jpg)
Mining (indexing by forced alignment)
x 21 million
![Page 21: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/21.jpg)
Mining (indexing by forced alignment)
![Page 22: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/22.jpg)
Mining (a needle in a haystack)
![Page 23: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/23.jpg)
Mining (a diamond in the rough)
![Page 24: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/24.jpg)
Challenges for alignments
Problems with documentation and records
• Transcription errors• Long untranscribed portions• Some transcribed regions with no audio
(lost in copying)
![Page 25: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/25.jpg)
Challenges for alignments
Broadcast recordings may include untranscribed commercials
Transcripts generally edit out dysfluenciesPolitical speeches may extemporize,
departing from the published script
![Page 26: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/26.jpg)
Challenges for alignments
• Overlapping speakers• Background noise/music/babble• Variable signal loudness• Reverberation• Distortion• Poor speaker vocal health/voice quality• Unexpected accents: need multidialect
pronouncing dictionary
![Page 27: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/27.jpg)
Issues we’re still grappling with
• No standards for adding phonemic transcriptions and timing information to XML transcriptions
• Many different possible schemes
• How to decide?
![Page 28: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/28.jpg)
Enabling other corpora to be brought in in futurePromoting common standards for audio
with linked transcription
?<w c5="AV0" hw="well" pos="ADV" >Well </w>
![Page 29: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/29.jpg)
Automatic Speech-to-Phoneme alignment
![Page 30: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/30.jpg)
Aligner output to extended XML
• HTK example:
• HTK output+ XML -> extended XML• How to represent the obtained time
information within the existing TEI-XML structure?
0.56250.6125"IH1"0.61250.8225"T”
0.56250.8225"IT”
![Page 31: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/31.jpg)
Integrating alignment information in the TEI-XML structure• Time information
• Word level• Phoneme level
• Phonemic representation of each word
• Timeline
![Page 32: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/32.jpg)
Other representations: EXMARaLDA
EXMARaLDA: “Extensible Markup Language for Discourse Annotation” http://www.exmaralda.org/
<common-timeline><tli id="T0" time="0.0"/> <tli id="T1" time="1.309974117691172"/> <tli id="T2" time="1.899962460773455"/> <tli id="T3" time="2.3399537674788866"/> ....<tier id="TIE0" speaker="SPK0" category="v" type="t"
display-name="PRE [v]"> <event start="T2" end="T3">Good evening. </event> <event start="T5" end="T6">I have with me tonight
Ann Elk Mistress Ann Elk. </event>
![Page 33: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/33.jpg)
Other representations: Voices of the Holocaust
http://voices.iit.edu/xml/voth_project_tei_example.xml <div corresp="#transcription_id"> <!-- begin Spool XXX --> <div xml:lang="en"> <u who="#interviewer_id" start="1.631">This is the
first utterance of the interviewer.</u> <u who="#interviewee_id" start="2.465">This is the
first utterance of the interviewee.</u> </div>
![Page 34: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/34.jpg)
Other representations: IFA Dialog Video corpus, Phonetic Sciences, University of Amsterdam
van Son, R., Wesseling, W., Sanders, E., and van den Heuvel, H., The IFADV corpus: A free dialog video corpus, LREC’08, Marrakech, 2008
<TIME_ORDER> <TIME_SLOT TIME_SLOT_ID="ts1" TIME_VALUE="0"/> <TIME_SLOT TIME_SLOT_ID="ts2" TIME_VALUE="10"/> <TIME_SLOT TIME_SLOT_ID="ts3" TIME_VALUE="462"/> <TIME_SLOT TIME_SLOT_ID="ts4" TIME_VALUE="840"/> ... <ANNOTATION> <ALIGNABLE_ANNOTATION ANNOTATION_ID="a1"
TIME_SLOT_REF1="ts4" TIME_SLOT_REF2="ts7"> <ANNOTATION_VALUE>beginnen we weer
opnieuw?</ANNOTATION_VALUE> </ALIGNABLE_ANNOTATION> </ANNOTATION>
![Page 35: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/35.jpg)
Other representations: Labb-Cat (ONZE Miner)
http://onzeminer.sourceforge.net
Transcriber or Praat representation
![Page 36: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/36.jpg)
Other representations: Transcriber
http://trans.sourceforge.net
<Turn speaker="spk2" startTime="0.557" endTime="5.851"> <Sync time="0.557"/> so what do you know of your family ’s <Sync time="2.255"/> history like <Sync time="3.410"/> do you know when and why they came to Oxford
</Turn>
![Page 37: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/37.jpg)
Other representations: COLT Corpus
http://www.hd.uib.no/colt/
– Sentence Level <u who=5 id=1 time=0.112> But I must see Mr <name> [smile again.] <u who=1 id=2 time=2.016> [<unclear> spoiled again?] ...
– Word level <u who=5 id=1 time=0.112><Audio word=BUT time=0.112 durn=0.176>But</Audio> <Audio word=I time=0.288 durn=0.064>I</Audio> <Audio word=MUST time=0.352 durn=0.304>must</Audio> <Audio word=SEE time=0.816 durn=0.352>see</Audio> <Audio word=MR time=1.168 durn=0.160>Mr</Audio> ...
![Page 38: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/38.jpg)
Other representations: Summary
• Mostly sentence/word level time information representation
• No phoneme analysis
• No phoneme time information • Timeline representation
• TEI standard?
![Page 39: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/39.jpg)
Other representations: Summary
• Mostly sentence/word level time information representation
• No phoneme analysis
• No phoneme time information • Timeline representation
• TEI standard?
• Extended TEI-XML with time and phoneme information
![Page 40: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/40.jpg)
<u who="D94PSUNK"> <s n="3"> <w c5="VVD" hw="want" pos="VERB">Wanted </w> <w c5="PNP" hw="i" pos="PRON">me </w> <w c5="TO0" hw="to" pos="PREP">to</w> <c c5="PUN">.</c> </s><!-- ... --></u>
![Page 41: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/41.jpg)
<u who="D94PSUNK"> <s n="3"> <w ana="#D94:0083:11" c5="VVD" hw="want" pos="VERB">Wanted </w> <w ana="#D94:0083:12" c5="PNP" hw="i" pos="PRON">me </w> <w ana="#D94:0083:13" c5="TO0" hw="to" pos="PREP">to</w> <c c5="PUN">.</c>
![Page 42: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/42.jpg)
<fs xml:id="D94:0083:11"> <f name="orth">wanted</f> <f name="phon_ana"> <vcoll type="lst"> <symbol synch="#D94:0083:11:0" value="W"/> <symbol synch="#D94:0083:11:1" value="AO1"/> <symbol synch="#D94:0083:11:2" value="N"/> <symbol synch="#D94:0083:11:3" value="AH0"/> <symbol synch="#D94:0083:11:4" value="D"/> </vcoll> </f> </fs>
![Page 43: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/43.jpg)
<timeline origin="0" unit="s" xml:id="TL0"> ... <when xml:id="#D94:0083:11:0" from="1.6925" to="1.8225"/> <when xml:id="#D94:0083:11:1" from="1.8225" to="1.9225"/> <when xml:id="#D94:0083:11:2" from="1.9225" to="2.1125"/> <when xml:id="#D94:0083:11:3" from="2.1125" to="2.1825"/> <when xml:id="#D94:0083:11:4" from="2.1825" to="2.3125"/> ...</timeline>
![Page 44: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/44.jpg)
Q. When you have an indexing scheme and a big database, what do you want to do with it?
A. Random access to audio snippets
![Page 45: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/45.jpg)
Random access to audio snippets
• Timing of fragments in URL
• e.g. Gaudi (Google Labs) everyzing.com (ramp.com)
• http://audio.weei.com/search?q=something• http://audio.weei.com/a/42828235/red-sox-p
regame-show.htm#q=something&seek=311.989
![Page 46: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/46.jpg)
![Page 47: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/47.jpg)
![Page 48: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/48.jpg)
Random access to audio snippets• Audio objects in HTML5 (in the browser)e.g. http://www.phon.ox.ac.uk/jcoleman/useful_test.html
• W3C media fragments protocole.g. http://www.w3.org/2008/WebVideo/Fragments/Demo:
http://ninsuna.elis.ugent.be/MediaFragmentsPlayer
![Page 49: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/49.jpg)
URN’s for audio snippets
• Linked data/semantic web approach:refer to each specific word, phoneme etc as a specific audio object, not just a time range inside an audio file
• Challenge: need for an ontology for sounds and sound timelines in audio recordings
• Some progress in music ontologies
![Page 50: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/50.jpg)
Conclusion• Sound and multimedia corpora/collections
are getting very big• In fact multimedia, not text, dominates the
internet• So, we need some standard ways for
representing audio structure and accessing its parts
• Forced alignment allows us to map transcriptions to audio, reasonably accurately
• For searching, there are several “demonstration” possibilities, but this is still work in progress
![Page 51: Linking transcriptions to spoken audio John Coleman and Sergio Grau Oxford University Phonetics Laboratory .](https://reader034.fdocuments.in/reader034/viewer/2022051819/551b30ee550346d41a8b4e65/html5/thumbnails/51.jpg)
Thank you very much!