Mining a Year of Speech: a “Digging into Data” project

75
Mining a Year of Speech: a “Digging into Data” project http://www.phon.ox.ac.uk/mining/

description

Mining a Year of Speech: a “Digging into Data” project http://www.phon.ox.ac.uk/mining/. Mining (a) Year(s) of Speech: a “Digging into Data” project http://www.phon.ox.ac.uk/mining/. - PowerPoint PPT Presentation

Transcript of Mining a Year of Speech: a “Digging into Data” project

Mining a Year of Speech

Mining a Year of Speech:a Digging into Data project

http://www.phon.ox.ac.uk/mining/

Mining (a) Year(s) of Speech:a Digging into Data project

http://www.phon.ox.ac.uk/mining/

John Coleman Greg KochanskiLadan Ravary Sergio GrauOxford University Phonetics Laboratory

Lou Burnard

Jonathan RobinsonThe British Library

Mark LibermanJiahong YuanChris Cieri

Phonetics Laboratoryand Linguistic Data ConsortiumUniversity of Pennsylvania

with support from ourDigging into Data competition funders

and with thanks for pump-priming support from the Oxford University John Fell Fund, and from the British LibraryThe Digging into Data challengeThe creation of vast quantities of Internet accessible digital data and the development of techniques for large-scale data analysis and visualization have led to remarkable new discoveries in genetics, astronomy and other fields ...

With books, newspapers, journals, films, artworks, and sound recordings being digitized on a massive scale, it is possible to apply data analysis techniques to large collections of diverse cultural heritage resources as well as scientific data.In Mining a Year of Speechwe addressed the challenges of working with very large audio collections of spoken language.

Challenges of very large audio collections of spoken language

How does a researcher find audio segments of interest?

How do audio corpus providers mark them up to facilitate searching and browsing?

How to make very large scale audio collections accessible?ChallengesAmount of materialStorageCD quality audio: 635 MB/hourUncompressed .wav files: 115 MB/hour2.8 GB/day85 GB/month1.02 TB/yearLibrary/archive .wav files: 1 GB/hr, 9 TB/yr

Spoken audio = 250 times XMLChallengesStoring 1.02 TB/year: not really a problem in 21st century

1 TB (1000 GB) hard drive: c. 65 Now 39.95!

Computing (distance measures, alignments, labels etc): multiprocessor cluster---ChallengesAmount of material

Computing distance measures, etc.alignment of labelssearching and browsingJust reading or copying 9 TB takes >1 dayDownload time: days or weeksChallengesTo make large corpora practical, you need:

A detailed index, so users can find the parts they needA way of using the index to access slices of the corpus

?Well Potential usersMembers of public interested in specific bits of content

Scientists with broader interests, e.g. law scholars, political scientists: text searches

Phoneticians and speech engineers: retrieval based on pronunciation and sound

Searching audio: some kinds of questions you might ask1. When did X say Y? For example, "find the video clip where George Bush said 'read my lips'."2. Are there changes in dialects, or in their social status, that are tied to the new social media?3. How do arguments work? For example, how do different people handle interruptions?4. How frequent are linguistic features such as phrase-final rising intonation ("uptalk") across different age groups, genders, social classes, and regions?Some large(ish) speech corporaSwitchBoard corpus: 13 days of audio.

Spoken Dutch Corpus: 1 month, but only a fraction is phonetically transcribed.

Spoken Spanish: 4.6 days, orthographically transcribed.

Buckeye Corpus (OSU): c. 2 days.

Wellington Corpus of Spoken New Zealand English, c. 3 days transcribed

Digital Archive of Southern Speech (American)The Year of SpeechA grove of corpora, held at various sites with a common indexing scheme and search tools

US English material: 2,240 hrs of telephone conversations

1,255 hrs of broadcast news

As-yet unpublished talk show conversations (1000 hrs), Supreme Court oral arguments (5000 hrs), political speeches and debates

British English: Spoken part of the British National Corpus, >7.4 million words of transcribed speech

Recently digitized by collaboration with British LibraryHow big is big science?Human genome:3 GBDASS audio sampler:350 GBHubble space telescope:0.5 TB/yearYear of Speech:>1 TBSloan digital sky survey:16 TBBeazley Archive of ancient Artifacts: 20 TBLarge Hadron Collider:15 PB/year= 2500 x Year of Speech

How big is big science?Human genome:3 GBDASS audio sampler:350 GBHubble space telescope:0.5 TB/yearYear of Speech:>1 TBSloan digital sky survey:16 TBBeazley Archive of ancient Artifacts: 20 TBLarge Hadron Collider:15 PB/year= 2500 x Year of Speech

-------------- humanitiesAnalogue audio in librariesBritish Library: >1m disks and tapes, 5% digitizedLibrary of Congress Recorded Sound Reference Center: >2m items, including International Storytelling Foundation: >8000 hrs of audio and videoEuropean broadcast archives: >20m hrs (2,283 years) cf. Large Hadron Collider

75% on tape20% shellac and vinyl7% digital

Analogue audio in librariesWorld wide: ~100m hours (11,415 yrs) analoguei.e. 4-5 Large Hadron Colliders!Cost of professional digitization: ~20/$32 per tape (e.g. C-90 cassette)

Using speech recognition and natural language technologies (e.g. summarization) could provide more detailed cataloguing/indexing without time-consuming human listeningWhy so large? Lopsided sparsity

I Top ten words each occurYou 58,000 timesitthe 'sand n'taThat12,400 words (23%) onlyYeahoccur onceWhy so large? Lopsided sparsity

Why so large? Lopsided sparsity

Lopsided sparsity and sizeFox and Robles (2010): 22 examples of It's like-enactments [e.g. it's like 'mmmmmm'] in 10 hours of dataLopsided sparsity and sizeRare phonetic word-joins

I'm trying 60 tokens per ~10 millionseem/n to 310alarng clock 18

swimmim pool 44 gettim paid 19weddim present 15

7 of the 'swimming pool' examples are from one family on one occasionLopsided sparsity and sizeFinal -t/-d deletion:

just19563 tokenswant 5221left 432slammed 6

A rule of thumbTo catch mostEnglish sounds, you need minutes of audiocommon words of English a few hoursa typical person's vocabulary >100 hrspairs of common words >1000 hrsarbitrary word-pairs >100 years

Rare and unique wondersaqualunging boringest chambermaiding

de-grandfathered europeaney gronnies

hoptastic lawnmowing mellies noseless

punny regurgitate-arianism scunny

smackerooney tooked weppings

yak-chucker zombienessNot just repositories of wordsSpecific phrases or constructions

Particularities of people's voices and speaking habits

Dog-directed speech

Parrot-directed speechLanguage(?) in the wildA parrot

Talking to a dog

Try transcribing this!

Theres gronnies lurking about

Unusual voicesCircumstances of useHow is the 'voice' selected?Do men do it more than women?Young more than old?How do the speaker's and listener's brains produce, interpret or store odd voice pronunciations and strange intonations?Main problem in large corporaFinding needles in the haystack

To address that challenge, we think there are two killer apps

Forced alignment Data linking: open exposure of digital material, coupled with cross-searchingCollaboration, not collectionSearch interface 2(e.g. BL)Search interface 1(e.g. Oxford)Search interface 3(e.g. Penn)Search interface 4(e.g. Lancaster ?)BNC-XML database - retrieve time stampsSpoken BNCrecordings - BL sound server(s)LDC database - retrieve time stampsSpoken LDCrecordings - various locationsCorpora in the Year of SpeechSpontaneous speechSpoken BNC ~1400 hrsConversational telephone speechRead textLibriVox audio booksBroadcast newsUS Supreme Court oral argumentsPolitical discourseOral history interviewsUS vernacular dialects/Sociolinguistic interviews

PracticalitiesIn order to be of much practical use, such very large corpora must be indexed at word and segment level

All included speech corpora must therefore have associated text transcriptions

Were using the Penn Phonetics Laboratory Forced Aligner to associate each word and segment with the corresponding start and end points in the sound files

Mining (indexing by forced alignment)

x 21 millionMining (indexing by forced alignment)

Mining (a needle in a haystack)

Mining (a diamond in the rough)

American and EnglishSame set of acoustic modelse.g. same [] for US Bob and UK Ba(r)b

Pronunciation differences between different varieties were dealt with by listing multiple phonetic transcriptions

Building a multi-dialect dictionary

Building a multi-dialect dictionary

Tools/user interfacesIssues we grappled withFunding logistics- US funding did not come through for 9 monthsQuality of transcriptions: errorsLong untranscribed portionsLarge transcribed regions with no audio (lost in copying)Problems with documentation and records

Issues we grappled withBroadcast recordings may include untranscribed commercialsTranscripts generally edit out dysfluenciesPolitical speeches may extemporize, departing from the published script

Issues we grappled withSome causes of difficulty in forced alignment:Overlapping speakersBackground noise/music/babbleTranscription errorsVariable signal loudnessReverberationDistortionPoor speaker vocal health/voice qualityUnexpected accents

Issues were still grappling withNo standards for adding phonemic transcriptions and timing information to XML transcriptions

Many different possible schemes

How to decide?

AnonymizationThe text transcriptions in the published BNC have already been anonymizedSome parts of the audio (e.g. COLT) have also been publishedFull names, personal addresses and telephone numbers were replaced by tagsWe use the location of all such tags to mute (silence) the corresponding portions of audioIntellectual property responsibilitiesAll s must be checked to ensure accuracyThis is a much bigger job than we had anticipated (>13,000 anonymization 'gaps')Checking the alignment of gaps is labour-intensive/slowCompounded by poor automatic alignmentsRare and unique wondersaeriated bolshiness canoodling drownded

even-stevens gakky kiddy-fied mindblank

noggin pythonish re-snogged sameyness

stripesey tuitioning watermanship

yukkifiedCollaboration, not collectionSearch interface 2(e.g. BL)Search interface 1(e.g. Oxford)Search interface 3(e.g. Penn)Search interface 4

BNC-XML database - retrieve time stampsSpoken BNCrecordings - BL sound server(s)LDC database - retrieve time stampsSpoken LDCrecordings - various locationsPublication/release plans: LDCPublication/release plans: BNCUntil we complete the checking of the alignment of anonymization gaps, we cannot yet release the full BNC Spoken Audio corpusIn the mean time, we have prepared a BNC Spoken Audio Sampler (release imminent!)Including the audio, alignments, and our proposed XML extensionsplus the multi-dialect pronouncing dictionaryLater: full release as linked data via the British Library Archival Sound Recordings serverEnabling other corpora to be brought in in futurePromoting common standards for audio with linked transcription

?Well Well Well

Well

...

Enabling other corpora to be brought in in futureNegotiating with other speech corpus providers to join the federationAccumulating transcriptions (in ordinary spelling) by crowdsourcingNew InsightsHow does the large scale change our research?

What did we learn at the large scale that we couldn't see before?

- We present an illustrative selection of pilot studies1. Sex differences in conversational speaking ratesDo women talk faster than men?

Method: Words and speaking times in the Fisher 2003 corpusConversational telephone speech5,850 ten-minute conversations2,368 between two women1,910 one woman, one man1,572 between two menDo women talk faster than men?Answer: No.

11,700 conversational sides; mean speaking rate = 173 wpm, sd = 27Male mean 174.3, female 172.6, diff 1.7, effect size = 0.06

No2. Phrasal modulation of speaking ratePhrase final lengthening (rallentando)What is a spoken phrase?

Method: Word duration by position between pausesData: Switchboard (conversational telephone speech)

Phrasal modulation of speaking rateResult: Amazingly regular pattern

Phrasal modulation of speaking rateResult: Amazingly regular pattern

3. How does speaking rate reflect the ebb and flow of conversations?Method: Word- or syllable-count in moving window over time-aligned transcripts

Ebb and flow of conversation

4. Previously unattested (impossible) assimilations of word-final consonantsRare phonetic word-joins

I'm tryingseem/n to alarng clock

swimmim poolgettim paidweddim present

5. Integration of language and other behavior

Speech in the wildListen they were going [belch] that ain't a burp he said

Like I'd be talking like this and suddenly it'll go [mimics microphone noises]

He simply went [sound effect] through his nose

Come on then shitboxNot just for linguistsTausczik and Pennebaker (2010):

We [social psychologists] are in the midst of a technological revolution whereby, for the first time, researchers can link daily word use to a broad array of real-world behaviors results using LIWC [text analysis software] demonstrate its ability to detect meaning in a wide variety of settings, to show attentional focus, emotionality, social relationships, thinking styles, and individual differences.Not just for linguistsIreland et al. (2011):

Similarity in how people talk with one another on speed dates (measured by their usage of function words) predicts increased likelihood of mutual romantic interest, mutually desired future contact and relationship stability at a 3-month follow-up.

Not just for linguistsBlack et al. (forthcoming):

when the justices focus more unpleasant language toward one attorney, the side he represents is more likely to lose.FinallySome parts of Year of Speech corpora are extremely challenging to automatically align with transcriptions

Other parts are relatively easy

For cleaner recordings, forced alignment of far larger corpora (e.g. 20-year C-SPAN archive) is already possible, in principle

As more and more audiovisual material becomes digitized, automatic alignment of transcriptions is essential for users to navigate through these libraries of the future.Thank you very much!Sheet1Problemcorrected formAmerican pronsVariants9766id=375s1?5984upa??4550mm.mm(Two filled hesitations)M M979banwabbreviation or word?B AE1 N UW01765ceau?escuAccented lettersCeausescuCH AW0 CH EH1 S K Y UW013219Accented lettersAA1AE113220coleAccented lettersEH2 K OW1 L13221poqueAccented lettersEH0 P OW1 K13222liteAccented lettersEY0 L IY1 T13223migrAccented lettersEH1 M IH0 G R EY26867verusdeliberate mispronunciationV IH1 ER0 AH0 SV EH1 R AH0 S10656oia_011207.tmpFilename?691aldate's.Final .AO1 L D EY2 T SAO1 L D EY2 T S792irish.Final .IrishAY1 R IH2 SH926attaboy.Final .AE1 T AH0 B OY2934aubergines.Final .OW1 B ER0 ZH IY0 N ZOW1 B AH0 ZH IY0 N Z

&C&"DejaVu Sans Condensed,Regular"&A&C&"DejaVu Sans Condensed,Regular"Page &P

Sheet1S Rhotic BritishS Nonrhotic BritishN Rhotic BritishN Nonrhotic BritishriochR IY1 OH2 KR IY1 OH2 KR IY1 OH2 KR IY1 OH2 KrisecoteR AY1 Z K OW2 TR AY1 Z K OW2 TR AY1 Z K OW2 TR AY1 Z K OW2 TrittoR IH1 T OW0R IH1 T OW0R IH1 T OW0R IH1 T OW0rituR IH1 T UW0R IH1 T UW0R IH1 T UW0R IH1 T UW0ritziR IH1 T S IY0R IH1 T S IY0R IH1 T S IY0R IH1 T S IY0rivermeadR IH1 V ER0 M IY2 DR IH1 V AH0 M IY2 DR IH1 V ER0 M IY2 DR IH1 V AH0 M IY2 DrivetusR IH1 V AH0 T AH0 SR IH1 V AH0 T AH0 SR IH1 V AH0 T AH0 SR IH1 V AH0 T AH0 SroadrunnersR OW1 D R AH2 N ER0 ZR OW1 D R AH2 N AH0 ZR OW1 D R UH2 N ER0 ZR OW1 D R UH2 N AH0 Z