Dan Jurafsky CS 424P/ LINGUIST 287 Extracting Social Meaning and Sentiment 1/5/07 Lecture 4: Speech...

Dan Jurafsky CS 424P/ LINGUIST 287 Extracting Social Meaning and Sentiment 1/5/07 Lecture 4: Speech and Prosody Slide 3 Outline Articulatory Phonetics Acoustic Phonetics Prosody Pitch Accents Disfluencies 1/5/07 Slide 4 Phonetics Articulatory Phonetics How speech sounds are made by articulators (moving organs) in mouth. Acoustic Phonetics Acoustic properties of speech sounds 1/5/07 Slide 5 Speech Production Process Respiration: We (normally) speak while breathing out. Respiration provides airflow. Pulmonic egressive airstream Phonation Airstream sets vocal folds in motion. Vibration of vocal folds produces sounds. Sound is then modulated by: Articulation and Resonance Shape of vocal tract, characterized by: Oral tract Teeth, soft palate (velum), hard palate Tongue, lips, uvula Nasal tract 1/5/07 Text adopted from Sharon Rose Slide 6 1/5/07 Nasal Cavity Pharynx Vocal Folds (within the Larynx) Trachea Lungs Text copyright J. J. Ohala, Sept 2001, from Sharon Rose slide Sagittal section of the vocal tract (Techmer 1880) Slide 7 1/5/07 From Mark Libermans website, from Ultimate Visual Dictionary Slide 8 1/5/07 From Mark Libermans Web Site, from Language Files (7th ed) Slide 9 Vocal tract 1/5/07 Figure thnx to John Coleman!! Slide 10 Vocal tract movie (high speed x-ray) 1/5/07 Figure of Ken Stevens, from Peter Ladefogeds web site Slide 11 1/5/07 Figure of Ken Stevens, labels from Peter Ladefogeds web site Slide 12 USCs SAIL Lab Shri Narayanan 1/5/07 Slide 13 Larynx and Vocal Folds The Larynx (voice box) A structure made of cartilage and muscle Located above the trachea (windpipe) and below the pharynx (throat) Contains the vocal folds (adjective for larynx: laryngeal) Vocal Folds (older term: vocal cords) Two bands of muscle and tissue in the larynx Can be set in motion to produce sound (voicing) 1/5/07 Text from slides by Sharon Rose UCSD LING 111 handout Slide 14 The larynx, external structure, from front 1/5/07 Figure thnx to John Coleman!! Slide 15 Vertical slice through larynx, as seen from back 1/5/07 Figure thnx to John Coleman!! Slide 16 Voicing: 1/5/07 Air comes up from lungs Forces its way through vocal cords, pushing open (2,3,4) This causes air pressure in glottis to fall, since: when gas runs through constricted passage, its velocity increases (Venturi tube effect) this increase in velocity results in a drop in pressure (Bernoulli principle) Because of drop in pressure, vocal cords snap together again (6-10) Single cycle: ~1/100 of a second. Figure & text from John Colemans web site Slide 17 Voicelessness When vocal cords are open, air passes through unobstructed Voiceless sounds: p/t/k/s/f/sh/th/ch If the air moves very quickly, the turbulence causes a different kind of phonation: whisper 1/5/07 Slide 18 Vocal folds open during breathing 1/5/07 From Mark Libermans web site, from Ultimate Visual Dictionary Slide 19 Vocal Fold Vibration 1/5/07 UCLA Phonetics Lab Demo Slide 20 Consonants and Vowels Consonants: phonetically, sounds with audible noise produced by a constriction Vowels: phonetically, sounds with no audible noise produced by a constriction (its more complicated than this, since we have to consider syllabic function, but this will do for now) 1/5/07 Text adapted from John Coleman Slide 21 Oral vs. Nasal Sounds 1/5/07 Thanks to Jong-bok Kim for this figure! Slide 22 Tongue position for vowels 1/5/07 Slide 23 Vowels 1/5/07 IYAAUW Fig. from Eric Keller Slide 24 American English Vowel Space 1/5/07 FRONTBACK HIGH LOW ey ow aw oy ay iy ih eh ae aa ao uw uh ah ax ixux Figure from Jennifer Venditti Slide 25 [iy] vs. [uw] 1/5/07 Figure from Jennifer Venditti, from a lecture given by Rochelle Newman Slide 26 [ae] vs. [aa] 1/5/07 Figure from Jennifer Venditti, from a lecture given by Rochelle Newman Slide 27 2. Acoustic Phonetics Sound Waves http://www.kettering.edu/~drussell/Demos/waves- intro/waves-intro.html http://www.kettering.edu/~drussell/Demos/waves- intro/waves-intro.html 1/5/07 Slide 28 Simple Period Waves (sine waves) 1/5/07 Characterized by: period: T amplitude A phase Fundamental frequency in cycles per second, or Hz F 0 =1/T 1 cycle Slide 29 Simple periodic waves Computing the frequency of a wave: 5 cycles in.5 seconds = 10 cycles/second = 10 Hz Amplitude: 1 Equation: Y = A sin(2 ft) 1/5/07 Slide 30 Speech sound waves A little piece from the waveform of the vowel [iy] Y axis: Amplitude = amount of air pressure at that time point Positive is compression Zero is normal air pressure, negative is rarefaction X axis: time. 1/5/07 Slide 31 Digitizing Speech 1/5/07 Slide 32 Digitizing Speech Analog-to-digital conversion Or A-D conversion. Two steps Sampling Quantization 1/5/07 Slide 33 Sampling 1/5/07 Measuring amplitude of a signal at time t The sample rate needs to have at least two samples for each cycle One for the positiive, and one for the negative half of each cycle More than two samples per cycle is ok Less than two samples will cause frequencies to be missed So the maximum frequency that can be measured is one that is half the sampling rate. The maximum frequency for a given sampling rate called Nyquist frequency Slide 34 Sampling 1/5/07 If measure at green dots, will see a lower frequency wave and miss the correct higher frequency one! Original signal in red: Slide 35 Sampling 1/5/07 In practice we use the following sample rates 16,000 Hz (samples/sec), for microphones, wideband 8,000 Hz (samples/sec) Telephone Why? Need at least 2 samples per cycle Max measurable frequency is half the sampling rate Human speech < 10KHz, so need max 20K Telephone is filtered at 4K, so 8K is enough. Slide 36 Quantization Representing real value of each amplitude as integer 8-bit (-128 to 127) or 16-bit (- 32768 to 32767) Formats: 16 bit PCM 8 bit mu-law; log compression Byte order LSB (Intel) vs. MSB (Sun, Apple) Headers : Raw (no header) Microsoft wav Sun.au 1/5/07 40 byte header Slide 37 WAV format 1/5/07 Slide 38 Fundamental frequency Waveform of the vowel [iy] Frequency: repetitions/second of a wave Above vowel has 10 reps in.03875 secs So freq is 10/.03875 = 258 Hz This is speed that vocal folds move, hence voicing Each peak corresponds to an opening of the vocal folds The frequency of the complex wave is called the fundamental frequency of the wave or F0 Slide 39 Pitch track Slide 40 Amplitude We need a way to talk about the amplitude of a region of a signal over tune We cant just average all the values. Why not? So we often talk about RMS amplitude Slide 41 Power and Intensity Power: related to square of amplitude Intensity in air: power normalized to auditory threshold, given in dB. P0 is auditory threshold pressure = 2x10 -5 pa Slide 42 Plot of Intensity Slide 43 Pitch and Loudness Pitch is the mental sensation or perceptual correlated of F0 Relationship between pitch and F0 is not linear; human pitch perception is most accurate between 100Hz and 1000Hz. Linear in this range Logarithmic above 1000Hz Mel scale is one model of this F0-pitch mapping A mel is a unit of pitch defined so that pairs of sounds which are perceptually equidistant in pitch are separated by an equal number of mels Frequency in mels = 1127 ln (1 + f/700) Slide 44 She just had a baby Note that vowels all have regular amplitude peaks Stop consonant Closure followed by release Notice the silence followed by slight bursts of emphasis: very clear for [b] of baby Fricative: noisy. [sh] of she at beginning Slide 45 Fricative 1/5/07 Slide 46 Waves have different frequencies 1/5/07 100 Hz 1000 Hz Slide 47 Complex waves: Adding a 100 Hz and 1000 Hz wave together 1/5/07 Slide 48 Spectrum 1/5/07 100 1000 Frequency in Hz Amplitude Frequency components (100 and 1000 Hz) on x-axis Slide 49 Spectra continued Fourier analysis: any wave can be represented as the (infinite) sum of sine waves of different frequencies (amplitude, phase) 1/5/07 Slide 50 Spectrum of one instant in an actual soundwave: many components across frequency range 1/5/07 Slide 51 Part of [ae] waveform from had Note complex wave repeating nine times in figure Plus smaller waves which repeats 4 times for every large pattern Large wave has frequency of 250 Hz (9 times in.036 seconds) Small wave roughly 4 times this, or roughly 1000 Hz Two little tiny waves on top of peak of 1000 Hz waves 1/5/07 Slide 52 Back to spectrum Spectrum represents these freq components Computed by Fourier transform, algorithm which separates out each frequency component of wave. x-axis shows frequency, y-axis shows magnitude (in decibels, a log measure of amplitude) Peaks at 930 Hz, 1860 Hz, and 3020 Hz. 1/5/07 Slide 53 Seeing formants: the spectrogram 1/5/07 Slide 54 Formants Vowels largely distinguished by 2 characteristic pitches. One of them (the higher of the two) goes downward throughout the series iy ih eh ae aa ao ou u The other goes up for the first four vowels and then down for the next four. These are called "formants" of the vowels, lower is 1st formant, higher is 2nd formant. 1/5/07 Slide 55 Spectrogram: spectrum + time dimension 1/5/07 Slide 56 Different vowels have different formants Vocal tract as "amplifier"; amplifies different frequencies Formants are result of different shapes of vocal tract. Any body of air will vibrate in a way that depends on its size and shape. Air in vocal tract is set in vibration by action of vocal cords. Every time the vocal cords open and close, pulse of air from the lungs, acting like sharp taps on air in vocal tract, Setting resonating cavities into vibration so produce a number of different frequencies. 1/5/07 Slide 57 Again: why is a speech sound wave composed of these peaks? Articulatory facts: The vocal cord vibrations create harmonics The mouth is an amplifier Depending on shape of mouth, some harmonics are amplified more than others 1/5/07 Slide 58 From Mark Libermans Web site Slide 59 How formants are produced Q: Why do different vowels have different pitches if the vocal cords are vibrating at the same rate? A: This is a confusion of frequencies of SOURCE and frequencies of FILTER! 1/5/07 Slide 60 Source-filter model of speech production 1/5/07 InputFilter Output Glottal spectrumVocal tract frequency response function Figures and text from Ratree Wayland slide from his website Source and filter are independent, so: Different vowels can have same pitch The same vowel can have different pitch Slide 61 1/5/07 Slide 62 Deriving schwa: how shape of mouth (filter function) creates peaks! Reminder of basic facts about sound waves f = c/ c = speed of sound (approx 35,000 cm/sec) A sound with =10 meters has low frequency f = 35 Hz (35,000/1000) A sound with =2 centimeters has high frequency f = 17,500 Hz (35,000/2) 1/5/07 Slide 63 Resonances of the vocal tract The human vocal tract as an open tube Air in a tube of a given length will tend to vibrate at resonance frequency of tube. 1/5/07 Closed end Open end Length 17.5 cm. Figure from Ladefoged(1996) p 117 Slide 64 Resonances of the vocal tract T he human vocal tract as an open tube Air in a tube of a given length will tend to vibrate at resonance frequency of tube. 1/5/07 Closed end Open end Length 17.5 cm. Figure from W. Barry Speech Science slides Slide 65 Resonances of the vocal tract If vocal tract is cylindrical tube open at one end Standing waves form in tubes Waves will resonate if their wavelength corresponds to dimensions of tube Constraint: Pressure differential should be maximal at (closed) glottal end and minimal at (open) lip end. Next slide shows what kind of length of waves can fit into a tube with this contraint 1/5/07 Slide 66 From Sundberg Slide 67 Computing the 3 formants of schwa Let the length of the tube be L F 1 = c/ 1 = c/(4L) = 35,000/4*17.5 = 500Hz F 2 = c/ 2 = c/(4/3L) = 3c/4L = 3*35,000/4*17.5 = 1500Hz F 1 = c/ 2 = c/(4/5L) = 5c/4L = 5*35,000/4*17.5 = 2500Hz So we expect a neutral vowel to have 3 resonances at 500, 1500, and 2500 Hz These vowel resonances are called formants 1/5/07 Slide 68 Figures from Ratree Wayland slides from his website Vowel [i] sung at successively higher pitch. 1 2 3 4 5 6 7 Slide 69 Summary Acoustic Phonetics Waves, sound waves, and spectra Speech waveforms F0, pitch, intensity Spectra Spectrograms Formants Reading spectrograms Deriving schwa: why are formants where they are PRAAT Resources: dictionaries and phonetically-labeled corpora. 1/5/07 Slide 70 3. Prosody Slide 71 Defining Intonation Ladd (1996) Intonational phonology The use of suprasegmental phonetic features Suprasegmental = above and beyond the segment/phone F0 Intensity (energy) Duration to convey sentence-level pragmatic meanings I.e. meanings that apply to phrases or utterances as a whole, not lexical stress, not lexical tone. Slide 72 Three aspects of prosody Prominence: some syllables/words are more prominent than others Structure/boundaries: sentences have prosodic structure Some words group naturally together Others have a noticeable break or disjuncture between them Tune: the intonational melody of an utterance. From Ladd (1996) Slide 73 Prosodic Prominence: Pitch Accents A: What types of foods are a good source of vitamins? B1: Legumes are a good source of VITAMINS. B2: LEGUMES are a good source of vitamins. Prominent syllables are: Louder Longer Have higher F0 and/or sharper changes in F0 (higher F0 velocity) Slide from Jennifer Venditti Slide 74 Prosodic Boundaries I met Mary and Elenas mother at the mall yesterday. French [bread and cheese] [French bread] and [cheese] Slide from Jennifer Venditti Slide 75 Prosodic Tunes Legumes are a good source of vitamins. Are legumes a good source of vitamins? Slide from Jennifer Venditti Slide 76 Prosody Part I Thinking about F0 Slide 77 Graphic representation of F0 legumes are a good source of VITAMINS time F0 (in Hertz) Slide from Jennifer Venditti Slide 78 The ripples legumes are a good source of VITAMINS [ t ] [ s ] F0 is not defined for consonants without vocal fold vibration. Slide from Jennifer Venditti Slide 79 The ripples legumes are a good source of VITAMINS [ v ] [ g ] [ z ]... and F0 can be perturbed by consonants with an extreme constriction in the vocal tract. Slide from Jennifer Venditti Slide 80 Abstraction of the F0 contour legumes are a good source of VITAMINS Our perception of the intonation contour abstracts away from these perturbations. Slide from Jennifer Venditti Slide 81 The waves and the swells legumes are a good source of VITAMINS wave = accent swell = phrase Slide from Jennifer Venditti Slide 82 Prosody Part II: Prominence: Placement of Pitch Accents Slide 83 Stress vs. accent Stress is a structural property of a word it marks a potential (arbitrary) location for an accent to occur, if there is one. Accent is a property of a word in context it is a way to mark intonational prominence in order to highlight important words in the discourse. (x) (accented syll) xxstressed syll xxxfull vowels xxxxxxxsyllables vitaminsCalifornia Slide from Jennifer Venditti Slide 84 Stress vs. accent (2) The speaker decides to make the word vitamin more prominent by accenting it. Lexical stress tell us that this prominence will appear on the first syllable, hence VItamin. So we will have to look at both the lexicon and the context to predict the details of prominence Im a little surPRISED to hear it CHARacterized as upBEAT Slide 85 Which word receives an accent? It depends on the context. The new information in the answer to a question is often accented while the old information is usually not. Q1: What types of foods are a good source of vitamins? A1: LEGUMES are a good source of vitamins. Q2: Are legumes a source of vitamins? A2: Legumes are a GOOD source of vitamins. Q3: Ive heard that legumes are healthy, but what are they a good source of ? A3: Legumes are a good source of VITAMINS. Slide from Jennifer Venditti Slide 86 Same tune, different alignment LEGUMES are a good source of vitamins The main rise-fall accent (= I assert this) shifts locations. Slide from Jennifer Venditti Slide 87 Same tune, different alignment Legumes are a GOOD source of vitamins The main rise-fall accent (= I assert this) shifts locations. Slide from Jennifer Venditti Slide 88 Same tune, different alignment legumes are a good source of VITAMINS The main rise-fall accent (= I assert this) shifts locations. Slide from Jennifer Venditti Slide 89 Levels of prominence Most phrases have more than one accent The last accent in a phrase is perceived as more prominent Called the Nuclear Accent Emphatic accents like nuclear accent often used for semantic purposes, such as indicating that a word is contrastive, or the semantic focus. The kind of thing you uses ***s in IM, or capitalized letters I know SOMETHING interesting is sure to happen, she said to herself. Can also have words that are less prominent than usual Reduced words, especially function words. Often use 4 classes of prominence: Emphatic accent, pitch accent, unaccented, reduced Slide 90 Prosody Part III: Intonational phrasing/boundaries Slide 91 A single intonation phrase legumes are a good source of vitamins Broad focus statement consisting of one intonation phrase (that is, one intonation tune spans the whole unit). Slide from Jennifer Venditti Slide 92 Multiple phrases legumes are a good source of vitamins Utterances can be chunked up into smaller phrases in order to signal the importance of information in each unit. Slide from Jennifer Venditti Slide 93 I wanted to go to London, but could only get tickets for France Slide 94 Phrasing can disambiguate Global ambiguity: The old men and women stayed home. The old men % and women % stayed home. Sally saw % the man with the binoculars. Sally saw the man % with the binoculars. John doesnt drink because hes unhappy. John doesnt drink % because hes unhappy. Slide from Jennifer Venditti Slide 95 Phrasing sometimes helps disambiguate I met Mary and Elenas mother at the mall yesterday Mary & Elenas mother mall One intonation phrase with relatively flat overall pitch range. Slide from Jennifer Venditti Slide 96 Phrasing sometimes helps disambiguate I met Mary and Elenas mother at the mall yesterday Mary mall Elenas mother Separate phrases, with expanded pitch movements. Slide from Jennifer Venditti Slide 97 Intonational tunes Slide 98 Yes-No question tune are LEGUMES a good source of vitamins Rise from the main accent to the end of the sentence. Slide from Jennifer Venditti Slide 99 Yes-No question tune are legumes a GOOD source of vitamins Rise from the main accent to the end of the sentence. Slide from Jennifer Venditti Slide 100 Yes-No question tune are legumes a good source of VITAMINS Rise from the main accent to the end of the sentence. Slide from Jennifer Venditti Slide 101 WH-questions WHAT are a good source of vitamins WH-questions typically have falling contours, like statements. [I know that many natural foods are healthy, but...] Slide from Jennifer Venditti Slide 102 Broad focus legumes are a good source of vitamins Tell me something about the world. Slide from Jennifer Venditti In the absence of narrow focus, English tends to mark the first and last content words with perceptually prominent accents. Slide 103 Rising statements legumes are a good source of vitamins High-rising statements can signal that the speaker is seeking approval. Tell me something I didnt already know. [... does this statement qualify?] Slide from Jennifer Venditti Slide 104 Yes-No question are legumes a good source of VITAMINS Rise from the main accent to the end of the sentence. Slide from Jennifer Venditti Slide 105 Surprise-redundancy tune legumes are a good source of vitamins Low beginning followed by a gradual rise to a high at the end. [How many times do I have to tell you...] Slide from Jennifer Venditti Slide 106 Contradiction tune linguini isnt a good source of vitamins Sharp fall at the beginning, flat and low, then rising at the end. Ive heard that linguini is a good source of vitamins. [... how could you think that?] Slide from Jennifer Venditti Slide 107 4. Pitch Accent Prediction from text Detection from text and speech Advanced linguistic models that capture tune 1/5/07 Slide 108 Study 1: Pitch accent prediction Which words in an utterance should bear accent? Accent in the sense of simplified ToBI (Ostendorf et al. 2001, Shattuck- Hufnagel and Ostendorf 1999) i believe at ibm they make you wear a blue suit. i BELIEVE at IBM they MAKE you WEAR a BLUE SUIT. 2001 was a good movie, if you had read the book. 2001 was a good MOVIE, if you had read the BOOK. Slide 109 Factors in accent prediction Part of speech: Content words are usually accented Function words are rarely accented Of, for, in on, that, the, a, an, no, to, and but or will may would can her is their its our there is am are was were, etc Slide 110 But not just function/content: A Broadcast News example from Hirschberg (1993) SUN MICROSYSTEMS INC, the UPSTART COMPANY that HELPED LAUNCH the DESKTOP COMPUTER industry TREND TOWARD HIGH powered WORKSTATIONS, was UNVEILING an ENTIRE OVERHAUL of its PRODUCT LINE TODAY. SOME of the new MACHINES, PRICED from FIVE THOUSAND NINE hundred NINETY five DOLLARS to seventy THREE thousand nine HUNDRED dollars, BOAST SOPHISTICATED new graphics and DIGITAL SOUND TECHNOLOGIES, HIGHER SPEEDS AND a CIRCUIT board that allows FULL motion VIDEO on a COMPUTER SCREEN. Slide 111 Factors in accent prediction Contrast Legumes are poor source of VITAMINS No, legumes are a GOOD source of vitamins I think JOHN or MARY should go No, I think JOHN AND MARY should go Slide 112 List intonation I went and saw ANNA, LENNY, MARY, and NORA. Slide 113 Word order Preposed items are accented more frequently TODAY we will BEGIN to LOOK at FROG anatomy. We will BEGIN to LOOK at FROG anatomy today. Slide 114 Information Status New versus old information. Old information is deaccented There are LAWYERS, and there are GOOD lawyers EACH NATION DEFINES its OWN national INTERST. I LIKE GOLDEN RETRIEVERS, but MOST dogs LEAVE me COLD. Slide 115 Complex Noun Phrase Structure Sproat, R. 1994. English noun-phrase accent prediction for text-to-speech. Computer Speech and Language 8:79-94. Proper Names, stress on right-most word New York CITY; Paris, FRANCE Adjective-Noun combinations, stress on noun Large HOUSE, red PEN, new NOTEBOOK Noun-Noun compounds: stress left noun HOTdog (food) versus HOT DOG (overheated animal) WHITE house (place) versus WHITE HOUSE (made of stucco) examples: MEDICAL Building, APPLE cake, cherry PIE. What about: Madison avenue, park street ?? Some Rules Furniture+Room -> RIGHT (e.g., kitchen TABLE) Proper-name + Street -> LEFT (e.g. PARK street) Slide 116 Other features POS POS of previous word POS of next word Stress of current, previous, next syllable Unigram probability of word Bigram probability of word Position of word in sentence Slide 117 Advanced features Accent is often deflected away from a word due to focus on a neighboring word. Could use syntactic parallelism to detect this kind of contrastive focus: driving [FIFTY miles] an hour in a [THIRTY mile] zone [WELD] [APPLAUDS] mandatory recycling. [SILBER] [DISMISSES] recycling goals as meaningless. but while Weld may be [LONG] on people skills, he may be [SHORT] on money Slide 118 Previous state of the art on accent prediction Useful features include: (starting with Hirschberg 1993) Lexical class (function words, clitics not accented) Word frequency Word identity (promising but problematic) Given/New, Theme/Rheme Focus Word bigram predictability Position in phrase Complex nominal rules (Sproat) Combined in a machine learning classifier: Decision trees (Hirchberg 1993), Bagging/boosting (Sun 2002) Hidden Markov models (Hasegawa-Johnson et al 2005) Conditional random fields (Gregory and Altun 2004) 117117117 Slide 119 Study 1: Nenkova et al 2007, 2008. What features are the best predictors of pitch accent? How much do sophisticated linguistic features help over simple features? Given/New Focus/contrast How can we make use of word identity? Can we rely on word identity across genres? Slide 120 Corpus and approach 12 Switchboard conversations 14,555 tokens Annotated for binary prominence M. Ostendorf, I. Shafran, S. Shattuck-Hufnagel, L. Carmichael, and W. Byrne. 2001. A prosodically labeled database of spontaneous speech. ISCA Workshop on Prosody, pp 119--121. Majority class 58% not accented words Decision tree classifier Supervised machine learning Training on labeled tokens, trying to predict +/- accent Exhaustive comparison of predictors with different feature sets Combinations of 1 to 5 features Slide 121 Features Information Status: Given/new/mediated Nissim, Dingare, Carletta, and Steedman 2004. An annotation scheme for information status in dialogue. LREC 2004. they old have all the WATER new they old WANT. They old can ACTUALLY PUMP water old. Kontrast (Contrast, focus, focus-sensitive adverb, etc) Calhoun, Nissim, Steedman, Brenier 2005. A framework for annotating information structure in discourse. Proceedings of the ACL Pie in the Sky Workshop, 45-25 YOU take this subject much more personally than I do. (How much does a nanny cost?) I THINK its about SIXTY DOLLARS a WEEK for TWO children Position in utterance (# of words from beginning/end) Shattuck-Hufnagel, Ostendorf, Ross 1994; Lisa Selkirk this morning Animacy (Zaenen et al 2004), Dialog act (Jurafsky et al 1998) POS, n-gram prob, tf.idf, word len, stopword, distance from pause Slide 122 Accent ratio feature Is prominence a property of the word itself? Of all the times this word occurred What percentage were accented? Memorized from a labeled corpus 60 Switchboard conversations (Ostendorf et al 2001) Detailed computation of accent ratio: k number of times a word is prominent n all occurrences of the word Slide 123 Single feature prediction accuracy Accent ratio significantly outperforms word frequency Accent ratio: 75.59% Word frequency: 73.77% Part of speech: 70.28% Accent ratio classifier a word not-prominent if AR < 0.38 Words not in the accent ratio dictionary assigned to prominent class Linguistic features not powerful by themselves Focus/contrast (kontrast): 67.57% Given/new: 64.13% Slide 124 Combining features Accent ratio always in best classifiers Kontrast (Focus/contrast) most helpful in combination with accent ratio Small or no effect of givenness, distance Best overall result: 76.7% accuracyAR + kontrast + tf.idf + givenness + distance 75.6% accuracyAR alone 123123123 Slide 125 Why didnt given/new help? Restricted applicability Nouns/pronouns Influences the form of referring expression Most old information expressed with a pronoun Pronouns are usually not accented Mediated category More than half of the annotated nouns Different accents appropriate depending on the semantic relation Slide 126 Information status 125 New information more likely to be accented But most nouns are accented anyway 1215/1864 = 65% accenting of nouns overall Slide 127 Why is accent ratio useful? Cover all part-of-speech categories Low accent ratio words a, uh, thing, some, been, up, out, its, them, him stayed, supposed, said, say, wanna minutes, little, thing, anything High accent ratio words actually, anyway, yeah, wow, gosh, yes, no, excatly one intonation phrase + exclamatives too, also, else Focus-inducing words rather, major, great, poor Intensifiers especially, definitely Slide 128 Accent Ratio as analytic tool Disfluencies and their effect on accent in spontaneous speech Christodoulou, Babwah and Arnold (2008) Discourse markers (and has higher AR in spontaneous speech) Genre differences: spontaneous speech (conversation, interview) vs read speech (storybooks, broadcast) Read speech: words dont vary Spontaneous speech: more variation Other differences: Yuan, Brenier, Jurafsky. 2005. Pitch Accent Prediction: Effects of Genre and Speaker. Eurospeech. Brenier. 2008 ms. The automatic prediction of prosodic prominence from text. PhD Dissertation. Slide 129 Is accent ratio robust to genre? Cross-genre experiment: broadcast news Cross: Accent ratio from Switchboard 82% accuracy Within: Unigram, bigram, backwards bigram probability; trained on broadcasts 83.67% accuracy So even across genre, accent ratio is still a very useful predictor of accent Slide 130 Summary for Study 1 The best predictor of a word being accented is whether the word tends to be accented in general Kontrast is also useful Given/New status not very helpful This information already carried by NP form (pronoun) We dont have good theoretical predictions about mediated entitites Slide 131 Study 2: Pitch accent detection Sridhar, Nenkova, Narayanan, Jurafsky. Speech Prosody 2008 Nenkova and Jurafsky 2007. ASRU 2007. Can we detect pitch accent from speech and text? How best to combine acoustic and lexical cues? How useful is contextual information (from neighboring words)? 130130130 Slide 132 Our experiment The same 12 Switchboard conversations 14,555 word tokens The task is again predicting whether a word is accented, using Text features (in this experiment, accent ratio+POS) Acoustic features Combined in a CRF ( Conditional Random Field ) classifier CRF is sort of a version of logistic regression that deals well with sequences Evaluated by how well we match human accent labels 131131131 Slide 133 Acoustic features tested Duration of word Pitch F0 mean of word Normalized by side F0 std dev Normalized by side Max F0 in word Min F0 in word F0 slope 132132132 Energy Mean RMS energy in word Energy std dev Energy slope across word RMS energy in first half of word RMS energy in second half of word Slide 134 Final set of acoustic features duration of word std dev of F0 values in word linear regression F0 slope over all points of word ratio of F0 mean in second and first half difference in F0 mean of second and first half of word, normalized by mean and std dev of F0 in convside std dev of rms energy in word 133133133 Slide 135 Log-linear classifier 134134134 FeaturesAccuracy in % Current word68.1 Current POS70.2 Accent Ratio75.3 Word+POS+AR75.4 Acoustics only73.1 All Features77.4 Accent ratio by itself better than acoustic features Combining them helps But would it help also to use contextual information? Slide 136 Better results using context Features in CRFCurrent word 1 word 2 words Words68.1%75.7%75.1% POS tags70.272.773.2 Accent Ratio75.375.275.1 Word+POS+AR75.476.075.9 Acoustics only73.174.073.9 All Features77.478.378.2 Best results (78.3%) use one previous and following word Again accent ratio is better than acoustics But the combination is better still We think these are the best published numbers on this task Slide 137 How well do humans do? Compare human to human Read news (Boston University Radio) Ostendorf, Price, Shattuck-Hufnagel (1995) One identical portion (1662 words) was read by 6 different speakers We computed pairwise agreement between humans Accuracy Average: 82% Min: 79% Max: 85% This suggests that current performance of 78.3% might be approaching human performance Slide 138 Summary from Study 2 Machine Learning Sequence models (CRF) give best performance on accent detection Confirms CRF results of Gregory and Altun (2004) And HMM results of Hasegawa-Johnson et al (2005) Linguistic Features Single preceding and following word most useful Probably captures stress clash/lapse Accent ratio is still best single feature Acoustic features give further improvement Surprisingly, most useful acoustic features: Raw: not normalized by speaker! 137137137 Slide 139 Advanced: Intonational Transcription Theories: ToBI Slide 140 ToBI: Tones and Break Indices Pitch accent tones H* peak accent L* low accent L+H* rising peak accent (contrastive) L*+H scooped accent H+!H* downstepped high Boundary tones L-L% (final low; Am Eng. Declarative contour) L-H% (continuation rise) H-H% (yes-no queston) Break indices 0: clitics, 1, word boundaries, 2 short pause 3 intermediate intonation phrase 4 full intonation phrase/final boundary. Slide 141 Examples of the TOBI system I dont eat beef. L* L* L*L-L% Marianna made the marmalade. H* L-L% L* H-H% I means insert. H* H* H*L-L% 1 H*L- H*L-L% 3 Slide from Lavoie and Podesva Slide 142 5. Disfluencies 1/5/07 Slide 143 Disfluencies Slide 144 Disfluencies: standard terminology (Levelt) Reparandum: thing repaired Interruption point (IP): where speaker breaks off Editing phase (edit terms): uh, I mean, you know Repair: fluent continuation Slide 145 Counts (from Shriberg, Heeman) Sentence disfluency rate ATIS: 6% of sentences disfluent (10% long sentences) Levelt human dialogs: 34% of sentences disfluent Swbd: ~50% of multiword sentences disfluent TRAINS: 10% of words are in reparandum or editing phrase Word disfluency rate SWBD: 6% ATIS:0.4% AMEX 13% (human-human air travel) Slide 146 Prosodic characteristics of disfluencies Nakatani and Hirschberg 1994 Fragments are good cues to disfluencies Prosody: Pause duration is shorter in disfluent silence than fluent silence F0 increases from end of reparandum to beginning of repair, but only minor change Repair interval offsets have minor prosodic phrase boundary, even in middle of NP: Show me all n- | round-trip flights | from Pittsburgh | to Atlanta Slide 147 Syntactic Characteristics of Disfluencies Hindle (1983) The repair often has same structure as reparandum Both are Noun Phrases (NPs) in this example: So if could automatically find IP, could find and correct reparandum! Slide 148 Use of different disfluencies Clark and Fox Tree Looked at um and uh uh includes er (er is just British non-rhotic dialect spelling for uh) Different meanings Uh: used to announce minor delays Preceded and followed by shorter pauses Um: used to announce major delays Preceded and followed by longer pauses Slide 149 Um versus uh: delays (Clark and Fox Tree) Slide 150 Utterance Planning The more difficulty speakers have in planning, the more delays Consider 3 locations: I: before intonation phrase: hardest II: after first word of intonation phrase: easier III: later: easiest And then uh somebody said,. [I] but um -- [II] dont you think theres evidence of this, in the twelfth - [III] and thirteenth centuries? Slide 151 Delays at different points in phrase Slide 152 More on location of FPs Peters: Medical dictation task Monologue rather than dialogue In this data, FPs occurred INSIDE clauses Trigram PP after FP: 367 Trigram PP after word: 51 Stolcke and Shriberg (1996b) w k FP w k+1 : looked at P(w k+1 |w k ) Transition probabilities lower for these transitions than normal ones Conclusion: People use FPs when they are planning difficult things, so following words likely to be unexpected/rare/difficult Slide 153 Repeaters and Deleters Fast speakers get ahead of themselves and restarts Slow speakers wait and use filled pauses but dont restart 1/5/07 Slide 154 Detecting Disfluencies 1/5/07 Slide 155 Recent work: EARS Metadata Evaluation (MDE) A recent multiyear DARPA bakeoff Sentence-like Unit (SU) detection: find end points of SU Detect subtype (question, statement, backchannel) Edit word detection: Find all words in reparandum (words that will be removed) Filler word detection Filled pauses (uh, um) Discourse markers (you know, like, so) Editing terms (I mean) Interruption point detection Liu et al 2003 Slide 156 Kinds of disfluencies Repetitions I * I like it Revisions We * I like it Restarts (false starts) Its also * I like it Slide 157 MDE transcription Conventions:./ for statement SU boundaries,for fillers, [] for edit words, * for IP (interruption point) inside edits And wash your clothes wherever you are./ and [ you ] * you really get used to the outdoors./ Slide 158 MDE Labeled Corpora CTSBN Training set (words)484K182K Test set (words)35K45K STT WER (%)14.911.7 SU %13.68.1 Edit word %7.41.8 Filler word %6.81.8 Slide 159 MDE Algorithms Use both text and prosodic features At each interword boundary Extract Prosodic features (pause length, durations, pitch contours, energy contours) Use N-gram Language model Combine via HMM, Maxent, CRF, or other classifier Slide 160 State of the art: Edit word detection Multi-stage model HMM combining LM and decision tree finds IP Heuristics rules find onset of reparandum Separate repetition detector for repeated words One-stage model CRF jointly finds edit region and IP BIO tagging (each word has tag whether is beginning of edit, inside edit, outside edit) Error rates: 43-50% using transcripts 80-90% using ASR Slide 161 Fragments Incomplete or cut-off words: Leaving at seven fif- eight thirty uh, I, I d-, don't feel comfortable You know the fam-, well, the families I need to know, uh, how- how do you feel Uh yeah, yeah, well, it- it- thats right. And it- SWBD: around 0.7% of words are fragments (Liu 2003) ATIS: 60.2% of repairs contain fragments (6% of corpus sentences had a least 1 repair) Bear et al (1992) Another ATIS corpus: 74% of all reparanda end in word fragments (Nakatani and Hirschberg 1994) Slide 162 Fragment glottalization Uh yeah, yeah, well, it- it- thats right. And it- Slide 163 Why fragments are important Frequent enough to be a problem: Only 1% of words/3% of sentences But if miss fragment, tend to get surrounding words wrong (word segmentation error). Goldwater et al.: 14% absolute increase in word error rate (from 18% to 32%) for words before fragments!! Useful for finding other repairs In 40% of SRI-ATIS sentences containing fragments, fragment occurred at right edge of long repair 74% of ATT-ATIS reparanda ended in fragments Sometimes are the only cue to repair leaving at eight thirty Slide 164 Cues for fragment detection 49/50 cases examined ended in silence >60msec; average 282ms (Bear et al) 24 of 25 vowel-final fragments glottalized (Bear et al) Glottalization: increased time between glottal pulses 75% dont even finish the vowel in first syllable (i.e., speaker stopped after first consonant) (OShaughnessy) Slide 165 Cues for fragment detection Nakatani and Hirschberg (1994) Word fragments tend to be content words: Lexical ClassTokenPercent Content12142% Function124% Untranscribed15554% Slide 166 Cues for fragment detection Nakatani and Hirschberg (1994) 91% are one syllable or less SyllablesTokensPercent 011339% 114952% 2259% 310.3% Slide 167 Cues for fragment detection Nakatani and Hirschberg (1994) Fricative-initial common; not vowel-initial Class% words% frags% 1-C frags Stop23% 11% Vowel25%13%0% Fric33%45%73% Slide 168 Liu (2003): Acoustic-Prosodic detection of fragments Prosodic features Duration (from alignments) Of word, pause, last-rhyme-in word Normalized in various ways F0 (from pitch tracker) Modified to compute stylized speaker-specific contours Energy Frame-level, modified in various ways Slide 169 Liu (2003): Acoustic-Prosodic detection of fragments Voice Quality Features Jitter A measure of perturbation in pitch period Praat computes this Spectral tilt Overall slope of spectrum Speakers modify this when they stress a word Open Quotient Ratio of times in which vocal folds are open to total length of glottal cycle Can be estimated from first and second harmonics Creaky voice (laryngealization) vocal folds held together, so short open quotient Slide 170 The larynx main function of vocal folds: block objects from falling into trachea Slide from Ulrike Gut Slide 171 Inside the larynx Slide from Ulrike Gut Slide 172 Phonation phonation: vibration (=opening and closing) of the vocal folds vocal folds closed - air from the lungs pushes them apart sucked back together (Bernoulli effect) Slide from Ulrike Gut Slide 173 Slide from K. Marasek, J. Wilcox Voice Quality and the Larynx Adductive tension (interarytenoid muscles adduct the arytenoid muscles) Medial compression (adductive force on vocal processes- adjustment of ligamental glottis) Longitudinal pressure (tension of vocal folds) Slide 174 Modulation of vocal fold vibration vocal folds are moved (adducted) by muscles can be tensed the shorter the vocal folds the faster they vibrate 200 times/sec 120 times/sec Slide from Ulrike Gut Slide 175 Modes of phonation voicelessness = no vocal fold vibration modal (normal) voicing whisper breathy voice voice creaky voice Slide from Ulrike Gut Slide 176 Slide from K. Marasek, J. Wilcox Modal voice neutral mode muscular adjustments are moderate vibration of the vocal folds is periodic with full closing of glottis, so no audible friction noises are produced when air flows through the glottis. frequency of vibration and loudness are in the lowto mid range for conversational speech Slide 177 Breathy voice arytenoid cartilages remain slightly apart continuous airflow during vocal fold vibration Slide from Ulrike Gut Slide 178 Creaky voice arytenoid cartilages tightly together so that vocal folds can only vibrate at the other end normal creaky voice Slide from Ulrike Gut Slide 179 Slide from K. Marasek, J. Wilcox Creaky voice voiced phonation vocal folds vibrate at a very low frequency vibration is somewhat irregular, vibrating mass is heavier because of low tension (only the ligamental part of glottis vibrates) The vocal folds are strongly adducted longitudinal tension is weak Moderately high medial compression Vocal folds thicken and create an unusually thick and slack structure. Slide 180 Whisper in whisper there is no true vibration of the vocal folds; adduction of vocal folds while maintaining an opening between the arytenoid cartilages Slide from Ulrike Gut Slide 181 Slide from K. Marasek, J. Wilcox Whispery voice voiceless phonation Very low adductive tension Medial compression moderately high Longitudinal tension moderately high Little or no vocal fold vibration ( produced through turbulences generated by the friction of the air in and above the larynx, which produces frication) Slide 182 Liu (2003) Use Switchboard 80%/20% Downsampled to 50% frags, 50% words Generated forced alignments with gold transcripts Extract prosodic and voice quality features Train decision tree Slide 183 Liu (2003) results Precision 74.3%, Recall 70.1% hypothesis completefragment referencecomplete10935 fragment43101 Slide 184 Liu (2003) features Features most queried by DT Feature% jitter.272 Energy slope difference between current and following word.241 Ratio between F0 before and after boundary.238 Average OQ.147 Position of current turn0.084 Pause duration0.018

Dan Jurafsky CS 424P/ LINGUIST 287 Extracting Social Meaning and Sentiment 1/5/07 Lecture 4: Speech...

Documents

Transcript of Dan Jurafsky CS 424P/ LINGUIST 287 Extracting Social Meaning and Sentiment 1/5/07 Lecture 4: Speech...