Fast Simulation of Lightning for 3D Games Jeremy Bryan Advisor: Sudhanshu Semwal.
Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal...
-
Upload
dorothy-hall -
Category
Documents
-
view
220 -
download
1
Transcript of Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal...
![Page 1: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/1.jpg)
Automatic Lip-Automatic Lip-Synchronization Using Synchronization Using
Linear Prediction of Linear Prediction of SpeechSpeech
Christopher Kohnert Christopher Kohnert SK SemwalSK Semwal
University of Colorado, Colorado University of Colorado, Colorado SpringsSprings
![Page 2: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/2.jpg)
Topics of PresentationTopics of Presentation
Introduction and BackgroundIntroduction and Background Linear Prediction TheoryLinear Prediction Theory Sound SignaturesSound Signatures Viseme ScoringViseme Scoring Rendering SystemRendering System ResultsResults ConclusionsConclusions
![Page 3: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/3.jpg)
JustificationJustification
Need:Need: Existing methods are labor intensiveExisting methods are labor intensive Poor resultsPoor results ExpensiveExpensive
Solution:Solution: Automatic methodAutomatic method ““Decent” resultsDecent” results
![Page 4: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/4.jpg)
Applications of Applications of Automatic SystemAutomatic System
Typical applications benefiting from Typical applications benefiting from an automatic method:an automatic method: Real-time video communicationReal-time video communication Synthetic computer agentsSynthetic computer agents Low-budget animation scenarios:Low-budget animation scenarios:
Video games industryVideo games industry
![Page 5: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/5.jpg)
Automatic Is PossibleAutomatic Is Possible
Spoken word is broken into Spoken word is broken into phonemesphonemes Phonemes are comprehensivePhonemes are comprehensive
Visemes are visual correlatesVisemes are visual correlates Used in lip-reading and traditional Used in lip-reading and traditional
animationanimation
![Page 6: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/6.jpg)
Existing Methods of Existing Methods of SynchronizationSynchronization
Text BasedText Based Analyze text to extract phonemesAnalyze text to extract phonemes
Speech BasedSpeech Based Volume trackingVolume tracking Speech recognition front-endSpeech recognition front-end Linear PredictionLinear Prediction
HybridsHybrids Text & SpeechText & Speech Image & SpeechImage & Speech
![Page 7: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/7.jpg)
Speech Based is BestSpeech Based is Best
Doesn’t need scriptDoesn’t need script Fully automaticFully automatic Can use original sound sample (best Can use original sound sample (best
quality)quality) Can use source-filter modelCan use source-filter model
![Page 8: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/8.jpg)
Source-Filter ModelSource-Filter Model Models a sound signal as a source passed Models a sound signal as a source passed
through a filterthrough a filter Source: lungs & vocal cordsSource: lungs & vocal cords Filter: vocal tractFilter: vocal tract
Implemented using Linear PredictionImplemented using Linear Prediction
![Page 9: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/9.jpg)
Speech Related TopicsSpeech Related Topics
Phoneme recognitionPhoneme recognition How many to use?How many to use?
Mapping phonemes to visemesMapping phonemes to visemes Use visually distinctive ones (e.g. vowel Use visually distinctive ones (e.g. vowel
sounds)sounds) Coarticulation effectCoarticulation effect
![Page 10: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/10.jpg)
The Coarticulation EffectThe Coarticulation Effect
The blending of sound based on The blending of sound based on adjacent phonemes (common in adjacent phonemes (common in every-day speech)every-day speech)
Artifact of discrete phoneme Artifact of discrete phoneme recognitionrecognition
Causes poor visual synchronization Causes poor visual synchronization (transitions are jerky and unnatural)(transitions are jerky and unnatural)
![Page 11: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/11.jpg)
Speech Encoding Speech Encoding MethodsMethods
Pulse Code Modulation (PCM)Pulse Code Modulation (PCM) VocodingVocoding Linear PredictionLinear Prediction
![Page 12: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/12.jpg)
Pulse Code ModulationPulse Code Modulation
Raw digital samplingRaw digital sampling High quality soundHigh quality sound Very high bandwidth requirementsVery high bandwidth requirements
![Page 13: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/13.jpg)
VocodingVocoding
Stands for VOice-enCODingStands for VOice-enCODing Origins in military applicationsOrigins in military applications Models physical entities (tongue, Models physical entities (tongue,
vocal cord, jaw, etc.)vocal cord, jaw, etc.) Poor sound quality (tin can voices)Poor sound quality (tin can voices) Very low bandwidth requirementsVery low bandwidth requirements
![Page 14: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/14.jpg)
Linear PredictionLinear Prediction
Hybrid method (of PCM and Vocoding)Hybrid method (of PCM and Vocoding) Models sound source and filter Models sound source and filter
separatelyseparately Uses original sound sample to Uses original sound sample to
calculate recreation parameters calculate recreation parameters (minimum error)(minimum error)
Low bandwidth requirementsLow bandwidth requirements Pitch and intonation independencePitch and intonation independence
![Page 15: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/15.jpg)
Linear Prediction TheoryLinear Prediction Theory
Source-Filter modelSource-Filter model PP coefficients are calculated coefficients are calculated
SourceFilter
![Page 16: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/16.jpg)
Linear Prediction Theory Linear Prediction Theory (cont.)(cont.)
The The aakk coefficients are found by coefficients are found by minimizing the original sound (minimizing the original sound (SStt) ) and the reconstructed sound (and the reconstructed sound (ssii).).
Can be solved using Levinson-Durbin Can be solved using Levinson-Durbin recursion.recursion.
![Page 17: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/17.jpg)
Linear Prediction Theory Linear Prediction Theory (cont.)(cont.)
Coefficients represent the filter partCoefficients represent the filter part The filter is assumed constant for The filter is assumed constant for
small “windows” on the original small “windows” on the original sample (10-30ms windows)sample (10-30ms windows)
Each window has its own Each window has its own coefficientscoefficients
Sound source is either Pulse Train Sound source is either Pulse Train (voiced) or white noise (unvoiced)(voiced) or white noise (unvoiced)
![Page 18: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/18.jpg)
Linear Prediction for Linear Prediction for RecognitionRecognition
Recognition on raw coefficients is Recognition on raw coefficients is poorpoor
Better to FFT the valuesBetter to FFT the values Take only first “half” of FFT’d valuesTake only first “half” of FFT’d values This is the “signature” of the soundThis is the “signature” of the sound
![Page 19: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/19.jpg)
Sound SignaturesSound Signatures
16 values represent the sound16 values represent the sound Speaker independentSpeaker independent Unique for each phonemeUnique for each phoneme Easily recognized by machineEasily recognized by machine
![Page 20: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/20.jpg)
Viseme ScoringViseme Scoring
Phonemes were chosen judiciouslyPhonemes were chosen judiciously Map one-to-one to visemesMap one-to-one to visemes Visemes scored independently using Visemes scored independently using
historyhistory VVii= 0.9 * V= 0.9 * Vi-1i-1 + 0.1 * {1 if matched at + 0.1 * {1 if matched at ii, ,
else 0}else 0} Ramps up and down with successive Ramps up and down with successive
matches/mismatchesmatches/mismatches
![Page 21: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/21.jpg)
Rendering SystemRendering System Uses Alias|Wavefront’s Maya Uses Alias|Wavefront’s Maya
packagepackage Built-in support for “blend Built-in support for “blend
shapes”shapes” Mapped directly to viseme Mapped directly to viseme
scoresscores Very expressive and flexibleVery expressive and flexible
Script generated and later Script generated and later read inread in
Rendered to movie, QuickTime Rendered to movie, QuickTime used to add in original sound used to add in original sound and produce final movie.and produce final movie.
![Page 22: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/22.jpg)
Results (Timing)Results (Timing)
Precise timing can Precise timing can be achievedbe achieved
Smoothing Smoothing introduces “lag”introduces “lag”
QuickTime™ and a H.263 decompressor are needed to see this picture.
QuickTime™ and a MPEG-4 Video decompressor are needed to see this picture.
![Page 23: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/23.jpg)
Results (Other Examples)Results (Other Examples)
A female speaker A female speaker using male using male phoneme setphoneme set
QuickTime™ and a Cinepak decompressor are needed to see this picture.
Slower speech, male speaker
QuickTime™ and a MPEG-4 Video decompressor are needed to see this picture.
![Page 24: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/24.jpg)
Results (Other Examples) Results (Other Examples) (cont.)(cont.)
Accented speech Accented speech with fast pacewith fast pace
QuickTime™ and a MPEG-4 Video decompressor are needed to see this picture.
![Page 25: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/25.jpg)
Results (Summary)Results (Summary)
Good with basic speechGood with basic speech Good speaker independence (for Good speaker independence (for
normal speech)normal speech) Poor performance when speech:Poor performance when speech:
Is too fastIs too fast Is accentedIs accented Contains phonemes not in the reference Contains phonemes not in the reference
set (e.g. “w” and “th”)set (e.g. “w” and “th”)
![Page 26: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/26.jpg)
ConclusionConclusion
Linear Prediction provides several Linear Prediction provides several benefits:benefits: Speaker independenceSpeaker independence Easy to recognize automaticallyEasy to recognize automatically
Results are reasonable, but can be Results are reasonable, but can be improvedimproved
![Page 27: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/27.jpg)
Future WorkFuture Work
Identify best set of phonemes and Identify best set of phonemes and visemesvisemes
Phoneme classification could be Phoneme classification could be improved with better matching improved with better matching algorithm (neural net?)algorithm (neural net?)
Larger phoneme reference set for Larger phoneme reference set for more robust matchingmore robust matching
![Page 28: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/28.jpg)
ResultsResults
Simple cases work very wellSimple cases work very well Timing is good and very responsiveTiming is good and very responsive
Robust with respect to speakerRobust with respect to speaker Cross-gender, multiple male speakersCross-gender, multiple male speakers Fails on: accents, speed, unknown Fails on: accents, speed, unknown
phonemesphonemes Problems with noisy samplesProblems with noisy samples
Can be smoothed but introduces “lag”Can be smoothed but introduces “lag”
![Page 29: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/29.jpg)
End
![Page 30: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/30.jpg)
Automatic Is PossibleAutomatic Is Possible
Spoken word is broken into Spoken word is broken into phonemesphonemes Phonemes are comprehensivePhonemes are comprehensive
Visemes are visual correlatesVisemes are visual correlates Used in lip-reading and traditional Used in lip-reading and traditional
animationanimation Physical speech (vocal cords, vocal Physical speech (vocal cords, vocal
tract) can be modeledtract) can be modeled Source-filter modelSource-filter model
![Page 31: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/31.jpg)
Sound Signatures Sound Signatures (Speaker Independence)(Speaker Independence)
![Page 32: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/32.jpg)
Sound Signatures (For Sound Signatures (For Phonemes)Phonemes)
QuickTime™ and a None decompressor are needed to see this picture.
![Page 33: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649cf85503460f949c88aa/html5/thumbnails/33.jpg)
Results (Normal Speech)Results (Normal Speech)
Normal speech, Normal speech, moderate pacemoderate pace
QuickTime™ and a Cinepak decompressor are needed to see this picture.