Towards Machine Comprehension of Spoken Content
Click here to load reader
-
Upload
nvidia-taiwan -
Category
Technology
-
view
93 -
download
0
Transcript of Towards Machine Comprehension of Spoken Content
![Page 1: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/1.jpg)
TAIPEI | SEP. 21-22, 2016
李宏毅 Hung-yi Lee
TOWARDS MACHINE COMPREHENSION OF SPOKEN CONTENT
![Page 2: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/2.jpg)
2
MULTIMEDIA INTERNET CONTENT
300 hrs multimedia is uploaded per minute.(2015.01)
1874 courses on coursera(2016.04)
Ø We need machine to listen to the audio data, understand it, and extract useful information for humans.
Ø In these multimedia, the spoken part carries very important information about the content.
Ø Nobody is able to go through the data.
Ø Overview the technology developed at NTU Speech Lab
![Page 3: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/3.jpg)
3
OVERVIEW
2016/9/26
SpokenContent
Text
SpeechRecognition
DeepLearning
![Page 4: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/4.jpg)
DeepLearningforSpeechRecognition• AcousticModel(聲學模型)
• DNN+HMM• Widelyused
• CTC• Sequencetosequencelearning
• DNN+structuredSVM[Meng &Lee,ICASSP10]
• DNN+structuredDNN[Liao&Lee,ASRU15]
hidden layer h1
hidden layer h2
W1
W2
F2(x, y; θ2)
WL
speech signal
F1(x, y; θ1)
y (phoneme label sequence)
(a) use DNN phone posterior as acoustic vector
(b) structured SVM (c) structured DNN
Ψ(x,y)
hidden layer hL-1
hidden layer h1
hidden layer hL
W0,0
output layer
input layer
W0,L
feature extraction
a c b a
x (acoustic vector sequence)
![Page 5: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/5.jpg)
DeepLearningforSpeechRecognition• LanguageModel(語言模型)
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
RNN
LSTM
Neural Turing
Machine
Attention-based Model
[Ko &Lee,submittedtoICASSP17]
[Liu&Lee,submittedtoICASSP17]
![Page 6: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/6.jpg)
6
OVERVIEW
SpokenContent
Text
SpeechRecognition
Summarization
![Page 7: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/7.jpg)
7
SPEECH SUMMARIZATION
RetrievedAudio File
Summary Select the most informative segments to form a compact version
1 hour long
10 minutes
ExtractiveSummariesRef:http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015/Structured%20Lecture/Summarization%20Hidden_2.ecm.mp4/index.html
![Page 8: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/8.jpg)
8
SPEECH SUMMARIZATIONAbstractiveSummaries
x1 x2 x3 xN
……
……
Input document (long word sequence)
Summary (short word sequence)
y1 y2 y3 y4
機器先看懂文章
機器用自己的話來寫摘要
[Yu & Lee, SLT 16]
![Page 9: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/9.jpg)
9
OVERVIEW
2016/9/26
SpokenContent
Text
SpeechRecognition
KeyTermExtraction
Summarization
![Page 10: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/10.jpg)
Key Term Extraction [Shen & Lee, Interspeech 16]
α1 α2 α3 α4 … αT
ΣαiVi
x4x3x2x1 xT…
…V3V2V1 V4 VT
Embedding Layer
…V3V2V1 V4 VT
OT
…
document
Output Layer
Hidden Layer
Embedding Layer
Key Terms: DNN, LSTN
機器先大略讀過整篇文章
機器擷取文章中的重點
回頭把重點畫起來
![Page 11: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/11.jpg)
11
OVERVIEW
2016/9/26
SpokenContent
Text
SpeechRecognition
SpokenContentRetrieval
KeyTermExtraction
Summarization
![Page 12: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/12.jpg)
SpokenContentRetrieval
l Transcribe spoken content into text by speech recognition
SpeechRecognition Models
Text
Retrieval Result
TextRetrieval Query learner
l Use text retrieval approach to search the transcriptions
Spoken Content
Black Box
![Page 13: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/13.jpg)
OverviewPaper
• Lin-shan Lee,JamesGlass,Hung-yiLee,Chun-anChan,"SpokenContentRetrieval—BeyondCascadingSpeechRecognitionwithTextRetrieval,"IEEE/ACMTransactionsonAudio,Speech,andLanguageProcessing,vol.23,no.9,pp.1389-1420,Sept.2015
• http://speech.ee.ntu.edu.tw/~tlkagk/paper/Overview.pdf• 3hourstutorialatINTERSPEECH2016
• Slide:http://speech.ee.ntu.edu.tw/~tlkagk/slide/spoken_content_retrieval_IS16.pdf
![Page 14: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/14.jpg)
Audioisdifficulttobrowse
• Retrievalresultsofspokencontentisusuallynoisy• Whenthesystemreturnstheretrievalresults,userdoesn’tknowwhathe/shegetatthefirstglance
Retrieval Result
![Page 15: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/15.jpg)
15
OVERVIEW
2016/9/26
SpokenContent
Text
SpeechRecognition
SpokenContentRetrieval
KeyTermExtraction
Summarization
Interaction
user
“Deep Learning”
Related to Machine Learning or Education?
![Page 16: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/16.jpg)
Challenges
• Given the information entered by the users, which action should be taken?
“Give me an example.”“Is it relevant to XXX?”
“More precisely, please.”
“Show the results.”
Theretrievalsystemlearnstotakethemosteffectiveactionsfromhistoricalinteractionexperiences.
![Page 17: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/17.jpg)
DeepReinforcementLearning
![Page 18: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/18.jpg)
DeepReinforcementLearning
• Theactionsaredeterminedbyaneuralnetwork• Input:informationtohelptomakethedecision• Output:whichactionshouldbetaken
• Takingtheactionwiththehighestscore
…
…
DNN
… InformationAction Z
Action B
Action A
Max
The network parameters can be optimized by historical interaction.
![Page 19: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/19.jpg)
DeepReinforcementLearning
• Differentnetworkdepth
The task cannot be addressed by linear model.
Some depth is needed.
More Interaction
Better retrieval performance,Less user labor
![Page 20: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/20.jpg)
20
OVERVIEW
2016/9/26
SpokenContent
Text
SpeechRecognition
SpokenContentRetrieval
KeyTermExtraction
Summarization
Interaction
Organization
![Page 21: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/21.jpg)
Today’sRetrievalTechniques
752matches
![Page 22: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/22.jpg)
More is less …...
• Given all the related lectures from different courses
Which lecture should I go first?
Learning MapØ Nodes: lectures in the
same topicsØ Edges: suggested learning
order
learner
[Shen & Lee, Interspeech 15]
![Page 23: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/23.jpg)
Demo
![Page 24: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/24.jpg)
24
OVERVIEW
SpokenContent
Text
SpeechRecognition
SpokenContentRetrieval
KeyTermExtraction
Summarization
QuestionAnswering
Interaction
Organization
![Page 25: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/25.jpg)
SpokenQuestionAnswering
WhatisapossibleoriginofVenus’clouds?
Spoken Question Answering: Machine answers questions based on the information in spoken content
Gasesreleasedasaresultofvolcanicactivity
![Page 26: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/26.jpg)
SpokenQuestionAnswering
• TOEFLListeningComprehensionTestbyMachine
Question: “ What is a possible origin of Venus’ clouds? ” Audio Story:
Choices:(A) gases released as a result of volcanic activity(B) chemical reactions caused by high surface temperatures(C) bursts of radio energy from the plane's surface(D) strong winds that blow dust into the atmosphere
(The original story is 5 min long.)
[Tseng & Lee, Interspeech 16]
![Page 27: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/27.jpg)
Results
Accu
racy
(%)
(1) (2) (3) (4) (5) (6) (7)
MemoryNetwork:39.2%(proposed by FB AI group)
NaiveApproaches
![Page 28: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/28.jpg)
ModelArchitecture
(A)
(A) (A) (A) (A)
(B) (B) (B)
![Page 29: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/29.jpg)
ModelArchitecture
“what is a possible origin of Venus‘ clouds?"
Question:
QuestionSemantics
…… It be quite possible that this be due to volcanic eruption because volcanic eruption often emit gas. If that be the case volcanism could very well be the root cause of Venus 's thick cloud cover. And also we have observe burst of radio energy from the planet 's surface. These burst be similar to what we see when volcano ……
Audio Story:
SpeechRecognition
SemanticAnalysis
SemanticAnalysis
Attention
Answer
Select the choice most similar to the answer
Attention
Themodelislearnedend-to-end.
![Page 30: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/30.jpg)
Results
Accu
racy
(%)
(1) (2) (3) (4) (5) (6) (7)
MemoryNetwork:39.2%
NaiveApproaches
ProposedApproach:48.8%
(proposed by FB AI group)
[Fang & Hsu & Lee, SLT 16][Tseng & Lee, Interspeech 16]
![Page 31: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/31.jpg)
31
OVERVIEW
2016/9/26
SpokenContent
Text
SpeechRecognition
SpokenContentRetrieval
KeyTermExtraction
Summarization
QuestionAnswering
Interaction
Organization
Speech recognition is essential?
![Page 32: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/32.jpg)
32
CHALLENGES IN SPEECH RECOGNITION?
Lots of audio files in different languages on the Internet
Most languages have little annotated data for training speech recognition systems.
Some audio files are produced in several different of languages
Some languages even do not have written form
Out-of-vocabulary (OOV) problem
![Page 33: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/33.jpg)
33
OVERVIEW
2016/9/26
SpokenContent
Text
SpeechRecognition
SpokenContentRetrieval
KeyTermExtraction
Summarization
QuestionAnswering
Interaction
Organization
Speech recognition is essential?
Is it possible to directly understand spoken content?
![Page 34: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/34.jpg)
PreliminaryStudy:LearningfromAudioBook
Machinelistenstolotsofaudiobook
[Chung,Interspeech 16)
Machinedoesnothaveanypriorknowledge
Likeaninfant
![Page 35: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/35.jpg)
PreliminaryStudy:AudioWordtoVector
• AudiosegmentcorrespondingtoanunknownwordFixed-length vector
![Page 36: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/36.jpg)
PreliminaryStudy:AudioWordtoVector
• Theaudiosegmentscorrespondingtowordswithsimilarpronunciationsareclosetoeachother.
ever ever
never
never
never
dog
dog
dogs
Unsupervised
![Page 37: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/37.jpg)
Sequence-to-sequenceAuto-encoder
audio segment
acoustic features
The values in the memory represent the whole audio segment
x1 x2 x3 x4
RNN Encoder
audio segmentvector
The vector we want
How to train RNN Encoder?
![Page 38: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/38.jpg)
Sequence-to-sequenceAuto-encoder
RNN Decoderx1 x2 x3 x4
y1 y2 y3 y4
x1 x2 x3 x4
RNN Encoder
audio segment
acoustic features
The RNN encoder and decoder are jointly trained.
Input acoustic features
![Page 39: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/39.jpg)
ExperimentalResults
neverever
CosineSimilarity
Edit Distance between Phoneme sequences
RNNEncoder
RNNEncoder
![Page 40: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/40.jpg)
ExperimentalResults
More similar pronunciation
Higher cosine similarity.
![Page 41: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/41.jpg)
Observation
• Visualizingembeddingvectorsofthewords
fear
nearname
fame
audio segmentvector
Project on 2-D
![Page 42: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/42.jpg)
NextStep……
• Includingsemantics?
flower tree
dog
cat
cats
walk
walked
run
![Page 43: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/43.jpg)
43
CONCLUDINGREMARKS
![Page 44: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/44.jpg)
44
CONCLUDING REMARKS
2016/9/26
SpokenContent
Text
SpeechRecognition
SpokenContentRetrieval
KeyTermExtraction
Summarization
Question-answering
Interaction
Organization
With Deep Learning,machine will understand spoken content, and extract useful information for humans.
![Page 45: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/45.jpg)
45
如果你想 “深度學習深度學習”My Course: Machine learning and having it deep and structured
http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html
6 hour version: http://www.slideshare.net/tw_dsconf/ss-62245351
“Neural Networks and Deep Learning”
written by Michael Nielsen
http://neuralnetworksanddeeplearning.com/
“Deep Learning”
Written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville
http://www.deeplearningbook.org
![Page 46: Towards Machine Comprehension of Spoken Content](https://reader038.fdocuments.in/reader038/viewer/2022100802/587319621a28ab673e8b5d5b/html5/thumbnails/46.jpg)
TAIPEI | SEP. 21-22, 2016
THANK YOU