Post on 16-Apr-2017
TAIPEI | SEP. 21-22, 2016
李宏毅 Hung-yi Lee
TOWARDS MACHINE COMPREHENSION OF SPOKEN CONTENT
2
MULTIMEDIA INTERNET CONTENT
300 hrs multimedia is uploaded per minute.(2015.01)
1874 courses on coursera(2016.04)
Ø We need machine to listen to the audio data, understand it, and extract useful information for humans.
Ø In these multimedia, the spoken part carries very important information about the content.
Ø Nobody is able to go through the data.
Ø Overview the technology developed at NTU Speech Lab
3
OVERVIEW
2016/9/26
SpokenContent
Text
SpeechRecognition
DeepLearning
DeepLearningforSpeechRecognition• AcousticModel(聲學模型)
• DNN+HMM• Widelyused
• CTC• Sequencetosequencelearning
• DNN+structuredSVM[Meng &Lee,ICASSP10]
• DNN+structuredDNN[Liao&Lee,ASRU15]
hidden layer h1
hidden layer h2
W1
W2
F2(x, y; θ2)
WL
speech signal
F1(x, y; θ1)
y (phoneme label sequence)
(a) use DNN phone posterior as acoustic vector
(b) structured SVM (c) structured DNN
Ψ(x,y)
hidden layer hL-1
hidden layer h1
hidden layer hL
W0,0
output layer
input layer
W0,L
feature extraction
a c b a
x (acoustic vector sequence)
DeepLearningforSpeechRecognition• LanguageModel(語言模型)
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
RNN
LSTM
Neural Turing
Machine
Attention-based Model
[Ko &Lee,submittedtoICASSP17]
[Liu&Lee,submittedtoICASSP17]
6
OVERVIEW
SpokenContent
Text
SpeechRecognition
Summarization
7
SPEECH SUMMARIZATION
RetrievedAudio File
Summary Select the most informative segments to form a compact version
1 hour long
10 minutes
ExtractiveSummariesRef:http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015/Structured%20Lecture/Summarization%20Hidden_2.ecm.mp4/index.html
8
SPEECH SUMMARIZATIONAbstractiveSummaries
x1 x2 x3 xN
……
……
Input document (long word sequence)
Summary (short word sequence)
y1 y2 y3 y4
機器先看懂文章
機器用自己的話來寫摘要
[Yu & Lee, SLT 16]
9
OVERVIEW
2016/9/26
SpokenContent
Text
SpeechRecognition
KeyTermExtraction
Summarization
Key Term Extraction [Shen & Lee, Interspeech 16]
α1 α2 α3 α4 … αT
ΣαiVi
x4x3x2x1 xT…
…V3V2V1 V4 VT
Embedding Layer
…V3V2V1 V4 VT
OT
…
document
Output Layer
Hidden Layer
Embedding Layer
Key Terms: DNN, LSTN
機器先大略讀過整篇文章
機器擷取文章中的重點
回頭把重點畫起來
11
OVERVIEW
2016/9/26
SpokenContent
Text
SpeechRecognition
SpokenContentRetrieval
KeyTermExtraction
Summarization
SpokenContentRetrieval
l Transcribe spoken content into text by speech recognition
SpeechRecognition Models
Text
Retrieval Result
TextRetrieval Query learner
l Use text retrieval approach to search the transcriptions
Spoken Content
Black Box
OverviewPaper
• Lin-shan Lee,JamesGlass,Hung-yiLee,Chun-anChan,"SpokenContentRetrieval—BeyondCascadingSpeechRecognitionwithTextRetrieval,"IEEE/ACMTransactionsonAudio,Speech,andLanguageProcessing,vol.23,no.9,pp.1389-1420,Sept.2015
• http://speech.ee.ntu.edu.tw/~tlkagk/paper/Overview.pdf• 3hourstutorialatINTERSPEECH2016
• Slide:http://speech.ee.ntu.edu.tw/~tlkagk/slide/spoken_content_retrieval_IS16.pdf
Audioisdifficulttobrowse
• Retrievalresultsofspokencontentisusuallynoisy• Whenthesystemreturnstheretrievalresults,userdoesn’tknowwhathe/shegetatthefirstglance
Retrieval Result
15
OVERVIEW
2016/9/26
SpokenContent
Text
SpeechRecognition
SpokenContentRetrieval
KeyTermExtraction
Summarization
Interaction
user
“Deep Learning”
Related to Machine Learning or Education?
Challenges
• Given the information entered by the users, which action should be taken?
“Give me an example.”“Is it relevant to XXX?”
“More precisely, please.”
“Show the results.”
Theretrievalsystemlearnstotakethemosteffectiveactionsfromhistoricalinteractionexperiences.
DeepReinforcementLearning
DeepReinforcementLearning
• Theactionsaredeterminedbyaneuralnetwork• Input:informationtohelptomakethedecision• Output:whichactionshouldbetaken
• Takingtheactionwiththehighestscore
…
…
DNN
… InformationAction Z
Action B
Action A
Max
The network parameters can be optimized by historical interaction.
DeepReinforcementLearning
• Differentnetworkdepth
The task cannot be addressed by linear model.
Some depth is needed.
More Interaction
Better retrieval performance,Less user labor
20
OVERVIEW
2016/9/26
SpokenContent
Text
SpeechRecognition
SpokenContentRetrieval
KeyTermExtraction
Summarization
Interaction
Organization
Today’sRetrievalTechniques
752matches
More is less …...
• Given all the related lectures from different courses
Which lecture should I go first?
Learning MapØ Nodes: lectures in the
same topicsØ Edges: suggested learning
order
learner
[Shen & Lee, Interspeech 15]
Demo
24
OVERVIEW
SpokenContent
Text
SpeechRecognition
SpokenContentRetrieval
KeyTermExtraction
Summarization
QuestionAnswering
Interaction
Organization
SpokenQuestionAnswering
WhatisapossibleoriginofVenus’clouds?
Spoken Question Answering: Machine answers questions based on the information in spoken content
Gasesreleasedasaresultofvolcanicactivity
SpokenQuestionAnswering
• TOEFLListeningComprehensionTestbyMachine
Question: “ What is a possible origin of Venus’ clouds? ” Audio Story:
Choices:(A) gases released as a result of volcanic activity(B) chemical reactions caused by high surface temperatures(C) bursts of radio energy from the plane's surface(D) strong winds that blow dust into the atmosphere
(The original story is 5 min long.)
[Tseng & Lee, Interspeech 16]
Results
Accu
racy
(%)
(1) (2) (3) (4) (5) (6) (7)
MemoryNetwork:39.2%(proposed by FB AI group)
NaiveApproaches
ModelArchitecture
(A)
(A) (A) (A) (A)
(B) (B) (B)
ModelArchitecture
“what is a possible origin of Venus‘ clouds?"
Question:
QuestionSemantics
…… It be quite possible that this be due to volcanic eruption because volcanic eruption often emit gas. If that be the case volcanism could very well be the root cause of Venus 's thick cloud cover. And also we have observe burst of radio energy from the planet 's surface. These burst be similar to what we see when volcano ……
Audio Story:
SpeechRecognition
SemanticAnalysis
SemanticAnalysis
Attention
Answer
Select the choice most similar to the answer
Attention
Themodelislearnedend-to-end.
Results
Accu
racy
(%)
(1) (2) (3) (4) (5) (6) (7)
MemoryNetwork:39.2%
NaiveApproaches
ProposedApproach:48.8%
(proposed by FB AI group)
[Fang & Hsu & Lee, SLT 16][Tseng & Lee, Interspeech 16]
31
OVERVIEW
2016/9/26
SpokenContent
Text
SpeechRecognition
SpokenContentRetrieval
KeyTermExtraction
Summarization
QuestionAnswering
Interaction
Organization
Speech recognition is essential?
32
CHALLENGES IN SPEECH RECOGNITION?
Lots of audio files in different languages on the Internet
Most languages have little annotated data for training speech recognition systems.
Some audio files are produced in several different of languages
Some languages even do not have written form
Out-of-vocabulary (OOV) problem
33
OVERVIEW
2016/9/26
SpokenContent
Text
SpeechRecognition
SpokenContentRetrieval
KeyTermExtraction
Summarization
QuestionAnswering
Interaction
Organization
Speech recognition is essential?
Is it possible to directly understand spoken content?
PreliminaryStudy:LearningfromAudioBook
Machinelistenstolotsofaudiobook
[Chung,Interspeech 16)
Machinedoesnothaveanypriorknowledge
Likeaninfant
PreliminaryStudy:AudioWordtoVector
• AudiosegmentcorrespondingtoanunknownwordFixed-length vector
PreliminaryStudy:AudioWordtoVector
• Theaudiosegmentscorrespondingtowordswithsimilarpronunciationsareclosetoeachother.
ever ever
never
never
never
dog
dog
dogs
Unsupervised
Sequence-to-sequenceAuto-encoder
audio segment
acoustic features
The values in the memory represent the whole audio segment
x1 x2 x3 x4
RNN Encoder
audio segmentvector
The vector we want
How to train RNN Encoder?
Sequence-to-sequenceAuto-encoder
RNN Decoderx1 x2 x3 x4
y1 y2 y3 y4
x1 x2 x3 x4
RNN Encoder
audio segment
acoustic features
The RNN encoder and decoder are jointly trained.
Input acoustic features
ExperimentalResults
neverever
CosineSimilarity
Edit Distance between Phoneme sequences
RNNEncoder
RNNEncoder
ExperimentalResults
More similar pronunciation
Higher cosine similarity.
Observation
• Visualizingembeddingvectorsofthewords
fear
nearname
fame
audio segmentvector
Project on 2-D
NextStep……
• Includingsemantics?
flower tree
dog
cat
cats
walk
walked
run
43
CONCLUDINGREMARKS
44
CONCLUDING REMARKS
2016/9/26
SpokenContent
Text
SpeechRecognition
SpokenContentRetrieval
KeyTermExtraction
Summarization
Question-answering
Interaction
Organization
With Deep Learning,machine will understand spoken content, and extract useful information for humans.
45
如果你想 “深度學習深度學習”My Course: Machine learning and having it deep and structured
http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html
6 hour version: http://www.slideshare.net/tw_dsconf/ss-62245351
“Neural Networks and Deep Learning”
written by Michael Nielsen
http://neuralnetworksanddeeplearning.com/
“Deep Learning”
Written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville
http://www.deeplearningbook.org
TAIPEI | SEP. 21-22, 2016
THANK YOU