Towards Machine Comprehension of Spoken Content

46

Click here to load reader

Transcript of Towards Machine Comprehension of Spoken Content

Page 1: Towards Machine Comprehension of Spoken Content

TAIPEI | SEP. 21-22, 2016

李宏毅 Hung-yi Lee

TOWARDS MACHINE COMPREHENSION OF SPOKEN CONTENT

Page 2: Towards Machine Comprehension of Spoken Content

2

MULTIMEDIA INTERNET CONTENT

300 hrs multimedia is uploaded per minute.(2015.01)

1874 courses on coursera(2016.04)

Ø We need machine to listen to the audio data, understand it, and extract useful information for humans.

Ø In these multimedia, the spoken part carries very important information about the content.

Ø Nobody is able to go through the data.

Ø Overview the technology developed at NTU Speech Lab

Page 3: Towards Machine Comprehension of Spoken Content

3

OVERVIEW

2016/9/26

SpokenContent

Text

SpeechRecognition

DeepLearning

Page 4: Towards Machine Comprehension of Spoken Content

DeepLearningforSpeechRecognition• AcousticModel(聲學模型)

• DNN+HMM• Widelyused

• CTC• Sequencetosequencelearning

• DNN+structuredSVM[Meng &Lee,ICASSP10]

• DNN+structuredDNN[Liao&Lee,ASRU15]

hidden layer h1

hidden layer h2

W1

W2

F2(x, y; θ2)

WL

speech signal

F1(x, y; θ1)

y (phoneme label sequence)

(a) use DNN phone posterior as acoustic vector

(b) structured SVM (c) structured DNN

Ψ(x,y)

hidden layer hL-1

hidden layer h1

hidden layer hL

W0,0

output layer

input layer

W0,L

feature extraction

a c b a

x (acoustic vector sequence)

Page 5: Towards Machine Comprehension of Spoken Content

DeepLearningforSpeechRecognition• LanguageModel(語言模型)

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

RNN

LSTM

Neural Turing

Machine

Attention-based Model

[Ko &Lee,submittedtoICASSP17]

[Liu&Lee,submittedtoICASSP17]

Page 6: Towards Machine Comprehension of Spoken Content

6

OVERVIEW

SpokenContent

Text

SpeechRecognition

Summarization

Page 7: Towards Machine Comprehension of Spoken Content

7

SPEECH SUMMARIZATION

RetrievedAudio File

Summary Select the most informative segments to form a compact version

1 hour long

10 minutes

ExtractiveSummariesRef:http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015/Structured%20Lecture/Summarization%20Hidden_2.ecm.mp4/index.html

Page 8: Towards Machine Comprehension of Spoken Content

8

SPEECH SUMMARIZATIONAbstractiveSummaries

x1 x2 x3 xN

……

……

Input document (long word sequence)

Summary (short word sequence)

y1 y2 y3 y4

機器先看懂文章

機器用自己的話來寫摘要

[Yu & Lee, SLT 16]

Page 9: Towards Machine Comprehension of Spoken Content

9

OVERVIEW

2016/9/26

SpokenContent

Text

SpeechRecognition

KeyTermExtraction

Summarization

Page 10: Towards Machine Comprehension of Spoken Content

Key Term Extraction [Shen & Lee, Interspeech 16]

α1 α2 α3 α4 … αT

ΣαiVi

x4x3x2x1 xT…

…V3V2V1 V4 VT

Embedding Layer

…V3V2V1 V4 VT

OT

document

Output Layer

Hidden Layer

Embedding Layer

Key Terms: DNN, LSTN

機器先大略讀過整篇文章

機器擷取文章中的重點

回頭把重點畫起來

Page 11: Towards Machine Comprehension of Spoken Content

11

OVERVIEW

2016/9/26

SpokenContent

Text

SpeechRecognition

SpokenContentRetrieval

KeyTermExtraction

Summarization

Page 12: Towards Machine Comprehension of Spoken Content

SpokenContentRetrieval

l Transcribe spoken content into text by speech recognition

SpeechRecognition Models

Text

Retrieval Result

TextRetrieval Query learner

l Use text retrieval approach to search the transcriptions

Spoken Content

Black Box

Page 13: Towards Machine Comprehension of Spoken Content

OverviewPaper

• Lin-shan Lee,JamesGlass,Hung-yiLee,Chun-anChan,"SpokenContentRetrieval—BeyondCascadingSpeechRecognitionwithTextRetrieval,"IEEE/ACMTransactionsonAudio,Speech,andLanguageProcessing,vol.23,no.9,pp.1389-1420,Sept.2015

• http://speech.ee.ntu.edu.tw/~tlkagk/paper/Overview.pdf• 3hourstutorialatINTERSPEECH2016

• Slide:http://speech.ee.ntu.edu.tw/~tlkagk/slide/spoken_content_retrieval_IS16.pdf

Page 14: Towards Machine Comprehension of Spoken Content

Audioisdifficulttobrowse

• Retrievalresultsofspokencontentisusuallynoisy• Whenthesystemreturnstheretrievalresults,userdoesn’tknowwhathe/shegetatthefirstglance

Retrieval Result

Page 15: Towards Machine Comprehension of Spoken Content

15

OVERVIEW

2016/9/26

SpokenContent

Text

SpeechRecognition

SpokenContentRetrieval

KeyTermExtraction

Summarization

Interaction

user

“Deep Learning”

Related to Machine Learning or Education?

Page 16: Towards Machine Comprehension of Spoken Content

Challenges

• Given the information entered by the users, which action should be taken?

“Give me an example.”“Is it relevant to XXX?”

“More precisely, please.”

“Show the results.”

Theretrievalsystemlearnstotakethemosteffectiveactionsfromhistoricalinteractionexperiences.

Page 17: Towards Machine Comprehension of Spoken Content

DeepReinforcementLearning

Page 18: Towards Machine Comprehension of Spoken Content

DeepReinforcementLearning

• Theactionsaredeterminedbyaneuralnetwork• Input:informationtohelptomakethedecision• Output:whichactionshouldbetaken

• Takingtheactionwiththehighestscore

DNN

… InformationAction Z

Action B

Action A

Max

The network parameters can be optimized by historical interaction.

Page 19: Towards Machine Comprehension of Spoken Content

DeepReinforcementLearning

• Differentnetworkdepth

The task cannot be addressed by linear model.

Some depth is needed.

More Interaction

Better retrieval performance,Less user labor

Page 20: Towards Machine Comprehension of Spoken Content

20

OVERVIEW

2016/9/26

SpokenContent

Text

SpeechRecognition

SpokenContentRetrieval

KeyTermExtraction

Summarization

Interaction

Organization

Page 21: Towards Machine Comprehension of Spoken Content

Today’sRetrievalTechniques

752matches

Page 22: Towards Machine Comprehension of Spoken Content

More is less …...

• Given all the related lectures from different courses

Which lecture should I go first?

Learning MapØ Nodes: lectures in the

same topicsØ Edges: suggested learning

order

learner

[Shen & Lee, Interspeech 15]

Page 23: Towards Machine Comprehension of Spoken Content

Demo

Page 24: Towards Machine Comprehension of Spoken Content

24

OVERVIEW

SpokenContent

Text

SpeechRecognition

SpokenContentRetrieval

KeyTermExtraction

Summarization

QuestionAnswering

Interaction

Organization

Page 25: Towards Machine Comprehension of Spoken Content

SpokenQuestionAnswering

WhatisapossibleoriginofVenus’clouds?

Spoken Question Answering: Machine answers questions based on the information in spoken content

Gasesreleasedasaresultofvolcanicactivity

Page 26: Towards Machine Comprehension of Spoken Content

SpokenQuestionAnswering

• TOEFLListeningComprehensionTestbyMachine

Question: “ What is a possible origin of Venus’ clouds? ” Audio Story:

Choices:(A) gases released as a result of volcanic activity(B) chemical reactions caused by high surface temperatures(C) bursts of radio energy from the plane's surface(D) strong winds that blow dust into the atmosphere

(The original story is 5 min long.)

[Tseng & Lee, Interspeech 16]

Page 27: Towards Machine Comprehension of Spoken Content

Results

Accu

racy

(%)

(1) (2) (3) (4) (5) (6) (7)

MemoryNetwork:39.2%(proposed by FB AI group)

NaiveApproaches

Page 28: Towards Machine Comprehension of Spoken Content

ModelArchitecture

(A)

(A) (A) (A) (A)

(B) (B) (B)

Page 29: Towards Machine Comprehension of Spoken Content

ModelArchitecture

“what is a possible origin of Venus‘ clouds?"

Question:

QuestionSemantics

…… It be quite possible that this be due to volcanic eruption because volcanic eruption often emit gas. If that be the case volcanism could very well be the root cause of Venus 's thick cloud cover. And also we have observe burst of radio energy from the planet 's surface. These burst be similar to what we see when volcano ……

Audio Story:

SpeechRecognition

SemanticAnalysis

SemanticAnalysis

Attention

Answer

Select the choice most similar to the answer

Attention

Themodelislearnedend-to-end.

Page 30: Towards Machine Comprehension of Spoken Content

Results

Accu

racy

(%)

(1) (2) (3) (4) (5) (6) (7)

MemoryNetwork:39.2%

NaiveApproaches

ProposedApproach:48.8%

(proposed by FB AI group)

[Fang & Hsu & Lee, SLT 16][Tseng & Lee, Interspeech 16]

Page 31: Towards Machine Comprehension of Spoken Content

31

OVERVIEW

2016/9/26

SpokenContent

Text

SpeechRecognition

SpokenContentRetrieval

KeyTermExtraction

Summarization

QuestionAnswering

Interaction

Organization

Speech recognition is essential?

Page 32: Towards Machine Comprehension of Spoken Content

32

CHALLENGES IN SPEECH RECOGNITION?

Lots of audio files in different languages on the Internet

Most languages have little annotated data for training speech recognition systems.

Some audio files are produced in several different of languages

Some languages even do not have written form

Out-of-vocabulary (OOV) problem

Page 33: Towards Machine Comprehension of Spoken Content

33

OVERVIEW

2016/9/26

SpokenContent

Text

SpeechRecognition

SpokenContentRetrieval

KeyTermExtraction

Summarization

QuestionAnswering

Interaction

Organization

Speech recognition is essential?

Is it possible to directly understand spoken content?

Page 34: Towards Machine Comprehension of Spoken Content

PreliminaryStudy:LearningfromAudioBook

Machinelistenstolotsofaudiobook

[Chung,Interspeech 16)

Machinedoesnothaveanypriorknowledge

Likeaninfant

Page 35: Towards Machine Comprehension of Spoken Content

PreliminaryStudy:AudioWordtoVector

• AudiosegmentcorrespondingtoanunknownwordFixed-length vector

Page 36: Towards Machine Comprehension of Spoken Content

PreliminaryStudy:AudioWordtoVector

• Theaudiosegmentscorrespondingtowordswithsimilarpronunciationsareclosetoeachother.

ever ever

never

never

never

dog

dog

dogs

Unsupervised

Page 37: Towards Machine Comprehension of Spoken Content

Sequence-to-sequenceAuto-encoder

audio segment

acoustic features

The values in the memory represent the whole audio segment

x1 x2 x3 x4

RNN Encoder

audio segmentvector

The vector we want

How to train RNN Encoder?

Page 38: Towards Machine Comprehension of Spoken Content

Sequence-to-sequenceAuto-encoder

RNN Decoderx1 x2 x3 x4

y1 y2 y3 y4

x1 x2 x3 x4

RNN Encoder

audio segment

acoustic features

The RNN encoder and decoder are jointly trained.

Input acoustic features

Page 39: Towards Machine Comprehension of Spoken Content

ExperimentalResults

neverever

CosineSimilarity

Edit Distance between Phoneme sequences

RNNEncoder

RNNEncoder

Page 40: Towards Machine Comprehension of Spoken Content

ExperimentalResults

More similar pronunciation

Higher cosine similarity.

Page 41: Towards Machine Comprehension of Spoken Content

Observation

• Visualizingembeddingvectorsofthewords

fear

nearname

fame

audio segmentvector

Project on 2-D

Page 42: Towards Machine Comprehension of Spoken Content

NextStep……

• Includingsemantics?

flower tree

dog

cat

cats

walk

walked

run

Page 43: Towards Machine Comprehension of Spoken Content

43

CONCLUDINGREMARKS

Page 44: Towards Machine Comprehension of Spoken Content

44

CONCLUDING REMARKS

2016/9/26

SpokenContent

Text

SpeechRecognition

SpokenContentRetrieval

KeyTermExtraction

Summarization

Question-answering

Interaction

Organization

With Deep Learning,machine will understand spoken content, and extract useful information for humans.

Page 45: Towards Machine Comprehension of Spoken Content

45

如果你想 “深度學習深度學習”My Course: Machine learning and having it deep and structured

http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html

6 hour version: http://www.slideshare.net/tw_dsconf/ss-62245351

“Neural Networks and Deep Learning”

written by Michael Nielsen

http://neuralnetworksanddeeplearning.com/

“Deep Learning”

Written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville

http://www.deeplearningbook.org

Page 46: Towards Machine Comprehension of Spoken Content

TAIPEI | SEP. 21-22, 2016

THANK YOU