Towards Machine Comprehension of Spoken Content

TAIPEI | SEP. 21-22, 2016

李宏毅 Hung-yi Lee

TOWARDS MACHINE COMPREHENSION OF SPOKEN CONTENT

2

MULTIMEDIA INTERNET CONTENT

300 hrs multimedia is uploaded per minute.(2015.01)

1874 courses on coursera(2016.04)

Ø We need machine to listen to the audio data, understand it, and extract useful information for humans.

Ø In these multimedia, the spoken part carries very important information about the content.

Ø Nobody is able to go through the data.

Ø Overview the technology developed at NTU Speech Lab

3

OVERVIEW

2016/9/26

SpokenContent

Text

SpeechRecognition

DeepLearning

DeepLearningforSpeechRecognition• AcousticModel(聲學模型)

• DNN+HMM• Widelyused

• CTC• Sequencetosequencelearning

• DNN+structuredSVM[Meng &Lee,ICASSP10]

• DNN+structuredDNN[Liao&Lee,ASRU15]

hidden layer h1

hidden layer h2

W1

W2

F2(x, y; θ2)

WL

speech signal

F1(x, y; θ1)

y (phoneme label sequence)

(a) use DNN phone posterior as acoustic vector

(b) structured SVM (c) structured DNN

Ψ(x,y)

hidden layer hL-1

hidden layer h1

hidden layer hL

W0,0

output layer

input layer

W0,L

feature extraction

a c b a

x (acoustic vector sequence)

DeepLearningforSpeechRecognition• LanguageModel(語言模型)

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

RNN

LSTM

Neural Turing

Machine

Attention-based Model

[Ko &Lee,submittedtoICASSP17]

[Liu&Lee,submittedtoICASSP17]

6

OVERVIEW

SpokenContent

Text

SpeechRecognition

Summarization

7

SPEECH SUMMARIZATION

RetrievedAudio File

Summary Select the most informative segments to form a compact version

1 hour long

10 minutes

ExtractiveSummariesRef:http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015/Structured%20Lecture/Summarization%20Hidden_2.ecm.mp4/index.html

8

SPEECH SUMMARIZATIONAbstractiveSummaries

x1 x2 x3 xN

……

……

Input document (long word sequence)

Summary (short word sequence)

y1 y2 y3 y4

機器先看懂文章

機器用自己的話來寫摘要

[Yu & Lee, SLT 16]

9

OVERVIEW

2016/9/26

SpokenContent

Text

SpeechRecognition

KeyTermExtraction

Summarization

Key Term Extraction [Shen & Lee, Interspeech 16]

α1 α2 α3 α4 … αT

ΣαiVi

x4x3x2x1 xT…

…V3V2V1 V4 VT

Embedding Layer

…V3V2V1 V4 VT

OT

…

document

Output Layer

Hidden Layer

Embedding Layer

Key Terms: DNN, LSTN

機器先大略讀過整篇文章

機器擷取文章中的重點

回頭把重點畫起來

11

OVERVIEW

2016/9/26

SpokenContent

Text

SpeechRecognition

SpokenContentRetrieval

KeyTermExtraction

Summarization


l Transcribe spoken content into text by speech recognition

SpeechRecognition Models

Text

Retrieval Result

TextRetrieval Query learner

l Use text retrieval approach to search the transcriptions

Spoken Content

Black Box

OverviewPaper

• Lin-shan Lee,JamesGlass,Hung-yiLee,Chun-anChan,"SpokenContentRetrieval—BeyondCascadingSpeechRecognitionwithTextRetrieval,"IEEE/ACMTransactionsonAudio,Speech,andLanguageProcessing,vol.23,no.9,pp.1389-1420,Sept.2015

• http://speech.ee.ntu.edu.tw/~tlkagk/paper/Overview.pdf• 3hourstutorialatINTERSPEECH2016

• Slide:http://speech.ee.ntu.edu.tw/~tlkagk/slide/spoken_content_retrieval_IS16.pdf

Audioisdifficulttobrowse

• Retrievalresultsofspokencontentisusuallynoisy• Whenthesystemreturnstheretrievalresults,userdoesn’tknowwhathe/shegetatthefirstglance

Retrieval Result

15

OVERVIEW

2016/9/26

SpokenContent

Text

SpeechRecognition


KeyTermExtraction

Summarization

Interaction

user

“Deep Learning”

Related to Machine Learning or Education?

Challenges

• Given the information entered by the users, which action should be taken?

“Give me an example.”“Is it relevant to XXX?”

“More precisely, please.”

“Show the results.”

Theretrievalsystemlearnstotakethemosteffectiveactionsfromhistoricalinteractionexperiences.

DeepReinforcementLearning


• Theactionsaredeterminedbyaneuralnetwork• Input:informationtohelptomakethedecision• Output:whichactionshouldbetaken

• Takingtheactionwiththehighestscore

…

…

DNN

… InformationAction Z

Action B

Action A

Max

The network parameters can be optimized by historical interaction.


• Differentnetworkdepth

The task cannot be addressed by linear model.

Some depth is needed.

More Interaction

Better retrieval performance,Less user labor

20

OVERVIEW

2016/9/26

SpokenContent

Text

SpeechRecognition


KeyTermExtraction

Summarization

Interaction

Organization

Today’sRetrievalTechniques

752matches

More is less …...

• Given all the related lectures from different courses

Which lecture should I go first?

Learning MapØ Nodes: lectures in the

same topicsØ Edges: suggested learning

order

learner

[Shen & Lee, Interspeech 15]

24

OVERVIEW

SpokenContent

Text

SpeechRecognition


KeyTermExtraction

Summarization

QuestionAnswering

Interaction

Organization

SpokenQuestionAnswering

WhatisapossibleoriginofVenus’clouds?

Spoken Question Answering: Machine answers questions based on the information in spoken content

Gasesreleasedasaresultofvolcanicactivity

SpokenQuestionAnswering

• TOEFLListeningComprehensionTestbyMachine

Question: “ What is a possible origin of Venus’ clouds? ” Audio Story:

Choices:(A) gases released as a result of volcanic activity(B) chemical reactions caused by high surface temperatures(C) bursts of radio energy from the plane's surface(D) strong winds that blow dust into the atmosphere

(The original story is 5 min long.)

[Tseng & Lee, Interspeech 16]

Results

Accu

racy

(%)

(1) (2) (3) (4) (5) (6) (7)

MemoryNetwork:39.2%(proposed by FB AI group)

NaiveApproaches

ModelArchitecture

(A)

(A) (A) (A) (A)

(B) (B) (B)

ModelArchitecture

“what is a possible origin of Venus‘ clouds?"

Question:

QuestionSemantics

…… It be quite possible that this be due to volcanic eruption because volcanic eruption often emit gas. If that be the case volcanism could very well be the root cause of Venus 's thick cloud cover. And also we have observe burst of radio energy from the planet 's surface. These burst be similar to what we see when volcano ……

Audio Story:

SpeechRecognition

SemanticAnalysis

SemanticAnalysis

Attention

Answer

Select the choice most similar to the answer

Attention

Themodelislearnedend-to-end.

Results

Accu

racy

(%)

(1) (2) (3) (4) (5) (6) (7)

MemoryNetwork:39.2%

NaiveApproaches

ProposedApproach:48.8%

(proposed by FB AI group)

[Fang & Hsu & Lee, SLT 16][Tseng & Lee, Interspeech 16]

31

OVERVIEW

2016/9/26

SpokenContent

Text

SpeechRecognition


KeyTermExtraction

Summarization

QuestionAnswering

Interaction

Organization

Speech recognition is essential?

32

CHALLENGES IN SPEECH RECOGNITION?

Lots of audio files in different languages on the Internet

Most languages have little annotated data for training speech recognition systems.

Some audio files are produced in several different of languages

Some languages even do not have written form

Out-of-vocabulary (OOV) problem

33

OVERVIEW

2016/9/26

SpokenContent

Text

SpeechRecognition


KeyTermExtraction

Summarization

QuestionAnswering

Interaction

Organization

Speech recognition is essential?

Is it possible to directly understand spoken content?

PreliminaryStudy:LearningfromAudioBook

Machinelistenstolotsofaudiobook

[Chung,Interspeech 16)

Machinedoesnothaveanypriorknowledge

Likeaninfant

PreliminaryStudy:AudioWordtoVector

• AudiosegmentcorrespondingtoanunknownwordFixed-length vector

PreliminaryStudy:AudioWordtoVector

• Theaudiosegmentscorrespondingtowordswithsimilarpronunciationsareclosetoeachother.

ever ever

never

never

never

dog

dog

dogs

Unsupervised

Sequence-to-sequenceAuto-encoder

audio segment

acoustic features

The values in the memory represent the whole audio segment

x1 x2 x3 x4

RNN Encoder

audio segmentvector

The vector we want

How to train RNN Encoder?

Sequence-to-sequenceAuto-encoder

RNN Decoderx1 x2 x3 x4

y1 y2 y3 y4

x1 x2 x3 x4

RNN Encoder

audio segment

acoustic features

The RNN encoder and decoder are jointly trained.

Input acoustic features

ExperimentalResults

neverever

CosineSimilarity

Edit Distance between Phoneme sequences

RNNEncoder

RNNEncoder

ExperimentalResults

More similar pronunciation

Higher cosine similarity.

Observation

• Visualizingembeddingvectorsofthewords

fear

nearname

fame

audio segmentvector

Project on 2-D

NextStep……

• Includingsemantics?

flower tree

dog

cat

cats

walk

walked

run

43

CONCLUDINGREMARKS

44

CONCLUDING REMARKS

2016/9/26

SpokenContent

Text

SpeechRecognition


KeyTermExtraction

Summarization

Question-answering

Interaction

Organization

With Deep Learning,machine will understand spoken content, and extract useful information for humans.

45

如果你想 “深度學習深度學習”My Course: Machine learning and having it deep and structured

http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html

6 hour version: http://www.slideshare.net/tw_dsconf/ss-62245351

“Neural Networks and Deep Learning”

written by Michael Nielsen

http://neuralnetworksanddeeplearning.com/

“Deep Learning”

Written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville

http://www.deeplearningbook.org

TAIPEI | SEP. 21-22, 2016

THANK YOU

Towards Machine Comprehension of Spoken Content

Technology

Transcript of Towards Machine Comprehension of Spoken Content