Towards Machine Comprehension of Spoken Content

TAIPEI | SEP. 21-22, 2016

李宏毅 Hung-yi Lee

TOWARDS MACHINE COMPREHENSION OF SPOKEN CONTENT

MULTIMEDIA INTERNET CONTENT

300 hrs multimedia is uploaded per minute.(2015.01)

1874 courses on coursera(2016.04)

Ø We need machine to listen to the audio data, understand it, and extract useful information for humans.

Ø In these multimedia, the spoken part carries very important information about the content.

Ø Nobody is able to go through the data.

Ø Overview the technology developed at NTU Speech Lab

OVERVIEW

2016/9/26

SpokenContent

SpeechRecognition

DeepLearning

DeepLearningforSpeechRecognition• AcousticModel(聲學模型)

• DNN+HMM• Widelyused

• CTC• Sequencetosequencelearning

• DNN+structuredSVM[Meng &Lee,ICASSP10]

• DNN+structuredDNN[Liao&Lee,ASRU15]

hidden layer h1

hidden layer h2

F2(x, y; θ2)

speech signal

F1(x, y; θ1)

y (phoneme label sequence)

(a) use DNN phone posterior as acoustic vector

(b) structured SVM (c) structured DNN

Ψ(x,y)

hidden layer hL-1

hidden layer h1

hidden layer hL

output layer

input layer

feature extraction

a c b a

x (acoustic vector sequence)

DeepLearningforSpeechRecognition• LanguageModel(語言模型)

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Neural Turing

Machine

Attention-based Model

[Ko &Lee,submittedtoICASSP17]

[Liu&Lee,submittedtoICASSP17]

OVERVIEW

SpokenContent

SpeechRecognition

Summarization

SPEECH SUMMARIZATION

RetrievedAudio File

Summary Select the most informative segments to form a compact version

1 hour long

10 minutes

ExtractiveSummariesRef:http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015/Structured%20Lecture/Summarization%20Hidden_2.ecm.mp4/index.html

SPEECH SUMMARIZATIONAbstractiveSummaries

x1 x2 x3 xN

……

Input document (long word sequence)

Summary (short word sequence)

y1 y2 y3 y4

機器先看懂文章

機器用自己的話來寫摘要

[Yu & Lee, SLT 16]

OVERVIEW

2016/9/26

SpokenContent

SpeechRecognition

KeyTermExtraction

Summarization

Key Term Extraction [Shen & Lee, Interspeech 16]

α1 α2 α3 α4 … αT

ΣαiVi

x4x3x2x1 xT…

…V3V2V1 V4 VT

Embedding Layer

…V3V2V1 V4 VT

document

Output Layer

Hidden Layer

Embedding Layer

Key Terms: DNN, LSTN

機器先大略讀過整篇文章

機器擷取文章中的重點

回頭把重點畫起來

OVERVIEW

2016/9/26

SpokenContent

SpeechRecognition

SpokenContentRetrieval

KeyTermExtraction

Summarization

l Transcribe spoken content into text by speech recognition

SpeechRecognition Models

Retrieval Result

TextRetrieval Query learner

l Use text retrieval approach to search the transcriptions

Spoken Content

Black Box

OverviewPaper

• Lin-shan Lee,JamesGlass,Hung-yiLee,Chun-anChan,"SpokenContentRetrieval—BeyondCascadingSpeechRecognitionwithTextRetrieval,"IEEE/ACMTransactionsonAudio,Speech,andLanguageProcessing,vol.23,no.9,pp.1389-1420,Sept.2015

• http://speech.ee.ntu.edu.tw/~tlkagk/paper/Overview.pdf• 3hourstutorialatINTERSPEECH2016

• Slide:http://speech.ee.ntu.edu.tw/~tlkagk/slide/spoken_content_retrieval_IS16.pdf

Audioisdifficulttobrowse

• Retrievalresultsofspokencontentisusuallynoisy• Whenthesystemreturnstheretrievalresults,userdoesn’tknowwhathe/shegetatthefirstglance

Retrieval Result

OVERVIEW

2016/9/26

SpokenContent

SpeechRecognition

KeyTermExtraction

Summarization

Interaction

“Deep Learning”

Related to Machine Learning or Education?

Challenges

• Given the information entered by the users, which action should be taken?

“Give me an example.”“Is it relevant to XXX?”

“More precisely, please.”

“Show the results.”

Theretrievalsystemlearnstotakethemosteffectiveactionsfromhistoricalinteractionexperiences.

DeepReinforcementLearning

• Theactionsaredeterminedbyaneuralnetwork• Input:informationtohelptomakethedecision• Output:whichactionshouldbetaken

• Takingtheactionwiththehighestscore

… InformationAction Z

Action B

Action A

The network parameters can be optimized by historical interaction.

DeepReinforcementLearning

• Differentnetworkdepth

The task cannot be addressed by linear model.

Some depth is needed.

More Interaction

Better retrieval performance,Less user labor

OVERVIEW

2016/9/26

SpokenContent

SpeechRecognition

KeyTermExtraction

Summarization

Interaction

Organization

Today’sRetrievalTechniques

752matches

More is less …...

• Given all the related lectures from different courses

Which lecture should I go first?

Learning MapØ Nodes: lectures in the

same topicsØ Edges: suggested learning

learner

[Shen & Lee, Interspeech 15]

OVERVIEW

SpokenContent

SpeechRecognition

KeyTermExtraction

Summarization

QuestionAnswering

Interaction

Organization

SpokenQuestionAnswering

WhatisapossibleoriginofVenus’clouds?

Spoken Question Answering: Machine answers questions based on the information in spoken content

Gasesreleasedasaresultofvolcanicactivity

SpokenQuestionAnswering

• TOEFLListeningComprehensionTestbyMachine

Question: “ What is a possible origin of Venus’ clouds? ” Audio Story:

Choices:(A) gases released as a result of volcanic activity(B) chemical reactions caused by high surface temperatures(C) bursts of radio energy from the plane's surface(D) strong winds that blow dust into the atmosphere

(The original story is 5 min long.)

[Tseng & Lee, Interspeech 16]

Results

(1) (2) (3) (4) (5) (6) (7)

MemoryNetwork:39.2%(proposed by FB AI group)

NaiveApproaches

ModelArchitecture

(A) (A) (A) (A)

(B) (B) (B)

ModelArchitecture

“what is a possible origin of Venus‘ clouds?"

Question:

QuestionSemantics

…… It be quite possible that this be due to volcanic eruption because volcanic eruption often emit gas. If that be the case volcanism could very well be the root cause of Venus 's thick cloud cover. And also we have observe burst of radio energy from the planet 's surface. These burst be similar to what we see when volcano ……

Audio Story:

SpeechRecognition

SemanticAnalysis

Attention

Answer

Select the choice most similar to the answer

Attention

Themodelislearnedend-to-end.

Results

(1) (2) (3) (4) (5) (6) (7)

MemoryNetwork:39.2%

NaiveApproaches

ProposedApproach:48.8%

(proposed by FB AI group)

[Fang & Hsu & Lee, SLT 16][Tseng & Lee, Interspeech 16]

OVERVIEW

2016/9/26

SpokenContent

SpeechRecognition

KeyTermExtraction

Summarization

QuestionAnswering

Interaction

Organization

Speech recognition is essential?

CHALLENGES IN SPEECH RECOGNITION?

Lots of audio files in different languages on the Internet

Most languages have little annotated data for training speech recognition systems.

Some audio files are produced in several different of languages

Some languages even do not have written form

Out-of-vocabulary (OOV) problem

OVERVIEW

2016/9/26

SpokenContent

SpeechRecognition

KeyTermExtraction

Summarization

QuestionAnswering

Interaction

Organization

Speech recognition is essential?

Is it possible to directly understand spoken content?

PreliminaryStudy:LearningfromAudioBook

Machinelistenstolotsofaudiobook

[Chung,Interspeech 16)

Machinedoesnothaveanypriorknowledge

Likeaninfant

PreliminaryStudy:AudioWordtoVector

• AudiosegmentcorrespondingtoanunknownwordFixed-length vector

PreliminaryStudy:AudioWordtoVector

• Theaudiosegmentscorrespondingtowordswithsimilarpronunciationsareclosetoeachother.

ever ever

Unsupervised

Sequence-to-sequenceAuto-encoder

audio segment

acoustic features

The values in the memory represent the whole audio segment

x1 x2 x3 x4

RNN Encoder

audio segmentvector

The vector we want

How to train RNN Encoder?

Sequence-to-sequenceAuto-encoder

RNN Decoderx1 x2 x3 x4

y1 y2 y3 y4

x1 x2 x3 x4

RNN Encoder

audio segment

acoustic features

The RNN encoder and decoder are jointly trained.

Input acoustic features

ExperimentalResults

neverever

CosineSimilarity

Edit Distance between Phoneme sequences

RNNEncoder

ExperimentalResults

More similar pronunciation

Higher cosine similarity.

Observation

• Visualizingembeddingvectorsofthewords

nearname

audio segmentvector

Project on 2-D

NextStep……

• Includingsemantics?

flower tree

walked

CONCLUDINGREMARKS

CONCLUDING REMARKS

2016/9/26

SpokenContent

SpeechRecognition

KeyTermExtraction

Summarization

Question-answering

Interaction

Organization

With Deep Learning,machine will understand spoken content, and extract useful information for humans.

如果你想 “深度學習深度學習”My Course: Machine learning and having it deep and structured

http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html

6 hour version: http://www.slideshare.net/tw_dsconf/ss-62245351

“Neural Networks and Deep Learning”

written by Michael Nielsen

http://neuralnetworksanddeeplearning.com/

“Deep Learning”

Written by Yoshua Bengio, Ian J. Goodfellow and Aaron Courville

http://www.deeplearningbook.org

TAIPEI | SEP. 21-22, 2016

THANK YOU

Towards Machine Comprehension of Spoken Content

Technology

Transcript of Towards Machine Comprehension of Spoken Content

Towards Spoken Knowledge Structuring and Organizationspeech.ee.ntu.edu.tw/~tlkagk/slide/MyTalk_NTUCS_v8.pdf · Spoken Language Processing techniques can be very helpful. ... Easy

mrshemby.files.wordpress.com€¦ · Web viewFirst words are usually spoken, speed and accuracy of word comprehension increase rapidly . Sorts objects into categories. Joins in

2017 MYE Topics Subject: English Language Level/Stream MYE … · 2017. 4. 7. · Language Use (Editing, Lang. in Spoken Context, Modified Cloze 1&2) Comprehension (Comprehension

Teachers' Beliefs and Attitudes towards Teaching Reading Comprehension to EFL Students

ELT Methods and Practices Unit 6. Listening comprehension: Learning to understand the spoken language Bessie Dendrinos School of Philosophy Faculty of.

Lexical representations in spoken language comprehensioncsl.psychol.cam.ac.uk/publications/pdf/88_Marslen-Wilson_LCP.pdf · Language Comprehension William Marslen-Wilson Max-Planck

A Trialogue-Based Spoken Dialogue ... - … comprehension and synthesis skills via written text or one-turn spoken ... reading passage, ... volves communication with other students

Towards a Description of Modern Spoken Indonesian...'Towards a Description of Modern Spoken Indonesian' Part A: Textbook Language vs Authentic Speech 'Many sentences occurred in textbooks,

Recent efforts towards Machine Reading … towards... · 2018-03-27 · Recent efforts towards Machine Reading Comprehension: A Literature Review Mar´ıa Fernanda Mora Alba Computer

West Bengal University of Technology Conversation. Listening Comprehension Achieving ability to comprehend material delivered at relatively fast speed; comprehending spoken material

Transfer effects on spoken sentence comprehension and ...

Game-based spoken dialog language learning applications ...vikramr.com/pubs/interspeech-2018-evanini-show-and-tell.pdf · game-based spoken dialog applications targeted towards young

Useful reading: Reading Rockets Comprehension ... · Use spoken text: text read by adult or buddy or text read by audio books or computers, when teaching comprehension strategies.

Towards human-like spoken dialogue systems Introduction · example a web form. The interface metaphor lies close at hand because many of the spoken dialogue systems that users are

Towards Automatic Assessment of Spontaneous Spoken Englishmi.eng.cam.ac.uk/~mjfg/ALTA/publications/ALTA_SpComm2017.pdf · Keywords: Automatic assessment of Spoken English, Spontaneous

Spoken sentence comprehension in children with … · Spoken sentence comprehension in children with dyslexia and language ... is a distinct disorder from dyslexia in which oral ...

SPOKEN LANGUAGE COMPREHENSION Anne Cutler Addendum: How to study issues in spoken language comprehension.

On-Line Spoken Language Comprehension CA 1

Eye movements and spoken language comprehension.

SPOKEN LANGUAGE COMPREHENSION