Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case...
Transcript of Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case...
![Page 1: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/1.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Soutenance de Stage de L3Multimodal Learning:A case study for Gesture and Audio-Visual SpeechRecognition
Hsieh Yu-Guan (Info 2016)Supervised by
Amélie Cordier & Mathieu LefortInternship period: 14th June 2017 – 11th August 2017
Yu-Guan Hsieh Multimodal Learning September 2017 1 / 19
![Page 2: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/2.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Introduction
BEHAVIORS.AIHoomano & LIRISTensorFlow
Yu-Guan Hsieh Multimodal Learning September 2017 2 / 19
![Page 3: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/3.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Introduction
Artifical Intelligence
> Robotics (embodied paradigme)> Developmental robotics> Multimodal learning> Gesture and Audio-Visual recogntion
Yu-Guan Hsieh Multimodal Learning September 2017 3 / 19
![Page 4: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/4.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Introduction
Artifical Intelligence> Robotics (embodied paradigme)
> Developmental robotics> Multimodal learning> Gesture and Audio-Visual recogntion
Yu-Guan Hsieh Multimodal Learning September 2017 3 / 19
![Page 5: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/5.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Introduction
Artifical Intelligence> Robotics (embodied paradigme)> Developmental robotics
> Multimodal learning> Gesture and Audio-Visual recogntion
Yu-Guan Hsieh Multimodal Learning September 2017 3 / 19
![Page 6: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/6.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Introduction
Artifical Intelligence> Robotics (embodied paradigme)> Developmental robotics> Multimodal learning
> Gesture and Audio-Visual recogntion
Yu-Guan Hsieh Multimodal Learning September 2017 3 / 19
![Page 7: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/7.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Introduction
Artifical Intelligence> Robotics (embodied paradigme)> Developmental robotics> Multimodal learning> Gesture and Audio-Visual recogntion
Yu-Guan Hsieh Multimodal Learning September 2017 3 / 19
![Page 8: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/8.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Deep Network Architectures – Convolutional NeuralNetworks (CNNs)
Yu-Guan Hsieh Multimodal Learning September 2017 4 / 19
![Page 9: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/9.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Deep Network Architectures – Convolutional NeuralNetworks (CNNs) – Convolution
Source: https://cambridgespark.com/content/tutorials/convolutional-neural-networks-with-keras/index.htmlAlso see https://github.com/vdumoulin/conv_arithmetic
Yu-Guan Hsieh Multimodal Learning September 2017 5 / 19
![Page 10: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/10.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Deep Network Architectures – Convolutional NeuralNetworks (CNNs) – Max-pooling
Source: https://cambridgespark.com/content/tutorials/convolutional-neural-networks-with-keras/index.html
Yu-Guan Hsieh Multimodal Learning September 2017 6 / 19
![Page 11: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/11.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Deep Network Architectures – Autoencoder
Source:https://hackernoon.com/autoencoders-deep-learning-bits-1-11731e200694
Yu-Guan Hsieh Multimodal Learning September 2017 7 / 19
![Page 12: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/12.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Deep Network Architectures – Convolutional Autoencoder
Source:https://hackernoon.com/autoencoders-deep-learning-bits-1-11731e200694
Yu-Guan Hsieh Multimodal Learning September 2017 8 / 19
![Page 13: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/13.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Training a Machine Learning Model
Loss function: cross-entropy, L2-distanceStochastic gradient descent (SGD)BackpropagationVaraints of SGD: AdaGrad, Adam
Yu-Guan Hsieh Multimodal Learning September 2017 9 / 19
![Page 14: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/14.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Datasets – ASL Finger SpellingRGB and depthMore than 60000 images for each modality24 static signs in American Sign Language5 subjectsOnly one channel in inputResized to 83× 83 and Z-normalization
(a) (b) (c) (d)
Yu-Guan Hsieh Multimodal Learning September 2017 10 / 19
![Page 15: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/15.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Datasets – AVLetters
Audio-visual26 letters from A to Z10 speakers, each letter 3 times eachAudio: 24 frames in input, 26 MFCCs for each frameVideo: 12 frames in input, z-normalized
Yu-Guan Hsieh Multimodal Learning September 2017 11 / 19
![Page 16: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/16.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Results – Unsupervised Learning with CAE
Yu-Guan Hsieh Multimodal Learning September 2017 12 / 19
![Page 17: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/17.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Results – Unsupervised Learning with CAE
Yu-Guan Hsieh Multimodal Learning September 2017 13 / 19
![Page 18: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/18.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Results – Unsupervised Learning with CAE
Raw: Perceptron that reads raw input data.CAE features: Perceptron stacked on the middle layer of the CAE.CAE architecture: Perceptron stacked on the middle layer of the CAEbut train the whole network in a supervised way as a CNN.
Raw CAE features CAE architecture
Intensity train 69.47 % 78.87 % 91.29 %test 32.64 % 50.24 % 65.44 %
Depth train 63.64 % 79.61 % 88.80 %test 29.93 % 41.64 % 55.62 %
Yu-Guan Hsieh Multimodal Learning September 2017 14 / 19
![Page 19: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/19.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Results – Shared Representation Learning
Yu-Guan Hsieh Multimodal Learning September 2017 15 / 19
![Page 20: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/20.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Results – Shared Representation
Shared: Perceptron that exploits the shared representation learned bya bimodal CAE.
Raw CAE features CAE architecture Shared
Intensity train 69.47 % 78.87 % 91.29 % 85.85 %test 32.64 % 50.24 % 65.44 % 53.38 %
Depth train 63.64 % 79.61 % 88.80 % 81.83 %test 29.93 % 41.64 % 55.62 % 42.85 %
Yu-Guan Hsieh Multimodal Learning September 2017 16 / 19
![Page 21: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/21.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
AVSR Knowledge Transfer
Yu-Guan Hsieh Multimodal Learning September 2017 17 / 19
![Page 22: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/22.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Yu-Guan Hsieh Multimodal Learning September 2017 18 / 19
![Page 23: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/23.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
AVSR Knowledge Transfer
Fine-tuned for 160 steps.For Exp1, Exp2 and Exp3, we have respectively α0 = 0.001, 0.005, 0.001and pa = 0.85, 0.85, 1.Notice that since pa = 1 for Exp3 no video data is given in input duringfine-tuning.
Trv TrvA∼T Trv
U∼Z Tev TevA∼T Tev
U∼Z
No transfer 77.67 % 100 % 0 % 40.56 % 54.48 % 0 %Exp1 81.17 % 98.28 % 21.64 % 39.44 % 47.76 % 15.22 %Exp2 40.83 % 51.07 % 5.22 % 23.89 % 30.60 % 4.35 %Exp3 19.67 % 12.23 % 45.52 % 12.22 % 2.24 % 41.34 %
Yu-Guan Hsieh Multimodal Learning September 2017 19 / 19
![Page 24: Soutenance de Stage de L3 - behaviors.ai · Soutenance de Stage de L3 Multimodal Learning: A case study for Gesture and Audio-Visual Speech Recognition Hsieh Yu-Guan (Info 2016) Supervised](https://reader035.fdocuments.in/reader035/viewer/2022062607/602267ba9cb0125ec833ffe2/html5/thumbnails/24.jpg)
..........
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
.....
.....
......
.....
......
.....
.....
.
Conclusion
Datasets, hyperparametersApplications in roboticsOther approaches
Yu-Guan Hsieh Multimodal Learning September 2017 20 / 19