EEP AMING - Yongtao HU (Tommy)hhhhhh hhhhhh VIDEO hhhh Voting window size N 1 10 30 50 60 70 Friends...

DEEP MULTIMODAL SPEAKER NAMINGYONGTAO HU, JIMMY REN, JINGWEN DAI, CHANG YUAN, LI XU, AND WENPING WANG

Project page, code and dataset: http://herohuyongtao.github.io/projects/deepsn/

OVERVIEWAutomatic speaker naming is the problem oflocalizing as well as identifying each speakingcharacter in a video. Previous multimodalapproaches to this problem usually process thedata of different modalities individually andmerge them using handcrafted heuristics. Inthis work, we propose a novel CNN basedlearning framework to automatically learn thefusion function of both face and audio cues.Without using face tracking, facial landmarklocalization or subtitle/transcript, our systemwith robust multimodal feature extraction isable to achieve state-of-the-art speaker namingperformance evaluated on 2 diverse TV series.

MULTIMODAL LEARNING FRAMEWORK

MULTIMODAL CNNEXPERIMENTAL SETUP

Dataset. We evaluate on over three hoursvideos of nine episodes from two TV series, i.e.“Friends" and “The Big Bang Theory" (“BBT").

• “Friends": S01E03 (Season 01, Episode03), S04E04, S07E07 and S10E15 fortraining and S05E05 for testing.

• “BBT": as in [3], S01E04, S01E05 andS01E06 for training and S01E03 fortesting.

Features.• Face: all face images are resized to 50 ×

40 and raw RGB pixels are sent to thelearning framework.

• Audio: a window size of 20ms and aframe shift of 10ms are used. We thenselect mean and standard deviation of25D MFCCs, and standard deviation of2-∆MFCCs, resulting in a total of 75features per audio sample.

RESULTSFace model.

• Previous work: Eigenface 60.7%, Fisherface 64.5%, LBP 65.6% and OpenBR/4SF 66.1%.• Ours: face-alone 86.7% and face-audio 88.5%.

Identifying non-matched pairs. Notethat, in the above experiments of face models,all the face-audio samples for training andevaluation are matched pairs, i.e. belong tothe same person. Instead of using the finaloutput label of our face models, we explore the

effectiveness of the features returned from themodel in the last CNN layer. We evaluatedthree SVM models on three different features:

• fused feature: 82.2%.• face feature+MFCC: 82.9%.• fused feature+MFCC: 84.1%.

Speaker naming. It can be viewed as an extension of previous experiments.hhhhhhhhhhhhhhhhVIDEO

Voting window size N1 10 30 50 60 70

Friends 75.3 79.6 84.4 88.2 90.0 90.5The Big Bang Theory 73.6 74.7 78.2 82.5 83.0 83.4

Compared with previous works. Previous works [1, 3] achieved speaker naming accuracyof 77.8% and 80.8% respectively (evaluated on “BBT S01E03") by incorporating face, faciallandmarks, cloth features, character tracking and associated subtitles/transcripts (equal to ourmethod when N = 58). In comparison, we can achieve SN accuracy of 82.9% without introducingface/person tracking, facial landmark localization or subtitle/transcript.

APPLICATIONS1. Content accessibility enhancement.

Using speaker-following subtitles [2] toenhance overall viewing experience forvideos and automatic speech balloons forgames or other VR/AR applications.

2. Video retrieval and summarization.With detailed speaking activities, manyhigh-level tasks can be achieved.

REFERENCE

[1] M. Bauml, M. Tapaswi, and R. Stiefelhagen.Semi-supervised learning with constraints forperson identification in multimedia data. InCVPR, 2013.

[2] Y. Hu, J. Kautz, Y. Yu, and W. Wang.Speaker-following video subtitles. TOMM, 2014.

[3] M. Tapaswi, M. Bauml, and R. Stie-felhagen.“Knock! Knock! Who is it?" probabilistic personidentification in TV-series. In CVPR, 2012.

EEP AMING - Yongtao HU (Tommy)hhhhhh hhhhhh VIDEO hhhh Voting window size N 1 10 30 50 60 70 Friends...

Documents

Transcript of EEP AMING - Yongtao HU (Tommy)hhhhhh hhhhhh VIDEO hhhh Voting window size N 1 10 30 50 60 70 Friends...

High-Order DDM Sensitivity Analysis of Particular Matter in CMAQ Wenxian Zhang, Shannon Capps, Yongtao Hu, Athanasios Nenes, and Armistead Russell Georgia.

Janusz Starzyk and Yongtao Guo School of Electrical Engineering and Computer Science

Department of Computer Science, Jinan University, Guangzhou, P.R. China Lijun Lyu, Junjie Xie, Yuhui Deng, Yongtao Zhou ICA3PP 2014: The 14th International.

HHHHHH - Al Islam Online · hhhhhh hhhhhh bbbbbbddddddd hhhhhh hhhhhhhhhhhh hhhhhh hhhhhh hhhhhh hhhhhh hhhhhh hhhhhh hhhhhh hhhhhhˆ ˇ ˜ ˚ ˛ ˝ ˙ ˆ ˇ ˘ hhhhhh ˘ fffffff

Frequently Asked Questions Paint Stripping and ...mvrboard.ohio.gov/pubs/EPARegion5_6H.pdf · (40 CFR Part 63, Subpart HHHHHH) With supplemental information for Ohio The following

EPA Rule 40 CFR Part 63 Subpart HHHHHH “The Refinisher Rule” National Emissions Standard for Hazardous Air Pollutants: Paint Stripping and Miscellaneous.

2013 Full Year Financial Results › ... › analyst_presentation_2014_… · •Profit HY2013 excluding negative one-off effect of restructuring financial liabilities: CHF 83.4 Mio.

SSSSSSSSSSSSSSSS - Trinity Lutheran Church › wp-content › uploads › 2019 › 06 › News-2019-06.pdfhhhh hhh hhhhhhhhh hhh hhhhhh hhhhhhhhhhh shhhhshhhhhhhshshhhhhhhhhhh hhhhhshshhyhhth

Talat Odman, Aditya Pophale, Yongtao Hu and Ted · PDF fileTalat Odman, Aditya Pophale, Yongtao Hu and Ted Russell . Georgia Institute of Technology . IWAQFR 2015 College Park, MD

Tackling the Grey Literature: Tips and Tricks for Uncovering Elusive Material Session Presenters: Yongtao Lin & Marcus Vaska May 29, 2008.

hhhhhh - DTIC · 2014-09-27 · .'U hhhhhh. 7832-1 November 1979 AN EMPIRICAL STUDY OF EARTH COVERED SCHOOLS IN OKLAHOMA SUMMARY prepared for Federal Emergency Management Agency Washington,

INDIANA DEPARTMENT OF ENVIRONMENTAL ANAGEMENT …permits.air.idem.in.gov/30826f.pdfUnder 40 CFR Part 63, Subpart HHHHHH, Paint Booth #1, is considered an affected facility. (b) One

Supermodel, Supermodel, Can I Breathe Tomorrow? Talat Odman* and Yongtao Hu Georgia Institute of Technology School of Civil & Environmental Engineering.

Source-Specific Forecasting of Air Quality Impacts with Dynamic Emissions Updating & Source Impact Reanalysis Georgia Institute of Technology Yongtao Hu.

Burcak Kaynak 1 , Yongtao Hu 1 , Randall V. Martin 2,3 and Armistead G. Russell 1

Leverage Similarity and Locality to Enhance Fingerprint Prefetching of Data Deduplication Yongtao Zhou, Yuhui Deng, Junjie Xie Department of Computer Science,

Phhhhhhhh Dhhhhh h WhhhhhhhEIUhhhhh hhhhhh hhh hhh · PDF filePhhhhhhhh Dhhhhh h WhhhhhhhEIUhhhhh hhhhhh hhh hhh professional design, but John Terwilliger is responsible for the content

Yongtao Hu 1 , M. Talat Odman 1 , Michael E. Chang 2 and Armistead G. Russell 1

RESEARCH REPORT - PUREpure.au.dk/portal/files/97724590/math_csgb_2015_05.pdf · RESEARCH REPORT 2015 CENTRE FOR STOCHASTIC GEOMETRY AND ADVANCED BIOIMAGING Abdollah Jalilian, Yongtao

HHHHHH - City of Lakes AmeriCorpscol.mpls.k12.mn.us/uploads/col_americorps_annual_report_web.pdfHHHHHH City of Lakes AmeriCorps mce City of Lakes AmeriCorps is a program of Minneapolis