Serena Yeung, PHD, Stanford, at MLconf Seattle 2017

Towards Scaling Video Understanding

Serena Yeung

YouTube TV

GoPro Smart spaces

State-of-the-art in video understanding

State-of-the-art in video understandingClassification

Abu-El-Haija et al. 2016

4,800 categories15.2 Top5 error

State-of-the-art in video understandingClassification Detection

Idrees et al. 2017, Sigurdsson et al. 2016Abu-El-Haija et al. 2016

Tens of categories~10-20 mAP at 0.5 overlap

State-of-the-art in video understandingClassification Detection

Abu-El-Haija et al. 2016

Captioning

Yu et al. 2016

Just getting started:Short clips, niche domains

Idrees et al. 2017, Sigurdsson et al. 2016

Comparing video with image understanding

Comparing video with image understandingClassification4,800 categories15.2% Top5 errorVideos

Images 1,000 categories*3.1% Top5 error

*Transfer learning widespread

Krizhevsky 2012, Xie 2016

He 2017

Classification Detection4,800 categories15.2% Top5 error

Tens of categories~10-20 mAP at 0.5 overlapVideos

Hundreds of categories*~60 mAP at 0.5 overlapPixel-level segmentation

He 2017 Johnson 2016, Krause 2017

Classification Detection Captioning4,800 categories15.2% Top5 error

Just getting started:Short clips, niche

domainsVideos

Dense captioningCoherent paragraphs

He 2017 Johnson 2016, Krause 2017

Classification Detection Captioning4,800 categories15.2 Top5 error

Just getting started:Short clips, niche

domainsVideos

Beyond

Dense captioningCoherent paragraphs

Significant work on question-answering

Yang 2016

The challenge of scale

Training labels Inference

Models

Video processing is computationally expensive

Video annotation is labor-intensive

Temporal dimension adds complexity

Training labels InferenceVideo processing is

computationally expensiveVideo annotation is

labor-intensive

ModelsTemporal dimension adds

complexity

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame

Glimpses in Videos. CVPR 2016.

Input Output

t = 0 t = TRunning

Task: Temporal action detection

Talking

Yeung, Russakovsky, Mori, Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR 2016.

Efficient video processing

t = 0 t = T

Output

Frame modelInput: a frame

Our model for efficient action detection

t = 0 t = T

[ ]Output

Output:Detection instance [start, end]Next frame to glimpse

Frame modelInput: a frame

t = 0 t = T

Recurrent neural network(time information)

[ ] Output:Detection instance [start, end]Next frame to glimpse

Output

Convolutional neural network (frame information)

t = 0 t = T

Output

t = 0 t = T

Output

t = 0 t = T

Output

[ ] [ ] Output:Detection instance [start, end]Next frame to glimpse

Output

t = 0 t = T

Output

t = 0 t = T

Output

t = 0 t = T

Output

t = 0 t = T

Output

t = 0 t = T

Output

�� Optional output:Detection instance [start, end]

Output:Next frame to glimpse

Our model for efficient action detection• Train differentiable outputs (detection output class and bounds) using

standard backpropagation

• Train non-differentiable outputs (where to look next, when to emit a prediction) using reinforcement learning (REINFORCE algorithm)

• Achieves detection performance on par with dense sliding window-based approaches, while observing only 2% of frames

Learned policy in action

labor-intensive

complexity

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling

of Actions in Complex Videos. IJCV 2017.

Dense action labeling

Yeung, Russakovsky, Jin, Andriluka, Mori, Fei-Fei. Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos. IJCV 2017.

MultiTHUMOS• Extends the THUMOS’14 action detection dataset with dense, multilevel,

frame-level action annotations for 30 hours across 400 videos

THUMOS MultiTHUMOSAnnotations 6,365 38,690

Classes 20 65Density (labels / frame) 0.3 1.5

Classes per video 1.1 10.5Max actions per frame 2 9Max actions per video 3 25

Modeling dense, multilabel actions• Need to reason about multiple potential actions simultaneously

• High degree of temporal dependency

• In standard recurrent models for action recognition, all state is in hidden layer representation

• At each time step, makes prediction of current frame based on the current frame and previous hidden representation

MultiLSTM• Extension of LSTM that expands the temporal receptive field of input

and output connections

• Key idea: providing the model with more freedom in both reading input and writing output reduces the burden placed on the hidden layer representation

MultiLSTM

Input video frames

Frame class predictions

Standard LSTM

……

Donahue 2014

MultiLSTMFrame class predictions

……

Standard LSTM: Single input, single output

Input video frames

MultiLSTMFrame class predictions

……

Frame class predictions

MultiLSTM: Multiple inputs, multiple outputs

……

Standard LSTM: Single input, single output

Input video frames Input video frames

MultiLSTMMultiple Inputs (soft attention)

Multiple Outputs (weighted average)

Multilabel Loss

MultiLSTM

Retrieving sequential actions

Retrieving co-occurring actions

labor-intensive

complexity

Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos.

CVPR 2017.

Labeling videos is expensive• Takes significantly longer to

label a video than an image

• If spatial or temporal bounds desired, even worse

• How can we practically learn about new concepts in video?

Web queries are a source of noisy video labels

Image search is much cleaner!

Can we effectively learn from the noisy web queries?

• Our approach: learn how to selective positive training examples from noisy queries in order to train classifiers for new classes

• Use a reinforcement learning-based formulation to learn a data labeling policy that achieves strong performance on a small, manually-labeled dataset of classes

• Then use this policy to automatically label noisy web data for new classes

Yeung, Ramanathan, Russakovsky, Shen, Mori, Fei-Fei. Learning to learn from noisy web videos. CVPR 2017.

Balancing diversity vs. semantic drift• Want diverse training examples to improve classifier

• But too much diversity can also lead to semantic drift

• Our approach: balance diversity and drift by training labeling policies using an annotated reward set which the policy must successfully classify

Overview of approach

Boomerang …

Boomerang on a beach

Boomerang music video

Classifier

Candidate web queries (YouTube autocomplete)

Label new positives

+Boomerang on a beach

Current positive set

Update classifier

Update state

Boomerang …

Classifier

Label new positives

Update classifier

Fixed negative set

Update state

Boomerang …

Classifier

Label new positives

Update classifier

Update state

Fixed negative set

Training reward

Eval on reward set

Sports1M

Greedy classifier Ours

Sports1M

Novel classes

labor-intensive

complexity

CVPR 2017.

labor-intensive

complexity

CVPR 2017.

Learning to learn

labor-intensive

complexity

CVPR 2017.

Learning to learnUnsupervised learning

Towards Knowledge

labor-intensive

complexity

Videos Knowledge of the dynamic visual world

Collaborators

Olga Russakovsky Mykhaylo Andriluka Ning Jin Vignesh Ramanathan Liyue Shen

Greg Mori Fei-Fei Li

Thank You

Serena Yeung, PHD, Stanford, at MLconf Seattle 2017

Technology

Transcript of Serena Yeung, PHD, Stanford, at MLconf Seattle 2017

Backpropagation and Lecture 4: Neural Networkscs231n.stanford.edu/slides/2017/cs231n_2017_lecture4.pdf · 1 Lecture 4: Backpropagation and ... Serena Yeung Lecture 4 - April 13, 2017

Lecture 2: Deep Learning Fundamentals - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture2.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 2 - Today: Review

Learning to Learn from Noisy Web Videos · 2017-06-12 · Learning to Learn from Noisy Web Videos Serena Yeung Stanford University serena@cs.stanford.edu Vignesh Ramanathan Stanford

Jake Mannix, MLconf 2013

Lecture 14: Reinforcement Learningcs231n.stanford.edu/slides/2017/cs231n_2017_lecture14.pdf · Lecture 14: Reinforcement Learning. Fei-Fei Li & Justin Johnson & Serena Yeung Lecture

Neural Networks and Lecture 4: Backpropagationcs231n.stanford.edu/slides/2019/cs231n_2019_lecture04.pdf · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 11, 2019April

Midterm Review - cs231n.stanford.educs231n.stanford.edu/slides/2018/cs231n_2018_midterm_review.pdf · Fei-Fei Li & Justin Johnson & Serena Yeung Midterm ReviewLecture 3 - May 4th,

Lecture 3: Loss Functions and Optimizationcs231n.stanford.edu/slides/2018/cs231n_2018_lecture03.pdf · 2018-04-10 · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April

MLconf NYC Claudia Perlich

Mahoney mlconf-nov13

Automatic Music Generation - GitHub Pages · Generation Examples Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 94 May 15, 2018 Variational Autoencoders: Generating Data!

Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2018/cs231n_2018_lecture09.pdf · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 15 May 1, 2018 Case Study:

Lecture 9: CNN Architectures - Stanford Universitycs231n.stanford.edu/slides/2017/cs231n_2017_lecture9.pdf · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 1 May 2, 2017

MLconf NYC Justin Basilico

Lecture 3: Loss Functions and Optimizationvision.stanford.edu/.../2019/cs231n_2019_lecture03.pdf · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 3 - April 9, 2019 16 cat frog

MLconf NYC 0xdata

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 ...cs231n.stanford.edu/slides/2017/cs231n_2017_lecture2.pdf · All pixels change when ... Fei-Fei Li & Justin Johnson & Serena

Music recommendations @ MLConf 2014

Scott Triglia, MLconf 2013

MLconf NYC Chang Wang