Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Modeling Images, Videos and Text

Using the Caffe Deep Learning Library

(Part 1)

Kate Saenko

Microsoft Summer Machine Learning School, St Petersburg 2015

about me

BOSTON, Massachusetts

PART I

INTRODUCTION

THE VISUAL DESCRIPTION PROBLEM

MODELING IMAGES

MODELING LANGUAGE

INTRO TO NEURAL NETWORKS

VIDEO-TO-TEXT NEURAL NETWORK

PART II

INTRO TO CAFFE

CAFFE IMAGE AND LANGUAGE MODELS

PART I

INTRODUCTION

MODELING IMAGES

MODELING LANGUAGE

PART II

INTRO TO CAFFE

Machine Learning: What is it?

• Program a computer to learn from experience

• Learn from “big data”

Machine Learning: It is used in more ways than you think!

Computer Vision: Teach Machine to “See” Like a Human

Terminator 2

Hollywood version…

terminator 2, enemy of the state (from UCSD “Fact or Fiction” DVD)

Computer Vision: Teach Machine to “See” Like a Human

Computer Vision in Real Life:

Face Tagging in Social Media

Computer Vision in Real Life: Surveillance and Security

• Stanford/Google one of the first to develop self-driving cars

• Cars “see” using many sensors: radar, laser, cameras

Smart Cars

Scientific Images

Image guided surgery

Grimson et al., MIT 3D imaging

MRI, CT

slide by S. Seitz

Medical Imaging

http://www.robocup.org/

NASA’s Mars Spirit Rover

http://en.wikipedia.org/wiki/Spirit_rover

slide by S. Seitz

Robot Vision

many other applications! 3D Shape Analysis

3D Face Reconstruction

http://grail.cs.washington.edu/projects/totalmoving/

Handwriting Recognition

3D Panoramas

How Do We Do It?

Computer Vision: Machine Learning from Big Data

Artificial Neural Network

Support Vector Machine

Machine Learning from Big Data: Achievements

PART I

INTRODUCTION

MODELING IMAGES

MODELING LANGUAGE

PART II

INTRO TO CAFFE

http://arxiv.org/abs/1411.4389

Image Description

Video Description

Output: A woman shredding chicken in a kitchen

video:

Social media analysis

A person dancing in a studio Machine sharpening a pencil

Ballerina dancing on stage Man playing guitar Woman chopping onion

Train passing by Mt. Fuji

Petabytes of video, very

little text

Social media: retrieval

Show me all video clips of a person playing guitar and singing

Social media: summarization

A car is driving

down a road A man is riding a

bike through the

A skateboarder

jumps and falls

Surveillance and Security

Smart camera alerts:

A woman wearing a

red coat walked past

A woman carrying a

large bag entered a

building

Question answering

How many times did Darth Vader use a light saber?

Assistive technology

Descriptive Video Service (DVS)

Object/action detection not enough

• Does not model interaction

between entities and scene

• Does not model what is

important to say

• Natural language is much

richer

Challenges

Object detection • YouTube dataset has 900+ objects

• Most test objects do NOT appear in training

PART I

INTRODUCTION

MODELING IMAGES

MODELING LANGUAGE

PART II

INTRO TO CAFFE

Dealing with uncertainty in

“in-the-wild” YouTube video

Guadarrama, Krishnamoorthy, Malkarnenkar, Venugopalan, Mooney, Darrell,

and Saenko. 2013. Youtube2text: Recognizing and describing arbitrary

activities using semantic hierarchies and zero-shot recognition. In IEEE

International Conference on Computer Vision (ICCV).

A is a .

SUBJECT VERB OBJECT person ride motorbike

<SUBJECT> <VERB> <OBJECT> A person is riding a motorbike .

A template model

OBJECT DETECTIONS

cow 0.11 person 0.42

table 0.07

aeroplane 0.05 dog 0.15

motorbike 0.51 train 0.17

car 0.29

SORTED OBJECT DETECTIONS

motorbike 0.51

person 0.42

car 0.29

aeroplane 0.05

… …

VERB DETECTIONS

hold 0.23 drink 0.11

move 0.34

dance 0.05 slice 0.13

climb 0.17 shoot 0.07

ride 0.19

SORTED VERB DETECTIONS

move 0.34

hold 0.23

ride 0.19

dance 0.05

… …

move 0.34

hold 0.23

ride 0.19

dance 0.05

… …

motorbike 0.51

person 0.42

car 0.29

aeroplane 0.05

… …

Vision pipeline

Video:

Output sentence: “Woman sharpens baby”

Problem: detection mistakes

sharpen 0.34

cut 0.23

… …

woman 0.51

baby 0.42

… …

Idea: trade off accuracy and specificity

Learn hierarchies from S, V, O co-occurrence

person animal

most specific prediction

“Woman sharpens baby”

work play

baby …

knife … … …

sharpen clamp woman man … …

entity

person tool

… …

“Man clamps knife”, “Person clamps knife, “Man working” Humans:

Subject Verb Object

person animal

work play

baby …

knife … … …

sharpen clamp woman man … …

entity

person tool

… …

“Man clamps knife”, “Person clamps knife, “Man working” Humans:

Subject Verb Object

our prediction

“Person working with a tool”

collected for paraphrase and machine translation

2,089 YouTube videos with 122K multi-lingual descriptions.

Available at: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/

Microsoft YouTube Dataset (Chen & Dolan, ACL 2011)

Microsoft YouTube Dataset (Chen & Dolan, ACL 2011) (a) Hollywood (8 actions) (b) TRECVID MED (6 actions)

(c) YouTube (218 actions)

A train is rolling by.A train passes by Mount Fuji.A bullet train zooms through the countryside.A train is coming down the tracks.

A man is sitting and playing a guitarA man is playing guitarStreet artists play guitar.A man is playing a guitar.a lady is playing the guitar.

A woman is cooking onions.Someone is cooking in a pan.someone preparing somethinga person coking.racipe for katsu curry

A girl is ballet dancing.A girl is dancing on a stage.A girl is performing as a ballerina.A woman dances.

We cluster words to obtain about 200 verbs and 300 nouns.

Video Collection Task

• Asked Amazon Mechanical Turk workers to submit video clips from YouTube

• Single, unambiguous action/event

• Short (4-10 seconds)

• Generally accessible

• No dialogue

• No words (subtitles, overlaid text, titles)

Generalization results on MSFT Youtube

Challenges

Need to model language • Syntax, semantics, common sense

• can a squirrel drive a car? an onion play guitar?

PART I

INTRODUCTION

MODELING IMAGES

MODELING LANGUAGE

PART II

INTRO TO CAFFE

Modeling common

sense J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. Mooney.

Integrating language and vision to gener- ate natural language descriptions of

videos in the wild. In Proceedings of the 25th International Conference on

Computational Linguistics (COLING), August 2014.

Video:

• Output sentence: “Woman sharpens baby”

• Common sense: babies cannot be

sharpened

• Idea: learn common SVO statistics from

very large text-only corpora

Problem: no common sense

Adding a linguistic prior using a

Factor Graph Model (FGM)

Visual confidence values are observed (gray potentials) and inform sentence components.

Language potentials (dashed) connect latent words between sentence components.

Mining Text Corpora Corpora Size of text parsed

British National Corpus (BNC) 1.5GB

GigaWord 26GB

ukWaC 5.5GB

WaCkypedia_EN 2.6GB

GoogleNgrams 1012 words

Stanford dependency parses from first 4 corpora used to build

SVO language model.

Full language model used for surface realization trained on

GoogleNgrams using BerkeleyLM

Evaluation

● Subject, Verb, Object accuracy o Compare to SVO extracted from ground truth

sentences

o “A woman shredding chicken in a kitchen”

o Most Common SVO, or Any Valid SVO

Results of SVO prediction

HVC: highest vision confidence

GFM: factor graph with language prior

Binary accuracy of “Most Common” SVO

Challenges

Need to model language • Syntax, semantics, common sense

• can a squirrel drive a car? an onion play guitar?

Sequence-to-sequence • Both input AND output are sequences

• So far we have assumed video and sentence are both fixed length

• What are good features? Can we learn them?

PART I

INTRODUCTION

MODELING IMAGES

MODELING LANGUAGE

PART II

INTRO TO CAFFE

Neurons in the Brain

Neuron in the brain

“Input wire”

“Output wire”

Multiply by

weights

Sum Threshold

Output

Artificial Neuron

Artificial Neuron: Activation

Multiply by

weights

Sum Threshold

Artificial Neuron: Activation

Multiply by

weights

Sum Threshold

Neurons learn

patterns!

Artificial Neuron: Pattern Classification

lues d

• Classify input into

class 0 or 1

• Teach neuron to

predict correct

class label

Example

Artificial Neuron: Learning

Multiply by

weights

Sum Threshold

activation class

Adjust

weights

Multiply by

weights

Sum Threshold

activation class

Adjust

weights

Multiply by

weights

Sum Threshold

activation class

Adjust

weights

Output

Weights

Simplify

Output

Artificial Neural Network Hidden Layer Input Layer Output Layer

Deep Network: many

hidden layers!

Single Neuron Neural Network

Artificial Neural Network Hidden Layer Input Layer Output Layer

𝑥 =

𝑥1…𝑥5

Θ(1) =𝜃11 ⋯ 𝜃15⋮ ⋱ ⋮𝜃31 ⋯ 𝜃35

ℎΘ(x) = 𝑔(Θ(2)𝑎)

a = 𝑔(Θ(1)𝑥)

Θ(2) =𝜃11 ⋯ 𝜃13⋮ ⋱ ⋮𝜃31 ⋯ 𝜃33

weights

hidden layer activations

𝑔 𝑧 =1

1 + exp(−𝑧)

output

𝑥 =

𝑥1…𝑥5

Θ(1) =𝜃11 ⋯ 𝜃15⋮ ⋱ ⋮𝜃31 ⋯ 𝜃35

ℎΘ(x) = 𝑔(Θ(2)𝑎)

a = 𝑔(Θ(1)𝑥)

Θ(2) =𝜃11 ⋯ 𝜃13⋮ ⋱ ⋮𝜃31 ⋯ 𝜃33

weights

hidden layer activations

𝑔 𝑧 =1

1 + exp(−𝑧)

output

ℎΘ(x) = estimated probability that class=1

for input x

𝑔 𝑧

ℎΘ x = 0.2

ℎΘ x = 0.8

predict class=0

predict class=1

Layer 3 Layer 1 Layer 2 Layer 4

Network architectures

Recurrent

Convolutional

t time

Representing Images

Input Layer

Reshape into

a vector

1 0 -1

Convolve with Threshold

Convolutional Neural Network

Input Layer

Output Layer

slide by Abi-Roozgard

w13 w12 w11

w23 w22 w21

w33 w32 w31

1 0 -1

Convolve with Threshold

w13 w12 w11

w23 w22 w21

w33 w32 w31

slide by Abi-Roozgard

Input Layer

Output Layer

Why Deep Learning? The Unreasonable Effectiveness of Deep Features

Rich visual structure of features deep in hierarchy.

[R-CNN]

[Zeiler-Fergus]

Maximal activations of pool5 units

conv5 DeConv visualization

PART I

INTRODUCTION

MODELING IMAGES

MODELING LANGUAGE

PART II

INTRO TO CAFFE

Deep Convolutional-

Recurrent Network Translating Videos to Natural Language Using Deep Recurrent Neural

Networks. Subhashini Venugopalan, Huijun Xu, Jeff Donahue, Marcus

Rohrbach, Raymond Mooney, Kate Saenko. NAACL 2015

Long-term Recurrent Convolutional Networks for Visual Recognition and

Description. Jeff Donahue, Lisa Hendricks, Sergio Guadarrama, Marcus

Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrell. CVPR 2015.

Deep Convolutional Neural Networks

• Learn robust high-level image representations

• achieve state-of-the-art results on image tasks

• But, do not handle sequences

• Idea: combine with Recurrent Neural Network

Recurrent

Convolutional

Contributions

● End-to-end deep video description o deep image and language model

● Leverage still image caption data

Recurrent neural networks

• Distributed hidden state stores

past information efficiently

• Non-linear dynamics

• with enough neurons and time,

can compute any function

based on slide by Geoff Hinton

1. Extract deep features from each frame

2. Create a fixed length vector representation of the video

input frames

Sentence

3. Decode the vector to a sentence

● Easier to train than RNN

● Impressive results for

○ speech recognition

○ handwriting

recognition

○ translation

Background - LSTM Unit

● Our model o 2 layers of LSTM units (hidden state

of first is input to second)

o Output- Softmax - probability

distribution over vocabulary of words

Input Video Convolutional Net Recurrent Net Output

playing

Evaluation

● Subject, Verb, Object accuracy o SVO extracted from the generated sentence

o “A woman shredding chicken in a kitchen”

o Most Common or Any Valid

● BLEU

● METEOR

● Human Evaluation

Results - SVO (Binary, Most Common)

Results - Generation

Model BLEU METEOR

FGM 13.68 23.9

LSTMFlickr 10.29 19.52

LSTMCOCO 12.66 20.96

LSTM-YT 31.19 26.87

LSTM-YTFlickr 32.03 27.87

LSTM-YTCOCO 33.29 29.07

LSTM-YTCOCO+Flickr 33.29 28.88 More fluent, but not enough training data in Youtube dataset

to train good language model

Idea: pre-learn on still image captions

Dataset Train Validation Test

Flickr30k ~28000 1000 1000

COCO2014 82783 40504 -

YouTube 1200 100 670

Pre-Train on Still Images (Flickr30k and COCO2014)

scaling

Input Image Convolutional Net Recurrent Net Output

Input Video Convolutional Net Recurrent Net Output

playing

Fine-Tune on Videos

Results - SVO (Binary, Most Common)

Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Science

Transcript of Modeling Images, Videos and Text Using the Caffe Deep Learning Library, part 1 (by Kate Saenko)

Caffe & Bevande

Dolci Caffe’ Liquori - Tamburino

PEARL: Perceptual Adaptive Representation Learning in the ...dimacs.rutgers.edu/Workshops/FieldPIMeeting/Slides/... · • Eric Tzeng, Judy Hoffman, Trevor Darrell, Kate Saenko, Simultaneous

Caffe Latte

Adaptive Deep Learning for Visual Understanding › wp-content › uploads › 2018 › 06 › Ada… · Eric Tzeng, Judy Hoffman, Trevor Darrell, Kate Saenko. “Simultaneous Deep

Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

Catalog Caffe Stolova HULSTA

Delonghi Caffe Venenzia Manual

Caffe Europa No. 35

CAFFE LADRO - Nexternal · increasingly important component of Caffe Ladro’s business model. According to Adrienne Kerrigan, Wholesale Marketing Coordinator at Caffe Ladro, improving

arXiv:1609.04356v1 [cs.CV] 14 Sep 2016cs-people.bu.edu/xpeng/pdfs/combining_accv2016.pdf · arXiv:1609.04356v1 [cs.CV] 14 Sep 2016. 2 Xingchao Peng and Kate Saenko Fig.1.Two-Stream

CAFFE LAVAZZA - eataly.com · DRIP COFFEE 3 ESPRESSO 3.50 CAPPUCCINO 4.50 CAFFE LATTE 4.50 CAFFE AMERICANO 4 CAFFE CORRETTO 8 LA VIA DEL TE TEA 4 Ask about today’s selection. CAFFE

Carravaggio caffe

caffe ck · PDF fileTitle: caffe ck Created Date: 1/9/2014 4:20:04 PM

Каталог Machiavelli Caffe

Nonnas Caffe & Colazione

Learning Deep Object Detectors From 3D Models · 2015-10-24 · Learning Deep Object Detectors from 3D Models Xingchao Peng, Baochen Sun, Karim Ali, Kate Saenko University of Massachusetts

CUDA & CAFFE

Caffe Vita Website Redesign

Dictionary-based Visual Sense Models Kate Saenko and Trevor Darrell.