Tamara Berg Machine Learning

Tamara BergMachine Learning

790-133Recognizing People, Objects, & Actions

Announcements

• Topic presentation groups posted. Anyone not have a group yet?

• Last day of background material

• For Monday - Object recognition papers will be posted online. Please read!

of 113

What is machine learning?

• Computer programs that can learn from data

• Two key components– Representation: how should we represent the

data?– Generalization: the system should generalize from

its past experience (observed data items) to perform well on unseen data items.

of 113

Types of ML algorithms

• Unsupervised– Algorithms operate on unlabeled examples

• Supervised– Algorithms operate on labeled examples

• Semi/Partially-supervised– Algorithms combine both labeled and unlabeled examples

of 113

Unsupervised Learning

of 113

K-means clustering• Want to minimize sum of squared Euclidean

distances between points xi and their nearest cluster centers mk

Algorithm:• Randomly initialize K cluster centers• Iterate until convergence:

• Assign each data point to the nearest center• Recompute each cluster center as the mean of all points assigned

ki mxMXDcluster

clusterinpoint

2)(),(

source: Svetlana Lazebnik Slide 7 of 113

of 113

Different clustering strategies• Agglomerative clustering

• Start with each point in a separate cluster• At each iteration, merge two of the “closest” clusters

• Divisive clustering• Start with all points grouped into a single cluster• At each iteration, split the “largest” cluster

• K-means clustering• Iterate: assign points to clusters, compute means

• K-medoids• Same as k-means, only cluster center cannot be computed by

averaging• The “medoid” of each cluster is the most centrally located point in

that cluster (i.e., point with lowest average distance to the other points)

source: Svetlana Lazebnik Slide 19 of 113

Supervised Learning

of 113

Slide from Dan KleinSlide 21 of 113

Example: Image classification

tomato

input desired output

Slide credit: Svetlana LazebnikSlide 25 of 113

Slide from Dan Kleinhttp://yann.lecun.com/exdb/mnist/index.html Slide 26 of 113

Example: Seismic data

Body wave magnitude

Nuclear explosions

Earthquakes

The basic classification framework

y = f(x)

• Learning: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the parameters of the prediction function f

• Inference: apply f to a never before seen test example x and output the predicted value y = f(x)

output classification function

Some ML classification methods

106 examples

Nearest neighbor

Shakhnarovich, Viola, Darrell 2003Berg, Berg, Malik 2005…

Neural networks

LeCun, Bottou, Bengio, Haffner 1998Rowley, Baluja, Kanade 1998…

Support Vector Machines and Kernels Conditional Random Fields

McCallum, Freitag, Pereira 2000Kumar, Hebert 2003…

Guyon, VapnikHeisele, Serre, Poggio, 2001…

Slide credit: Antonio Torralba

Example: Training and testing

• Key challenge: generalization to unseen examples

Training set (labels known) Test set (labels unknown)

Slide credit: Dan KleinSlide 32 of 113

Slide from Min-Yen Kan

Classification by Nearest Neighbor

Word vector document classification – here the vector space is illustrated as having 2 dimensions. How many dimensions would the data actually live in?

of 113

Classify the test document as the class of the document “nearest” to the query document (use vector similarity to find most similar doc)

of 113

Classification by kNN

Classify the test document as the majority class of the k documents “nearest” to the query document. Slide from Min-Yen Kan

of 113

What are the features? What’s the training data? Testing data? Parameters?

of 113

Slide from Min-Yen KanSlide 38 of 113

What are the features? What’s the training data? Testing data? Parameters?

of 113

NN for vision

Fast Pose Estimation with Parameter Sensitive HashingShakhnarovich, Viola, Darrell

J. Hays and A. Efros, Scene Completion using Millions of Photographs, SIGGRAPH 2007

NN for vision

J. Hays and A. Efros, IM2GPS: estimating geographic information from a single image, CVPR 2008

NN for vision

Decision tree classifierExample problem: decide whether to wait for a table at a

restaurant, based on the following attributes:1. Alternate: is there an alternative restaurant nearby?2. Bar: is there a comfortable bar area to wait in?3. Fri/Sat: is today Friday or Saturday?4. Hungry: are we hungry?5. Patrons: number of people in the restaurant (None, Some, Full)6. Price: price range ($, $$, $$$)7. Raining: is it raining outside?8. Reservation: have we made a reservation?9. Type: kind of restaurant (French, Italian, Thai, Burger)10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)

Decision tree classifier

Linear classifier

• Find a linear function to separate the classes

f(x) = sgn(w1x1 + w2x2 + … + wDxD) = sgn(w x)

Discriminant Function• It can be arbitrary functions of x, such as:

Nearest Neighbor

Decision Tree

LinearFunctions

( ) Tg b x w x

Slide credit: Jinwei GuSlide 51 of 113

Linear Discriminant Function• g(x) is a linear function:

( ) Tg b x w x

wT x + b = 0

wT x + b < 0

wT x + b > 0

A hyper-plane in the feature space

Slide credit: Jinwei Gu

denotes +1denotes -1

of 113

• How would you classify these points using a linear discriminant function in order to minimize the error rate?

Linear Discriminant Function

Infinite number of answers!

x2• How would you classify these points using a linear discriminant function in order to minimize the error rate?

Which one is the best?

Large Margin Linear Classifier

“safe zone”• The linear discriminant

function (classifier) with the maximum margin is the best

Margin is defined as the width that the boundary could be increased by before hitting a data point

Why it is the best? strong generalization ability

Margin

Linear SVMSlide credit: Jinwei Gu

of 113

Large Margin Linear Classifier

x2 Margin

wT x + b = 0

wT x + b = -1w

T x + b = 1

Support Vectors

Large Margin Linear Classifier • Formulation:

x2 Margin

wT x + b = 0

wT x + b = -1w

T x + b = 1

21minimize 2

such that

For 1, 1

Large Margin Linear Classifier • Formulation:

x2 Margin

wT x + b = 0

wT x + b = -1w

T x + b = 1

x-n( ) 1T

i iy b w x

21minimize 2

such that

Solving the Optimization Problem

( ) 1Ti iy b w x

21minimize 2

Quadratic programming

with linear constraints

Solving the Optimization Problem The linear discriminant function is:

Notice it relies on a dot product between the test point x and the support vectors xi

Linear separability

Non-linear SVMs: Feature Space General idea: the original input space can be mapped to

some higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

Slide courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt

Nonlinear SVMs: The Kernel Trick With this mapping, our discriminant function becomes:

( ) ( ) ( ) ( )T Ti i

x w x x x

No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test.

A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:

( , ) ( ) ( )Ti j i jK x x x x

Nonlinear SVMs: The Kernel Trick

Linear kernel:

2( , ) exp( )2i j

x xx x

( , ) Ti j i jK x x x x

( , ) (1 )T pi j i jK x x x x

0 1( , ) tanh( )Ti j i jK x x x x

Examples of commonly-used kernel functions:

Polynomial kernel:

Gaussian (Radial-Basis Function (RBF) ) kernel:

Sigmoid:

Support Vector Machine: Algorithm

1. Choose a kernel function

2. Choose a value for C and any other parameters (e.g. σ)

3. Solve the quadratic programming problem (many software packages available)

4. Classify held out validation instances using the learned model

5. Select the best learned model based on validation accuracy 6. Classify test instances using the final selected model

of 113

Some Issues• Choice of kernel - Gaussian or polynomial kernel is default - if ineffective, more elaborate kernels are needed - domain experts can give assistance in formulating appropriate similarity

measures

• Choice of kernel parameters - e.g. σ in Gaussian kernel - In the absence of reliable criteria, applications rely on the use of a

validation set or cross-validation to set such parameters.

This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt Slide 73 of 113

Summary: Support Vector Machine

• 1. Large Margin Classifier – Better generalization ability & less over-fitting

• 2. The Kernel Trick– Map data points to higher dimensional space in

order to make them linearly separable.– Since only dot product is used, we do not need to

represent the mapping explicitly.

• A simple algorithm for learning robust classifiers– Freund & Shapire, 1995– Friedman, Hastie, Tibshhirani, 1998

• Provides efficient algorithm for sparse visual feature selection– Tieu & Viola, 2000– Viola & Jones, 2003

• Easy to implement, doesn’t require external optimization tools.

Boosting

Slide credit: Antonio TorralbaSlide 75 of 113

• Defines a classifier using an additive model:

Boosting

Strong classifier

Weak classifier

WeightFeaturesvector

• Defines a classifier using an additive model:

• We need to define a family of weak classifiers

Boosting

Strong classifier

Weak classifier

WeightFeaturesvector

from a family of weak classifiers

Adaboost

Each data point has

a class label:

wt =1and a weight:

+1 ( )

-1 ( )yt =

Boosting• It is a sequential procedure:

Toy exampleWeak learners from the family of lines

h => p(error) = 0.5 it is at chance

Each data point has

a class label:

wt =1and a weight:

+1 ( )

-1 ( )yt =

Toy example

This one seems to be the best

Each data point has

a class label:

wt =1and a weight:

+1 ( )

-1 ( )yt =

This is a ‘weak classifier’: It performs slightly better than chance.Slide credit: Antonio Torralba

of 113

Toy example

Each data point has

a class label:

wt wt exp{-yt Ht}

We update the weights:

+1 ( )

-1 ( )yt =

Toy example

Each data point has

a class label:

wt wt exp{-yt Ht}

+1 ( )

-1 ( )yt =

Toy example

Each data point has

a class label:

wt wt exp{-yt Ht}

+1 ( )

-1 ( )yt =

Toy example

Each data point has

a class label:

wt wt exp{-yt Ht}

+1 ( )

-1 ( )yt =

Toy example

The strong (nonlinear) classifier is built as the combination of all the weak (linear) classifiers.

Adaboost

Semi-Supervised Learning

of 113

Supervised learning has many successes• recognize speech,• steer a car,• classify documents• classify proteins• recognizing faces, objects in images• ...

Slide Credit: Avrim Blum Slide 89 of 113

However, for many problems, labeled data can be rare or expensive.

Unlabeled data is much cheaper.Need to pay someone to do it, requires special testing,…

Slide Credit: Avrim Blum

Unlabeled data is much cheaper.

Speech

Images

Medical outcomes

Customer modeling

Protein sequences

Web pages

Need to pay someone to do it, requires special testing,…

[From Jerry Zhu]

Can we make use of cheap unlabeled data?

Semi-Supervised LearningCan we use unlabeled data to augment a small

labeled sample to improve learning?

But unlabeled data is missing the most important info!!But maybe still has

useful regularities that we can use.

But…But…But…Slide Credit: Avrim Blum Slide 94 of 113

Method 1:

How to use unlabeled data • One way is to use the EM algorithm

– EM: Expectation Maximization• The EM algorithm is a popular iterative algorithm for

maximum likelihood estimation in problems with missing data.

• The EM algorithm consists of two steps, – Expectation step, i.e., filling in the missing data – Maximization step – calculate a new maximum a posteriori

estimate for the parameters.

of 113

Algorithm Outline

1. Train a classifier with only the labeled documents.

2. Use it to probabilistically classify the unlabeled documents.

3. Use ALL the documents to train a new classifier.4. Iterate steps 2 and 3 to convergence.

of 113

Method 2:

Co-Training

Co-training[Blum&Mitchell’98] Many problems have two different sources of info

(“features/views”) you can use to determine label.E.g., classifying faculty webpages: can use words on page or words on links pointing to the page.

My AdvisorProf. Avrim Blum My AdvisorProf. Avrim Blum

x2- Text infox1- Link infox - Link info & Text info

Slide Credit: Avrim BlumSlide 99 of 113

Co-trainingIdea: Use small labeled sample to learn initial rules.

– E.g., “my advisor” pointing to a page is a good indicator it is a faculty home page.

– E.g., “I am teaching” on a page is a good indicator it is a faculty home page.

my advisor

Co-trainingIdea: Use small labeled sample to learn initial rules.

– E.g., “my advisor” pointing to a page is a good indicator it is a faculty home page.

– E.g., “I am teaching” on a page is a good indicator it is a faculty home page.

Then look for unlabeled examples where one view is confident and the other is not. Have it label the example for the other.

Training 2 classifiers, one on each type of info. Using each to help train the other.

hx1,x2ihx1,x2ihx1,x2i

Co-training Algorithm [Blum and Mitchell, 1998]

Given: labeled data L,

unlabeled data U

Train h1 (e.g., hyperlink classifier) using L

Train h2 (e.g., page classifier) using L

Allow h1 to label p positive, n negative examples from U

Allow h2 to label p positive, n negative examples from U

Add these most confident self-labeled examples to L

Watch, Listen & Learn: Co-training on Captioned Images and Videos

Sonal Gupta, Joohyun Kim, Kristen Grauman, Raymond MooneyThe University of Texas at Austin, U.S.A.

Goals• Classify images and videos with the help

of visual information and associated text captions

• Use unlabeled image and video examples

of 113

Image Examples

Cultivating farming at Nabataean Ruins of the Ancient Avdat

Bedouin Leads His Donkey That Carries Load Of Straw

Ibex Eating In The Nature Entrance To Mikveh Israel Agricultural School

Desert

of 113

Approach• Combining two views of images and videos using Co-

training (Blum and Mitchell ‘98) learning algorithm

• Views: Text and Visual

• Text View – Caption of image or video– Readily available

• Visual View– Color, texture, temporal information in image/video

of 113

Co-training

Initially Labeled Instances

Visual Classifier

Text Classifier

Text View Visual View

of 113

Co-training

Initially Labeled Instances

Visual Classifier

Text Classifier

Supervised Learning

Text ViewText ViewText ViewText View

Visual ViewVisual ViewVisual ViewVisual View

of 113

Co-training

Unlabeled Instances

Visual Classifier

Text Classifier

of 113

Co-training

ClassifierLabeled

Instances

Classify most confident instances

Text Classifier

Visual Classifier

of 113

Co-training

Retrain Classifiers

Text Classifier

Visual Classifier

of 113

Video FeaturesDetect Interest Points

Harris-Forstener Corner Detector for both spatial and temporal space

Describe Interest PointsHistogram of Oriented Gradients (HoG)

Create Spatio-Temporal VocabularyQuantize interest points to create 200

visual words dictionary

Represent each video as histogram of visual words

[Laptev, IJCV ‘05]

of 113

Textual Features

• That was a very nice forward camel.• Well I remember her performance last time.• He has some delicate hand movement.• She gave a small jump while gliding• He runs in to chip the ball with his right foot.• He runs in to take the instep drive and executes it well.• The small kid pushes the ball ahead with his tiny kicks.

Standard Bag-of-Words Representation

Raw Text Commentary

Porter Stemmer Remove Stop Words

of 113

Conclusion• Combining textual and visual features

can help improve accuracy• Co-training can be useful to combine

textual and visual features to classify images and videos

• Co-training helps in reducing labeling of images and videos

[More information on http://www.cs.utexas.edu/users/ml/co-training]

114 Slide 114 of 113

Co-training vs. EM

• Co-training splits features, EM does not.

• Co-training incrementally uses the unlabeled data.

• EM probabilistically labels all the data at each round; EM iteratively uses the unlabeled data.

of 113

Tamara Berg Machine Learning

Documents

Transcript of Tamara Berg Machine Learning

Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill.

Tamara Berg Machine Learning 790-133 Recognizing People, Objects, & Actions 1.

In this issue Dr. Tamara Berg 2018 AAWD President · Dr. Tamara Berg 2018 AAWD President Page 18: ... At our meetings, we can discuss family life, ... I hope to meet many of you on

Names and Faces in the News - University of Chicagottic.uchicago.edu/~mmaire/papers/pdf/names_faces_cvpr...Names and Faces in the News Tamara L. Berg, Alexander C. Berg, Jaety Edwards,

Advanced Multimedia Text Classification Tamara Berg.

Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Tamara Berg Object Recognition – BoF models 790-133 Recognizing People, Objects, & Actions 1.

Fundamentals of Multimedia, Chapter 6 Sound Intro Tamara Berg Advanced Multimedia 1.

Natural Language Processing Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

MDPs (cont) & Reinforcement Learning Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein,

Tamara Berg Retrieval 790-133 Language and Vision.

Names and Faces - University College LondonNames and Faces Tamara L. Berg∗ Alexander C. Berg∗ Jaety Edwards∗ Michael Maire∗ Ryan White∗ Yee-Whye Teh† Erik Learned-Miller‡

Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)

Tamara Berg Object Recognition – BoF models

Sound Applications Advanced Multimedia Tamara Berg.

PageRank for Product Image Search - Tamara Berg

UCB Computer Vision Animals on the Web Tamara L. Berg SUNY Stony Brook.

When Was That Made? - arXiv · 2016. 8. 16. · When Was That Made? Sirion Vittayakorn Alexander C. Berg Tamara L. Berg University of North Carolina at Chapel Hill sirionv, aberg,

Reinforcement Learning Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

Fundamentals of Multimedia, Chapter 6 Sound Analysis Tamara Berg Advanced Multimedia 1.