CS598Visual Information retrieval

CS598VISUAL INFORMATION RETRIEVALLecture VIII: LSH Recap and Case Study on Face Retrieval

LECTURE VIIIPart I: Locality Sensitive Hashing Re-cap

Slides credit: Y. Gong

THE PROBLEM Large scale image search:

Given a query image Want to search a large database to find similar

images E.g., search the internet to find similar images

Need fast turn-around time Need accurate search results

LARGE SCALE IMAGE SEARCH Find similar images in a large database

Kristen Grauman et al

INTERNET LARGE SCALE IMAGE SEARCH

Internet contains billions of images

The Challenge:– Need way of measuring similarity between images

(distance metric learning)– Needs to scale to Internet (How?)

Search the internet

LARGE SCALE IMAGE SEARCH

Representation must fit in memory (disk too slow)

Facebook has ~10 billion images (1010)

PC has ~10 Gbytes of memory (1011 bits)

Budget of 101 bits/image

Fergus et al

REQUIREMENTS FOR IMAGE SEARCH Search must be both fast, accurate and

scalable Fast

Kd-trees: tree data structure to improve search speed Locality Sensitive Hashing: hash tables to improve search

speed Small code: binary small code (010101101)

Scalable Require very little memory

Accurate Learned distance metric

SUMMARIZATION OF EXISTING ALGORITHMS

Tree Based Indexing Spatial partitions (i.e. kd-tree) and recursive hyper

plane decomposition provide an efficient means to search low-dimensional vector data exactly.

Hashing Locality-sensitive hashing offers sub-linear time

search by hashing highly similar examples together.

Binary Small Code Compact binary code, with a few hundred bits per

image

TREE BASED STRUCTURE Kd-tree

The kd-tree is a binary tree in which every node is a k-dimensional point

They are known to break down in practice for high dimensional data, and cannot provide better than a worst case linear query time guarantee.

LOCALITY SENSITIVE HASHING Hashing methods to do fast Nearest Neighbor (NN) Search

Sub-liner time search by hashing highly similar examples together in a hash table

Take random projections of data

Quantize each projection with few bits

Strong theoretical guarantees

More detail later

BINARY SMALL CODE

1110101010101010

Binary? 0101010010101010101 Only use binary code (0/1)

Small? A small number of bits to code each image i.e. 32 bits, 256 bits

How could this kind of small code improve the image search speed? More detail later.

DETAIL OF THESE ALGORITHMS1. Locality sensitive hashing

Basic LSH LSH for learned metric

2. Small binary code Basic small code idea Spectral hashing

The basic idea behind LSH is to project the data into a low-dimensional binary (Hamming) space i.e., each data point is mapped to a b-bit vector,

called the hash key. Each hash function h must satisfy the locality

sensitive hashing property:

where ∈ [0, 1] is the similarity function of interest

1. LOCALITY SENSITIVE HASHING

Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In SOCG, 2004.

LSH FUNCTIONS FOR DOT PRODUCTS

The hashing function of LSH to produce Hash Code

is a hyperplane separating the space

1. LOCALITY SENSITIVE HASHING

• Take random projections of data • Quantize each projection with few bits

0

1

01 0

1

101

No learning involved

Feature vector

Slide credit: Fergus et al.

HOW TO SEARCH FROM HASH TABLE?

Q11110

1

110111

110101

h r1…rkXi

N

h r1…rk

<< N

Q

A set of data points

Hash function

Hash table

New query

Search the hash table for a small set of images

results

COULD WE IMPROVE LSH? Could we utilize learned metric to improve LSH?

How to improve LSH from learned metric?

Assume we have already learned a distance metric A from domain knowledge

XTAX has better quantity than simple metrics such as Euclidean distance

HOW TO LEARN DISTANCE METRIC? First assume we have a set of domain

knowledge Use the methods described in the last lecture

to learn distance metric A

As discussed before,

Thus is a linear embedding function that embeds the data into a lower dimensional space

Define G =

LSH FUNCTIONS FOR LEARNED METRICS Given learned metric with G could be viewed as linear parametric function or a

linear embedding function for data x Thus the LSH function could be:

The key idea is first embed the data into a lower space by G and then do LSH in the lower dimensional space

Data embedding

Jain, B. Kulis, and K. Grauman. Fast Image Search for Learned Metrics. In CVPR, 2008

SOME RESULTS FOR LSH Caltech-101 data set

Goal: Exemplar-based Object Categorization Some exemplars Want to categorize the whole data set

RESULTS: OBJECT CATEGORIZATION

Caltech-101 database

[CORR]

ML = metric learningKristen Grauman et al

QUESTION ? Is Hashing fast enough?

Is sub-linear search time fast enough?

For retrieving (1 + e) near neighbors is bounded by O(n1/(1+e) )

Is it fast enough? Is it scalable enough? (adapt to the memory

of a PC?)

NO! Small binary code could do better.

Cast an image to a compact binary code, with a few hundred bits per image.

Small code is possible to perform real-time searches with millions from the Internet using a single large PC.

Within 1 second! (for 80 million data 0.146 sec.)

80 million data (~300G) 120M

BINARY SMALL CODE First introduced in text search/retrieval

Salakhutdinov and Hinton [3] introduced it for text documents retrieval

Introduced to computer vision by Torralba et al [4].

[3] Ruslan Salakhutdinov and Geoffrey Hinton, "Semantic Hashing. International Journal of Approximate Reasoning," 2009 [4] A. Torralba, R. Fergus, and Y. Weiss. "Small Codes and Large Databases for Recognition." In CVPR, 2008.

SEMANTIC HASHING

Address Space

Semantically similar images

Query address

Semantic

HashFunction

Query

Binary code

Images in database

Quite differentto a (conventional)randomizing hash

Fergus et al

Ruslan Salakhutdinov and Geoffrey Hinton Semantic Hashing. International Journal of Approximate Reasoning, 2009

Similar points are mapped into similar small code

Then store these code into memory and compute hamming distance (very fast, carried out by hardware)

SEMANTIC HASHING

OVERALL QUERY SCHEME

Query Image

generate

small code

Feature vector

Binary code

Feature vector

Image 1

Store into memoryUse hard ware to compute hamming distance

Retrieved images <1ms

~1ms (in Matlab)

<10μs

Fergus et al

SEARCH FRAMEWORK

Produce binary code (01010011010)

Store these binary code into the memory

Use hardware to compute the hamming distance (very fast)

Sort the Hamming distances and get final ranking results

HOW TO LEARN SMALL BINARY CODE

Simplest method (use median)

LSH are already able to produce binary code

Restricted Boltzmann Machines (RBM)

Optimal small binary code by spectral hashing

1. SIMPLE BINARIZATION STRATEGYSet threshold (unsupervised)

- e.g. use median

0

1

0 1Fergus et al

2. LOCALITY SENSITIVE HASHING LSH is ready to generate binary code (unsupervised)• Take random projections of data• Quantize each projection with few bits

0

1

01

01

101

No learning involved

Feature vector

Fergus et al

3. RBM [3] TO GENERATE CODE Not going into detail, see Salakhutdinov and Hinton for

detail Use a deep neural network to train small code Supervised method

R. R. Salakhutdinov and G. E. Hinton. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. In AISTATS, 2007.

LABELME RETRIEVAL LabelMe is a large database with human annotated

images

The goal of this experiment is to First generate small code Use hamming distance to search for similar images Sort the results to produce final ranking

Gist descriptor: ground truth

EXAMPLES OF LABELME RETRIEVAL 12 closest neighbors under different distance

metrics

TEST SET 2: WEB IMAGES 12.9 million images from tiny image data set

Collected from Internet

No labels, so use Euclidean distance between Gist vectors as ground truth distance

Note: Gist descriptor is a kind of feature widely used in computer vision for

EXAMPLES OF WEB RETRIEVAL 12 neighbors using different distance metrics

Fergus et al

WEB IMAGES RETRIEVAL

Observation: more codes get better performance

RETRIEVAL TIMINGS

Fergus et al

SUMMARY Image search should be

Fast Accurate Scalable

Tree based methods Locality Sensitive Hashing Binary Small Code (state of the art)

QUESTIONS?

LECTURE VIIIPart II: Face Retrieval

• Poses, lighting and facial expressions confront recognition

• Efficiently matching against large gallery dataset is nontrivial

• Large number of subjects matters

Challenges

… …

… …

… …

Gallery faces

?

Contextual face recognition

• Design robust face similarity metric– Leverage local feature context to deal

with visual variations• Enable large-scale face recognition tasks– Design efficient and scalable matching

metric– Employ image level context and active

learning–Utilize social network context to scope

recognition – Design social network priors to improve

recognition

Preprocessing

Face DetectionBoosted cascade

Eye DetectionNeural network

Face alignmentSimilarity transform to canonical frame

Illumination normalizationSelf-quotient image[Wang et. al. ‘04]

[Viola-Jones ‘01]

Input to our algorithm

Feature extraction

*

Spatial AggregationLog-polar arrangement of 25 Gaussian-weighted regions

Gaussian pyramidDense sampling in scale

Patches 8×8, extracted on a regular grid at each scale

FilteringConvolution with 4 oriented fourth-derivative of Gaussian quadrature pairs

{ f1 … fn }

<>

n ≈ 500

fi ε R400

0

One feature descriptor per patch:

DAISY Shape

Face representation & matchingAdjoin Spatial information:

Quantizating by a forest of randomized trees in Feature Space × Image Space :

…

T1 Tk

Each feature gi contributes to k bins of the combined histogram vector h.IDF weighted L1 norm: wi = log ( #{ training h : h(i) > 0 } / #training ). d( h, h’ ) = Σi wi | h(i) –

h’(i) |

{ f1 … fn }

f1

x1

y1

fn

xn

yn

… { g1, g2 … gn }

f1

x1

y1

w, <> τT2

…

Randomized projection trees<> τ

Linear decision at each node: w a random projection: w ~ N( 0, Σ ).

Why random projections?• Simple • Interact well with high-dimensional sparse data (feature descriptors!)• Generalize trees used previously used for vision tasks (kd-trees, Extremely Randomized Forests)

[Dasgupa & Freund, Wakin et. al., ...]

Additional data-dependence can be introduced through multiple trials: Select a (w, τ) pair that minimizes a cost function (i.e., MSE, conditional entropy)

[Guerts, LePetit & Fua, ...]

τ = median{ w, [f x y]’ }Normalizes spatial and feature parts

Can also be randomizedw

τ

{ w, [f x y]’ }

Implicit elastic matching{ g1 … gn }

{ g’1 … g’n }

???Even with good descriptors, dense correspondence is difficult:

Instead, quantize features such that corresponding features likely map to same bin:

gi g’jBin 1Bin 2 …

Number of “soft matches’’ is given by comparing histograms:

…h …h’

Histogram of joint quantizations

Similarity(I,I’) = 1 - || h - h’ || w1

Enable efficient inverted file indexing!!

… … … … … … …

Gallery faces

… …

Query face

Ross

• A subset of PIE for exploration (11554 faces / 68 users)– 30 faces per person are used for inducing the

trees• Three settings to explore– Histogram distance metric– Tree depth– Number of trees

Exploring the optimal settings

Distance metric

Reco. Rate

L2 un-weighted

86.3%

L2 IDF-weighted

86.7%

L1 un-weighted

89.3%

L1 IDF-weighted

89.4%

Forest size

1 5 10 15

Reco. Rate

89.4%

92.4%

93.1%

93.6%

Recognition accuracy (1) ORL

40 subjectsUnconstrained

Ext. Yale B

38 subjects , Extreme illumination

PIE

68 subjects Pose, illumination

Multi-PIE

250 subjectsPose, illumination, expression, time

Baseline (PCA) 88.1% 65.4% 62.1% 32.1%LDA 93.9% 81.3% 89.1% 37.0%LPP 93.7% 86.4% 89.2% 21.9%This work 96.5% 91.4% 94.3% 67.6%

Gallery faces:ORL: 5 faces/subject YaleB: 20 faces/subjectPIE: 30 faces/subject Multi-PIE: faces in the 1st session

Recognition accuracy (2) PIE->ORL (ORL->ORL)

ORL->PIE(PIE->PIE)

PIE -> Multi-PIE(Multi-PIE->Multi-

PIE)

Baseline (PCA)

85.0% (88.1%)

55.7%(62.1%)

26.5%(32.6%)

LDA 58.5% (93.9%)

72.8%(89.1%)

8.5%(37.0%)

LPP 17.0% (93.7%)

69.1%(89.2%)

17.1%(21.9%)

This work 92.5% (96.5%)

89.7%(94.3%)

67.2%(67.6%)

The first dataset is used for inducing the forestThe forest is then applied to test on the second dataset

• Poses, lighting and facial expressions confront recognition

• Efficiently matching against large gallery dataset is nontrivial

• Large number of subjects matters

Challenges

… …

… …

… …

Gallery faces

?

Contextual face recognition

• Design robust face similarity metric– Leverage local feature context to deal

with visual variations• Enable large-scale face recognition tasks– Design efficient and scalable matching

metric– Employ image level context and active

learning–Utilize social network context to scope

recognition – Design social network priors to improve

recognition

Preprocessing

Face DetectionBoosted cascade

Eye DetectionNeural network

Face alignmentSimilarity transform to canonical frame

Illumination normalizationSelf-quotient image[Wang et. al. ‘04]

[Viola-Jones ‘01]

Input to our algorithm

Feature extraction

*

Spatial AggregationLog-polar arrangement of 25 Gaussian-weighted regions

Gaussian pyramidDense sampling in scale

Patches 8×8, extracted on a regular grid at each scale

FilteringConvolution with 4 oriented fourth-derivative of Gaussian quadrature pairs

{ f1 … fn }

<>

n ≈ 500

fi ε R400

0

One feature descriptor per patch:

DAISY Shape

Face representation & matchingAdjoin Spatial information:

Quantizating by a forest of randomized trees in Feature Space × Image Space :

…

T1 Tk

Each feature gi contributes to k bins of the combined histogram vector h.IDF weighted L1 norm: wi = log ( #{ training h : h(i) > 0 } / #training ). d( h, h’ ) = Σi wi | h(i) –

h’(i) |

{ f1 … fn }

f1

x1

y1

fn

xn

yn

… { g1, g2 … gn }

f1

x1

y1

w, <> τT2

…

Randomized projection trees<> τ

Linear decision at each node: w a random projection: w ~ N( 0, Σ ).

Why random projections?• Simple • Interact well with high-dimensional sparse data (feature descriptors!)• Generalize trees used previously used for vision tasks (kd-trees, Extremely Randomized Forests)

[Dasgupa & Freund, Wakin et. al., ...]

Additional data-dependence can be introduced through multiple trials: Select a (w, τ) pair that minimizes a cost function (i.e., MSE, conditional entropy)

[Guerts, LePetit & Fua, ...]

τ = median{ w, [f x y]’ }Normalizes spatial and feature parts

Can also be randomizedw

τ

{ w, [f x y]’ }

Implicit elastic matching{ g1 … gn }

{ g’1 … g’n }

???Even with good descriptors, dense correspondence is difficult:

Instead, quantize features such that corresponding features likely map to same bin:

gi g’jBin 1Bin 2 …

Number of “soft matches’’ is given by comparing histograms:

…h …h’

Histogram of joint quantizations

Similarity(I,I’) = 1 - || h - h’ || w1

Enable efficient inverted file indexing!!

… … … … … … …

Gallery faces

… …

Query face

Ross

• A subset of PIE for exploration (11554 faces / 68 users)– 30 faces per person are used for inducing the

trees• Three settings to explore– Histogram distance metric– Tree depth– Number of trees

Exploring the optimal settings

Distance metric

Reco. Rate

L2 un-weighted

86.3%

L2 IDF-weighted

86.7%

L1 un-weighted

89.3%

L1 IDF-weighted

89.4%

Forest size

1 5 10 15

Reco. Rate

89.4%

92.4%

93.1%

93.6%

Recognition accuracy (1) ORL

40 subjectsUnconstrained

Ext. Yale B

38 subjects , Extreme illumination

PIE

68 subjects Pose, illumination

Multi-PIE

250 subjectsPose, illumination, expression, time

Baseline (PCA) 88.1% 65.4% 62.1% 32.1%LDA 93.9% 81.3% 89.1% 37.0%LPP 93.7% 86.4% 89.2% 21.9%This work 96.5% 91.4% 94.3% 67.6%

Gallery faces:ORL: 5 faces/subject YaleB: 20 faces/subjectPIE: 30 faces/subject Multi-PIE: faces in the 1st session

Recognition accuracy (2) PIE->ORL (ORL->ORL)

ORL->PIE(PIE->PIE)

PIE -> Multi-PIE(Multi-PIE->Multi-

PIE)

Baseline (PCA)

85.0% (88.1%)

55.7%(62.1%)

26.5%(32.6%)

LDA 58.5% (93.9%)

72.8%(89.1%)

8.5%(37.0%)

LPP 17.0% (93.7%)

69.1%(89.2%)

17.1%(21.9%)

This work 92.5% (96.5%)

89.7%(94.3%)

67.2%(67.6%)

The first dataset is used for inducing the forestThe forest is then applied to test on the second dataset

Scalability challenges

• Given N labeled faces, how to choose a maximum M<N faces to form the gallery faces such that the recognition accuracy is maximized?

• How to handle the problem of recognizing large number of subjects in a social network?

An active learning framework

A discriminative model (GP+MRF)

Active learning:

),(),( jiji tttt

),(1),( jiji tttt

K)N(0,)X|Y( P

2

2

2exp),(

ii

ii

tyty

Cc

ibp

ibp

UiiUictqctqtH )(log)(maxarg)(maxargx*

Experiments

• 80 minutes “friends” video• 6 characters• 1282 tracks, 16720 faces• 500 samples are used for testing

Social network scope and priors

• Scope the recognition by social network• Build the prior probability of whom Rachel would like to tag

Effects of social priors

Perfect recognition

Recognition w/ Priors

Recognition w/o Priors

CS598Visual Information retrieval

Documents

Transcript of CS598Visual Information retrieval