CS598Visual Information retrieval
description
Transcript of CS598Visual Information retrieval
CS598VISUAL INFORMATION RETRIEVALLecture VIII: LSH Recap and Case Study on Face Retrieval
LECTURE VIIIPart I: Locality Sensitive Hashing Re-cap
Slides credit: Y. Gong
THE PROBLEM Large scale image search:
Given a query image Want to search a large database to find similar
images E.g., search the internet to find similar images
Need fast turn-around time Need accurate search results
LARGE SCALE IMAGE SEARCH Find similar images in a large database
Kristen Grauman et al
INTERNET LARGE SCALE IMAGE SEARCH
Internet contains billions of images
The Challenge:– Need way of measuring similarity between images
(distance metric learning)– Needs to scale to Internet (How?)
Search the internet
LARGE SCALE IMAGE SEARCH
Representation must fit in memory (disk too slow)
Facebook has ~10 billion images (1010)
PC has ~10 Gbytes of memory (1011 bits)
Budget of 101 bits/image
Fergus et al
REQUIREMENTS FOR IMAGE SEARCH Search must be both fast, accurate and
scalable Fast
Kd-trees: tree data structure to improve search speed Locality Sensitive Hashing: hash tables to improve search
speed Small code: binary small code (010101101)
Scalable Require very little memory
Accurate Learned distance metric
SUMMARIZATION OF EXISTING ALGORITHMS
Tree Based Indexing Spatial partitions (i.e. kd-tree) and recursive hyper
plane decomposition provide an efficient means to search low-dimensional vector data exactly.
Hashing Locality-sensitive hashing offers sub-linear time
search by hashing highly similar examples together.
Binary Small Code Compact binary code, with a few hundred bits per
image
TREE BASED STRUCTURE Kd-tree
The kd-tree is a binary tree in which every node is a k-dimensional point
They are known to break down in practice for high dimensional data, and cannot provide better than a worst case linear query time guarantee.
LOCALITY SENSITIVE HASHING Hashing methods to do fast Nearest Neighbor (NN) Search
Sub-liner time search by hashing highly similar examples together in a hash table
Take random projections of data
Quantize each projection with few bits
Strong theoretical guarantees
More detail later
BINARY SMALL CODE
1110101010101010
Binary? 0101010010101010101 Only use binary code (0/1)
Small? A small number of bits to code each image i.e. 32 bits, 256 bits
How could this kind of small code improve the image search speed? More detail later.
DETAIL OF THESE ALGORITHMS1. Locality sensitive hashing
Basic LSH LSH for learned metric
2. Small binary code Basic small code idea Spectral hashing
The basic idea behind LSH is to project the data into a low-dimensional binary (Hamming) space i.e., each data point is mapped to a b-bit vector,
called the hash key. Each hash function h must satisfy the locality
sensitive hashing property:
where ∈ [0, 1] is the similarity function of interest
1. LOCALITY SENSITIVE HASHING
Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In SOCG, 2004.
LSH FUNCTIONS FOR DOT PRODUCTS
The hashing function of LSH to produce Hash Code
is a hyperplane separating the space
1. LOCALITY SENSITIVE HASHING
• Take random projections of data • Quantize each projection with few bits
0
1
01 0
1
101
No learning involved
Feature vector
Slide credit: Fergus et al.
HOW TO SEARCH FROM HASH TABLE?
Q11110
1
110111
110101
h r1…rkXi
N
h r1…rk
<< N
Q
A set of data points
Hash function
Hash table
New query
Search the hash table for a small set of images
results
COULD WE IMPROVE LSH? Could we utilize learned metric to improve LSH?
How to improve LSH from learned metric?
Assume we have already learned a distance metric A from domain knowledge
XTAX has better quantity than simple metrics such as Euclidean distance
HOW TO LEARN DISTANCE METRIC? First assume we have a set of domain
knowledge Use the methods described in the last lecture
to learn distance metric A
As discussed before,
Thus is a linear embedding function that embeds the data into a lower dimensional space
Define G =
LSH FUNCTIONS FOR LEARNED METRICS Given learned metric with G could be viewed as linear parametric function or a
linear embedding function for data x Thus the LSH function could be:
The key idea is first embed the data into a lower space by G and then do LSH in the lower dimensional space
Data embedding
Jain, B. Kulis, and K. Grauman. Fast Image Search for Learned Metrics. In CVPR, 2008
SOME RESULTS FOR LSH Caltech-101 data set
Goal: Exemplar-based Object Categorization Some exemplars Want to categorize the whole data set
RESULTS: OBJECT CATEGORIZATION
Caltech-101 database
[CORR]
ML = metric learningKristen Grauman et al
QUESTION ? Is Hashing fast enough?
Is sub-linear search time fast enough?
For retrieving (1 + e) near neighbors is bounded by O(n1/(1+e) )
Is it fast enough? Is it scalable enough? (adapt to the memory
of a PC?)
NO! Small binary code could do better.
Cast an image to a compact binary code, with a few hundred bits per image.
Small code is possible to perform real-time searches with millions from the Internet using a single large PC.
Within 1 second! (for 80 million data 0.146 sec.)
80 million data (~300G) 120M
BINARY SMALL CODE First introduced in text search/retrieval
Salakhutdinov and Hinton [3] introduced it for text documents retrieval
Introduced to computer vision by Torralba et al [4].
[3] Ruslan Salakhutdinov and Geoffrey Hinton, "Semantic Hashing. International Journal of Approximate Reasoning," 2009 [4] A. Torralba, R. Fergus, and Y. Weiss. "Small Codes and Large Databases for Recognition." In CVPR, 2008.
SEMANTIC HASHING
Address Space
Semantically similar images
Query address
Semantic
HashFunction
Query
Binary code
Images in database
Quite differentto a (conventional)randomizing hash
Fergus et al
Ruslan Salakhutdinov and Geoffrey Hinton Semantic Hashing. International Journal of Approximate Reasoning, 2009
Similar points are mapped into similar small code
Then store these code into memory and compute hamming distance (very fast, carried out by hardware)
SEMANTIC HASHING
OVERALL QUERY SCHEME
Query Image
generate
small code
Feature vector
Binary code
Feature vector
Image 1
Store into memoryUse hard ware to compute hamming distance
Retrieved images <1ms
~1ms (in Matlab)
<10μs
Fergus et al
SEARCH FRAMEWORK
Produce binary code (01010011010)
Store these binary code into the memory
Use hardware to compute the hamming distance (very fast)
Sort the Hamming distances and get final ranking results
HOW TO LEARN SMALL BINARY CODE
Simplest method (use median)
LSH are already able to produce binary code
Restricted Boltzmann Machines (RBM)
Optimal small binary code by spectral hashing
1. SIMPLE BINARIZATION STRATEGYSet threshold (unsupervised)
- e.g. use median
0
1
0 1Fergus et al
2. LOCALITY SENSITIVE HASHING LSH is ready to generate binary code (unsupervised)• Take random projections of data• Quantize each projection with few bits
0
1
01
01
101
No learning involved
Feature vector
Fergus et al
3. RBM [3] TO GENERATE CODE Not going into detail, see Salakhutdinov and Hinton for
detail Use a deep neural network to train small code Supervised method
R. R. Salakhutdinov and G. E. Hinton. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. In AISTATS, 2007.
LABELME RETRIEVAL LabelMe is a large database with human annotated
images
The goal of this experiment is to First generate small code Use hamming distance to search for similar images Sort the results to produce final ranking
Gist descriptor: ground truth
EXAMPLES OF LABELME RETRIEVAL 12 closest neighbors under different distance
metrics
TEST SET 2: WEB IMAGES 12.9 million images from tiny image data set
Collected from Internet
No labels, so use Euclidean distance between Gist vectors as ground truth distance
Note: Gist descriptor is a kind of feature widely used in computer vision for
EXAMPLES OF WEB RETRIEVAL 12 neighbors using different distance metrics
Fergus et al
WEB IMAGES RETRIEVAL
Observation: more codes get better performance
RETRIEVAL TIMINGS
Fergus et al
SUMMARY Image search should be
Fast Accurate Scalable
Tree based methods Locality Sensitive Hashing Binary Small Code (state of the art)
QUESTIONS?
LECTURE VIIIPart II: Face Retrieval
• Poses, lighting and facial expressions confront recognition
• Efficiently matching against large gallery dataset is nontrivial
• Large number of subjects matters
Challenges
… …
… …
… …
Gallery faces
?
Contextual face recognition
• Design robust face similarity metric– Leverage local feature context to deal
with visual variations• Enable large-scale face recognition tasks– Design efficient and scalable matching
metric– Employ image level context and active
learning–Utilize social network context to scope
recognition – Design social network priors to improve
recognition
Preprocessing
Face DetectionBoosted cascade
Eye DetectionNeural network
Face alignmentSimilarity transform to canonical frame
Illumination normalizationSelf-quotient image[Wang et. al. ‘04]
[Viola-Jones ‘01]
Input to our algorithm
Feature extraction
*
Spatial AggregationLog-polar arrangement of 25 Gaussian-weighted regions
Gaussian pyramidDense sampling in scale
Patches 8×8, extracted on a regular grid at each scale
FilteringConvolution with 4 oriented fourth-derivative of Gaussian quadrature pairs
{ f1 … fn }
<>
n ≈ 500
fi ε R400
0
One feature descriptor per patch:
DAISY Shape
Face representation & matchingAdjoin Spatial information:
Quantizating by a forest of randomized trees in Feature Space × Image Space :
…
T1 Tk
Each feature gi contributes to k bins of the combined histogram vector h.IDF weighted L1 norm: wi = log ( #{ training h : h(i) > 0 } / #training ). d( h, h’ ) = Σi wi | h(i) –
h’(i) |
{ f1 … fn }
f1
x1
y1
fn
xn
yn
… { g1, g2 … gn }
f1
x1
y1
w, <> τT2
…
Randomized projection trees<> τ
Linear decision at each node: w a random projection: w ~ N( 0, Σ ).
Why random projections?• Simple • Interact well with high-dimensional sparse data (feature descriptors!)• Generalize trees used previously used for vision tasks (kd-trees, Extremely Randomized Forests)
[Dasgupa & Freund, Wakin et. al., ...]
Additional data-dependence can be introduced through multiple trials: Select a (w, τ) pair that minimizes a cost function (i.e., MSE, conditional entropy)
[Guerts, LePetit & Fua, ...]
τ = median{ w, [f x y]’ }Normalizes spatial and feature parts
Can also be randomizedw
τ
{ w, [f x y]’ }
Implicit elastic matching{ g1 … gn }
{ g’1 … g’n }
???Even with good descriptors, dense correspondence is difficult:
Instead, quantize features such that corresponding features likely map to same bin:
gi g’jBin 1Bin 2 …
Number of “soft matches’’ is given by comparing histograms:
…h …h’
Histogram of joint quantizations
Similarity(I,I’) = 1 - || h - h’ || w1
Enable efficient inverted file indexing!!
… … … … … … …
Gallery faces
… …
Query face
Ross
• A subset of PIE for exploration (11554 faces / 68 users)– 30 faces per person are used for inducing the
trees• Three settings to explore– Histogram distance metric– Tree depth– Number of trees
Exploring the optimal settings
Distance metric
Reco. Rate
L2 un-weighted
86.3%
L2 IDF-weighted
86.7%
L1 un-weighted
89.3%
L1 IDF-weighted
89.4%
Forest size
1 5 10 15
Reco. Rate
89.4%
92.4%
93.1%
93.6%
Recognition accuracy (1) ORL
40 subjectsUnconstrained
Ext. Yale B
38 subjects , Extreme illumination
PIE
68 subjects Pose, illumination
Multi-PIE
250 subjectsPose, illumination, expression, time
Baseline (PCA) 88.1% 65.4% 62.1% 32.1%LDA 93.9% 81.3% 89.1% 37.0%LPP 93.7% 86.4% 89.2% 21.9%This work 96.5% 91.4% 94.3% 67.6%
Gallery faces:ORL: 5 faces/subject YaleB: 20 faces/subjectPIE: 30 faces/subject Multi-PIE: faces in the 1st session
Recognition accuracy (2) PIE->ORL (ORL->ORL)
ORL->PIE(PIE->PIE)
PIE -> Multi-PIE(Multi-PIE->Multi-
PIE)
Baseline (PCA)
85.0% (88.1%)
55.7%(62.1%)
26.5%(32.6%)
LDA 58.5% (93.9%)
72.8%(89.1%)
8.5%(37.0%)
LPP 17.0% (93.7%)
69.1%(89.2%)
17.1%(21.9%)
This work 92.5% (96.5%)
89.7%(94.3%)
67.2%(67.6%)
The first dataset is used for inducing the forestThe forest is then applied to test on the second dataset
• Poses, lighting and facial expressions confront recognition
• Efficiently matching against large gallery dataset is nontrivial
• Large number of subjects matters
Challenges
… …
… …
… …
Gallery faces
?
Contextual face recognition
• Design robust face similarity metric– Leverage local feature context to deal
with visual variations• Enable large-scale face recognition tasks– Design efficient and scalable matching
metric– Employ image level context and active
learning–Utilize social network context to scope
recognition – Design social network priors to improve
recognition
Preprocessing
Face DetectionBoosted cascade
Eye DetectionNeural network
Face alignmentSimilarity transform to canonical frame
Illumination normalizationSelf-quotient image[Wang et. al. ‘04]
[Viola-Jones ‘01]
Input to our algorithm
Feature extraction
*
Spatial AggregationLog-polar arrangement of 25 Gaussian-weighted regions
Gaussian pyramidDense sampling in scale
Patches 8×8, extracted on a regular grid at each scale
FilteringConvolution with 4 oriented fourth-derivative of Gaussian quadrature pairs
{ f1 … fn }
<>
n ≈ 500
fi ε R400
0
One feature descriptor per patch:
DAISY Shape
Face representation & matchingAdjoin Spatial information:
Quantizating by a forest of randomized trees in Feature Space × Image Space :
…
T1 Tk
Each feature gi contributes to k bins of the combined histogram vector h.IDF weighted L1 norm: wi = log ( #{ training h : h(i) > 0 } / #training ). d( h, h’ ) = Σi wi | h(i) –
h’(i) |
{ f1 … fn }
f1
x1
y1
fn
xn
yn
… { g1, g2 … gn }
f1
x1
y1
w, <> τT2
…
Randomized projection trees<> τ
Linear decision at each node: w a random projection: w ~ N( 0, Σ ).
Why random projections?• Simple • Interact well with high-dimensional sparse data (feature descriptors!)• Generalize trees used previously used for vision tasks (kd-trees, Extremely Randomized Forests)
[Dasgupa & Freund, Wakin et. al., ...]
Additional data-dependence can be introduced through multiple trials: Select a (w, τ) pair that minimizes a cost function (i.e., MSE, conditional entropy)
[Guerts, LePetit & Fua, ...]
τ = median{ w, [f x y]’ }Normalizes spatial and feature parts
Can also be randomizedw
τ
{ w, [f x y]’ }
Implicit elastic matching{ g1 … gn }
{ g’1 … g’n }
???Even with good descriptors, dense correspondence is difficult:
Instead, quantize features such that corresponding features likely map to same bin:
gi g’jBin 1Bin 2 …
Number of “soft matches’’ is given by comparing histograms:
…h …h’
Histogram of joint quantizations
Similarity(I,I’) = 1 - || h - h’ || w1
Enable efficient inverted file indexing!!
… … … … … … …
Gallery faces
… …
Query face
Ross
• A subset of PIE for exploration (11554 faces / 68 users)– 30 faces per person are used for inducing the
trees• Three settings to explore– Histogram distance metric– Tree depth– Number of trees
Exploring the optimal settings
Distance metric
Reco. Rate
L2 un-weighted
86.3%
L2 IDF-weighted
86.7%
L1 un-weighted
89.3%
L1 IDF-weighted
89.4%
Forest size
1 5 10 15
Reco. Rate
89.4%
92.4%
93.1%
93.6%
Recognition accuracy (1) ORL
40 subjectsUnconstrained
Ext. Yale B
38 subjects , Extreme illumination
PIE
68 subjects Pose, illumination
Multi-PIE
250 subjectsPose, illumination, expression, time
Baseline (PCA) 88.1% 65.4% 62.1% 32.1%LDA 93.9% 81.3% 89.1% 37.0%LPP 93.7% 86.4% 89.2% 21.9%This work 96.5% 91.4% 94.3% 67.6%
Gallery faces:ORL: 5 faces/subject YaleB: 20 faces/subjectPIE: 30 faces/subject Multi-PIE: faces in the 1st session
Recognition accuracy (2) PIE->ORL (ORL->ORL)
ORL->PIE(PIE->PIE)
PIE -> Multi-PIE(Multi-PIE->Multi-
PIE)
Baseline (PCA)
85.0% (88.1%)
55.7%(62.1%)
26.5%(32.6%)
LDA 58.5% (93.9%)
72.8%(89.1%)
8.5%(37.0%)
LPP 17.0% (93.7%)
69.1%(89.2%)
17.1%(21.9%)
This work 92.5% (96.5%)
89.7%(94.3%)
67.2%(67.6%)
The first dataset is used for inducing the forestThe forest is then applied to test on the second dataset
Scalability challenges
• Given N labeled faces, how to choose a maximum M<N faces to form the gallery faces such that the recognition accuracy is maximized?
• How to handle the problem of recognizing large number of subjects in a social network?
An active learning framework
A discriminative model (GP+MRF)
Active learning:
),(),( jiji tttt
),(1),( jiji tttt
K)N(0,)X|Y( P
2
2
2exp),(
ii
ii
tyty
Cc
ibp
ibp
UiiUictqctqtH )(log)(maxarg)(maxargx*
Experiments
• 80 minutes “friends” video• 6 characters• 1282 tracks, 16720 faces• 500 samples are used for testing
Social network scope and priors
• Scope the recognition by social network• Build the prior probability of whom Rachel would like to tag
Effects of social priors
Perfect recognition
Recognition w/ Priors
Recognition w/o Priors