Ferret: A Toolkit for Content-Based Similarity Search of ...Ferret SHD PSB 3D Shape TIMT Audio...
Transcript of Ferret: A Toolkit for Content-Based Similarity Search of ...Ferret SHD PSB 3D Shape TIMT Audio...
-
Princeton University 1
Ferret: A Toolkit for Content-Based Similarity Search of Feature-Rich Data
Qin Lv
Joint work with:William Josephson, Zhe Wang,
Moses Charikar, and Kai Li
-
2Princeton University
Motivations
Digital data is everywhereIncreases exponentially
Feature-rich digital data dominates data volumeAudio, video, digital photos, scientific sensor dataSystems support for managing feature-rich data
Techniques for text data do not applyFeature-rich data are noisy and high-dimensional
Domain efforts limited to small datasets
-
3Princeton University
Current Search Techniques
Search capability is becoming an integral part of modern operating systems
Mac OS X Tiger: SpotlightWindows Vista
Limited to text-based searchWeb search engines: Google, Yahoo, Microsoft, …Desktop search: Google, Yahoo, MSN, …Text-based documents
• Emails, word documents, PDF files, instant messages, …Text-based annotations and attributes
• Image annotations, music (title, artist, lyrics), …
-
4Princeton University
-
5Princeton University
Ferret Toolkit: Design Goals
Works with multiple feature-rich data typesImage, audio, 3D shape model, gene expression data
High performanceSearch quality, search speed, memory usage
storage layer
attribute-based search
content-based search: text
content-based search: feature-rich
basic file system
-
6Princeton University
Outline
MotivationsFerret toolkit architecture designSimilarity search problemCore similarity search engineUsing the Ferret toolkitEvaluation resultsConclusion & future work
-
7Princeton University
Ferret Toolkit Architecture Design
-
8Princeton University
Similarity Search Problem
Similarity searchGiven a query object, find similar objects(i.e. containing similar features)
Distance function d (X, Y)Between objects
Nearest neighbor search K-nearest neighbor (KNN)Approximate nearest neighbor (ANN)
Hard problem for high dimensional search
-
9Princeton University
Object Representation & Distance Function
Multi-feature representation
Distance functionE.g. Earth Mover’s Distance (EMD)
-
10Princeton University
Core Similarity Search Engine
Sketchconstruction
Sketchconstruction Filtering
Similarityranking
Segmentation& featureextraction
InputData
objects
Sketchdatabase
Segmentation& featureextraction
QueryData
object
Results
Core Similarity Search Engine
-
11Princeton University
Sketch Construction
SketchesCompact data structures, estimate properties of original data
Sketch distanceHamming distance between bit vectorsEstimate distance between objects
Sketch0
10 0
0
0
000
00 0
111 1
1 1 1
1Complex object
0 1
x1
y2
x4
x2
x3
y1
y3
y4
x = (x1,x2,x3,x4)y = (y1,y2,y3,y4)
-
12Princeton University
Filtering for Similarity Search
Multi-feature representationComputing object distance is expensive
FilteringScans through the entire datasetUses a much faster distance function to filter out “bad”answers
• Hamming distance of sketchesComputes object distance for a much smaller candidate set
Criteria in picking candidate objectsHas at least one segment that is close enough to one of the major segments of the query object
-
13Princeton University
Using the Ferret Toolkit
Can the Ferret toolkit be applied to multiple data types?Image data?Audio data?3D shape models?Gene expression data?
-
14Princeton University
Image Similarity Search
SegmentationJSEG segmentation tool from UCSB
Feature extraction14-d features: 9-d color moments and 5-d bounding boxSegment weight: square root of segment size
Distance functionsSegment distance: weighted ℓ1 distanceObject distance: EMD
-
15Princeton University
Image Similarity Search
-
16Princeton University
Audio Similarity Search
SegmentationUtterance level segmenter, human marked word boundary
Feature extraction32 windows x 6 MFCC parameters = 192 featuresSegment weight: proportional to segment length
Distance functionsSegment distance: ℓ1 distanceObject distance: EMD
-
17Princeton University
Audio Similarity Search
-
18Princeton University
3D Shape Similarity Search
Segmentation32 decomposing spheres
Feature extractionSpherical harmonic descriptor (SHD)32 x 17 = 544 dimensions
Distance functionsSegment distance: ℓ1 distanceObject distance: same as segment distance
-
19Princeton University
3D Shape Similarity Search
-
20Princeton University
Gene Expression Similarity Search
Segmentation Gene expression microarray data: one gene per row
Feature extractionGene expression values
Distance functionPearson correlation, spearman correlation, ℓ1 distance
-
21Princeton University
Gene Expression Similarity Search
-
22Princeton University
Evaluations
Can the systems built with Ferret toolkit achieve high-quality similarity search results at a high speed?
How small can the sketches be as the metadata of the similarity search systems?
How much benefit can we get by using sketching and filtering?
-
23Princeton University
Benchmarks
Search quality benchmark suiteVARY image: 10,000 images, 32 setsTIMIT audio: 6,300 sentences, 450 setsPSB shape: 1,814 3D shape models, 92 sets
Search speed benchmark suiteMixed image dataset: 660,000 imagesTIMIT audio: 6,300 sentencesMixed shape dataset: 40,000 3D shape models
-
24Princeton University
Search Quality Metrics
Given a query q with k similar objects:1st-tier recall
Percentage of similar objects returned within rank k
2nd-tier recallPercentage of similar objects returned within rank 2k
Average precision
Example: k = 5, return 4 good results ranked at 1, 2, 5, 10Average precision = (1/1 + 2/2 + 3/5 + 4/10) / 5 = 0.6
∑=
=k
i iranki
k 11Precision Average
-
25Princeton University
Search Quality & Search Speed
80017472
600
96264
Vector Size (bits)
22:10.410.43
0.300.32
0.320.33
FerretSHD
PSB 3D Shape
10:10.490.420.44FerretTIMT Audio
5:10.630.47
0.540.41
0.590.41
FerretSIMPLIcity
VARY Image
Size Ratio2nd-tier1st-tierAverage Precision
Method
0.01140,000Mixed 3D Shape
0.098.66,300TIMIT Audio
2.010.8660,000Mixed Image
Search Time (s)#Vectors / Object#Data Objects
-
26Princeton University
Search Quality vs. Sketch Size
600 bits (29:1)200 bits (87:1)PSB 3D Shape
450 bits (3:1)250 bits (6:1)TIMIT Audio
88 bits (5:1)64 bits (7:1)VARY Image
Sketch SizeSketch Size
-
27Princeton University
Brute-Force, Sketching, Filtering
BruteForceOriginalLinear scan using original feature vectors
BruteForceSketchLinear scan using segment sketches
FilteringFiltering using segment sketches
-
28Princeton University
Conclusion & Future Work
Ferret toolkit for content-based similarity search Used for image, audio, 3D shape, genomic data
Achieves high search quality at reasonably high search speedUsing sketches greatly reduces metadata size with minimal quality degradationFuture work
Integrate with attribute-based searchIndexing techniquesMore effective and efficient distance functions and corresponding sketching techniquesMore data types: video, sensor data
-
29Princeton University
Thanks!
CASS: Content-Aware Search Systemshttp://www.cs.princeton.edu/cass
Try our image similarity search tool for Windows• http://www.cs.princeton.edu/cass/software