A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space...
-
Upload
aubrey-tucker -
Category
Documents
-
view
214 -
download
1
Transcript of A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space...
![Page 1: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/1.jpg)
A Quantitative Analysis and Performance Study For Similar-Search Methods In High-Dimensional Space
Presented By Umang Shah Koushik
![Page 2: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/2.jpg)
Introduction
Sequential Scan always out perform whenever the dimension is greater then 10 or higher.
Any method of clustering or data space partition method fail to handle HDVS beyond a certain limit.
VA files is proposed to do the inevitable sequential scan more efficiently. Performance increases with dimensions.
![Page 3: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/3.jpg)
Assumptions and Notation
Assumption 1-Data and Metric
• Unit hypercube
• Distances
Assumption 2-Uniformity and Independence
• Data and query points are uniformly distributed
• Dimensions are independent.
![Page 4: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/4.jpg)
NN, NN-distance, NN-sphere
![Page 5: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/5.jpg)
Probability and Volume Computations
![Page 6: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/6.jpg)
The Difficulties of High Dimensionality
Number of partitions. Data space is sparsely populated Spherical range queries Exponentially growing DB size Expected NN-Distance.
![Page 7: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/7.jpg)
Number of partitions
2d partitions Assume N = 106 points. For d = 100, there are 2100 ≈ 1030 partitions. Too many partitions are empty.
![Page 8: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/8.jpg)
Data space is sparsely populated
0.95^100 = 0.0059 At d = 100, even a hypercube
of side 0.95 can cover only 0.59% of the data space.
![Page 9: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/9.jpg)
Spherical range queries
The largest spherical query.
![Page 10: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/10.jpg)
Exponentially growing DB size
At least one point falls into the largest possible sphere.
![Page 11: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/11.jpg)
Expected NN-Distance
The NN distance grows steadily with d.
![Page 12: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/12.jpg)
General Cost Model
The Probability that the ith block is visited,
Expected number of blocks visited
If we assume m objects per block,
Is Mvisit > 20%?
![Page 13: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/13.jpg)
Space-Partitioning Methods
is independent of d.
Space consumption – 2d.So split is done in d’ dimensions only.
E [nndist] increases with d
When E [nndist] is greater than lmax the entire database is accessed.
![Page 14: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/14.jpg)
Data-Partitioning Methods
Rectengular MBRS • R* tree,X tree,SR tree
Spherical MBRS• TV tree,M tree,SR tree
General partitioning and Clustering schemes
![Page 15: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/15.jpg)
Rectangular MBRs
![Page 16: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/16.jpg)
Spherical MBRS
![Page 17: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/17.jpg)
General Partitioning and ClusteringSchemes
Assumptions• A cluster is
characterized by a geometrical form (MBR) that covers all cluster points
• Each cluster contains at least 2 points
• The MBR of a cluster is convex.
![Page 18: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/18.jpg)
Vector Approximation File Basic Idea: Technique Specially Designed
For Similarity Search Object Approximation Vector Data Compression
![Page 19: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/19.jpg)
Notations
![Page 20: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/20.jpg)
Lower bound ,upper bound
![Page 21: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/21.jpg)
How it is done
The data is divided in to 2^b rectangular cells
Cells are arranged in form of grid Entire file is scanned at the time of query
![Page 22: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/22.jpg)
Compression Vector
For each dimension a small number bitsb [i] is assigned. The sum b[i] is b The data space is divided in 2^d hyper
rectangles Each data point is approximated by the bit
string of the cell Only the boundary points of each data set
needs to be stored
![Page 23: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/23.jpg)
Compression Vector
Normally bits chosen for each dimension vary from 4 to 8
Typically
bi = l, b = d *l, l = 4.. .8
![Page 24: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/24.jpg)
Example:
![Page 25: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/25.jpg)
Two probability associated with the VA files
![Page 26: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/26.jpg)
Filtering Step
Simple Search Algorithm An Array of k elements is maintained This array is maintained in sorted order File is sequentially searched. If the element’s lower bound < k th
element upper bound The actual distance are calculated
![Page 27: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/27.jpg)
Filtering Step
Near Optimal search algorithm Done in two steps While scanning through the file Step1-Calculate the kth largest upper bound
Encountered so far If new element has lower bound greater then
then discard it
![Page 28: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/28.jpg)
Filtering Step
Step2-The elements remaining in step1 are collected
The elements in increasing order of lower bound are visited till it is >= to the kth element upper bound
![Page 29: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/29.jpg)
Performance
Add Two Graphs Of Performance
![Page 30: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/30.jpg)
Performance
![Page 31: A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ee05503460f94bf0557/html5/thumbnails/31.jpg)
Conclusion
All approaches to nearest-neighbor search in HDVSs ultimately become linear at high dimensionality.
The VA-File method can out-perform any other method known to the authors.