Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions

34
Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions Yufei Tao, Reynold Cheng, Xiaokui Xiao, Wang Kai Ngai, Ben Kao, Sunil Prabhakar City University of Hong Kong Hong Kong Polytechnic University University of Hong Kong Purdue University

description

Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions. Yufei Tao, Reynold Cheng, Xiaokui Xiao, Wang Kai Ngai, Ben Kao, Sunil Prabhakar City University of Hong Kong Hong Kong Polytechnic University University of Hong Kong Purdue University. - PowerPoint PPT Presentation

Transcript of Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions

Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions

Yufei Tao, Reynold Cheng, Xiaokui Xiao, Wang Kai Ngai, Ben Kao, Sunil Prabhakar

City University of Hong Kong

Hong Kong Polytechnic University

University of Hong Kong

Purdue University

Multi-dimensional Uncertain Data

Moving objects An object sends its location to a server whenever its distance

from the previously reported location is larger than certain threshold.

Sensor readings Each sensor reports the temperature, humidity, UV index, …,

in its neighborhood periodically.

Querying the (uncertain) data stored in the server directly is meaningless.

Uncertainty Modeling

Client 1

distance threshold

recorded locationin database

uncertaintyregion

An object’s location is described by a probability density function.

Probabilistic Range Search

Client 2

Client 1

Client 4

Client 3

Client 5

Client 6

rq (The area of CityU)

Find the clients that are currently in CityU with at least 50% probability (probabilistic range query) (probability threshold)

Appearance Probability

apperance probability:

x

ur(uncertainty region) rq

(query region)

rq ∩ ur

Client 1

E.g., uniform pdf:

Appearance Probability

o.urrq

o.ur ∩ rq

o

must be calculated numerically

Calculation time of an appearance probability in 2D space: 1.3ms

Time for a random access: 10ms

A good solution should…

Support any pdf. Minimize the number of page accesses. Minimize the number of appearance probabilit

y calculations.

Minimize the total cost (I/O + CPU)

Main Idea

Pre-compute some “auxiliary information” that can be used to efficiently decide whether an object appears in a

region with at least a certain probability without calculating its actual appearance

probability.

Quick Examples

o.urrqo.urrq

pq=20%

Probabilistically Constrained Regions (PCR)

o.ur

l1-

app. prob. = 0.2

l1+

app. prob. = 0.2

l2+

app. prob. = 0.2

l2-

app. prob. = 0.2 l1- l1+

l2-

l2+

o.pcr(0.2)

Probabilistically Constrained Regions (PCR)

o.pcr(0.2)

l1-

app. prob. = 0.2

rq

l1+

app. prob. = 0.2

rq

For a query q with search region rq and probability pq= 0.2

Observation 1.1 (pruning)

an object o can not satisfy q if rq does not intersect o.pcr(0.2)

l2-

app. prob. = 0.2

rq

rq

Probabilistically Constrained Regions (PCR)

l1+

app. prob. = 0.2

o.pcr(0.2)rq

For a query q with search region rq and probability pq= 0.8

Observation 1.2 (pruning)

an object o can not satisfy q if rq does not fully contain o.pcr(0.2)

(= 1 – 0.2)

rq

l1+ l1+

app. prob. = 0.8

Probabilistically Constrained Regions (PCR)

o.pcr(0.2)

l1-

o.MBR

l1-

app. prob. = 0.2

A query q with search region rq and probability pq= 0.2

Observation 1.3 (validating)

an object o definitely satisfies q if rq fully contains the part of o.MBR on the left of l1- (or on the right of l1+ or below l2- or above l2+)

rq

Probabilistically Constrained Regions (PCR)

o.MBRrq

l1+l1+

app. prob. = 0.2

A query q with search region rq and probability pq= 0.8

Observation 1.4 (for validating)

an object o definitely satisfies q if rq fully contains the part of o.MBR on the left of l1+ (or on the right of l1- or below l2+ or above l2-)

l1+

app. prob. = 0.8

Probabilistically Constrained Regions (PCR)

l1+l1-

app. prob. = 0.2

app. prob. = 0.2

app. prob. = 0.6

l1+l1-

app. prob. = 0.2

app. prob. = 0.2

A query q with search region rq and probability pq= 0.6

Observation 1.5 (for validating)

an object o must satisfy q if rq fully contains the part of o.MBR between l1- and l1+ (or between l2- and l2+)

=(1 – 2 * 0.2)

l1-

o.MBR

l1+

rq

Probabilistically Constrained Regions (PCR)

o.pcr(0.2) provides 5 heuristics to reduce CPU cost

In general, for a prob-range query with probability threshold pq

if pq <= 0.5 o may be pruned using o.pcr( pq ) observation 1.1 o may be validated using o.pcr( pq ) observation 1.3 o may be validated using o.pcr( (1 - pq)/2 ) observation 1.5

if pq > 0.5 o may be pruned using o.pcr( 1 - pq ) observation 1.2 o may be validated using o.pcr( 1 - pq ) observation 1.4 o may be validated using o.pcr( pq /2 ) observation 1.5

pq in [0, 1] → infinite number of pq

→ infinite number of PCRsImpractical!

It is possible to use a finite number of PCRs to achieve pruning and validating.

Using PCRs in a Conservative Way

o.pcr(0.2)

o.pcr(0.25)

o.pcr(0.3)

rq

for a query q with search region rq and probability pq= 0.25

Observation 1.1

E.g., U-catalog: { 0, 0.1, 0.2, 0.3, 0.4, 0.5 }

Observation 2.1

an object o cannot satisfy q if rq does not intersect o.pcr(0.2)

an object o cannot satisfy q if rq does not intersect o.pcr(0.25)

rq

Using PCRs in a Conservative Way

o.pcr(0.2)

o.pcr(0.25)

o.pcr(0.3)

rq

for a query q with search region rq and probability pq= 0.75

Observation 1.2

U-catalog: { 0, 0.1, 0.2, 0.3, 0.4, 0.5 }

Observation 2.2

an object o cannot satisfy q if rq does not fully contain o.pcr(0.25)

an object o cannot satisfy q if rq does not fully contain o.pcr(0.3)

rq

U-catalog Size m

{0, 0.5}, m = 2

{0, 0.25, 0.5}, m = 3

{0, 0.1, 0.2, 0.3, 0.4, 0.5}, m = 6

larger m → more PCRs → greater pruning/validating power

→ less CPU cost

larger m → higher space consumption

→ larger I/O cost

m = 9

0

0.1

0.2

0.3

0.4

0.5

p

x

Conservative Functional Boxes (CFB)

o.pcr(…)U-catalog: { 0, 0.1, 0.2, 0.3, 0.4, 0.5 }

o.pcr : 2m values for each dimension

o.cfbout : 4 values for each dimensiono.cfbin : 4 values for each dimensiontotal : 8 values

m = 98 : 18

o.cfbxout

o.cfbxin

Conservative Functional Boxes (CFB)

0

0.1

0.2

0.3

0.4

0.5

o.pcr(0.2)

o.cfbout

o.cfbout(0.2)

rq

for a query q with search region rq and probability pq= 0.25

Observation 1.1

U-catalog: { 0, 0.1, 0.2, 0.3, 0.4, 0.5 }

Observation 2.1

an object o cannot satisfy q if rq does not intersect o.pcr(0.2)

an object o cannot satisfy q if rq does not intersect o.pcr(0.25)

Observation 3.1

an object o cannot satisfy q if rq does not intersect o.cfbout(0.2)

Conservative Functional Boxes (CFB)

for a query q with search region rq and probability pq= 0.75

Observation 1.2

U-catalog: { 0, 0.1, 0.2, 0.3, 0.4, 0.5 }

Observation 2.2

an object o cannot satisfy q if rq does not fully contain o.pcr(0.3)

an object o cannot satisfy q if rq does not fully contain o.pcr(0.25)

Observation 3.2

an object o cannot satisfy q if rq does not fully contain o.cfbin(0.3)

0

0.1

0.2

0.3

0.4

0.5

o.pcr(0.3)

o.cfbin

o.cfbin(0.3)

rq

Comparing CFBs with PCRs

CFBs have weaker pruning/validating power than PCRs

But CFBs require less space than PCRs

PCR1 PCR2 …… PCRm

Using PCRs2·m·d values

CFBout CFBin

Using CFBs8·d values

0

0.1

0.2

0.3

0.4

0.5o.cfbout

o.cfbin

p

x

o.pcr

Finding Conservative Functional Boxes

goal: minimize

for the i th dimension, minimize

with the following constrains:

Linear Programming: Simplex Method

0

0.1

0.2

0.3

0.4

0.5o.cfbi-

out

p

x

o.cfbi+out

αi-out αi+

out

arctan(-βi-out)

arctan(βi+out)

More in Our Paper

The U-treea dynamic index designed to accelerate prob-range queries.

Experimental Results

data space: [0, 10000]d

uncertainty region shape: circle (sphere)

uncertainty region radius: 250

data set: Long Beach County (LB): 53k 2D objects, uniform pdf

California (CA): 62k 2D objects, Gaussian pdf

Aircraft: 100k 3D objects, uniform pdf

query set: 100 queries for each data set with various sizes of rq and different pq

Experimental Results

Experimental Results

Query performance vs. search region size (LB, pq = 0.6)

Experimental Results

Query performance vs. search region size (CA, pq = 0.6)

Experimental Results

Query performance vs. search region size on (Aircraft, pq = 0.6)

Experimental Results

Query performance vs. probability threshold on (LB, qs = 1500)

Experimental Results

Query performance vs. probability threshold on (CA, qs = 1500)

Experimental Results

Query performance vs. probability threshold on (Aircraft, qs = 1500)

Summary

A fast method for answering probabilistic range search queries.