Monitoring Threshold Functions in systems …...Monitoring Threshold Functions in systems comprised...

72
Monitoring Monitoring Threshold Functions Threshold Functions in in systems comprised of systems comprised of 1/18/2012 1 Distributed Data Streams Distributed Data Streams Daniel Keren, Haifa U Daniel Keren, Haifa U Tsachi Tsachi Sharfman Sharfman, Technion , Technion Assaf Schuster, Technion Assaf Schuster, Technion

Transcript of Monitoring Threshold Functions in systems …...Monitoring Threshold Functions in systems comprised...

Monitoring Monitoring

Threshold Functions Threshold Functions

in in

systems comprised ofsystems comprised of

1/18/2012 1

systems comprised ofsystems comprised of

Distributed Data StreamsDistributed Data Streams

Daniel Keren, Haifa UDaniel Keren, Haifa UTsachiTsachi SharfmanSharfman, Technion, TechnionAssaf Schuster, TechnionAssaf Schuster, Technion

Web Page Frequency CountsWeb Page Frequency Counts

�� Mirrored web siteMirrored web site

�� Mirrors record the frequency of requests for pagesMirrors record the frequency of requests for pages�� Detect when the global frequency of requests for a page Detect when the global frequency of requests for a page exceeds a predetermined thresholdexceeds a predetermined threshold

1/18/2012 2

Req #1

Req #2

Req #3

Air Quality MonitoringAir Quality Monitoring

�� Sensors monitoring the Sensors monitoring the concentration of airconcentration of airpollutants.pollutants.

Each sensor holds a data vector comprising Each sensor holds a data vector comprising

Quiescence1/18/2012 3

�� Each sensor holds a data vector comprising Each sensor holds a data vector comprising measured concentration of various pollutants measured concentration of various pollutants (CO(CO22, SO, SO22, O, O33, etc.)., etc.).

�� A function on the A function on the average average readings determines readings determines the Air Quality Index (AQI)the Air Quality Index (AQI)

�� Issue an alert in case the AQI exceeds a given Issue an alert in case the AQI exceeds a given threshold.threshold.

Sensor NetworksSensor Networks�� Sensors monitoring the temperature in a server Sensors monitoring the temperature in a server room (machine room, conference room, etc.)room (machine room, conference room, etc.)�� Ensure uniform temp.: monitor variance of readingsEnsure uniform temp.: monitor variance of readings

�� Alert in case variance exceeds a thresholdAlert in case variance exceeds a threshold

�� Temperature readings by Temperature readings by nn sensors sensors xx11, …, , …, xxnn

�� Each sensor holds a data vector Each sensor holds a data vector vvii = (= (xxii22, , xxii ))

TT

1/18/2012 4

�� Each sensor holds a data vector Each sensor holds a data vector vvii = (= (xxii , , xxii ))

�� The The averageaverage data vector is data vector is v v ==

�� VarVar((all sensors) = all sensors) = 2

1 1

1 1Tn n

i i

i i

x xn n

= =

∑ ∑

2

2

1 1

1 1n n

i i

i i

x xn n

= =

∑ ∑

Search Engine Search Engine

�� Distributed datacenter/warehouseDistributed datacenter/warehouse�� 1010Ks horizontal partitionsKs horizontal partitions

�� ““Our logs are larger than any other data by orders of Our logs are larger than any other data by orders of magnitude. They are our source of truthmagnitude. They are our source of truth..”” Sridhar Sridhar RamaswamyRamaswamy. . SIGMOD’SIGMOD’08 08 keynote on “Extreme Data Mining”keynote on “Extreme Data Mining”

Quiescence1/18/2012 5

RamaswamyRamaswamy. . SIGMOD’SIGMOD’08 08 keynote on “Extreme Data Mining”keynote on “Extreme Data Mining”

�� Mining the logs: Compute pairs of keywords for Mining the logs: Compute pairs of keywords for which the correlation index is highwhich the correlation index is high

�� Thousands simultaneous tasksThousands simultaneous tasks�� ““Network bandwidth is a relatively scarce resource in Network bandwidth is a relatively scarce resource in our computing environmentour computing environment”. ”. Dean and Dean and GhemawatGhemawat..MapReduceMapReduce paper,paper, OSDIOSDI’’0404

Cloud ComputingCloud Computing

�� Amazon’s Elastic Compute Cloud Amazon’s Elastic Compute Cloud –– ECEC22

�� Amazon’s Simple Storage Service Amazon’s Simple Storage Service –– SS33

Amazon Web Services » Service Health DashboardAmazon S3 Availability Event: July 20, 2008

Quiescence1/18/2012 6

Amazon S3 Availability Event: July 20, 2008Amazon S3 Availability Event: July 20, 2008

“At 8:40am PDT, error rates in all Amazon S3 datacenters began to quickly climb and our alarms went off. By 8:50am PDT, error rates were significantly elevated and very few requests were completing successfully. By 8:55am PDT, we had multiple engineers engaged and investigating the issue. Our alarms pointed at problems processing customer requests in multiple places within the system and across multiple data centers. While we began investigating several possible causes, we tried to restore system health... At 9:41am PDT, we determined that servers within Amazon S3 were having problems… By 11:05am PDT, all server-to-server communication was stopped, request processing components shut down, and the system's state cleared…. “

AdAd--Hoc Mobile PHoc Mobile P22P NetworksP Networks

Peer-to-peer network invites drivers to get connectedCarTorrent could smarten up our daily commute, reducing accidents and bringing multimedia journey data to our fingertips

•Laura Parker •The Guardian, •Thursday January 17 2008

Quiescence1/18/2012 7

“The name BitTorrent has become part of most people's day-to-day vernacular, synonymous with downloading every kind of content via the internet's peer-to-peer networks. But if a team of US researchers have their way, we may all be talking about CarTorrent in the not too distant future…..

Researchers from the University of California Los Angeles are working on a wireless communication network that will allow cars to talk to each other, simultaneously downloading information in the shape of road safety warnings, entertainment content and navigational tools….”

Quiescence1/18/2012 8

Social NetworksSocial Networks

�� Can you predict a global event?Can you predict a global event?

�� FB collecting FB collecting 100100TB/day…TB/day…

�� IBM SystemIBM System--S guys talk about millions S guys talk about millions

continuous queries…continuous queries…

Quiescence

continuous queries…continuous queries…

1/18/2012 9

Centralized AlgorithmsCentralized Algorithms

�� All data is moved to a central locationAll data is moved to a central location�� Communication overheadCommunication overhead

�� BandwidthBandwidth�� PowerPower

Quiescence1/18/2012 10

�� PowerPower�� CyclesCycles

�� Centralized resourcesCentralized resources�� Privacy issuesPrivacy issues

Distributed AlgorithmsDistributed Algorithms

�� Local filteringLocal filtering

�� Local constraintsLocal constraints

�� Local dataLocal data

Local decisionsLocal decisions

Quiescence

�� Local decisionsLocal decisions

�� Etc.Etc.

�� But do not forget to But do not forget to

maintain correctness!!!maintain correctness!!!

1/18/2012 11

Mining HUGE Mining HUGE

NetworksNetworksNetworksNetworks

•• PeerPeer--toto--PeerPeer

•• SocialSocial

•• GridGrid

•• InernetInernet--scale routersscale routers

1/18/2012 12

DataData--Centric view of Centric view of

PeerPeer--toto--Peer NetworksPeer Networks�� Opportunity for peers: Current technology trend allows Opportunity for peers: Current technology trend allows

peers to collect huge amounts of data and share itpeers to collect huge amounts of data and share it

�� Once the territory of companies onlyOnce the territory of companies only

�� Opportunity for companies: Mine the logs of POpportunity for companies: Mine the logs of P22P networks P networks to improve system operation and customer experienceto improve system operation and customer experience

�� Skype, VOD, distribution networks, etc.Skype, VOD, distribution networks, etc.

Quiescence1/18/2012

�� Skype, VOD, distribution networks, etc.Skype, VOD, distribution networks, etc.

Customer data mining Mirroring corporates'

Recommendations (unbiased) – e-Mule

Product Lifetime Cost (as opposed to CLV)

Data Mining HUGEData Mining HUGE DatabaseDatabase

�� Impossible to collect the dataImpossible to collect the data�� Internet scaleInternet scale

�� NO global operatorsNO global operators�� NO synchronizationsNO synchronizations�� NO global communicationNO global communicationNO coordinationNO coordination

Quiescence1/18/2012

NO global communicationNO global communication�� NO coordinationNO coordination

�� Ever changing data and systemEver changing data and system�� Failures, crashes, joins, departuresFailures, crashes, joins, departures�� Data modified faster than propagatedData modified faster than propagated�� Data streamsData streams

What you should not tryWhat you should not try�� DDecomposable statistics (ecomposable statistics (AvgAvg, , VarVar, cross, cross--tables, etc.) tables, etc.) ccan be calculated using an be calculated using (distributed) sum reduction(distributed) sum reduction

�� SynchronizationSynchronization

�� Bandwidth requirementsBandwidth requirements

Quiescence1/18/2012

Bandwidth requirementsBandwidth requirements

�� FailuresFailures

�� ConsistencyConsistency

�� Gossip (aka sampling)Gossip (aka sampling)

�� Rigid schemesRigid schemes

�� Hard Hard to to analyseanalyse

�� Slow to respondSlow to respond

Local AlgorithmsLocal Algorithms

�� Every peer's result Every peer's result depends on the data depends on the data gathered from a (small) gathered from a (small) environmentenvironment of peersof peers

�� SizeSize of environment of environment

1/18/2012

�� SizeSize of environment of environment may depend on the may depend on the problem/instance at problem/instance at handhand

�� Eventual Eventual correctnesscorrectnessguaranteed (assuming guaranteed (assuming stabilization)stabilization)

Local vs. CentralizedLocal vs. Centralized

Local Centralized

1/18/2012

2 link delays,doesn’t depend on the network size

16 link delays,equals to the

network diameter

Properties of Local AlgorithmsProperties of Local Algorithms

Quiescence1/18/2012

• � Scalability• � Asynchronousity• � Incrementality• � Robustness• � Energy-efficient

VotingVoting

Generalized (weighted)Generalized (weighted) majority voting: majority voting:

∑∑pp11s(p)/votes(p) > s(p)/votes(p) > λλ ((11>>λλ>>00, p, pЄЄpeers)peers)

Depends on the significance of the vote: Depends on the significance of the vote:

Quiescence1/18/2012

Depends on the significance of the vote: Depends on the significance of the vote: in a tie all votes must be countedin a tie all votes must be counted

For many applications a tie is not important: For many applications a tie is not important: Inconclusive votingInconclusive voting

Local Majority Voting (spanning tree)

X

Y

ZW

1/18/2012

�� Information propagation Information propagation –– correctnesscorrectness�� No propagation No propagation –– localitylocality

Correctness vs. LocalityCorrectness vs. Locality

X

Y

ZW

Quiescence1/18/2012

�� Rule of thumb:Rule of thumb: propagate when eitherpropagate when either1.1. neighbor needs to be persuaded, or neighbor needs to be persuaded, or 2.2. previous message to neighbor turns out to be previous message to neighbor turns out to be

potentially misleading.potentially misleading.

Quiescence1/18/2012 Intel DTTC, Haifa

LocalityLocality

1,600 nodes

All initiated at once

Quiescence1/18/2012

Local DB of 10K transactions

Locked step

Run until there are no further messages

Dynamic Behavior Dynamic Behavior –– 11MM--peerspeers

•1% noise

•At every simulator step

Quiescence1/18/2012 Intel DTTC, Haifa

48% set input bits 0.1% noise

A Decomposition A Decomposition MethodologyMethodology

1.1. Decompose data mining process Decompose data mining process into primitivesinto primitives�� Primitives are simplerPrimitives are simpler

2.2. Find local distributed algorithms Find local distributed algorithms for primitivesfor primitives

Quiescence1/18/2012

2.2. Find local distributed algorithms Find local distributed algorithms for primitivesfor primitives�� Efficient in the “common case”Efficient in the “common case”

3.3. Recompose data mining process Recompose data mining process from primitivesfrom primitives�� Maintain correctnessMaintain correctness�� Maintain localityMaintain locality�� Maintain asynchronousity Maintain asynchronousity

Distributed Shopping Store Distributed Shopping Store

�� Each store maintains statistics on last day performancesEach store maintains statistics on last day performances�� Every day the region manager would like to know the Every day the region manager would like to know the toptop--

kk sold product pairssold product pairs

1/18/2012 26Dozens Kmart stores in NY state

Product PurchasesMilk 100,000

Bread 90,000

Cheese 75,000

Beer 50,000

Diapers 40,000

Product1 Product2 PurchasesCheese Milk 85,000

Beer Diapers 30,000

Beer Milk 10,000

Product Purchases

Bread 40,000

Chocolate 35,000

Milk 30,000

Beer 20,000

Diapers 18,000

Product1 Product2 PurchasesChocolate Milk 20000

Beer Diapers 15,000

Beer Milk 8,,000

�� is is frequentfrequent if if

more than more than MinFreqMinFreq

of the transactions of the transactions

contain X and Ycontain X and Y

AssociationsAssociations

X Y

1/18/2012

contain X and Ycontain X and Y

�� is is confidentconfident if if

more than more than MinConfMinConf

of the transactions of the transactions

that contain X also that contain X also

contain Ycontain Y

X Y

•• Find that Find that

is frequentis frequent

AssociationsAssociations

1/18/2012

Apples Apples 8080%%

AssociationsAssociations

● Find that I s

frequent

● And that is

s frequent

1/18/2012

Bananas Bananas 8080%%

s frequent

AssociationsAssociations

● Find that is

frequent

● And that

Apples Bananas Apples Bananas 7575%%

is frequent

● Then compare the

frequencies of

and

● Three votings!

Activation Flow Activation Flow in in ARMARM

R R R R R R

F1

I

output

DBu

Individual

frequent

itemset

Frequent

itemsets

of size 2F2 F3

Frequent

itemsets

of size 3

Confident

rules

Quiescence

R � R� R� R� R � R�

F1

I

output

DBu

Individual

frequent

itemset

Frequent

itemsets

of size 2F2 F3

Frequent

itemsets

of size 3

Confident

rules

I1 I2 I3

I1 I2 I3

I1 I2 I3

I1 I2 I3

Network

Topology

Graph

Instance of iteration 1

Node A

Node B

Node C

Node D

Instance of iteration 3

communicates with

instance of iteration 3

on other node

Output of iteration

2 is an input

of iteration 3

Distributed IterationsDistributed Iterations

Quiescence

I1 I2 I3

I1 I2 I3

I1 I2 I3

I1 I2 I3

Network

Topology

Graph

Instance of iteration 1

Node A

Node B

Node C

Node D

Instance of iteration 3

communicates with

instance of iteration 3

on other node

Output of iteration

2 is an input

of iteration 3

Avoiding Synchronization viaAvoiding Synchronization via

SpeculationSpeculation

Node A

I� �

II

I

Su

sp

en

de

d s

ub

-pa

thNode B

I� �

I

Iteration

1

Out-Of-Context

Queue

Out-Of-Context

Queue

IX

Instance of iteration 3 in node B

does not have a peer in node A,

thus its messages go to out-of-context queue.

Each iteration instance

communicates with its peer.

2

3

Active

instance

a b

c

b

d

Quiescence10 Nov 2007

Node A

I

I� I� �

I � � �

Su

sp

en

de

d s

ub

-pa

thNode B

I

I�

Iteration

1

Out-Of-Context

Queue

Out-Of-Context

Queue

I � �

X

Instance of iteration 3 in node B

does not have a peer in node A,

thus its messages go to out-of-context queue.

Each iteration instance

communicates with its peer.

2

3

Active

instance

a b

c

b

d

Speculative executionSpeculative execution

�� If we never finish to If we never finish to compute the first step, compute the first step, how can we start the how can we start the second?second?

• We make a guess and

Quiescence1/18/2012

• We make a guess and base on it the next step.

• The guess is based on the current known data.

• Improves over time.• If at a certain point the guess turns out wrong, we backtrack and recompute.

Every node speculatesEvery node speculates1. Eventually, the

first iteration will converge to the correct result, and will be the same in all nodes.

2. Then, the second

Quiescence1/18/2012

2. Then, the second iteration will converge, and so on until all iterations are correct.

3. When all iterations are correct, every node will output the correct solution.

Properties of Properties of Speculative AlgorithmsSpeculative Algorithms

Quiescence1/18/2012

�� �� Anytime algorithmsAnytime algorithms�� �� Output stabilizationOutput stabilization�� �� Incremental adIncremental ad--hoc hoc

view of dataview of data�� �� Dynamic changesDynamic changes

The Big PictureThe Big Picture

1/18/2012

Associations Associations PerformancePerformance

Quiescence1/18/2012

44K peers, InternetK peers, Internet--like (like (BriteBrite), ), 1010K trans./peer, K trans./peer, 150 150 trans trans read/step.read/step.

By the time the database is scanned once, in parallel:By the time the database is scanned once, in parallel:•• the average peer has discovered the average peer has discovered 9595% of the rules;% of the rules;•• has less than has less than 1010% false rules.% false rules.

Association OverheadAssociation Overhead

Quiescence1/18/2012

The Geometric MethodThe Geometric Method

Central coordinationCentral coordination

1/18/2012 40

( )f x

Threshold T

( ) , ( ) , 2

x yf x T f y T f T

+ > > <

2

x y+ yx

1/18/2012 41

More interesting example: two nodes hold (local) contingency tables, More interesting example: two nodes hold (local) contingency tables, and the global contingency table is the average of the local ones. and the global contingency table is the average of the local ones. Given a table , the mutual information is defined as Given a table , the mutual information is defined as

11 1211 12

11 12 11 21 11 12 12 22

21 2221 22

21 22 11 21 21 22 12 22

( ) log log( )( ) ( )( )

log log( )( ) ( )( )

MIπ π

π π ππ π π π π π π π

π ππ π

π π π π π π π π

= + ++ + + +

++ + + +

π

21 22 11 21 21 22 12 22

1 2

0.9 0.03 0.04 0.02,

0.02 0.05 0.03 0.91π π

= =

1 21 2( ) 0.104, ( ) 0.082, 0.494

2MI MI MI

π ππ π + = = =

The mutual information of the global table is much larger than the The mutual information of the global table is much larger than the local values. As in the parabola case, there’s no way to infer about the local values. As in the parabola case, there’s no way to infer about the global MI given the local ones.global MI given the local ones.

1/18/2012 42

Mining NonMining Non--Linear FunctionsLinear Functions

“…“…

The link function is, of course, The link function is, of course, nonlinearnonlinear. So we . So we agonize over trading off optimization agonize over trading off optimization performance with ability to use the massiveperformance with ability to use the massiveinfrastructure.infrastructure.

1/18/2012 43

infrastructure.infrastructure.…”…”

Sridhar RamaswamySridhar Ramaswamy. .

SIGMOD’SIGMOD’08 08 Keynote talk on “Keynote talk on “Extreme Data Mining”Extreme Data Mining”

Slide title: “Slide title: “10 10 top reasons why googlers do not sleep at nighttop reasons why googlers do not sleep at night””(Coffee is reason #(Coffee is reason #55))

SummarySummary

�� As in As in Srikanta’sSrikanta’s talktalk…. Monitoring is:…. Monitoring is:

�� Business as usual: Business as usual: inputinput streams + streams + coordinatorcoordinator

A A global function global function over the over the average average of of �� A A global function global function over the over the average average of of the inputsthe inputs

�� A continuous query: does the function A continuous query: does the function cross a cross a given threshold???given threshold???

�� Goal: Goal: minimize communication.minimize communication.

1/18/2012 44

Discussion Discussion

�� Looking at the *domain* of the global Looking at the *domain* of the global function may be easier than looking at its function may be easier than looking at its valuevalue

�� Local inputs commonly stationary; rarely: Local inputs commonly stationary; rarely: “phase change”“phase change”

Quiescence

“phase change”“phase change”

�� ��Compile LOCAL constraints as indicators Compile LOCAL constraints as indicators to function output getting too close to to function output getting too close to thresholdthreshold�� Not timeNot time--based,based, because of the tradeoff because of the tradeoff ��

�� Not count base, because of realNot count base, because of real--time time ��

�� (Sorry, (Sorry, SrikantaSrikanta))1/18/2012 45

Geometric ApproachGeometric Approach

�� Geometric Interpretation:Geometric Interpretation:�� Each node holds a statistics vectorEach node holds a statistics vector

�� Coloring the vector space Coloring the vector space �� Grey:: function > thresholdGrey:: function > threshold

White:: function <= thresholdWhite:: function <= threshold

1/18/2012 46

�� White:: function <= thresholdWhite:: function <= threshold

�� Goal: determine color of global data vector Goal: determine color of global data vector (average).(average).

Bounding the Convex HullBounding the Convex Hull

1/18/2012 47

�� Observation: average is in the convex hullObservation: average is in the convex hull

�� If convex hull monochromatic then average tooIf convex hull monochromatic then average too

�� Problem Problem –– convex hull may become largeconvex hull may become large

Drift VectorsDrift Vectors

�� Periodically calculate an Periodically calculate an estimate vectorestimate vector -- the the current global current global

1 1 1

1 1

( )

( )

n n nknown

i i ii i i

i

n n

i ii i

v v vAvg v

n n n

v e ve

n n

= = =

= =

∆= = + =

∆ + ∆= + =

∑ ∑ ∑

∑ ∑

r r r

r r r

r

1/18/2012 48

current global current global

�� Each node maintains a Each node maintains a drift vectordrift vector –– the change in the change in the local statistics vector since the last time the the local statistics vector since the last time the estimate vector was calculatedestimate vector was calculated

�� Global average statistics vector is also the Global average statistics vector is also the average of the drift vectorsaverage of the drift vectors

The Bounding TheoremThe Bounding Theorem

�� A reference point is A reference point is known to all nodesknown to all nodes

�� Each vertex constructs a Each vertex constructs a spheresphere

1/18/2012 49

spheresphere

�� Theorem: convex hull is Theorem: convex hull is bounded by the union of bounded by the union of spheresspheres

�� �� Local constraints!Local constraints!

Proofs of the bounding theorem:Proofs of the bounding theorem:

SIGMODSIGMOD06 06 –– induction on the dimension.induction on the dimension.

MichaMicha SharirSharir –– induction on number of points.induction on number of points.

Yuri Rabinovich Yuri Rabinovich –– uses the following observation: uses the following observation:

zz is not in the sphere supported by is not in the sphere supported by x,yx,y iffiff ((zz--xx,,zz--yy)>)>00..

50

z

y

x

( )

ion.contradict a ),( )(But

0))(),(( Hence

0),( ,,(if

0)()(

),(conv

zyzx

zyzx

zyzxiyxBz

zyzx

zyxyxz

ii

ii

ii

ii

iii

−−=−

>−−

>−−∀∉

=−+−

=+⇒∈

λλ

λλ

λλ

λλ

U

Basic AlgorithmBasic Algorithm

�� An initial estimate vector An initial estimate vector is calculatedis calculated

�� Nodes check color of Nodes check color of drift spheresdrift spheres�� Drift vector is the Drift vector is the diameter of the drift diameter of the drift

1/18/2012 51

Drift vector is the Drift vector is the diameter of the drift diameter of the drift spheresphere

�� If any sphere non If any sphere non monochromatic: node monochromatic: node triggers retriggers re--calculation of calculation of estimate vectorestimate vector

Reuters Corpus (RCVReuters Corpus (RCV11--vv22))

Information Gain vs. Document Index Broadcast Messages vs. Threshold

�� 800800,,000000+ news stories + news stories

�� Aug Aug 20 1996 20 1996 ---- Aug Aug 19 199719 1997

�� Corporate/Industrial tagging simulates spamCorporate/Industrial tagging simulates spam

1/18/2012 52

Information Gain vs. Document Index

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0 200000 400000 600000 800000Document Index

Info

rmat

ion

Gai

n

bosnia

ipo

febru

Broadcast Messages vs. Threshold

0

100

200

300

400

500

600

700

800

0 0.001 0.002 0.003 0.004 0.005 0.006Threshold

Bro

adca

st M

essa

ges

(x10

00)

bosnia

ipo

febru

Naive Alg.

n=10

TradeTrade--off: Accuracy vs. Performanceoff: Accuracy vs. Performance

�� Inefficiency: value of Inefficiency: value of function on average function on average is close to the is close to the thresholdthreshold

�� Performance can be Performance can be enhanced at the cost enhanced at the cost

1/18/2012 53

�� Performance can be Performance can be enhanced at the cost enhanced at the cost of less accurate of less accurate result: result:

�� Set error margin Set error margin around the threshold around the threshold valuevalue

Broadcast Messages vs. Error Margin

0

50

100

150

200

250

0% 10% 20% 30% 40% 50%Error Margin

Bro

adca

st M

essa

ges

(x10

00) bosnia

ipo

febru

Performance AnalysisPerformance Analysis

1/18/2012 54

Performance Analysis (cntd.)Performance Analysis (cntd.)

[ ]E vr

ivr

[ ]E vr

( [ ( )], ( ))iB E v t v tr r

1/18/2012 55

globa

l

D

( [ ( )], ( ))iB E v t v t

BalancingBalancing�� Globally calculating Globally calculating

average is costly average is costly

�� Often possible to Often possible to average only average only somesome of of the data vectors.the data vectors.

n

1/18/2012 56

the data vectors.the data vectors.

1

1

1

( )( )

0

( )( )

n

ii

i

n

i

n

i ii

i

e vAvg v

n

e vAvg v

n

δ

δ

=

=

+ ∆=

=

+ ∆ +=

r r

r

c

rr r

Shape SensitivityShape Sensitivity

�� Fitting cover to DataFitting cover to Data

�� Fitting cover to threshold surfaceFitting cover to threshold surface

�� Specific function classes Specific function classes

Quiescence1/18/2012 57

Fitting Cover to DataFitting Cover to Data(using the covariance matrix)(using the covariance matrix)

Quiescence1/18/2012 58

Fitting Cover to Threshold Surface Fitting Cover to Threshold Surface ----

Reference Vector SelectionReference Vector Selection

1/18/2012 59

Distance FieldsDistance Fields

Skeleton, Medial Axis

r

1/18/2012 60

})(|min{)(, ryfxyxF rf =−= rrrr

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−0.5

0

0.5

1

1.5

0

0.2

0.4

0.6

0.8

1

maxr

*e*en

e

1/18/2012 61

Results Results ––

Shape SensitivityShape Sensitivity

Chi-Square vs. Document Index

0

100

200

300

400

500

600

700

0 100000 200000 300000 400000 500000 600000 700000Document Index

Ch

i-S

quar

e

ipo

bosnia

febru

Messages vs. Threshold - bosnia

1.0E+6

1.0E+7M

essa

ges

(log

scal

e)

1/18/2012 62

1.0E+1

1.0E+2

1.0E+3

1.0E+4

1.0E+5

30 80 130 180 230 280 330 380 430 480

Threshold

Mes

sage

s (lo

g sc

ale)

Spheres

Ellispoids

Spheres-Internal

Ellipsoids-Internal

Theoretic Optimal

Distributed TopDistributed Top--KK

Distributed Search Engine ExampleDistributed Search Engine Example

�� Each node maintains local statistics on last day queriesEach node maintains local statistics on last day queries�� Every day the search engine manager would like to mine Every day the search engine manager would like to mine

the logs for the topthe logs for the top--k appearing keyword pairsk appearing keyword pairs

Quiescence1/18/2012 63

Word Count

Total 5000

Sell 3500

Price 2000

Day 1750

Election 1200

Word1 Word2 CountTotal Sell 2900

Day Election 1000

Price Total 900

Word Count

Total 1000

DVD 850

Sell 800

Insurance 620

Price 500

Word1 Word2 CountTotal Sell 640

DVD Price 500

Insurance Price 450

�� Value in the range [Value in the range [--11,,11]]

� >0 - indicates that the terms tend to appear in the same queries

� =0 - indicates that there is no correlation between terms

� <0 - indicates that the terms tend to exclude each other

Restrict to

Correlation Coefficient FunctionCorrelation Coefficient Function

))(( 22BBAA

BAABAB

ffff

fff

−−

−=ρ

1/18/2012 64

� Restrict to

� Monotonically decrease with

� Monotonically increase with

),min( BAAB fff ≤

BA ff ,

ABf

Queries A B A & B

Node1 1000 100 100 19 0.1 0.1 0.019 0.1

Node2 1000 400 400 184 0.4 0.4 0.184 0.1

Global 2000 500 500 203 0.25 0.25 0.1015 0.208

iABf ,

Af

iBf ,iAf ,

ABfBf

iAB,ρ

ABρ

A FourA Four--Phase ApproachPhase Approach

�� Phase I Phase I –– Determine a lower bound on the Determine a lower bound on the score of the topscore of the top--kk objectsobjects�� By collecting local contendersBy collecting local contenders

�� Phase II Phase II –– Improve the lower boundImprove the lower bound�� By collecting all local contendersBy collecting all local contenders

Quiescence1/18/2012 65

�� By collecting all local contendersBy collecting all local contenders

�� Phase III Phase III –– Determine candidates by locally Determine candidates by locally prune out objects according to lower boundprune out objects according to lower bound�� Use geometry to derive local constraints and local Use geometry to derive local constraints and local

upper boundsupper bounds�� Use domination relations to avoid access whole DBUse domination relations to avoid access whole DB

�� Phase IV Phase IV –– Determine topDetermine top--kk objects among objects among candidatescandidates

Determining Local Upper BoundsDetermining Local Upper Bounds

�� The local upper bound The local upper bound uuj,ij,i at the node at the node ppii for for the object the object oojj is is determined as follows:determined as follows:�� The local statistics The local statistics

1,jxr

2,jxr

3,jxr

1/18/2012 66

�� The local statistics The local statistics vector for each object vector for each object creates a sphere whose creates a sphere whose diameter is distance to diameter is distance to the originthe origin

�� uuj,ij,i is the maximum score is the maximum score received within the received within the spheresphere

4,jxr

refxr

Reducing I/O Costs Reducing I/O Costs --

Domination RelationsDomination Relations

�� If dominates , then for If dominates , then for

any monotone function any monotone function f f

�� Theorem: Let Theorem: Let

xr

yr

)()( yfxfrr ≥

yr

2

3

4

5

6

7

8

9

10

11

ab

c

d

e

h

fg

l

j

k

ib dom a

g dom c, e, f, h

k dom c, h

iax ,r

1/18/2012 67

�� Theorem: Let Theorem: Let

dominate . Then dominate . Then

uua,ia,i≥≥uub,i b,i ..

0

1

2

0 2 4 6 8 10

liax ,

ibx ,r

iax ,r

ibx ,r

refxr

'zr

zr

)(zrδ

)(zrδ

ResultsResults

� Queries collected from a search engine over a period of 3 months.(http://gregsadetsky.com/aol-data/)

KeywordsKeywords ScoreScore

Quiescence1/18/2012 68

KeywordsKeywords ScoreScore

donnie, mcclurkindonnie, mcclurkin 0.923280.92328

duff, hilaryduff, hilary 0.858930.85893

las, vegaslas, vegas 0.854880.85488

estate, realestate, real 0.839950.83995

CommunicationCommunication

Communication Cost

1.0E+05

1.0E+06

1.0E+07

Co

mm

. co

st (

log

. sca

le) For LARGE datasets

the saving reaches four orders of

1/18/2012 69

1.0E+02

1.0E+03

1.0E+04

0 20 40 60 80 100

Num. of Nodes

Co

mm

. co

st (

log

. sca

le)

NaiveBasic (Top-100)Scaled (Top-100)Basic (Top-10)Scaled (Top-10)

the saving reaches four orders of magnitude for any K

ConclusionsConclusions

�� Local filtering as a first choice in largeLocal filtering as a first choice in large--scale distributed data stream systemsscale distributed data stream systems

�� Not necessary to sacrifice precisionNot necessary to sacrifice precision

�� Saving is unlimitedSaving is unlimitedBounded only by the size of the data over system Bounded only by the size of the data over system

Quiescence

�� Bounded only by the size of the data over system Bounded only by the size of the data over system lifetimelifetime

�� Facilitates high scalability, Autonomous Facilitates high scalability, Autonomous operation, and Privacyoperation, and Privacy

�� Threshold is a building block for many Threshold is a building block for many data mining (machine learning) algorithmsdata mining (machine learning) algorithms

1/18/2012 70

IssuesIssues

�� Heuristic, not a theoryHeuristic, not a theory

�� Computation complexity in nodesComputation complexity in nodes

�� Convex Hull a lower bound to efficiency, Convex Hull a lower bound to efficiency, is it really necessary?is it really necessary?

Quiescence

is it really necessary?is it really necessary?

1/18/2012 71

Questions?Questions?

Quiescence1/18/2012 72