FAUST Oblique Analytics (based on the dot product, o). Given a table, X(X 1..X n ), |X|=N and...

FAUST Oblique AnalyticsFAUST Oblique Analytics (based on the dot product, (based on the dot product, o).o). Given a table, X(XGiven a table, X(X11..X..Xnn), |X|=N and vectors, D=(D), |X|=N and vectors, D=(D11..D..Dnn),),

FAUST Oblique employs the ScalarPTreeSets (SPTS) of the valueTrees, XFAUST Oblique employs the ScalarPTreeSets (SPTS) of the valueTrees, XooD D k=1..nk=1..nXXkkDDkk

FC (FAUST Count Change clusterer)FC (FAUST Count Change clusterer) Choose Density(DT), DensityUniformity(DUT) and PrecipitousCountChange(PCCT) thresholds. Choose Density(DT), DensityUniformity(DUT) and PrecipitousCountChange(PCCT) thresholds.

If DT (and DUT) are not exceeded at a cluster C, partition C by cutting at each gap and/or PCC in CIf DT (and DUT) are not exceeded at a cluster C, partition C by cutting at each gap and/or PCC in CooD using D using nextDnextD..

FCGFCG cuts in the middle of C cuts in the middle of CooD gaps (only) D gaps (only) (This is the old version. It might be faster, but it usually chokes on big data.) (This is the old version. It might be faster, but it usually chokes on big data.)

FCPFCP cuts at PCCs cuts at PCCs (gap are PCC-cuts, of course). (gap are PCC-cuts, of course).

Outlier MiningOutlier Mining: : Find the top k objects dissimilarity from the rest of the objects. This might mean:Find the top k objects dissimilarity from the rest of the objects. This might mean:1.a Find {x1.a Find {xh h | h=1..k} such that x| h=1..k} such that xhh maximizes distance(x maximizes distance(xhh, X-{x, X-{xj j | j| jh})h})

1.b Find the top 1.b Find the top set of k objectsset of k objects, S, Skk, that maximizes distance(X-S, that maximizes distance(X-Skk.S.Skk))

2. Given a Training Set, X, identify outliers in each class (correctly classified but noticeably dissimilar from classmates) or 2. Given a Training Set, X, identify outliers in each class (correctly classified but noticeably dissimilar from classmates) or Fuzzy cluster X, i.e., assign a weight for each (object, cluster) pair. Then x isa outlier iff w(x,k) < OutlierThreshold Fuzzy cluster X, i.e., assign a weight for each (object, cluster) pair. Then x isa outlier iff w(x,k) < OutlierThreshold kk

3. Examine individual new samples for outlierhood, assuming they come in after 3. Examine individual new samples for outlierhood, assuming they come in after normalcynormalcy has been established by 1 o 2. has been established by 1 o 2.

FDO (FAUST Distance-based Outlier Miner) FDO (FAUST Distance-based Outlier Miner) uses Duses D22NN = SquareDistance(x, X-{x}) = rankN(x-X)NN = SquareDistance(x, X-{x}) = rankN(x-X)oo(x-X)(x-X)DD22NN provides an instantaneous NN provides an instantaneous k-sliderk-slider for 1.a. (useful for the others too. for 1.a. (useful for the others too. InstantaneousInstantaneous? UDR on D? UDR on D22NN takes logNN takes log22n time n time

(and is a 1-time calculation), then a k-slider works instantaneously off that distribution - there is no need to sort D(and is a 1-time calculation), then a k-slider works instantaneously off that distribution - there is no need to sort D 22NN)NN)

NextDNextD is a sequence of D's, used when recursively partitioning X into a Clusters (constructing a Cluster Dendogram for X) is a sequence of D's, used when recursively partitioning X into a Clusters (constructing a Cluster Dendogram for X) e.g.e.g.

a. recursively, take the diagonal maximizing Standard Deviation (STD(Ca. recursively, take the diagonal maximizing Standard Deviation (STD(CooD)) [or maximizing STD(CD)) [or maximizing STD(CooD)/Spread(CD)/Spread(CooD).] D).] b. recursively, take the AM(Cb. recursively, take the AM(CooD)D)Avg-to-MedianAvg-to-Median; AFFA(C; AFFA(CooD)D)Avg-FurthestFromAvgAvg-FurthestFromAvg; FFAFFFFA(C; FFAFFFFA(CooD)D)FFA-FurthFromFFAFFA-FurthFromFFAc. recursively cycle thru diagonals: ec. recursively cycle thru diagonals: e1,...,1,...,..e..enn, e, e11ee22.. or cycle thru AM, AFFA, FFAFFFFA or cycle through both sets.. or cycle thru AM, AFFA, FFAFFFFA or cycle through both sets

FP FP (FAUST Polygon(FAUST Polygon for for k-class classification, k=1..k-class classification, k=1.. X Xn+1n+1= ClassLabel. = ClassLabel. DDDsetDset, l, lD,kD,kmnCmnCkkooD (1D (1stst PCI?); h PCI?); hD,kD,k=h=hD,kD,kmxCmxCkkooD (last PCD?)D (last PCD?)

y y isis declared to be class=k iff y declared to be class=k iff yHullHullk k where Hullwhere Hullkk={z| l={z| lD,k D,k D Dooz z h hD,kD,k all D}. all D}.

(If y is in multiple hulls, H(If y is in multiple hulls, H ii11..H..Hii

hh, y isa C, y isa Ckk for the k maximizing OneCount{P for the k maximizing OneCount{PCCkk

&P&PHHii..&P..&PHHiihh

} or fuzzy classify using those OneCounts as k-weights)} or fuzzy classify using those OneCounts as k-weights)

FCOFCO uses FC as an outlier miner. It identifies and removes large clusters using FCP, so outliers reveal themselves.uses FC as an outlier miner. It identifies and removes large clusters using FCP, so outliers reveal themselves.

DsetDset is a set of D is a set of Dss used to build a model for fast classification (1-class or k-class) by circumscribing each class with a hull. used to build a model for fast classification (1-class or k-class) by circumscribing each class with a hull.The larger theThe larger the Dset Dset the better (for accuracy). the better (for accuracy). D, there is, however, the 1-time construction cost of LD, there is, however, the 1-time construction cost of LD,kD,k and H and HD,kD,k below. below.

Dset Dset should include DAshould include DAi,ji,jAvg(CAvg(Cii))Avg(CAvg(Cjj) ) i>j=1..k [and also the Median connectors?]. Should i>j=1..k [and also the Median connectors?]. Should DsetDset include all include all

DDnextDnextD??

(Note: The old version used Dset(Note: The old version used Dset{DA{DAi,ji,j | i>j=1..k} only.) | i>j=1..k} only.)

XXooD = D = k=1..nk=1..nXXkk*D*Dkk

k=1..nk=1..n ( (= 2= 22B2B

+ 2+ 22B-12B-1 DDk,Bk,B p pk,B-1k,B-1 + D+ Dk,B-1k,B-1 p pk,Bk,B

+ 2+ 22B-22B-2 DDk,Bk,B p pk,B-2k,B-2 + D+ Dk,B-1k,B-1 p pk,B-1k,B-1 + D+ Dk,B-2k,B-2 p pk,Bk,B

+ 2+ 22B-32B-3 DDk,Bk,B p pk,B-3k,B-3 + D+ Dk,B-1k,B-1 p pk,B-2k,B-2 + D+ Dk,B-2k,B-2 p pk,B-1k,B-1 +D+Dk,B-3k,B-3 p pk,Bk,B

+ 2+ 233 DDk,Bk,B p pk,0k,0 + D+ Dk,2k,2 p pk,1k,1 + D+ Dk,1k,1 p pk,2k,2 +D+Dk,0k,0 p pk,3k,3

+ 2+ 222 DDk,2k,2 p pk,0k,0 + D+ Dk,1k,1 p pk,1k,1 + D+ Dk,0k,0 p pk,2k,2

+ 2+ 211 DDk,1k,1 p pk,0k,0 + D+ Dk,0k,0 p pk,1k,1

+ 2+ 200 DDk,0k,0 p pk,0k,0

DDk,Bk,B p pk,Bk,B

k=1..nk=1..n ( (k=1..nk=1..n ( (k=1..nk=1..n ( (

. . .. . .

k=1..nk=1..n ( (k=1..nk=1..n ( (k=1..nk=1..n ( (k=1..nk=1..n ( (

XXooD=D=k=1,2k=1,2XXkk*D*Dk k with pTrees: qwith pTrees: qNN..q..q00, ,

N=2N=22B+roof(log2B+roof(log22n)+2B+1n)+2B+1k=1..2k=1..2 ( (= 2= 222

+ 2+ 211 DDk,1k,1 p pk,0k,0 + D+ Dk,0k,0 p pk,1k,1

+ 2+ 200 DDk,0k,0 p pk,0k,0

DDk,1k,1 p pk,1k,1

k=1..2k=1..2 ( (

k=1..2k=1..2 ( (

113322

110011

XX pTreespTrees001111

111100

110011

000000

1 2 1 2 DD DD1,11,1 DD1,01,0

0 10 1DD2,12,1 DD2,02,01 01 0B=1B=1

((= 2= 222 + 2+ 211 DD1,11,1 pp1,01,0 + D+ D1,01,0 pp1111 + 2+ 200 DD1,01,0 pp1,01,0DD1,11,1 pp1,11,1 (( ((+ D+ D2,12,1 pp2,1 2,1 )) + D+ D2,12,1 pp2,02,0 + D+ D2,02,0 pp2,12,1 )) + D+ D2,02,0 pp2,02,0 ))

((= 2= 222 + 2+ 211 DD1,11,1 p p1,01,0 ++ DD1,01,0 p p1111 + 2+ 200 DD1,01,0 p p1,01,0DD1,11,1 p p1,11,1 (( ((+ D+ D2,12,1 p p2,1 2,1 )) + + DD2,12,1 p p2,02,0 ++ DD2,02,0 p p2,12,1 )) + D+ D2,02,0 p p2,02,0 ))000000

001111

110011

111100

qq0 0 = p= p1,0 1,0 = = no carryno carry111100

qq11= = carrycarry11= =

111100

000011

qq22=carry=carry11= = no carryno carry000011

3 3 3 3 DD DD1,11,1 DD1,01,0

1 11 1DD2,12,1 DD2,02,01 11 1

qq0 0 = = carrycarry00==001111

110000

((= 2= 222 + 2+ 211 1 p1 p1,01,0 + 1 p+ 1 p1111 + 2+ 200 1 p1 p1,01,01 p1 p1,11,1 (( ((+ 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,02,0 + 1 p+ 1 p2,1 2,1 )) + 1 p+ 1 p2,0 2,0 ))001111

111100

001111

111100

110011

000000

110011

qq11=carry=carry00+raw+raw11= = carrycarry11==111111

221111

A carryTree is a valueTree or vTree, as is the rawTree at each level (rawTree = valueTree before carry is incl.). A carryTree is a valueTree or vTree, as is the rawTree at each level (rawTree = valueTree before carry is incl.). In what form is it best to carry the carryTree over? (for speediest of processing?)In what form is it best to carry the carryTree over? (for speediest of processing?)1. multiple pTrees added at next level? (since the pTrees at the next level are in that form and need to be added)1. multiple pTrees added at next level? (since the pTrees at the next level are in that form and need to be added)2. carryTree as a SPTS, s2. carryTree as a SPTS, s11? (next level rawTree=SPTS, s? (next level rawTree=SPTS, s22, then , then ss1010& s& s20 20 = q= qnext_levelnext_level and carry and carrynext_levelnext_level ? ?

qq22=carry=carry11+raw+raw22= = carrycarry22==111111

111111

qq33=carry=carry22 = = carrycarry33==111111

FC ClustererFC Clusterer If DT (and/or DUT) are not exceeded at C, partition C further by cutting at each gap and PCC in CIf DT (and/or DUT) are not exceeded at C, partition C further by cutting at each gap and PCC in CooDD

For a table X(XFor a table X(X11...X...Xnn), the SPTS, X), the SPTS, Xkk*D*Dkk is the column of numbers, x is the column of numbers, xkk*D*Dkk. X. XooD is the sum of those SPTSs, D is the sum of those SPTSs, k=1..nk=1..nXXkk*D*Dkk

XXkk*D*Dkk = D = Dkkbb22bbppk,bk,b = 2= 2BBDDkkppk,Bk,B +..+ 2+..+ 200DDkkppk,0k,0

= D= Dkk(2(2BBppk,Bk,B +..+2+..+200ppk,0k,0) =) = (2(2BBppk,Bk,B +..+2+..+200ppk,0k,0))(2(2BBDDk,Bk,B+..+2+..+200DDk,0k,0))

+ 2+ 22B-12B-1(D(Dk,B-1k,B-1ppk,Bk,B +..+2+..+200DDk,0k,0ppk,0k,0= 2= 22B2B( ( DDk,Bk,Bppk,Bk,B) ) +D+Dk,Bk,Bppk,B-1k,B-1))

So, So, DotProduct DotProduct involves just multi-operand pTree involves just multi-operand pTree addition. (no SPTSs and no multiplications)addition. (no SPTSs and no multiplications)Engineering shortcut tricka would be huge!!!Engineering shortcut tricka would be huge!!!

FOFO Table, X(XTable, X(X11...X...Xnn) D) D22NN yields a 1.a-type outlier detector (top k objects, x, dissimilarity from X-{x}).NN yields a 1.a-type outlier detector (top k objects, x, dissimilarity from X-{x}).

We install in DWe install in D22NN, each min[DNN, each min[D22NN(x)] (It's a one-time construction but for a trillion xNN(x)] (It's a one-time construction but for a trillion x ss it's slow. Parallelization?) it's slow. Parallelization?)

DD22NN(x)= NN(x)= k=1..nk=1..n(x(xkk-X-Xkk)(x)(xkk-X-Xkk)=)=k=1..nk=1..n((b=B..0b=B..022bbxxk,bk,b-2-2bbppk,bk,b)()( ((b=B..0b=B..022bbxxk,bk,b-2-2bbppk,bk,b))

==k=1..nk=1..n( ( b=B..0b=B..022bb(x(xk,bk,b-p-pk,bk,b) )) ) (( ----a----ak,bk,b------b=B..0b=B..022bb(x(xk,bk,b-p-pk,bk,b) )) )

(2(2BBaak,Bk,B++ 22B-1B-1aak,B-1k,B-1+..++..+ 2211aak, 1k, 1++ 2200aak, 0k, 0)) (2(2BBaak,Bk,B++ 22B-1B-1aak,B-1k,B-1+..++..+ 2211aak, 1k, 1++ 2200aak, 0k, 0))==kk

( 2( 22B2Baak,Bk,Baak,Bk,B + +222B-12B-1( a( ak,Bk,Baak,B-1k,B-1 + a + ak,B-1k,B-1aak,Bk,B ) + ) + { which is 2{ which is 22B2Baak,Bk,Baak,B-1 k,B-1 }}222B-22B-2( a( ak,Bk,Baak,B-2k,B-2 + a + ak,B-1k,B-1aak,B-1k,B-1 + a + ak,B-2k,B-2aak,Bk,B ) + ) + { which is 2{ which is 22B-12B-1aak,Bk,Baak,B-2 k,B-2 + 2+ 22B-22B-2aak,B-1k,B-1

22

222B-32B-3( a( ak,Bk,Baak,B-3k,B-3 + a + ak,B-1k,B-1aak,B-2k,B-2 + a + ak,B-2k,B-2aak,B-1k,B-1 + a + ak,B-3k,B-3aak,B k,B ) +) + { 2{ 22B-22B-2( a( ak,Bk,Baak,B-3 k,B-3 + a+ ak,B-1k,B-1aak,B-2k,B-2 ) } ) }

222B-42B-4(a(ak,Bk,Baak,B-4k,B-4+a+ak,B-1k,B-1aak,B-3k,B-3+a+ak,B-2k,B-2aak,B-2k,B-2+a+ak,B-3k,B-3aak,B-1k,B-1+a+ak,B-4k,B-4aak,Bk,B)...)... {2{22B-32B-3( a( ak,Bk,Baak,B-4k,B-4+a+ak,B-1k,B-1aak,B-3k,B-3)+2)+22B-42B-4aak,B-2k,B-222}}

=2=22B 2B ( a( ak,Bk,B22 + a + ak,Bk,Baak,B-1 k,B-1 ) +) +

222B-12B-1( a( ak,Bk,Baak,B-2 k,B-2 ) +) +

222B-22B-2( a( ak,B-1k,B-122

222B-32B-3( a( ak,Bk,Baak,B-4k,B-4+a+ak,B-1k,B-1aak,B-3k,B-3))

222B-42B-4aak,B-2k,B-22 2 ......

+ a+ ak,Bk,Baak,B-3 k,B-3 + a+ ak,B-1k,B-1aak,B-2k,B-2 ) )

Does D2NN involve just multi-operand pTree addition? (or SPTSs, Does D2NN involve just multi-operand pTree addition? (or SPTSs, multiplication)multiplication)

Notes: When xNotes: When xk,bk,b=1, a=1, ak,bk,b=p'=p'k,bk,b and when x and when xk,bk,b=0, a=0, ak,bk,b= -p= -pk.bk.b So D2NN has just multi- So D2NN has just multi-

op pTree multiplications/additions/subtractions! Of course, each entry in op pTree multiplications/additions/subtractions! Of course, each entry in D2NN (each xX) is a separate [parallelizable] calculation.D2NN (each xX) is a separate [parallelizable] calculation.

Should we pre-compute all pShould we pre-compute all pk,ik,i*p*pk,jk,j p'p'k,ik,i*p'*p'k,jk,j ppk,ik,i*p'*p'k,jk,j

Is subtraction just a matter of flipping sign bit and adding, Md?Is subtraction just a matter of flipping sign bit and adding, Md?

U.S. Library of Congress is archiving all tweets sent since 2006. USLOCTweetTable may have 1 million trillion rows and 50 columns.Volume 172 billion tweets in 2013 alone (~300 each from 500 million tweeters). Currently > 20 million tweets/hour, 24 hours/day, seven days/week.a tweet is 140 characters. There are 50 fields (Who wrote it. Where. When To Whom ...)

Enron Dataset Volume 16GB. 1,000,000 rows. 100,000 columns (terms)

Drone data? Maybe just RGB (3 columns) and trillions of rows (one for each pixel each hour for 10 years. Each pixel is GPS located (would be sort by location then before pTree-izing?

pTree Rank(K) computation: (Rank(n-1) gives 2nd smallest which is very useful in outlier analysis?)

X P4,3 P4,2 P4,1 P4,0

1

0

0

0

1

1

0

0

1

1

1

0

0

0

1

0

1

1

1

0

1

0

1

0

1

1

1

1

{0}

{1}

{0}

{1}

(n=3) c=Count(P&P4,3)= 3 < 6

p=6–3=3; P=P&P’4,3 masks off highest 3 (val 8)

(n=2) c=Count(P&P4,2)= 3 >= 3

P=P&P4,2 masks off lowest 1 (val 4)

(n=1) c=Count(P&P4,1)=2 < 3

p=3-2=1; P=P&P'4,1 masks off highest 2 (val8-2=6 )

(n=0) c=Count(P&P4,0 )=1 >= 1

P=P&P4,0

10

5

6

7

11

9

3

{0} {1} {0} {1}

RankKval=0; p=K; c=0; P=Pure1; /*Also RankPts are returned as the resulting pTree, P*/For i=n to 0 {c=Count(P&Pi); If (c>=p) {RankVal=RankVal+2i; P=P&Pi }; else {p=p-c; P=P&P'i }; return RankKval, P; /*Below K=n-1=7-1=6 (looking for the 6th highest = 2nd lowest value)*/

Cross out the 0-positions of P each step.

5 P=MapRankKPts= ListRankKPts={2}

0100000

23 * + 22 * + 21 * + 20 * =

RankKval=

Suppose MinVal is duplicated (occurs at two points). What does the algorithm return?

P4,3 P4,2 P4,1 P4,0

1

0

0

0

1

1

0

0

1

1

0

0

0

0

1

0

1

1

1

0

1

0

1

0

1

1

1

1

{0}

{0}

{1}

{1}

1. P = P4,3

Ct (P) = 3 < 6

P = P’4,3 masks off highest 3 (Val 8)

p = 6 – 3 = 3 2. Ct(P&P4,2) = 2 < 3

P = P&P'4,2 p=3-2=1 masks off highest 2 (val 4)

3. Ct(P&P4,1 )=2 >= 1 P=P&P4,1

4. Ct (P&P4,0 )=1 >= 1 P=P&P4,0

10

5

6

3

11

9

3

23 * + 22 * + 21 * + 20 * = {0} {0} {1} {1} 3=MinVal=rank(n-1)Val. Pmask MinPts=rank(n-1)Pts{#4,#7}

RankKval=0; p=K; c=0; P=Pure1; /*Also RankPts are returned as the resulting pTree, P*/For i=n to 0 {c=Count(P&Pi); If (c>=p) {RankVal=RankVal+2i; P=P&Pi }; else {p=p-c; P=P&P'i }; ret RankKval, P;

P4,3 P4,2 P4,1 P4,0

1

0

0

0

1

1

0

0

0

1

0

0

0

0

1

1

1

1

1

0

1

0

1

0

1

1

1

1

{0}

{0}

{1}{1}

1. P = P4,3

Ct (P) = 3 < 6

P = P’4,3 (masks off the highest 3 val 8)

p = 6 – 3 = 3 2. Ct(P&P4,2) = 1 < 3

P = P&P'4,2 p=3-1=2 (masks off highest 1 val 4)

3. Ct(P&P4,1 )=3 >= 2 P=P&P4,1

4. Ct (P&P4,0 )=3 >= 2 P=P&P4,0

10

3

6

3

11

9

3

23 * + 22 * + 21 * + 20 * = {0} {0} {1} {1} 3=MinVal. Pc mask MinPts #4,#5,#7

Suppose MinVal is triplicated (occurs at three points). What does the algorithm return?

RankKval=0; p=K; c=0; P=Pure1; /*Also RankPts are returned as the resulting pTree, P*/For i=n to 0 {c=Count(P&Pi); If (c>=p) {RankVal=RankVal+2i; P=P&Pi }; else {p=p-c; P=P&P'i }; return RankKval, P;

FAUST Oblique Analytics (based on the dot product, o). Given a table, X(X 1..X n ), |X|=N and...

Documents

Transcript of FAUST Oblique Analytics (based on the dot product, o). Given a table, X(X 1..X n ), |X|=N and...