Visualization of TV Space
description
Transcript of Visualization of TV Space
Visualization of TV Space
TVX()=TVX(x33)
TVX(x15)
12
34
5
XY
TV
12
34
5
Proof that graph of TVX is a steep hyper-parabola centered at the mean, µ=(xX xi)/|X|
Let f(c) = TVX(c) xX (x-c)o(x-c)
= xX i=1..n (xi - ci)2
= xX i=1..n (xi)2 - 2*x in X i=1..n xi*ci + xX i=1..n ci2
This is clearly parabolic in each dimension, ci (fixing all other dimensions)
f/ck = xX -2(xk - ck) =0 iff xXxk = xXck = |X|ck iff ck = (xXxk)/|X| = µWe can say more about the shape of the hyper-parabolic graph of f.
Since f/ck = xX -2(xk - ck) = -2xXxk + 2xXck = 2 xXck - 2|X|μk = 2|X|(ck -μk)
we see that on each dimensional slice the parabola has the same shape, since the parabola in xy centered at (x0,y0) has the form y - y0 = a (x - x0)2 and a obviously = f(x0+1) -f(x0)
we note y' =2a(x-x0), so in our case a=|X| a very large number (steep parabola) and x0 = µk
Since the slope of the graph is |X|, if one wants, roughly, an -radius-contour (hyper-circular) centered at a, one needs to take the pre-image of the |X|-interval about TVX(a), f-1( TVX(a)- |X| , TVX(a)+ |X| )
12
34
5
X1
x2
12
34
5
a
Proof that graph of IPX is a steep hyper-plane to µ = (xX xi)/|X|
Inner Product functional: IPX(c) = xX
xc = xX
i=1..n
xici =
= i=1..n
ci xXxi =
i=1..n ci |X|µi =
i=1..n ci |X|µi = |X|
i=1..n ci µi = |X| cµ, so
IPX(c) = |X| |µ| |c| cosθ where θ is the angle between c and µ. We can use any form of these equivalent formulas, depending upon which one sheds the most light on the issue we are concerned with. The blue one tell us what that the graph is extremely steep vertically (slight change in length of c causes a tremendous change in IPX(c) ) and that the contour(IPX,a,r) about a point, a, is a linear slice perpendicular to µ and also tells us how to choose the interval radius on the IPX axis so that the contour has radius, r. Red version guides to efficient preprocessing.
The steepness of the graph is evident from f/ ck = |X|µk or the gradient, f = |X| |µ
µ
Proof that graph of Xa is a 45o hyper-plane nearly to µ = (xX xi)/|X|
aDomainA1..DomainAn, projection onto a, Xa(x)=xa =i=1..nxi*ai is a functional whose
graph is a hyperplane at a 45 angle with a. Contour(Xa,X,b,r) is a linear (n-1)-dimensional hyper-bar through b perpendicular to a.
Xi(x) = xi is just Xei which also has planar graphs and have linear hyper-slice (n-1 dimensional) contours
perpendicular to their coordinate basis vector, ei.
Xa is just as easily calculated as TV (easier!), but which ones? All of them? That's impractical! One could process each Xi though.
To classify all s in S, we could first cluster S based on some notion of closeness (isotropic clusters), then take the cluster means as representatives of the entire cluster, classify those cluster means individually (giving the same class assignment to all other points in that cluster, addressing the curse of cardinality of S), or we can classify each s in S individually. In either case we the classify s in S as follows:
For unclassified sample (mean of cluster or just any sample), s, find a set of epsilon contours (from TVX, IPX, Xi's ) that reduce the candidate near neighbor set to a manageable size.
Select the candidate near neighbors that are Euclidean close enough (or the closest k of them).
Let those selected near nbrs vote with Gaussian Radial Basis (RDF) weighted votes. Done!
TVX(a)-ε
TVX(a)+ε
IPX(a)+ε
IPX(a)-ε
X2(a)+ε
X2(a)-ε
µ
To prune halos with a small number of contours (fewer that the n+1 contours: {TV; X i, i=1..n},
e.g., with 2 contours, use the Xa-µ-contour
or use just a few Xi-contours corresponding to the largest coordinates of a-µ
Xa-
μ(a
)+ε
X a-μ(a
)-ε
halos, but small
no halo, not too large
X1(
a+ε
X1(
a-ε
no halo,but very large
a
ContourTVX(a,r)
Contours of TVX, IPX, Xa, Xa-μ
Xy(a)
12
34
5
XY
12
34
5
a
ContourXy(a,r)
ContourXx(a,r)
12
34
5
XY
12
34
5
a
μ
μ
ContourIPX(c,r)
ContourXb (a,r)
12
34
5
XY
12
34
5
aμ a-μb
ContourXa-μ(a,r)
c
Note: ContourTVX(a,r) = Contour(TV,X,a,r), etc.
Γ
Γ
How about higher dimensions?
ContourXy (a,r)
outside surface of ContourTVX(a,r)
two (n-1)-D hypersurfaces (isobars) bounding ContourXa (a,r)
outside surface of (n-1)D surface, ContourTVX(a,r)
Type-0 P-trees
101..0
101..1
100..1
100..0
100..1
101..1
101..1
100..1
101..1
100..1
101..1
101..1
111..1
100..1
101..1
101..1
100..1
101..1
001..1
Ex.: dimension=1 fanout=2dimensoin =2 depth=5 NumberOfPotentialLeaves=NOPL=32=fanoutdepth
Leaf Existence Array, LA31302928232221201514131211983210 1
101..0
101..1
100..1
100..0
100..1
101..1
101..1
100..1
101..1
100..1
101..1
101..1
100..1
101..1
101..1
100..1
101..1
001..1
0
Type0: pure0 leaves omitted. pure1s tagged in Purity Field.
0 0 0 0 0 00 0 0 0 0 00 0 0 0 0 0
Purity Field (PF)
The upper inode levels are not productive.
Indicate which leaves are either mixed or pure1 in a Leaf Existence Array (LA) with Purity Field (PF) or Leaf Existence Map (LM) and Purity Map (PM)
Only the mixed leaves that get stored.
Leaves are bit vectors, any or all of which could be compressed the same way (i.e., this 2-level structure can be nested to more levels).
The LM is a "existential smoothing" of the Ptree (tells us precisely which leaves contain at least 1 1-bit).
If we next enough, the LMs give us multiple smoothing levels. Needless to say, I prefer the LM/PM approach. The LM/PF may be clearer
0 1 1 1 1 0 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
Leaf Existence Map, LM (size=NOPL)
101..0
101..1
100..1
100..0
100..1
101..1
101..1
100..1
101..1
100..1
101..1
101..1
100..1
101..1
101..1
100..1
101..1
001..1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 Purity Map (PM)
Leaf Map (LM)0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
leaf length
Type-1 P-trees
1 1 1 1 1 0 0 0 0 1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
Physical Structure: Leaf Exists Map or LM (size=NOPL=32)
101..0
101..1
100..1
100..0
100..1
101..1
101..1
100..1
101..1
100..1
101..1
101..1
100..1
101..1
101..1
100..1
101..1
001..1
0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 Purity Map (PM)
Leaf Map (LM)0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Physical Structure: Leaf Existence Array or LA31302928232221201514131211983210 1
101..0
101..1
100..1
100..0
100..1
101..1
101..1
100..1
101..1
100..1
101..1
101..1
100..1
101..1
101..1
100..1
101..1
001..1
0
Type 0 means pure0 leaves omitted, pure1s switched on
0 0 0 0 0 00 0 0 0 0 00 0 0 0 0 0
Purity Field
The Type-0 Ptree above, can be expressed as Type-1:Using either a Leaf Existence Array (LA) with Pure Field (PF) or a Leaf Existence Map (LM) and a Purity Map (PM)
4,5,6,7
Physical Structure: Leaf Exists Array or LA31302928
23 24,25,26,27222120
15 16,17,18,1914131210983210 1
101..0
101..1
100..1
100..0
100..1
101..1
101..1
100..1
101..1
100..1
101..1
101..1
100..1
101..1
101..1
100..1
101..1
001..1
1
Type1 means pure1 omitted, pure0s tagged in Purity Field.
0 0 0 0 0 01 0 0 0 0 00 0 0 0 0 0
Purity Field (PF)
1 11 0 1 1
0p2
How should Ptree be stored?
0p4
0p1
0p6
1p3
1p5
1 1 1 1Leaf Map1
1Pure Map1
1 1Leaf Map2
1Pure Map2
1 1Leaf Map4
1Pure Map4
1 1Leaf Map5
Pure Map5
1 1Leaf Map3
Pure Map3
1
1 1Leaf Map6
1Pure Map6 1
The tempting way to store these structures is to cluster by Ptree (horizontally across the rows of this cube above).
But since those leaves almost never get ANDed with oneanother (except 1-time preprocessing) better to cluster by Leaf Offset or LM position ? ( vertically down this cube) since these are precisely the bit vectors that get ANDed together.
If there is good compression (not too many mixed leaves per Ptree), then storing each Leaf Offset (vertical slice of the cube) on a page (or extent), would mean that only that page would need to be brought in when an actual AND is called for (and prefetching is straight forward).
The collection of LMs and PMs could be stored separately on one extent, since they're processed separately (before) the leaves (or processed as smoothings).
Typebit
impu
re le
aves
leaf length
1010
0010
1000
011
P-tree operation: COMPLEMENT
<-- COMPLEMENTing a P-tree -->
Flip the Type Bit and complement the Leaves ( That's all! )
If the structure is nested, complement a leaf by flipping its TypeBit and complementing its leaves (leaves of the leaves), etc.
111
.
.
1
101
.
.
1
100
.
.
1
111
.
.
1
000
.
.
1
100
.
.
0
1817 11676 1 19
101
.
.
1
100
.
.
1
000
.
.
1
100
.
.
0
0
000
.
.
0
010
.
.
0
011
.
.
0
000
.
.
0
111
.
.
0
011
1
.
1
1817 11676 1 19
1
010
.
.
0
011
.
.
0
111
.
.
0
011
1
.
1
P-tree operation: AND
0p2
0p4
0p1
0p6
1p3
1p5
1 1 1 10-Leaf Map1
10Pure Map1
1 10-Leaf Map2
10Pure Map2
1 10-Leaf Map4
10Pure Map4
1 11-Leaf Map5
1Pure Map5
1 11-Leaf Map3
1Pure Map3
1
1 10-Leaf Map6
10Pure Map6 1
Assumemixed leaves are clustered by Leaf Offset ( vertically down the cube), and the collection of LMs (and PMs) are stored separately on one additional extent.
1. AND all 0-LMs --> A2. scan l-to-r across A for next 1 bit, if that position in any 1-PM=1, then GOTO 2 else fetch & AND nonpure leaves --> B; GOTO 23. A forms the LM of the result and the Bs are the nonpure leaves.
E.g., p1 ^ p3 ^ p61. 1001100001^ 1000001001^ 1000010000= 1000000000
0 1 2 3 4 5 6 7 8 9 positions
101000101000011
101000111110011
101000000000011
101000101000000
111100101000011
111000000000011
000000000011011
000011000000001
000011000010101
2. pos=1, PM3(1)=0 so fetch & AND p1 101000101000011p3 111100101000011p6 000011000010101
res 0000000000000013. Result Ptree:0-Leaf_Map: 10000000000-Pure_Map: 0000000000impure leaves: 000000000000001root-count = 1
In ASM, is there an operation, AND and COUNT? to count 1-bits as they are produced?
Even better:^{0LM} ^{1PM'} 0LM (result is always type0)Fetch & AND leaves corresp. to 1-bits in 0LM.Set Purity Map.
K
1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U 5 6
120000000000000000000000001011110100111010101001111111111123
1300000000000000000000000001111111000000000000000000000000
7
110011001100110011010101010111111111001111111110010100101034
100101010101010101110011000101010001010000000010100001100122
230000000000000000000000000111110011111111111111111000000021
210000000011111111000011111111111011111110000111110111100035
200000111100001111001100111000111111111001100111111100011033
220000000000000000111111111111111011111111111000001111111134
LeafSize=8, NOPL=7
LeafOff=0; 11 10 20 001100110101010100001111
0-p13
01111111
0-p12
101111010011101010100111
1-p1100110011001100110101010101111111110011111111100101001010
0-p1001010101010101011100110001010100010100000000101000011001
0-p23
01111100
10000000
1-p22
11111110
11100000
1-p21
0000111111111110111111100001111101111000
0-p2000001111000011110011001110001111111110011001111111000110
130001000
120001111
111111111
101111111
230001111
221101010
211011111
201111111
120000001
230000110
221100000
211000000
LeafOff=1; 11 10 20 001100110101010100001111
LeafOff=2; 11 10 21 20 01010101110011000000111100110011
LeafOff=3; 13 12 11 10 23 22 21 20 0111111110111101
011111110101010001111100111111101111111010001111
LeafOff=4; 12 11 10 21 20 001110101100111101010000
1111111011111001
LeafOff=5; 12 11 10 22 21 20 101001111111100100001010
111000000001111110011111
LeafOff=6; 11 10 23 21 20 010010100001100110000000
0111100011000110
Leaf Maps (red are type-1)
130001000
120001111
111111111
101111111
230001111
221101010
211011111
201111111
Purity Maps
120000001
230000110
221100000
211000000
1300000000000000000000000001111111000000000000000000000000
1200000000000000000000000010111101001110101010011111111111
1100110011001100110101010101111111110011111111100101001010
1001010101010101011100110001010100010100000000101000011001
2300000000000000000000000001111100111111111111111110000000
2200000000000000001111111111111110111111111110000011111111
2100000000111111110000111111111110111111100001111101111000
2000001111000011110011001110001111111110011001111111000110
lo0; 11 10 20 001100110101010100001111130001000
120001111
111111111
101111111
230001111
221101010
211011111
201111111
lo1; 11 10 20 001100110101010100001111
lo2; 11 10 21 20 01010101110011000000111100110011
lo3; 13 12 11 10 23 22 21 20 0111111110111101011111110101010001111100111111101111111010001111
lo4; 12 11 10 21 20 0011101011001111010100001111111011111001
lo5; 12 11 10 22 21 20 101001111111100100001010111000000001111110011111
lo6; 11 10 23 21 20 0100101000011001100000000111100011000110
LeafMaps; (red=type1)
P13^p12 130001000
120001111
0LM0001000
lo3; 13 12
01111111
10111101
00111101
rc=5
P11^p10^{0LM} ^{1PM'} 0LMFetch & AND lo=0,1,2,3,4,5,6
LM1111111
101111111
PureMaps;120000001
230000110
221100000
211000000
lo=0; 11 10 00110011
0101010100010001 rc=2
lo=1; 11 10 00110011
0101010100010001 rc=2
lo=2; 11 10 01010101
1100110001000100 rc=2
lo=3; 11 10 01111111
0101010001010100 rc=3
lo=4; 11 10 11001111
0101000001000000 rc=1
lo=5; 11 10 11111001
0000101000001000 rc=1
lo=6; 11 10 01001010
00011001 00001000 rc=1
Total rc=12
^{0LM} ^{1PM'} 0LM
Fetch & AND lo=3
lo0; 11 10 20 001100110101010100001111130001000
120001111
111111111
101111111
230001111
221101010
211011111
201111111
lo1; 11 10 20 001100110101010100001111
lo2; 11 10 21 20 01010101110011000000111100110011
lo3; 13 12 11 10 23 22 21 20 0111111110111101011111110101010001111100111111101111111010001111
lo4; 12 11 10 21 20 0011101011001111010100001111111011111001
lo5; 12 11 10 22 21 20 101001111111100100001010111000000001111110011111
lo6; 11 10 23 21 20 0100101000011001100000000111100011000110
LeafMaps; (red=type1)
P22^p21^{0LM} ^{1PM'} 0LMFetch & AND lo of 0LM 1-bit positions (i.e., 2,3,4,5,6) for P21, p22 (those that exits)
LM0011111
PureMaps;120000001
230000110
221100000
211000000
22'0011111
21'0111111
lo2; 21 00001111
lo3; 22 11111110 21 11111110
lo4; 21 11111110
lo5; 22 11100000 21 00011111
lo6; 21 01111000
00001111 rc=4
11111110 rc=7
11111110 rc=7
00000000 rc=0
01111000 rc=4
Total rc=22
lo0; 11 10 20 001100110101010100001111130001000
120001111
111111111
101111111
230001111
221101010
211011111
201111111
lo1; 11 10 20 001100110101010100001111
lo2; 11 10 21 20 01010101110011000000111100110011
lo3; 13 12 11 10 23 22 21 20 0111111110111101011111110101010001111100111111101111111010001111
lo4; 12 11 10 21 20 0011101011001111010100001111111011111001
lo5; 12 11 10 22 21 20 101001111111100100001010111000000001111110011111
lo6; 11 10 23 21 20 0100101000011001100000000111100011000110
LeafMaps; (red=type1)
P22^p21^p13^p12^{0LM} ^{1PM'} 0LMFetch & AND lo of 0LM 1-bit positions (i.e., lo3) for 21, 22 13, 12 (those that exits)
LM0001000
PureMaps;120000001
230000110
221100000
211000000
22'0011111
21'0111111
130001000
120001111
lo3; 13 12 22 21 01111111
10111101
11111110
11111110
00111100 rc=4
lo0; 11 10 20 001100110101010100001111130001000
120001111
111111111
101111111
230001111
221101010
211011111
201111111
lo1; 11 10 20 001100110101010100001111
lo2; 11 10 21 20 01010101110011000000111100110011
lo3; 13 12 11 10 23 22 21 20 0111111110111101011111110101010001111100111111101111111010001111
lo4; 12 11 10 21 20 0011101011001111010100001111111011111001
lo5; 12 11 10 22 21 20 101001111111100100001010111000000001111110011111
lo6; 11 10 23 21 20 0100101000011001100000000111100011000110
LeafMaps; (red=type1)
P22^p21^{0LM} ^{1PM'} 0LMFetch & AND lo of 0LM 1-bit positions (i.e., 2,3,4,5,6) for P21, p22 (those that exits)
LM0011111
PureMaps;120000001
230000110
221100000
211000000
22'0011111
21'0111111
lo2; 21 00001111
lo3; 22 11111110 21 11111110
lo4; 21 11111110
lo5; 22 11100000 21 00011111
lo6; 21 01111000
00001111 rc=4
11111110 rc=7
11111110 rc=7
00000000 rc=0
01111000 rc=4
Total rc=22
Vertical Data Assistant (VDA) is a windows (or windows CE) application that can data mine massive datasets efficiently? (note that the a separate application can be built to convert an store properly.
I am changing my thinking a bit on DataMIME and the whole idea of competing in the "big iron" community with a data mining system. There is no way to win there (too many players with too much money - Google, Microsoft, all the Bioinformatics and Drug development companies....). It occurred to me, as I was revising Masum's slides, that other than the size of the datasets, our methods and modules are getting surprisingly simple and compact.
Developing DataMIME on big hardware may be a mistake, if no one uses it! Maybe it is wrong-headed? Maybe the killer app is a completely portable, tiny client system (on a desktop/laptop/PDA) that can do scalable data mining for "anyone, anytime, anywhere" provided their data is capture into a consistent universally format. If this would work, it could eventually be scaled up to supercomputers and as a Grid App, but first, let's do a tiny version (tiny in terms of system requirements, but not in terms of DM power).
DataSURG software development (e.g., SMILEY, DataMIME) weren't too successful. The one software system has been used is TM-Lab, a small app on small PCs and requiring data be in a simple format (BSQ).
I now favor a Windows and Windows CE Utility Suite approach. Each Utility does one thing we and can be invoked in a GUI drag and drop mode. (e.g., An AND operation pulls the specified Ptrees from the specified folder and returns either just the rootcount or drops the resulting "derived Ptree" back in the folder and returns its name. Each platform (PDA to supercomputer) will get an appropriate subset of the suite.
Body Area Network (BAN) and Network is the Computer (NITC) technologies may eventually converge to produce the next "sea change" technology.
Imagine nano-computers, massive storage (e.g., 4G thumb drives), a wireless networks and applications such as
real-time health monitors (blood_emzynes/body_temperature/blood_sugar/coronary_flow...); PDA apps (schedulers/reminders/auto_thought_recorders, DNA/FingerPrint/FaceGeo/IrisScan realtime name_recall/restraining_order_enforcement/homeland_security);Environmental sensors (eyes_in_back_of_head/sense_nearby_explosives/viruses_sensing...).
All that data will have to be data mined for exceptions.
Exception mining (i.e., classification into 2 classes, exceptional_situation or normal_situation) is what DataSURG does better than anyone in the world (add "ego monitoring" to VDA apps ;-)
Our approach can be implemented on tiny processing platforms. Of course the datasets are large, so they must be broken up and compressed and store in an extremely simple and universal (no variations!) format.
So simple that Windows CE can handle them and so that Joe Public can understand them.
After all, simplicity and consistency made the Relational Model successful!
K
1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V 5 7
1200000000000000000000000010111101001110101010011111111111124
13000000000000000000000000011111110000000000000000000000000
7
1100110011001100110101010101111111110011111111100101001010135
1001010101010101011100110001010100010100000000101000011001123
2300000000000000000000000001111100111111111111111110000000022
2100000000111111110000111111111110111111100001111101111000035
2000001111000011110011001110001111111110011001111111000110033
2200000000000000001111111111111110111111111110000011111111135
d e f g
9 a b c
5 6 7 8
1 2 3 4
n o
l m
j k
h i
t u
r s
v
w
x y z A B
C D
E N F
G H
I J K L M
q
p O
P Q R
S T
U V
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
µi = (xX xi) / |X|
=(1/57)x k 2kxi,
(1/57)k 2kxxi,k
=(1/57)k 2krcPi,k
for i=1 =(1/57)*(7*23
+24*22+35*21+23*20)=4.3
for i=2 =(1/57)*(22*23
+35*22+35*21+33*20)=7.35
µ
Created a bit map for each {ring(f,Y,k)}k=1..K , of each Y={x}, one would have all
reverse-kNN information and much more! ring(f,Y,k)s give up distance
information, so {ring(f,Y,(kr, (k+1)r])}k=0..R holds more info, but makes it more difficult to computer reverse-kNN sets (though more useful sets can easily be calculated to do a better job of whatever reverse-kNN sets were to be used for anyway).
K
1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V 5 7
1200000000000000000000000010111101001110101010011111111111124
13000000000000000000000000011111110000000000000000000000000
7
1100110011001100110101010101111111110011111111100101001010135
1001010101010101011100110001010100010100000000101000011001123
2300000000000000000000000001111100111111111111111110000000022
2100000000111111110000111111111110111111100001111101111000035
2000001111000011110011001110001111111110011001111111000110033
2200000000000000001111111111111110111111111110000011111111135
d e f g
9 a b c
5 6 7 8
1 2 3 4
n o
l m
j k
h i
t u
r s
v
w
x y z A B
C D
E N F
G H
I J K L M
q
p O
P Q R
S T
U V
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
µ
ring(a,d*2k-1,d*2k )k=0... tell a great story on neighbors.(here, d=1)
We note that thering(a,d*(k-1),d*k)tell an even greaterstory but they may have higher cost?
ra,0=skin(a,1)ra,k=ring(a,2k-1,2k)
rf,a,k=ring(f,f-1a,2k-1,2k)
10010001000100010010001000000000000000000000000000000000000
6
1100100010001000100001000100000000100001010101000000000000011
1200010001000100010100010010000000011000000000110010100100014
2200000000000000000000000010000000001100000000011010110101010
2010101010101010100011001100000000100001010101000000000000017
21000100010001000101000100000000000100000000001000000000000 8
rx1,1,k 3200000000000000000000000000000000000110101010001101010011011
3001010101100101011100110000000000010000000000100000000000014
3110001000100010000010001010000000001000000000010010100100012
K
1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V 5 7
d e f g
9 a b c
5 6 7 8
1 2 3 4
n o
l m
j k
h i
t u
r s
v
w
x y z A B
C D
E N F
G H
I J K L M
q
p O
P Q R
S T
U V
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
µ
4000100010001000100001000110000000101001010101010010100100017
4101000100010001001000100000000000000100000000001000010001010
4210001000100010000010001000000000000010101010000101001010115
S211001100110011001010101000000000000000000000000000001000114
S000010001000100010100010000000000010100000000101000010001012
S100100010001000100001000100000000100011111111000101000010018
U200000000000000000000000000000000000110101010001101010011011
U000000000000000000000000010000000001010101010010111100110013
U100010001000100010100010000000000010000000000100000001000110
K
1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V 5 7
d e f g
9 a b c
5 6 7 8
1 2 3 4
n o
l m
j k
h i
t u
r s
v
w
x y z A B
C D
E N F
G H
I J K L M
q
p O
P Q R
S T
U V
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
µ
T0000000000000000000000000000000000001000000000010000110011 6
T1000000000000000000000000100000000010000000000100101001000 6
T200110011001100110101010100000010110001010101100000000000020
V200010001000100010100010011000010011000000000110010100100016
V0000000000000000000000000000000000000101010100001010000100 7
V1000000000000000000000000000000000001000000000010000100010 4
v200000000000000000000000000101001000010101010000101001010112
v0000000000000000000000000010000000000000000000000000000000 1
v1000000000000000000000000000000000000000000000000000000000 0
K
1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V 5 7
d e f g
9 a b c
5 6 7 8
1 2 3 4
n o
l m
j k
h i
t u
r s
v
w
x y z A B
C D
E N F
G H
I J K L M
q
p O
P Q R
S T
U V
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
q0000000000000000000000000000000100000000000000000000000000 1
q1000000000000000000000000000000000000000000000000000000000 0
q2000000000000000000000000001111010000000000000000000010001 7
w2000000000000000000000000010000100000000000000000000000000 2
w0000000000000000000000000000101000000000000000000000000000 2
w1000000000000000000000000000000000000000000000000000000000 0
s2000000000000000000000000010000000000000000000000000000000 1
s0000000000000000000000000001010010000000000000000000000000 3
s1000000000000000000000000000000000000000000000000000000000 0
K
1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V 5 7
d e f g
9 a b c
5 6 7 8
1 2 3 4
n o
l m
j k
h i
t u
r s
v
w
x y z A B
C D
E N F
G H
I J K L M
q
p O
P Q R
S T
U V
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
10000011110000000000000000000000010000000000000000000000000 5
11000000001111000000000000000000000000000000000000000000000 4
12000000000000111111000000000000000000000000000000000000011 8
52000000000000000011110000000000000000000000000000000001111 8
50111100001111000000000000000000000000000000000000000000000 8
51000000000000111100000000000000000000000000000000000000000 4
92000000000000000000111100000000000000000000000000001111100 9
90000011110000111100000000000000010000000000000000000000000 9
91111100000000000011000000000000000000000000000000000000011 8
K
1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V 5 7
d e f g
9 a b c
5 6 7 8
1 2 3 4
n o
l m
j k
h i
t u
r s
v
w
x y z A B
C D
E N F
G H
I J K L M
q
p O
P Q R
S T
U V
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
d0000000001111000011000000000000000000000000000000000000011 8
d1000011110000000000110000000000010000000000000000000001100 9
d211110000000000000000111110000010000000000000000001111000014
h211111111000000000000001110000011000000000000000001000000014
h0000000000000111100110000000000000000000000000000000001100 8
h1000000001111000000001100000000000000000000000000001110000 9
j2000011111111000000000000000000010000000000000000000000000 9
j0000000000000000011001100000000000000000000000000001110011 9
j1000000000000111100000011100000100000000000000000010000000 9
K
1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V 5 7
d e f g
9 a b c
5 6 7 8
1 2 3 4
n o
l m
j k
h i
t u
r s
v
w
x y z A B
C D
E N F
G H
I J K L M
q
p O
P Q R
S T
U V
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
l0000000000000000000110011100000100000000000000000010001100 9
l1000000000000000011000000000000000000000000000000000000011 4
l2000000001111111100000000000000000000000000000000000000000 8
n200000000000011111100000000000000000000000001111100000001113
n0000000000000000000001100000000000000000000000000001110000 5
n1000000000000000000110000000000000000000000000000000001100 4
K
1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V 5 7
d e f g
9 a b c
5 6 7 8
1 2 3 4
n o
l m
j k
h i
t u
r s
v
w
x y z A B
C D
E N F
G H
I J K L M
q
p O
P Q R
S T
U V
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
I0000000000000000000000000000000000000000001100000000000000 2
I1000000000000000000000000000000000000000110000000100000000 3
I200000000000000000000001111111110111111100000000001000000017
G2000000000000000000000000000011001111100000000000000000000 7
G0000000000000000000000000000000000000000110011111100000000 8
G1000000000000000000000000011100000000011000000000000000000 5
E2000000000000000000000000000000000000000000000000000000000 0
E0000000000000000000000000011100000000011001100000000000000 7
E100000000000000000000000000001100111110000001111100000000012
K
1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V 5 7
d e f g
9 a b c
5 6 7 8
1 2 3 4
n o
l m
j k
h i
t u
r s
v
w
x y z A B
C D
E N F
G H
I J K L M
q
p O
P Q R
S T
U V
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
C000000000000000000000000000001100111110011000000010000000010
C1000000000000000000000000000000000000000001100000000000000 2
C2000000000000000000000000000000000000000000011111000000000 5
x2000000000000000000000000000000000000000001111111000000000 7
x0000000000000000000000000011100000000011000000000000000000 5
x1000000000000000000000000000000000000000110000000100000000 3
µ
K
1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V 5 7
d e f g
9 a b c
5 6 7 8
1 2 3 4
n o
l m
j k
h i
t u
r s
v
w
x y z A B
C D
E N F
G H
I J K L M
q
p O
P Q R
S T
U V
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
µ
x TVX gap
2415 13p 2428 97P 2525 21Q 2546 38O 2584 34m 2618 110o 2728 8S 2736 156T 2892 38R 2930 106k 3036 46U 3082 64l 3146 8i 3154 26K 3180 21L 3201 72J 3273 63M 3336 130V 3466 14I 3480 4n 3484 109g 3593 89h 3682 110j 3792 8f 3800 9H 3809 144G 3953 168e 4121 25c 4146 94N 4240 24v 4264 89b 4353 43F 4396 144E 4540 16d 4556 118a 4674 1398 4813 2077 5020 77D 5097 129 5109 132C 5241 1006 5341 2534 5594 162z 5756 205 5776 1A 5777 243 5801 48y 5849 63B 5912 144x 6056 662 6122 4351 6557 925q 74822799r10281 815t11096 125w11221 221s11442 815u12257
000000000000000000000000100000000000000000000000000000000 1
TV-countours bounded by isobar gaps of atleast 57*radial_distance_from_=(7.4, 4.3)
m
000000000000000000000100000000000000000000000000011100000 4
T
000000000000000000000001000000000000000000000000000001100 3
M
000000000000000001011000000000000000000000001111000010010 9
57*1
57*2
57*3
57*4
H
000000000000001110100010000000000000000000110000000000001 8
57*5
F
000000000011010000000000000000100000000011000000100000000 7
57*6
C
000000111100100000000000000000000000011100000000000000000 8
57*7
x
001111000000000000000000000000001111100000000000000000000 9
57*7
1
110000000000000000000000000000000000000000000000000000000 2
57*8
000000000000000000000000000000000000000000000000000000000 0
57*9
q
000000000000000000000000000000000000000000000000000000000 1
57*11
s
000000000000000000000000001110010000000000000000000000000 4
57*14
u
000000000000000000000000000001000000000000000000000000000 1
57*15
d e f g
9 a b c
5 6 7 8
1 2 3 4
n o
l m
j k
h i
t u
r s
v
w
x y z A B
C D
E N F
G H
I J K L M
q
p O
P Q R
S T
U V
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
1111
1110
1101
1100
1011
1010
1001
1000
0111
0110
0101
0100
0011
0010
0001
0000
µ
=(7.4, 4.3)We note taking a = (1, 0) and a = (0, 2)the 4 resulting TV-Xa-contours nearly partition a thick ring.
Thickening the Xa-contours even more, gives better coverage without increasing the neighborhood much.
Does this hold true in higher dimensions?
Do we need to consider other diagonals, e.g., (0,0.., i,0,.., j,0,..,0) etc. ?
FAML Vector Space Classification using sorting (ala SMART-TV) Given T=R(A1,…,An,C) and X=R[A1,…,An]
1. Create and store the 2-column n tables, Xi(x,fi(x)) sorted on fi(x)
1. equivalently, [ala 4thNF] decompose R(A1..An,C,Af1..Afm
) into T, X1 ,..., Xm sorted on fi(x).
2. equiv, create secondary indexes, Xi for each derived attribute, Afi on R(A1..An, C, Af1
..Afm )
3. Store 3-col, Xi(x,fi(x),x.C) so votes are handy (# coding classes so bitwidth = log2|C| )?
2. For an unclassified sample, a, calculate {fi(a)} and retrieve the {cskin(fi(a),r)} [or cskin(fi(a),k)].
3. Form contour(fi,cskin(fi(a),r)) and intersect them, to get a candidate near neighbor set.
4. For every candidate, x, in that set, if d(x,a)<threshold, tally RBF-weighted vote of x.
Sorting is expensive, (creating indexes) even tho it is 1-time activity which can be amortized over all classifications, it may be too expensive for very large data sets. (If SMART-TV is essentially "creating indexes on derived attributes")
FAML Vector Space Classification using P-trees (includes PINE)
To address curse of cardinality (mostly sorting) we use P-tree technology on derived attributes:
1. Create basic P-trees for each derived attribute, Afi
2. Using EIN technology, create P-tree mask for the contour(fi,cskin(a,r)), AND them
3. For every x in the resulting mask, if d(x,a)<threshold, tally RBF-weighted vote of x.
Next we examine some specific functionals (dually, derived attributes). First some notation:
A functional, f:XR where Dom(R)={0,1} (binary) is called a predicate.
The derived attribute, Af will have bitwidth=1 and is called a derived map.
When starting with a functional, f:XR , the dual derived attribute will be denoted, Af.
When starting with a derived attribute, A, the dual functional will be denoted fA:XR.
Coordinate Projection Xi(x) =xi, is just Xei
(Xi's define L∞-nbrhds in EIN-PINE technology).
IPX(a) = xXxoa = xXXa(x) is TAPP with ContourIPX(a,r) linear hyper-bar to µ)
FAML Classification
Given
R(A1..An,C)=Training Space
X=R[A1..An]=Feature Space (since R,X same key, we'll use X.C)
Functionals, {fi:dom(A1)..dom(An)Reals} (e.g., TV; IP; Projection onto a, Xa)
Define:
Contour(fi, X, a, r) = {xX :|fi(x)-fi(a)|<r} (polar).
Isobar(fi, X, a, r) = {xX :|fi(x)-fi(a)|=r}.
ContourNbrhd(X,a,r) = iContour(fi,X,a,r).
Basic Contour classification algorithm:
xContourNbrhd, if dist(a,x)<ε,
then tally x.C-vote as a [weighted] vote for a.C.
TVX()=TVX(x33)
12
34
5
XY
TVX
12
3 4 5
a
Contour(TV,X,a,r)
r
b
Contour(Xa,X,b,r)
TVX(a)
FAML Classification (early version)
xX and f, compute f(X).
Store Table(fi) consisting of (x, fi(x))-tuples sorted on fi. (or basic Ptrees of derived attributes?)
unclassified, a:
Calculate {fi(a)}subset of i's. (No calculation is required for Xi(a)=ai ),
Retrieve portion of Table(fi) at fi(a), NBR(fi(a)). (Count/radius based? Different k/r for each fi?)
Construct contours, f-1(NBR(fi(a)).
Prune halos (by intersecting other contours until |Contour(X,fi,a,r)|<threshold.
Prune to VoterSet, by checking d(a,x).
x close enough, cast xs vote using weighted (Gaussian Radial Basis) vote.
(Note that the purple part could be done in bulk on the set of all unclassified samples, S, by clustering S by distance, then apply Purple part to each component rather).
FAML Clustering? One pervasive use of clustering is for class identification or class generation, that is, to identify sets of highly similar objects that might form the classes in a training set and for which subsequent samples can be classified against. These generated classes can be isotropic or density based (in either case subsequent classification can correctly based on near neighbors.)
In fact, if class generation is done exclusively for the purpose of creating a training set for subsequent classification, then isotropic clustering will always suffice (that is, the round pre-clusters do not have to be joined into arbitrarily shaped clusters since the near neighbor set will determine the class assignment anyway (independent of the other round pre-clusters that might or might not get connected up to that one).
Assume a [large] data set of historical tuples and that we want to use them to classify subsequent tuples we may receive from, say, a data stream.
First we want to generate classes with historical data set thru isotropic clustering (from a recent window of the stream or the whole stream so far).
Then we want to use that "classed" dataset as a training set for near neighbor classification.
We can partition into core, core nbr, border, and noise points ala Rana's method) using functional near nbr sets (intersecting all or just some of the n+2 functional contours) instead of epsilon-nbrhds.
Then the border pts can be attached to the best (closest/ most overlapping) core cluster. Then classify according to overlap with these clusters???
POLAR NeighborhoodsConstruct contours, f-1(NN) to prune halos on NN until |NN|<threshold, revisited.
First cluster the unclassified sample set, S,
Identify dense angular contours in S-μ, Acontour(a-μ,θ);
Identify dense radial contours in S-μ, Rcontour(a-μ,r) ( TVXcontour(a,r) )
Intersect them and pick out the dense
Polar Neighborhood, PN(a-μ,r,θ) (no halos)
For dense PNs in S-μ roughly classify the entire PN (pruned by Euclidean dist. to be isotropic) by classifying one representative! Then classify the rest of S, one at a time.
a-μθ
The Polar (r,θ)-Neighborhood at a, PN(a,r,θ) =AContour(a,θ) RContour(a,r) is an interesting gridding for grid based clustering. Can we easily construct this partition?
Acontour(a,θ)={xX | a
x <θ}={xX | x/|x|a/|a| < cosθ } = {xX | xa<|x||a|cosθ)}.
Acontour(a-μ,θ)={xX | x
a-μ<θ}={xX | x/|x|(a-μ)/|a-μ|<cosθ}
a-μ
a-μ
Non-uniform Polar GriddingsaDomA1..DomAn, projection onto a, Xa(x)=xa= i=1..nxi*ai
Contour(Xa,X,b,r) is a linear (n-1)-dimensional hyper-bar through b perpendicular to a.
Contour(TV,X,a,r) Contour(Xa,X,a,θ) is an approximation of PN(a,r,θ). Xa(x) values are easier to calculated than TVX(x)c
a
r
θ
Consider a very non-uniform gridding (for finding Dense Cores in clustering or for finding Near Neighbor Sets):
Determine the densest rings (about μ)
Within dense rings determine directions, a, such that the green nbrhds are dense and extract them.
Treat the remainder of x one point at a time.
0 r 2r 3r
Non-uniform Parallel Griddings
dimension, ei, partition into r-slices (actually, we should use j-low griddings?)
Starting on one side, determine the count in each r-slice by determining the count in the first one (one < inequality). P-tree mask it (for j-low griddings this is already determined). Do the same for the next one and then AND off the first....
As in previous slide, determine the dense cells (j-low cells) and treat the remaining points as one large (but sparse) partition.
Each partition is P-tree masked.
APPENDIX (slides that may be unnecessary)
No halos! But what is the functional, Xa- ?
The functional is clearly projection onto the a direction or fa-(x) = x o (a-)/|a-|.Is this PrePreprocessing Compliant? (Taufik Abidin PreProcessing Compliant)?I think (from the chapter in the book) we can conclude that it’s quickly computed.
12
34
5
XY
12
34
5
a
ContourXa(a-,r)
ContourTVX(a,r)
Xa = X {g(x,a)|xX} has mean(Xa) = a so IPXaContours are a
ax x-
a
(x-a)a/|a|
(x-a)a/|a| - (x-a)
g(x,a)=x+2[(x-a)a/|a| - (x-a)]