CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.
-
Upload
lesley-obrien -
Category
Documents
-
view
222 -
download
2
Transcript of CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.
![Page 1: CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697c0031a28abf838cc3981/html5/thumbnails/1.jpg)
CS 521Data Mining TechniquesInstructor: Abdullah MueenLECTURE 8: TIME SERIES AND GRAPH MINING
![Page 2: CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697c0031a28abf838cc3981/html5/thumbnails/2.jpg)
Definition of Time Series Motifs
1. Length of the motif 2. Support of the motif 3. Similarity of the Pattern 4. Relative Position of the Pattern
Given a length, the most similar/least distant pair of non-overlapping subsequences.
20 40 60 80 100 120 140 160 180 200-2
-1
0
1
2
iii
y
yi
x
xi
)yx(yxd
yy
xx
2ˆˆ)ˆ,ˆ(
ˆ,ˆ
![Page 3: CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697c0031a28abf838cc3981/html5/thumbnails/3.jpg)
Problem Formulation
The most similar pair of non-overlapping subsequences
100 200 300 400 500 600 700 800 900 1000
-8000
-7500
-7000
. . .
12345678...
873
time:1000
The closest pair of points in high dimensional space
Optimal algorithm in two dimension : Θ(n log n) For large dimensionality d, optimum algorithm is effectively
Θ(n2d)
![Page 4: CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697c0031a28abf838cc3981/html5/thumbnails/4.jpg)
Lower Bound If P, Q and R are three points in a d-spaced(P,Q)+d(Q,R) ≥ d(P,R)
d(P,Q) ≥ |d(Q,R) - d(P,R)|
A third point R provides a very inexpensive lower bound on the true distance
If the lower bound is larger than the existing best, skip d(P, Q)
d(P,Q) ≥ |d(Q,R) - d(P,R)| ≥ BestPairDistance
P Q
R
![Page 5: CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697c0031a28abf838cc3981/html5/thumbnails/5.jpg)
Circular Projection
r
Pick a reference point r
Circularly Project all points on a line passing through the reference point
Equivalent to computing distance from r and then sorting the points according to distance
1
5
3
716
10
12
20
11
6
24
21
18
2
22
17
15
23
13
148
49
19 r
![Page 6: CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697c0031a28abf838cc3981/html5/thumbnails/6.jpg)
The Order Line
r
P Q
r|d(Q, r) - d(P, r)|
d(Q, r)
d(P, r)
k = 1k = 2k = 3
k=1:n-1• Compare every pair having
k-1 points in between
• Do k scans of the order line, starting with the 1st to kth point
BestPairDistance
1
5
3
716
10
12
20
11
624
21
18
2
22
17
15
23
13
148
49
19 r
0
![Page 7: CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697c0031a28abf838cc3981/html5/thumbnails/7.jpg)
Correctness If we search for all offset=1,2,…,n-1 then all possible pairs are considered.
◦ n(n-1)/2 pairs
for any offset=k, if none of the k scans needs an actual distance computation then for the rest of the offsets=k+1,…,n-1 no distance computation will be needed.
r
![Page 8: CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697c0031a28abf838cc3981/html5/thumbnails/8.jpg)
Graph Similarity Edit distance/graph isomorphism:
◦ Tree Edit Distance
Feature extraction◦ IN/out degree◦ Diameter
Iterative methods ◦ SimRank
![Page 9: CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697c0031a28abf838cc3981/html5/thumbnails/9.jpg)
Diameter Largest Shortest path in the graph.
1 let dist be a |V| × |V| array of minimum distances initialized to ∞ (infinity)2 for each vertex v3 dist[v][v] ← 04 for each edge (u,v)5 dist[u][v] ← w(u,v) // the weight of the edge (u,v)6 for k from 1 to |V|7 for i from 1 to |V|8 for j from 1 to |V|9 if dist[i][j] > dist[i][k] + dist[k][j] 10 dist[i][j] ← dist[i][k] + dist[k][j]11 end if
http://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm
![Page 10: CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697c0031a28abf838cc3981/html5/thumbnails/10.jpg)
Simrank
For a node v in a graph, we denote by I(v) and O(v) the set of in-neighbors and out-neighbors of v, respectively.
http://www-cs-students.stanford.edu/~glenj/simrank.pdf
1. A solution s( , ) [0, 1] to the n∗ ∗ ∈ 2 SimRank equations always exists and is unique.
2. Symmetric3. Reflexive
![Page 11: CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697c0031a28abf838cc3981/html5/thumbnails/11.jpg)
Tree Edit Distance
http://grfia.dlsi.ua.es/ml/algorithms/references/editsurvey_bille.pdf
![Page 12: CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697c0031a28abf838cc3981/html5/thumbnails/12.jpg)
Tree Edit Distance
![Page 13: CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697c0031a28abf838cc3981/html5/thumbnails/13.jpg)
Applications Find the most frequent tree structure in a phylogenetic tree.
Match a query subtree with a set of XML documents.
![Page 14: CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697c0031a28abf838cc3981/html5/thumbnails/14.jpg)
Ranking Nodes Page Rank
PR(A) is the PageRank of page A,
PR(Ti) is the PageRank of pages Ti which link to page A,
C(Ti) is the number of outbound links on page Ti and
d is a damping factor which can be set between 0 and 1.
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
![Page 15: CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697c0031a28abf838cc3981/html5/thumbnails/15.jpg)
ExamplePR(A) = 0.5 + 0.5 PR(C)PR(B) = 0.5 + 0.5 (PR(A) / 2)PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))
These equations can easily be solved. We get the following PageRank values for the single pages:
PR(A) = 14/13 = 1.07692308PR(B) = 10/13 = 0.76923077PR(C) = 15/13 = 1.15384615
![Page 16: CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697c0031a28abf838cc3981/html5/thumbnails/16.jpg)
Matlab Script Matlab script for the example in the previous slide
syms x y z;
eqn1 = x == 0.5 + 0.5*z
eqn2 = y == 0.5 + 0.25*x
eqn3 = z == 0.5 + 0.25*x + 0.5*y
[A,B] = equationsToMatrix([eqn1, eqn2, eqn3], [x, y, z])
X = linsolve(A,B)
![Page 17: CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.](https://reader036.fdocuments.in/reader036/viewer/2022062322/5697c0031a28abf838cc3981/html5/thumbnails/17.jpg)
HITS: Hyperlink-Induced Topic Search
http://www.cs.cornell.edu/home/kleinber/auth.pdf