2015年12月4日星期五 2015年12月4日星期五 2015年12月4日星期五 Data Mining: Concepts...
-
Upload
avice-ward -
Category
Documents
-
view
316 -
download
0
Transcript of 2015年12月4日星期五 2015年12月4日星期五 2015年12月4日星期五 Data Mining: Concepts...
2023年4月18日Data Mining: Concepts and Technique
s 2
What we learned?
1. Frequent pattern & association
2. Clustering
3. Classification
4. Data warehousing
5. Additional
2023年4月18日Data Mining: Concepts and Technique
s 3
What we learned?
1. Frequent pattern & association
frequent itemsets (Apriori, FP-growth)
max and closed itemsets
association rules
essential rules
generalized itemsets
Sequential pattern
2. Clustering
3. Classification
4. Data warehousing
5. Additional
2023年4月18日Data Mining: Concepts and Technique
s 4
What we learned?
1. Frequent pattern & association
2. Clustering
k-means
Birch (based on CF-tree)
DBSCAN
CURE
3. Classification
4. Data warehousing
5. Additional
2023年4月18日Data Mining: Concepts and Technique
s 5
What we learned?
1. Frequent pattern & association
2. Clustering
3. Classification
decision tree
naïve Baysian classifier
Baysian network
neural net and SVM
4. Data warehousing
5. Additional
2023年4月18日Data Mining: Concepts and Technique
s 6
What we learned?
1. Frequent pattern & association
2. Clustering
3. Classification
4. Data warehousing
concept, schema
data cube & operations (rollup, …)
cube computation: multi-way array aggregation
iceberg cube
dynamic data cube
5. Additional
2023年4月18日Data Mining: Concepts and Technique
s 7
What we learned?
1. Frequent pattern & association
2. Clustering
3. Classification
4. Data warehousing
5. Additional
lattice (of itemsets, g-itemsets, rules, cuboids)
distance-based indexing
2023年4月18日Data Mining: Concepts and Technique
s 8
1. Frequent pattern & association frequent itemsets (Apriori, FP-growth)
max and closed itemsets
association rules
essential rules
generalized itemsets
Sequential pattern
2023年4月18日Data Mining: Concepts and Technique
s 9
Basic Concepts: Frequent Patterns and Association Rules
Itemset X={x1, …, xk} Find all the rules XY with min
confidence and support support, s, probability that a
transaction contains XY confidence, c, conditional
probability that a transaction having X also contains Y.
Let min_support = 50%, min_conf = 50%:
A C (50%, 66.7%)C A (50%, 100%)
Customerbuys diaper
Customerbuys both
Customerbuys beer
Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
2023年4月18日Data Mining: Concepts and Technique
s 10
From Mining Association Rules to Mining Frequent Patterns (i.e. Frequent Itemsets)
Given a frequent itemset X, how to find association
rules?
Examine every subset S of X.
Confidence(S X – S ) = support(X)/support(S)
Compare with min_conf
An optimization is possible (refer to exercises
6.1, 6.2).
2023年4月18日Data Mining: Concepts and Technique
s 11
The Apriori Algorithm—An Example
Database TDB
1st scan
C1L1
L2
C2 C2
2nd scan
C3 L33rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2
Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2
Itemset
{B, C, E}Itemset sup{B, C, E} 2
2023年4月18日Data Mining: Concepts and Technique
s 12
Important Details of Apriori
How to generate candidates? Step 1: self-joining Lk
Step 2: pruning How to count supports of candidates? Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd acde from acd and ace
Pruning: acde is removed because ade is not in L3
C4={abcd}
2023年4月18日Data Mining: Concepts and Technique
s 13
Construct FP-tree from a Transaction Database
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
min_support = 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree F-list=f-c-a-b-m-p
2023年4月18日Data Mining: Concepts and Technique
s 14
Construct FP-tree from a Transaction Database
{}
f:1
c:1
a:1
m:1
p:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
min_support = 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree F-list=f-c-a-b-m-p
2023年4月18日Data Mining: Concepts and Technique
s 15
Construct FP-tree from a Transaction Database
{}
f:2
c:2
a:2
b:1m:1
p:1 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
min_support = 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree F-list=f-c-a-b-m-p
2023年4月18日Data Mining: Concepts and Technique
s 16
Construct FP-tree from a Transaction Database
{}
f:3
c:2
a:2
b:1m:1
p:1 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
min_support = 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree F-list=f-c-a-b-m-p
b:1
2023年4月18日Data Mining: Concepts and Technique
s 17
Construct FP-tree from a Transaction Database
{}
f:3 c:1
b:1
p:1
b:1c:2
a:2
b:1m:1
p:1 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
min_support = 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree F-list=f-c-a-b-m-p
2023年4月18日Data Mining: Concepts and Technique
s 18
Construct FP-tree from a Transaction Database
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
min_support = 3
TID Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o, w} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Sort frequent items in frequency descending order, f-list
3. Scan DB again, construct FP-tree F-list=f-c-a-b-m-p
2023年4月18日Data Mining: Concepts and Technique
s 19
Find Patterns Having P From P-conditional Database
Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent
item p Accumulate all of transformed prefix paths of item p to
form p’s conditional pattern base
Conditional pattern bases
item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency head f 4c 4a 3b 3m 3p 3
2023年4月18日Data Mining: Concepts and Technique
s 20
Max-patterns
Frequent pattern {a1, …, a100} (1001) +
(1002) + … + (1
10
00
0) = 2100-1 = 1.27*1030
frequent sub-patterns! Max-pattern: frequent patterns without
proper frequent super pattern BCDE, ACD are max-patterns BCD is not a max-pattern
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,FMin_sup=2
2023年4月18日Data Mining: Concepts and Technique
s 21
Example
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
(ABCDEF)
Items Frequency
A 2
B 2
C 3
D 3
E 2
F 1
ABCDE 0
Min_sup=2
Max patterns:
A (BCDE)B (CDE) C (DE) E ()D (E)
2023年4月18日Data Mining: Concepts and Technique
s 22
Example
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
Items Frequency
AB 1
AC 2
AD 2
AE 1
ACD 2
Min_sup=2
(ABCDEF)
A (BCDE)B (CDE) C (DE) E ()D (E)
Max patterns:
Node A
2023年4月18日Data Mining: Concepts and Technique
s 23
Example
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
Items Frequency
AB 1
AC 2
AD 2
AE 1
ACD 2
Min_sup=2
(ABCDEF)
A (BCDE)B (CDE) C (DE) E ()D (E)
ACD
Max patterns:
Node A
2023年4月18日Data Mining: Concepts and Technique
s 24
Example
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
Items Frequency
BCDE 2
Min_sup=2
(ABCDEF)
A (BCDE)B (CDE) C (DE) E ()D (E)
ACD
Max patterns:
Node B
2023年4月18日Data Mining: Concepts and Technique
s 25
Example
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
Items Frequency
BCDE 2
Min_sup=2
(ABCDEF)
A (BCDE)B (CDE) C (DE) E ()D (E)
ACD
BCDE
Max patterns:
Node B
2023年4月18日Data Mining: Concepts and Technique
s 26
Example
Tid Items
10 A,B,C,D,E
20 B,C,D,E,
30 A,C,D,F
Min_sup=2
(ABCDEF)
A (BCDE)B (CDE) C (DE) E ()D (E)
ACD
BCDE
Max patterns:
2023年4月18日Data Mining: Concepts and Technique
s 27
A Critical Observation
Rule Support Confidence
A → BC sup(ABC) sup(ABC)/sup(A)
AB → C sup(ABC) sup(ABC)/sup(AB)
AC → B sup(ABC) sup(ABC)/sup(AC)
A → B sup(AB) sup(AB)/sup(A)
A → C sup(AC) sup(AC)/sup(A)
A → BC has smaller support and confidence than the other rules, independent to the TDB.
Rules AB → C, AC → B, A → B and A → C are redundant with regard to A → BC.
While mining association rules, a large percentage of rules may be redundant.
2023年4月18日Data Mining: Concepts and Technique
s 28
Formal Definition of Essential Rule
Definition 1 Rule r1 implies another rule r2 if support(r1)≤support(r2) and confidence(r1)≤ confidence(r2) independent to TDB.
Denote as r1 r2
Definition 2 Rule r1 is an essential rule if r1 is strong and r2 s.t. r2 r1 .
2023年4月18日Data Mining: Concepts and Technique
s 29
Example of a Lattice of rules
ABC
ABCCABBAC
ACBABCABAC BABC BCA CACB
• Generate the child nodes: move or delete from the consequent.• To find essential rules: start from each max itemset; browse top-down; prune a sub-tree whenever a rule is confident.
2023年4月18日Data Mining: Concepts and Technique
s 30
Frequent generalized itemsets
A taxonomy of items. TDB involves leaf items in the taxonomy. A g-itemset may contain g-items, but
cannot contain an ancestor and a descendant at the same time.
!! A descendant g-item is a “superset”!! Anyone who bought {milk, bread} also
bought {milk}. Anyone who bought {A} also bought {W}.
?? how to find frequent g-itemsets? Browse (and prune) a lattice of g-
itemsets! To get children, replace one item by its
ancestor (if conflicts, remove instead.)
2023年4月18日Data Mining: Concepts and Technique
s 31
What Is Sequential Pattern Mining?
Given a set of sequences, find the complete set of frequent subsequences
A sequence database Given support threshold min_sup =2, <(ab)c> is a sequential pattern
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
2023年4月18日Data Mining: Concepts and Technique
s 32
Mining Sequential Patterns by Prefix Projections
Step 1: find length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f>
Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: The ones having prefix <a>; The ones having prefix <b>; … The ones having prefix <f>
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
2023年4月18日Data Mining: Concepts and Technique
s 33
Finding Seq. Patterns with Prefix <a>
Only need to consider projections w.r.t. <a> <a>-projected database: <(abc)(ac)d(cf)>,
<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>
Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>,
by checking the frequency of items like a and _a. Further partition into 6 subsets
Having prefix <aa>; … Having prefix <af>
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
2023年4月18日Data Mining: Concepts and Technique
s 34
2. Clustering k-means
Birch (based on CF-tree)
DBSCAN
CURE
2023年4月18日Data Mining: Concepts and Technique
s 35
The K-Means Clustering Method
Pick k objects as initial seed points Assign each object to the cluster with the
nearest seed point Re-compute each seed point as the
centroid (or mean point) of its cluster Go back to Step 2, stop when no more new
assignment
Not optimal. A counter example?
2023年4月18日Data Mining: Concepts and Technique
s 36
BIRCH (1996)
Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering Phase 1: scan DB to build an initial in-memory CF tree
(a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans
Weakness: handles only numeric data, and sensitive to the order of the data record.
Balanced Iterative Reducing and Clustering using Hierarchies
2023年4月18日Data Mining: Concepts and Technique
s 37
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1 Xi
SS: Ni=1 (Xi )2
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),244)
(3,4)(2,6)(4,5)(4,7)(3,8)
2023年4月18日Data Mining: Concepts and Technique
s 38
Some Characteristics of CF
Two CF can be aggregated.
Given CF1=(N1, LS1, SS1), CF2 = (N2, LS2, SS2),
If combined into one cluster, CF=(N1+N2, LS1+LS2, SS1+SS2).
The centroid and radius can both be computed from CF.
centroid is the center of the cluster
radius is the average distance between an object and the centroid.
how?
N
N
i ixx 1
0N
R
N
i i xx 2
1 0)(
2023年4月18日Data Mining: Concepts and Technique
s 39
Some Characteristics of CF
N
LSx
0
N
NLSLSNLSNSS )/(*2)/( 2
NR xxxx i
N
i i)*2)()((
0
2
1 0
2
2*1
LSSSNN
2023年4月18日Data Mining: Concepts and Technique
s 40
CF-Tree in BIRCH
Clustering feature:
summary of the statistics for a given subcluster: the 0-th, 1st and 2nd moments of the subcluster from the statistical point of view.
registers crucial measurements for computing cluster and utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering
A nonleaf node in a tree has descendants or “children”
The nonleaf nodes store sums of the CFs of their children
A CF tree has two parameters
Branching factor: specify the maximum number of children.
threshold T: max radius of sub-clusters stored at the leaf nodes
2023年4月18日Data Mining: Concepts and Technique
s 41
Insertion in a CF-Tree
To insert an object o to a CF-tree, insert to the root node of the CF-tree.
To insert o into an index node, insert into the child node whose centroid is the closest to o.
To insert o into a leaf node,
If an existing leaf entry can “absorb” it (i.e. new radius <= T), let it be;
Otherwise, create a new leaf entry.
Split:
Choose two entries whose centroids are the farthest away;
Assign them to two different groups;
Assign the remaining entries to one of these groups.
2023年4月18日Data Mining: Concepts and Technique
s 42
Density-Based Clustering: Background (II)
Density-reachable: A point p is density-reachable
from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi
Density-connected A point p is density-connected to
a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts.
p
qp1
p q
o
2023年4月18日Data Mining: Concepts and Technique
s 43
DBSCAN: Density Based Spatial Clustering of Applications with Noise
Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases with noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
2023年4月18日Data Mining: Concepts and Technique
s 44
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p wrt Eps and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database.
Continue the process until all of the points have been processed.
2023年4月18日Data Mining: Concepts and Technique
s 45
k-means does not perform well on this; AGNES + dmin has single-link effect!
Motivation for CURE
2023年4月18日Data Mining: Concepts and Technique
s 46
Cure: The Basic Version
Initially, insert to PQ every object as a cluster.
Every cluster in PQ has: (Up to) C representative points Pointer to closest cluster (dist between two
clusters = min{dist(rep1, rep2)}. While PQ has more than k clusters,
Merge the top cluster with its closest cluster.
2023年4月18日Data Mining: Concepts and Technique
s 47
Representative points
Step 1: choose up to C points. If a cluster has no more than C points, all of them. Otherwise, choose the first point as the farthest
from the mean. Choose the others as the farthest from the chosen ones.
Step 2: shrink each point towards mean: p’ = p + * (mean – p) [0,1]. Larger means shrinking more. Reason for shrink: avoid outlier, as faraway
objects are shrunk more.
2023年4月18日Data Mining: Concepts and Technique
s 48
3. Classification decision tree
naïve Baysian classifier
Baysian net
neural net and SVM
2023年4月18日Data Mining: Concepts and Technique
s 49
Training Dataset
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
This follows an example from Quinlan’s ID3
2023年4月18日Data Mining: Concepts and Technique
s 50
Output: A Decision Tree for “buys_computer”
age?
overcast
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
30..40
2023年4月18日Data Mining: Concepts and Technique
s 51
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-
conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are
discretized in advance) Examples are partitioned recursively based on selected
attributes Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain) Conditions for stopping partitioning
All samples for a given node belong to the same class There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf There are no samples left
2023年4月18日Data Mining: Concepts and Technique
s 52
Suppose X can have one of m values… V1, V2, … Vm
What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s
H(X) = The entropy of X “High Entropy” means X is from a uniform (boring) distribution “Low Entropy” means X is from varied (peaks and valleys)
distribution
General Case
mm ppppppXH 2222121 logloglog)(
P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm
m
jjj pp
12log
2023年4月18日Data Mining: Concepts and Technique
s 53
Specific Conditional Entropy
Definition of Conditional Entropy:
H(Y|X=v) = The entropy of Y among only those records in which X has value v
Example:
• H(Y|X=Math) = 1
• H(Y|X=History) = 0
• H(Y|X=CS) = 0
X = College Major
Y = Likes “Gladiator”
X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
2023年4月18日Data Mining: Concepts and Technique
s 54
Conditional EntropyDefinition of general Conditional Entropy:
H(Y|X) = The average conditional entropy of Y
= ΣjProb(X=vj) H(Y | X = vj)
X = College Major
Y = Likes “Gladiator”
Example:
vj Prob(X=vj) H(Y | X = vj)
Math 0.5 1
History 0.25 0
CS 0.25 0
H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5
X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
2023年4月18日Data Mining: Concepts and Technique
s 55
Conditional entropy H(C|age)
H(C|age<=30) = 2/5 * lg(5/2) + 3/5 * lg(5/3) = 0.971
H(C|age in 30..40) = 1 * lg 1 + 0 * lg 1/0 = 0 H(C|age>40) = 3/5 * lg(5/3) + 2/5 * lg(5/2) =
0.971
age buy no<=30 2 330…40 4 0>40 3 2
694.0)40|(14
5
])40,30(|(14
4)30|(
14
5)|(
ageCH
ageCHageCHageCH
2023年4月18日Data Mining: Concepts and Technique
s 56
Select the attribute with lowest conditional entropy
H(C|age) = 0.694 H(C|income) = 0.911 H(C|student) = 0.789 H(C|credit_rating) =
0.892 Select “age” to be the
tree root! yes
age?
<=30 >4030..40
student?
no yes
no yes
credit rating?
fairexcellent
no yes
2023年4月18日Data Mining: Concepts and Technique
s 57
Bayesian Classification
X: a data sample whose class label is unknown, e.g. X =(Income=medium, Credit_rating=Fair, Age=40).
Hi: a hypothesis that a record belongs to class Ci, e.g.
Hi = a record belongs to the “buy computer” class. P(Hi), P(X): probabilities. P(Hi/X): a conditional probability: among all records
with medium income and fair credit rating, what’s the probability to buy a computer?
This is what we need for classification! Given X, P(Hi/X) tells us the possibility that it belongs to some class.
What if we need to determine a single class for X?
2023年4月18日Data Mining: Concepts and Technique
s 58
Bayesian Theorem
Another concept, P(X|Hi) : probability of observing the sample X, given that the hypothesis holds. E.g. among all people who buy computer, what percentage has the same value as X.
We know P(X Hi) = P(Hi|X) P(X) = P(X|Hi) P(Hi), So
)()()|(
)|(XP
iHPiHXPXiHP
We should assign X to the class Ci where P(Hi|X) is maximized, equivalent to maximize P(X|Hi) P(Hi).
2023年4月18日Data Mining: Concepts and Technique
s 59
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally independent:
The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y1,y2],C) = P(y1,C) * P(y2,C)
No dependence relation between attributes Greatly reduces the number of probabilities to
maintain.
n
kCixPCiXP
k1
)|()|(
2023年4月18日Data Mining: Concepts and Technique
s 60
Sample quiz questions
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
1. What data does
naïve Baysian netmaintain?
2. Given X =(age<=30,Income=medium,Student=yesCredit_rating=Fai
r)
buy or not buy?
2023年4月18日Data Mining: Concepts and Technique
s 61
Naïve Bayesian Classifier: Example
Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007
X belongs to class “buys_computer=yes”Pitfall: forget P(Ci)
2023年4月18日Data Mining: Concepts and Technique
s 62
Assume five variables
T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny
T only directly influenced by L (i.e. T is conditionally independent of R,M,S given L)
L only directly influenced by M and S (i.e. L is conditionally independent of R given M & S)
R only directly influenced by M (i.e. R is conditionally independent of L,S, given M)
M and S are independent
2023年4月18日Data Mining: Concepts and Technique
s 63
Making a Bayes net
S M
R
L
T
Step One: add variables.• Just choose the variables you’d like to be included in the
net.
T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny
2023年4月18日Data Mining: Concepts and Technique
s 64
Making a Bayes net
S M
R
L
T
Step Two: add links.• The link structure must be acyclic.• If node X is given parents Q1,Q2,..Qn you are promising
that any variable that’s a non-descendent of X is conditionally independent of X given {Q1,Q2,..Qn}
T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny
2023年4月18日Data Mining: Concepts and Technique
s 65
Making a Bayes net
S M
R
L
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
Step Three: add a probability table for each node.• The table for node X must list P(X|Parent Values) for each
possible combination of parent values
T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny
2023年4月18日Data Mining: Concepts and Technique
s 66
Computing with Bayes Net
P(T ^ ~R ^ L ^ ~M ^ S) =P(T ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S) = P(T L) * P(~R ^ L ^ ~M ^ S) =P(T L) * P(~R L ^ ~M ^ S) * P(L^~M^S) =P(T L) * P(~R ~M) * P(L^~M^S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M^S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M | S)*P(S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M)*P(S).
S M
RL
T
P(s)=0.3P(M)=0.6
P(RM)=0.3P(R~M)=0.6
P(TL)=0.3P(T~L)=0.8
P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2
2023年4月18日Data Mining: Concepts and Technique
s 67
What we learned?
4. Data warehousing concept, schema
data cube & operations (rollup, …)
cube computation: multi-way array aggregation
iceberg cube
dynamic data cube
2023年4月18日Data Mining: Concepts and Technique
s 68
What is Data Warehouse?
Defined in many different ways, but not rigorously.
A decision support database that is maintained
separately from the organization’s operational database
Support information processing by providing a solid
platform of consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
Data warehousing:
The process of constructing and using data warehouses
2023年4月18日Data Mining: Concepts and Technique
s 69
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected
to a set of dimension tables Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized
into a set of smaller dimension tables, forming a
shape similar to snowflake Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact
constellation
2023年4月18日Data Mining: Concepts and Technique
s 70
A data cube
all
product quarter country
product, quarter product,country quarter, country
product, quarter, country
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D(base) cuboid
2023年4月18日Data Mining: Concepts and Technique
s 71
Multidimensional Data
Sales volume as a function of product, month, and region
Pro
duct
Regio
n
Month
Dimensions: Product, Location, TimeHierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
Pick one node from each dimensionhierarchy, you get a data cube!
How many cubes? How many distinct cuboids?
2023年4月18日Data Mining: Concepts and Technique
s 72
Typical OLAP Operations
Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or
detailed data, or introducing new dimensions Slice and dice:
project and select Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes.
2023年4月18日Data Mining: Concepts and Technique
s 73
Typical OLAP Operations
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
?? Starting from [product, city, week], what OLAP operations can produce the total sales for every month and every category in the “automobile” industry.
2023年4月18日Data Mining: Concepts and Technique
s 74
OLAP Server Architectures
Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and
manage warehouse data and OLAP middle ware to support missing pieces
Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services
greater scalability Multidimensional OLAP (MOLAP)
Array-based multidimensional storage engine (sparse matrix techniques)
fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP)
User flexibility, e.g., low level: relational, high-level: array Specialized SQL servers
specialized support for SQL queries over star/snowflake schemas
2023年4月18日Data Mining: Concepts and Technique
s 75
Multi-way Array Aggregation for Cube Computation
A
B
29 30 31 32
1 2 3 4
5
9
13 14 15 16
6463626148474645
a1a0
c3c2
c1c 0
b3
b2
b1
b0
a2 a3
C
4428 56
4024 52
3620
60
B
Order: ABCAB: planeAC: lineBC: point
2023年4月18日Data Mining: Concepts and Technique
s 76
Multi-Way Array Aggregation for Cube Computation (Cont.)
Let A: 40 values, B: 400 values, C: 4000 values. One chunk contains 10*100*1000 = 1,000,000
values. ABC needs how much memory?
AB plane: 40*400=16,000 AC line: 40*(4000/4) = 40,000 BC point: (400/4)*(4000/4) = 100,000 total: 156,000
CBA needs how much memory? CB plane: 4000*400=1,600,000 CA line: 4000*(40/4) = 40,000 BA point: (400/4)*(40/4) = 1000 total: 1,641,000 --- 10 times more!
2023年4月18日Data Mining: Concepts and Technique
s 77
Computing iceberg cube using BUC
BUC (Beyer & Ramakrishnan, SIGMOD’99) Bottom-up vs. top-down?—depending on how you view it! Apriori property:
Aggregate the data, then move to the next level If minsup is not met, stop!
2023年4月18日Data Mining: Concepts and Technique
s 78
The Dynamic Data Cube [EDBT’00]
E.g. 16+12+8+6 = 42. Query cost = update cost = O(log2(n))
1..4 5..8
1..4
5..8
4 8 12 16
4812161111
1111
1111
11114 8 12 16
4812161111
1111
1111
1111
4 8 12 16
4812161111
1111
1111
1111
4 8 12 16
4812161111
1111
1111
1111
2023年4月18日Data Mining: Concepts and Technique
s 79
Dynamic Data Cube summary
A balanced tree with fanout=4. The leaf nodes contains the original data cube. Each index entry stores an X-border and an Y-
border. Each border is stored as a binary tree, which
supports a 1-dim prefix-sum query and an update in O(log(n)) time.
Overall, the DDC supports a range-sum query and an update both in O(log2n) time.
2023年4月18日Data Mining: Concepts and Technique
s 80
5. Additional lattice (of itemsets, g-itemsets, rules, cuboids)
distance-based indexing
2023年4月18日Data Mining: Concepts and Technique
s 81
Problem Statement
Given a set S of objects and a metric distance function d(). The similarity search problem is defined as: for an arbitrary object q and a threshold , find{ o | oS d(o, q)< }
Solution without index: for every oS, compute d(q,o). Not efficient!
2023年4月18日Data Mining: Concepts and Technique
s 82
An Example of the VP-tree
S={o1,…,o10}. Randomly pick o1 as root. Compute the distance between o1 and oi, sort in
increasing order of distance:
build tree recursively.
o3 o7 o6 o9 o10 o2 o8 o5 o4
5 6 18 34 96 102 111 300 401
o1
o3 , o7 , o6 , o9 o10 , o2 , o8 , o5, o4
34 96
2023年4月18日Data Mining: Concepts and Technique
s 83
Query Processing
Given object q, compute d(q,root). Intuitively, if it’s small, search the left tree; otherwise, search the right tree.
In each index node, store: maxDL=max{ d(root, oi)|oi left tree }, minDR=min{ d(root, oi)|oi right tree }.
Pruning condition: prune left if: d(q,root) – maxDL ≥ prune right if: minDR - d(q,root) ≥
?? maxDL=10, minDR=20, d(q,root)=10, =10. Which sub-tree(s) do we check?
?? maxDL=10, minDR=20, d(q,root)=10, for what do we have to check both trees?