Chapt. 7 Multidimensional Hierarchical Clustering

29
Prof. Bayer, DWH, Ch.7, S S2002 1 Chapt. 7 Multidimensional Hierarchical Clustering Fig. 3.1 Hierarchies in the `Juice and More´ schema Year (3) Month (12) TIME Region (8) Nation (7) Trade Type (2) Business Type (7) CUSTOMER Type (5) Brand (8) Category (19) Container (10) PRODUCT Sales Organization (5) Distribution Channel (3) DISTRIBUTION All Products All Distributions All Customer All Time

description

PRODUCT. CUSTOMER. DISTRIBUTION. TIME. All Products. All Customer. All Distributions. All Time. Type (5). Sales. Region (8). Year (3). Organization (5). Month (12). Brand (8). Distribution. Nation (7). Channel (3). Category (19). Trade. Type (2). Container (10). Business. - PowerPoint PPT Presentation

Transcript of Chapt. 7 Multidimensional Hierarchical Clustering

Page 1: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 1

Chapt. 7 Multidimensional Hierarchical Clustering

Fig. 3.1 Hierarchies in the `Juice and More´ schema

Year (3)

Month (12)

TIME

Region (8)

Nation (7)

Trade Type (2)

Business Type (7)

CUSTOMER

Type (5)

Brand (8)

Category (19)

Container (10)

PRODUCT

Sales Organization (5)

DistributionChannel (3)

DISTRIBUTION

All Products All DistributionsAll Customer All Time

Page 2: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 2

(b)

PRODKEY

CUSTKEY

DISTKEY

TIMEKEY

SALES

DISTCOST

PRODKEY

PRODUCT2180 rows

TYPE

BRAND

CATEGORY

CONTAINER

...

CUSTKEY

CUSTOMER7064 rows

REGION

NATION

TRADE-TYPE

BUSINESS-TYPE

...

DISTKEY

DISTRIBUTION12 rows

SALESORG

CHANNEL

...

TIMEKEY

TIME36 rows

YEAR

MONTH

FACT26M rows

...

Page 3: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 3

Size of completely aggregated Cube

(6*9*20*11)*(9*8*3*8)*(6*4)*(4*13)------------------------------------------------ = (5*8*19*10)*(8*7*2*7)*(5*3)*(3*12)

4*6*6*9*11*13 185.328-------------------- = ----------- = 7.96 larger than base cube 5*5*7*7*19 23.275

Base Cube has 2.245.024.000 cells * 4 B ~ 9 GB

Number of available facts: 26 million

Page 4: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 4

Sparsity:

26*106

-------------- = 0,01162,245* 109

100 - 1.16 = 98.84 % sparsity

Page 5: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 5

Hierarchically aggregated Cube

(1+5+40+760+7600) = 8406

(1+8+56+112+784) = 961

(1+5+15) = 21

(1+3+24) = 28

= 4.749.961.608

Size of base cube 2.145.024.000

Number of aggregate cells 2.504.937.608

==> Juice and More database has 96 times more hierarchically aggregated cells than occupied base cells!

Page 6: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 6

Star-Joins

Restrictions on several dimension tables, which are then joined with fact table

In addition: grouping, computation of aggregates, sorting of results.

Example:

Select <MEASURE AGGREGATION>From Fact F, Customer C, DISTRIBUTION D,

Product P, Time TWhere F. ProdKey = P. ProdKey AND

F. CustKey = C. CustKey ANDF.TIMEKEY = T.TIMEKEY ANDF.DISTKEY = D.DISTKEY AND<CUSTOMER RESTRICTION> AND<DISTRIBUTION RESTRICTION> AND<PRODUCT RESTRICTION> AND<TIME RESTRICTION>

Page 7: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 7

Select <MEASURE AGGREGATION>

From Fact F

Where F. ProdKey BETWEEN Pkey1 AND Pkey2 AND

F. DistKey BETWEEN Dkey1 AND Dkey2 AND

F. CustKey BETWEEN Ckey1 AND Ckey2 AND

F. TimeKey BETWEEN Tkey1 AND Tkey2

Page 8: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 8

Key Question:

How to compute star-joins efficiently?

• Secondary indexes on foreign keys of fact table (standard B-trees), see chapter 5 for details

- intersect result lists

-retrieve tuples from fact table randomly

• Bitmaps

Page 9: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 9

Bitmap Index Intersection

bitmap for organization = „TM“

bitmap for region = „

Asia “

1.....1.11 1.1...1.1. 1.1...1.1. ...1.1.... ..1.1...1.

11.1...... 1.11.....1 .1.1..1... 1.1.1..... .1..1.1...

1......... 1.1....... ......1... .......... ....1.....

Page 1 Page 2 Page 3 Page 4 Page 5

result of bitmap intersection

accessed disk pages (shaded)

34 % of tuples

32 % of tuples

10 % of tuples

80 % of

pages

Page 10: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 10

Problem: for small result sets of a few %, almost all pages of the facts table must be fetched from disk, if the hits in the result set are not clustered on disk.

Ex: with 8 KB pages 20 to 400 tuples per page, i.e. at 0.25% to 5% hits in the result almost all pages must be fetched.

At least tuple clustering, preferably page clustering, are desirable, but how??

Goal: Code hierarchies in such a way, that for star-joins with the Fact table we have to join only with a query box on the Fact table

Page 11: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 11

Basic Idea for Multidimensional Clustering

1L} 0.5L; Juice Apple 1L; OJ 0.7L; OJ 0.33L; {OJ

0

1 m

1L} OJ 0.7L; OJ 0.33L; {OJ1

1 m

0.5L} {A-Juice2 4 m

1L} Juice Apple 0.5L Juice Apple {1

2 m

0.33L} {OJ2 1 m 0.7L} {OJ

22 m 1L} OJ {

23 m 1L} {A-Juice

2 5m

Orange Juice Apple Juice

0,33L 0,7L 1L0,5L

Product Category

All Products

All

0 1

10 0 2

Level Label Member Ordinal (e.g.,1) Member Label (e.g., 0.7L)Legend:

Example Hierarchy in Member Set Representation

AppleJuice

1 1L

Page 12: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 12

Dimension D consists of

Value Set V = [[ v1, v2, ... vn ]]

Hierarchy H of height h consisting of h+1 hierarchy levels H = [[L0 , L1

,..., Lh ]]

Level Li is a set of sets = [[m1i, ..., mj

i ]] with mki V

mki get names, e.g. „Orange Juice“ as label(m1

1), in general label(mk

i)

Constraint: every mli+1 must be a subset of some mk

i

Page 13: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 13

Hierarchic Relationships

The children of mki are all those sets ml

i+1 of the lower level i+1 with the property:

mli+1 mk

i , formally:

children(mki ) := [[ml

i+1 Li+1 : mli+1 mk

i ]]

parent(mki ) := [[ml

i-1 Li-1 : mli-1 mk

i ]]

Principle: the children of m are numbered by the bijective function ordm starting at 1 or 0

Page 14: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 14

Hierarchic Relationships

The children of mki are all those sets ml

i+1 of the lower level i+1 with the property:

mli+1 mk

i , formally:

children(mki ) := [[ml

i+1 Li+1 : mli+1 mk

i ]]

parent(mki ) := [[ml

i-1 Li-1 : mli-1 mk

i ]]

Principle: the children of m are numbered by the bijective function ordm starting at 1 or 0

Page 15: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 15

Enumeration and Surrogate Functions

Let A be an enumeration type

A = [[ a0, a1, ... ak ]]

f : A --> (0, 1 ,..., k ) defined as

f (ai ) = i

then i is called the surrogate of ai

Page 16: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 16

Hierarchies and composite Surrogates

Basic Idea: concatenate the surogates of successive hierarchy levels (compound surrogates cs)

Note: the root ALL of the hierarchy is not encoded

Def: compound surrogate cs for hierarchy H

ordm : children (m) --> [[0, 1, ..., |children(m)| -1]]

cs (H, mi) := ord father (mi) (mi) if i=1

:=cs (H, father ( mi)) ord father (mi) (mi) otherwise

Page 17: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 17

Example:

REGION f(REGION)

South Europe 0

Middle Europe 1

Northern Europe 2

Western Europe 3

North America 4

Latin America 5

Asia 6

Australia 7

(a)

Page 18: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 18

0

CUSTOMER

South Europe North America Asia

RetailWholesale Kana´s Sushi Bar

Joe‘s Sports Bar

... ...

Bar

4 6

2

1

10

RetailUSACanada 10

... ...

... ...

... ...

Australia7

Wholesale0

Surrogates for Region and the entire Costumer Hierarchy

Page 19: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 19

Example: the path

North America --> USA --> Retail --> Bar

has the compound surrogate 4112

Next Idea: for every hierarchy level determine the higest branching degree (plus a safety margin for future extensions) and code by fixed number of bits.

surrogates (H,i) := max [[ cardinality (children (H,m)) : m level (H, i-1) ]]

Page 20: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 20

let li := log2 surrogates (H,i)

then li bits are needed for the surrogates of level i

let be a path = m0 m1 m2 ... mh

to a leaf mh of hierarchy H:

Page 21: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 21

cs (H,) = cs (H,mh)

hlll

mfathermord 32

1 21

hlll

mfathermord 43

2 22

h

mfathermord h

:=

... +

+

+

Page 22: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 22

Example:

cs (H, Bar) = 100 001 1 010 = 538

l1=3 l2=3 l3=1 l4=3

number of bits needed at certain level

Page 23: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 23

Properties of MHC Encoding

• very compact coding of fixed length

• lexicographic order of composite keys remains, i.e. isomorphic to integer ordering

• point restrictions on arbitrary hierarchy levels lead to interval restrictions on the compound surrogates

Page 24: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 24

Example: path to USA is:

North America --> USA

4 = 1002 1 = 0012

leads to range on cs:100 001 0 0002 to 100 001 1 1112

and to the decimal range:528 to 543 or [528 : 543]

==> star join with restriction North America.USA leads to an interval restriction on the fact table

==> point restrictions on arbitrary hierarchy levels of several dimensions lead to Query Boxes on the fact table.

Page 25: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 25

Complex Hierarchies

• time with months and weeks, both restrictions lead to intervals on the level of days

• Example of Fig. 4-4

• proposal for multiple hierarchies: choose the most useful (depending on the query profile) or consider multiple hierarchies as several independent hierarchies. Caution, this increases the number of dimensions !!!

• Time variant hierarchies: extend by time interval of validity , see Example Fig. 4-5,

Page 26: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 26

(a) (b)

YEAR

MONTH WEEK

DAY

REGION

NATION

TRADE TYPE

CUSTOMER TYPE

CUSTOMER SIZE

CUSTOMER

Fig. 4-4 Complex Hierarchy Graphs

Page 27: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 27

CUSTOMER

South Europe North America ...

USACanada

Retail Wholesale

Bar Restaurant

Joe ‘s Sports BarYear <= 1997 Year > 1997

Fig. 4-5 Change of a hierarchy over the time

Page 28: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 28

Orange

Juice

Asia

Page 29: Chapt. 7 Multidimensional Hierarchical Clustering

Prof. Bayer, DWH, Ch.7, SS2002 29

Apple

Juice

AsiaProcessing a query box in sort order with the Tetris algorithm