Data Mining

74
made by Radmilo Pesic & Branko Golubovic 1/74 Data Mining Concepts and Tehniques tutorial based on the book: by Jiawei Han and Micheline Kamber This material was developed with financial help of the WUSA fund of Austria.

description

tutorial based on the book:. Data Mining. Concepts and Tehniques. by Jiawei Han and Micheline Kamber. This material was developed with financial help of the WUSA fund of Austria. 1. Introduction. What motivated data mining?. Necessity is the mother of invention. - PowerPoint PPT Presentation

Transcript of Data Mining

Page 1: Data Mining

made by Radmilo Pesic & Branko Golubovic 1/74

Data MiningConcepts and Tehniques

tutorial based on the book:

by Jiawei Han and Micheline Kamber

This material was developed with financial help of the WUSA fund of Austria.

Page 2: Data Mining

made by Radmilo Pesic & Branko Golubovic 2/74

1Introduction

Page 3: Data Mining

made by Radmilo Pesic & Branko Golubovic 3/74

What motivated data mining?Necessity is the mother of invention.

Data Collection and Database Creation(1960s and earlier)

Data Collection and Database Creation(1960s and earlier)

Database Management Systems(1970s-early 1980s)

Database Management Systems(1970s-early 1980s)

Advanced Databases Systems(mid-1980s-present)

Advanced Databases Systems(mid-1980s-present)

Web-based Databases Systems(1990s-present)

Web-based Databases Systems(1990s-present)

Data Warehousing and Data Mining(mid-1980s-present)

Data Warehousing and Data Mining(mid-1980s-present)

New Generation of Integrated Information Systems(2000-…)

New Generation of Integrated Information Systems(2000-…)

Page 4: Data Mining

made by Radmilo Pesic & Branko Golubovic 4/74

What Is Data Mining?

Datawarehouse

Databases Flat files

Cleaning andIntegration

Selection andTransformation

Data Mining

Patterns

KnowledgeEvaluation and

Presentation

Extracting or “mining” knowledge from large amounts of data.

1. Data cleaning2. Data integration3. Data selection4. Data transformation5. Data mining6. Pattern evaluation7. Knowledge presentation

Page 5: Data Mining

made by Radmilo Pesic & Branko Golubovic 5/74

Components of a typical data mining system:

• Database, data warehouse,

or other information repository

• Database

or data warehouse server

• Knowledge base

• Data mining engine

• Pattern evaluation module

• Graphical user interface

Graphical user interfaceGraphical user interface

Pattern evaluationPattern evaluation

Data mining engineData mining engine

Database orData warehouse server

Database orData warehouse server

DatabaseData

warehouse

Knowledgebase

Page 6: Data Mining

made by Radmilo Pesic & Branko Golubovic 6/74

Data mining – On What Kind of Data?

• Relational Databases• Data Warehouses• Transactional Databases• Advanced Database Systems

and Advanced Database Applications(object-oriented, object-relational, spatial, temporal, time-series, text, multimedia, heterogeneus, legacy databases and the world wide web)

Page 7: Data Mining

made by Radmilo Pesic & Branko Golubovic 7/74

Relational Databasescust_ID name address age income credit_info …

C1

Smith, Sandy

5463 E Hastings, Burnaby,

BC V5A 4S9, Canada

21

$27000

1

item_ID name brand category type price place_made supplier cost

I3

I8

high_res_TV

multidisc-

CDplay

Toshiba

Sanyo

high resolution

multidisc

TV

CD player

$988.00

$369.00

Japan

Japan

NikoX

Music Front

$600.00

$120.00

empl_ID name category group salary commission

E55

Jones, Jane

home entertainment

manager

$18,000

2%

branch_ID name address

B1

City Square

369 Cambie St., Vancouver, BC V5L 3A2, Canada

trans_ID cust_ID empl_ID date time method_paid amount

T100

C1

E55

09/21/98

15:45

Visa

$1357.00

trans_ID item_ID qty

T100

T100

I3

I8

1

2

empl_ID branch_ID

E55

B1

customer

item

employee

branch

purchases

item_sold works_at

Page 8: Data Mining

made by Radmilo Pesic & Branko Golubovic 8/74

Data Warehouses

CleanTransformIntegrateLoad

Datawarehouse

Query andanalysis tools

Query andanalysis tools

ClientClient

ClientClient

Data source in New York

Data source in Chicago

Data source in Toronto

Data source in Vancouver

Typical architecture of a data warehouse for AllElectronics

Page 9: Data Mining

made by Radmilo Pesic & Branko Golubovic 9/74

1560440

395

40014825605

Chicago

TorontoVancouver

New York

computer security

phonehomeentertainment

Q1

Q2

Q3

Q4

item (types)ti

me

(qu

arte

rs)

address (

cities

)

computer security

phonehomeentertainment

item (types)

computer security

phonehomeentertainment

item (types)

Chicago

TorontoVancouver

New York

address (

cities

)

Q1

Q2

Q3

Q4

tim

e (q

uar

ters

)

100

150

150

10002000

CanadaUSA

address (

countries)

tim

e (m

onth

s) Jan

March

Feb

<Vancouver,Q1,security>

Roll-up on addressDrill-down on time data for Q1

Page 10: Data Mining

made by Radmilo Pesic & Branko Golubovic 10/74

Text Databases and Multimedia Databases• Text databases can be:

highly unstructured, semistructured or well structured

• Multimedia databases store image, audio, and video data

• Such data require a lot of storage space; it’s continuous-media data

Heterogeneus Databases and Legacy Databases

The World Wide Web• mining path traversal patterns

Page 11: Data Mining

made by Radmilo Pesic & Branko Golubovic 11/74

Data Mining Functionalities What Kinds of Patterns Can Be Mined?

• Concept/Class Description:

Characterization and Discrimination• Association Analysis• Classification and Prediction• Cluster Analysis• Outlier Analysis• Evolution Analysis

Page 12: Data Mining

made by Radmilo Pesic & Branko Golubovic 12/74

Are All of the Patterns Interesting?

A pattern is interesiting if it is:

• easily understood

• valid

• (potentially) useful

• novel

or if it

• confirms user’s hypothesis

Interesting pattern represents knowledge!

Page 13: Data Mining

made by Radmilo Pesic & Branko Golubovic 13/74

Objective measures of pattern interestingness:• support

• confidence

Subjective measures of pattern interestingness:• data is unexpected

• data is actionable

• data is expected

Can a data mining system generate all of the interesting patterns?

Can a data mining system generate only interesting patterns?

Page 14: Data Mining

made by Radmilo Pesic & Branko Golubovic 14/74

Classification of Data Mining Systems

• according to kinds of databases mined (relational, data warehouse, object-oriented…)

• according to kinds of knowledge mined (association, classification, clustering…; generalized, primitive-level or knowledge at multiple levels; regularities or irregularities)

• according to the kinds of techniques utilized (autonomous, interactive exploratory or query-driven systems; data warehouse oriented, statistics…)

• according to the applications adapted (for finance, DNA, etc.)

DataMining

DataMining

Databasetechnology

Databasetechnology

Informationscience

Informationscience

Machinelearning

Machinelearning

StatisticsStatistics

VisualizationVisualization Other disciplinesOther disciplines

Page 15: Data Mining

made by Radmilo Pesic & Branko Golubovic 15/74

Major Issues in Data MiningMining methodology and user interaction issues:

• Mining different kinds of knowledge in databases

• Interactive mining of knowledge at multiple levels of abstraction

• Incorporation of background knowledge

• Data mining query languages and ad hoc data mining

• Presentation and visualization of data mining results

• Handling noisy or incomplete data

• Pattern evaluation – the interestingness problem

Performance issues:

• Efficiency and scalability of data mining algorithms

• Parallel, distributed, and incremental mining algorithms

Issues relating to the diversity of database types:

• Handling of relational and complex types of data

• Mining information from heterogeneous databases and global information systems

Page 16: Data Mining

made by Radmilo Pesic & Branko Golubovic 16/74

2Data Warehouse and OLAP Technology for Data Mining

Page 17: Data Mining

made by Radmilo Pesic & Branko Golubovic 17/74

What Is a Data Warehouse?

“A datawarehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision making process.”

W.H. Inmon

• Subject-oriented• Integrated• Time-variant• Nonvolatile

Page 18: Data Mining

made by Radmilo Pesic & Branko Golubovic 18/74

How are organizations using the information from

data warehouse?

• Increasing customer focus

• Repositioning products and managing product portfolios

• Analyzing operations and looking for sources of profit

• Managing the customer relationships,

making environmental corrections, and

managing the cost of corporate assets

Different approach to heterogeneous database integration:

• Query-driven approach (wrappers and integrators)

• Update-driven approach

Page 19: Data Mining

made by Radmilo Pesic & Branko Golubovic 19/74

Differences Between Operational Database Systems and Data Warehouse

• Users and system orientation• Data contents• Database design• View• Access patterns

Why have a separate data warehouse?

Page 20: Data Mining

made by Radmilo Pesic & Branko Golubovic 20/74

A Multidimensional Data Model

From Tables and Spreadsheets to Data Cubes

• A data cube is defined by dimensions and facts

• Dimension table

• Fact table

Page 21: Data Mining

made by Radmilo Pesic & Branko Golubovic 21/74

location = “Chicago” location = “New York” location = “Toronto” location = “Vancouver”

item item item item

home

ent.comp. phone sec.

home

ent.comp. phone sec.

home

ent.comp. phone sec.

home

ent.comp. phone sec.

time

Q1 854 882 89 623 1087 968 38 872 818 746 43 591 605 825 14 400

Q2 943 890 64 698 1130 1024 41 925 894 769 52 682 680 952 31 512

Q3 1032 924 59 789 1034 1048 45 1002 940 795 58 728 812 1023 30 501

Q4 1129 992 63 870 1142 1091 54 984 978 864 59 784 927 1038 38 580

1560440

395

40014825605

Chicago

TorontoVancouver

New York

computer security

phonehomeentertainment

Q1

Q2

Q3

Q4

item (types)

tim

e (q

uar

ters

)

location (c

ities)

680 952

812 1023

1038927

501

580

51231

30

38

89

4338968

746

623882

591872

682

728

784

925

1002

984

698

789

870

A 2-D view of sales data for AllElectronics, and it’s 3-D data cube representation

Page 22: Data Mining

made by Radmilo Pesic & Branko Golubovic 22/74

Chicago

TorontoVancouver

New York

computer security

phonehomeentertainment

Q1

Q2

Q3

Q4

item (types)

tim

e (q

uar

ters

)

location (c

ities)

40014825605

computer security

phonehomeentertainment

item (types)

computer security

phonehomeentertainment

item (types)

supplier=“SUP1” supplier=“SUP1”supplier=“SUP2”

A 4-D data cube representation of sales data for AllElectronics

Page 23: Data Mining

made by Radmilo Pesic & Branko Golubovic 23/74

time, location, suppliertime, item, location

time, item, supplieritem, location, supplier

time, item, location, supplier

time, location

time, supplierlocation, suppliertime, item

item, location

item, supplier

timelocationitem

supplier

all0-D (apex) cuboid

1-D cuboid

4-D (base) cuboid

2-D cuboid

3-D cuboid

Lattice of cuboids, making up a 4-D data cube

Page 24: Data Mining

made by Radmilo Pesic & Branko Golubovic 24/74

Stars, Snowflakes, and Fact Constellations:Schemas for Multidimensional DatabasesStar schema:• a large central table (fact table)• a set of smaller attendant tables (dimension tables),

one for each dimension

time_key

item_key

branch_key

location_key

dollars_sold

units_sold

time_key

item_key

branch_key

location_key

dollars_sold

units_sold

time_key

day

day_of_week

month

quarter

year

time_key

day

day_of_week

month

quarter

year

item_key

item_name

brand

type

supplier_type

item_key

item_name

brand

type

supplier_type

location_key

street

city

province_or_state

country

location_key

street

city

province_or_state

country

branch_key

branch_name

branch_type

branch_key

branch_name

branch_type

timedimension table

locationdimension table

itemdimension table

branchdimension table

salesfact table

Page 25: Data Mining

made by Radmilo Pesic & Branko Golubovic 25/74

Snowflake schema:

• a variant of star schema, where some dimension tables are normalized

• reduce redundancies, but reduce the effectivness of browsing

time_key

item_key

branch_key

location_key

dollars_sold

units_sold

time_key

item_key

branch_key

location_key

dollars_sold

units_sold

time_key

day

day_of_week

month

quarter

year

time_key

day

day_of_week

month

quarter

year

item_key

item_name

brand

type

supplier_key

item_key

item_name

brand

type

supplier_key

branch_key

branch_name

branch_type

branch_key

branch_name

branch_type

timedimension table

locationdimension table

itemdimension table

branchdimension table

salesfact table

location_key

street

city_key

location_key

street

city_key

supplier_key

supplier_type

supplier_key

supplier_type

supplierdimension table

city_key

city

province_or_state

country

city_key

city

province_or_state

country

citydimension table

Page 26: Data Mining

made by Radmilo Pesic & Branko Golubovic 26/74

Fact constelation:

• multiple fact tables share dimension tables

time_key

item_key

branch_key

location_key

dollars_sold

units_sold

time_key

item_key

branch_key

location_key

dollars_sold

units_sold

time_key

day

day_of__week

month

quarter

year

time_key

day

day_of__week

month

quarter

year

item_key

item_name

brand

type

supplier_type

item_key

item_name

brand

type

supplier_type

location_key

street

city

province_or_state

country

location_key

street

city

province_or_state

country

branch_key

branch_name

branch_type

branch_key

branch_name

branch_type

timedimension table

locationdimension table

itemdimension table

branchdimension table

salesfact table

shippingfact table

item_key

time_key

shipper_key

from_location

to_location

dollars_sold

units_shipped

item_key

time_key

shipper_key

from_location

to_location

dollars_sold

units_shipped

shipper_key

shipper_name

location_key

shipper_type

shipper_key

shipper_name

location_key

shipper_type

shipperdimension table

Page 27: Data Mining

made by Radmilo Pesic & Branko Golubovic 27/74

Defining multidimensional schema

• DMQL – data mining query language

• Syntax:

cube definition:define cube <cube_name> [<dimension_list>]: <measure_list>

dimension definition:define dimension <dimension_name> as (<atribute_or_subdimension_list>)

Page 28: Data Mining

made by Radmilo Pesic & Branko Golubovic 28/74

Example:

• Constellation schema defined in DMQL:

define cube sales [time, item, branch, location]:

dollars_sold=sum(sales_in_dollars), units_sold=count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, province_or_state, country)

define cube shipping [time, item, shipper, from_location, to_location]:

dollars_cost=sum(cost_in_dollars), unit_shipped=count(*)

define dimension time as time in cube sales

define dimension item as item in cube sales

define dimension shipper as (shipper_key, shipper_name,

location as location in cube sales, shipper_type)

define dimension from_location as location in cube sales

define dimension to_location as location in cube sales

Page 29: Data Mining

made by Radmilo Pesic & Branko Golubovic 29/74

Measures: Their Categorization and Computation

Measures, based on the aggregate function:

• Distributive

• Algebraic

• Holistic

Page 30: Data Mining

made by Radmilo Pesic & Branko Golubovic 30/74

Introducing Concept Hierarchies

• A concept hierarchy defines a sequence of mappings from a set of low-level to higher-level concepts.

allall

CanadaCanada USAUSA

British ColumbiaBritish Columbia OntarioOntario New YorkNew York IllinoisIllinois

VancouverVancouver VictoriaVictoria TorontoToronto OttawaOttawa BuffaloBuffaloNew YorkNew York ChicagoChicago

all

country

province_or_state

city

location

Page 31: Data Mining

made by Radmilo Pesic & Branko Golubovic 31/74

• Hierarchial and lattice structures of atributes in warehouse dimensions:

country

province_or_state

city

street

year

week

day

month

quarter

Hierarchy for location Lattice for time

Page 32: Data Mining

made by Radmilo Pesic & Branko Golubovic 32/74

OLAP Operations in the Multidimensional Data Model

• Roll-up• Drill-down• Slice and dice• Pivot (rotate)• Other (drill-across, drill-through)

Page 33: Data Mining

made by Radmilo Pesic & Branko Golubovic 33/74

1560440

395

40014825605

Chicago

TorontoVancouver

New York

computer security

phonehomeentertainment

Q1

Q2

Q3

Q4

item (types)

tim

e (q

uar

ters

)

location (c

ities)

2000

1000

USACanada

computer security

phonehomeentertainment

Q1

Q2

Q3

Q4

item (types)

tim

e (q

uar

ters

)location (c

ountries)

150

100

150

Chicago

TorontoVancouver

New York

computer security

phonehomeentertainment

item (types)

tim

e (m

onth

s)

location (c

ities)

January

February

March

April

May

June

July

August

September

October

November

December

roll-upon location(from cities to countries)

drill-downon time(from quarters to months)

Page 34: Data Mining

made by Radmilo Pesic & Branko Golubovic 34/74

1560440

395

40014825605

Chicago

TorontoVancouver

New York

computer security

phonehomeentertainment

Q1

Q2

Q3

Q4

item (types)

tim

e (q

uar

ters

)locatio

n (citie

s)

395

605

USACanada

computer

homeentertainment

Q1

Q2

item (types)

tim

e(q

uar

ters

)

location (c

ities)

dice for(location=“Toronto” or “Vancouver”)and (time=“Q1”or “Q2”) and(item=“home entertainment” or “computer”)

slicefor time=“Q1”

400

14

825

605

Vancouver

Toronto

New York

Chicago

computer

security

phone

homeentertainment

item

(ty

pes

)

location (cities)

40014825605Vancouver

Toronto

New York

Chicago

computer security

phonehomeentertainment

item (types)

loca

tion

(ci

ties

)

pivot

Page 35: Data Mining

made by Radmilo Pesic & Branko Golubovic 35/74

A Starnet Query Model for Querying Multidimensional Databases

continent

country

province_or_state

city

street

location

day

month

quarter

year

time

name brand category typeitem

name

category

group

customer

Page 36: Data Mining

made by Radmilo Pesic & Branko Golubovic 36/74

Data Warehouse Architecture

Steps for the Design and Construction of Data Warehouse

The Design of a Data Warehouse: A Business Analysis Framework

• top-down view

• data source view

• data warehouse view

• business query view

Page 37: Data Mining

made by Radmilo Pesic & Branko Golubovic 37/74

The Process of Data Warehouse Design

• top-down approach

• bottom-up approach

• combined approach

• waterfall method

• spiral method

Steps of the warehouse design:

1) Choosing a business proces to model;

2) Choosing the grain of the business proces;

3) Choosing the dimensions;

4) Choosing the measures.

Page 38: Data Mining

made by Radmilo Pesic & Branko Golubovic 38/74

A Three-Tier Data Warehouse Architecture

Output

Query/report Analysis Data mining

OLAP server OLAP server

Monitoring Administration Data warehouse Data marts

Metadata repositoryExtractClean

TransformLoad

RefreshOperational databases External sources

Data

Bottom tier:data warehouseserver

Middle tier:OLAP server

Top tier:front-end tools

Page 39: Data Mining

made by Radmilo Pesic & Branko Golubovic 39/74

There are three data warehouse models:

• Enterprise warehouse

• Data mart

• Virtual warehouse

Page 40: Data Mining

made by Radmilo Pesic & Branko Golubovic 40/74

Types of OLAP Servers: ROLAP versus MOLAP versus HOLAP

Relational OLAP (ROLAP) servers:• use of relational or extended-relational DBMS• greater scalabilityMultidimensional OLAP (MOLAP) servers:• use of data cube – fast indexing• possible low storage utilization – use of compressionHybrid OLAP (HOLAP) servers:• scalability of ROLAP and faster computation of MOLAP• Microsoft SQL Server 7.0 OLAP Services

supports HOLAP server

Page 41: Data Mining

made by Radmilo Pesic & Branko Golubovic 41/74

Data Warehouse Implementation

• SQL group byData cube computation extends SQL with compute cube

• Example: “Compute the sum of sales, grouping by item and city.” “Compute the sum of sales, grouping by item.” “Compute the sum of sales, grouping by city.”

• The possible group by’s are the following:{(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ()}

Page 42: Data Mining

made by Radmilo Pesic & Branko Golubovic 42/74

( )

(year)(item)(city)

(city,year) (item,year)(city,item)

(city,item,year)

0-D (apex) cuboid

1-D cuboids

2-D cuboids

3-D (base) cuboids

Lattice of cuboids

define cube sales [item, city, year]: sum(sales_in_dollars)

compute cube sales

Page 43: Data Mining

made by Radmilo Pesic & Branko Golubovic 43/74

• Number of cuboids in an n-dimensional data cube is 2n

• Number of cuboids in an n-dimensional data cube

where we have a concept hihierarchy (day<week<month<quarter<year) is:

• Example:if the cube has 10 dimensions and each dimension has 4 levels,

the total number of cuboids that can be generated will be 510 9.8 x 106

n

iiLT

1

)1(

Page 44: Data Mining

made by Radmilo Pesic & Branko Golubovic 44/74

Partial Materialization: Selected Computation of Cuboids

There are three choices for data cube materialization given a base cuboid:

(1) do not precompute any of the “nonbase” cuboids (no materialization)

(2) precompute all of the cuboids (full materialization)

(3) selectively compute a proper subset

of the whole set of possible cuboids (partial materialization);

the partial materialization of cuboids shoul consider three factors:•identify the subset of cuboids to materialize,

•exploit the materialized cuboids during query processing, and

•efficiently update the materialized cuboids during load and refresh.

Page 45: Data Mining

made by Radmilo Pesic & Branko Golubovic 45/74

Multiway Array Aggregation in the Computation of Data Cubes

ROLAP:• Sorting, hashing, and grouping operations are applied

to the dimension attributes in order to reorder and cluster related tuples.• Grouping is performed on some subaggregates as a “partial grouping step”.

These “partial groupings” may be used to speed up the computation of other subaggregates.

• Aggregates may be computed from previously computed aggregates, rather than from the base fact tables.

MOLAP:• Partitition the array into chunks.• Compute aggregates by visiting cube cells.

Page 46: Data Mining

made by Radmilo Pesic & Branko Golubovic 46/74

a0 a1 a2 a3

1 2 3 4

13 14 15 16b3

b2

b1

b0

5

9

3029 31 32

45 46 47 48

44

40

36

28

24

20

60

56

52

61 62 63 64

c0

c1

c2

c3

A

C

B

A 3-D array for the dimensions A, B, and C, organized into 64 chunks

Page 47: Data Mining

made by Radmilo Pesic & Branko Golubovic 47/74

Indexing OLAP Data

• Bitmap indexing

• Join indexing

Page 48: Data Mining

made by Radmilo Pesic & Branko Golubovic 48/74

Bitmap Indexing

RID item city

R1

R2

R3

R4

R5

R6

R7

R8

H

C

P

S

H

C

P

S

V

V

V

V

T

T

T

T

RID H C P S

R1

R2

R3

R4

R5

R6

R7

R8

1

0

0

0

1

0

0

0

0

1

0

0

0

1

0

0

0

0

1

0

0

0

1

0

0

0

0

1

0

0

0

1

RID V T

R1

R2

R3

R4

R5

R6

R7

R8

1

1

1

1

0

0

0

0

0

0

0

0

1

1

1

1

Base table Item bitmap index table City bitmap index table

Indexing OLAP data using bitmap indices

Page 49: Data Mining

made by Radmilo Pesic & Branko Golubovic 49/74

Join Indexing

location item sales_key

Main Street

Sony-TV

T57

location

Linkages between a sales fact table anddimension tables for location and item

Main Street Sony-TV

T459

T884

T238

T57item

sales location sales_key

Main Street

Main Street

Main Street

T57

T238

T884

item sales_key

Sony-TV

Sony-TV

T57

T459

Join index table forlocation/sales

Join index table foritem/sales

Join index table linking two dimensionslocation/item/sales

Join index tables based on the linkagesbetween the sales fact table and dimensiontables for location and item

Page 50: Data Mining

made by Radmilo Pesic & Branko Golubovic 50/74

Efficient Processing of OLAP Queries

1. Determine which operations should be performed on the available cuboids

2. Determine to which materialized cuboid(s)

the relevant operations should be applied

Page 51: Data Mining

made by Radmilo Pesic & Branko Golubovic 51/74

Metadata Repository

• A description of the structure

of the data warehouse• Operational metadata• The algorythms used for summarization• The mapping from the operational environment

to the data warehouse• Data related to system performance• Business metadata

Page 52: Data Mining

made by Radmilo Pesic & Branko Golubovic 52/74

Data Warehouse Back-End Tools and Utilities

• Data extraction• Data cleaning• Data transformation• Load• Refresh

Page 53: Data Mining

made by Radmilo Pesic & Branko Golubovic 53/74

Further Development of Data Cube Technology

Discovery-Driven Exploration of Data Cubes

• SelfExp• InExp• PathExp

Page 54: Data Mining

made by Radmilo Pesic & Branko Golubovic 54/74

Sum of sales Month

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Total 1% -1% 0% 1% 3% -1% -9% -1% 2% -4% 3%

Avg. sales Month

Item Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Sony b/w printer 9% -8% 2% -5% 14% -4% 0% 41% -13% -15% -11%

Sony color printer 0% 0% 3% 2% 4% -10% -13% 0% 4% -6% 4%

HP b/w printer -2% 1% 2% 3% 8% 0% -12% -9% 3% -3% 6%

HP color printer 0% 0% -2% 1% 0% -1% -7% -2% 1% -4% 1%

IBM desktop computer 1% -2% -1% -1% 3% 3% -10% 4% 1% -4% -1%

IBM laptop computer 0% 0% -1% 3% 4% 2% -10% -2% 0% -9% 3%

Toshiba desktop comp. -2% -5% 1% 1% -1% 1% 5% -3% -5% -1% -1%

Toshiba laptop comp. 1% 0% 3% 0% -2% -2% -5% 3% 2% -1% 0%

Logitech mouse 3% -2% -1% 0% 4% 6% -11% 2% 1% -4% 0%

Ergo-way mouse 0% 0% 2% 3% 1% -2% -2% -5% 0% -5% 8%

Change in sales over time

Change in sales for each item-time combination

Page 55: Data Mining

made by Radmilo Pesic & Branko Golubovic 55/74

Avg. sales Month

Region Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

North -1% -3% -1% 0% 3% 4% -7% 1% 0% -3% -3%

South -1% 1% -9% 6% -1% -39% 9% -34% 4% 1% 7%

East -1% -2% 2% -3% 1% 18% -2% 11% -3% -2% -1%

West 4% 0% -1% -3% 5% 1% -18% 8% 5% -8% 1%

Change in sales for the item IBM desktop computer per region

Page 56: Data Mining

made by Radmilo Pesic & Branko Golubovic 56/74

Complex Aggregation at Multiple Granularities: Multifeature Cubes

• Example 1:Query 1: A simple data cube query. Find the total sales in 2000, broken down by item, region, and month, with subtotals for each dimension.

• Example 2:Query 2: A complex query. Grouping by all subsets of {item, region, month}, find the maximum price in 2000 for each group, and the total sales among all maximum price tuples.

select item, region, month, MAX(price), SUM(R.sales)from Purchaseswhere year=2000cube by item, region, month: Rsuch that R.price=MAX(price)

Page 57: Data Mining

made by Radmilo Pesic & Branko Golubovic 57/74

• Example 3:Query 3: An even more complex query. Grouping by all subsets of {item,region,month}, find the maximum price in 2000 for each group. Among the maximum price tuples, find the minimum and maximum item shelf life. Also find the fraction of the total sales due to tuples that have minimum shelf life within the set of all maximum price tuples, and the fraction of the total sales due to tuples that have maximum shelf life within the set of all maximum price tuples.

select item, region, month, MAX(price), MIN(R1.shelf), MAX(R1.shelf), SUM(R1.sales),

SUM(R2.sales), SUM(R3.sales)

from Purchaseswhere year=2000cube by item, region, month: R1, R2, R3such that R1.price=MAX(price) and

R2 in R1 and R2.shelf=MIN(R1..shelf) andR3 in R1 and R3.shelf=MAX(R1.shelf)

Page 58: Data Mining

made by Radmilo Pesic & Branko Golubovic 58/74

From Data Warehousing to Data Mining

Data Warehouse Usage

• Information processing• Analytical processing• Data mining

Page 59: Data Mining

made by Radmilo Pesic & Branko Golubovic 59/74

From On-Line Analytical Processing to On-Line Analytical Mining

• High quality of data in data warehouses• Available information processing infrastructure

surrounding data warehouses• OLAP-based exploratory data analysis• On-line selection of data mining functions

Page 60: Data Mining

made by Radmilo Pesic & Branko Golubovic 60/74

Architecture for On-Line Analytical Mining

Graphical user interface APIGraphical user interface API

Cube APICube API

Database APIDatabase API

OLAMengine

OLAMengine

OLAPengine

OLAPengine

Databases Datawarehouse

Meta dataMDDB

Databases

Data filtering, data integration

Data cleaningData integration

Filtering

Constraint-basedmining query

Mining result

Layer 1data repository

Layer 2multidimensional

database

Layer 3OLAP/OLAM

Layer 4user interface

An integrated OLAM and OLAP architecture

Page 61: Data Mining

made by Radmilo Pesic & Branko Golubovic 61/74

3Data Preprocessing

Page 62: Data Mining

made by Radmilo Pesic & Branko Golubovic 62/74

-2, 32, 100, 59, 48

Data integration

Data transformation

Data cleaning

-0.02, 0.32, 1.00, 0.59, 0.48

A1 A2 A3 … A126

T1

T2

T3

T4

T2000

tran

sact

ions

attributesA1 A3 … A115

T1

T4

T1456tran

sact

ions

attributesData reduction

Format of data preprocesing

Page 63: Data Mining

made by Radmilo Pesic & Branko Golubovic 63/74

Data Cleaning

Missing values

1. Ignore the tuple

2. Fill in the missing value manualy

3. Use a global constant to fill in the missing value

4. Use the attribute mean to fill in the missing value

5. Use the attribute mean for all samples belonging to the same class

as the given tuple

6. Use the most probable value to fill in the missing value

Page 64: Data Mining

made by Radmilo Pesic & Branko Golubovic 64/74

Inconsistent data

Noisy data• Bining Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition info (equidepth) bins:

Bin 1: 4, 8, 15Bin 2: 21, 21, 24Bin 3: 25, 28, 34

Smoothing by bin means:Bin 1: 9, 9, 9Bin 2: 22, 22, 22Bin 3: 29, 29, 29

Smoothing by bin boundaries:Bin 1: 4, 4, 15Bin 2: 21, 21, 24Bin 3: 25, 25, 34

• Clustering• Combined computer and human inspection• Regression

Page 65: Data Mining

made by Radmilo Pesic & Branko Golubovic 65/74

Data Integration and Transformation

Data Integration

Data Transformation

• Smoothing

• Aggregation

• Generalization

• Normalization

• Attribute construction

Page 66: Data Mining

made by Radmilo Pesic & Branko Golubovic 66/74

Data Reduction

• Data cube aggregation• Dimension reduction• Data compression• Numerosity reduction• Discretization and concept hierarchy generation

Page 67: Data Mining

made by Radmilo Pesic & Branko Golubovic 67/74

Dimensionality reduction

1. Stepwise forward selection

2. Stepwise backward elimination

3. Combination of forward selection and

backward elimination

• Decision tree induction

• Example: Forward selection

Initial attribute set:

{A1,A2,A3,A4,A5,A6}

Initial reduced set:

{}

{A1}

{A1,A4}

Reduced attribute set:

{A1,A4,A6}

Backward elimination

Initial attribute set:

{A1,A2,A3,A4,A5,A6}

{A1,A3,A4,A5,A6}

{A1,A4,A5,A6}

Reduced attribute set:

{A1,A4,A6}

Decision tree inductiom

Initial attribute set:

{A1,A2,A3,A4,A5,A6}

Reduced attribute set:

{A1,A4,A6}

A4?A4?

Class1

Y N

A1?A1? A6?A6?

Class1 Class2 Class2

NN YY

Greedy (heuristic)methods for attribute subset selection.

Page 68: Data Mining

made by Radmilo Pesic & Branko Golubovic 68/74

Data Compression

• Wavelet transforms

• Principal components analysis

Page 69: Data Mining

made by Radmilo Pesic & Branko Golubovic 69/74

Numerosity Reduction

• Regression and log-linear models

• Histograms

• Clustering

• Sampling

Page 70: Data Mining

made by Radmilo Pesic & Branko Golubovic 70/74

Histogram Examples

5 10 15 20 25 30

12345678910

price ($)

cou

nt

5

10

15

20

25

1-10 11-20 21-30price ($)

cou

nt

A histogram for price using singleton buckets – eachbucket represent one price-value/frequency pair.

An equiwidth histogram forprice, where values areaggregated so that each bucket has a uniform widthof $10.

Page 71: Data Mining

made by Radmilo Pesic & Branko Golubovic 71/74

Discretization And Concept Hierarchy

Generation

($900…$1000]($900…$1000]

($200…$300]($200…$300] ($400…$500]($400…$500] ($600…$700]($600…$700] ($800…$900]($800…$900]

($100…$200]($100…$200] ($300…$400]($300…$400] ($500…$600]($500…$600] ($700…$800]($700…$800]

($0…$100]($0…$100]

($0…$1000]($0…$1000]

($0…$200]($0…$200] ($200…$400]($200…$400] ($400…$600]($400…$600] ($600…$800]($600…$800] ($800…$1000]($800…$1000]

A concept hierarchy for the attribute price.

Page 72: Data Mining

made by Radmilo Pesic & Branko Golubovic 72/74

Discretization And Concept Hierarchy Generation

for Numeric Data

• Binning• Histogram analysis• Cluster analysis• Entropy-based Discretization• Segmentation by natural partitioning

Page 73: Data Mining

made by Radmilo Pesic & Branko Golubovic 73/74

Concept Hierarchy Generation for Categorical Data

• Specification of a partial ordering

of attributes explicitly at the schema level

by users or experts

• Specification of a portion of a hierarchy

by explicit data grouping

• Specification of a set of attributes,

but not of their partial ordering

• Specification of only a partial set of attributes

countrycountry

province_or_stateprovince_or_state

citycity

streetstreet

15 distinct values

365 distinct values

3,567 distinct values

674,339 distinct values

Automatic generation of a schema concept hierarchy based on the number of distinct attribute values.

Page 74: Data Mining

made by Radmilo Pesic & Branko Golubovic 74/74

Credits:

Radmilo Pešić [email protected] Golubović [email protected] Milutinović [email protected]