Data Mining

made by Radmilo Pesic & Branko Golubovic 1/74

Data MiningConcepts and Tehniques

tutorial based on the book:

by Jiawei Han and Micheline Kamber

This material was developed with financial help of the WUSA fund of Austria.


1Introduction


What motivated data mining?Necessity is the mother of invention.

Data Collection and Database Creation(1960s and earlier)

Data Collection and Database Creation(1960s and earlier)

Database Management Systems(1970s-early 1980s)

Database Management Systems(1970s-early 1980s)

Advanced Databases Systems(mid-1980s-present)

Advanced Databases Systems(mid-1980s-present)

Web-based Databases Systems(1990s-present)

Web-based Databases Systems(1990s-present)

Data Warehousing and Data Mining(mid-1980s-present)

Data Warehousing and Data Mining(mid-1980s-present)

New Generation of Integrated Information Systems(2000-…)

New Generation of Integrated Information Systems(2000-…)


What Is Data Mining?

Datawarehouse

Databases Flat files

Cleaning andIntegration

Selection andTransformation

Data Mining

Patterns

KnowledgeEvaluation and

Presentation

Extracting or “mining” knowledge from large amounts of data.

1. Data cleaning2. Data integration3. Data selection4. Data transformation5. Data mining6. Pattern evaluation7. Knowledge presentation


Components of a typical data mining system:

• Database, data warehouse,

or other information repository

• Database

or data warehouse server

• Knowledge base

• Data mining engine

• Pattern evaluation module

• Graphical user interface

Graphical user interfaceGraphical user interface

Pattern evaluationPattern evaluation

Data mining engineData mining engine

Database orData warehouse server

Database orData warehouse server

DatabaseData

warehouse

Knowledgebase


Data mining – On What Kind of Data?

• Relational Databases• Data Warehouses• Transactional Databases• Advanced Database Systems

and Advanced Database Applications(object-oriented, object-relational, spatial, temporal, time-series, text, multimedia, heterogeneus, legacy databases and the world wide web)


Relational Databasescust_ID name address age income credit_info …

C1

…

…

Smith, Sandy

…

…

5463 E Hastings, Burnaby,

BC V5A 4S9, Canada

…

21

…

…

$27000

…

…

1

…

…

…

…

…

item_ID name brand category type price place_made supplier cost

I3

I8

…

high_res_TV

multidisc-

CDplay

Toshiba

Sanyo

…

high resolution

multidisc

…

TV

CD player

…

$988.00

$369.00

…

Japan

Japan

…

NikoX

Music Front

…

$600.00

$120.00

…

empl_ID name category group salary commission

E55

…

Jones, Jane

…

home entertainment

…

manager

…

$18,000

…

2%

…

branch_ID name address

B1

…

City Square

…

369 Cambie St., Vancouver, BC V5L 3A2, Canada

…

trans_ID cust_ID empl_ID date time method_paid amount

T100

…

C1

…

E55

…

09/21/98

…

15:45

…

Visa

…

$1357.00

…

trans_ID item_ID qty

T100

T100

I3

I8

1

2

empl_ID branch_ID

E55

…

B1

…

customer

item

employee

branch

purchases

item_sold works_at


Data Warehouses

CleanTransformIntegrateLoad

Datawarehouse

Query andanalysis tools

Query andanalysis tools

ClientClient

ClientClient

Data source in New York

Data source in Chicago

Data source in Toronto

Data source in Vancouver

Typical architecture of a data warehouse for AllElectronics


1560440

395

40014825605

Chicago

TorontoVancouver

New York

computer security

phonehomeentertainment

Q1

Q2

Q3

Q4

item (types)ti

me

(qu

arte

rs)

address (

cities

)

computer security


item (types)

computer security


item (types)

Chicago

TorontoVancouver

New York

address (

cities

)

Q1

Q2

Q3

Q4

tim

e (q

uar

ters

)

100

150

150

10002000

CanadaUSA

address (

countries)

tim

e (m

onth

s) Jan

March

Feb

<Vancouver,Q1,security>

Roll-up on addressDrill-down on time data for Q1


Text Databases and Multimedia Databases• Text databases can be:

highly unstructured, semistructured or well structured

• Multimedia databases store image, audio, and video data

• Such data require a lot of storage space; it’s continuous-media data

Heterogeneus Databases and Legacy Databases

The World Wide Web• mining path traversal patterns


Data Mining Functionalities What Kinds of Patterns Can Be Mined?

• Concept/Class Description:

Characterization and Discrimination• Association Analysis• Classification and Prediction• Cluster Analysis• Outlier Analysis• Evolution Analysis


Are All of the Patterns Interesting?

A pattern is interesiting if it is:

• easily understood

• valid

• (potentially) useful

• novel

or if it

• confirms user’s hypothesis

Interesting pattern represents knowledge!


Objective measures of pattern interestingness:• support

• confidence

Subjective measures of pattern interestingness:• data is unexpected

• data is actionable

• data is expected

Can a data mining system generate all of the interesting patterns?

Can a data mining system generate only interesting patterns?


Classification of Data Mining Systems

• according to kinds of databases mined (relational, data warehouse, object-oriented…)

• according to kinds of knowledge mined (association, classification, clustering…; generalized, primitive-level or knowledge at multiple levels; regularities or irregularities)

• according to the kinds of techniques utilized (autonomous, interactive exploratory or query-driven systems; data warehouse oriented, statistics…)

• according to the applications adapted (for finance, DNA, etc.)

DataMining

DataMining

Databasetechnology

Databasetechnology

Informationscience

Informationscience

Machinelearning

Machinelearning

StatisticsStatistics

VisualizationVisualization Other disciplinesOther disciplines


Major Issues in Data MiningMining methodology and user interaction issues:

• Mining different kinds of knowledge in databases

• Interactive mining of knowledge at multiple levels of abstraction

• Incorporation of background knowledge

• Data mining query languages and ad hoc data mining

• Presentation and visualization of data mining results

• Handling noisy or incomplete data

• Pattern evaluation – the interestingness problem

Performance issues:

• Efficiency and scalability of data mining algorithms

• Parallel, distributed, and incremental mining algorithms

Issues relating to the diversity of database types:

• Handling of relational and complex types of data

• Mining information from heterogeneous databases and global information systems


2Data Warehouse and OLAP Technology for Data Mining


What Is a Data Warehouse?

“A datawarehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision making process.”

W.H. Inmon

• Subject-oriented• Integrated• Time-variant• Nonvolatile


How are organizations using the information from

data warehouse?

• Increasing customer focus

• Repositioning products and managing product portfolios

• Analyzing operations and looking for sources of profit

• Managing the customer relationships,

making environmental corrections, and

managing the cost of corporate assets

Different approach to heterogeneous database integration:

• Query-driven approach (wrappers and integrators)

• Update-driven approach


Differences Between Operational Database Systems and Data Warehouse

• Users and system orientation• Data contents• Database design• View• Access patterns

Why have a separate data warehouse?


A Multidimensional Data Model

From Tables and Spreadsheets to Data Cubes

• A data cube is defined by dimensions and facts

• Dimension table

• Fact table


location = “Chicago” location = “New York” location = “Toronto” location = “Vancouver”

item item item item

home

ent.comp. phone sec.

home


home


home


time

Q1 854 882 89 623 1087 968 38 872 818 746 43 591 605 825 14 400

Q2 943 890 64 698 1130 1024 41 925 894 769 52 682 680 952 31 512

Q3 1032 924 59 789 1034 1048 45 1002 940 795 58 728 812 1023 30 501

Q4 1129 992 63 870 1142 1091 54 984 978 864 59 784 927 1038 38 580

1560440

395

40014825605

Chicago

TorontoVancouver

New York

computer security


Q1

Q2

Q3

Q4

item (types)

tim

e (q

uar

ters

)

location (c

ities)

680 952

812 1023

1038927

501

580

51231

30

38

89

4338968

746

623882

591872

682

728

784

925

1002

984

698

789

870

A 2-D view of sales data for AllElectronics, and it’s 3-D data cube representation


Chicago

TorontoVancouver

New York

computer security


Q1

Q2

Q3

Q4

item (types)

tim

e (q

uar

ters

)

location (c

ities)

40014825605

computer security


item (types)

computer security


item (types)

supplier=“SUP1” supplier=“SUP1”supplier=“SUP2”

A 4-D data cube representation of sales data for AllElectronics


time, location, suppliertime, item, location

time, item, supplieritem, location, supplier

time, item, location, supplier

time, location

time, supplierlocation, suppliertime, item

item, location

item, supplier

timelocationitem

supplier

all0-D (apex) cuboid

1-D cuboid

4-D (base) cuboid

2-D cuboid

3-D cuboid

Lattice of cuboids, making up a 4-D data cube


Stars, Snowflakes, and Fact Constellations:Schemas for Multidimensional DatabasesStar schema:• a large central table (fact table)• a set of smaller attendant tables (dimension tables),

one for each dimension

time_key

item_key

branch_key

location_key

dollars_sold

units_sold

time_key

item_key

branch_key

location_key

dollars_sold

units_sold

time_key

day

day_of_week

month

quarter

year

time_key

day

day_of_week

month

quarter

year

item_key

item_name

brand

type

supplier_type

item_key

item_name

brand

type

supplier_type

location_key

street

city

province_or_state

country

location_key

street

city

province_or_state

country

branch_key

branch_name

branch_type

branch_key

branch_name

branch_type

timedimension table

locationdimension table

itemdimension table

branchdimension table

salesfact table


Snowflake schema:

• a variant of star schema, where some dimension tables are normalized

• reduce redundancies, but reduce the effectivness of browsing

time_key

item_key

branch_key

location_key

dollars_sold

units_sold

time_key

item_key

branch_key

location_key

dollars_sold

units_sold

time_key

day

day_of_week

month

quarter

year

time_key

day

day_of_week

month

quarter

year

item_key

item_name

brand

type

supplier_key

item_key

item_name

brand

type

supplier_key

branch_key

branch_name

branch_type

branch_key

branch_name

branch_type

timedimension table


itemdimension table


salesfact table

location_key

street

city_key

location_key

street

city_key

supplier_key

supplier_type

supplier_key

supplier_type

supplierdimension table

city_key

city

province_or_state

country

city_key

city

province_or_state

country

citydimension table


Fact constelation:

• multiple fact tables share dimension tables

time_key

item_key

branch_key

location_key

dollars_sold

units_sold

time_key

item_key

branch_key

location_key

dollars_sold

units_sold

time_key

day

day_of__week

month

quarter

year

time_key

day

day_of__week

month

quarter

year

item_key

item_name

brand

type

supplier_type

item_key

item_name

brand

type

supplier_type

location_key

street

city

province_or_state

country

location_key

street

city

province_or_state

country

branch_key

branch_name

branch_type

branch_key

branch_name

branch_type

timedimension table


itemdimension table


salesfact table

shippingfact table

item_key

time_key

shipper_key

from_location

to_location

dollars_sold

units_shipped

item_key

time_key

shipper_key

from_location

to_location

dollars_sold

units_shipped

shipper_key

shipper_name

location_key

shipper_type

shipper_key

shipper_name

location_key

shipper_type

shipperdimension table


Defining multidimensional schema

• DMQL – data mining query language

• Syntax:

cube definition:define cube <cube_name> [<dimension_list>]: <measure_list>

dimension definition:define dimension <dimension_name> as (<atribute_or_subdimension_list>)


Example:

• Constellation schema defined in DMQL:

define cube sales [time, item, branch, location]:

dollars_sold=sum(sales_in_dollars), units_sold=count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, province_or_state, country)

define cube shipping [time, item, shipper, from_location, to_location]:

dollars_cost=sum(cost_in_dollars), unit_shipped=count(*)

define dimension time as time in cube sales

define dimension item as item in cube sales

define dimension shipper as (shipper_key, shipper_name,

location as location in cube sales, shipper_type)

define dimension from_location as location in cube sales

define dimension to_location as location in cube sales


Measures: Their Categorization and Computation

Measures, based on the aggregate function:

• Distributive

• Algebraic

• Holistic


Introducing Concept Hierarchies

• A concept hierarchy defines a sequence of mappings from a set of low-level to higher-level concepts.

allall

CanadaCanada USAUSA

British ColumbiaBritish Columbia OntarioOntario New YorkNew York IllinoisIllinois

VancouverVancouver VictoriaVictoria TorontoToronto OttawaOttawa BuffaloBuffaloNew YorkNew York ChicagoChicago

all

country

province_or_state

city

location


• Hierarchial and lattice structures of atributes in warehouse dimensions:

country

province_or_state

city

street

year

week

day

month

quarter

Hierarchy for location Lattice for time


OLAP Operations in the Multidimensional Data Model

• Roll-up• Drill-down• Slice and dice• Pivot (rotate)• Other (drill-across, drill-through)


1560440

395

40014825605

Chicago

TorontoVancouver

New York

computer security


Q1

Q2

Q3

Q4

item (types)

tim

e (q

uar

ters

)

location (c

ities)

2000

1000

USACanada

computer security


Q1

Q2

Q3

Q4

item (types)

tim

e (q

uar

ters

)location (c

ountries)

150

100

150

Chicago

TorontoVancouver

New York

computer security


item (types)

tim

e (m

onth

s)

location (c

ities)

January

February

March

April

May

June

July

August

September

October

November

December

roll-upon location(from cities to countries)

drill-downon time(from quarters to months)


1560440

395

40014825605

Chicago

TorontoVancouver

New York

computer security


Q1

Q2

Q3

Q4

item (types)

tim

e (q

uar

ters

)locatio

n (citie

s)

395

605

USACanada

computer

homeentertainment

Q1

Q2

item (types)

tim

e(q

uar

ters

)

location (c

ities)

dice for(location=“Toronto” or “Vancouver”)and (time=“Q1”or “Q2”) and(item=“home entertainment” or “computer”)

slicefor time=“Q1”

400

14

825

605

Vancouver

Toronto

New York

Chicago

computer

security

phone

homeentertainment

item

(ty

pes

)

location (cities)

40014825605Vancouver

Toronto

New York

Chicago

computer security


item (types)

loca

tion

(ci

ties

)

pivot


A Starnet Query Model for Querying Multidimensional Databases

continent

country

province_or_state

city

street

location

day

month

quarter

year

time

name brand category typeitem

name

category

group

customer


Data Warehouse Architecture

Steps for the Design and Construction of Data Warehouse

The Design of a Data Warehouse: A Business Analysis Framework

• top-down view

• data source view

• data warehouse view

• business query view


The Process of Data Warehouse Design

• top-down approach

• bottom-up approach

• combined approach

• waterfall method

• spiral method

Steps of the warehouse design:

1) Choosing a business proces to model;

2) Choosing the grain of the business proces;

3) Choosing the dimensions;

4) Choosing the measures.


A Three-Tier Data Warehouse Architecture

Output

Query/report Analysis Data mining

OLAP server OLAP server

Monitoring Administration Data warehouse Data marts

Metadata repositoryExtractClean

TransformLoad

RefreshOperational databases External sources

Data

Bottom tier:data warehouseserver

Middle tier:OLAP server

Top tier:front-end tools


There are three data warehouse models:

• Enterprise warehouse

• Data mart

• Virtual warehouse


Types of OLAP Servers: ROLAP versus MOLAP versus HOLAP

Relational OLAP (ROLAP) servers:• use of relational or extended-relational DBMS• greater scalabilityMultidimensional OLAP (MOLAP) servers:• use of data cube – fast indexing• possible low storage utilization – use of compressionHybrid OLAP (HOLAP) servers:• scalability of ROLAP and faster computation of MOLAP• Microsoft SQL Server 7.0 OLAP Services

supports HOLAP server


Data Warehouse Implementation

• SQL group byData cube computation extends SQL with compute cube

• Example: “Compute the sum of sales, grouping by item and city.” “Compute the sum of sales, grouping by item.” “Compute the sum of sales, grouping by city.”

• The possible group by’s are the following:{(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ()}


( )

(year)(item)(city)

(city,year) (item,year)(city,item)

(city,item,year)

0-D (apex) cuboid

1-D cuboids

2-D cuboids

3-D (base) cuboids

Lattice of cuboids

define cube sales [item, city, year]: sum(sales_in_dollars)

compute cube sales


• Number of cuboids in an n-dimensional data cube is 2n

• Number of cuboids in an n-dimensional data cube

where we have a concept hihierarchy (day<week<month<quarter<year) is:

• Example:if the cube has 10 dimensions and each dimension has 4 levels,

the total number of cuboids that can be generated will be 510 9.8 x 106

n

iiLT

1

)1(


Partial Materialization: Selected Computation of Cuboids

There are three choices for data cube materialization given a base cuboid:

(1) do not precompute any of the “nonbase” cuboids (no materialization)

(2) precompute all of the cuboids (full materialization)

(3) selectively compute a proper subset

of the whole set of possible cuboids (partial materialization);

the partial materialization of cuboids shoul consider three factors:•identify the subset of cuboids to materialize,

•exploit the materialized cuboids during query processing, and

•efficiently update the materialized cuboids during load and refresh.


Multiway Array Aggregation in the Computation of Data Cubes

ROLAP:• Sorting, hashing, and grouping operations are applied

to the dimension attributes in order to reorder and cluster related tuples.• Grouping is performed on some subaggregates as a “partial grouping step”.

These “partial groupings” may be used to speed up the computation of other subaggregates.

• Aggregates may be computed from previously computed aggregates, rather than from the base fact tables.

MOLAP:• Partitition the array into chunks.• Compute aggregates by visiting cube cells.


a0 a1 a2 a3

1 2 3 4

13 14 15 16b3

b2

b1

b0

5

9

3029 31 32

45 46 47 48

44

40

36

28

24

20

60

56

52

61 62 63 64

c0

c1

c2

c3

A

C

B

A 3-D array for the dimensions A, B, and C, organized into 64 chunks


Indexing OLAP Data

• Bitmap indexing

• Join indexing


Bitmap Indexing

RID item city

R1

R2

R3

R4

R5

R6

R7

R8

H

C

P

S

H

C

P

S

V

V

V

V

T

T

T

T

RID H C P S

R1

R2

R3

R4

R5

R6

R7

R8

1

0

0

0

1

0

0

0

0

1

0

0

0

1

0

0

0

0

1

0

0

0

1

0

0

0

0

1

0

0

0

1

RID V T

R1

R2

R3

R4

R5

R6

R7

R8

1

1

1

1

0

0

0

0

0

0

0

0

1

1

1

1

Base table Item bitmap index table City bitmap index table

Indexing OLAP data using bitmap indices


Join Indexing

location item sales_key

…

Main Street

…

…

Sony-TV

…

…

T57

…

location

Linkages between a sales fact table anddimension tables for location and item

Main Street Sony-TV

T459

T884

T238

T57item

sales location sales_key

…

Main Street

Main Street

Main Street

…

…

T57

T238

T884

…

item sales_key

…

Sony-TV

Sony-TV

…

…

T57

T459

…

Join index table forlocation/sales

Join index table foritem/sales

Join index table linking two dimensionslocation/item/sales

Join index tables based on the linkagesbetween the sales fact table and dimensiontables for location and item


Efficient Processing of OLAP Queries

1. Determine which operations should be performed on the available cuboids

2. Determine to which materialized cuboid(s)

the relevant operations should be applied


Metadata Repository

• A description of the structure

of the data warehouse• Operational metadata• The algorythms used for summarization• The mapping from the operational environment

to the data warehouse• Data related to system performance• Business metadata


Data Warehouse Back-End Tools and Utilities

• Data extraction• Data cleaning• Data transformation• Load• Refresh


Further Development of Data Cube Technology

Discovery-Driven Exploration of Data Cubes

• SelfExp• InExp• PathExp


Sum of sales Month

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Total 1% -1% 0% 1% 3% -1% -9% -1% 2% -4% 3%

Avg. sales Month

Item Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Sony b/w printer 9% -8% 2% -5% 14% -4% 0% 41% -13% -15% -11%

Sony color printer 0% 0% 3% 2% 4% -10% -13% 0% 4% -6% 4%

HP b/w printer -2% 1% 2% 3% 8% 0% -12% -9% 3% -3% 6%

HP color printer 0% 0% -2% 1% 0% -1% -7% -2% 1% -4% 1%

IBM desktop computer 1% -2% -1% -1% 3% 3% -10% 4% 1% -4% -1%

IBM laptop computer 0% 0% -1% 3% 4% 2% -10% -2% 0% -9% 3%

Toshiba desktop comp. -2% -5% 1% 1% -1% 1% 5% -3% -5% -1% -1%

Toshiba laptop comp. 1% 0% 3% 0% -2% -2% -5% 3% 2% -1% 0%

Logitech mouse 3% -2% -1% 0% 4% 6% -11% 2% 1% -4% 0%

Ergo-way mouse 0% 0% 2% 3% 1% -2% -2% -5% 0% -5% 8%

Change in sales over time

Change in sales for each item-time combination


Avg. sales Month

Region Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

North -1% -3% -1% 0% 3% 4% -7% 1% 0% -3% -3%

South -1% 1% -9% 6% -1% -39% 9% -34% 4% 1% 7%

East -1% -2% 2% -3% 1% 18% -2% 11% -3% -2% -1%

West 4% 0% -1% -3% 5% 1% -18% 8% 5% -8% 1%

Change in sales for the item IBM desktop computer per region


Complex Aggregation at Multiple Granularities: Multifeature Cubes

• Example 1:Query 1: A simple data cube query. Find the total sales in 2000, broken down by item, region, and month, with subtotals for each dimension.

• Example 2:Query 2: A complex query. Grouping by all subsets of {item, region, month}, find the maximum price in 2000 for each group, and the total sales among all maximum price tuples.

select item, region, month, MAX(price), SUM(R.sales)from Purchaseswhere year=2000cube by item, region, month: Rsuch that R.price=MAX(price)


• Example 3:Query 3: An even more complex query. Grouping by all subsets of {item,region,month}, find the maximum price in 2000 for each group. Among the maximum price tuples, find the minimum and maximum item shelf life. Also find the fraction of the total sales due to tuples that have minimum shelf life within the set of all maximum price tuples, and the fraction of the total sales due to tuples that have maximum shelf life within the set of all maximum price tuples.

select item, region, month, MAX(price), MIN(R1.shelf), MAX(R1.shelf), SUM(R1.sales),

SUM(R2.sales), SUM(R3.sales)

from Purchaseswhere year=2000cube by item, region, month: R1, R2, R3such that R1.price=MAX(price) and

R2 in R1 and R2.shelf=MIN(R1..shelf) andR3 in R1 and R3.shelf=MAX(R1.shelf)


From Data Warehousing to Data Mining

Data Warehouse Usage

• Information processing• Analytical processing• Data mining


From On-Line Analytical Processing to On-Line Analytical Mining

• High quality of data in data warehouses• Available information processing infrastructure

surrounding data warehouses• OLAP-based exploratory data analysis• On-line selection of data mining functions


Architecture for On-Line Analytical Mining

Graphical user interface APIGraphical user interface API

Cube APICube API

Database APIDatabase API

OLAMengine

OLAMengine

OLAPengine

OLAPengine

Databases Datawarehouse

Meta dataMDDB

Databases

Data filtering, data integration

Data cleaningData integration

Filtering

Constraint-basedmining query

Mining result

Layer 1data repository

Layer 2multidimensional

database

Layer 3OLAP/OLAM

Layer 4user interface

An integrated OLAM and OLAP architecture


3Data Preprocessing


-2, 32, 100, 59, 48

Data integration

Data transformation

Data cleaning

-0.02, 0.32, 1.00, 0.59, 0.48

A1 A2 A3 … A126

T1

T2

T3

T4

…

T2000

tran

sact

ions

attributesA1 A3 … A115

T1

T4

…

T1456tran

sact

ions

attributesData reduction

Format of data preprocesing


Data Cleaning

Missing values

1. Ignore the tuple

2. Fill in the missing value manualy

3. Use a global constant to fill in the missing value

4. Use the attribute mean to fill in the missing value

5. Use the attribute mean for all samples belonging to the same class

as the given tuple

6. Use the most probable value to fill in the missing value


Inconsistent data

Noisy data• Bining Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition info (equidepth) bins:

Bin 1: 4, 8, 15Bin 2: 21, 21, 24Bin 3: 25, 28, 34

Smoothing by bin means:Bin 1: 9, 9, 9Bin 2: 22, 22, 22Bin 3: 29, 29, 29

Smoothing by bin boundaries:Bin 1: 4, 4, 15Bin 2: 21, 21, 24Bin 3: 25, 25, 34

• Clustering• Combined computer and human inspection• Regression


Data Integration and Transformation

Data Integration

Data Transformation

• Smoothing

• Aggregation

• Generalization

• Normalization

• Attribute construction


Data Reduction

• Data cube aggregation• Dimension reduction• Data compression• Numerosity reduction• Discretization and concept hierarchy generation


Dimensionality reduction

1. Stepwise forward selection

2. Stepwise backward elimination

3. Combination of forward selection and

backward elimination

• Decision tree induction

• Example: Forward selection

Initial attribute set:

{A1,A2,A3,A4,A5,A6}

Initial reduced set:

{}

{A1}

{A1,A4}

Reduced attribute set:

{A1,A4,A6}

Backward elimination


{A1,A2,A3,A4,A5,A6}

{A1,A3,A4,A5,A6}

{A1,A4,A5,A6}


{A1,A4,A6}

Decision tree inductiom


{A1,A2,A3,A4,A5,A6}


{A1,A4,A6}

A4?A4?

Class1

Y N

A1?A1? A6?A6?

Class1 Class2 Class2

NN YY

Greedy (heuristic)methods for attribute subset selection.


Data Compression

• Wavelet transforms

• Principal components analysis


Numerosity Reduction

• Regression and log-linear models

• Histograms

• Clustering

• Sampling


Histogram Examples

5 10 15 20 25 30

12345678910

price ($)

cou

nt

5

10

15

20

25

1-10 11-20 21-30price ($)

cou

nt

A histogram for price using singleton buckets – eachbucket represent one price-value/frequency pair.

An equiwidth histogram forprice, where values areaggregated so that each bucket has a uniform widthof $10.


Discretization And Concept Hierarchy

Generation

($900…$1000]($900…$1000]

($200…$300]($200…$300] ($400…$500]($400…$500] ($600…$700]($600…$700] ($800…$900]($800…$900]

($100…$200]($100…$200] ($300…$400]($300…$400] ($500…$600]($500…$600] ($700…$800]($700…$800]

($0…$100]($0…$100]

($0…$1000]($0…$1000]

($0…$200]($0…$200] ($200…$400]($200…$400] ($400…$600]($400…$600] ($600…$800]($600…$800] ($800…$1000]($800…$1000]

A concept hierarchy for the attribute price.


Discretization And Concept Hierarchy Generation

for Numeric Data

• Binning• Histogram analysis• Cluster analysis• Entropy-based Discretization• Segmentation by natural partitioning


Concept Hierarchy Generation for Categorical Data

• Specification of a partial ordering

of attributes explicitly at the schema level

by users or experts

• Specification of a portion of a hierarchy

by explicit data grouping

• Specification of a set of attributes,

but not of their partial ordering

• Specification of only a partial set of attributes

countrycountry

province_or_stateprovince_or_state

citycity

streetstreet

15 distinct values

365 distinct values

3,567 distinct values

674,339 distinct values

Automatic generation of a schema concept hierarchy based on the number of distinct attribute values.


Credits:

Radmilo Pešić [email protected] Golubović [email protected] Milutinović [email protected]

Data Mining

Documents

Transcript of Data Mining