Data Mining
description
Transcript of Data Mining
made by Radmilo Pesic & Branko Golubovic 1/74
Data MiningConcepts and Tehniques
tutorial based on the book:
by Jiawei Han and Micheline Kamber
This material was developed with financial help of the WUSA fund of Austria.
made by Radmilo Pesic & Branko Golubovic 2/74
1Introduction
made by Radmilo Pesic & Branko Golubovic 3/74
What motivated data mining?Necessity is the mother of invention.
Data Collection and Database Creation(1960s and earlier)
Data Collection and Database Creation(1960s and earlier)
Database Management Systems(1970s-early 1980s)
Database Management Systems(1970s-early 1980s)
Advanced Databases Systems(mid-1980s-present)
Advanced Databases Systems(mid-1980s-present)
Web-based Databases Systems(1990s-present)
Web-based Databases Systems(1990s-present)
Data Warehousing and Data Mining(mid-1980s-present)
Data Warehousing and Data Mining(mid-1980s-present)
New Generation of Integrated Information Systems(2000-…)
New Generation of Integrated Information Systems(2000-…)
made by Radmilo Pesic & Branko Golubovic 4/74
What Is Data Mining?
Datawarehouse
Databases Flat files
Cleaning andIntegration
Selection andTransformation
Data Mining
Patterns
KnowledgeEvaluation and
Presentation
Extracting or “mining” knowledge from large amounts of data.
1. Data cleaning2. Data integration3. Data selection4. Data transformation5. Data mining6. Pattern evaluation7. Knowledge presentation
made by Radmilo Pesic & Branko Golubovic 5/74
Components of a typical data mining system:
• Database, data warehouse,
or other information repository
• Database
or data warehouse server
• Knowledge base
• Data mining engine
• Pattern evaluation module
• Graphical user interface
Graphical user interfaceGraphical user interface
Pattern evaluationPattern evaluation
Data mining engineData mining engine
Database orData warehouse server
Database orData warehouse server
DatabaseData
warehouse
Knowledgebase
made by Radmilo Pesic & Branko Golubovic 6/74
Data mining – On What Kind of Data?
• Relational Databases• Data Warehouses• Transactional Databases• Advanced Database Systems
and Advanced Database Applications(object-oriented, object-relational, spatial, temporal, time-series, text, multimedia, heterogeneus, legacy databases and the world wide web)
made by Radmilo Pesic & Branko Golubovic 7/74
Relational Databasescust_ID name address age income credit_info …
C1
…
…
Smith, Sandy
…
…
5463 E Hastings, Burnaby,
BC V5A 4S9, Canada
…
21
…
…
$27000
…
…
1
…
…
…
…
…
item_ID name brand category type price place_made supplier cost
I3
I8
…
high_res_TV
multidisc-
CDplay
Toshiba
Sanyo
…
high resolution
multidisc
…
TV
CD player
…
$988.00
$369.00
…
Japan
Japan
…
NikoX
Music Front
…
$600.00
$120.00
…
empl_ID name category group salary commission
E55
…
Jones, Jane
…
home entertainment
…
manager
…
$18,000
…
2%
…
branch_ID name address
B1
…
City Square
…
369 Cambie St., Vancouver, BC V5L 3A2, Canada
…
trans_ID cust_ID empl_ID date time method_paid amount
T100
…
C1
…
E55
…
09/21/98
…
15:45
…
Visa
…
$1357.00
…
trans_ID item_ID qty
T100
T100
I3
I8
1
2
empl_ID branch_ID
E55
…
B1
…
customer
item
employee
branch
purchases
item_sold works_at
made by Radmilo Pesic & Branko Golubovic 8/74
Data Warehouses
CleanTransformIntegrateLoad
Datawarehouse
Query andanalysis tools
Query andanalysis tools
ClientClient
ClientClient
Data source in New York
Data source in Chicago
Data source in Toronto
Data source in Vancouver
Typical architecture of a data warehouse for AllElectronics
made by Radmilo Pesic & Branko Golubovic 9/74
1560440
395
40014825605
Chicago
TorontoVancouver
New York
computer security
phonehomeentertainment
Q1
Q2
Q3
Q4
item (types)ti
me
(qu
arte
rs)
address (
cities
)
computer security
phonehomeentertainment
item (types)
computer security
phonehomeentertainment
item (types)
Chicago
TorontoVancouver
New York
address (
cities
)
Q1
Q2
Q3
Q4
tim
e (q
uar
ters
)
100
150
150
10002000
CanadaUSA
address (
countries)
tim
e (m
onth
s) Jan
March
Feb
<Vancouver,Q1,security>
Roll-up on addressDrill-down on time data for Q1
made by Radmilo Pesic & Branko Golubovic 10/74
Text Databases and Multimedia Databases• Text databases can be:
highly unstructured, semistructured or well structured
• Multimedia databases store image, audio, and video data
• Such data require a lot of storage space; it’s continuous-media data
Heterogeneus Databases and Legacy Databases
The World Wide Web• mining path traversal patterns
made by Radmilo Pesic & Branko Golubovic 11/74
Data Mining Functionalities What Kinds of Patterns Can Be Mined?
• Concept/Class Description:
Characterization and Discrimination• Association Analysis• Classification and Prediction• Cluster Analysis• Outlier Analysis• Evolution Analysis
made by Radmilo Pesic & Branko Golubovic 12/74
Are All of the Patterns Interesting?
A pattern is interesiting if it is:
• easily understood
• valid
• (potentially) useful
• novel
or if it
• confirms user’s hypothesis
Interesting pattern represents knowledge!
made by Radmilo Pesic & Branko Golubovic 13/74
Objective measures of pattern interestingness:• support
• confidence
Subjective measures of pattern interestingness:• data is unexpected
• data is actionable
• data is expected
Can a data mining system generate all of the interesting patterns?
Can a data mining system generate only interesting patterns?
made by Radmilo Pesic & Branko Golubovic 14/74
Classification of Data Mining Systems
• according to kinds of databases mined (relational, data warehouse, object-oriented…)
• according to kinds of knowledge mined (association, classification, clustering…; generalized, primitive-level or knowledge at multiple levels; regularities or irregularities)
• according to the kinds of techniques utilized (autonomous, interactive exploratory or query-driven systems; data warehouse oriented, statistics…)
• according to the applications adapted (for finance, DNA, etc.)
DataMining
DataMining
Databasetechnology
Databasetechnology
Informationscience
Informationscience
Machinelearning
Machinelearning
StatisticsStatistics
VisualizationVisualization Other disciplinesOther disciplines
made by Radmilo Pesic & Branko Golubovic 15/74
Major Issues in Data MiningMining methodology and user interaction issues:
• Mining different kinds of knowledge in databases
• Interactive mining of knowledge at multiple levels of abstraction
• Incorporation of background knowledge
• Data mining query languages and ad hoc data mining
• Presentation and visualization of data mining results
• Handling noisy or incomplete data
• Pattern evaluation – the interestingness problem
Performance issues:
• Efficiency and scalability of data mining algorithms
• Parallel, distributed, and incremental mining algorithms
Issues relating to the diversity of database types:
• Handling of relational and complex types of data
• Mining information from heterogeneous databases and global information systems
made by Radmilo Pesic & Branko Golubovic 16/74
2Data Warehouse and OLAP Technology for Data Mining
made by Radmilo Pesic & Branko Golubovic 17/74
What Is a Data Warehouse?
“A datawarehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision making process.”
W.H. Inmon
• Subject-oriented• Integrated• Time-variant• Nonvolatile
made by Radmilo Pesic & Branko Golubovic 18/74
How are organizations using the information from
data warehouse?
• Increasing customer focus
• Repositioning products and managing product portfolios
• Analyzing operations and looking for sources of profit
• Managing the customer relationships,
making environmental corrections, and
managing the cost of corporate assets
Different approach to heterogeneous database integration:
• Query-driven approach (wrappers and integrators)
• Update-driven approach
made by Radmilo Pesic & Branko Golubovic 19/74
Differences Between Operational Database Systems and Data Warehouse
• Users and system orientation• Data contents• Database design• View• Access patterns
Why have a separate data warehouse?
made by Radmilo Pesic & Branko Golubovic 20/74
A Multidimensional Data Model
From Tables and Spreadsheets to Data Cubes
• A data cube is defined by dimensions and facts
• Dimension table
• Fact table
made by Radmilo Pesic & Branko Golubovic 21/74
location = “Chicago” location = “New York” location = “Toronto” location = “Vancouver”
item item item item
home
ent.comp. phone sec.
home
ent.comp. phone sec.
home
ent.comp. phone sec.
home
ent.comp. phone sec.
time
Q1 854 882 89 623 1087 968 38 872 818 746 43 591 605 825 14 400
Q2 943 890 64 698 1130 1024 41 925 894 769 52 682 680 952 31 512
Q3 1032 924 59 789 1034 1048 45 1002 940 795 58 728 812 1023 30 501
Q4 1129 992 63 870 1142 1091 54 984 978 864 59 784 927 1038 38 580
1560440
395
40014825605
Chicago
TorontoVancouver
New York
computer security
phonehomeentertainment
Q1
Q2
Q3
Q4
item (types)
tim
e (q
uar
ters
)
location (c
ities)
680 952
812 1023
1038927
501
580
51231
30
38
89
4338968
746
623882
591872
682
728
784
925
1002
984
698
789
870
A 2-D view of sales data for AllElectronics, and it’s 3-D data cube representation
made by Radmilo Pesic & Branko Golubovic 22/74
Chicago
TorontoVancouver
New York
computer security
phonehomeentertainment
Q1
Q2
Q3
Q4
item (types)
tim
e (q
uar
ters
)
location (c
ities)
40014825605
computer security
phonehomeentertainment
item (types)
computer security
phonehomeentertainment
item (types)
supplier=“SUP1” supplier=“SUP1”supplier=“SUP2”
A 4-D data cube representation of sales data for AllElectronics
made by Radmilo Pesic & Branko Golubovic 23/74
time, location, suppliertime, item, location
time, item, supplieritem, location, supplier
time, item, location, supplier
time, location
time, supplierlocation, suppliertime, item
item, location
item, supplier
timelocationitem
supplier
all0-D (apex) cuboid
1-D cuboid
4-D (base) cuboid
2-D cuboid
3-D cuboid
Lattice of cuboids, making up a 4-D data cube
made by Radmilo Pesic & Branko Golubovic 24/74
Stars, Snowflakes, and Fact Constellations:Schemas for Multidimensional DatabasesStar schema:• a large central table (fact table)• a set of smaller attendant tables (dimension tables),
one for each dimension
time_key
item_key
branch_key
location_key
dollars_sold
units_sold
time_key
item_key
branch_key
location_key
dollars_sold
units_sold
time_key
day
day_of_week
month
quarter
year
time_key
day
day_of_week
month
quarter
year
item_key
item_name
brand
type
supplier_type
item_key
item_name
brand
type
supplier_type
location_key
street
city
province_or_state
country
location_key
street
city
province_or_state
country
branch_key
branch_name
branch_type
branch_key
branch_name
branch_type
timedimension table
locationdimension table
itemdimension table
branchdimension table
salesfact table
made by Radmilo Pesic & Branko Golubovic 25/74
Snowflake schema:
• a variant of star schema, where some dimension tables are normalized
• reduce redundancies, but reduce the effectivness of browsing
time_key
item_key
branch_key
location_key
dollars_sold
units_sold
time_key
item_key
branch_key
location_key
dollars_sold
units_sold
time_key
day
day_of_week
month
quarter
year
time_key
day
day_of_week
month
quarter
year
item_key
item_name
brand
type
supplier_key
item_key
item_name
brand
type
supplier_key
branch_key
branch_name
branch_type
branch_key
branch_name
branch_type
timedimension table
locationdimension table
itemdimension table
branchdimension table
salesfact table
location_key
street
city_key
location_key
street
city_key
supplier_key
supplier_type
supplier_key
supplier_type
supplierdimension table
city_key
city
province_or_state
country
city_key
city
province_or_state
country
citydimension table
made by Radmilo Pesic & Branko Golubovic 26/74
Fact constelation:
• multiple fact tables share dimension tables
time_key
item_key
branch_key
location_key
dollars_sold
units_sold
time_key
item_key
branch_key
location_key
dollars_sold
units_sold
time_key
day
day_of__week
month
quarter
year
time_key
day
day_of__week
month
quarter
year
item_key
item_name
brand
type
supplier_type
item_key
item_name
brand
type
supplier_type
location_key
street
city
province_or_state
country
location_key
street
city
province_or_state
country
branch_key
branch_name
branch_type
branch_key
branch_name
branch_type
timedimension table
locationdimension table
itemdimension table
branchdimension table
salesfact table
shippingfact table
item_key
time_key
shipper_key
from_location
to_location
dollars_sold
units_shipped
item_key
time_key
shipper_key
from_location
to_location
dollars_sold
units_shipped
shipper_key
shipper_name
location_key
shipper_type
shipper_key
shipper_name
location_key
shipper_type
shipperdimension table
made by Radmilo Pesic & Branko Golubovic 27/74
Defining multidimensional schema
• DMQL – data mining query language
• Syntax:
cube definition:define cube <cube_name> [<dimension_list>]: <measure_list>
dimension definition:define dimension <dimension_name> as (<atribute_or_subdimension_list>)
made by Radmilo Pesic & Branko Golubovic 28/74
Example:
• Constellation schema defined in DMQL:
define cube sales [time, item, branch, location]:
dollars_sold=sum(sales_in_dollars), units_sold=count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollars_cost=sum(cost_in_dollars), unit_shipped=count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name,
location as location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
made by Radmilo Pesic & Branko Golubovic 29/74
Measures: Their Categorization and Computation
Measures, based on the aggregate function:
• Distributive
• Algebraic
• Holistic
made by Radmilo Pesic & Branko Golubovic 30/74
Introducing Concept Hierarchies
• A concept hierarchy defines a sequence of mappings from a set of low-level to higher-level concepts.
allall
CanadaCanada USAUSA
British ColumbiaBritish Columbia OntarioOntario New YorkNew York IllinoisIllinois
VancouverVancouver VictoriaVictoria TorontoToronto OttawaOttawa BuffaloBuffaloNew YorkNew York ChicagoChicago
all
country
province_or_state
city
location
made by Radmilo Pesic & Branko Golubovic 31/74
• Hierarchial and lattice structures of atributes in warehouse dimensions:
country
province_or_state
city
street
year
week
day
month
quarter
Hierarchy for location Lattice for time
made by Radmilo Pesic & Branko Golubovic 32/74
OLAP Operations in the Multidimensional Data Model
• Roll-up• Drill-down• Slice and dice• Pivot (rotate)• Other (drill-across, drill-through)
made by Radmilo Pesic & Branko Golubovic 33/74
1560440
395
40014825605
Chicago
TorontoVancouver
New York
computer security
phonehomeentertainment
Q1
Q2
Q3
Q4
item (types)
tim
e (q
uar
ters
)
location (c
ities)
2000
1000
USACanada
computer security
phonehomeentertainment
Q1
Q2
Q3
Q4
item (types)
tim
e (q
uar
ters
)location (c
ountries)
150
100
150
Chicago
TorontoVancouver
New York
computer security
phonehomeentertainment
item (types)
tim
e (m
onth
s)
location (c
ities)
January
February
March
April
May
June
July
August
September
October
November
December
roll-upon location(from cities to countries)
drill-downon time(from quarters to months)
made by Radmilo Pesic & Branko Golubovic 34/74
1560440
395
40014825605
Chicago
TorontoVancouver
New York
computer security
phonehomeentertainment
Q1
Q2
Q3
Q4
item (types)
tim
e (q
uar
ters
)locatio
n (citie
s)
395
605
USACanada
computer
homeentertainment
Q1
Q2
item (types)
tim
e(q
uar
ters
)
location (c
ities)
dice for(location=“Toronto” or “Vancouver”)and (time=“Q1”or “Q2”) and(item=“home entertainment” or “computer”)
slicefor time=“Q1”
400
14
825
605
Vancouver
Toronto
New York
Chicago
computer
security
phone
homeentertainment
item
(ty
pes
)
location (cities)
40014825605Vancouver
Toronto
New York
Chicago
computer security
phonehomeentertainment
item (types)
loca
tion
(ci
ties
)
pivot
made by Radmilo Pesic & Branko Golubovic 35/74
A Starnet Query Model for Querying Multidimensional Databases
continent
country
province_or_state
city
street
location
day
month
quarter
year
time
name brand category typeitem
name
category
group
customer
made by Radmilo Pesic & Branko Golubovic 36/74
Data Warehouse Architecture
Steps for the Design and Construction of Data Warehouse
The Design of a Data Warehouse: A Business Analysis Framework
• top-down view
• data source view
• data warehouse view
• business query view
made by Radmilo Pesic & Branko Golubovic 37/74
The Process of Data Warehouse Design
• top-down approach
• bottom-up approach
• combined approach
• waterfall method
• spiral method
Steps of the warehouse design:
1) Choosing a business proces to model;
2) Choosing the grain of the business proces;
3) Choosing the dimensions;
4) Choosing the measures.
made by Radmilo Pesic & Branko Golubovic 38/74
A Three-Tier Data Warehouse Architecture
Output
Query/report Analysis Data mining
OLAP server OLAP server
Monitoring Administration Data warehouse Data marts
Metadata repositoryExtractClean
TransformLoad
RefreshOperational databases External sources
Data
Bottom tier:data warehouseserver
Middle tier:OLAP server
Top tier:front-end tools
made by Radmilo Pesic & Branko Golubovic 39/74
There are three data warehouse models:
• Enterprise warehouse
• Data mart
• Virtual warehouse
made by Radmilo Pesic & Branko Golubovic 40/74
Types of OLAP Servers: ROLAP versus MOLAP versus HOLAP
Relational OLAP (ROLAP) servers:• use of relational or extended-relational DBMS• greater scalabilityMultidimensional OLAP (MOLAP) servers:• use of data cube – fast indexing• possible low storage utilization – use of compressionHybrid OLAP (HOLAP) servers:• scalability of ROLAP and faster computation of MOLAP• Microsoft SQL Server 7.0 OLAP Services
supports HOLAP server
made by Radmilo Pesic & Branko Golubovic 41/74
Data Warehouse Implementation
• SQL group byData cube computation extends SQL with compute cube
• Example: “Compute the sum of sales, grouping by item and city.” “Compute the sum of sales, grouping by item.” “Compute the sum of sales, grouping by city.”
• The possible group by’s are the following:{(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ()}
made by Radmilo Pesic & Branko Golubovic 42/74
( )
(year)(item)(city)
(city,year) (item,year)(city,item)
(city,item,year)
0-D (apex) cuboid
1-D cuboids
2-D cuboids
3-D (base) cuboids
Lattice of cuboids
define cube sales [item, city, year]: sum(sales_in_dollars)
compute cube sales
made by Radmilo Pesic & Branko Golubovic 43/74
• Number of cuboids in an n-dimensional data cube is 2n
• Number of cuboids in an n-dimensional data cube
where we have a concept hihierarchy (day<week<month<quarter<year) is:
• Example:if the cube has 10 dimensions and each dimension has 4 levels,
the total number of cuboids that can be generated will be 510 9.8 x 106
n
iiLT
1
)1(
made by Radmilo Pesic & Branko Golubovic 44/74
Partial Materialization: Selected Computation of Cuboids
There are three choices for data cube materialization given a base cuboid:
(1) do not precompute any of the “nonbase” cuboids (no materialization)
(2) precompute all of the cuboids (full materialization)
(3) selectively compute a proper subset
of the whole set of possible cuboids (partial materialization);
the partial materialization of cuboids shoul consider three factors:•identify the subset of cuboids to materialize,
•exploit the materialized cuboids during query processing, and
•efficiently update the materialized cuboids during load and refresh.
made by Radmilo Pesic & Branko Golubovic 45/74
Multiway Array Aggregation in the Computation of Data Cubes
ROLAP:• Sorting, hashing, and grouping operations are applied
to the dimension attributes in order to reorder and cluster related tuples.• Grouping is performed on some subaggregates as a “partial grouping step”.
These “partial groupings” may be used to speed up the computation of other subaggregates.
• Aggregates may be computed from previously computed aggregates, rather than from the base fact tables.
MOLAP:• Partitition the array into chunks.• Compute aggregates by visiting cube cells.
made by Radmilo Pesic & Branko Golubovic 46/74
a0 a1 a2 a3
1 2 3 4
13 14 15 16b3
b2
b1
b0
5
9
3029 31 32
45 46 47 48
44
40
36
28
24
20
60
56
52
61 62 63 64
c0
c1
c2
c3
A
C
B
A 3-D array for the dimensions A, B, and C, organized into 64 chunks
made by Radmilo Pesic & Branko Golubovic 47/74
Indexing OLAP Data
• Bitmap indexing
• Join indexing
made by Radmilo Pesic & Branko Golubovic 48/74
Bitmap Indexing
RID item city
R1
R2
R3
R4
R5
R6
R7
R8
H
C
P
S
H
C
P
S
V
V
V
V
T
T
T
T
RID H C P S
R1
R2
R3
R4
R5
R6
R7
R8
1
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
1
RID V T
R1
R2
R3
R4
R5
R6
R7
R8
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
Base table Item bitmap index table City bitmap index table
Indexing OLAP data using bitmap indices
made by Radmilo Pesic & Branko Golubovic 49/74
Join Indexing
location item sales_key
…
Main Street
…
…
Sony-TV
…
…
T57
…
location
Linkages between a sales fact table anddimension tables for location and item
Main Street Sony-TV
T459
T884
T238
T57item
sales location sales_key
…
Main Street
Main Street
Main Street
…
…
T57
T238
T884
…
item sales_key
…
Sony-TV
Sony-TV
…
…
T57
T459
…
Join index table forlocation/sales
Join index table foritem/sales
Join index table linking two dimensionslocation/item/sales
Join index tables based on the linkagesbetween the sales fact table and dimensiontables for location and item
made by Radmilo Pesic & Branko Golubovic 50/74
Efficient Processing of OLAP Queries
1. Determine which operations should be performed on the available cuboids
2. Determine to which materialized cuboid(s)
the relevant operations should be applied
made by Radmilo Pesic & Branko Golubovic 51/74
Metadata Repository
• A description of the structure
of the data warehouse• Operational metadata• The algorythms used for summarization• The mapping from the operational environment
to the data warehouse• Data related to system performance• Business metadata
made by Radmilo Pesic & Branko Golubovic 52/74
Data Warehouse Back-End Tools and Utilities
• Data extraction• Data cleaning• Data transformation• Load• Refresh
made by Radmilo Pesic & Branko Golubovic 53/74
Further Development of Data Cube Technology
Discovery-Driven Exploration of Data Cubes
• SelfExp• InExp• PathExp
made by Radmilo Pesic & Branko Golubovic 54/74
Sum of sales Month
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Total 1% -1% 0% 1% 3% -1% -9% -1% 2% -4% 3%
Avg. sales Month
Item Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Sony b/w printer 9% -8% 2% -5% 14% -4% 0% 41% -13% -15% -11%
Sony color printer 0% 0% 3% 2% 4% -10% -13% 0% 4% -6% 4%
HP b/w printer -2% 1% 2% 3% 8% 0% -12% -9% 3% -3% 6%
HP color printer 0% 0% -2% 1% 0% -1% -7% -2% 1% -4% 1%
IBM desktop computer 1% -2% -1% -1% 3% 3% -10% 4% 1% -4% -1%
IBM laptop computer 0% 0% -1% 3% 4% 2% -10% -2% 0% -9% 3%
Toshiba desktop comp. -2% -5% 1% 1% -1% 1% 5% -3% -5% -1% -1%
Toshiba laptop comp. 1% 0% 3% 0% -2% -2% -5% 3% 2% -1% 0%
Logitech mouse 3% -2% -1% 0% 4% 6% -11% 2% 1% -4% 0%
Ergo-way mouse 0% 0% 2% 3% 1% -2% -2% -5% 0% -5% 8%
Change in sales over time
Change in sales for each item-time combination
made by Radmilo Pesic & Branko Golubovic 55/74
Avg. sales Month
Region Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
North -1% -3% -1% 0% 3% 4% -7% 1% 0% -3% -3%
South -1% 1% -9% 6% -1% -39% 9% -34% 4% 1% 7%
East -1% -2% 2% -3% 1% 18% -2% 11% -3% -2% -1%
West 4% 0% -1% -3% 5% 1% -18% 8% 5% -8% 1%
Change in sales for the item IBM desktop computer per region
made by Radmilo Pesic & Branko Golubovic 56/74
Complex Aggregation at Multiple Granularities: Multifeature Cubes
• Example 1:Query 1: A simple data cube query. Find the total sales in 2000, broken down by item, region, and month, with subtotals for each dimension.
• Example 2:Query 2: A complex query. Grouping by all subsets of {item, region, month}, find the maximum price in 2000 for each group, and the total sales among all maximum price tuples.
select item, region, month, MAX(price), SUM(R.sales)from Purchaseswhere year=2000cube by item, region, month: Rsuch that R.price=MAX(price)
made by Radmilo Pesic & Branko Golubovic 57/74
• Example 3:Query 3: An even more complex query. Grouping by all subsets of {item,region,month}, find the maximum price in 2000 for each group. Among the maximum price tuples, find the minimum and maximum item shelf life. Also find the fraction of the total sales due to tuples that have minimum shelf life within the set of all maximum price tuples, and the fraction of the total sales due to tuples that have maximum shelf life within the set of all maximum price tuples.
select item, region, month, MAX(price), MIN(R1.shelf), MAX(R1.shelf), SUM(R1.sales),
SUM(R2.sales), SUM(R3.sales)
from Purchaseswhere year=2000cube by item, region, month: R1, R2, R3such that R1.price=MAX(price) and
R2 in R1 and R2.shelf=MIN(R1..shelf) andR3 in R1 and R3.shelf=MAX(R1.shelf)
made by Radmilo Pesic & Branko Golubovic 58/74
From Data Warehousing to Data Mining
Data Warehouse Usage
• Information processing• Analytical processing• Data mining
made by Radmilo Pesic & Branko Golubovic 59/74
From On-Line Analytical Processing to On-Line Analytical Mining
• High quality of data in data warehouses• Available information processing infrastructure
surrounding data warehouses• OLAP-based exploratory data analysis• On-line selection of data mining functions
made by Radmilo Pesic & Branko Golubovic 60/74
Architecture for On-Line Analytical Mining
Graphical user interface APIGraphical user interface API
Cube APICube API
Database APIDatabase API
OLAMengine
OLAMengine
OLAPengine
OLAPengine
Databases Datawarehouse
Meta dataMDDB
Databases
Data filtering, data integration
Data cleaningData integration
Filtering
Constraint-basedmining query
Mining result
Layer 1data repository
Layer 2multidimensional
database
Layer 3OLAP/OLAM
Layer 4user interface
An integrated OLAM and OLAP architecture
made by Radmilo Pesic & Branko Golubovic 61/74
3Data Preprocessing
made by Radmilo Pesic & Branko Golubovic 62/74
-2, 32, 100, 59, 48
Data integration
Data transformation
Data cleaning
-0.02, 0.32, 1.00, 0.59, 0.48
A1 A2 A3 … A126
T1
T2
T3
T4
…
T2000
tran
sact
ions
attributesA1 A3 … A115
T1
T4
…
T1456tran
sact
ions
attributesData reduction
Format of data preprocesing
made by Radmilo Pesic & Branko Golubovic 63/74
Data Cleaning
Missing values
1. Ignore the tuple
2. Fill in the missing value manualy
3. Use a global constant to fill in the missing value
4. Use the attribute mean to fill in the missing value
5. Use the attribute mean for all samples belonging to the same class
as the given tuple
6. Use the most probable value to fill in the missing value
made by Radmilo Pesic & Branko Golubovic 64/74
Inconsistent data
Noisy data• Bining Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition info (equidepth) bins:
Bin 1: 4, 8, 15Bin 2: 21, 21, 24Bin 3: 25, 28, 34
Smoothing by bin means:Bin 1: 9, 9, 9Bin 2: 22, 22, 22Bin 3: 29, 29, 29
Smoothing by bin boundaries:Bin 1: 4, 4, 15Bin 2: 21, 21, 24Bin 3: 25, 25, 34
• Clustering• Combined computer and human inspection• Regression
made by Radmilo Pesic & Branko Golubovic 65/74
Data Integration and Transformation
Data Integration
Data Transformation
• Smoothing
• Aggregation
• Generalization
• Normalization
• Attribute construction
made by Radmilo Pesic & Branko Golubovic 66/74
Data Reduction
• Data cube aggregation• Dimension reduction• Data compression• Numerosity reduction• Discretization and concept hierarchy generation
made by Radmilo Pesic & Branko Golubovic 67/74
Dimensionality reduction
1. Stepwise forward selection
2. Stepwise backward elimination
3. Combination of forward selection and
backward elimination
• Decision tree induction
• Example: Forward selection
Initial attribute set:
{A1,A2,A3,A4,A5,A6}
Initial reduced set:
{}
{A1}
{A1,A4}
Reduced attribute set:
{A1,A4,A6}
Backward elimination
Initial attribute set:
{A1,A2,A3,A4,A5,A6}
{A1,A3,A4,A5,A6}
{A1,A4,A5,A6}
Reduced attribute set:
{A1,A4,A6}
Decision tree inductiom
Initial attribute set:
{A1,A2,A3,A4,A5,A6}
Reduced attribute set:
{A1,A4,A6}
A4?A4?
Class1
Y N
A1?A1? A6?A6?
Class1 Class2 Class2
NN YY
Greedy (heuristic)methods for attribute subset selection.
made by Radmilo Pesic & Branko Golubovic 68/74
Data Compression
• Wavelet transforms
• Principal components analysis
made by Radmilo Pesic & Branko Golubovic 69/74
Numerosity Reduction
• Regression and log-linear models
• Histograms
• Clustering
• Sampling
made by Radmilo Pesic & Branko Golubovic 70/74
Histogram Examples
5 10 15 20 25 30
12345678910
price ($)
cou
nt
5
10
15
20
25
1-10 11-20 21-30price ($)
cou
nt
A histogram for price using singleton buckets – eachbucket represent one price-value/frequency pair.
An equiwidth histogram forprice, where values areaggregated so that each bucket has a uniform widthof $10.
made by Radmilo Pesic & Branko Golubovic 71/74
Discretization And Concept Hierarchy
Generation
($900…$1000]($900…$1000]
($200…$300]($200…$300] ($400…$500]($400…$500] ($600…$700]($600…$700] ($800…$900]($800…$900]
($100…$200]($100…$200] ($300…$400]($300…$400] ($500…$600]($500…$600] ($700…$800]($700…$800]
($0…$100]($0…$100]
($0…$1000]($0…$1000]
($0…$200]($0…$200] ($200…$400]($200…$400] ($400…$600]($400…$600] ($600…$800]($600…$800] ($800…$1000]($800…$1000]
A concept hierarchy for the attribute price.
made by Radmilo Pesic & Branko Golubovic 72/74
Discretization And Concept Hierarchy Generation
for Numeric Data
• Binning• Histogram analysis• Cluster analysis• Entropy-based Discretization• Segmentation by natural partitioning
made by Radmilo Pesic & Branko Golubovic 73/74
Concept Hierarchy Generation for Categorical Data
• Specification of a partial ordering
of attributes explicitly at the schema level
by users or experts
• Specification of a portion of a hierarchy
by explicit data grouping
• Specification of a set of attributes,
but not of their partial ordering
• Specification of only a partial set of attributes
countrycountry
province_or_stateprovince_or_state
citycity
streetstreet
15 distinct values
365 distinct values
3,567 distinct values
674,339 distinct values
Automatic generation of a schema concept hierarchy based on the number of distinct attribute values.
made by Radmilo Pesic & Branko Golubovic 74/74
Credits:
Radmilo Pešić [email protected] Golubović [email protected] Milutinović [email protected]