Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order...
Transcript of Data Warehouse and OLAP - mzym.susu.ru · OLAP TECHNOLOGIES Data Mining Keep order, and the order...
17.04.2013
1
DATA WAREHOUSE
AND
OLAP TECHNOLOGIES
Data Mining
Keep order, and the order shall save thee. Latin maxim
Outline
Data Mining
Data Warehouse
Definition
Architecture
OLAP
Multidimensional data model
OLAP cube computing
© Mikhail Zymbler
2
Data Warehouse
© Mikhail Zymbler Data Mining
3
A data warehouse is a:
subject-oriented,
integrated,
time-variant, and
nonvolatile
collection of data in support of management’s
decision-making process.
Data warehousing is a process of constructing and using
data warehouses.
W. Inmon
17.04.2013
2
Data Warehouse: subject-oriented
© Mikhail Zymbler Data Mining
4
Organized around major subjects, such as customer,
product, sales
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process
Data Warehouse: integrated
© Mikhail Zymbler Data Mining
5
Constructed by integrating multiple, heterogeneous data
sources
relational databases, flat files, on-line transaction records
Data cleaning and data integration techniques are
applied.
Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is converted.
Data Warehouse: time variant
© Mikhail Zymbler Data Mining
6
The time horizon for the data warehouse is significantly
longer than that of operational systems
Operational database: current value data
Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not contain
“time element”
17.04.2013
3
Data Warehouse: nonvolatile
© Mikhail Zymbler Data Mining
7
A physically separate store of data transformed from the
operational environment
Operational update of data does not occur in the data
warehouse environment
Does not require transaction processing, recovery, and
concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data
Data Warehouse
© Mikhail Zymbler Data Mining
9
Information from
heterogeneous sources is
integrated in advance
and stored in physically
distinguished warehouses
for direct query and
analysis
Client Client
Source Source Source
Warehouse
Data Warehouse vs. Operational DBMS
© Mikhail Zymbler Data Mining
10
OLTP (On-Line Transaction Processing) Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.
OLAP (On-Line Analytical Processing) Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP) User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
17.04.2013
4
Multidimensional data model
© Mikhail Zymbler Data Mining
11
Data in a data warehouse is modeled and viewed in
multiple dimensions
Dimension is a set of attribute values
Suppliers={MEXX, Bvlgari, Versace, Ecco, …}
Products={Clothes, Shoes, Cosmetic, Haberdashery, …}
Locations={Chelyabinsk, Moscow, Yekaterinburg, …}
Measure is a numerical function of dimensions
Cost: Suppliers Products Locations R+
Supplements(Ecco, Shoes, Chelyabinsk)=50300.75 (USD)
Amount: Suppliers Products Locations Z+
Supplements(Versace, Clothes, Moscow)=10 (item)
Data Warehouse: tables
© Mikhail Zymbler Data Mining
12
Dimension table keeps data concerning dimensions
Dimension(ID, Attr1, Attr2, …)
Suppliers(SID, Name, Rating, …)
Products(PID, Name, Price, Color, …)
Locations(LID, Name, Address, …)
Fact table keeps data cube
Fact(ID_Dim1, ID_Dim2, …, Measure1, Measure2, …)
Sales(SID, PID, LID, Cost, Amount)
Data Warehouse: typical schemas
© Mikhail Zymbler Data Mining
13
Star schema: a fact table in the middle connected to a
set of dimension tables.
Snowflake schema: a refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables
Constellation (galaxy) schema: multiple fact tables
share dimension tables, viewed as a collection of stars.
17.04.2013
5
Star schema
© Mikhail Zymbler Data Mining
14
Supplements
ProductID
LocationID
SupplierID
Cost
Amount
…
Products
ID
Name
Brand
…
Locations
ID
Address
City
…
Suppliers
ID
Name
…
Snowflake schema
© Mikhail Zymbler Data Mining
15
Supplements
ProductID
LocationID
SupplierID
Cost
Amount
…
Products
ID
Name
BrandID
…
Locations
ID
Address
CityID
Suppliers
ID
Name
… Cities
ID
City
Country
Population
…
Brands
ID
Name
Country
…
Constellation (galaxy) schema
© Mikhail Zymbler Data Mining
16
Supplements
ProductID
LocationID
DeliverymanID
Cost
Amount
Products
ID
Name
Brand
…
Locations
ID
Address
City
Country
Suppliers
ID
Name
…
Deliveries
ProductID
FromID
ToID
DeliverymanID
Cost
Amount
Deliverymen
ID
Name
LocaleID
17.04.2013
6
Time dimension
© Mikhail Zymbler Data Mining
17
DD-MMM-YYYY day is unambiguously identified by
number of the day in the month
number of the month in the year
number of the year
number of the week in the year
number of the day in the week
number of quarter
…
25-JAN-2010=(ID, 25, 1, 2010, 4, 2, 1, …)
Hierarchy in dimensions
© Mikhail Zymbler Data Mining
18
ALL
West Europe East Europe
Portugal Poland … Belarus Russia …
Lisbon Krakow … … Minsk … Moscow …
Tys
ciekawostek
Тысяча
мелочей
Milhares
de trivialidades
Адна тысяча
драбніц
ALL
Region
Country
City
Shop Всё
для вас
Multidimensional data model
© Mikhail Zymbler Data Mining
19
Domain's facts are points of n-dimensional space.
Location
Time
Product
280.7
Cost
17.04.2013
7
Data cube
© Mikhail Zymbler Data Mining
20
280 70 310
330
180
Footwear
Cosmetics
Clothing 160
?
?
?
?
?
270
Location
Time
Product
Data cube
© Mikhail Zymbler Data Mining
21
280 70 310
330
180
Yekaterinburg Footwear
Cosmetics
Clothing 160
?
?
?
?
?
270
Location
Time Product
Moscow ?
220
250
40
?
350
?
?
300
?
140
360
Chelyabinsk ?
?
120
?
230
180
?
?
130
?
?
50
OLAP cube
© Mikhail Zymbler Data Mining
22
OLAP cube is a data cube, where every dimension has
additional ALL value and respective points of data space
are computed by an aggregate function(s).
Aggregate functions
Distributive
count(), sum(), min(), max(), etc.
Algebraic
avg(), stddev(), etc.
Holistic
median(), mode(), etc.
17.04.2013
8
OLAP cube
© Mikhail Zymbler Data Mining
23
Footwear
Cosmetics
Clothing
Location
Time Product
Yekaterinburg ?
? 180
? 230
?
? ?
?
? ?
270
Moscow ?
? 250
? 230
350
? ?
300
? ?
360
Chelyabinsk ?
? 120
? 230
180
? ?
130
? ?
50
ALL ?
? 550
? 230
530
? ?
430
? ?
680
OLAP cube
© Mikhail Zymbler Data Mining
24
Shoes
Cosmetics
Clothing
Location
Time Product
Yekaterinburg ?
? 180
? 230
?
? ?
?
? ?
270
Moscow ?
? 250
? 230
350
? ?
300
? ?
360
Chelyabinsk ?
? 120
? 230
180
? ?
130
? ?
50
ALL ?
? 550
? 230
530
? ?
430
? ?
680
? ?
450
? ?
1260
? ?
480
OLAP cube
© Mikhail Zymbler Data Mining
25
Shoes
Cosmetics
Clothing
Location
Time Product
Yekaterinburg ?
? 180
? 230
?
? ?
?
? ?
270
Moscow ?
? 250
? 230
350
? ?
300
? ?
360
Chelyabinsk ?
? 120
? 230
180
? ?
130
? ?
50
ALL ?
? 550
? 230
530
? ?
430
? ?
680
? ?
450
? ?
1260
? ?
480
ALL 580 630 460 650
470 430 440 560
380 610 540 620
17.04.2013
9
OLAP cube
© Mikhail Zymbler Data Mining
26
Shoes
Cosmetics
Clothing
Location
Time Product
Yekaterinburg ?
? 180
? 230
?
? ?
?
? ?
270
Moscow ?
? 250
? 230
350
? ?
300
? ?
360
Chelyabinsk ?
? 120
? 230
180
? ?
130
? ?
50
ALL ?
? 550
? 230
530
? ?
430
? ?
680
? ?
450
? ?
1260
? ?
480
ALL 580 630 460 650
470 430 440 560
380 610 540 620
2560 3100 2340 3620
? ?
3080
3570
3230
3780
29780
Manipulation with OLAP cube
© Mikhail Zymbler Data Mining
27
Slice and dice
projection and/or restriction
Roll-up (drill-up)
computing measure moving upward
Drill-down (roll-down)
computing measure moving downward
Pivot
changing an order of dimensions
Slice
© Mikhail Zymbler Data Mining
28
Shoes
Cosmetics
Clothing
Yekaterinburg ? ?
180
? 230
?
? ?
?
? ?
270
Moscow ? ?
250
? 230
300
? ?
350
? ?
360
Chelyabinsk ? ?
120
? 230
180
? ?
130
? ?
50
2000
Shoes
Cosmetics
Clothing Yekaterinburg
Moscow
Chelyabinsk
180
250
120
180
250
120
180
250
120
Location
Time Product
slice
for Time=2000
17.04.2013
10
Slice
© Mikhail Zymbler Data Mining
29
Shoes
Cosmetics
Clothing
Yekaterinburg ? ?
180
? 230
?
? ?
?
? ?
270
Moscow ? ?
250
? 230
300
? ?
350
? ?
360
Chelyabinsk ? ?
120
? 230
180
? ?
130
? ?
50
Location
Time Product
2000 Yekaterinburg Moscow Chelyabins
k
Cosmetics 100 240 210
Shoes 320 170 320
Clothing 180 250 180
slice
for Time=2000
Pivot
© Mikhail Zymbler Data Mining
30
Location
Time Product
2000 Yekaterinburg Moscow Chelyabinsk
Cosmetics 100 240 210
Shoes 320 170 320
Clothing 180 250 120
2000 Clothing Shoes Cosmetics
Yekaterinburg 180 320 100
Moscow 250 170 240
Chelyabinsk 120 320 210
pivot
Dice
© Mikhail Zymbler Data Mining
31
Shoes
Cosmetics
Clothing
Yekaterinburg ? ?
180
? 230
?
? ?
?
? ?
270
Moscow ? ?
250
? 230
300
? ?
350
? ?
360
Chelyabinsk ? ?
120
? 230
180
? ?
130
? ?
50
Location
Time Product
dice for (Time=2000 or Time=2001) and
(Location=Chelyabinsk or
Location=Moscow) and
(Product=Shoes or Product=Clothing) 2000
Shoes
Clothing Moscow
Chelyabinsk
250
120
250
120
250
120
300
180
2001
17.04.2013
11
Roll-up
© Mikhail Zymbler Data Mining
32
Shoes
Cosmetics
Clothing
Yekaterinburg ? ?
180
? 230
?
? ?
?
? ?
270
Moscow ? ?
250
? 230
300
? ?
350
? ?
360
Chelyabinsk ? ?
120
? 230
180
? ?
130
? ?
50
Location
Time Product
roll-up on Location
(from City to Region)
Shoes
Cosmetics
Clothing
Center ? ?
250
? 230
300
? ?
350
? ?
360
Ural ? ?
300
? 230
180
? ?
130
? ?
320
Drill-down
© Mikhail Zymbler Data Mining
33
Location
Time Product
drill-down on Time
(from Year to Quarter)
2000
Shoes
Cosmetics
Clothing Yekaterinburg
Moscow
Chelyabinsk
180
250
120
180
250
120
180
250
120
1 qrt.
Shoes
Cosmetics
Clothing
180
250
120
180
250
120
80
50
60
2 qrt.
180
250
120
180
250
120
40
150
30
3 qrt.
180
250
120
180
250
120
20
25
20
4 qrt.
180
250
120
180
250
120
40
25
10
Model of OLAP queries
© Mikhail Zymbler Data Mining
34
Location
Time
Product
Customer
Day
Name
Category
TargetGroup
Week Month Quarter Year
Seller
Shop
City
Region
Name
Brand
Category
17.04.2013
12
Data Warehouse is a result of ETL
© Mikhail Zymbler Data Mining
35
"Raw" data
Target data
Cleansed data
Data
Warehouse
Reports
Diagrams
Data Interpretation
OLAP
Data Mining
Extraction
Transforming
Loading
Data Preprocessing
Data Warehouse: a multi-tiered
architecture
© Mikhail Zymbler Data Mining
36
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Queries
Reports
Data mining
Data Sources Front-End Tools
Data Marts
Operational
databases
Other
sources
Data Storage
19901991
1992ALL
Red
Blue0
50
100
150
200 150-200
100-150
50-100
0-50
Metadata
repository Monitor &
Integrator OLAP
Server
OLAP
Server
OLAP
Server
OLAP
Server
Metadata Repository
© Mikhail Zymbler Data Mining
37
Description of the structure of the data warehouse
schema, view, dimensions, hierarchies, derived data definition, data mart locations and contents
Operational meta-data
data lineage (history of migrated data and transformation path)
currency of data (active, archived, or purged)
monitoring information (warehouse usage statistics, error reports, audit trails)
Algorithms used for summarization
measure and dimension definition algorithms
data granularity, partitions, subject areas, aggregation, summarization, and predefined queries and reports
17.04.2013
13
Metadata Repository
© Mikhail Zymbler Data Mining
38
The mapping from operational environment to the data warehouse source databases and their contents,
gateway descriptions, data partitions, data extraction, cleaning, transformation rules, and defaults, data refresh and purge rules
security
Data related to system performance indices, profiles
timing and scheduling of refresh
Business metadata business terms and definitions
ownership of data
charging policies
OLAP Server Architectures
© Mikhail Zymbler Data Mining
39
Relational OLAP (ROLAP)
Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware
Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services
Multidimensional OLAP (MOLAP)
Sparse array-based multidimensional storage engine
Fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP)
Low level: relational, high-level: multidimensional arrays
Computing OLAP cube in SQL
© Mikhail Zymbler Data Mining
40
ROLLUP BY
creates subtotals that roll up from the most detailed level to a grand total, following a specified grouping list
takes as its argument an ordered list of grouping columns
calculates the standard aggregate values specified in the GROUP BY clause
creates progressively higher-level subtotals, moving from right to left through the list of grouping columns
creates a grand total.
CUBE BY
generates all the subtotals that could be calculated for a data cube with the specified dimensions
17.04.2013
14
ROLLUP BY
© Mikhail Zymbler Data Mining
41
select Time, Location, Product, sum(Cost) as Profit
from Sales
rollup by (Time, Location, Product)
select Time, Location, Product,
sum(Cost) as Profit
from Sales
group by (Time, Location, Product)
union
select Time, Location, '',
sum(Cost) as Profit
from Sales
group by (Time, Location)
union
select Time, '', '',
sum(Cost) as Profit
from Sales
group by (Time)
union
select '', '', '', sum(Cost) as Profit
from Sales
ROLLUP BY
© Mikhail Zymbler Data Mining
42
Time Location Product Cost
2000 Chelyabinsk Clothing 100
2000 Chelyabinsk Cosmetics 120
2000 Moscow Clothing 250
2000 Moscow Cosmetics 75
2001 Chelyabinsk Clothing 230
2001 Chelyabinsk Cosmetics 310
2001 Moscow Clothing 170
2001 Moscow Cosmetics 350
ROLLUP BY
© Mikhail Zymbler Data Mining
43
select
Time, Location, Product, sum(Cost) as Profit
from Sales
rollup by (Time, Location, Product)
Time Location Product Profit
2000 Chelyabinsk Clothing 100
2000 Chelyabinsk Cosmetics 120
2000 Chelyabinsk [NULL] 220
2000 Moscow Clothing 250
2000 Moscow Cosmetics 75
2000 Moscow [NULL] 325
2000 [NULL] [NULL] 545
2001 Chelyabinsk Clothing 230
2001 Chelyabinsk Cosmetics 310
2001 Chelyabinsk [NULL] 540
2001 Moscow Clothing 170
2001 Moscow Cosmetics 350
2001 Moscow [NULL] 520
2001 [NULL] [NULL] 1 060
[NULL] [NULL] [NULL] 1 605
17.04.2013
15
CUBE BY
© Mikhail Zymbler Data Mining
44
Time Location Product Cost
2000 Chelyabinsk Clothing 100
2000 Chelyabinsk Cosmetics 120
2000 Moscow Clothing 250
2000 Moscow Cosmetics 75
2001 Chelyabinsk Clothing 230
2001 Chelyabinsk Cosmetics 310
2001 Moscow Clothing 170
2001 Moscow Cosmetics 350
CUBE BY
© Mikhail Zymbler Data Mining
45
select
Time, Location, Product, sum(Cost) as Profit
from Sales
cube by (Time, Location, Product)
Time Location Product Profit
2000 Chelyabinsk Clothing 100
2000 Chelyabinsk Cosmetics 120
2000 Chelyabinsk [NULL] 220
2000 Moscow Clothing 250
2000 Moscow Cosmetics 75
2000 Moscow [NULL] 325
2000 [NULL] Clothing 350
2000 [NULL] Cosmetics 195
2000 [NULL] [NULL] 545
2001 Chelyabinsk Clothing 230
2001 Chelyabinsk Cosmetics 310
2001 Chelyabinsk [NULL] 540
2001 Moscow Clothing 170
2001 Moscow Cosmetics 350
2001 Moscow [NULL] 520
CUBE BY
© Mikhail Zymbler Data Mining
46
Time Location Product Profit
[NULL] Chelyabinsk Clothing 330
[NULL] Chelyabinsk Cosmetics 430
[NULL] Chelyabinsk [NULL] 760
[NULL] Moscow Clothing 420
[NULL] Moscow Cosmetics 425
[NULL] Moscow [NULL] 845
[NULL] [NULL] Clothing 750
[NULL] [NULL] Cosmetics 855
[NULL] [NULL] [NULL] 1 605
select
Time, Location, Product, sum(Cost) as Profit
from Sales
cube by (Time, Location, Product)
17.04.2013
16
Conclusion
Data Mining
Data warehouse is a subject-oriented, integrated, time-variant, nonvolatile and physically distinguished collection of data in support of management’s decision-making process.
Data warehouse is based on multidimensional model.
There are three basic data warehouse schemas: star, snowflake, constellation.
OLAP cube is a data cube, where every dimension has additional ALL value and respective points of data space are computed by an aggregate function(s).
OLAP operations: roll-up, drill-down, pivot.
OLAP cube computing using SQL: ROLLUP BY and CUBE BY keywords.
© Mikhail Zymbler
47