Data Warehouse Design

46
Data Warehouse Design Enrico Franconi CS 636

description

Data Warehouse Design. Enrico Franconi CS 636. Implementing a Warehouse. Monitoring : Sending data from sources Integrating : Loading, cleansing,... Processing : Query processing, indexing, ... Managing : Metadata, Design,. Monitoring. - PowerPoint PPT Presentation

Transcript of Data Warehouse Design

Page 1: Data Warehouse Design

Data Warehouse Design

Enrico Franconi

CS 636

Page 2: Data Warehouse Design

CS 336 2

Implementing a Warehouse

Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing, indexing, ... Managing: Metadata, Design, ...

Page 3: Data Warehouse Design

CS 336 3

Monitoring Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … How to get data out?

Replication tool Dump file Create report ODBC or third-party “wrappers”

Page 4: Data Warehouse Design

CS 336 4

Monitoring Techniques

Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring

Page 5: Data Warehouse Design

CS 336 5

Monitoring Issues

Frequency periodic: daily, weekly, … triggered: on “big” change, lots of changes, ...

Data transformation convert data to uniform format remove & add fields (e.g., add date to get history)

Standards (e.g., ODBC)

Gateways

Page 6: Data Warehouse Design

CS 336 6

Wrapper

Converts data and queries from one data model to another

Extends query capabilities for sources with limited capabilities

DataModel

B

DataModel

A

Queries

Data

Queries SourceWrapper

Page 7: Data Warehouse Design

CS 336 7

Wrapper Generation

Solution 1: Hard code for each source Solution 2: Automatic wrapper generation

WrapperWrapperGenerator

Definition

Page 8: Data Warehouse Design

CS 336 8

Integration

Data Cleaning Data Loading Derived Data

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

Page 9: Data Warehouse Design

CS 336 9

Data Integration

Receive data (changes) from multiple wrappers/monitors and integrate into warehouse

Rule-based Actions

Resolve inconsistencies Eliminate duplicates Integrate into warehouse (may not be empty) Summarize data Fetch more data from sources (wh updates) etc.

Page 10: Data Warehouse Design

CS 336 10

Data Cleaning

Find (& remove) duplicate tuples e.g., Jane Doe vs. Jane Q. Doe

Detect inconsistent, wrong data Attribute values that don’t match

Patch missing, unreadable data Insert default values

Notify sources of errors found

Page 11: Data Warehouse Design

CS 336 11

Data Cleaning

Migration (e.g., yen to dollars) Scrubbing: use domain-specific knowledge (e.g., social

security numbers) Fusion (e.g., mail list, customer merging)

billing DB

service DB

customer1(Joe)

customer2(Joe)

merged_customer(Joe)

Page 12: Data Warehouse Design

CS 336 12

Loading Data in the Warehouse

Incremental vs. refresh Off-line vs. on-line Frequency of loading

At night, 1x a week/month, continuously

Parallel/Partitioned load

Page 13: Data Warehouse Design

CS 336 13

Warehouse Maintenance

Warehouse data materialized view Initial loading View maintenance

Derived Warehouse Data indexes aggregates materialized views

View maintenance

Page 14: Data Warehouse Design

CS 336 14

Materialized Views

Define new warehouse relations using SQL expressions

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

product id name pricep1 bolt 10p2 nut 5

joinTb prodId name price storeId date amtp1 bolt 10 c1 1 12p2 nut 5 c1 1 11p1 bolt 10 c3 1 50p2 nut 5 c2 1 8p1 bolt 10 c1 2 44p1 bolt 10 c2 2 4

does not existat any source

Page 15: Data Warehouse Design

CS 336 15

Differs from Conventional View Maintenance...

Warehouses may be highly aggregated and summarized

Warehouse views may be over history of base data

Process large batch updates Schema may evolve

Page 16: Data Warehouse Design

CS 336 16

Differs from Conventional View Maintenance...

Base data doesn’t participate in view maintenance Simply reports changes Loosely coupled Absence of locking, global transactions May not be queriable

Page 17: Data Warehouse Design

CS 336 17

Warehouse Maintenance Anomalies

Materialized view maintenance in loosely coupled, non-transactional environment

Simple example

Sales Comp.

Integrator

DataWarehouse

Sale(item,clerk) Emp(clerk,age)

Sold (item,clerk,age)

Sold = Sale Emp

Page 18: Data Warehouse Design

CS 336 18

Warehouse Maintenance Anomalies

1. Insert into Emp(Mary,25), notify integrator

2. Insert into Sale (Computer,Mary), notify integrator

3. (1) integrator adds Sale (Mary,25)

4. (2) integrator adds (Computer,Mary) Emp

5. View incorrect (duplicate tuple)

Sales Comp.

Integrator

DataWarehouse

Sale(item,clerk) Emp(clerk,age)

Sold (item,clerk,age)

Page 19: Data Warehouse Design

CS 336 19

Maintenance Anomaly - Solutions

Incremental update algorithms (ECA, Strobe, etc.)

Research issues: Self-maintainable views What views are self-maintainable Store auxiliary views so original + auxiliary

views are self-maintainable

Page 20: Data Warehouse Design

CS 336 20

Self-Maintainability: Examples

Sold(item,clerk,age) =

Sale(item,clerk) Emp(clerk,age)

Inserts into EmpIf Emp.clerk is key and Sale.clerk is foreign key (with ref. int.) then no effect

Inserts into SaleMaintain auxiliary view: Emp-clerk,age(Sold)

Deletes from EmpDelete from Sold based on clerk

Page 21: Data Warehouse Design

CS 336 21

Self-Maintainability: Examples

Deletes from SaleDelete from Sold based on {item,clerk}

Unless age at time of sale is relevant

Auxiliary views for self-maintainability Must themselves be self-maintainable One solution: all source data But want minimal set

Page 22: Data Warehouse Design

CS 336 22

Partial Self-Maintainability

Avoid (but don’t prohibit) going to sourcesSold=Sale(item,clerk) Emp(clerk,age)

Inserts into Sale Check if clerk already in Sold, go to source if

not Or replicate all clerks over age 30 Or ...

Page 23: Data Warehouse Design

CS 336 23

Warehouse Specification (ideally)

Extractor/Monitor

Extractor/Monitor

Extractor/Monitor

Integrator

Warehouse

...

Metadata

Warehouse Configuration

Module

View Definitions

Integrationrules

ChangeDetection

Requirements

Page 24: Data Warehouse Design

CS 336 24

Processing

ROLAP servers vs. MOLAP servers Index Structures What to Materialize? Algorithms Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

Page 25: Data Warehouse Design

CS 336 25

ROLAP Server

Relational OLAP Server

relationalDBMS

ROLAPserver

tools

utilities

sale prodId date sump1 1 62p2 1 19p1 2 48

Special indices, tuning;

Schema is “denormalized”

Page 26: Data Warehouse Design

CS 336 26

MOLAP Server

Multi-Dimensional OLAP Server

multi-dimensional

server

M.D. tools

utilitiescould also

sit onrelational

DBMS

Pro

du

ctCity

Date1 2 3 4

milk

soda

eggs

soap

AB

Sales

Page 27: Data Warehouse Design

CS 336 27

Index Structures (sketch)

Traditional Access Methods B-trees, hash tables, R-trees, grids, …

Popular in Warehouses inverted lists bit map indexes join indexes text indexes

Page 28: Data Warehouse Design

CS 336 28

What to Materialize?

Store in warehouse results useful for common queries

Example:day 2

c1 c2 c3p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

c1p1 110p2 19

129

. . .

total sales

materialize

Page 29: Data Warehouse Design

CS 336 29

Materialization Factors

Type/frequency of queries Query response time Storage cost Update cost

Page 30: Data Warehouse Design

CS 336 30

Cube Aggregates Lattice

city, product, date

city, product city, date product, date

city product date

all

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

129

use greedyalgorithm todecide whatto materialize

Page 31: Data Warehouse Design

CS 336 31

Dimension Hierarchies

all

state

city

cities city statec1 CAc2 NY

Page 32: Data Warehouse Design

CS 336 32

Dimension Hierarchies

city, product

city, product, date

city, date product, date

city product date

all

state, product, date

state, date

state, product

state

not all arcs shown...

Page 33: Data Warehouse Design

CS 336 33

Interesting Hierarchy

all

years

quarters

months

days

weeks

time day week month quarter year1 1 1 1 20002 1 1 1 20003 1 1 1 20004 1 1 1 20005 1 1 1 20006 1 1 1 20007 1 1 1 20008 2 1 1 2000

conceptualdimension table

Page 34: Data Warehouse Design

CS 336 34

Managing

Metadata Warehouse Design Tools

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

Page 35: Data Warehouse Design

CS 336 35

Metadata

Administrative definition of sources, tools, ... schemas, dimension hierarchies, … rules for extraction, cleaning, … refresh, purging policies user profiles, access control, ...

Page 36: Data Warehouse Design

CS 336 36

Metadata

Business business terms & definition data ownership, charging

Operational data lineage data currency (e.g., active, archived, purged) use stats, error reports, audit trails

Page 37: Data Warehouse Design

CS 336 37

Design Summary

What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index?

Page 38: Data Warehouse Design

CS 336 38

Tools

Development design & edit: schemas, views, scripts, rules, queries, reports

Planning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning

Warehouse Management performance monitoring, usage patterns, exception reporting

System & Network Management measure traffic (sources, warehouse, clients)

Workflow Management “reliable scripts” for cleaning & analyzing data

Page 39: Data Warehouse Design

CS 336 39

Current State of Industry

Extraction and integration done off-line Usually in large, time-consuming, batches

Everything copied at warehouse Not selective about what is stored Query benefit vs storage & update cost

Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying anything

Page 40: Data Warehouse Design

CS 336 40

State of Commercial Practice ... Connectivity to sources

Apertus Information Builders Informix Enterprise Gateway Oracle Open Connect CA-Ingres gateway MS ODBC Platinum InfoHub

Data extract, clean, transform, refresh CA-Ingres Replicator ETI-Extract IBM Data Joiner, Data

Propagator Prism Warehouse manager SAS Access Sybase Replication Server Trinzic InfoPump

Page 41: Data Warehouse Design

CS 336 41

… State of Commercial Practice ... Multidimensional

Database Engines Arbor Essbase Oracle RIR Express Comshare Commader SAS System

Warehouse Data Servers CA-Ingres Oracle 8 RedBrick Sybase IQ Informix Dynamic Server IBM DB2

ROLAP Servers HP Intelligent Warehouse Informix Metacube MicroStrategy DSS Server Information Advantage Asxys

Page 42: Data Warehouse Design

CS 336 42

… State of Commercial Practice Query/Reporting

Environments IBM DataGuide SAS Access CA Visual Express

Platinum Forest&Trees Informix ViewPoint

Multidimensional Analysis Kenan Systems Acumate Microsoft Excel Arbor Essbase Analysis server Cognos PowerPlay IQ Software IQ/Vision Lotus 123 SAS OLAP++ Business Objects

Lots and lots of consulting!!

Page 43: Data Warehouse Design

CS 336 43

Future Directions

Better performance Larger warehouses Easier to use What are companies & research labs

working on?

Page 44: Data Warehouse Design

CS 336 44

Research (1)

Incremental Maintenance Data Consistency Data Expiration Recovery Data Quality Error Handling (Back Flush)

Page 45: Data Warehouse Design

CS 336 45

Research (2)

Rapid Monitor Construction Temporal Warehouses Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational Data Conceptual Modelling

Page 46: Data Warehouse Design

CS 336 46

Conclusions

Massive amounts of data and complexity of queries will push limits of current warehouses

Need better systems: easier to use provide quality information