Date warehousing concepts

128
Data Warehousing Concepts and Design

description

 

Transcript of Date warehousing concepts

Page 1: Date warehousing concepts

Data Warehousing Concepts and Design

Page 3: Date warehousing concepts

Objectives

Data Warehousing Concepts

• What is Business Intelligence (BI)?• Evolution of BI• Characteristics of an OLTP system• Why OLTP is not suitable for complex analysis?• Characteristics of a Data Warehouse• Define DWH and its properties – • Subject Oriented, Integrated, Time variant, Non-Volatile• Define Grain/Granularity• Differentiate between OLTP and Data Warehouse• User expectations and User community• Enterprise Data Warehouse• Data Warehouse versus Data marts• Dependent Data marts• Independent Data marts• Data Warehouse components – • Source systems, Staging area, Presentation area, Access tools

Page 4: Date warehousing concepts

Objectives

Data Warehousing Concepts

• Goals of a Data Warehouse• Data Warehouse development approaches - • Top-down, Bottom-up, Hybrid, Federated• Incremental approach to warehouse development• Dimensional Modeling• Star Schema – Fact and Dimension tables• Dimensions and Measure objects• Snowflake Schema• Types of Fact tables• Factless Fact table• OLAP storage modes – MOLAP, ROLAP, HOLAP, DOLAP• Slowly and Rapidly changing Dimensions- Type I, II, III• Degenarated Dimension• Junk Dimension• CASE-STUDIES

Page 5: Date warehousing concepts

What is Business Intelligence (BI)?

“Business Intelligence (BI) is the process of transforming data into information, information into knowledge and through iterative discoveries turning knowledge into Intelligence.”

– — Gartner group

Page 6: Date warehousing concepts

Objective of Business Intelligence

Value

Volume

Intelligence

Knowledge

Information

Data

BI can be defined as taking ‘Decisions based on Data’.The objective of BI is to transform large volumes of data into useful information.

Page 7: Date warehousing concepts

Evolution of BI

– Executive information systems (EIS)– Management Information System (MIS)– Decision Support Systems (DSS)– Business Intelligence (BI)

EIS

MIS

DSS

BI

Page 8: Date warehousing concepts

Information

Information in an organization could exists in two different types of systems:

– Online Transaction Processing (OLTP) systems(Operational Systems)

– Data Warehouse (DWH) systems

Both OLTP and DWH systems have different purpose, business needs and users.

Page 9: Date warehousing concepts

Features of OLTP Systems

OLTP systems handle day-to-day transactions and operations of the business. They are high performance, high throughput systems. They run mission critical applications.

OLTP systems store, update and retrieve Operational Data. Operational Data is the data that runs the business.

Some of the Operational systems that we interact with are Net Banking system, Tax Accounting system, Payroll package, Order-processing system, SAP, Airline reservation system etc.

Page 10: Date warehousing concepts

Why OLTP systems are not suitable for analysis?

OLTP Analytical Reporting

Supports day-to-day operations Historical information to analyze

Data stored at transaction level Data required at summary level

Islands of operational systems Data needs to be integrated

Database design: Normalized

Database design: Dimensional

Page 11: Date warehousing concepts

OLTP Versus Data Warehouse

Property OLTP Data Warehouse

Response Time Sub seconds to seconds Seconds to hours

Operations DMLData goes in

Primarily Read onlyData goes out

Age of Data 30 – 60 days or 1 year - 2 years.Current

Snapshots over time(Quarter, Month, etc).Historical

Data Organization Application Subject, time

Size Small to large Few MB to GB

Large to very large,Few GB to TB

Page 12: Date warehousing concepts

OLTP Versus Data Warehouse

Property OLTP Data Warehouse

Data Sources Operational, Internal Operational,Internal, External

Activities Processes Analysis

No. of records One record at a time Thousands to millionsof records

Grain Atomic (Detail),transactional level,Highest granularity

Atomic and/or Summarized (aggregate),less granularity

Database Design Normalized De-Normalized, Star schema

Page 13: Date warehousing concepts

Data Extract Processing

A logical progression towards a data warehouse – Data Extracts

– End user computing offloaded from the operational environment– User’s own data

Decision

makers

Operational

systems

Extracts

Page 14: Date warehousing concepts

Issues with Data Extract Programs

ExtractsOperational systems

Decisionmakers

Extract Explosion

Page 15: Date warehousing concepts

Data Quality Issues with Extract Processing

– No common time basis– Different calculation algorithms– Different levels of extraction– Different levels of granularity– Different data field names– Different data field meanings– Missing information– No data correction rules– No Metadata– No drill-down capability

Page 16: Date warehousing concepts

Data Warehousing and Business Intelligence

Page 17: Date warehousing concepts

Advances Enabling Data Warehousing

Technology

– Hardware– Operating system– Database– BI Tools & Applications

Business

– Competition

Page 18: Date warehousing concepts

Definition of a Data Warehouse

“A data warehouse is a subject oriented, integrated, non-volatile,

and time-variant collection of data to support management decisions.”

— Bill Inmon

Page 19: Date warehousing concepts

Data Warehouse Properties

Integrated

Time-variantNonvolatile

Subject-oriented

DataWarehouse

Page 20: Date warehousing concepts

Subject-Oriented

• Data is categorized and stored by business subject rather than by application.

OLTP Applications

Equity Plans

Shares

Insurance

Loans

Savings

Data Warehouse

Subject

Customer

financial

information

Page 21: Date warehousing concepts

Integrated

• Data on a given subject is collected from various sources and stored once.

Data WarehouseOLTP Applications

Customer

Savings

Current Accounts

Loans

Page 22: Date warehousing concepts

Data Warehouse

Time-Variant

• Data is stored as a series of snapshots, each representing a period of time.

Page 23: Date warehousing concepts

Non-volatile

• Typically data in the data warehouse is not updated or deleted.

Warehouse

Read

Load

Operational

Insert, Update, Delete, or Read

Page 24: Date warehousing concepts

Changing Warehouse Data

Operational Databases Warehouse Database

First time load

Refresh

Refresh

RefreshPurge or Archive

Page 25: Date warehousing concepts

Goals of a Data Warehouse

• The Data Warehouse must assist in decision making process

• The Data Warehouse must meet the requirements of the business community

• The Data Warehouse must provide easy access to information

• The Data Warehouse must present information consistently and accurately

• The Data Warehouse must be adaptive and resilient to change

• The Data Warehouse must provide a secured access to information

Page 26: Date warehousing concepts

Usage Curves

– Operational system is predictable

– Data warehouse:• Variable• Random

Page 27: Date warehousing concepts

User Expectations

– Control expectations– Set achievable targets for query response– Set SLAs– Educate business and end users– Growth and use is exponential

Page 28: Date warehousing concepts

Enterprisewide Data Warehouse

– Large scale implementation– Scopes the entire business– Data from all subject areas– Developed incrementally– Single source of enterprisewide data– Synchronized enterprisewide data– Single distribution point to dependent data marts

Page 29: Date warehousing concepts

Data Warehouse Vocabulary

– Grain of Data - Granularity

Grain is defined as the level of detail of data captured in the data warehouse. More the detail, higher the granularity and vice-versa

– Fact table

It is similar to the transaction table in an OLTP system. It stores the facts or measures of the business. Eg: SALES, ORDERS

– Dimension table

It is similar to the master table in an OLTP system. It stores the textual descriptors of the business. Eg: CUSTOMER, PRODUCT

Page 30: Date warehousing concepts

Data Marts

• A Data mart is a subset of data warehouse.

• A data mart is designed for a single line of business (LOB) or functional area such as sales, finance, or marketing.

Page 31: Date warehousing concepts

Data Warehouses Versus Data Marts

Property Data Warehouse Data Mart

Scope Enterprise Department

Subjects Multiple Single-subject, LOB

Data Source Many Few

Implementation time Months to years Months

Size 100 GB to > 1 TB < 100 GB

Initial effort, cost, Risk Higher Lower

Next level of migration Data Mart Data Warehouse

Approach Top-Down Bottom-up

Page 32: Date warehousing concepts

Dependent Data Mart

Data Warehouse

Data Marts

Flat FilesMarketing

Sales

Finance

MarketingSales

FinanceHR

OperationalSystems

External Data

Operations Data

Legacy Data

External Data

Page 33: Date warehousing concepts

Independent Data Mart

Sales orMarketing

Flat Files

OperationalSystems

External Data

Operations Data

Legacy Data

External Data

Page 34: Date warehousing concepts

Warehouse Development Approaches

• Top-down approach(Big-Bang)

• Bottom-up approach

• Hybrid approach(Combination)

• Federated approach

Page 35: Date warehousing concepts

Top-Down Approach

Build the Data Warehouse

Build the Data Marts

Page 36: Date warehousing concepts

Top-Down Approach

Data Warehouse

Data Marts

Flat FilesMarketing

Sales

Finance

MarketingSales

FinanceHR

OperationalSystems

External Data

Operations Data

Legacy Data

External Data

Page 37: Date warehousing concepts

Bottom-Up Approach

Build Data Marts

Build the Data Warehouse

Page 38: Date warehousing concepts

Bottom-Up Approach

Data Warehouse

Data Marts

Marketing

Sales

Finance

OperationalSystems

External Data

Operations Data

Legacy Data

Page 39: Date warehousing concepts

Hybrid Approach

The hybrid approach tries to blend the best of both “top-down and “bottom-up” approaches

Starts by designing DW and DM models synchronously,Build out first 2-3 DMs that are mutually exclusive and criticalBackfill a DW behind the DMs Build the enterprise model and move atomic data to the DW

Page 40: Date warehousing concepts

Federated Approach

This approach is referred to as “an architecture of architectures”.

Emphasizes the need to integrate new and existing heterogeneous BI environments.

Page 41: Date warehousing concepts

Data Warehouse Components

Source Systems

Staging Area

Presentation Area

AccessTools

ODS

Operational

External

Legacy

Metadata Repository

Data Marts

Data Warehouse

Page 42: Date warehousing concepts

Examining Data Sources

– Production– Archive– Internal– External

Page 43: Date warehousing concepts

Production Data

– Operating system platforms– File systems– Database systems – Vertical applications

IMS

DB2

Oracle

Sybase

Informix

VSAM

SAP

Dun and Bradstreet Financials

Oracle Financials

Baan

PeopleSoft

Page 44: Date warehousing concepts

Archive Data

– Historical data– Useful for analysis over long periods of time– Useful for first-time load

Operation databases

Warehouse database

Page 45: Date warehousing concepts

Internal Data

– Planning, sales, and marketing organization data– Maintained in the form of:

• Spreadsheets (structured)• Documents (unstructured)

– Treated like any other source data

Warehouse database

Planning

Accounting

Marketing

Page 46: Date warehousing concepts

External Data

– Information from outside the organization– Issues of frequency, format, and predictability – Described and tracked using metadata

A.C. Nielsen, IRI, IMRB, ORG-MARG

Barron's

Dun and Bradstreet

Purchased databases

Wall Street Journal

Economic forecasts

Competitive information

Warehousingdatabases

Page 47: Date warehousing concepts

Extraction, Transformation and Loading (ETL)

Page 48: Date warehousing concepts

Extraction, Transformation and Loading (ETL)

• “Effective data extract, transform and load (ETL) processes represent the number one success factor for your data warehouse project and can absorb up to 70 percent of the time spent on a typical data warehousing project.”

– DM Review, March 2001

Source TargetStaging Area

Page 49: Date warehousing concepts

Staging Models

• Remote staging model

• Onsite staging model

Page 50: Date warehousing concepts

Remote Staging Model

LoadWarehouse

LoadWarehouse

Data staging area within the warehouse environment

Data staging area in its own independent environment

Operationalsystem

Extract

Operationalsystem

Extract

Transform

Staging area

Transform

Staging area

Page 51: Date warehousing concepts

On-site Staging Model

• Data staging area within the operational environment, possibly affecting the operational system

Extract Load

Warehouse

Operational system

Transform

Staging area

Page 52: Date warehousing concepts

Extraction Methods

– Logical Extraction methods:• Full Extraction• Incremental Extraction

Page 53: Date warehousing concepts

Extraction Methods

– Physical Extraction methods:• Online Extraction• Offline Extraction

Page 54: Date warehousing concepts

ETL Techniques

– Programs: C, C++, COBOL, PL/SQL, Java

– Gateways: Transparent Database Access

– Tools:• In-house developed tools • Vendor’s ETL tools (Ideal technique)

Page 55: Date warehousing concepts

Mapping Data

• Mapping data defines:– Which operational attributes to use– How to transform the attributes for the warehouse– Where the attributes exist in the warehouse

Metadata

File A

F1

Staging File One

Number

F2

F3

Name

DOB

Staging File OneNumber USA123Name Mr. BloggsDOB 10-Dec-56

File AF1 123F2 BloggsF3 10/12/56

Page 56: Date warehousing concepts

Transformation Routines

– Cleaning data– Eliminating inconsistencies– Adding elements– Merging data– Integrating data– Transforming data before load

Page 57: Date warehousing concepts

Transforming Data: Problems and Solutions

– Data Anomalies– Multipart keys– Multiple local standards– Multiple files– Missing values– Duplicate values– Element names– Element meanings– Input formats– Referential Integrity constraints– Name and address

Page 58: Date warehousing concepts

Data Anomalies

– No unique key– Data naming and coding anomalies– Data meaning anomalies between groups– Spelling and text inconsistencies

CUSNUM NAME ADDRESS

90233479 Oracle Limited 100 N.E. 1st St.

90233489 Oracle Computing 15 Main Road, Ft. Lauderdale

90234889 Oracle Corp. UK 15 Main Road, Ft. Lauderdale, FLA

90345672 Oracle Corp UK Ltd 181 North Street, Key West, FLA

Page 59: Date warehousing concepts

Multipart Keys Problem

• Multipart keys

Country code

Sales territory

Productnumber

Salesperson code

Product code = 12 M 654313 45

Page 60: Date warehousing concepts

Multiple Local Standards Problem

– Multiple local standards– Tools or filters to preprocess

cm

inches

cm USD 600

1,000 GBP

FF 9,990

DD/MM/YY

MM/DD/YY

DD-Mon-YY

Page 61: Date warehousing concepts

Multiple Source Files Problem

– Added complexity of multiple source files

Transformeddata

Multiple source files

Logic to detectcorrect source

Page 62: Date warehousing concepts

Missing Values Problem

• Solution:– Ignore– Wait– Mark rows– Extract when time-stamped

If NULL thenfield = ‘A’

A

Page 63: Date warehousing concepts

Duplicate Values Problem

• Solution:– SQL self-join techniques– RDMBS constraint utilities

ACME Inc

ACME Inc

ACME Inc

SQL> SELECT ... 2 FROM table_a, table_b 3 WHERE table_a.key (+)= table_b.key 4 UNION 5 SELECT ... 6 FROM table_a, table_b 7 WHERE table_a.key = table_b.key (+);

Page 64: Date warehousing concepts

Element Names Problem

• Solution:– Common naming conventions

Customer

Customer

Client

Contact

Name

Page 65: Date warehousing concepts

Element Meaning Problem

– Avoid misinterpretation– Complex solution– Document meaning in metadata

Product number

p_no

Purchase order number Policy number

Page 66: Date warehousing concepts

Input Format Problem

ASCIIEBCDIC

12373“123-73”

ACME Co.

áøåëéí äáàéí Beer (Pack of 8)

• Different character sets or data-types

Page 67: Date warehousing concepts

Referential Integrity Problem

• Solution:– SQL anti-join (outer join)– Server constraints– Dedicated tools

Department10

20

30

40

Emp Name Department1099 Smith 10

1289 Jones 20

1234 Doe 50

6786 Harris 60

Page 68: Date warehousing concepts

Name and Address Problem

– Single-field format– Multiple-field format

Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565

Database 1NAME LOCATIONDIANNE ZIEFELD N100

HARRY H. ENFIELD M300

Database 2NAME LOCATIONZIEFELD, DIANNE 100

ENFIELD, HARRY H 300

Name Mr. J. Smith

Street 100 Main St.

Town Bigtown

Country County Luth

Code 23565

Page 69: Date warehousing concepts

Transformation Timing and Location

– Transformation is performed:• Before load• In parallel while loading

– Can be initiated at different points:• On the operational platform• In a separate staging area

Page 70: Date warehousing concepts

Adding a Date Stamp: Fact Tables and Dimensions

Item TableItem_idDept_id

Time_key

Store TableStore_id

District_idTime_key

Sales Fact TableItem_idStore_idTime_key

Sales_dollarsSales_units

Time TableWeek_idPeriod_idYear_id

Time_key

Product TableProduct_idTime_key

Product_desc

Page 71: Date warehousing concepts

Summarizing Data

1. During extraction on staging area

2. After loading to the warehouse server

Operationaldatabases

Warehousedatabase

Staging area

Page 72: Date warehousing concepts

Loading Data into the Warehouse

– Loading moves the data into the warehouse– Loading can be time-consuming:

• Consider the load window• Schedule and automate the loading

– Initial load moves large volumes of data– Subsequent refresh moves smaller volumes of data

Operationaldatabases

Warehousedatabase

Staging area

Extract

Transform

Transport,Load

Page 73: Date warehousing concepts

Load Window Requirements

– Time available for entire ETL process– Plan– Test– Prove – Monitor

0 3 am 6 9 12 pm 3 6 9 12

User Access PeriodLoad Window Load Window

Page 74: Date warehousing concepts

0 3 am 6 9 12 pm 3 6 9 12

User Access Period

Planning the Load Window

– Plan and build processes according to a strategy.– Consider volumes of data.– Identify technical infrastructure.– Ensure currency of data.– Consider user access requirements first.– High availability requirements may mean a small load window.

Page 75: Date warehousing concepts

Initial Load and Refresh

• Initial Load:– Single event that populates the database with historical data– Involves large volumes of data– Employs distinct ETL tasks– Involves large amounts of processing after load

• Refresh:– Performed according to a business cycle– Less data to load than first-time load– complex ETL tasks– Smaller amounts of post-load processing

Page 76: Date warehousing concepts

Data Refresh Models

Extract Processing Environment– After each time interval, build a new snapshot of the database.– Purge old snap shots.

T1 T2 T3

Operationaldatabases

Page 77: Date warehousing concepts

Data Refresh Models

Warehouse Environment– Build a new database the first time.– After each time interval, add delta changes to database.– Archive or purge oldest data.

T1 T2 T3

Operationaldatabases

Page 78: Date warehousing concepts

Post-Processing of Loaded Data

Post-processing of loaded data

Create indexes

Generate keys

Summarize Filter

Extract

Transform

LoadWarehouseStaging area

Page 79: Date warehousing concepts

Unique Indexes

– Disable constraints before load.– Enable constraints after load.– Re-create index if necessary.

Load data

Disableconstraints

Enableconstraints

Create index Reprocess

Catch errors

Page 80: Date warehousing concepts

Creating Derived Keys

• The use of derived (sometimes referred as generalized or artificial key or synthetic key or a surrogate or a warehouse key) is recommended to maintain the uniqueness of a row.

• Method– Concatenate key– Assign a number sequentially from a list

109908 01109908

109908 100

Page 81: Date warehousing concepts

Metadata repository

Metadata Users

End users

Developers IT Professionals

Page 82: Date warehousing concepts

Metadata Documentation Approaches

– Automated• Data modeling tools• ETL tools

– Manual

Page 83: Date warehousing concepts

Data Warehouse Design

Dimensional Modeling

I. Identify the ‘Business Process’

II. Determine the ‘Grain’

III. Identify the ‘Facts’

IV. Identify the ‘Dimensions’

Page 84: Date warehousing concepts

Existing Metadata Production ERD Model

BusinessRequirements

Research

Business Requirements Drive the Design Process

– Primary input

– Secondary input

Page 85: Date warehousing concepts

Perform Strategic Analysis

– Identify crucial business processes– Understand business processes– Prioritize and select the business processes to implement

BusinessBenefit

Low High

Low

High

Feasibility

Page 86: Date warehousing concepts

Using a Business Process Matrix

DW Bus Architecture

Business Dimensions

Business ProcessesSales Returns Inventory

Customer

Date

Product

Channel

Promotion

Page 87: Date warehousing concepts

Conformed Dimensions

• Dimensions are conformed when they are exactly the same including the keys or one is a perfect subset of the other.

• DW bus architecture provides a standard set of conformed dimensions

Page 88: Date warehousing concepts

Determine the Grain

YEAR?

QUARTER?

MONTH?

WEEK?

DAY?

Page 89: Date warehousing concepts

04/10/2393

Documenting the Granularity

• Is an important design consideration

• Determines the level of detail

• Is determined by business needs

Low-level grain (Transaction-level data)

High-level grain (Summary data)

Page 90: Date warehousing concepts

Defining Time Granularity

Fiscal Time Hierarchy

Current dimension grain

Fiscal Year

Fiscal Quarter

Fiscal Month

Fiscal Week

Day Future dimension grain

Page 91: Date warehousing concepts

Identify the Facts and Dimensions

•The attribute is perceived as constant or discrete:

– Product– Location– Time– Size

•The attribute varies continuously:

– Balance– Units Sold– Cost– Sales

Facts (Measures)

Dimensions

Page 92: Date warehousing concepts

Data Warehouse Environment Data Structures

The data structures that are commonly found in a data warehouse environment:

– Third normal form (3NF)– Star schema– Snowflake schema

Page 93: Date warehousing concepts

Star Schema

Customer Location

Sales

Supplier Product

Page 94: Date warehousing concepts

Star Schema Model

Product TableProduct_idProduct_disc,...

Time TableDay_idMonth_idYear_id,...

Sales Fact TableProduct_idStore_idItem_idDay_idSales_amountSales_units, ...

Item TableItem_idItem_desc,...

Store TableStore_idDistrict_id,...

Central fact table

Denormalizeddimensions

Page 95: Date warehousing concepts

Fact Table Characteristics

– Contain numerical metrics of the business– Can hold large volumes of data– Can grow quickly– Can contain base, derived,

and summarized data– Are typically additive– Are joined to dimension tables

through foreign keys that reference

Primary keys in the dimension tables

Sales Fact TableProduct_idStore_idItem_idDay_idSales_amountSales_units...

Page 96: Date warehousing concepts

Dimension Table Characteristics

– Contain descriptors of the business /

textual information that represents the attributes of the business– Contain relatively static data

– Are usually smaller than fact tables

– Are joined to a fact table through

a foreign key reference

Item TableItem_idItem_desc,...

Page 97: Date warehousing concepts

Advantages of Using a Star Dimensional Model

– Design improves performance by reducing table joins.

– The model is easy for users to understand.– Supports multidimensional analysis.– Provides an extensible design

– Primary keys represent a dimension.

– Non-foreign key columns are values.

– Facts are usually highly normalized.

– Dimensions are completely de-normalized.

– End users can express complex queries.

Page 98: Date warehousing concepts

Base and Derived Data

Payroll table

Derived dataBase data

Emp_FK Month_FK Salary Comm Comp101 05 1,000 0 1,000102 05 1,500 100 1,600103 05 1,000 200 1,200104 05 1,500 1,000 2,500

Page 99: Date warehousing concepts

Translating Business Measures into a Fact Table

Business measures

Facts

Business MeasuresNumber of ItemsAmountCostProfit

FactNumber of ItemsItem Amount

Item CostProfit

BaseBaseBaseDerived

Page 100: Date warehousing concepts

Snowflake Schema Model

Time TableWeek_idPeriod_idYear_id

Dept TableDept_id

Dept_descMgr_id

Mgr TableDept_idMgr_id

Mgr_name

Product TableProduct_id

Product_desc

Item TableItem_id

Item_descDept_id

Sales Fact TableItem_idStore_idProduct_idWeek_id

Sales_amountSales_units

Store TableStore_idStore_descDistrict_id

District TableDistrict_idDistrict_desc

Page 101: Date warehousing concepts

04/10/23105

Snowflake Model

. . . .

Order

Web

History_PK

Customer

History History_FKCustomer_FKProduct_FKChannel_FK

Item_nbrItem_descQuantityDiscnt_priceUnit-priceOrder_amt…

Product

Channel

Channel_PK

Web_PKChannel_desc

Customer_PK

. . . .

Product_PK

. . . .

Web_PK

Web_url

Page 102: Date warehousing concepts

Snowflake Schema Model

– Provides for speedier data loading– Can become large and unmanageable– Degrades query performance– More complex metadata

– Facts are usually highly normalized

– Dimensions are also normalized

Country State County City

Page 103: Date warehousing concepts

Constellation Configuration

Atomic fact

Page 104: Date warehousing concepts

Fact Table Measures

Nonadditive:Cannot be added

along any dimension

Semiadditive: Added along some

dimensions

Additive: Added across all

dimensions

Page 105: Date warehousing concepts

04/10/23109

More on Factless Fact Tables

Emp_FKSal_FKAge_FKEd_FKGrade_FK

Grade dimensionGrade_PK

Education dimensionEd_PK

Employee dimensionEmp_PK

Salary dimensionSal_PK

Age dimensionAge_PK

PK = Primary Key & FK = Foreign Key

Page 106: Date warehousing concepts

Factless Fact Tables

– Event tracking

– Coverage

Page 107: Date warehousing concepts

04/10/23111

Bracketed Dimensions

– Enhance performance and analytical capabilities

– Create groups of values for attributes with many unique values, such as income ranges and age brackets

– Minimize the need for full table scans by pre-aggregating data

Page 108: Date warehousing concepts

04/10/23112

Bracketing Dimensions

Customer_PKBracket_FK

Bracket_PK

Customer_PKBracket_FK

Bracket dimension

Customer dimension

Income fact

Bracket_PK Income (10Ks) Marital Status Gender Age

1 60-90 Single Male <21

2 60-90 Single Male 21-35

3 60-90 Single Male 35-55

4 60-90 Single Male >55

5 60-90 Single Female <21

6 60-90 Single Female 21-35

Page 109: Date warehousing concepts

04/10/23113

Identifying Analytical Hierarchies

Store dimension

Store IDStore DescLocationSizeTypeDistrict IDDistrict DescRegion IDRegion Desc

Business hierarchies describe organizational structure and logical parent-child relationships within the data.

Region

District

Store

Organization hierarchy

Page 110: Date warehousing concepts

04/10/23114

Multiple Hierarchies

Store IDStore DescLocationSizeTypeDistrict ID District DescRegion IDRegion DescCity IDCity DescCounty IDCounty DescState IDState Desc

Region

District

Store

Organization hierarchy

Store dimension

Region

District

Store

Geography hierarchy

Page 111: Date warehousing concepts

04/10/23115

Multiple Time Hierarchies

Fiscal year

Fiscal quarter

Fiscal month

Fiscal time hierarchy

Fiscal week

Calendar year

Calendar quarter

Calendar month

Calendar time hierarchy

Calendar week

Page 112: Date warehousing concepts

04/10/23116

Store 5Store 1 Store 2

Region 2

District 2 District 4

Drilling Up and Drilling Down

Store 4

Group

Market Hierarchy

Region 1

District 1

Store 6Store 3

District 3

Page 113: Date warehousing concepts

Region

District

Drilling Across

Stores > 20,000 sq. ft.

Group

Market hierarchy

Region

District

Store Store City

City

City hierarchy

Page 114: Date warehousing concepts

Using Time in the Data Warehouse

– Defining standards for time is critical.

– Aggregation based on time is complex.

– Time is critical to the data warehouse. A consistent representation of time is required for extensibility.

Where should the element of time be stored?

Timedimension

Sales fact

Page 115: Date warehousing concepts

Date Dimension

– Should Date Dimension be modeled?

Page 116: Date warehousing concepts

Applying the Changes to Data

• You have a choice of techniques:– Overwrite a record– Add a record– Add a field– Maintain history– Add version numbers

Page 117: Date warehousing concepts

OLAP Models

– Relational (ROLAP)

– Multidimensional (MOLAP)

– Hybrid (HOLAP)

– Desktop (DOLAP)

Page 118: Date warehousing concepts

Slowly Changing Dimensions (SCDs)

What is a SCD?

It is a dimension that has attribute data that needs to be updated, rather slowly over time.

There are 3 standard ways outlined by Kimball (and others) to handle this situation:– Type-I– Type-II– Type-III

Page 119: Date warehousing concepts

Type I - Overwriting a Record

– Easy to implement– Loses all history– Not recommended

42135 John Doe Single42135 John Doe Married

Page 120: Date warehousing concepts

Type II - Adding a New Record

– History is preserved; dimensions grow.– Generalized key is created.

42135 John Doe Single

42135_01 John Doe Married

Page 121: Date warehousing concepts

Type III - Adding a Current Field

– Maintains some history– Loses intermediate values– Is enhanced by adding an Effective Date field

42135 John Doe Single

42135 John Doe Single Married 1-Jan-01

Page 122: Date warehousing concepts

Maintain History

History tables:– One-to-many relationships– One current record and many history records

Product

Time

Sales

HIST_CUST

CUSTOMER

Page 123: Date warehousing concepts

Versioning

– Avoid double counting– Facts hold version number

Time

Product

Customer

Customer.CustId Version Customer Name

1234 1 Comer

1234 2 Comer

Sales.CustId Version Sales Facts

1234 1 $11,000

1234 2 $12,000

Sales

Page 124: Date warehousing concepts

Rapidly Changing Dimensions (RCDs)

It is a dimension that has attribute data that needs to be updated, rather quickly over time.

Also referred to as Rapidly Changing Monster dimension.

Create a separate dimension referred to as mini dimension

DemographicsKey

Age children income

1 20–24 0 <20000

2 20-24 1-2 20000 – 30000

3 20-24 > 2 >30000

4 25-30 0 <20000

5 25-30 1-2 20000 – 30000

:::: ::::: :::: ::::::::::

Mini Dimension

Page 125: Date warehousing concepts

Junk Dimension

Junk dimension is an abstract dimension with the decodes for a group of low cardinality flags and indicators, thereby removing them from fact table.

Junk Key Payment Type Order type Order Mode

1 Cash Normal Web

2 Cash Urgent Web

3 Credit Normal Fax

4 Credit Urgent Fax

::::: ::::: :::::: ::::;

Junk Dimension

Page 126: Date warehousing concepts

Secret of Success

Think big, start small!

Page 127: Date warehousing concepts

References

Useful web sites:

http://www.dmreview.comhttp://www.rkimball.comhttp://www.billinmon.comhttp://www.dmforum.orghttp://www.freedatawarehouse.com

Page 128: Date warehousing concepts

Thank-you