An Introduction to Data Warehousing

50
An Introduction to Data Warehousing 1

description

An Introduction to Data Warehousing. 1. Business Intelligence. - PowerPoint PPT Presentation

Transcript of An Introduction to Data Warehousing

Page 1: An Introduction  to  Data Warehousing

An Introduction to

Data Warehousing

1

Page 2: An Introduction  to  Data Warehousing

Business Intelligence Now,if the Estimates made before a Battle indicate Victory,it is because Now,if the Estimates made before a Battle indicate Victory,it is because

careful calculations show that your conditions are more favorable than careful calculations show that your conditions are more favorable than

those of your enemy;if they indicate defeat ,it is because careful those of your enemy;if they indicate defeat ,it is because careful

calculations show that the favorable conditions for a Battle are calculations show that the favorable conditions for a Battle are

fewer.With more careful calculations one can win ; with less one fewer.With more careful calculations one can win ; with less one

cannot. How much chance of Victory has one who makes no cannot. How much chance of Victory has one who makes no

calculations at all !!calculations at all !!

--- Sun Tzu , The Art of War--- Sun Tzu , The Art of War

Business these days are ,war minus shooting. Business these days are ,war minus shooting.

-Anonymous-Anonymous

Page 3: An Introduction  to  Data Warehousing

Course Roadmap• Introduction to Datawarehousing• Difference between Operational System and DataWarehouse• Emergence of Decision Support Systems• DataWarehouse Theoretical Architecture• DataWarehouse Technical Architecture• DataWarehouse Bus Architecture• Data Modelling concepts• E-R Modelling for OLTP System• Dimensional Modelling for a Datawarehouse• Scheme generation for Datawarehouse• Star Scheme Design• Snowflake Scheme Design• Key aspects in designing the Dimensional Model• Granularity with respect to the Fact Table in the Schemas• Conformed Facts,Dimensions

Page 4: An Introduction  to  Data Warehousing

Course Roadmap

• Fact less Fact Tables,Aggregate Fact Tables• Out Trigger Entities in the Schemas• Types of Relationships to be maintained between Facts and Dimensions• Dependencies while generating Physical Scheme for a DataWarehouse• Case Study of design of DataWarehouse for an existing ERmodel

Page 5: An Introduction  to  Data Warehousing

Objectives

At the end of this session, you will know :At the end of this session, you will know :

– What is Data Warehousing What is Data Warehousing

– The evolution of Data WarehousingThe evolution of Data Warehousing

– Need for Data WarehousingNeed for Data Warehousing

– OLTP Vs Warehouse ApplicationsOLTP Vs Warehouse Applications

– Data marts Vs Data WarehousesData marts Vs Data Warehouses

– Operational Data StoresOperational Data Stores

– Overview of Warehouse ArchitectureOverview of Warehouse Architecture

Page 6: An Introduction  to  Data Warehousing

Objectives

At the end of this lesson, you will know :At the end of this lesson, you will know :

– Data Warehouse ArchitecturesData Warehouse Architectures

– Components of Data Warehousing ArchitectureComponents of Data Warehousing Architecture

– An overview of each of the componentsAn overview of each of the components

– Considerations for Data Warehouse DesignConsiderations for Data Warehouse Design

– Common mistakes in Warehouse designsCommon mistakes in Warehouse designs

– An overview of Warehouse on the webAn overview of Warehouse on the web

Page 7: An Introduction  to  Data Warehousing

What is a DataWarehouse ?

Page 8: An Introduction  to  Data Warehousing

What is a Data Warehouse ?

A data warehouse is a A data warehouse is a subject-oriented,subject-oriented,

integrated,integrated, nonvolatile,nonvolatile, time-varianttime-variant collection collection

of data in support of management's decisions. of data in support of management's decisions.

- WH Inmon- WH Inmon

WH Inmon - Regarded As Father Of Data WarehousingWH Inmon - Regarded As Father Of Data Warehousing

Page 9: An Introduction  to  Data Warehousing

Subject-Oriented- Characteristics of a Data Warehouse

Quotes Orders

ProspectsLeads

Operational Data Warehouse

Customers Products

Regions Time

Focus is on Subject Areas rather than ApplicationsFocus is on Subject Areas rather than Applications

Page 10: An Introduction  to  Data Warehousing

Integrated - Characteristics of a Data Warehouse

Appl A - m,fAppl B - 1,0Appl C - male,female

Appl A - balance dec fixed (13,2)Appl B - balance pic 9(9)V99Appl C - balance pic S9(7)V99 comp-3

Appl A - bal-on-handAppl B - current-balanceAppl C - cash-on-hand

Appl A - date (julian)Appl B - date (yymmdd)Appl C - date (absolute)

m,f

balance dec fixed (13,2)

date (julian)

Current balance

Integrated View Is The Essence Of A Data WarehouseIntegrated View Is The Essence Of A Data Warehouse

Page 11: An Introduction  to  Data Warehousing

Non-volatile - Characteristics of a Data Warehouse

Operational Data Warehouse

replacechange

insert

changeinsert

delete load

read only access

Data Warehouse Is Relatively Static In NatureData Warehouse Is Relatively Static In Nature

Page 12: An Introduction  to  Data Warehousing

Time Variant - Characteristics of a Data Warehouse

Operational Data Warehouse

Current Value data• time horizon : 60-90 days

Snapshot data• time horizon : 5-10 years•data warehouse stores historical data

Data Warehouse Typically Spans Across TimeData Warehouse Typically Spans Across Time

Page 13: An Introduction  to  Data Warehousing

Alternate Definitions

A collection of integrated, subject oriented databases A collection of integrated, subject oriented databases

designed to support the DSS function, where each designed to support the DSS function, where each

unit of data is relevant to some moment of timeunit of data is relevant to some moment of time

- - Imhoff Imhoff

Page 14: An Introduction  to  Data Warehousing

Alternate Definitions

Data Warehouse is a repository of data summarized Data Warehouse is a repository of data summarized

or aggregated in simplified form from operational or aggregated in simplified form from operational

systems. End user orientated data access and systems. End user orientated data access and

reporting tools let user get at the data for decision reporting tools let user get at the data for decision

support - Babcocksupport - Babcock

Page 15: An Introduction  to  Data Warehousing

Evolution of Data Warehousing

1960 - 1985 : MIS Era

Focus on ReportingFocus on Reporting

• Unfriendly

• Slow

• Dependent on IS programmers

• Inflexible

• Analysis limited to defined reports

Page 16: An Introduction  to  Data Warehousing

Evolution of Data Warehousing

1985 - 1990 : Querying Era

Focus on Online QueryingFocus on Online Querying

• Adhoc, unstructured access to corporate data

• SQL as interface not scalable

• Cannot handle complex analysis

Queries that are formulated by the user

on the spur of the moment

Page 17: An Introduction  to  Data Warehousing

Evolution of Data Warehousing

1990 - 20xx : Analysis Era

Focus on Online AnalysisFocus on Online Analysis

• Trend Analysis

• What If ?

• Cross Dimensional Comparisons

• Statistical profiles

• Automated pattern and rule discovery

Page 18: An Introduction  to  Data Warehousing

Need for Data Warehousing

Better business intelligence for end-usersBetter business intelligence for end-users

Reduction in time to locate, access, and analyze informationReduction in time to locate, access, and analyze information

Consolidation of disparate information sourcesConsolidation of disparate information sources

Strategic advantage over competitorsStrategic advantage over competitors

Faster time-to-market for products and servicesFaster time-to-market for products and services

Replacement of older, less-responsive decision support Replacement of older, less-responsive decision support

systemssystems

Reduction in demand on IS to generate reportsReduction in demand on IS to generate reports

Page 19: An Introduction  to  Data Warehousing

Typical Business Queries

Which product generated maximum revenue over last two Which product generated maximum revenue over last two

quarters in a chosen geographical region, city wise, relative to quarters in a chosen geographical region, city wise, relative to

the previous version of product, compared with the planthe previous version of product, compared with the plan

What percent of customer procures product A with B in a What percent of customer procures product A with B in a

chosen region, brokenchosen region, broken down by city, season, and income group down by city, season, and income group

Business Queries

Page 20: An Introduction  to  Data Warehousing

OLTP Systems Vs Data Warehouse

Remember

Between OLTP and Data Warehouse systems

users are different

data content is different,

data structures are different

hardware is different

Understanding The Differences Is The KeyUnderstanding The Differences Is The Key

Page 21: An Introduction  to  Data Warehousing

OLTP Vs Warehouse

Operational System Data Warehouse

Transaction Processing Query Processing

Predictable CPU Usage Random CPU Usage

Time Sensitive History Oriented

Operator View Managerial View

Normalized Efficient

Design for TP

Denormalized Design for

Query Processing

Operational System Data Warehouse

Transaction Processing Query Processing

Predictable CPU Usage Random CPU Usage

Time Sensitive History Oriented

Operator View Managerial View

Normalized Efficient

Design for TP

Denormalized Design for

Query Processing

Page 22: An Introduction  to  Data Warehousing

OLTP Vs Warehouse

Operational System Data Warehouse

Designed for Atmocity,Consistency, Isolation andDurability

Designed for quite or staticdatabase

Organized by transactions(Order, Input, Inventory)

Organized by subject(Customer, Product)

Relatively smaller database Large database size

Many concurrent users Relatively few concurrentusers

Volatile Data Non Volatile Data

Operational System Data Warehouse

Designed for Atmocity,Consistency, Isolation andDurability

Designed for quite or staticdatabase

Organized by transactions(Order, Input, Inventory)

Organized by subject(Customer, Product)

Relatively smaller database Large database size

Many concurrent users Relatively few concurrentusers

Volatile Data Non Volatile Data

Page 23: An Introduction  to  Data Warehousing

OLTP Vs Warehouse

Operational System Data Warehouse

Stores all data Stores relevant data

Performance Sensitive Less Sensitive to performance

Not Flexible Flexible

Efficiency Effectiveness

Operational System Data Warehouse

Stores all data Stores relevant data

Performance Sensitive Less Sensitive to performance

Not Flexible Flexible

Efficiency Effectiveness

Page 24: An Introduction  to  Data Warehousing

Capacity Planning

Pro

cessin

g P

ow

er

Time of day

Processing Load Peaks During the Beginning and End of DayProcessing Load Peaks During the Beginning and End of Day

Page 25: An Introduction  to  Data Warehousing

Examples Of Some Applications

Target Marketing Target Marketing

Market SegmentationMarket Segmentation

BudgetingBudgeting

Credit Rating AgenciesCredit Rating Agencies

Financial Reporting and ConsolidationFinancial Reporting and Consolidation

Market Basket Analysis - Market Basket Analysis - POS Analysis

Fraud ManagementFraud Management

Profitability ManagementProfitability Management

Event trackingEvent tracking

ManufacturersManufacturersManufacturersManufacturers

CustomersCustomersCustomersCustomers

RetailersRetailersRetailersRetailers

Page 26: An Introduction  to  Data Warehousing

Do we need a separate database ?

OLTP and data warehousing require two very OLTP and data warehousing require two very

differently configured systemsdifferently configured systems

Isolation of Production System from Business Isolation of Production System from Business

Intelligence SystemIntelligence System

Significant and highly variable resource demands of Significant and highly variable resource demands of

the data warehousethe data warehouse

Cost of disk space no longer a concernCost of disk space no longer a concern

Production systems not designed for query Production systems not designed for query

processingprocessing

Page 27: An Introduction  to  Data Warehousing

Data Marts

Enterprise wide data warehousing projects have a Enterprise wide data warehousing projects have a

very large cycle timevery large cycle time

Getting consensus between multiple parties may Getting consensus between multiple parties may

also be difficultalso be difficult

Departments may not be satisfied with priority Departments may not be satisfied with priority

accorded to themaccorded to them

Sometimes individual departmental needs may be Sometimes individual departmental needs may be

strong enough to warrant a local implementationstrong enough to warrant a local implementation

Application/database distribution is also an Application/database distribution is also an

important factorimportant factor

Page 28: An Introduction  to  Data Warehousing

Data Marts

Subject or Application Oriented Business View of Subject or Application Oriented Business View of

WarehouseWarehouse

» Finance, Manufacturing, Sales etc.Finance, Manufacturing, Sales etc.

» Smaller amount of data used for Analytic ProcessingSmaller amount of data used for Analytic Processing

» Address a single business processAddress a single business process

A Logical Subset of The Complete Data WarehouseA Logical Subset of The Complete Data Warehouse

Page 29: An Introduction  to  Data Warehousing

Data Warehouse and Data Mart

Data Warehouse Data Marts

Scope Application Neutral Centralized, Shared Cross LOB/enterprise

Specific ApplicationRequirement

LOB, department Business Process

Oriented

DataPerspective

Historical Detailed data Some summary

Detailed (some history) Summarized

Subjects Multiple subject areas Single Partial subject Multiple partial subjects OLTP snapshots

Data Warehouse Data Marts

Scope Application Neutral Centralized, Shared Cross LOB/enterprise

Specific ApplicationRequirement

LOB, department Business Process

Oriented

DataPerspective

Historical Detailed data Some summary

Detailed (some history) Summarized

Subjects Multiple subject areas Single Partial subject Multiple partial subjects OLTP snapshots

Page 30: An Introduction  to  Data Warehousing

Data Warehouse and Data Mart

Data Warehouse Data Marts

Data Sources Many Operational/ External

Data

Few Operational, external

data OLTP snapshots

ImplementTime Frame

9-18 months for firststage

Multiple stageimplementation

4-12 months

Characteristics Flexible, extensible Durable/Strategic Data orientation

Restrictive, nonextensible

Short life/tactical Project Orientation

Data Warehouse Data Marts

Data Sources Many Operational/ External

Data

Few Operational, external

data OLTP snapshots

ImplementTime Frame

9-18 months for firststage

Multiple stageimplementation

4-12 months

Characteristics Flexible, extensible Durable/Strategic Data orientation

Restrictive, nonextensible

Short life/tactical Project Orientation

Page 31: An Introduction  to  Data Warehousing

Warehouse or Mart First ?

Data Warehouse First Data Mart first

Expensive Relatively cheap

Large development cycle Delivered in < 6 months

Change management isdifficult

Easy to manage change

Difficult to obtain continuouscorporate support

Can lead to independent andincompatible marts

Technical challenges inbuilding large databases

Cleansing, transformation,modeling techniques may beincompatible

Data Warehouse First Data Mart first

Expensive Relatively cheap

Large development cycle Delivered in < 6 months

Change management isdifficult

Easy to manage change

Difficult to obtain continuouscorporate support

Can lead to independent andincompatible marts

Technical challenges inbuilding large databases

Cleansing, transformation,modeling techniques may beincompatible

Page 32: An Introduction  to  Data Warehousing

Different kinds of Information Needs

CurrentCurrent

RecentRecent

HistoricalHistorical

CurrentCurrent

RecentRecent

HistoricalHistorical

Is this medicine available in stock

What are the tests this patient has completed so far

Has the incidence of Tuberculosis increased in last 5 years in Southern region

Page 33: An Introduction  to  Data Warehousing

Operational Data Store - Definition

A A subject orientedsubject oriented, , integratedintegrated, ,

volatilevolatile, , current valuedcurrent valued data store data store

containing only corporate containing only corporate

detailed datadetailed dataData stored only for current period. Old

Data is either archived or moved to

Data Warehouse

Can I see credit report from

Accounts, Sales from

marketing and open order report from

order entry for this customer

Identical queries may give different results

at different times. Supports analysis requiring current

data

Data from multiple sources is integrated

for a subject

Page 34: An Introduction  to  Data Warehousing

Operational Data Store

Increasingly becoming integrated with the data Increasingly becoming integrated with the data

warehousewarehouse

Are nothing but more responsive real time data Are nothing but more responsive real time data

warehouseswarehouses

Data Mining has anyway forced Data Warehouses Data Mining has anyway forced Data Warehouses

to store transactional level datato store transactional level data

Page 35: An Introduction  to  Data Warehousing

OLTP Vs ODS Vs DWH

Characteristic OLTP ODS Data Warehouse

Audience OperatingPersonnel

Analysts Managers andanalysts

Data access Individual records,transaction driven

Individual records,transaction oranalysis driven

Set of records,analysis driven

Data content Current, real-time Current and near-current

Historical

Data granularity Detailed Detailed and lightlysummarized

Summarized andderived

Data organization Functional Subject-oriented Subject-oriented

Data quality All applicationspecific detaileddata needed tosupport a businessactivity

All integrated dataneeded to support abusiness activity

Data relevant tomanagementinformation needs

Page 36: An Introduction  to  Data Warehousing

OLTP Vs ODS Vs DWH

Characteristic OLTP ODS Data Warehouse

Data redundancy Non-redundantwithin system;Unmanagedredundancy amongsystems

Somewhatredundant withoperationaldatabases

Managedredundancy

Data stability Dynamic Somewhat dynamic Static

Data update Field by field Field by field Controlled batch

Data usage Highly structured,repetitive

Somewhatstructured, someanalytical

Highlyunstructured,heuristic oranalytical

Database size Moderate Moderate Large to very large

Databasestructure stability

Stable Somewhat stable Dynamic

Page 37: An Introduction  to  Data Warehousing

OLTP Vs ODS Vs DWH

Characteristic OLTP ODS Data Warehouse

Developmentmethodology

Requirementsdriven, structured

Data driven,somewhatevolutionary

Data driven,evolutionary

Operationalpriorities

Performance andavailability

Availability Access flexibilityand end userautonomy

Philosophy Support day-to-day operation

Support day-to-daydecisions &operationalactivities

Support managingthe enterprise

Predictability Stable Mostly stable, someunpredictability

Unpredictable

Response time Sub-second Seconds to minutes Seconds to minutes

Return set Small amount ofdata

Small to mediumamount of data

Small to largeamount of data

Page 38: An Introduction  to  Data Warehousing

Typical Data Warehouse Architecture

OperationalSystems/Data

Select

Extract

Transform

Integrate

Maintain

Data Preparation

Middleware/API

Data Warehouse

Metadata

EIS /DSS

Query Tools

OLAP/ROLAP

Web Browsers

Data Mining

DataMarts

Multi-tiered Data Warehouse without ODSMulti-tiered Data Warehouse without ODS

Page 39: An Introduction  to  Data Warehousing

Typical Data Warehouse Architecture

Multi-tiered Data Warehouse with ODSMulti-tiered Data Warehouse with ODS

OperationalSystems/Data

Select

Extract

Transform

Integrate

Maintain

Data Preparation

DataMarts

Data Warehouse

Metadata

ODS

Metadata

Select

Extract

Transform

Load

Data Preparation

Page 40: An Introduction  to  Data Warehousing

Benefits of DWH

To formulate effective business, marketing

and sales strategies.

To precisely target promotional activity.

To discover and penetrate new markets.

To successfully compete in the marketplace

from a position of informed strength.

To build predictive rather than retrospective models.

These capabilities empower the corporate...

Page 41: An Introduction  to  Data Warehousing

Warehouse Architecture - 1

OperationalSystems/Data

Select

Extract

Transform

Integrate

Maintain

Data Preparation

Middleware/API

Data Warehouse

Metadata

EIS /DSS

Query Tools

OLAP/ROLAP

Web Browsers

Data Mining

Enterprise Data WarehouseEnterprise Data Warehouse

Page 42: An Introduction  to  Data Warehousing

Warehouse Architecture - 2

OperationalSystems/Data

Select

Extract

Transform

Integrate

Maintain

Data Preparation

EIS /DSS

Query Tools

Middleware/API

OLAP/ROLAP

Web Browsers

Data Mining

Data Mart

Metadata

Data Mart

Metadata

Data Mart

Metadata

Single Department Data MartSingle Department Data Mart

Page 43: An Introduction  to  Data Warehousing

Warehouse Architecture - 3

OperationalSystems/Data

Select

Extract

Transform

Integrate

Maintain

Data Preparation

Middleware/API

Data Warehouse

Metadata

EIS /DSS

Query Tools

OLAP/ROLAP

Web Browsers

Data Mining

DataMarts

Operational Data Store

Multi-tiered Data WarehouseMulti-tiered Data Warehouse

Page 44: An Introduction  to  Data Warehousing

Data Warehouse Architectures

There are three schools of thought about DW There are three schools of thought about DW

architecturesarchitectures

– One supports Dimensional Modeling all through One supports Dimensional Modeling all through

(Ralph Kimball)(Ralph Kimball)

– Second supports ER for Data Warehouse and Star Second supports ER for Data Warehouse and Star

Schemas for Data MartsSchemas for Data Marts

– Third supports ER model for DW (NCR)Third supports ER model for DW (NCR)

Page 45: An Introduction  to  Data Warehousing

Kimball’s View

Data Warehouse Server

Processes

•Extract •Scrubbing•Transformation•Load Jobs•Aggregation Jobs•Replication•Monitoring•Management•Meta Data Repository•Meta Data Population•Meta Data Maintenance

Operational Systems

Staging Area

DW is sum total of all Data Marts

LAN

Presentation Server

DW Bus usingConformed Dimensions

Each Star is a Data Mart and has both summary and

detail data

Multiple Data Marts With Conformed DimensionsMultiple Data Marts With Conformed Dimensions

Page 46: An Introduction  to  Data Warehousing

Inmon’s View

Data Warehouse Server Processes

•Extract •Scrubbing•Transformation•Load Jobs•Aggregation Jobs•Replication•Monitoring•Management•Meta Data Repository•Meta Data Population•Meta Data Maintenance

Operational Systems

Staging Area

LAN

Data Marts

Summarized Data in Star formats

Data Warehouse

Detail Data in ER format

Data Warehouse (ER) Feeding Multiple Data Marts (Star Schema)Data Warehouse (ER) Feeding Multiple Data Marts (Star Schema)

Page 47: An Introduction  to  Data Warehousing

Components of a Data Warehouse Architecture

Source DatabasesSource Databases

Data extraction/transformation/load (ETL) toolData extraction/transformation/load (ETL) tool

Data warehouse maintenance and administration Data warehouse maintenance and administration

toolstools

Data modeling tool or interface to external data Data modeling tool or interface to external data

modelsmodels

Warehouse databasesWarehouse databases

End-user data access and analysis toolsEnd-user data access and analysis tools

Page 48: An Introduction  to  Data Warehousing

Data Cleansing

Tools

Source Databases

Central Metadata

ETL Tool

Data Modeling

ToolData Access and Analysis Tools

-Managed Query

-Desktop OLAP

-ROLAP

-MOLAP

- Data Mining

Central Warehouse(RDBMS)

Warehouse Admin Tool

Local meta data

RDBMS

ROLAP Engine

Architected Datamarts

Warehouse Databases

MDDB

Components of a Data Warehouse Architecture

Data Warehouse Is Not Just About Data... But Tools TooData Warehouse Is Not Just About Data... But Tools Too

Page 49: An Introduction  to  Data Warehousing

Source Databases - Characteristics

Legacy, relational, text or external sourcesLegacy, relational, text or external sources

Designed for high-speed transaction processingDesigned for high-speed transaction processing

Real-time, current, volatile dataReal-time, current, volatile data

Fast response for larger numbers of concurrent usersFast response for larger numbers of concurrent users

Many short transactionsMany short transactions

Update-intensive; modifications by rowUpdate-intensive; modifications by row

Inquiry-oriented; access by keysInquiry-oriented; access by keys

High integrity, security, recoverabilityHigh integrity, security, recoverability

Source data is often inconsistent and poorly modeledSource data is often inconsistent and poorly modeled

Page 50: An Introduction  to  Data Warehousing

Data Cleaning Tools

To clean data at the source To clean data at the source

Clean up source data in-place on the hostClean up source data in-place on the host

Business rule discovery tools which analyse the Business rule discovery tools which analyse the

source data and write cleaning rules based on source data and write cleaning rules based on

lexical analysis and AI techniqueslexical analysis and AI techniques

Poorly integrated with data warehousing toolsPoorly integrated with data warehousing tools

ETL tools have limited yet adequate data cleansing ETL tools have limited yet adequate data cleansing

functionalityfunctionality