8659164-DataWarehousing

57
DATA WAREHOUSING

description

DW

Transcript of 8659164-DataWarehousing

Page 1: 8659164-DataWarehousing

DATA WAREHOUSING

Page 2: 8659164-DataWarehousing

Agenda

Introduction Process DSS Information processing Dimensions OLAP Architecture Types Best Practise Case

Page 3: 8659164-DataWarehousing

IntroductionShilpa Surve

Page 4: 8659164-DataWarehousing

04/17/2023 4

Definition

Data Warehouse is a• Subject-Oriented• Integrated• Time-Variant• Non-volatile

Page 5: 8659164-DataWarehousing

04/17/2023 5

What are Data Warehouses?

Data warehouses store large volumes of data which are frequently used by DSS

It is maintained separately from the organization’s operational databases

Data warehouses are relatively static with only infrequent updates

A data warehouse is a stand-alone repository of information, integrated from several, possibly heterogeneous operational databases

Page 6: 8659164-DataWarehousing

04/17/2023 6

Steps in Building a Warehouse

Identify key business drivers, sponsorship, risks, ROI

Survey information needs and identify desired functionality and define functional requirements for initial subject area.

Architect long-term, data warehousing architecture

Evaluate and Finalize DW tool & technologyConduct Proof-of-Concept

Page 7: 8659164-DataWarehousing

04/17/2023 7

Steps in building Data Warehouse

Design target data base schema Build data mapping, extract,

transformation, cleansing and aggregation/summarization rules

Build initial data mart, using exact subset of enterprise data warehousing architecture and expand to enterprise architecture over subsequent phases

Maintain and administer data warehouse

Page 8: 8659164-DataWarehousing

04/17/2023 8

The Three Views of Data Warehousing

Strategic or Business view• Define key business drivers of data warehouse• How can business-driven approach achieve high

ROI? Architectural or Technology view

• Alternative data warehousing architectures• How can the right architecture achieve a high

ROI? Methodology or Implementation view

• Development and implementation methodology• How can the right methodology achieve a rapid

ROI?

Page 9: 8659164-DataWarehousing

ProcessSwathi Velisetty

Page 10: 8659164-DataWarehousing

04/17/2023 10

DW Components

TransmissionNETWORK

Metadata Layer

Cleansing

Transformation

AggregationSummarization

Data Mart Population

Knowledge Discovery

ODS DW

OLAP ANALYSIS

Extraction

DM1

DM2

DMn

Legacy System

FS1

FS2

FSn

.

.

.

STAGING

AREA

Page 11: 8659164-DataWarehousing

04/17/2023 11

Cleansing process

Raw data (Staging Area)

Process MetadataCleansing Rules

Control Metadata

CleansingProcess

Cleansing

Reports

Good

Bad

Clean data

•Clean the Raw Data •Mark it Good/Bad•Generate the cleansing Reports and mail to the DWA and Feed System representatives

Page 12: 8659164-DataWarehousing

04/17/2023 12

Transformation Process

TransformationProcess

CleanOperational

DataOperational

Data Store

•Transform the cleaned Operational Data into DSS Data •Load the DSS data into ODS•ODS contains the current DSS data at the lowest level of granularity

Control Metadata

Process Metadata•Mapping Detail•Transformation Rule

Page 13: 8659164-DataWarehousing

04/17/2023 13

Summarization Process

Summarization

Process

ODS

Weekly Monthly Yearly

DW

• Summarize and aggregate ODS data and Populate to the Warehouse• Periodicity of Summarization Process depends upon the level of summarization at Warehouse ( weekly, monthly, daily )

Control Metadata

Page 14: 8659164-DataWarehousing

04/17/2023 14

Enterprise Data Warehouse

DATA WAREHOUSE

Legacy

OLTP

External

API

USERS

Operational Systems Enterprise wide Data

Select

Extract

Maintain

Transform

Integrate

Data Preparation

Metadata Repositor

yClient/Server

Page 15: 8659164-DataWarehousing

04/17/2023 15

Distributed Data Marts

API

USERS

Operational Systems Data

Data Preparation

Data Mart

Data Mart

Data MartLegacy

OLTP

External

Select

Extract

Maintain

Transform

Integrate

Client/Server

Page 16: 8659164-DataWarehousing

04/17/2023 16

Multi-tiered Data Warehouse

DATA WAREHOUSE

Legacy

Client/Server

OLTP

External

API

USERS

Operational Systems Enterprise wide Data

Metadata Repository

Data Mart

Data Mart

Data Mart

Select

Extract

Maintain

Transform

Integrate

Page 17: 8659164-DataWarehousing

04/17/2023 17

Example

Monthly Sales by Product for 1991-94

Weekly sales by product/sub-productfor 1991-94

Sales Detailfor 1991-94

Sales Detail for1985-90

Metadata

Weekly sales by region for 1991-94

Monthly sales by region for 1991-94

Page 18: 8659164-DataWarehousing

Decision support system

Atul zade

Page 19: 8659164-DataWarehousing

04/17/2023 19

What is DSS?

Enable users to get a “Business View” of the data

Facilitate Data based Decision Making that would drive and improve the Business

Discover “Hidden Trends”

Decision Support Systems (DSS) are interactive computer-based systems intended to help decision makers utilize data and models to identify and solve problems and make decisions. Data Warehouse is the foundation of DSS process. It is a Strategy and a Process for Staging Corporate Data.

Decision Support Systems (DSS) are interactive computer-based systems intended to help decision makers utilize data and models to identify and solve problems and make decisions. Data Warehouse is the foundation of DSS process. It is a Strategy and a Process for Staging Corporate Data.

Page 20: 8659164-DataWarehousing

04/17/2023 20

Driving Forces for DSS

Changes in the Business Environment

RESULT:

Customers

Reform

Technology

Business Speed

COMPETITION

Contd.

Page 21: 8659164-DataWarehousing

04/17/2023 21

How to answer these Business Queries?

What is the sales distribution region wise?

What is Defaulter’s Profile?

What are the slow movers in my product line?

How did my revenue improve in the past 5 years?

Which of my Sales Agentsare doing better?

Who are my profitable customers?

Currency Risk, Interest Rate Risk, Liquidity Risk

Strategic Planning / Budgeting

Which channel costs me more and pays less?

Page 22: 8659164-DataWarehousing

OLTP v/s DSS Environment

OLTP Environment

• get data IN • large volumes of simple

transaction queries• continuous data changes• low processing time• mode of processing• transaction details• data inconsistency• mostly current data• high concurrent usage• highly normalized data

structure• static applications• automates routines

DSS Environment • get information OUT • small number of diverse

queries• periodic updates only• high processing time• mode of discovery• subject oriented - summaries • data consistency• historical data is relevant• low concurrent usage• fewer tables, but more

columns per table• dynamic applications• facilitates creativity

Page 23: 8659164-DataWarehousing

04/17/2023 23

Benefits for Business User

• Flexible Information Access• High Availability• Ease of Use• Quality & Completeness of Data• Focus on Information Processing• Information Base for Knowledge Discovery

Page 24: 8659164-DataWarehousing

04/17/2023 24

Classification of Business Users

• Executives/Managers• Multi-dimensional analysis, reporting tools

• Knowledge Worker• Ad hoc queries, detail & summary data,

application focus• Power-Analyst

• Ad hoc queries, Data Analysis & Data Mining

• Customer Contacts• Detail Data at specific levels

Page 25: 8659164-DataWarehousing

Information processingPrem Sequera

Page 26: 8659164-DataWarehousing

04/17/2023 26

Data Processing to Information Processing

Business Objectives & GoalsApplication Domains and Business FunctionsB U S I N E S S E L E M E N T S

Heterogeneous Data Sources Feed Systems and External Sources

D A T A E L E M E N T S

T

R

A

C

E

AB

IL

IT

Y

Query Processing

ReportGeneration

KNOWLEDGEDISCOVERYData MiningApplicationsKNOWLEDGE MANAGEMENT

T

R

A

C

E

AB

IL

IT

Y

OperationalData Store(ODS)

OLAP/QueryTools

Enterprise Data Warehouse

OLAPAppl.

Data Mart A

Data Mart B

Data Mart N

Appl. Spec.Analysis

Appl. Spec.Analysis

Appl. Spec.Analysis

Management Decision: Value Chain

Data ProcessingInformation ProcessingKnowledge Processing

Page 27: 8659164-DataWarehousing

04/17/2023 27

Subject Oriented Analysis

Data Warehouse StorageTransactional Storage

SalesSales

CustomersCustomers

ProductsProducts

Entry

Sales RepQuantity SoldPart NumberDate Customer NameProduct DescriptionUnit PriceMail Address

Process Oriented Subject Oriented

Page 28: 8659164-DataWarehousing

04/17/2023 28

Integration of Data

Data Warehouse StorageTransactional Storage

Appl. A - M, FAppl. B - 1, 0Appl. C - X, Y

Appl. A - pipeline cm.Appl. B - pipeline inchesAppl. C - pipeline mcf

Appl. A - balance dec(13,2) Appl. B - balance PIC 9(9)V99Appl. C - balance float

Appl. A - bal-on-handAppl. B - current_balanceAppl. C - balance

Appl. A - date (Julian)Appl. B - date (yymmdd)Appl. C - date (absolute)

M, F

pipeline cm

balance dec(13, 2)

balance

date (Julian)In

teg

rati

on

Encoding

Unit of Attributes Physical Attributes Naming Conventions

Data Consistency

Page 29: 8659164-DataWarehousing

04/17/2023 29

Volatility of Data

Load

Access

Mass Load / Access of DataRecord-by-Record Data Manipulation

Insert

Access

Insert

Change

Delete

Change

Volatile Non-Volatile

Data Warehouse StorageTransactional Storage

Page 30: 8659164-DataWarehousing

04/17/2023 30

Time Variant Data Analysis

Data Warehouse StorageTransactional Storage

Current Data Historical Data

0

5

10

15

20

Sales ( in lakhs )

January February March

Year97

Sales ( Region , Year - Year 97 - 1st Qtr)

EastWestNorth

Page 31: 8659164-DataWarehousing

DimensionKairav Parikh

Page 32: 8659164-DataWarehousing

04/17/2023 32

What is a Dimension?

Data Warehouse is• Subject-Oriented• •Integrated• Time-Variant• Non-volatilecollection of data in support of management’s decision.

Subject Dimension

CustomerGeography

Time

Page 33: 8659164-DataWarehousing

04/17/2023 33

Dimensional Hierarchy

World

America

AsiaEurope

USA

FL

Canada

Argentina

GA VA CA WA

Tampa

Miami Orlando

Naples

Continent Level

State Level

City Level

World Level

Country Level

Pare

nt R

elat

ion

Dimension Member / Business Entity

Geography Dimension

Attributes: Population, Tourist’s Place

Page 34: 8659164-DataWarehousing

04/17/2023 34

Types of Dimensions

• Simple Dimensions (e.g. Time)

• Related Dimensions (e.g. Gender of a Customer)

• Spool Dimensions (e.g. Account as an interaction between Customer and Product)

• Bucket Dimensions (e.g. Income Ranges of a Customer)

• Slowly Changing Dimensions (e.g. changes in Organization)

• Fast Varying Dimensions (e.g. changes Retail Customers attributes)

• Unused Dimensions (e.g. Order No., Invoice No.)

Page 35: 8659164-DataWarehousing

04/17/2023 35

Dimensional ModelingSTEP 1

• Identify Subjects (Dimensions)

• Identify Hierarchies of a Dimension

• Identify Attributes of levels in Hierarchies

• Define Grain

Customer

Industry SegmentIndustry Type City

State

Country

Contd.

Fin. Class

Page 36: 8659164-DataWarehousing

04/17/2023 36

Dimensional ModelingSTEP 2

• Use KPIs to identify the Facts

• Group the Facts in a logical set

Trans. Amount

No. of Bonds

No. of TransactionsService Cost...

Financial Transactions

No. of Cheques Cleared

No. of Visits to a Branch

No. of DEMAT Transactions

...

Non-Financial Transactions

Contd.

Page 37: 8659164-DataWarehousing

04/17/2023 37

Dimensional ModelingSTEP 3

• Link the Group of Facts to the Dimensions that participate in the Facts

Customer

OrganizationTime

Product

Channel

Financial Transactions

Page 38: 8659164-DataWarehousing

04/17/2023 38

Dimensional ModelingSTEP 4

• Define Granularity for each Group of Facts

Customer (Customer)

Organization (Branch)

Product (Scheme)

Channel (Channel)

Time (Day-Hour)

Financial Transactions

Page 39: 8659164-DataWarehousing

04/17/2023 39

Data Warehouse Schemas

Star Schema

• A Group of Facts connected to Multiple Dimensions

Customer

OrganizationTime

Product

Channel

Contd.

Financial Transactions

Page 40: 8659164-DataWarehousing

04/17/2023 40

Data Warehouse Schemas

Snow-flake Schema (= Extended Star Schema)

• A Group of Facts connected to Dimensions, which are split across multiple hierarchies and attributes

Customer

Organization

Time Product

ChannelFinancial Transactions

Contd.

Segment

Geography

Page 41: 8659164-DataWarehousing

04/17/2023 41

Data Warehouse Schemas

Galaxy Schema

• Multiple Groups of Facts links by few common dimensions

Fact1

Fact2 Fact3

Dimension2Dimension1

Dimension4

Dimension5

Dimension3

Dimension7Dimension6

Page 42: 8659164-DataWarehousing

OLAPAkshay Shiveshwarkar

Page 43: 8659164-DataWarehousing

04/17/2023 43

On-Line Analytical Processing

OLAP can be defined as a technology which allows the users to view the aggregate data across measurements (like Maturity Amount, Interest Rate etc.) along with a set of related parameters called dimensions (like Product, Organization, Customer, etc.)

Page 44: 8659164-DataWarehousing

04/17/2023 44

What is MDDB?

A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is • intimately related and

• stored, viewed and analyzed from different perspectives (Dimensions).

Page 45: 8659164-DataWarehousing

04/17/2023 45

RDBMS v/s MDDB

MODEL COLOR SALES VOL.MINI VAN BLUE 6MINI VAN RED 5MINI VAN WHITE 4SPORTS COUPE BLUE 3SPORTS COUPE RED 5SPORTS COUPE WHITE 5SEDAN BLUE 4SEDAN RED 3SEDAN WHITE 2

COLOR

MODEL

Mini Van

Sedan

Coupe

Red WhiteBlue

6 5 4

3 5 5

4 3 2

Sales Volumes

9 x 3 = 27 cells 3 x 3 = 9 cells

Page 46: 8659164-DataWarehousing

04/17/2023 46

Benefits of MDDB over RDBMS

Ease of Data Presentation & Navigation Intuitive, Spreadsheet / Crosstab like data views

Storage Space Very low Space Consumption compared to Relational DB

Performance Gives much better performance. Relational DB may give comparable results only through

database tuning (indexing, keys etc), which may not be possible for ad-hoc queries.

Ease of Maintenance No overhead as data is stored in the same way it is

viewed. In Relational DB, indexes, sophisticated joins etc. are used which require considerable storage and maintenance

Page 47: 8659164-DataWarehousing

04/17/2023 47

Issues with MDDB

• Sparsity– Controlled Sparsity– Random Sparsity

• Data Explosion– Due to Sparsity– Due to Summarization

• Performance– Doesn’t perform better than RDBMS at high data

volumes (>20-30 GB)

Page 48: 8659164-DataWarehousing

04/17/2023 48

OLAP Features

Subject oriented approach to Decision Support Calculations applied across dimensions,

through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in

the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying

detail data

Page 49: 8659164-DataWarehousing

04/17/2023 49

Features of OLAP - Drill Down / Up

Gary

Gleason Carr Levi Lucas Bolton

Midwest

St. LouisChicago

Clyde

REGION

DISTRICT

DEALERSHIP

ORGANIZATION DIMENSIONSales at region/District/Dealership Level

• Moving Up and moving down in a hierarchy is referred to as “drill-up” / “roll-up” and “drill-down”

Page 50: 8659164-DataWarehousing

Architecture TypesRitesh Raushan

Page 51: 8659164-DataWarehousing

04/17/2023 51

Implementation Techniques - OLAP Architectures

• MOLAP - Multidimensional OLAP• Multidimensional Databases for database and

application logic layer• ROLAP - Relational OLAP

• Access Data stored in relational Data Warehouse for OLAP Analysis.

• Database and Application logic provided as separate layers

• HOLAP - Hybrid OLAP• OLAP Server routes queries first to MDDB, then to

RDBMS and result processed on-the-fly in Server• DOLAP - Desk OLAP

• Personal MDDB Server and application on the desktop

Page 52: 8659164-DataWarehousing

04/17/2023 52

MOLAP - MDDB storage

OLAPCalculation

Engine OLAP Tools

OLAP Applications

WebBrowserOLAP

Cube

Page 53: 8659164-DataWarehousing

04/17/2023 53

ROLAP - Standard SQL storage

OLAPCalculation

Engine OLAPTools

OLAP Applications

WebBrowser

Relational DWMDDB - Relational Mapping

SQL

Page 54: 8659164-DataWarehousing

04/17/2023 54

HOLAP - Combination of RDBMS and MDDB

Any Client

OLAPCalculation

Engine OLAPTools

OLAP Applications

WebBrowser

Relational DW

OLAP Cube

SQL

Page 55: 8659164-DataWarehousing

04/17/2023 55

Architecture Comparison

MOLAP ROLAP HOLAP

Definition MDDB OLAP =Transaction level data +summary in MDDB

Relational OLAP =Transaction level data +summary in RDBMS

Hybrid OLAP =ROLAP + summary inMDDB

Data explosion dueto Sparsity

Good Design 3 – 10times

No Sparsity Sparsity exists only inMDDB part

Data explosion dueto Summarization

High (May go beyondcontrol. Estimation isvery important)

To the necessary extent To the necessary extent

Query ExecutionSpeed

Fast - (Depends upon the size of the MDDB)

Slow Optimum - If the data isfetched from RDBMSthen it’s like ROLAPotherwise like MOLAP.

Cost Medium: MDDB Server+ large disk space cost

Low: Only RDBMS + diskspace cost

High: RDBMS + diskspace + MDDB Servercost

Where to apply? Small transactionaldata + complex model +frequent summaryanalysis

Very large transactionaldata & it needs to beviewed / sorted

Large transactional data+ frequent summaryanalysis

Page 56: 8659164-DataWarehousing

CaseKiran Naik

Page 57: 8659164-DataWarehousing

Thank you