DWH Concepts Summ

39
- Vamshi Myana Data Warehouse Concepts & Terminology

description

data warehouse concepts

Transcript of DWH Concepts Summ

Page 1: DWH Concepts Summ

- Vamshi Myana

Data Warehouse Concepts&

Terminology

Page 2: DWH Concepts Summ

Contents What is Datawarehouse? Why Separate Data Warehouse? Data Granularity Difference between OLTP & DW Datawarehouse Architecture Top-Down Versus Bottom-Up Approach Data Warehouses Versus Data Marts Dimensional Modeling Fundamentals Extraction, Transformation and Load Separate Data Warehouse? ETL(Extract Transform Load) & OLAP

Page 3: DWH Concepts Summ

What is Datawarehouse? A data warehouse is a relational database that is

designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.

In addition to a relational database, a data warehouse environment includes an extraction, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.

Page 4: DWH Concepts Summ

Data Warehouse Properties

DataWarehouse

Integrated

Time VariantNon Volatile

SubjectOriented

-- Bill Inmon, Building the Data Warehouse 1996

Page 5: DWH Concepts Summ

Subject-OrientedData is categorized and stored by business subjectrather than by application

EquityPlans Shares Customer

financialinformation

SavingsInsurance

Loans

OLTP Applications Data Warehouse Subject

Page 6: DWH Concepts Summ

Integrated

Constructed by integrating multiple, heterogeneous data sources

Relational databases, flat files, on-line transaction records

Data cleaning and data integration techniques are applied.

Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources

• E.g. Hotel price: currency, tax, breakfast covered, etc.

Page 7: DWH Concepts Summ

Time-VariantData is stored as a series of snapshots, each representing a period of time

Time DataJan-97 JanuaryFeb-97 FebruaryMar-97 March

Page 8: DWH Concepts Summ

NonvolatileTypically data in the data warehouse is not updated or delelted.

Insert UpdateDelete

Read Read

Operational Warehouse

Load

Page 9: DWH Concepts Summ

Why Separate Data Warehouse? High performance for both systems

DBMS — tuned for OLTP: access methods, indexing, concurrency control, recovery

Warehouse — tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.

Different functions and different data: missing data: Decision support requires historical data which

operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation,

summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data

representations, codes and formats which have to be reconciled

Page 10: DWH Concepts Summ

Datawarehouse terminology Enterprise Data warehouse

Collects all information about subjects (customers,products,sales,assets, personnel) that span the entire organization

Data MartDepartmental subsets that focus on selected subjects

Decision Support System (DSS)is not a product its an environment where Information technology is used to help the knowledge worker (executive, manager, analyst) make faster & better decisions.

Operational data store (ODS)Stores tactical data from production systems that are subject-oriented and integrated to address operational needs.

Online Analytical Processing (OLAP)An element of decision support systems (DSS), which provides analysis of data

stored in a database. OLAP tools enable users to analyze different dimensions of multidimensional data.

Page 11: DWH Concepts Summ

Data GranularityWhat is Granularity of your DW?

Granularity is the level of details we want to store in the data warehouse.

For a retail store, Point of Sale (POS) is the lowest granularity information available.

For banking it’s the account level details based on every day transactions.

Page 12: DWH Concepts Summ

Data Warehouse Versus OLTP

PropertyResponseTime

Operations

Nature of Data

Data Organization

Size

Data Source

Activities

OperationalSub seconds to seconds

DML

30-60 days

Applications

Small to large

Operational, Internal

Processes

Data Warehouse

Seconds to hours

Snapshots over time

Subject, time

Large to very large

Operational, Internal,External

Analysis

Primarily read only

Page 13: DWH Concepts Summ

Data warehouse Architectures

Page 14: DWH Concepts Summ

Data warehouse Architectures

Page 15: DWH Concepts Summ

Top-Down Versus Bottom-Up Approach Here are the two different basic approaches:

Overall data warehouse feeding dependent data marts Several departmental or local data marts combining into a

data warehouse.

In the first approach, you extract data from the operational systems; you then transform, clean, integrate, and keep the data in the data warehouse.

So, which approach is best in your case, the top-down or the bottom-up approach?

Page 16: DWH Concepts Summ

Top-Down Approach

The advantages of this approach are: A truly corporate effort, an enterprise

view of data Inherently architected—not a union of

disparate data marts Single, central storage of data about

the content Centralized rules and control

Page 17: DWH Concepts Summ

Top-Down Approach

The disadvantages are:Takes longer to buildHigh exposure/risk to failure Needs high level of cross-functional

skills High outlay without proof of concept

Page 18: DWH Concepts Summ

Bottom-Up Approach

The advantages of this approach are:Faster and easier implementation of

manageable piecesFavorable return on investment and

proof of conceptLess risk of failure Inherently incremental; can schedule

important data marts first

Page 19: DWH Concepts Summ

Bottom-Up Approach

The disadvantages are:Each data mart has its own narrow view

of dataPermeates redundant data in every data

martPerpetuates inconsistent and

irreconcilable data

Page 20: DWH Concepts Summ

Data Warehouses Versus Data Marts

Property Data Warehouse Data MartScope Enterprise DepartmentSubject Multiple Single-subjectData Source Many FewSize(typical) 100 GB to>1 TB <100 GBImplementation time Months to years Months

DataWarehouse

DataMart

Page 21: DWH Concepts Summ

Dimensional Model A dimensional model is a model in which the data is structurally classified

as fact or dimension. General characteristics:

Query oriented Structured around data usage not business rules Organized roughly into base facts and dimensions of those facts Based on identification of key grains of data and on characteristics of those

grains Consisting usually of snapshot, business data Looks to reduce the number and depth of joins

Two general patterns- Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional

hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake

Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

Page 22: DWH Concepts Summ

Example of Star Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_salesMeasures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Page 23: DWH Concepts Summ

Example of Snowflake Schema

STORE KEYStore Dimension

Store DescriptionCityStateDistrict IDRegion_IDRegional Mgr.

District_IDDistrict Desc.Region_ID

Region_IDRegion Desc.Regional Mgr.

STORE KEYPRODUCT KEYPERIOD KEY

DollarsUnitsPrice

Store Fact Table

Page 24: DWH Concepts Summ

Dimensional Modeling Terminology A Fact table stores measures as well as keys

representing relationships to various dimensions. Additive - Measures that can be added across all

dimensions. Semi Additive - Measures that can be added across few

dimensions and not with others. Non Additive - Measures that cannot be added across all

dimensions. Dimensions are perspectives with respect to

which an organization wants to keep record. It contain textual attributes that describe the facts

Page 25: DWH Concepts Summ

In the example, sales fact table is connected to dimensions location, product, time and organization. Measure "Sales Dollar" in sales fact table can be added across all dimensions independently or in a combined manner which is explained below. Sales Dollar value for a particular product Sales Dollar value for a product in a location Sales Dollar value for a product in a year within a location Sales Dollar value for a product in a year within a location sold or

serviced by an employee

Page 26: DWH Concepts Summ

Conformed Dimension Dimension tables that adhere to a common

structure, and therefore allow queries to be executed across star schemas.

Sales Schema

Inventory Schema

Item KeyItem Desc.Brand Desc.Category..

DATE KEYITEM KEYSTORE KEYPROMO KEYSales Fact

Item KeyItem Desc.Brand Desc.Category..

DATE KEYITEM KEYSTORE KEYInventory Fact

Page 27: DWH Concepts Summ

Extraction, Transformation and Load

Purchase specialist tools, or develop programs Extraction-- Is mapping the data between

source systems and target database Transformation--validate, clean, integrate, and

time stamp data Load--Loading the transformed data into the

target system

OLTP Databases Staging File Warehouse Database

Page 28: DWH Concepts Summ

What is OLAP?What is OLAP?

Online Analytical Processing. Viewing data in a multi dimensional way.

Why OLAP? “Slice and dice” for data warehouse. RDBMS is a 2 dimensional way of storing /

viewing the data OLAP is a multi dimensional way of storing /

viewing the data

Page 29: DWH Concepts Summ

OLAP operations Roll up (drill-up):

summarize data by climbing up

hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up from higher level

summary to lower level summary or detailed data, or introducing new dimensions

Page 30: DWH Concepts Summ

OLAP operations Slicing: Selecting the

dimensions of the cube to be viewed. Example: View “Sales

volume” as a function of “Product ” by “Country “by “Quarter”

Dicing: Specifying the values along one or more dimensions. Example: View “Sales

volume” for “Product=PC” by “Country “by “Quarter”

Page 31: DWH Concepts Summ

Types in OLAP?Three types of OLAP in the industry.

1.MOLAP – Multi dimensional OLAP (Ex MSOLAP, Essbase, Cognos).

2.ROLAP – Relational OLAP ( Ex Business Objects, Microstrategy).

3.HOLAP – Hybrid OLAP

Page 32: DWH Concepts Summ

Architecture diagram of ROLAP

DataWarehouseOr

Data Mart

App Server

ROLAP toolsLike

BOCognos

MicrostrategyEtc

BI Metadata

OLAPReport1

OLAPReport2

OLAPReport n

When a report is executed by end user the actual SQL is issued to RDBMS to getthe data. Some BI tools can even store the results set in the application server andperiodically refresh that report based on the data refreshes which happen in DW.

Page 33: DWH Concepts Summ

Architecture diagram of MOLAP

DataWarehouseOr

Data Mart

MicrosoftAnalysisServices

BI Metadata

Cube defnetc

OLAPReport1

OLAPReport2

OLAPReport n

MOLAPcubes

MOLAPcubes

When a report is executed by end user the actual data is retrieved from the MOLAPcubes. The way it retrieves by using MDX queries based on the report. MDX standsfor Multidimensional expression. SQL is used to get the data RDBMS, MDX is usedto get the data from MOLAP. The MOLAP cubes are refreshed periodically based on the data refreshes which happen in DW.

Page 34: DWH Concepts Summ

Terminology

Cube –Cube –A cube is a A cube is a multidimensional structure multidimensional structure of data. Cubes are defined of data. Cubes are defined by a set of dimensions and by a set of dimensions and measures.measures.

Page 35: DWH Concepts Summ

Terminology

Time

Prod

ucts

Loca

tion

Dimension –Dimension –A structural attribute A structural attribute of a cube that acts as of a cube that acts as an index for identifying an index for identifying values within a multi-values within a multi-dimensional array.dimensional array.If all dimensions have If all dimensions have a single member a single member selected, then a single selected, then a single cell is defined. cell is defined.

Page 36: DWH Concepts Summ

Terminology

Measures –Measures –Numeric data of Numeric data of interest. interest. e.g. Revenue per Sale, e.g. Revenue per Sale, Quantity Quantity

Time

Prod

ucts

Loca

tion

China

China PeruPeruJapan

JapanItalyItaly

Janu

ary

Janu

ary

Febr

uary

Febr

uary

Mar

chM

arch

Apri

lAp

ril

CoffeeCoffeeApplesApples

TeaTeaOnionsOnions €1.95

Page 37: DWH Concepts Summ

SummaryThis session covered the following topics:What is Datawarehouse?Difference between OLTP & DWData warehouse Architecture and

approachDimensional ModelingWhat is OLAP?

Page 38: DWH Concepts Summ

Questions ?

Page 39: DWH Concepts Summ

Thank You.