Data Warehousing Concepts By Sathish Yellanki

35
Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 1 Data Warehousing Concepts

description

Data Warehousing Concepts

Transcript of Data Warehousing Concepts By Sathish Yellanki

Page 1: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 1

Data Warehousing Concepts

Page 2: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 2

Dimensional Data Model • Dimensional Data Model is Commonly Used in Data

Warehousing Systems. • The Two Common Schema Types

• Star Schema • Snowflake Schema

Slowly Changing Dimension • Slowly Changing Dimensions Are Common Issues Facing Data

Warehousing Development Process.

Conceptual Data Model • A Data Warehouse Specialist Should Be Much Familiar With The

Concept of Conceptual Data Model.

Logical Data Model • A Data Warehouse Specialist Should Be Very Clear With The

Concepts And Process of A Logical Data Model.

Physical Data Model • A Data Warehouse Specialist Should Be Very Clear of The

Concept And Process of Developing The Physical Data Model.

Compare Conceptual, Logical, And Physical Data Model • A Data Warehouse Specialist Should BE Familiar With Different

Levels of Abstraction For A Data Model.

Data Integrity • A Data Warehouse Specialist Should Be Clear With “What is

Data Integrity” And How it is Enforced in Data Warehousing.

Page 3: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 3

What is OLAP? • All The Data Warehousing Experts Should Be Familiar With The

Definition of OLAP.

MOLAP, ROLAP, AND HOLAP • A Data Warehousing Specialist Should Have Crystal Clarity With

The Different Types of OLAP Technology.

Bill Inmon Vs. Ralph Kimball Process • A Data Warehousing Specialist Should Know The Difference of

Opinion Between The Role Between DWH And Data Mart. • The Direction of Development Should Be DWH To Datamart OR

Vice Versa

Factless Fact Table • A Data Warehousing Specialist Should Take in Confidence What

is The Use of A Fact Table Without Any Fact.

Junk Dimension • A Data Warehousing Specialist Should Definitely Keep Himself

Clear With The Concept of A Junk Dimension, When To Use The Junk Dimension And Why And Where it is Useful.

Conformed Dimension • A Data Warehouse Specialist Should Keep Him Self Clear With

The Concept of A Conformed Dimension. • The Specialists Should Have Detailed Clarity in What is

Conformed Dimension And Why it is Important.

Page 4: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 4

Dimensional Data Model

Page 5: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 5

• Dimensional Data Model is Most Often Dominated in Building The Data Warehousing Systems.

• Dimensional Modeling is Different From The 3rd Normal Form, Standards Commonly Used For Transactional (OLTP) Systems.

Jargon For Dimensional Modeling

Dimension • Dimension Always Represents A Specific Category of

Information As A Single Collection. • The Dimension is Planned As Per The Subject of Analysis it is

Chosen.

Attribute • An Attribute is A Unique Level Within The Dimension. • An Attribute in Dimensional Model Definitely Has An Hierarchy

OR Level.

Hierarchy • Hierarchy is The Specification of Levels That Represents

Relationship Between Different Attributes Within A Dimension.

Fact Table • A Fact Table is A Table That Contains The Measures of Interest

Specific To The Subject of Analysis. • Fact Table is Generally A Collection of Aggregates At Different

Levels of Granularity. • A Fact Table is A Collection of Measures of The Subject of

Analysis, Integrated To The Associated Dimensions.

Page 6: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 6

Lookup Table • The Lookup Table Provides The Detailed Information About The

Attributes.

• The Lookup Table Keeps Such Information That is Essential To Describe The Attribute in A More Better Way As Per The Requirement of Analysis.

• The Lookup Table For Any Attribute Would Include A List of All of The Descriptive Details Available in The Data Warehouse.

Common Points To Consider • A Dimensional Model Includes A Collection of Fact Tables And

Lookup Tables.

• Fact Tables Connect To One OR More Lookup Tables, But Fact Tables Do Not Have Direct Relationships To One Another.

• Dimensions And Hierarchies Are Represented By Lookup Tables.

• Attributes Are The Non-Key Columns in The Lookup Tables.

Data Models For Data Warehouses / Data Marts • The Most Commonly Used Schema Types in Data Warehousing

Are • Star Schema

• Snowflake Schema

• Using A Star OR A Snowflake Schema Largely Depends on Personal Preference And Business Needs.

Page 7: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 7

• Snowflakes Are Always A Better Choice When There is A Business Case To Analyze The Information At That Particular Level.

What is Meant By Granularity? • Granularity Refers To The Level of Detail of The Data Stored in

The Fact Tables in A Data Warehouse. • High Granularity Refers To Data That is At OR Near The

Transaction Level. • Data That is At The Transaction Level is Usually Called As

Atomic Level Data. • Low Granularity Refers To Data That is Summarized OR

Aggregated, Usually From The Atomic Level Data. • Summarized Data Can Be Lightly Summarized As in Daily OR

Weekly Summaries OR Highly Summarized Data Such As Yearly Averages And Totals.

What is Meant By Fact Table Granularity? • The First Step in Designing A Fact Table is To Determine The

Granularity of The Fact Table. • Fact Table Granularity Decides The Lowest Level of Information

That Will Be Stored in The Fact Table. • The Fact Table Granularity Depends on The Construct of The

Type And The Number of Dimensions That Are Included in The Schema.

• Fact Table Granularity Decides Factor of Density on Measures

Page 8: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 8

What Constitutes The Fact Table Granularity? • Fact Table Granularity Constitutes Two Steps

• Determine Which Dimensions Will Be Included. • Determine Where Along The Hierarchy of Each Dimension The

Information Will Be Kept.

• The Determining Factors of Fact Table Granularity Usually Goes Back To The Requirements Phase.

Which Dimensions We Should Include? • Determining Which Dimensions To Include in The Data

Warehouse is Usually A Straightforward Process, As Business Processes Will Often Dictate Clearly What Are The Relevant Dimensions.

What Level Should Be Included Within Each Dimension? • Determining Which Part of Hierarchy The Information is Stored

Along Each Dimension is Not An Exactly Scientific. • Level of The Dimension is Dictated Purely on User Requirement

Only. • Sometimes The Users Will Not Specify Certain Requirements,

But Based on The Industry Knowledge, The Data Warehousing Team Must Foresee Certain Requirements And Include Them.

• It is Prudent For The Data Warehousing Team To Design The Fact Table Such That Lower-Level Information is Included, To Avoid Re-Design of The Fact Table in The Future.

• Level of Dimension is More of An Art Than Science.

Page 9: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 9

What is A Fact Table? • A Fact Table Consists of The Measurements, Metrics OR Facts of

A Business Process, Located At The Center of A Star Schema OR A Snowflake Schema Surrounded By Dimension Tables.

• A Fact Table Stores Quantitative Information For Analysis And is Often De-Normalized.

• A Fact Table Typically Has Two Types of Columns • Columns Containing Facts • Foreign Keys To Dimension Tables

• The Primary Key of A Fact Table is Usually A Composite Key That is Made Up of All of its Foreign Keys.

• Fact Store Different Types of Measures • Additive Measures. • Semi Additive Measures. • Non Additive Measures.

Types of Facts Additive Facts • Additive Facts Are Facts That Can Be Summed Up Through All of The

Dimensions in The Fact Table.

Semi-Additive Facts • Semi-Additive Facts Are Facts That Can Be Summed Up For Some of The

Dimensions in The Fact Table, But Not The Others.

Non-Additive

• Non-Additive Facts Are Facts That Cannot Be Summed Up For Any of The Dimensions Present in The Fact Table.

Page 10: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 10

Types of Fact Tables Cumulative Fact Table • Cumulative Fact Table Describes What Has Happened Over A Period of

Time.

• The Facts For This Type of Fact Tables Are Mostly Additive Facts. Snapshot Fact Table • Snapshot Fact Table Describes The State of Things in A Particular

Instance of Time, And Usually Includes More Semi-Additive And Non-Additive Facts.

What is Star Schema? • In The Star Schema Design, A Single Object OR Also Called The

Fact Table Sits in The Middle And is Radically Connected To Other Surrounding Objects Which Are Dimension Lookup Tables Like A Star.

• Each Dimension is Represented As A Single Table. • The Primary Key in Each Dimension Table is Related To A

Foreign Key in The Fact Table. • All Measures in The Fact Table Are Related To All The

Dimensions With Which The Fact Table is Related. • All The Measures Will Have The Same Level of Granularity. • A Star Schema Can Be Simple OR Complex, A Simple Star

Consists of One Fact Table And A Complex Star Can Have More Than One Fact Table With Measures Integrated on The Dimensions.

Page 11: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 11

What is A Snowflake Schema? • The Snowflake Schema is An Extension of The Star Schema,

Where Each Point of The Star Explodes into More Points. • In A Snowflake Schema, The Dimensional Table is Normalized

into Multiple Lookup Tables, Each Representing A Level in The Dimensional Hierarchy.

Advantage • Improvement in Query Performance Due To Minimized Disk

Storage Requirements And Joining Smaller Lookup Tables.

Disadvantage

• Additional Maintenance Efforts Needed Due To The Increase Number of Lookup Tables.

What is A Slowly Changing Dimension? • Slowly Changing Dimensions Are Dimensions That Change

Slowly Over Time, Rather Than Changing on Regular Schedule That is Time-Based.

• In Data Warehouse There is A Need To Track Changes in Dimensional Attributes in Order To Report Historical Data.

• Slowly Changing Dimensions Are Implemented in Multiple Ways, Implementing One of The SCD Types Should Enable Users Assigning Proper Dimensional Attribute Value For Given Data.

• All Dimensions Cannot Be Suitable For SCD Standards.

Page 12: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 12

• Slowly Changing Dimension Applies To Cases Where The Attribute For A Record Varies Over Time.

Types of Slowly Changing Dimensions

Page 13: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 13

SCD Type 0 • The SCD Type 0 Method is A Passive Method. • It Just Manages Dimensional Changes And No Action is

Performed. • The Values in The Dimension Remain As They Were At The Time

The Dimension Record Was First Inserted. • In Certain Circumstances History is Preserved With A Type 0,

And Type 0 Provides The Least OR No Control on The History.

SCD Type 1 • SCD Type 1 Methodology Overwrites Old Data With New Data,

And Therefore Does Not Track Historical Data. • SCD Type 1 Methodology is Used When There is No Need To

Store Historical Data in The Dimension Table. • SCD Type 1 is Used To Correct Data Errors in The Dimension. • Usage is 50% in The Development of Data Warehouse.

Advantage • SCD Type 1 is Easy To Maintain.

Disadvantage • There is No History in The Data Warehouse.

Illustrative Example Original Dimension

Supplier_Key Supplier_Code Supplier_Name Supplier_State

123 ABC Acme Supply Co CA

Page 14: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 14

Supplier_Key Supplier_Code Supplier_Name Supplier_State

123 ABC Acme Supply Co IL

SCD Type 2 • SCD Type 2 Method Tracks Historical Data By Creating Multiple

Records For A Given Natural Key in The Dimensional Tables, With Separate Surrogate Keys AND/OR Different Version Numbers.

• Using SCD Type 2 We Can Manage Unlimited History Preserved For Each Insert.

• Usage is 50% in The Development of Data Warehouse.

Methods of Implementing SCD Type 2 Method 1 • Add One Extra Column For Managing Version Numbers. • The Version Column Will Be Incremented Sequentially For The

Number of Changes That Are Taking Place on The Dimensional Value.

Illustrative Example Original Dimension

Illustrative Example Changed Dimension SCD Type 1

Supplier_Key Supplier_Code Supplier_Name Supplier_State

123 ABC Acme Supply Co CA

Page 15: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 15

Method 2 Changed Dimension SCD Type 2 • Add One Extra Column Which Manages The „Effective Date‟ • The „Effective Date‟ Column Will Register The Current Latest

Date Exactly When The New Change is Being Registered.

Illustrative Example Changed Dimension SCD Type 2

Supplier_Key Supplier_Code Supplier_Name Supplier_State Version

123 ABC Acme Supply Co CA 0

124 ABC Acme Supply Co IL 1

Supplier_Key

Supplier_Code

Supplier_Name

Supplier_State

Start_Date End_Date

123 ABC Acme

Supply Co CA 01-Jan-2000 21-Dec-2004

124 ABC Acme

Supply Co IL 22-Dec-2004

Advantage • SCD Type 2 Keeps Accurately All The Historical Information.

Disadvantage • SCD Type 2 Will Cause The Size of The Table To Grow Fast. • For The Table With Many Rows Storage And Performance Can

Become A Concern. • SCD Type 2 Necessarily Complicates The ETL Process.

Page 16: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 16

SCD Type 3 • SCD Type 3 Method Tracks Changes Using Separate Columns

And Preserves Limited History. • SCD Type 3 Preserves Limited History As it is Limited To The

Number of Columns Designated For Storing Historical Data. • The Original Table Structure in Type 1 And Type 2 is The Same

But Type III Adds Additional Columns. • We Can Have One Additional Column That Specifies When The

Change Has Taken Place Effectively. • SCD Type 3 is Rarely Used in Actual Practice.

Illustrative Example Original Dimension

Supplier_Key

Supplier_Code

Supplier_Name Original_Supplier_

State Effective_Date

Current_Supplier_

State

123 ABC Acme Supply Co CA 22-Dec-2004 IL

Supplier_Key Supplier_Code Supplier_Name Supplier_State

123 ABC Acme Supply Co CA

Illustrative Example Changed Dimension SCD Type 3

Advantage • SCD Type 3 Does Not Increase The Size of The Table, Since New

Information Is Updated. • SCD Type 3 Allows Us to Keep Some Part of History.

Page 17: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 17

Disadvantage • SCD Type 3 Will Not Be Able To Keep All History Where An

Attribute is Changed More Than Once.

SCD Type 4 • SCD Type 4 Method Uses “History Tables”, Where One Table

Keeps The Current Data, And An Additional Table is Used To Keep A Record of Some OR All Changes.

• In SCD Type 4 Both The Surrogate Keys Are Referenced in The Fact Table To Enhance Query Performance.

• SCD Type 4 Method Resembles How Database Audit Tables And Change Data Capture Techniques Function.

Illustrative Example Original Dimension

Supplier_Key Supplier_Code Supplier_Name Supplier_State

123 ABC Acme Supply Co CA

Supplier_key Supplier_Code Supplier_Name Supplier_State

123 ABC Acme Supply Co IL

Illustrative Example Supplier Current Table

Supplier_key Supplier_Code Supplier_Name Supplier_State

Create_Date

123 ABC Acme Supply Co CA 22-Dec-2004

Illustrative Example Supplier History Table

Page 18: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 18

Data Modeling (Conceptual, Logical, And Physical Data Models)

Page 19: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 19

• There Are Three Levels of Data Modeling Standards • Conceptual Data Model • Logical Data Model • Physical Data Model

Conceptual Data Model • A Conceptual Data Model Identifies The Highest-Level

Relationships Between The Different Entities.

Features of Conceptual Data Model • Enterprise-Wide Coverage of The Business Concepts

• Customer • Product • Store • Location • Asset

• Designed And Developed Primarily For A Business Audience • Contains Around 20-50 Entities OR Concepts With No OR

Extremely Limited Number of Attributes Described. • Contains Relationships Between Entities, But May OR May Not

Include Cardinality And Nullability. • Entities Will Have Definitions. • Designed And Developed To Be Independent of DBMS, Data

Storage Locations OR Technologies. • Model Addresses Digital And Non-Digital Concepts. Includes

The Important Entities And The Relationships Among Them. • No Attribute is Specified, No Primary Key is Specified.

Page 20: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 20

Conceptual Data Model • A Logical Data Model is A Fully-Attributed Data Model That is

Independent of • DBMS • Technology • Data Storage

• Organizational Constraints

Page 21: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 21

• Logical Data Model Typically Describes Data Requirements From The Business Point of View.

• Has No Requirement That Resulting Data Implementations Must Be Created Using Relational Technologies.

Features of A Logical Data Model • Typically Describes Data Requirements For A Single Project OR

Major Subject Area. • May Be Integrated With Other Logical Data Models Via A

Repository of Shared Entities • Typically Contains 100-1000 Entities, Although These Numbers

Are Highly Variable Depending on The Scope of The Data Model. • Contains Relationships Between Entities That Address

Cardinality And Nullability of The Relationships. • Designed And Developed To Be Independent of DBMS, Data

Storage Locations OR Technologies. • Data Attributes Will Typically Have Datatypes With Precisions

And Lengths Assigned, Nullability (Optionality) Assigned. • Entities And Attributes Will Have Definitions. • All Kinds of Other Meta Data May Be Included Like Retention

Rules, Privacy Indicators, Volumetrics, Data Lineage. • A Logical Data Model May Show Only A Tiny Percentage of The

Meta Data Contained Within The Model. • A Logical Data Model Will Normally Be Derived From And OR

Linked Back To Objects in A Conceptual Data Model.

Page 22: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 22

Steps For Designing The Logical Data Model • Specify Primary Keys For All Entities. • Find The Relationships Between Different Entities. • Find All Attributes For Each Entity. • Resolve Many-To-Many Relationships. • Normalization.

Page 23: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 23

Differences Between Conceptual And Logical Data Models • In A Logical Data Model, Primary Keys Are Present, Whereas in

A Conceptual Data Model, Primary Keys Are Not Present. • In A Logical Data Model, All Attributes Are Specified Within An

Entity. Conceptual Data Model Does Not Specifies Attributes. • Relationships Between Entities Are Specified Using Primary

Keys And Foreign Keys in A Logical Data Model. In A Conceptual Data Model, The Relationships Are Simply Stated, Not Specified, So We Simply Know That Two Entities Are Related, But We Do Not Specify What Attributes Are Used For This Relationship.

Physical Data Model • A Physical Data Model is A Fully-Attributed Data Model That is

Dependent Upon A Specific Version of A Data Persistence Technology.

• The Target Implementation Technology May Be • A Relational DBMS • An XML Document • A NOSQL Data Storage Component • A Spreadsheet • Other Data Implementation Option

Features of A Physical Data Model • Physical Data Model Typically Describes Data Requirements For

A Single Project OR Application, OR Sometimes Even A Portion of An Application.

Page 24: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 24

• May Be Integrated With Other Physical Data Models Via A Repository of Shared Entities

• Typically Contains 10-1000 Tables, Although These Numbers Are Highly Variable Depending on The Scope of The Data Model.

• Contains Relationships Between Tables That Address Cardinality And Nullability of The Relationships.

• Designed And Developed To Be Dependent on A Specific Version of A DBMS, Data Storage Location OR Technology.

• Columns Will Have Datatypes With Precisions And Lengths Assigned.

• Columns Will Have Nullability Assigned. • Tables And Columns Will Have Definitions. • Denormalization May Occur Based on User Requirements. • Physical Data Model Includes Other Physical Objects Such As

• Views • Primary Key Constraints • Foreign Key Constraints • Indexes • Security Roles • Store Procedures • XML Extensions • File Stores

• The Diagram of A Physical Data Model May Show Only A Tiny Percentage of The Meta Data Contained Within The Model.

Page 25: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 25

The Steps For Physical Data Model • Convert Entities into Tables. • Convert Relationships into Foreign Keys. • Convert Attributes into Columns. • Modify The Physical Data Model Based on Physical Constraints

OR Requirements.

Page 26: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 26

Differences Between Conceptual And Logical Data Models • Entity Names Are Now Table Names. • Attributes Are Now Column Names. • Data Type For Each Column is Specified. • Data Types Can Be Different Depending on The Actual Database

Being Used.

Cross Comparison of All Models

Feature Conceptual Logical Physical

Entity Names ✓ ✓

Entity Relationships ✓ ✓

Attributes ✓

Primary Keys ✓ ✓

Foreign Keys ✓ ✓

Table Names ✓

Column Names ✓

Column Data Types ✓

Page 27: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 27

A View on Data Integrity

Page 28: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 28

• Data Integrity Refers To The Validity of Data, Which Concentrates on Data Consistency And Correctness.

• In A Data Warehouse OR A Data Mart, There Are Three Areas of Data Integrity Needs To Be Enforced: • Database Level • ETL Process Level • Access Level

Database Level Integrity Referential Integrity • The Relationship Between The Primary Key of One Table And

The Foreign Key of Another Table Must Always Be Maintained. Primary Key / Unique Constraint • Primary Keys And The Unique Constraints Are Used To Make

Sure Every Row in A Table Can Be Uniquely Identified. Not Null Versus Nullable • For Columns Identified As Not Null, They Cannot Have A Null

Value. Valid Values

• Only Allowed Values Are Permitted in The Database. ETL Process Level Integrity • For Each Step of The ETL Process, Data Integrity Checks Should

Be Put in Place To Ensure That Source Data is The Same As The Data in The Destination.

• Most Common Checks Include Record Counts OR Record Sums.

Page 29: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 29

Access Level • We Need To Ensure That Data is Not Altered By Any

Unauthorized Means Either During The ETL Process OR in The Data Warehouse.

• Design Safeguards Against Unauthorized Access To Data Including Physical Access To The Servers, As Well As Logging of All Data Access History.

• Data Integrity Can Only Be Ensured if There is No Unauthorized Access To The Data.

What is Meant By OLAP? • OLAP is An Abbreviated Form For On-Line Analytical

Processing. • The First Attempt To Provide A Definition To OLAP Was By Dr.

Codd, Who Proposed 12 Rules For OLAP. • The Key Feature of The OLAP Environment is

"Multidimensional“ Environment OR The Architecture. • Depending on The Underlying Technology Used, OLAP Can Be

Broadly Divided into Three Different Flavors • MOLAP(Multi Dimensional On-Line Analytical Processing) • ROLAP(Relational On-Line Analytical Processing) • HOLAP(Hybrid Online Analytical Processing)

• OLAP is A Field of Analysis of Data Considering The Samples Collected on A Time Based Variance, Related To The Business Process.

Page 30: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 30

MOLAP • MOLAP is The More Traditional Way of OLAP Analysis, in Which,

Data is Stored in A Multidimensional Cube. • The Storage is Not Necessary To Be in The Relational Database,

But Can Be in Proprietary Formats. • MOLAP Processes Data That is Already Stored in A

Multidimensonal Array in Which All Possible Combinations of Data Are Reflected, Each in A Cell That Can Be Accessed Directly

Advantages • Excellent Performance Due To Optimized Storage,

Multidimensional Indexing And Caching, Optimal For Slicing And Dicing Operations.

• MOLAP Can Perform Complex Calculations, All Calculations Are Pre-Generated When The Cube is Created. Hence, Complex Calculations Are Possible, And Are Returned Quickly.

Disadvantages • MOLAP is Limited To The Amount of Data it Can Handle, Because

All Calculations Are Performed When The Cube is Built, it is Not Possible To Include A Large Amount of Data in The Cube Itself.

• Only Summary-Level Information Will Be Included in Cube. • Requires Additional Investment, Cube Technology Are Often

Proprietary And Do Not Already Exist in The Organization. Therefore, To Adopt MOLAP Technology, Chances Are Additional Investments in Human And Capital Resources Are Needed.

Page 31: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 31

ROLAP • ROLAP Methodology Relies on Manipulating The Data Stored in

The Relational Database To Give The Appearance of Traditional OLAP's Slicing And Dicing Functionality.

• Each Action of Slicing And Dicing is Equivalent To Adding A "WHERE" Clause in The SQL Statement.

• ROLAP Differs Significantly in That it Does Not Require The Pre-Computation And Storage of Information.

Advantages • ROLAP Can Handle Large Amounts of Data, The Data Size

Limitation of ROLAP Technology is The Limitation on Data Size of The Underlying Relational Database.

• Can Leverage Functionalities Inherent in The Relational Database.

• ROLAP is Considered To Be More Scalable in Handling Large Data Volumes, Especially Models With Dimensions With Very High Cardinality.

Disadvantages • Performance Can Be Slow, Because Each ROLAP Report is

Essentially An SQL Query OR Multiple SQL Queries Where The Query Time Can Be Long if The Underlying Data Size is Large.

• Limited By SQL Functionalities, As ROLAP Technology Mainly Relies on Generating SQL Statements To Query The Relational Database, And SQL Statements Do Not Fit All Needs.

Page 32: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 32

HOLAP • HOLAP Technologies Attempt To Combine The Advantages of

MOLAP And ROLAP. • For Summary-Type Information, HOLAP Leverages Cube

Technology For Faster Performance, When Detail Information is Needed, HOLAP Can "Drill Through" From The Cube into The Underlying Relational Data.

• HOLAP Stores Data in A Both A Relational Database (RDB) And A Multidimensional Database (MDDB) And Uses Whichever One is Best Suited To The Type of Processing Desired.

Factless Fact Table • A Factless Fact Table is A Fact Table That Does Not Have Any

Measures. • Factless Fact Table is Essentially An Intersection of Dimensions. • Factless Fact Tables Offer The Most Flexibility in Data

Warehouse Design, In Certain Situations if Factless Fact Table is Not Desined We May Land With Multiple Fact Tables.

• Factless Fact Table Captures Events That Happen Only At Information Level But Not Included in The Calculations Level.

Page 33: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 33

Junk Dimension • In Data Warehouse Design, Frequently We Run into A Situation

Where There Are Yes/No Indicator Fields in The Source System.

• As Per The Business Analysis, We Have To Keep Boolean Information in The Fact Table.

• Keeping All The Boolean Indicator Fields in The Fact Table, Needs Many Small Dimension Tables, And The Amount of Information Stored in The Fact Table Also Increases Tremendously, Leading To Possible Performance And Management Issues.

• Junk Dimension is A Dimension in Which We Combine The Boolean Indicator Fields into A Single Dimension, Leading To A Single Dimension Table.

• The Content in The Junk Dimension Table is The Combination of All Possible Values of The Individual Indicator Fields.

Advantage of Junk Dimension • It Provides A Recognizable Location For Related Codes,

Indicators And Their Descriptors in A Dimensional Framework. • Avoids The Creation of Multiple Dimension Tables. • Provides Smaller, Quicker Point of Entry Queries Compared To

Performance When Attributes Are Directly in The Fact Table. • An Interesting Use For A Junk Dimension Is To Capture The

Context of A Specific Transaction.

Page 34: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 34

Conformed Dimension • A Conformed Dimension is A Dimension That Has Exactly The

Same Meaning And Content When Being Referred From Different Fact Tables.

• A Conformed Dimension Can Refer To Multiple Tables in Multiple Data Marts Within The Same Organization.

• For Two Dimension Tables To Be Considered As Conformed, They Must Either Be Identical OR One Must Be A Subset of Another.

• Two Dimension Tables That Are Exactly The Same Except For The Primary Key Are Not Considered Conformed Dimensions.

Rapidly Changing Dimensions • A Dimension Attribute That Changes Frequently is A Rapidly

Changing Attribute.

• If We Move The Rapidly Changing Attribute To its Own Dimension, With A Separate Foreign Key in The Fact Table, Then That New Dimension is Called A Rapidly Changing Dimension.

Degenerated Dimension • A Degenerate Dimension is A Dimension Which is Derived From

The Fact Table And Doesn't Have its Own Dimension Table.

• These Are Essentially Dimension Keys For Which There Are No Other Attributes

Page 35: Data Warehousing Concepts By Sathish Yellanki

Sunday, August 31, 2014 Data Warehouse Concepts By Sathish Yellanki Slide No : 35

Inferred Dimension • While Loading Fact Records, A Dimension Record May Not Yet

Be Ready. • One Solution is To Generate An Surrogate Key With Null For All

The Other Attributes. • The Generated Surrogate Key is Called An Inferred Member, But

is Often Called As An Inferred Dimension.

Role Playing Dimension • A Role-Playing Dimension is One Where The Same Dimension

Key Along With its Associated Attributes Can Be Joined To More Than One Foreign Key in The Fact Table.

Shrunken Dimension • A Shrunken Dimension is A Subset of Another Dimension.

Static Dimension • Static Dimensions Are Not Extracted From The Original Data

Source, But Are Created Within The Context of The DWH. • A Static Dimension Can Be Loaded Manually.

Data Warehouse VS Data Mart • DWH Holds Multiple Subject Areas, DM Holds Single Subject • DWH Holds Very Detailed Information, DM Holds Summaries • DWH Works To Integrate All Data Sources, DM Integrates A Given

Subject Only. • DWH May Operate on Dimensional Model, DM Works only on

Dimensional Model