Dimensional Modeling 1. Agenda DW Project Lifecycle Eliciting Business Requirements Dimensional...

download Dimensional Modeling 1. Agenda  DW Project Lifecycle  Eliciting Business Requirements  Dimensional Model Components  Dimensional Model Schemas  Additional.

If you can't read please download the document

description

DW Development Approach: Kimball  Methodology  DW Project Lifecycle  Business requirements  Business Requirements Documentation  Bus Matrix  Design, build and deliver in increments  DW Architecture  DW Design  ETL system  Cube, Reports, query tools, … 3

Transcript of Dimensional Modeling 1. Agenda DW Project Lifecycle Eliciting Business Requirements Dimensional...

Dimensional Modeling 1 Agenda DW Project Lifecycle Eliciting Business Requirements Dimensional Model Components Dimensional Model Schemas Additional Modeling Concepts 2 DW Development Approach: Kimball Methodology DW Project Lifecycle Business requirements Business Requirements Documentation Bus Matrix Design, build and deliver in increments DW Architecture DW Design ETL system Cube, Reports, query tools, 3 Data Warehouse Project Lifecycle 4 Source: Mundy, Thornthwaite, and Kimball (2006). The Microsoft Data Warehouse Toolkit, Wiley Publishing Inc., Indianapolis, IN. PlanningAnalysisDesign Implementation Data Warehouse Project Lifecycle 5 Source: Mundy, Thornthwaite, and Kimball (2006). The Microsoft Data Warehouse Toolkit, Wiley Publishing Inc., Indianapolis, IN. Project Planning Determine: Initial project scope Project cost Define: Team roles Team members Project schedule 6 Example Initial Project Scope 7 Source: Mundy, Thornthwaite, and Kimball (2006). The Microsoft Data Warehouse Toolkit, Wiley Publishing Inc., Indianapolis, IN. Data Warehouse Project Lifecycle 8 Source: Mundy, Thornthwaite, and Kimball (2006). The Microsoft Data Warehouse Toolkit, Wiley Publishing Inc., Indianapolis, IN. DW Development Approach: Kimball Methodology DW Project Lifecycle Business requirements Business Requirements Documentation Bus Matrix Design, build and deliver in increments DW Architecture DW Design ETL system Cube, Reports, query tools, 9 Requirements Elicitation Identify who to interview May include more levels of management Conduct Interviews Business challenges Definition of success Info needed to track success, detect problems Ways to view/break-down info Other discovery methods Existing systems Reports Document & Prioritize 10 Documenting Requirements Interview Summaries Prose summarizing interviews Kimball format Kimball format Analytic Themes Analysis Requirements grouped into categories Kimball format (pg 35) Kimball format DW Bus Matrix Business processes mapped to data needed Kimball format (pg 37) Kimball format DM Information Package Prioritized processes Ponniah format (pg 104) Ponniah format 11 Kimball Example: Interview Summaries 12 Kimball Example: Analytic Themes 13 Kimball Example: Bus Matrix 14 Class Example: University Dept. Requirements 15 Class Example: University Dept. Bus Matrix 16 Class Example: University Dept. Information Package 17 In-Class Example: Newspaper Information Package 18 Data Warehouse Project Lifecycle 19 Source: Mundy, Thornthwaite, and Kimball (2006). The Microsoft Data Warehouse Toolkit, Wiley Publishing Inc., Indianapolis, IN. BI Architecture, cont 20 Source: Oracle Corporation. Information Management and Big Data: A Reference Architecture, Oracle White Paper, February 2013, p. 12. Data Warehouse Project Lifecycle 21 Source: Mundy, Thornthwaite, and Kimball (2006). The Microsoft Data Warehouse Toolkit, Wiley Publishing Inc., Indianapolis, IN. DW Development Approach: Kimball Methodology DW Project Lifecycle Business requirements Business Requirements Documentation Bus Matrix Design, build and deliver in increments DW Architecture DW Design ETL system Cube, Reports, query tools, 22 ERD 23 Reporting Challenges with ERD/OLTP Model designed for efficient record processing, not "subject" processing External data often excluded Analyses require multiple joins Indexes not optimized for reporting History not stored 24 Pre-Computing Aggregates 25 MonthProductCityTOTAL Sales Quantity OctProd1Abiline9556 Prod1Austin799 Prod1Dallas1356 Prod1Waco36678 Prod2Abiline7869 Prod2Austin2967 Prod2Dallas568 Prod2Waco Prod3Abiline43 Prod3Austin6588 Prod3Dallas8434 Prod3Waco3756 NovProd1Abiline77977 Prod1Austin234 Prod1Dallas4378 Prod1Waco20349 Prod2Abiline210 Prod2Austin789 Prod2Dallas888 Prod2Waco4566 Prod3Abiline2078 Prod3Austin292 Prod3Dallas1111 Prod3Waco36 DecProd1Abiline34657 Prod1Austin2999 Prod1Dallas5888 Prod1Waco9999 Prod2Abiline1580 Prod2Austin2940 Prod2Dallas975 Prod2Waco5748 Prod3Abiline6140 Prod3Austin211 Prod3Dallas1357 Prod3Waco1000 Queries: 1.Total Sales 2.Total Sales by Month 3.Total Sales by Month and Product Line 4.Total Sales by Month, Product Line, and City 5.Total Sales by City .. ORDERED_QUANTITY Pre-Computing Aggregates, cont 26 OctNov Dec P1 P2 P3 1. Total Sales 3. Total Sales by Month and Product 2. Total Sales by Month (1 "fact, 0 dimensions) (1 "fact", 1 "dimension" with 3 values) (1 "fact", 2 "dimensions" each with 3 values) OctNovDec SELECT sum(ordered_quantity) AS "total" FROM order_line_t; SELECT month(order_date) AS "month", sum(ordered_quantity) AS "total" FROM order_line_t ol, order_t o WHERE ol.order_id = o.order_id GROUP BY month(order_date); SELECT month(order_date) AS "month", p.product_line_id AS "product", sum(ordered_quantity) AS "total" FROM order_line_t ol, order_t o, product_t p WHERE ol.order_id = o.order_id AND ol.product_id = p.product_id GROUP BY month(order_date), p.product_line_id; Pre-Computing Aggregates, cont 27 OctNov Dec P1 P2 P3 4. Total Sales by Month, Product, & City (1 "fact", 3 "dimensions" each with 3 values) AB AU DA WA select month(order_date) as "month", p.product_line_id as "product", c.city, sum(ordered_quantity) as "total" from order_line_t ol, order_t o, product_t p, customer_t c where ol.order_id = o.order_id and ol.product_id = p.product_id and o.customer_id = c.customer_id group by month(order_date), p.product_line_id, c.city; OLAP Review Short: Class of applications or tools that support ad-hoc analysis of multidimensional data Longer: technology that enables [users] to gain insight into data throughfast, consistent, interactive access [to]information that has been transformedto reflect the real dimensionality of the enterprise OLAP Council (www.olapcouncil.org)www.olapcouncil.org 28 OLAP Cubes Improves Reporting Performance Pre-processed aggregates Data In-memory Index Structures Bye Bye Locks! Flexible, interactive information delivery to DW Multidimensional data representation and operations Rollup Drill-down Slice/Dice Pivot (or Rotate) * See29 30 31 32 33 Dimensional Modeling Data Model Logical view of a multi-dimensional cube Key structures and components Fact table(s) Key business process Facts/Measurements/metrics Foreign Keys Dimension tables Ways to view measures Attributes Often denormalized Surrogate Key vs. Business Key Hierarchies 34 Dimensional Model Example 35 Fact Table Dimension Tables Foreign Keys Attributes Measures Business Key Include it!Surrogate Key Hierarchy DIM FACT Dimensional Model Characteristics Dim TablesFact Tables 36 Star Schema At least one fact table and (typically) two or more dimension tables Fact table has direct relationship with each of the dimension tables Single-table dimensions Arrangement resembles a "star" 37 Star Schema Example 38 Snowflake Schema 39 Fact table has direct relationship with some dimension tables, and indirect relationship with other(s) Multi-table dimensions i.e., "Normalized" dimensions Snowflake Example 40 Comparison of Schemas Star The much-preferred approach Adv: Faster load/query/analysis performance Potentially more intuitive to users Snowflake Adv: Potentially faster setup Avoid data redundancy Reduces size of dimension table Ease of maintaining 41 Common Dims, Facts, Measures Dims 42 Facts Measures In-Class Example: Newspaper Dim Model 43 Additional Modeling Concepts Surrogate Keys Attribute Hierarchies Time Dimensions Junk Dimensions Degenerate Dimensions Slowly-Changing Dimensions 44 Surrogate Keys Problem: Potential for PK to change in source systems e.g., PKs with built-in meaning Data spread across multiple systems PK's exist??? PK's consistent??? PK's means same thing??? Surrogate Key Newly-generated PK for dimension rows in DW System-generated sequence numbers Mapped to source/application key(s) Fact rows reference SKs 45 Surrogate Keys Example 46 Attribute Hierarchies 1:M relationships between attributes Supports user navigation drill-downs, drill-ups Improves performance Assists SSAS in aggregation selection Storage improvement 47 Attribute Hierarchy Examples 48 State City Year Month Year Semester Date / Time Dimension Common feature of every data warehouse Minimum attributes: Date key (e.g , , 12345) Date name (e.g. Monday, January ) Common additional attributes Month, Year, Quarter, Holiday Name, 49 Time Dimension Example 50 Junk Dimensions Stores one or more "lookup" codes, flags, indicators that describe or categorize transactions/events Usually low cardinality May include all valid combinations of codes OR valid combinations that exist 51 Junk Dimension Example 52 Enrollment_Status_ID_ SK Registration_Statu s Permit _Issued Class_Fee_ Status 1Wait ListYPaid 2Wait ListYUnpaid 3Wait ListNPaid 4Wait ListNUnpaid 5ConfirmedYPaid 6ConfirmedYUnpaid 7ConfirmedNPaid 8ConfirmedNUnpaid 9Awaiting ApprovalYPaid 10Awaiting ApprovalYUnpaid 11Awaiting ApprovalNPaid 12Awaiting ApprovalNUnpaid Degenerate Dimensions An attribute (dimension) stored in fact table Typically a high-cardinality attribute Attribute does NOT link to a dimension table Often used for drill-downs and/or data mining (e.g. Market Basket Analysis) 53 Degenerate Dimension Example 54 Slowly-Changing Dimensions 55 What you want to do when a value in dimension record changes 0. Do Nothing 1. Overwrite Record 2. Retain All History (add new rows) 3. Retain Some History (add new columns) Impacts ETL Type 0 (Fixed Attribute) DimCustomer Table CustomerSK10 CustomerID LastNameHarris FirstNameMiles GenderM Source Extract CustomerID LastNameHarris FirstNameMiles GenderF Update Update Ignored or Failure 2006 Microsoft Corporation. Type 1 (Changing Attribute) DimCustomer Table CustomerSK10 CustomerID LastNameHarris FirstNameMiles AddressLine15363 Blackshire Street ZipCode Source Extract CustomerID LastNameHarris FirstNameMiles AddressLine1123 Main St. ZipCode54276 Update Updated DimCustomer Table CustomerSK10 CustomerID LastNameHarris FirstNameMiles AddressLine1123 Main St. ZipCode54276 Simple UPDATE statement applied: UPDATE DimCustomer Set AddressLine1 = 123 Main St, ZipCode = 54276 WHERE CustomerID = 2006 Microsoft Corporation. Simple UPDATE statement applied: UPDATE DimCustomer Set EndDate = 2/18/2007 WHERE CustomerID = Type 2 (Changing Attribute) DimCustomer Table CustomerSK10 CustomerID LastNameHarris FirstNameMiles AddressLine15363 Blackshire Street ZipCode54271 StartDate1/1/2007 EndDateNULL Customer Source Extract CustomerID LastNameHarris FirstNameMiles AddressLine1123 Main St. ZipCode54276 Update Updated DimCustomer Table CustomerSK10108 CustomerID LastNameHarris FirstNameMiles AddressLine15363 Blackshire Street 123 Main St. ZipCode StartDate1/1/20072/18/2007 EndDate2/18/2007NULL 2006 Microsoft Corporation. Then INSERT statement applied: INSERT INTO DimCustomer (CustomerID, LastName, Firstname) VALUES ( , 'Harris', 'Miles', 123 Main St, 54276, '2/18/2007',NULL) Type 3 (Changing Attribute) DimCustomer Table CustomerSK10 CustomerID LastNameHarris FirstNameMiles AddressLine15363 Blackshire Street ZipCode54271 StartDate1/1/2007 EndDateNULL Customer Source Extract CustomerID LastNameHarris FirstNameMiles AddressLine1123 Main St. ZipCode54276 Update Updated DimCustomer Table CustomerSK10 CustomerID LastNameHarris FirstNameMiles AddressLine15363 Blackshire Street ZipCode54271 Updated AddressLine1 123 Main St. Updated ZipCode54276 2006 Microsoft Corporation. Simple UPDATE statement applied: UPDATE DimCustomer Set UpdatedAddressLine1 = 123 Main St, UpdatedZipCode = 54276 WHERE CustomerID = Data Warehouse Project Lifecycle 60 Source: Mundy, Thornthwaite, and Kimball (2006). The Microsoft Data Warehouse Toolkit, Wiley Publishing Inc., Indianapolis, IN. DW Physical Design 61 Summary Dimensional Model Basic Components Facts Measures Dimensions Attributes Keys Primary Surrogate Business Foreign Schemas Hierarchies Slowly-Changing Dimensions Junk Dimensions Degenerate Dimensions 62