TechnoLogica Ltd. 3, Sofiisko Pole Str. tel : ( + 359 2) 91 91 2 (ten lines)
description
Transcript of TechnoLogica Ltd. 3, Sofiisko Pole Str. tel : ( + 359 2) 91 91 2 (ten lines)
1
TechnoLogica Ltd.3, Sofiisko Pole Str. tel: (+ 3592) 91 91 2 (ten lines)e-mail: [email protected], http:// www.technologica.com
DW Concepts
Dimension Modeling Techniques
Milena GerovaProject Manager
2
www.technologica.com
TechnoLogica DW Projects
Business Management System National Health Insurance Fund (10.2004 – current)
Customer Data Integration Allianz Bulgaria Holding (10.2004 – current)
Regulatory Reporting System BULBANK (2002 - 2003)
Information System Monetary StatisticsBulgarian National Bank (April 2003 – August 2004)
Management Information System BULBANK (January 2001 - June 2002)
3
www.technologica.com
Agenda
DW Terminology Overview
Dimensional Modeling
Dimension Types
History and Dimensions
Hierarchy in Dimensions
4
www.technologica.com
The data warehouse must
Make an organization’s information easily accessible.
Present the organization’s information consistently.
Be adaptive and resilient to change
Be a secure bastion that protects our information assets.
Serve as the foundation for improved decision making
The business community must accept the data warehouse if it is to be deemed successful.
5
www.technologica.com
Components of a Data Warehouse
6
www.technologica.com
Dimensional Modeling
Dimensional modeling is a new name for an old technique for making databases simple and understandable
Dimensional modeling is quite different from third-normal-form (3NF) modeling
ERM ->The Transaction Processing Model
o One table per entity
o Minimize data redundancy
o Optimize update
DM -> The data warehousing model
o One fact table for a process in the organization
o Maximize understandability
o Optimized for retrieval
o Resilient to change
7
www.technologica.com
Star Dimensional Modeling
History(Dimension
table)
Customer(Dimension
table)
Product(Dimension
table)
Channel(Dimension
table)
Item_nbrItem_descQuantityDiscnt_priceUnit_priceOrder_amount…
(Fact table)
OrderHistory
(Dimensiontable)
Customer(Dimension
table)
Product(Dimension
table)
Channel(Dimension
table)
Item_nbrItem_descQuantityDiscnt_priceUnit_priceOrder_amount…
(Fact table)
Order
8
www.technologica.com
Four-Step Dimensional Design Process
1. Select the business process to model.
2. Declare the grain of the business process.
3. Choose the dimensions that apply to each fact table row.
4. Identify the numeric facts that will populate each fact table row.
9
www.technologica.com
Dimensions
Determine these by the ways you want to slice and dice the data
Small number of rows compared to facts
Usually 5-10 dimensions surrounding a fact table
Time is almost always a dimension used by every fact
Track history
Uses Surrogate Keys
Hierarchies are usually built into them if possible
10
www.technologica.com
Date Dimension
The date dimension is the one dimension nearly guaranteed to be in every data mart
Date Dimension = Time Dimension before
We can build the date dimension table in advance (5-10 years -> only 3,650 rows)
11
www.technologica.com
Date Dimension
12
www.technologica.com
Date Dimension
13
www.technologica.com
Date Dimension
Data warehouses always need an explicit date dimension table. There are many date attributes not supported by the SQL date function, including fiscal periods, seasons, holidays, and weekends. Rather than attempting to determine these nonstandard calendar calculations in a query, we should look them up in a date dimension table.
select sum(f.amount_sold)from DATE_DIM d, FACT fwhere d.Calendar_Month = ‘January’
and d.id = f.date_dim_id;
14
www.technologica.com
Dimension Normalization(Denormalized dimension)
15
www.technologica.com
Dimension Normalization(Denormalized dimension)
16
www.technologica.com
Dimension Normalization(Snowflaking)
17
www.technologica.com
Dimension Normalization(Snowflaking)
The dimension tables should remain as flat tables physically.
Normalized, snowflaked dimension tables penalize cross-attribute browsing and prohibit the use of bit-mapped indexes.
Disk space savings gained by normalizing the dimension tables typically are less than 1 percent of the total disk space needed for the overall schema
18
www.technologica.com
Too Many Dimensions
19
www.technologica.com
Too Many Dimensions
A very large number of dimensions typically is a sign that several dimensions are not completely independent and should be combined into a single dimension.
If our design has 25 or more dimensions, we should look for ways to combine correlated dimensions into a single dimension
It is a dimensional modeling mistake to represent elements of a hierarchy as separate dimensions in the fact table.
20
www.technologica.com
Surrogate Keys
Every join between dimension and fact tables in the data warehouse should be based on meaningless integer surrogate keys.
You should avoid using the natural operational production codes. None of the data warehouse keys should be smart, where you can tell something about the row just by looking at the key.
21
www.technologica.com
Surrogate Keys
Surrogate keys are like an immunization for the data warehouse
Buffer the data warehouse environment from operational changes
Performance advantages The smaller surrogate key translates into smaller fact tables, smaller fact table indices, and more fact table rows per block input-output operation
Surrogate keys are used to record dimension conditions that may not have an operational code“No Promotion in Effect”, “Date Not Applicable.”
22
www.technologica.com
Surrogate Keys
The date dimension is the one dimension where surrogate keys should be assigned in a meaningful, sequential order
Surrogate keys are needed to support one of the primary techniques for handling changes to dimension table attributes
Don’t use concatenated or compound keys for dimension tables
23
www.technologica.com
Data Warehouse Bus Architecture
24
www.technologica.com
Data Warehouse Bus Matrix
25
www.technologica.com
Conformed Dimensions
Most dimensions are defined naturally at the most granular level possible
Conformed dimensions are either identical or strict mathematical subsets of the most granular, detailed dimension
They have consistent dimension keys, consistent attribute column names, consistent attribute definitions, and consistent attribute values
The conformed dimension may be the same physical table within the database or may be duplicated synchronously in each data mart
26
www.technologica.com
Conformed Dimensions
Roll-up dimensions conform to the base-level atomic dimension if they are a strict subset of that atomic dimension.
27
www.technologica.com
Conformed Dimensions
They should be built once in the staging area
They must be published prior to staging of the fact data
The dimension authority has responsibility for defining, maintaining, and publishing a particular dimension or its subsets to all the data mart clients who need it
28
www.technologica.com
Tracking History in Dimensions
Unchanging Dimensions
Changing, but Original Values are Irrelevant A phone number in a customer record
Slowly Changing Dimensions (SCD) A customer address, manager
Rapidly Changing Dimensions Income range of a customer
Continuously Changing Dimensions Customer age
29
www.technologica.com
Type 1: Overwrite the Value
The type 1 response is easy to implement, but: it does not maintain any history of prior attribute values
any preexisting aggregations based on the department value will need to be rebuilt
30
www.technologica.com
The type 2 response is the primary technique for accurately tracking slowly changing dimension attributes. It is extremely powerful because the new dimension row automatically partitions history in the fact table.
It’s not suitable for dimension tables that already exceed a million rows
Type 2: Add a Dimension Row
31
www.technologica.com
Type 2: Add a Dimension Row
Product Key
Product Description Department
SKU Number
Effective Date
Expiration Date
12345 IntelliKidz 1.0 Education ABC922-Z 01.1.1900 22.4.200525984 IntelliKidz 1.0 Strategy ABC922-Z 23.4.2005 01.1.2500
Product Key
Product Description Department
SKU Number
Effective Date
Most Resent Flag
12345 IntelliKidz 1.0 Education ABC922-Z 01.1.1900 N25984 IntelliKidz 1.0 Strategy ABC922-Z 23.4.2005 Y
Product Key Date Key
Amount Sold
12345 200 100 <--- 20.04.200525984 203 200 <--- 23.04.2005
32
www.technologica.com
Type 3: Add a Dimension Column
The type 3 slowly changing dimension technique allows us to see new and historical fact data by either the new or prior attribute values.
33
www.technologica.com
Hybrid SCD TechniquesSeries of Type 3 Attributes
Predictable Changes with Multiple Version Overlays
Report each year’s sales using the district map for that year.
Report each year’s sales using a district map from an arbitrary different year.
Report an arbitrary span of years’ sales using a single district map from any chosen year. The most common version of this requirement would be to report the complete span of fact data using the current district map.
34
www.technologica.com
Hybrid SCD TechniquesType 2 with "Current" Overwrite
Unpredictable Changes with Single-Version Overlay preserves historical accuracy while supporting the ability to report historical data according to the current values
35
www.technologica.com
Dimension Table Staging
36
www.technologica.com
Dimension Table Staging
38
www.technologica.com
Junk Dimensions
What to do with flags and indicators Leave the flags and indicators unchanged in the fact
table row.
Make each flag and indicator into its own separate dimension
Strip out all the flags and indicators from the design.
A junk dimension is a convenient grouping of typically low-cardinality flags and indicators
39
www.technologica.com
Junk Dimensions
Whether to use junk dimension5 indicators, each has 3 values -> 243 (35) rows5 indicators, each has 100 values -> 100 million (1005) rows
When to insert rows in the dimension
40
www.technologica.com
Multiple Currencies
41
www.technologica.com
Customer Dimension
Critical element for effective CRM
The most challenging dimension for any data warehouse
extremely deep (with millions of rows)
extremely wide (with dozens or even hundreds of attributes)
sometimes subject to rather rapid change
42
www.technologica.com
Customer Dimension Name and Address Parsing
43
www.technologica.com
Customer Dimension Other Common Customer Attributes
Gender
Ethnicity
Age or other life-stage classifications
Income or other lifestyle classifications
Status (for example, new, active, inactive, closed)
Referring source
Business-specific market segment
Scores characterizing the customer, such as purchase behavior, payment behavior, product preferences
44
www.technologica.com
Customer Dimension Aggregated Facts as Attributes
These attributes are to be used for constraining and labeling; they are not to be used in numeric calculations
Focus on those which will be used frequently
Minimize the frequency with which these attributes need to be updated
Replace metrics with more meaningful descriptive values, such as “High Spender”
45
www.technologica.com
Dimension Outriggers for aLow-Cardinality Attribute Set
46
www.technologica.com
Rapidly Changing CustomerDimensions
Challenges It generally takes too long to constrain or browse
among the relationships in such a big table
It is difficult to use previously described techniques for tracking changes in these large dimensions
One solution is to break off frequently analyzed or frequently changing attributes into a separate dimension, referred to as a minidimension
47
www.technologica.com
Rapidly Changing CustomerDimensions
The Mini Dimension with "Current" Overwrite
48
www.technologica.com
Rapidly Changing CustomerDimensions
The minidimension terminology refers to when the demographics key is part of the fact table composite key
If the demographics key is a foreign key in the customer dimension, we refer to it as an outrigger
49
www.technologica.com
Rapidly Changing CustomerDimensions
Type 2 with Natural Keys in Fact Table
Customer Dimension - Current Attributes (SCD1) Fact Table
Customer ID (Natural Key) Customer Key (FK) Customer Name Customer Demographics Key (FK) Customer Address More Foreign Keys … Customer Date of Birth Facts … Customer Date of 1st Order … Age
Gender Customer Dimension - "As was" Attributes (SCD2)
Annual Income Customer Key (PK) Number of Children Customer ID (Natural Key) Marital Status Customer Name
Customer Address Customer Date of Birth Customer Date of 1st Order … Age Gender Annual Income Number of Children Marital Status
50
www.technologica.com
Implications of Type 2 CustomerDimension Changes
Be careful to avoid overcounting because we may have multiple rows in the customer dimension for the same individual
COUNT DISTINCT
A most recent row indicator
The comparison operators depend on the business rules used to set our effective/expiration dates.
51
www.technologica.com
Capture the keys of the customers or products whose behavior you are tracking
Customer Behavior Study Groups
52
www.technologica.com
Commercial Customer Hierarchies
53
www.technologica.com
Commercial Customer Hierarchies
Bridge tables
54
www.technologica.com
Commercial Customer Hierarchies
55
www.technologica.com
Commercial Customer Hierarchies
Be aware of risk of double counting
SELECT 'San Francisco', SUM(F.REVENUE)FROM FACT F, DATE DWHERE F.CUSTOMER_KEY IN
(SELECT B.SUBSIDIARY_KEYFROM CUSTOMER C, BRIDGE BWHERE C.CUSTOMER_KEY =
B.PARENT_KEYAND C.CUSTOMER_CITY = 'San
Francisco') //to sum all SF parentsAND F.DATE_KEY = D.DATE_KEYAND D.MONTH = 'January 2002‘GROUP BY 'San Francisco'
56
www.technologica.com
Heterogeneous Product Schemas
57
www.technologica.com
Heterogeneous Product Schemas
58
www.technologica.com
Common Dimensional Modeling Mistakes to Avoid
Mistake 10: Place text attributes used for constraining and grouping in a fact table
Mistake 9: Limit verbose descriptive attributes in dimensions to save space
Mistake 8: Split hierarchies and hierarchy levels into multiple dimensions
Mistake 7: Ignore the need to track dimension attribute changes
Mistake 6: Solve all query performance problems by adding more hardware
59
www.technologica.com
Common Dimensional Modeling Mistakes to Avoid
Mistake 5: Use operational or smart keys to join dimension tables to a fact table
Mistake 4: Neglect to declare and then comply with the fact table’s grain
Mistake 3: Design the dimensional model based on a specific report
Mistake 2: Expect users to query the lowest-level atomic data in a normalized forma
Mistake 1: Fail to conform facts and dimensions across separate fact tables
60
www.technologica.com
Answers
Questionsand