NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant.
-
Upload
maurice-baldwin -
Category
Documents
-
view
219 -
download
6
Transcript of NIC Exposure Level Training Vijayendra Gururao Business Intelligence Consultant.
NIC Exposure Level Training
Vijayendra GururaoBusiness Intelligence Consultant
Agenda
Data warehousing Concepts - Day 1 Govt Case Study - Day 2 Defense HR Case study - Day
3 Manufacturing Case Study - Day
4 Data mining - Day
5
What’s What’s importantimportant
What itWhat itmeansmeans
What to doWhat to doabout it about it
ReportReport
Summarize Summarize (MIS)(MIS)
Focus Focus (EIS)(EIS)
Analyze Analyze (OLAP)(OLAP)
ActAct(Intelligent Agents)(Intelligent Agents)
RecommendRecommend(Data Mining)(Data Mining)
20012001
19961996
19911991
19861986
ActionAction
Type of AnalysisType of Analysis
ActiveActive
PassivePassive
HumanHuman TechnologyTechnology
The Evolution of Business Intelligence
Data Warehousing
Introduction:
Definitions Legacy Systems Dimensions Data Dependencies Model Dimensional Model
An ER ModelShipType
Shipper
DistrictCredit
OrderItem
Ship To
Product
ContactLocat.
ProductLine
SalesOrder
Cust.Locat.
ProductGroup
Contract ContractType
Customer
SalesRep
SalesDistrict
SalesRegion
SalesDivision
Contact
Why Data Warehouses?
To meet the long sought after goal of providing the user with more flexible data bases containing data that can be accessed “every which way.”
OLTP vs. OLAP
OLTP (Online transaction processing) has been the standard reason for IS and DP for the last thirty years. Most legacy systems are quite good at capturing data but do not facilitate data access.
OLAP (Online analytical processing) is a set of procedures for defining and using a dimension framework for decision support
The Goals for and Characteristics of a DW
Make organizational data accessible Facilitate consistency Adaptable and yet resilient to change Secure and reliable Designed with a focus on supporting decision
making
The Goals for and Characteristics of a DW
Generate an environment in which data can be sliced and diced in multiple ways
It is more than data, it is a set of tools to query, analyze, and present information
The DW is the place where operational data is published (cleaned up, assembled, etc.)
Data Warehousing is Changing!
ERPCampaign Management
Supply Chain
Customer Relationship
Mgmt.
E-commerce
Target Marketing
Knowledge Management
Call Center
ERP
Application requirements--not just data requirements--are now driving need.
Organization of data in the presentation area of the data warehouse
Data in the warehouse are dimensional, not normalized relations However, data that are ultimately
presented in the data warehouse will often be derived directly from relational DBs
Data should be atomic someplace in the warehouse; even if the presentation is aggregate
Uses the bus architecture to support a decentralized set of data marts
Updates to a data warehouse For many years, the dogma stated
that data warehouses are never updated.
This is unrealistic since labels, titles, etc. change.
Some components will, therefore, be changed; albeit, via a managed load (as opposed to transactional updates)
Basic elements of the data warehouse
Services:
Clean, combine, and standardizeConform DimensionsNo user query services
Data Store:
Flat files and relational tables
Processing:
Sorting and sequential processing
DataStaging
Area
Data Mart #1
DimensionalAtomic and summary dataBased on a single business process
Data Mart #2
Similar design
DW Bus:Conformed facts and dimensions
Ad hoc query tools
Report Writers
Analytical Applications
Modeling:
Forecasting
Scoring
Data MiningExtract
Extract
Extract
Load
Load Access
Access
Operational Source
Systems
DataPresentation
Area
DataAccessTools
Data Staging Area
Extract-Transformation-Load Extract: Reading the source data and
copying the data to the staging area Transformation:
Cleaning Combining Duplicating Assigning keys
Load: present data to the bulk loading facilities of the data mart
Dimensional Modeling Terms and Concepts Fact table Dimension tables
Fact Tables
Fact table: a table in the data warehouse that contains Numerical performance measures Foreign keys that tie the fact table to the
dimension tables
Fact Tables
Each row records a measurement describing a transaction
Where? When? Who? How much? How many?
The level of detail represented by this data is referred to as the grain of the data warehouse
Questions can only be asked down to a level corresponding with the grain of the data warehouse
Dimension tables
Tables containing textual descriptors of the business Dimension tables are usually wide (e.g.,
100 columns) Dimension tables are usually shallow
(100s of thousand or a few million rows) Values in the dimensions usually provide
Constraints on queries (e.g., view customer by region)
Report headings
Dimension tables
The quality of the dimensions will determine the quality of the data warehouse; that is, the DW is only as good as its dimension attributes
Dimensions are often split into hierarchical branches (i.e., snowflakes) because of the hierarchical nature of organizations Product part Product Brand
Dimensions are usually highly denormalized
Dimension tables
The dimension attributes define the constraints for the DW. Without good dimensions, it becomes difficult to narrow down on a solution when the DW is used for decision support
Bringing together facts and dimensions – Building the dimensional Model Start with the normalized ER Model Group the ER diagram components into
segments based on common business processes and model each as a unit
Find M:M relationships in the model with numeric and additive non-key facts and include them in a fact table
Denormalize the other tables as needed and designate one field as a primary key
A Dimensional Model
time_key
day_of_Week
month
quarter
year
holiday_flag
time_key
product_key
store_key
dollars_sold
units_sold
dollars_cost
product_key
description
brand
category
store_key
store_name
address
floor_plan_type
Time DimensionSales Fact
Product Dimension
Store Dimension
Kimball Methodology
Conformed Dimensions
Review: A Private Data Mart
A data mart containing one fact table and three dimension tables. We delivered all the tables by executing a fact build.
What if we want to add another fact table called F_Sales that will reference the three existing dimension tables?
Understand Conformed Dimensions
Sales Fact
Distribution Fact
Order Fact
Location
Customer
Product
Time
Distributor
Promotion
Co
nfo
rmed
Dim
ensi
on
sC
on
form
ed D
imen
sion
s
A conformed dimension is a dimension that is standardized across all data marts.
Advantages of Conformed Dimensions
Deliver incremental data marts in a short period of time.
Independent data marts become part of a fully integrated data warehouse.
Deliver a consistent view across your business.
Conformed Dimensions Within Bus Architecture
Facts
Dimensions
Sales Fact
Distribution Fact
Order Fact
Lo
cati
on
Cu
sto
mer
Pro
du
ct
Tim
e
Dis
trib
uto
r
Pro
mo
tio
n
X
X
X
X X
X
X
X
X
X
X
X
Identifying and designing the conformed dimensions is a critical step in the architecture phase of data warehouse design.
Design of Conformed Dimensions
A commitment to using conformed dimensions is more than just a technical consideration. It must be a business mandate.
Lay out a broad dimensional map for the enterprise. Define conformed dimensions at the most granular
(atomic) level possible. Conformed dimensions should always use surrogate
keys. Define standard definitions for dimension and fact
attributes.
Granularity in Conformed Dimensions
D_ProductProduct IdDescriptionProduct TypeType DescriptionProduct LineLine Description
Order FactDay IdProduct IdCustomer IdCostNumberOrdered
D_TimeDayDay IdDayMonth Year
D_CustomerCustomer IdLast NameFirst NameAddress
Conformed dimensions should be defined at the most granular (atomic) level so that each record in these tables corresponds to a single record in the base-level fact table.
Flexibility of Conformed Dimensions
ProductProduct IdDescriptionProduct TypeProduct Line
CustomerCustomer IdLast NameFirst NameAddress
Time(Day)Day IdDayMonth Id Period
Time(Month) ViewMonth IdMonthPeriod
View or Snowflake table
Time Dimension
Conformed dimensions are usually designed within star schema data marts. For multiple granularity fact tables, higher level views of dimensions can be used (or a snowflake table).
Order FactDay IdProduct IdCustomer IdCostNumberOrdered
Sales FactMonth IdProduct IdCustomer IdAmountSoldRevenue
So, What is a DW?
A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant collection of data in support of management’s decisions
W.H. Inmon (the father of DW)
Subject Oriented
Data in a data warehouse are organized around the major subjects of the organization
Integrated
Data from multiple sources are standardized (scrubbed, cleansed, etc.) and brought into one environment
Non-Volatile
Once added to the DW, data are not changed (barring the existence of major errors)
Time Variant
The DW captures data at a specific moment, thus, it is a snap-shot view of the organization at that moment in time. As these snap-shots accumulate, the analyst is able to examine the organization over time (a time series!)
The snap-shot is called a production data extract
Need for Data Warehousing Integrated, company-wide view of high-quality information (from
disparate databases) Separation of operational and informational systems and data
(for improved performance)
comparison of operational and informational systems
Data Warehouse Architectures Generic Two-Level Architecture Independent Data Mart Dependent Data Mart and Operational Data
Store Logical Data Mart and @ctive Warehouse Three-Layer architecture
All involve some form of extraction, transformation and loading (ETLETL)
Generic two-level architecture
E
T
LOne, company-wide warehouse
Periodic extraction data is not completely current in warehouse
Independent Data Mart Data marts:Data marts:Mini-warehouses, limited in scope
E
T
L
Separate ETL for each independent data mart
Data access complexity due to multiple data marts
Dependent data mart with operational data store
ET
L
Single ETL for enterprise data warehouse(EDW)(EDW)
Simpler data access
ODS ODS provides option for obtaining current data
Dependent data marts loaded from EDW
Logical data mart and @ctive data warehouse
ET
L
Near real-time ETL for @active Data Warehouse@active Data Warehouse
ODSODS and data warehousedata warehouse are one and the same
Data marts are NOT separate databases, but logical views of the data warehouse Easier to create new data marts
Three-layer architecture
DW Design
Mainly consists of Logical Design Physical Design
Logical Design of DW
Identification of Entities Relationships Attributes Uniqe identifiers
Conceptual and abstract. Results in Fact and dimension tables
Created using Pen and Paper OR Modeling tools also
Physical Design of DW
Conversion of data gathered in Logical design to Physical database structure
Mainly driven for query performance
Logical Physical
Data CharacteristicsStatus vs. Event Data Example of DBMS log entry
Status
Status
Event = a database action (create/update/delete) that results from a transaction
Data CharacteristicsData CharacteristicsTransient vs. Periodic DataTransient vs. Periodic Data
Figure 11-8: Transient operational data
Changes to existing records are written over previous records, thus
destroying the previous data content
Data CharacteristicsData CharacteristicsTransient vs. Periodic DataTransient vs. Periodic Data
Periodic warehouse data
Data are never physically altered or deleted once they
have been added to the store
Data Reconciliation Typical operational data is:
Transient – not historical Not normalized (perhaps due to denormalization for
performance) Restricted in scope – not comprehensive Sometimes poor quality – inconsistencies and errors
After ETL, data should be: Detailed – not summarized yet Historical – periodic Denormalized Comprehensive – enterprise-wide perspective Quality controlled – accurate with full integrity
Extract Transform Load
Extract data from operational system, transform and load into data warehouse
Why ETL? Will your warehouse produce correct
information with the current data? How how can I ensure warehouse
credibility?
Excuses for NOT Transforming Legacy Data Old data works fine, new will work as
well. Data will be fixed at point of entry
through GUI. If needed, data will be cleaned after new
system populated; After proof-of-concept pilot.
Keys join the data most of the time. Users will not agree to modifying or
standardizing their data.
Levels of Migration Problem Existing metadata is insufficient and
unreliable Metadata must hold for all occurrences Metadata must represent business and
technical attributes Data values incorrectly typed and
accessible Values form extracted from storage Values meaning inferred from its content
Entity keys unreliable or unavailable Inferred from related values
Metadata Challenge
Metadata gets out of synch with details it summarizes Business grows faster than systems
designed to capture business info Not at the right level of detail
Multiple values in a single field Multiple meanings to a single field No fixed format for value
Expressed in awkward of limited terms Program/compiler view rather than business
view
Character-level Challenge Value instance level
Spelling, aliases Abbreviations, truncations, transpositions Inconsistent storage formats
Named type level Multiple meanings, contextual meanings Synonyms, homonyms
Entity level No common keys or representation No integrated view across records, files,
systems
The ETL Process
Capture Scrub or data cleansing Transform Load and Index
ETL = Extract, transform, and load
The ETL ProcessSource Systems
Extract Transform
Staging Area
Load
PresentationSystem
Source Data
Record the name location and data that exists in the TPS environment.
File names and location Layout Attribute meaning
Source
BusinessOwner
ISOwner
Platform
Location
Data SourceDescription
Extraction
Copy specific data directly from the source tables into a working dataset in the staging area.
Target Table
Target Column
Data
Type
Len
Target Column Description
Source
System
Source Table /
File
Source Col / Field
Data Txform Notes
Transformation (Dimension Tables) Generate surrogate key in a primary-
surrogate table. Make this permanent. Insert the surrogate key into the working
dimension tables. Conduct any editing/cleaning operations
you need (usually on the working table) Generate any derived attributes you
need. Generate and retain process logs.
Transformation(Fact tables) Join all dimensions to the fact table
(using original primary keys). Insert surrogate keys
Generate derived facts Generate indicator flags
Chg
Flag
Fact Grou
p
Derived Fact Name
Derived Fact Description
Typ
e
Agg Rule
Formula
Constrai
nts
Transfor-
mations
Target Data
Describe the presentation data structure.
Model Metadata Usage and constraints
TableName
ColumnName
Data
Type
Len Null
s?
ColumnDescription
PK
PKOrder
FK
Flow Documentation
DFD for the ETL process ERD for Source, Staging and
Target databases. Metadata Usage notes.
Steps in data reconciliation
Static extractStatic extract = capturing a snapshot of the source data at a point in time
Incremental extractIncremental extract = capturing changes that have occurred since the last static extract
Capture = extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse
Steps in data reconciliation (continued)
Scrub = cleanse…uses pattern recognition and AI techniques to upgrade data quality
Fixing errors:Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies
Also:Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data
Steps in data reconciliation (continued)
Transform = convert data from format of operational system to format of data warehouse
Record-level:Record-level:Selection – data partitioningJoining – data combiningAggregation – data summarization
Field-level:Field-level: single-field – from one field to one fieldmulti-field – from many fields to one, or one field to many
Steps in data reconciliation (continued)
Load/Index= place transformed data into the warehouse and create indexes
Refresh mode:Refresh mode: bulk rewriting of target data at periodic intervals
Update mode:Update mode: only changes in source data are written to data warehouse
Single-field transformation
In general – some transformation function translates data from old form to new form
Algorithmic transformation uses a formula or logical expression
Table lookup – another approach
Multi field transformation
M:1 –from many source fields to one target field
1:M –from one source field to many target fields
Derived Data Objectives
Ease of use for decision support applications Fast response to predefined user queries Customized data for particular target audiences Ad-hoc query support Data mining capabilities
Characteristics Detailed (mostly periodic) data Aggregate (for summary) Distributed (to departmental servers)
Most common data model = star schemastar schema(also called “dimensional model”)
Components of a star schemastar schemaFact tables contain factual or quantitative data
Dimension tables contain descriptions about the subjects of the business
1:N relationship between dimension tables and fact tables
Excellent for ad-hoc queries, but bad for online transaction processing
Dimension tables are denormalized to maximize performance
Star schema example
Fact table provides statistics for sales broken down by product, period and store dimensions
Star schema with sample data
Slowly Changing dimensions Ragged Hierarchies
Advanced concepts
What if Our Data is not Static? Small occasional changes in dimension data
are normal in business. Examples of these changes include:
addition of new members (a new product is launched) changing of relationships within the dimension (a sales
rep moves to another branch) properties of members changed (a product is
reformulated or renamed) deletion of members (this is rare in data warehousing)
Understand Surrogates 4-byte integer key
(can hold more than 2 billion positive integers)
Internally assigned and meaningless
insures uniqueness
always known
Used in conjunction with business keys
business key is often mnemonic; for example, OTA used for Ottawa office
surrogate key is numeric; for example, 000128
Surrogate keys are never used in reports. They are used to link dimension tables to fact tables.
Understand Surrogate Keys Used In Operational Systems
Operational databases also sometimes use surrogate keys (for example, Employee_No). These keys typically cannot be used as the data mart surrogate keys.
A single member in a data mart (for example, a particular employee) may have several data mart surrogate keys assigned over time to deal with slowly changing dimensions.
You may have to merge entities from separate operational systems, each with its own operational surrogate key (for example, customers from separate banking and insurance applications).
Operational surrogate keys are usually considered business keys in the data mart.
Understand Natural Keys: Example
Product Dimension
Fact Table
Customer Dimension
Metrics
PR X 002 39 SA 1 11
PR X 003 40 LO 2 22
PR Y 003 40 SE 5 55
Prod Code Cust Code
Measures
PR X 002 39 Soup
PR X 003 40 Beans
PR Y 003 40 Peas
Prod Code Name
SA 1 11 Safeway
LO 2 22 Loblaws
SE 5 55 7-11
Cust Code Name
Understand Surrogate Keys: Example
Product Dimension
Fact Table
Customer Dimension
1 10
2 20
3 30
Prod Sur Cust Sur
Measures
1 PR X 002 39 Soup
2 PR X 003 40 Beans
3 PR Y 003 40 Peas
10 SA 1 11 Safeway
20 LO 2 22 Loblaws
30 SE 5 55 7-11
Cust Sur Cust Code NameProd Sur Prod Code Name
Track Dimensional Changes Over Time
Operational systems tend to contain data about the current state of the business.
A data warehouse is expected to hold data for five to 10 years.
Users may need to query data as of any particular date (for example, at which office(s) did Mary Smith work between January/1999 and December/1999?).
If Mary Smith changes offices, to which office do her sales apply, the old one or the new one?
Understand Slowly Changing Dimensions (SCD)
Operational dimensional data may often be thought of as static. It may only need to reflect the current state.
Data warehouse dimensional data often must show how the dimensional data changes over time. It is not static.
The term Slowly Changing Dimension (SCD) refers to the tracking of changes to dimensional data over time.
Understand Issues With Slowly Changing Dimensions Maintaining SCDs can be complex without surrogates.
*
* (Emp. No + Branch)
**
** (Emp. No + Branch + Position)
***
*** (Emp. No + Branch + Position + Salary)
Imagine the effect of having such a largeNatural key in the fact table.
10001 Jack OTA VP Jan88' 50K
Emp. No Name Branch Position Hire Date
Salary
10002 Jane ARL MK Jan92' 40K
10003 Tom NY SS Jan93' 35K
Business Key
Normal Type2 Type2 Normal Type2
10001 Jack SJ VP Jan88' 50K
10001 Jack SJ S-VP Jan88' 50K
10001 Jack SJ S-VP Jan88' 60K
Surrogate
1
2
3
4
5
6
Use Different Methods of Handling Dimensional Changes
Two most commonly used types of SCDs (according to Kimball):Type 1. Overwrite the old value with the new value
(do not track the changes).Type 2. Add a new dimension record with a new
surrogate key (track changes over time).
A single row may have a combination of columns of different types.
Type 1: Overwrite the Original Value
Sales Rep Dimension Table
Rep Key
Order Date
Cust Key
…
00128 1/1/1999 1234500128 2/1/1999 12345
Sales Fact TableRep Key
Name Marital Status
Office …
00128 Mary Smith Single Dallas
Rep Key
Name Marital Status
Office …
00128 Mary Jones Married Dallas
The organization may not choose to track certain data changes because: the original data may have been incorrect the change is not considered relevant for
tracking
When a tracked change is detected, a new surrogate key is assigned and a new row is added to the dimension table.
Usually, an effective begin/end date is also updated on the new and old rows.
Multiple rows may have the same business key, but they will always have unique surrogate keys.
Type 2: Add a New Dimension Record
Sales Rep Dimension TableSales Fact Table
RepSur Key
Rep Key
Name Office Eff Date
11111 00128 Mary Smith Dallas 990111112 00128 Mary Smith NYC 9903
RepSur Key
Order Date Cust Key
…
11111 01/01/1999 1234511111 02/01/1999 1234511112 03/01/1999 1234511112 04/01/1999 12345
Balanced and Ragged Hierarchies
Dimensional data are usually structured as hierarchies, either balanced or ragged (unbalanced).
Balanced hierarchies (those with a fixed number of levels) are most common and are the easiest to understand and analyze.
In ragged hierarchies, each branch does not break down into the same number of levels. They are harder to analyze and report against.
Also, PowerPlay requires that all leaf (lowest-level) nodes be at the same level to aggregate properly.
Parent-Child Relationships
Employees
Orders
Reports To
Parent-child relationship are recursive relationships.
The levels of the hierarchy are determined by rows of the same table.
Ragged Hierarchies
Nancy Davolio Janet Leverling Margaret Peacock Laura Callahan
Michael Suyama Robert King Anne Dodsworth
Stephen Buchanan
Andrew Fuller
Employee Hierarchy
leaf leaf leaf leaf
leaf leaf leaf
Leaf nodes have no children. DecisionStream fact builds only look for leaf
nodes at the lowest level.
Resolve Ragged Hierarchies: Step 1
Create an auto-level hierarchy to obtain the number of levels.
Create a dimension build to create a physical table that will identify for each row the level it belongs to.
Resolve Ragged Hierarchies: Step 1 (cont’d) Use Auto-Level Hierarchies
Report to Steven Buchanan (Level 3)
Report to Andrew Fuller (Level 2)
Report to Andrew Fuller (Level 2)
Top Level (Level 1)
The purpose of auto-level hierarchies in DecisionStream is to determine the number of levels in a hierarchy.
Issues Regarding Star Schema
Dimension table keys must be surrogate (non-intelligent and non-business related), because: Keys may change over time Length/format consistency
Granularity of Fact Table – what level of detail do you want? Transactional grain – finest level Aggregated grain – more summarized Finer grains better market basket analysis capability Finer grain more dimension tables, more rows in fact table
The User InterfaceMetadata (data catalog)
Identify subjects of the data mart Identify dimensions and facts Indicate how data is derived from enterprise data
warehouses, including derivation rules Indicate how data is derived from operational data
store, including derivation rules Identify available reports and predefined queries Identify data analysis techniques (e.g. drill-down) Identify responsible people
Q & A
Multi-dimensional data
On-Line Analytical Processing (OLAP) The use of a set of graphical tools that provides users
with multidimensional views of their data and allows them to analyze the data using simple windowing techniques
Relational OLAP (ROLAP) Traditional relational representation
Multidimensional OLAP (MOLAP) CubeCube structure
OLAP Operations Cube slicing – come up with 2-D view of data Drill-down – going from summary to more
detailed views
Overall Plan
We need fast answers to analytical questions
Relational model may not be the answer
We can restructure data specifically for analysis
First we need to find out how people analyse data
Overall Plan
Analysis analysis reveals importance of measures and dimensions
So we structure the data with that in mind
The star schema is the physical structure that emerges
We can implement this as ROLAP, MOLAP and HOLAP
We achieve our objective – rapid analytical processing
How do we make databases faster? Indexing Query design Application design Care with locking Lots of ways Data structuring
Relational
Data structuring –Relational model
Pros Data integrity in the face of user updates Small data volume Good for transactional queries
Cons Poor analytical query performance
Why are relational databases slow? Joins Functions Aggregations
Poor analytical query performance
Poor analytical query performance
So, there is a tension between: Transactions Analytical querying
Solution: Split them up Take a copy of the transactional database and structure it
in a totally different way that is optimised for analytical querying
Structure for analytical querying
Great idea, but first we need to find out how people analyse their data
How people analyse their data
People analyse their data in terms of: Graphs Grids Reports
Do these have anything in common?
How people analyse their data
Do these have anything in common? Measures
Numerical values Typically plotted on the Y axis
Dimensions Discontinuous variables Slice the measures into aggregated groups Typically plotted on the X axis
How people analyse their data
Dimensions are often hierarchical People want to analyse:
Time by Year, Quarter, Month, Day Product by Warehouse, Type, Product Customer by Country, County, Person
How people analyse their data
So, we need to summarise all of this….. Measures Dimensions Hierarchies
Employee
Region
Employee
Item
Year
Quarter
Class
Warehouse
Month
WeekDay
Time
Customer
Country
Region
Name
Delay
Profit
Quantity
County
Squashed Octopus
Product
Squashed Octopus
The SO is a logical model What about a physical model? (Recap why the relational model is slow for
analytical querying) Joins Functions Aggregations
Relational
Star Schema
Star Schema
What is in the fact table? Facts
Star Schema
What is in a dimension table? Dimensional information Hierarchical information
Star Schema
How do dimension and fact tables work together?
Star Schema
Is it faster?
Query Time
Relational
Time
Star Schema
Monthly totals 70 60
Sales in March by Product
18 6
Sales in March 2004 by Product
12 2
Star Schema
Is it faster? Yes
How can we make it even faster? Aggregation
Query Time
Relational
Time
Star Schema
Time
Aggregated
Star Schema
Monthly totals
70 60 <1
Sales in March by Product
18 6 <1
Sales in March 2004 by Product
12 2 <1
Star Schema
If we leave this as a set of tables then it is ROLAP – Relational OLAP
(OLAP – On-Line Analytical Processing) But it is a pain to manage
All those aggregation tables
Star Schema
So, the answer is MOLAP (Multi-dimensional OLAP)
Star Schema
Finally HOLAP (Hybrid OLAP)
On-Line Analytical Processing (OLAP) OLAP Operations
Cube slicing – come up with 2-D view of data Drill-down – going from summary to more
detailed views
Figure 11-22: Slicing a data cube
Example of drill-down
Summary report
Drill-down with color added
Summary
We need fast answers to analytical questions
Relational model may not be the answer
We can restructure data specifically for analysis
First we need to find out how people analyse data
Summary
Analysis analysis reveals importance of measures and dimensions
So we structure the data with that in mind
The star schema is the physical structure that emerges
We can implement this as ROLAP, MOLAP and HOLAP
We achieve our objective – rapid analytical processing