OLAP CS 543 – Data Warehousing. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS2...

57
OLAP CS 543 – Data Warehousing
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    2

Transcript of OLAP CS 543 – Data Warehousing. CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS2...

OLAP

CS 543 – Data Warehousing

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 2

Where Does OLAP Fit In? (1)

OLAP = On-line analytical processing. OLAP is a characterization of applications, not a database

design technique. Idea is to provide very fast response time in order to

facilitate iterative decision-making. Analytical processing requires access to complex

aggregations (as opposed to record-level access).

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 3

Where Does OLAP Fit In? (2)

Information is conceptually viewed as “cubes” for simplifying the way in which users access, view, and analyze data.

Quantitative values are known as “facts” or “measures.”e.g., sales $, units sold, etc.

Descriptive categories are known as “dimensions.” e.g., geography, time, product, scenario (budget or actual),

etc. Dimensions are often organized in hierarchies that represent levels

of detail in the data (e.g., UPC, SKU, product subcategory, product category, etc.).

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 4

OLAP FASMI TestFast: Delivers information to the user at a fairly constant rate.

Most queries should be delivered to the user in five seconds or less.

Analysis: Performs basic numerical and statistical analysis of the data, pre-defined by an application developer or defined ad hoc by the user.

Shared: Implements the security requirements necessary for sharing potentially confidential data across a large user population.

Multi-dimensional: The essential characteristic of OLAP.Information: Accesses all the data and information necessary

and relevant for the application, wherever it may reside and not limited by volume.

...from the OLAP Report by Pendse and Creeth.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 5

Need for Multidimensional Analysis

A simple analysis How many units of product A did we sell in the store in

DHA, Lahore

Typically, decision support requires more complex analyses

How much revenue did the new product X generate during the last three months, broken down by individual months, in the Southern Region, by individual stores, broken down by the promotions, compared to estimates, and compared to the previous version of the the product?

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 6

Kinds of Analyses

Roll-ups to provide summaries and aggregates along the hierarchies of the dimensions

Drill-downs from the top level to the lowest along the hierarchies of the dimensions

Calculations involving facts and metrics Algebraic equations involving key performance

indicators Moving averages and growth percentages Trend analyses using statistical methods

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 7

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 8

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 9

OLAP?

The name On-Line Analytical Processing was coined in a paper by E.F. Codd in 1993 (“Providing On-Line Analytical Processing for User Analysts”)

A definition OLAP is a category of software technology that enables

analysts, managers, and executives to gain insight into data through fast, consistent, interactive access in a a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 10

OLAP Features

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 11

Dimensional Analysis (1)

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 12

Dimensional Analysis (2)

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 13

Some Queries

Display the total sales of all products for past five years in all stores

Compare total sales for all stores, product by product, between years 2000 and 1999.

Show comparison of sales by individual stores, product by product, between years 2000 and 1999 only for those products with reduced sales.

Show the results of the previous queries, but rotating the columns with rows

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 14

Hypercubes

Multi-dimension cubes Hard to visualize and display beyond three dimensions

Multi-dimensional domain structure (MDS) Represents each dimension as a line showing the values

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 15

MDS

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 16

Display of Hypercubes

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 17

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 18

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 19

Drill-Down and Roll-Up

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 20

Slice-and-Dice or Rotation

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 21

OLAP Models/Implementations

MOLAP: OLAP implemented with a multi-dimensional database.

ROLAP: OLAP implemented with a relational database.

HOLAP: OLAP implemented with a hybrid of multi-dimensional and relational database technologies.

DOLAP: OLAP implemented for desktop decision support environments.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 22

ROLAP and MOLAP

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 23

MOLAP Implementations

OLAP has historically been implemented through use of multi-dimensional databases (MDDs).

Dimensions are key business factors for analysis: geographies (zip, state, region,...) products (item, product category, product department,...) dates (day, week, month, quarter, year,...)

Very high performance via fast look-up into “cube” data structure to retrieve pre-calculated results.

“Cube” data structures allow pre-calculation of aggregate results for each possible combination of dimensional values.

Use of application programming interface (API) for access via front-end tools.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 24

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 25

MOLAP Implementations

Need to consider both maintenance and storage implications when designing strategy for when to build cubes.

Maintenance Considerations: Every data item received into MDD must be aggregated into every cube (assuming “to-date” summaries are maintained).

Storage Considerations: Although cubes get much smaller (e.g., more dense) as dimensions get less detailed (e.g., year vs. day), storage implications for building hundreds of cubes can be significant.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 26

MOLAP Implementations

Typically outperform relational database technology because all answers are pre-computed into cubes (and overhead for accessing cubes is very low).

Difficult to scale because of combinatorial explosion in the number and size of cubes when dimensions of significant cardinality are required.

Beyond tens (sometimes small hundreds) of thousands of entries in a single dimension will break the MOLAP model because the pre-computed cube model does not work well when the cubes are very sparse in the population of individual cells.

See www.olapreport.com/DataExplosion.htm

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 27

Virtual CubesVirtual cubes are used when there is a need to join

information from two dissimilar cubes that share one or more common dimensions.

Similar to a relational view; two (or more) cubes are linked along common dimension(s).

Often used to save space by eliminating redundant storage of information.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 28

Partitioned Cubes

One logical cube of data can be spread across multiple physical cubes on separate (or same) servers.

The divide-and-conquer approach of partitioned cubes helps to mitigate the scalability limitations of a MOLAP environment.

Ideal cube partitioning is completely invisible to end users.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 29

ROLAP Implementations

Advances in database technologies and front-end tools have begun to allow deployment of OLAP using ANSI SQL RDBMS implementations.

ROLAP facilitates deployment of much larger dimension tables than MOLAP implementations.

Front-end tools to facilitate GUI access to multi-dimensional analysis capabilities.

Aggregate awareness allows exploitation of pre-built summary tables for some front-end tools.

Star schema designs are often used to facilitate OLAP against relational databases.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 30

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 31

Simplified Third Normal Form (Retail)

ZONE REGION

zip _x_SMSA1

ZIP ZONE ZIP SMSA ZIP ADI QTR YR

STORE # ADDRESS ZIP ...WEEK QTR

DATE WEEK

RECEIPT # STORE # DATE ...

DATE WEATHER

RECEIPT #ITEM # ... $

ITEM # CATEGORYITEM # MFCTR

DEPTCATEGORY

zip _x_adi year

quarter

week

sale_detail

item_x_categoryitem_x_mfctr

category_x_dept

M

1M

1 1 1M

1M11

M M

1 M 1

M

M M1 1

M1

STORE #1

M

M M

ZONE REGION

zip _x_SMSA1

ZIP ZONE ZIP SMSA ZIP ADI QTR YR

STORE # ADDRESS ZIP ...WEEK QTR

DATE WEEK

RECEIPT # STORE # DATE ...

DATE WEATHER

RECEIPT #ITEM # ... $

ITEM # CATEGORYITEM # MFCTR

DEPTCATEGORY

zip _x_adi year

quarter

week

sale_detail

item_x_categoryitem_x_mfctr

category_x_dept

M

1M

1 1 1M

1M11

M M

1 M 1

M

M M1 1

M1

STORE #1

M

M M

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 32

Simplified Star Schema

ITEM# CATEGORY DEPT MFCTR ...

ITEM# STORE# DATERECEIPT# ...

M

1

Fact Table

Product Dimension Table

STORE# ADDRESS ZIP ADI SMSA ZONE

1

M

Geography Dimension Table

REGION

$

DATE WEEK QUARTER YEAR ...

Calendar Dimension Table

1

M

A vastly simplified model ... may even summarize out receipt # .....

STORE# DATE WEATHER

1

M

1

M

ITEM# CATEGORY DEPT MFCTR ...

ITEM# STORE# DATERECEIPT# ...

M

1

Fact Table

Product Dimension Table

STORE# ADDRESS ZIP ADI SMSA ZONE

1

M

Geography Dimension Table

REGION

$

DATE WEEK QUARTER YEAR ...

Calendar Dimension Table

1

M

A vastly simplified model ... may even summarize out receipt # .....

STORE# DATE WEATHER

1

M

1

M

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 33

Simplified Star Schema

A vastly simplified physical data model!

Collapse dimensional hierarchies into a single table for each dimension and create a single fact table from the header and detail records:

Fewer tables. Fewer joins to get results.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 34

Star Schema for High Performance

Business question: How many $ in raincoats did I sell in the first week of January through stores in Boston?

Assume: 4 Billion rows in fact table. 20 different kinds (size, color, style) of raincoats

(product category) out of 50,000 UPCs in store. 8 stores out of 400 are in BOSTON SMSA. 2 years of POS history in DBMS.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 35

Star Schema for High PerformanceSimple (poor performance) approach to query

execution:

1. Join item table with filtering on raincoat product category (very selective) to fact table.

2. Join date table with filtering by week (next most selective) to result table.

3. Join store table with filtering on store to result table from step 2.

4. Aggregate.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 36

Star Schema for High Performance

Advanced (better performance) approach to query execution:

1. Cartesian product join between dimensional tables.* Result is 20 x 8 x 7 = 1,120 rows.

2. Use composite index on item:store:day into fact table for very selective access.* Access less than 0.00000008 percent of data in fact table!

Sophisticated cost-based optimizers will figure this out.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 37

Forcing a Cartesian Product Join

Add an addition “join_value” column in each dimensional table.

Set join_value to same value in all rows of the dimensional tables.

Add additional where clause predicates joining on this column between dimensional tables.

NOTE: This shouldn't be necessary with a “smart” optimizer.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 38

Forcing a Cartesian Product Join

Sample code:

select sum(sales.sales_amt)from d_sales_detail ,store ,item ,periodwhere d_sales_detail.store_id = store.store_id and d_sales_detail.item_id = item.item_id and d_sales_detail.day_dt = period.day_dt and period.day_dt between '23-NOV-2000' and '24-DEC-2000' and item.trade_style_cd = 'BARBIE' and store.state_cd = 'CA' and store.join_value = period.join_value and store.join_value = item.join_value and period.join_value = item.join_value;

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 39

Star Schema for High PerformanceProblem: What if I want to know raincoat sales in first

week of January regardless of store?

Answer: Performance advantage of composite index in traditional RDBMS is severely impaired!

B-tree indexing techniques do not allow for flexibility in the use of dimensions for query purposes.

Bit indexing (and variations thereof) often allows much more generality in achieving high performance from a star schema.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 40

Star Schema for High Performance

Bottom Line:

It is not at all unusual to obtain an order of magnitude (or more) in performance advantage using a star schema with advanced indexing versus a more traditional relational database implementation.

Despite what vendors may tell you, star schemas cannot be effectively implemented for all DSS business applications and/or data models.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 41

ROLAP

Relational OLAP often makes heavy use of summary tables to provide near instantaneous access for multi-dimensional queries.

Foundation is usually star schema or snowflake database design.

Allows OLAP with much larger data sets than multi-dimensional database (MDD) products using cube structures (MOLAP).

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 42

ROLAP

Number of summary tables can get very large if discipline is not enforced...

Assume a retail database with the following two dimensions on the fact table...

Calendar: Day, Week, Period, Quarter, Year, All Days

Geography: Store, Zone, District, Region, All Stores

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 43

ROLAP

All Days 13 19 24 28 30

Year 9 15 22 27 29

Quarter 6 11 18 23 26

Period 4 8 14 21 25

Week 2 5 10 17 20

Day 1 3 7 12 16

Store Zone District Region All Stores

Summary tables in a naive implementation require all combinations of the dimensions at each aggregation level...

30 summary tables! ... Add in item, SKU, subcategory, category, and all items...now we are up to 150 pre-aggregates!

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 44

ROLAP

Summary tables are more of a maintenance issue than a storage issue in most production implementations.

Notice that summary tables get much smaller as dimensions get less detailed (e.g., year vs. day).

Should plan for double the size of the unsummarized data for ROLAP summaries in most environments.

Every detail record that is received into warehouse must aggregate into EVERY summary table (assuming "to-date" summaries are maintained).

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 45

ROLAP

Warning: Do not assume that dimensions are always simple hierarchies.

Example: Items are not just category, subcategory, SKU, and atomic item.... what about trade styles or manufacturer?

Now we need summary tables along these lines as well...another 120 summary tables!

Calendar vs. accounting period vs. billing cycle can be even worse...

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 46

ROLAP

Many ROLAP products have devised ways to reduce the number of summary tables:

Ability to build summaries on-the-fly as demanded by end-user applications.

Ability to aggregate efficiently from subset of the summary tables.

Tools exist in some products to assist in DBAs in selecting the "best” aggregations to build.

HOLAP (Hybrid OLAP) tools allow co-existence of pre-built cubes alongside relational OLAP structures.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 47

Intelligent Aggregation Selection Maximum performance boost implies lots of disk

for every pre-calculation. Minimum performance boost implies no disk with

zero pre-calculation. Strategy is to use meta data to heuristically

determine optimum set of aggregates from which all other aggregates can be derived.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 48

Aggregate Wizards

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 49

Fact Table Aggregates

Enhance performance on common queries at coarser granularities.

Save space to permit storing more history than possible with finer granularities.

Take advantage of need to store other facts (with similar samples) at a particular granularity.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 50

Aggregate Advice

Coarser granularity decreases potential cardinality, but usually increases density (e.g., daily summary table is typically twice the size of weekly summary table - not seven times).

Strongly consider omitting candidate aggregates where expected cardinality is more than 10% that of next finer granularity stored.

Keep the detail for drill down, even if you deploy aggregates for performance.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 51

Bottom Line

There are many implementation techniques for delivery of an OLAP environment.

Must fully consider the performance, scalability, complexity, and flexibility characteristics when deciding between MOLAP, ROLAP, and HOLAP.

Understand your tools and RDBMS!

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 52

MOLAP Vs. ROLAP

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 53

Implementation Issues

Data design and preparation Administration and performance OLAP platforms OLAP tools and products

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 54

Data Design and Preparation

Characteristics of data Stores and uses much less data compared to a DW Data is summarized. You will rarely find data at the lowest

levels of detail as in the DW Data is more flexible for processing and analysis partly

because there is much less data to work with Every instance of the OLAP system in your environment is

customized for the purpose that instance serves

OLAP data is generally customized Types and levels

Static and dynamic summary data Permanent and transient detailed data

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 55

Administration

Administering and managing OLAP systems should be handled with that of the DW environment

Some considerations Expectations on what data would be accessed and how Selection of the right business dimensions Selection of the right filters in loading data from the DW Choosing the aggregation, summarization, and precalculation Size of the multidimensional database Access and security privileges Backup and restore facilities Drill-through to the data warehouse; drill-through to another

OLAP instance

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 56

Performance

OLAP takes most of the queries that normalling would run against the DW

OLAP is designed for complex queries; so it should enhance overall query performance of the DW environment

OLAP can precalculate and pre-aggregate data for quick response

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS 57

OLAP Platforms

Usually, the data warehouse and OLAP systems reside on the same platform in the start. Later when the data warehouse becmes large and OLAP is a common task, OLAP system is moved to another platform

A separate platform is needed when The size and usage of the DW consumes all resources Many departmental users desire OLAP capabilities The stability and performance of OLAP degrades OLAP tools require a different platform configuration than

the DW