25. Data Warehouses and Decision Support

14
08/25/22 PSU’s CS 587-3 25. Data Warehouses and Decision Support Len Shapiro, for CS386, 11/2-3/05. Some slides taken from Ramakrishnan and Gherke, with permission.

description

25. Data Warehouses and Decision Support. Len Shapiro, for CS386, 11/2-3/05. Some slides taken from Ramakrishnan and Gherke, with permission. 25. Whs. Overview. Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business strategies. - PowerPoint PPT Presentation

Transcript of 25. Data Warehouses and Decision Support

Page 1: 25. Data Warehouses and Decision Support

04/19/23 1PSU’s CS 587-3

25. Data Warehouses and Decision Support

Len Shapiro, for CS386, 11/2-3/05.Some slides taken from Ramakrishnan

and Gherke, with permission.

Page 2: 25. Data Warehouses and Decision Support

04/19/23 2PSU’s CS 587-3

Overview

Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business strategies.

Emphasis is on complex, interactive, exploratory analysis of very large datasets created by integrating data from across all parts of an enterprise; data is fairly static. Contrast such On-Line Analytic Processing

(OLAP) with traditional On-line Transaction Processing (OLTP): mostly long queries, instead of short update Xacts.

25. Whs.

Page 3: 25. Data Warehouses and Decision Support

04/19/23 3PSU’s CS 587-3

Enterprise Applications

OLTP / Operational / Production

DSS / Warehouse / DataMart

Operate the business / Clerks Diagnose the business / Managers

Short queries, small amts of data

opposite

Queries change data opposite

Customer inquiry, Order Entry, etc.

Statistics, Visualization, Data Mining, etc.

Legacy Applications, Heterogeneous databases

Opposite

Often Distributed Often Centralized (Warehouse)

Current data Current and Historical data

25. Whs.

Page 4: 25. Data Warehouses and Decision Support

04/19/23 4PSU’s CS 587-3

Data Warehouse Requirements

Operational Warehouse

General E-R Diagrams Multidimensional data model common

Locks necessary No Locks necessary

Crash recovery required Crash recovery optional

Smaller volume of data Huge volume of data

Need indexes designed to access small amounts of data

Need indexes designed to access large volumes of data

Page 5: 25. Data Warehouses and Decision Support

04/19/23 5PSU’s CS 587-3

Data Warehousing

Integrated data spanning long time periods, often augmented with summary information.

Several gigabytes to terabytes common.

Interactive response times expected for complex queries; ad-hoc updates uncommon.

EXTERNAL DATA SOURCES

EXTRACTTRANSFORM LOAD REFRESH

DATAWAREHOUSE Metadata

Repository

SUPPORTS

OLAPDATAMINING

25. Whs.

Page 6: 25. Data Warehouses and Decision Support

04/19/23 6PSU’s CS 587-3

25.2 Multidimensional Data Model

Collection of numeric measures, which depend on a set of dimensions. E.g., measure Sales, dimensions

Product (key: pid), Location (locid), and Time (timeid).

8 10 10

30 20 50

25 8 15

1 2 3 timeid

p

id11

12

13

11 1 1 25

11 2 1 8

11 3 1 15

12 1 1 30

12 2 1 20

12 3 1 50

13 1 1 8

13 2 1 10

13 3 1 10

11 1 2 35

pid

tim

eid

locid

sale

s

locid

Slice locid=1is shown:

Page 7: 25. Data Warehouses and Decision Support

04/19/23 7PSU’s CS 587-3

Examples of Dimensional Data

Products(ProductID, StoreID, DateID, Sale) Product(ID, SKU, size, brand) Store(ID, Address, Sales District, Region, Manager) Date (Date, Week, Month, Holiday, Promotion)

Claims(ProvID, MembID, Procedure, DateID, Cost) Providers(ID, Practice, Address, ZIP, City, State) Members(ID, Contract, Name, Address) Procedure (ID, Name, Type)

Telecomm (CustID, SalesRepID, ServiceID, DateID) SalesRep(ID, Address, Sales District, Region, Manager) Service(ID, Name, Category)

Page 8: 25. Data Warehouses and Decision Support

04/19/23 8PSU’s CS 587-3

MOLAP vs ROLAP

Multidimensional data can be stored physically in a (disk-resident, persistent) array; called MOLAP systems. Alternatively, can store as a relation; called ROLAP systems.

The main relation, which relates dimensions to a measure, is called the fact table. Each dimension can have additional attributes and an associated dimension table. E.g., Products(pid, pname, category, price) Fact tables are much larger than dimensional tables.

25. Whs.

Page 9: 25. Data Warehouses and Decision Support

04/19/23 9PSU’s CS 587-3

Dimension Hierarchies

For each dimension, some of the attributes may be organized in a hierarchy:

PRODUCT TIME LOCATION

pname week city

PID date ZIP

year

category quarter state

25. Whs.

Page 10: 25. Data Warehouses and Decision Support

04/19/23 10PSU’s CS 587-3

25.3 OLAP Queries

Influenced by SQL and by spreadsheets. A common operation is to aggregate a

measure over one or more dimensions. Find total sales. Find total sales for each city, or for each state. Find top five products ranked by total sales.

Roll-up: Aggregating at different levels of a dimension hierarchy. E.g., Given total sales by city, we can roll-up to

get sales by state.

25. Whs.

Page 11: 25. Data Warehouses and Decision Support

04/19/23 11PSU’s CS 587-3

OLAP Queries Drill-down: The inverse of roll-up.

E.g., Given total sales by state, can drill-down to get total sales by city.

E.g., Can also drill-down on different dimension to get total sales by product for each state.

Pivoting: Aggregation on selected dimensions. E.g., Pivoting on Location and Time

yields this cross-tabulation:63 81 144

38 107 145

75 35 110

WI CA Total

2002

2003

2004

176 223 339Total

Slicing and Dicing: Equality and range selections on one or more dimensions.

25. Whs.

Page 12: 25. Data Warehouses and Decision Support

04/19/23 12PSU’s CS 587-3

Comparison with SQL Queries The cross-tabulation obtained by pivoting can also be

computed using a collection of SQLqueries:

SELECT SUM(S.sales)FROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.timeid=L.timeidGROUP BY T.year, L.state

SELECT SUM(S.sales)FROM Sales S, Times TWHERE S.timeid=T.timeidGROUP BY T.year

SELECT SUM(S.sales)FROM Sales S, Location LWHERE S.timeid=L.timeidGROUP BY L.state

25. Whs.

Page 13: 25. Data Warehouses and Decision Support

04/19/23 13PSU’s CS 587-3

The CUBE Operator

Generalizing the previous example, if there are k dimensions, we have 2^k possible SQL GROUP BY queries that can be generated through pivoting on a subset of dimensions.

CUBE pid, locid, timeid BY SUM Sales Equivalent to rolling up Sales on all eight subsets

of the set {pid, locid, timeid}; each roll-up corresponds to an SQL query of the form:

SELECT SUM(S.sales)FROM Sales SGROUP BY grouping-list

Lots of work on optimizing the CUBE operator!

25. Whs.

Page 14: 25. Data Warehouses and Decision Support

04/19/23 14PSU’s CS 587-3

Summary Decision support is an emerging, rapidly growing

subarea of databases. Involves the creation of large, consolidated data

repositories called data warehouses. Warehouses exploited using sophisticated analysis

techniques: complex SQL queries and OLAP “multidimensional” queries (influenced by both SQL and spreadsheets).

New techniques for database design, indexing, view maintenance, and interactive querying need to be supported (CS587).

25. Whs.