OLAP1
Transcript of OLAP1
-
8/17/2019 OLAP1
1/31
Data Warehousing and OLAP
dr. Toon [email protected]
Motivation• « Traditional » relational databases are
geared towards online transactionprocessing: – bank terminal – flight reservations – student administration
• Decision support systems have differentrequirements
-
8/17/2019 OLAP1
2/31
Transaction Processing
Transaction processing
• Operational setting• Up-to-date = critical
• Simple data
• Simple queries; only« touch » a small partof the database
Flight reservations
• ticket sales• do not sell a seat
twice• reservation, date,
name
• Give flight details of XList flights to Y
Transaction Processing• Database must support
– simple data• tables
– simple queries• select from where …
– consistency & integrity CRITICAL – concurrency
• Relational databases, Object-Oriented,Object-Relational
-
8/17/2019 OLAP1
3/31
Decision Support
Decision support
• Off-line setting• « Historical » data• Summarized data
• Integrate different
databases• Statistical queries
Flight company
• Evaluate ROI flights• Flights of last year• # passengers per
carrier for destination X• Passengers, fuel costs,
maintenance info• Average % of seats
sold/month/destination
Outline of this lectureOnline Analytical Processing• Data Warehouses• Conceptual model: Data Cubes• Query languages for supporting OLAP
– SQL extensions
– MDX• Database Explosion Problem
-
8/17/2019 OLAP1
4/31
A decision support DB maintained separatelyfrom the operational databases.
Why Separate Data Warehouse?• Different functions
– DBMS— tuned for OLTP – Warehouse—tuned for OLAP
• Different data – Decision support requires historical data
• Integration of data from heterogeneoussources
Data Warehouse
Three-Tier Architecture
DataWarehouse
ExtractTransformLoadRefresh
OLAP Engine
Monitor&
IntegratorMetadata
Data Sources Front-End Tools
Serve
Data Marts
Operational
DBs
othersources
Data Storage
OLAPServer
Analysis
Query/Reporting
Data Mining
ROLAPServer
-
8/17/2019 OLAP1
5/31
Three-Tier Architecture
• Extract-Transform-Load – Get data from different sources; data
integration – Cleaning the data – Takes significant part of the effort (up to 80%!)
• Refresh – Keep the data warehouse up to date when
source data changes
Three-Tier Architecture• Data storage
– Optimized for OLAP – Specialized data structures – Specialized indexing structures
• Data marts – common term to refer to “less ambitious data
warehouses” – Task oriented, departmental level
-
8/17/2019 OLAP1
6/31
OLAP
• OLAP = OnLine Analytical Processing – Online = no waiting for answers
• OLAP system = system that supportsanalytical queries that are dimensional in nature.
Outline of the lectureOnline Analytical Processing• Data Warehouses• Conceptual model: Data cubes• Query languages for supporting OLAP
– SQL extensions
– MDX• Database Explosion Problem
-
8/17/2019 OLAP1
7/31
Examples of Queries
• Flight company: evaluate ticket sales – give total, average, minimal, maximal amount – per date: week, month, year – by destination/source port/country/continent – by ticket type – by # of connections – …
Common Characteristics• One (or few) special attribute(s): amount
measure
• Other attributes: select relevant regions dimensions
• Different levels of generality (month, year)
hierarchies• Measurement data is summarized: sum,
min, max, averageaggregations
-
8/17/2019 OLAP1
8/31
Supermarket Example
• Evaluate the sales of products – Product cost in $ – Customer: ID, city, state, country, – Store: chain, size, location, – Product: brand, type, … – …
• What are the measure and dimensionalattributes, where are the hierarchies?
measure
Dim.hierarchies
Why Dimensions?
customer
store
product
Cost in $
• Multidimensional view on the data
-
8/17/2019 OLAP1
9/31
Cross Tabulation
• Cross-tabulations are highly useful – Sales of clothes June August ‘06
138512265August
57032967174Total
1981202058July
2341582551June
TotalOrangeRedBlue
Product: color
Date:month,June August2006
Data Cubes• Extension of Cross-Tables to multiple
dimensions• Conceptual notion
138512265August
57032967174Total
1981202058July
2341582551June
TotalOrangeRedBlue
Dimensions
Data Points/ 1st level of aggregation
Aggregatedw.r.t. X-dim
Aggregatedw.r.t. Y-dim
Aggregatedw.r.t. X and Y
-
8/17/2019 OLAP1
10/31
Data CubesDate
P r o d
u c t
C o u n t r y
sum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
Ireland
France
Germany
sum
Data Cubes
1) #TV’s sold in the 2Qtr?2) Total sales in 3Qtr?3) #products sold in France
this year4) #PC’s in Ireland in 2Qtr?
sum 1Qtr 2Qtr 3Qtr 4Qtr
Ireland
France
Germany
sum
sum
TV
VCRPC
-
8/17/2019 OLAP1
11/31
Data Cubes
sum
sum
4
2
3
TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
Ireland
France
Germany
sum
1) #TV’s sold in the 2Qtr?
2) Total sales in 3Qtr?3) #products sold in France
this year4) #PC’s in Ireland in 2Qtr?
In the back, at thebottom, second column
Outline of the lectureOnline Analytical Processing• Data Warehouses• Conceptual model: Data cubes• Query languages for supporting OLAP
– SQL extensions
– MDX• Database Explosion Problem
-
8/17/2019 OLAP1
12/31
Operations with Data Cubes
• What operations can you think of that ananalyst might find useful? (e.g., store)
Operations with Data Cubes• What operations can you think of that an
analyst might find useful? (e.g., store) – only look at stores in the Netherlands – look at cities instead of individual stores – look at the cross-table for product-date – restrict analysis to 2006, product O1 – go back to a finer granularity at the store level
-
8/17/2019 OLAP1
13/31
Roll-Up
• Move in one dimension from a lowergranularity to a higher one – store city – cities country – product product type – Quarter Semester
Roll-Up
P r o d
u c t
C o u n t r y
sum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
Ireland
France
Germany
sum
Date
-
8/17/2019 OLAP1
14/31
Roll-Up
P r o d
u c t
C o u n t r y
sum
sum TV
VCRPC
1st semester
Ireland
France
Germany
sum
Date2nd semester
Drill-down• Inverse operation• Move in one dimension from a higher
granularity to a lower one – city store – country cities – product type product
• Drill-through: – go back to the original, individual data records
-
8/17/2019 OLAP1
15/31
Pivoting
• Change the dimensions that are“displayed”; select a cross-tab. – look at the cross-table for product-date – display cross-table for date-customer
Pivoting• Change the dimensions that are
“displayed”; select a cross-tab. – look at the cross-table for product-date – display cross-table for date-customer
Sales Date
1st sem 2 nd sem Total
Ireland 20 23 43France 126 138 264
Country Germany 56 48 104
Total 202 209 411
-
8/17/2019 OLAP1
16/31
Slice & dice
• Select a part of the cube by restricting oneor more dimensions – restrict analysis to Ireland and VCR
Slice & dice
P r o d
u c t
C o u n t r y
sum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
Ireland
France
Germany
sum
-
8/17/2019 OLAP1
17/31
Slice & dice
• Select a part of the cube by restricting oneor more dimensions – restrict analysis to Ireland and VCR
sum 1Qtr 2Qtr 3Qtr 4Qtr
Outline of the lectureOnline Analytical Processing• Data Warehouses• Conceptual model: Data cubes• Query languages for supporting OLAP
– SQL extensions
– MDX• Database Explosion Problem
-
8/17/2019 OLAP1
18/31
Extended Aggregation
• SQL-92 aggregation quite limited – Many useful aggregates are either very hard
or impossible to specify• Data cube• Complex aggregates (median, variance)• binary aggregates (correlation, regression curves)• ranking queries (“assign each student a rank
based on the total marks”)
• SQL:1999 OLAP extensions
Representing the Cube• Special value « null » is used:
Sales Date1st sem 2 nd sem Total
Ireland 20 23 43France 126 138 264
Country Germany 56 48 104
Total 202 209 411
-
8/17/2019 OLAP1
19/31
Representing the Cube
• Special value « null » is used:
411nullnull
104Germanynull
264Francenull
43Irelandnull
209null2nd semester
48Germany2nd semester
138France2nd semester
23Ireland2nd semester
202null1st semester
56Germany1st semester
126France1st semester
20Ireland1st semester
SalesCountryDate
Group by Cube• group by cube :
select item-name, color, size, sum (number )from sales group by cube (item-name, color, size )
Computes the union of eight differentgroupings of the sales relation:{ (item-name, color, size ), (item-name, color ),
(item-name, size ),(color, size ),(item-name ),(color ),(size ), ( ) }
-
8/17/2019 OLAP1
20/31
Group by Cube
• Relational representation of the date-country-sales cube can be computed asfollows:
select semester as date , country, sum (sales)from sales group by cube (semester,country )
– grouping() and decode() can be applied to
replace “null” by other constant:• decode (grouping (semester), 1, ‘all’, semester )
Group by Rollup• rollup construct generates union on every
prefix of specified list of attributesselect item-name , color , size , sum (number )from sales group by rollup (item-name, color, size )
Generates union of four groupings:
{ (item-name, color, size ), (item-name,color ), (item-name ), ( ) }
-
8/17/2019 OLAP1
21/31
Group by Rollup
• Rollup can be used to generateaggregates at multiple levels.
• E.g., suppose itemcategory (item-name,category ) gives category of each item.
select category, item-name , sum (number )from sales, itemcategory where sales.item-name = itemcategory.item-name
group by rollup (category, item-name )gives a hierarchical summary by item- name and by category.
Group by Cube & Rollup• Multiple rollups and cubes can be used in
a single group by clause – Each generates set of group by lists, cross
product of sets gives overall set of group bylists
-
8/17/2019 OLAP1
22/31
Example
select item-name, color, size , sum (number )from sales group by rollup (item-name ), rollup (color, size )
generates the groupings{item-name, ()} X {(color, size), (color), ()}
={ (item-name, color, size ), (item-name, color ),(item-name ),(color, size ), (color ), ( ) }
Ranking• Ranking is done in conjunction with an
order by specification.• Given a relation student-marks(student-id,
marks) find the rank of each student.select student-id , rank ( ) over (order by marks desc) as s-rank from student-marks
-
8/17/2019 OLAP1
23/31
Ranking
• An extra order by clause is needed to getthem in sorted orderselect student-id , rank ( ) over (order by marks desc )as s-rank from student-marksorder by s-rank
• Ranking may leave gaps: e.g. 2 studentshave top mark, then both have rank 1, andthe next rank is 3 – dense_rank does not leave gaps, so next
dense rank would be 2
Ranking (Cont.)• Ranking can be done within partitions.• “Find the rank of students within each
section.”select student-id, section , rank ( ) over (partition by sectionorder by marks desc ) as sec-rank from student-marks, student-section where student-marks.student-id = student-section.student-id order by section, sec-rank
• Multiple rank clauses can occur in a singleselect clause
-
8/17/2019 OLAP1
24/31
Exercises
• Find students with top n ranks
• Rank students by sum of their marks indifferent courses
given relation student-course-
marks(student-id, course, marks)
Ranking (Cont.)• Other ranking functions:
– percent_rank – cume_dist (cumulative distribution) – row_number
• SQL:1999 permits the user to specifynulls first or nulls last
-
8/17/2019 OLAP1
25/31
-
8/17/2019 OLAP1
26/31
Outline of the lecture
Online Analytical Processing• Data Warehouses• Conceptual model: Data cubes• Query languages for supporting OLAP
– SQL extensions – MDX
• Database Explosion Problem
Implementation• To make query answering more efficient:
consolidate (materialize) all aggregations
• Early implementations used amultidimensional array.
– Fast lookup: cell(prod. p, date d, prom. pr):• look up index of p1, index of d, index of pr:index = (p x D x PR) + (d x PR) + pr
-
8/17/2019 OLAP1
27/31
Implementation
• Multidimensional array – obvious problem: sparse data
can easily be solved, though.
Example:binary search tree, key on indexhash table.
Implementation• However: very quickly people were
confronted with the Data ExplosionProblem
Consolidating the summaries blows up the dataenormously !
Reasons are often misunderstood and confusing.
-
8/17/2019 OLAP1
28/31
Data Explosion Problem
• Why?Suppose:
– n dimensions, every dimension has d values – d n possible tuples.
– Number of cells in the cube: (d+1) n
– So, this is not the problem
Data Explosion Problem• Why?Suppose
– n dimensions, every dimension has d values – every dimension has a hierarchy
– most extreme case: binary tree2d possibilities/dimension
-
8/17/2019 OLAP1
29/31
Data Explosion Problem
• Why?Suppose
– n dimensions, every dimension has d values – every dimension has a hierarchy – most extreme case: binary tree
2d possibilities/dimension 2n x dn cells
Only partial explanation (factor 2 n comesfrom an extremely pathological case)
Data Explosion Problem• Why?
– The problem is that most data is not dense,but sparse .
– Hence, not all d n combinations are possible.
Example : 10 dimensions with 10 values – 10 000 000 000 possibilities
Suppose « only » 1 000 000 are present
-
8/17/2019 OLAP1
30/31
Data Explosion Problem
Example : 10 dimensions with 10 values – 10 000 000 000 possibilities
Suppose « only » 1 000 000 are presentEvery tuple increases count of 2 10 cells !
With hierarchies: effect even worse!
If every hierarchy has 5 items:510 = 9 765 625 cells!
Data Explosion• Conclusion
– Straightforward implementation as a multi-dimensional cube won’t work for large numberof dimensions
-
8/17/2019 OLAP1
31/31
Summary
• Datawarehouses supporting OLAP fordecision support
• Data Cubes as a conceptual model – Measurement, dimensions, hierarchy,
aggregation
• Queries – Roll-up, Drill-down, Slice and dice, pivoting… – SQL:1999 extensions for supporting OLAP
• Straightforward implementation isproblematic
Next Lecture• How can we implement this?
– ROLAP – MOLAP – (HOLAP)
• How can we do this efficiently?
– Multicubes – Indexing• bitmap, join, …