1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik.
-
Upload
jack-cameron -
Category
Documents
-
view
226 -
download
0
Transcript of 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik.
1
CUBE: A Relational Aggregate Operator
Generalizing Group By
CUBE: A Relational Aggregate Operator
Generalizing Group By
CHEVY
FORD 19901991
19921993
REDWHITEBLUE
By Color
By Make & Color
By Make & Year
By Color & Year
By MakeBy Year
Sum
The Data Cube and The Sub-Space Aggregates
REDWHITEBLUE
Chevy Ford
By Make
By Color
Sum
Cross TabRED
WHITEBLUE
By Color
Sum
Group By (with total)Sum
Aggregate
By
Ata İsmet Özçelik
2
The Data Analysis CycleThe Data Analysis Cycle• User extracts data from
database with query
• Then visualizes, analyzes data with desktop tools
Spread Sheet
Table
1
1015
1012
109
106
103
Size vs Speed
Access Time (seconds)10-9 10-6 10-3 10 0 10 3
Cache
Main
Secondary
Disc
Nearline Tape Offline
Tape
OnlineTape
104
102
100
10-2
10-4
Price vs Speed
Access Time (seconds)10-9 10-6 10-3 10 0 10 3
Cache
MainSecondary
Disc
Nearline Tape
OfflineTape
OnlineTape
Size(B) $/MB
visualize
Extract analyze
3
N-Dimensional data
• What exactly is N-Dimensional data ?– Relation with N-attribute Domains.– Could have Domain Tables for dimension in
the main table.
• Why is just this not enough?– We need aggregation of various kinds to
make the data representation humanly readable.
4
Relational Representation of a 3-D Data
Model
Sales Fact Table
model_key
year_key
color_key
sales
Measures
Year
Color
5
Aggregate Functions
• Aggregation Functions :– SQL Standard – SUM(), COUNT(), MIN(), MAX(), and
AVG().– Many Systems provide their own custom aggregate
functions and some even provide users ability to make custom functions.
• The basic idea is :
Combine all values in a column
into a single scalar value.
SUM()
6 6
Relational Group By OperatorRelational Group By Operator• Group By allows aggregates over table sub-groups
• Result is a new table
• Syntax: select location, sum(units)from inventorygroup by locationhaving nation = “USA”;
Grouping Values
Partitioned Table
Sum()
Aggregate Values
7
Problems with GROUP BY• Histogram
– In standard SQL, histograms are computed indirectly from table-valued expression which is then aggregated.
• Roll-up Totals and Sub-Totals for drill-downs.– Reports commonly aggregate data at a coarse level, and then
at successively finer levels.• Roll-up: going up levels.• Drill-down: going down levels.
• Cross-tabulation (Cross-tab for short).– Symmetric aggregation table.
• The problem hence is a 2N – way Union for every Roll-up or Cross-tab, when using GROUP BY
8
An example approach
• Not relational
• Not convenient
9
‘ALL’
• Dummy value to fill all the super-aggregation items.
• Is actually a set representing all the values that are present for the corresponding dimension.
• There are two ways of dealing with it.– Define a new keyword ALL in SQL
• ALL() function is defined to enumerate the set that ALL represents.
• ALL [NOT] ALLOWED is added to column definition syntax
• Set interpretation guides relational operators {=, IN} for ALL
– Avoiding the ALL keyword.• NULL is used instead of ALL.
• GROUPING() function to discriminate between ALL and NULL
10
This is a simple 3-dimensional roll-up. Aggregating over N dimensions requires N such unions.
3D ROLL-UP
3D Roll-Up
11
Cross Tabs
• The symmetric aggregation result is a table called cross-tabulation.
12
CHEVY
FORD 19901991
19921993
REDWHITEBLUE
By Color
By Make & Color
By Make & Year
By Color & Year
By MakeBy Year
Sum
The Data Cube and The Sub-Space Aggregates
REDWHITEBLUE
Chevy Ford
By Make
By Color
Sum
Cross TabRED
WHITEBLUE
By Color
Sum
Group By (with total)Sum
Aggregate
Data Cube Relational Operator
13
N-dimensional CubeEach Attribute is a Dimension
N-dimensional CubeEach Attribute is a Dimension
• N-dimensional Aggregate (sum(), max(),...)
– fits relational model exactly:
• a1, a2, ...., aN, f()
• Super-aggregate over N-1 Dimensional sub-cubes
• ALL, a2, ...., aN , f()
• a1 , ALL, a3, ...., aN , f()
• ...
• a1, a2, ...., ALL, f()
– this is the N-1 Dimensional cross-tab.
• Super-aggregate over N-2 Dimensional sub-cubes
• ALL, ALL, a3, ...., aN , f()
• ...
• a1, a2 ,...., ALL, ALL, f()
14
CUBE Operator
• Syntax:SELECT Model, Year, Color, SUM(sales) AS Sales
FROM Sales
WHERE Model in (‘Ford’, ‘Chevy’)
AND Year BETWEEN 1990 AND 1992
GROUP BY CUBE (Model, Year, Color)
• Semantics:
15
CUBE
Result of a Cube Operator
16
ROLL UP Operator
• Syntax:SELECT Manufacturer, Year, Color, Model, SUM(price) AS Revenue
FROM Weather
GROUP BY Manufacturer
ROLLUP Year(Time) AS Year
Month(Time) AS Month
Day(Time) AS Day
• Semantics:
Manufacturer Year, Mo, Day
Mo
de
l xC
olo
rcu
be
s
17
ALL
DivisionGroup
Unit
ALL
Channel Discount District
Region
Geography
WeekMonth
QuarterYear
Product Seller Buyer Units Price Office Date
ALL
ALL
ALL
Cust Type
ALL
Snowflake Schema
A snowflake schema showing the core fact table and some of the many aggregation granularities of the core dimensions.
18
Addressing Data Cube
• SQL3 defines a Turing Complete procedural programming language.SELECT Year, Color, Model, SUM(sales) AS total
SUM(Sales) / total(ALL, ALL, ALL)
FROM Sales
WHERE Model IN {‘Ford’, ‘Chevy’}
AND Year BETWEEN 1990 AND 1992
GROUP BY CUBE Model, Year, Color
19
Computing Data CubesComputing Data Cubes
• If each attribute has Ni valuesCUBE has P (Ni+1) values
• Compute N-D cube with hash if fits in RAM
• Compute N-D cube with sort if overflows RAM
• Same comments apply to subcubes:
– compute N-D-1 subcube from N-D cube.
– Aggregate on “biggest” domain first when >1 deep
– Aggregate functions need hidden variables:
• e.g. average needs sum and count.
• Use standard techniques from query processing
– arrays, hashing, hybrid hashing
– fall back on sorting.
20
Computing Data Cubes
• 2N Algorithm for cube computation.– The simplest algorithm to compute the cube is to allocate a handle
for each cube cell
• Categorization of aggregation functions.– Distributive
• If the function can be calculated in the following distributed manner:– Partition data into n sets.– Compute the aggregation function on each partition to get an aggregate
value.– Apply a function g(), to the n aggregates to get a final aggregate.– This aggregate is the same as it would have been if the whole data would
have been aggregated at the same time.
• COUNT(), SUM(), MIN(), MAX(), SUM().• Can be more efficiently calculated than by the 2N Algorithm
21
Computing Data Cubes continued..
– Algebraic
• If it can be calculated by an algebraic function with M(a bounded +ve integer) arguments(each result of a distributive function)
• Min_N(), max_N, standard_deviation(), avg()
• Can also be calculated in a more efficient way.
– Holistic
• If there is no constant bound on the storage size needed to describe a subaggregate.
• rank(), median(), mode() (Need base data)
• 2N algorithm the fastest for exact result, but better algorithms for approximate results.
22
Compute 2D core of 2 x 3 Cube
Then computer 1D edges
Then compute 0D points
Works for algebraic and distributive functionsSaves “lots” of calls
Example
23
Maintaining a Data Cube– Up until now we have been discussing only SELECT statements.
– Now we have to accommodate INSERT, DELETE, & UPDATE
– Example max() function• Distributive for SELECT and INSERT , but holistic for DELETE
– If a function algebraic for INSERT,UPDATE and DELETE it is easy to maintain the cube.
– If it is distributive it is fairly inexpensive ( using scratchpads)
– If its holistic it is expensive to maintain the cube.
24
SummarySummary
• CUBE operator generalizes relational aggregates• Needs ALL value to denote sub-cubes
– ALL values represent aggregation sets• Needs generalization of user-defined aggregates• Decorations and abstractions are interesting• Computation has interesting optimizations• Relationship to “rest of SQL” not fully worked
out.