DW Performance Optimizations - unibz · J. Gamper, Free University of Bolzano, DWDM 2011-12 DW...
-
Upload
duongthuan -
Category
Documents
-
view
217 -
download
0
Transcript of DW Performance Optimizations - unibz · J. Gamper, Free University of Bolzano, DWDM 2011-12 DW...
J. Gamper, Free University of Bolzano, DWDM 2011-12
DW Performance Optimizations
Pre-aggregation• Using aggregates• Choosing aggregates View maintenance Bitmap indices
Acknowledgements: I am indebted to M. Böhlen for providing me the lecture notes.
J. Gamper, Free University of Bolzano, DWDM 2012-13 2
Aggregates/1
• Observations DW queries are simple, follow the same ”schema” ”Aggregate measure per dimattr1, dimattr2”
• Idea Compute and store query results in advance (pre-
aggregation) Example: store “total sales per month and product” Yields large performance improvements (factor 100,
1000, …) No need to store everything: re-use possible
e.g., quarterly total can be computed from monthly total
J. Gamper, Free University of Bolzano, DWDM 2012-13 3
Aggregates/2
• Prerequisites Tree-structured dimensions Many-to-one relationships from fact to dimensions Facts mapped to bottom level in all dimensions Otherwise, re-use is not possible
J. Gamper, Free University of Bolzano, DWDM 2012-13 4
Aggregate Example
• Imagine 1 bio. sales rows, 1000 products, 100 locations
• CREATE VIEW TotalSales (pid, locid, total) ASSELECT s.pid, s.locid, SUM(s.sales) FROM Sales s GROUP BY s.pid, s.locid
• The materialized view has 100'000 rows• Query rewritten to use view
SELECT p.category, SUM(s.sales) FROM Products p, Sales s WHERE p.pid=s.pid GROUP BY p.category
Rewritten to SELECT p.category, SUM(t.total) FROM Products p, TotalSales t WHERE p.pid=t.pid GROUP BY p.category
Query becomes 10'000 times faster!
J. Gamper, Free University of Bolzano, DWDM 2012-13 5
Pre-Aggregation Choices
• Full pre-aggregation: (all combinations of levels) Fast query response Takes a lot of space/update time (200-500 times raw data)
• No pre-aggregation Slow query response (for terabytes…)
• Practical pre-aggregation: chosen combinations A good compromise between response time and space use
• Most (R)OLAP tools now support practical pre-aggregation
IBM DB2 UDB Oracle 9iR2 MS Analysis Services Hyperion Essbase (DB2 OLAP Services)
J. Gamper, Free University of Bolzano, DWDM 2012-13 6
Using Aggregates
• Given a query, the best aggregate must be found Should be done by the system, not by the user
• The four design goals for aggregate usage Aggregates stored separately from detail data “Shrunk” dimensions (i.e., subset of a dimension’s
attributes that apply to the aggregation) mapped to aggregate facts
Connection between aggregates and detail data known by system
All queries (SQL) refer to detail data only
• Aggregates used via aggregate navigator Given a query, the best aggregate is found, and the
query is rewritten to use it Traditionally done in middleware, e.g. ODBC Can now (most often) be performed directly by DBMS
• SUM, MIN, MAX, COUNT, AVG can all be handled
J. Gamper, Free University of Bolzano, DWDM 2012-13 7
Choosing Aggregates
• Using practical pre-aggregation, it must be decided what aggregates to store
• This is a non-trivial (NP-complete) optimization problem
Many influencing factors Space use Update speed Response time demands Actual queries Prioritization of queries Index and/or aggregates
• Only choose an aggregate if it is considerably smaller than available, usable aggregates (factor 3-5-10)
• Often supported (semi-)automatically by tool/DBMS Oracle, DB2, MS SQL Server
J. Gamper, Free University of Bolzano, DWDM 2012-13 8
MS Analysis Aggregate Choice
• Can also log and use knowledge of actual queries
J. Gamper, Free University of Bolzano, DWDM 2012-13 9
Implementing Data Cubes Efficiently
• Classic SIGMOD 1996 paper Harinarayan, Rajaraman, and Ullman
Implementing Data Cubes Efficiently
• Simple but effective approach• Almost all DBMSes (ROLAP+MOLAP) now use
similar, but more advanced, techniques for determining best aggregates to materialize
J. Gamper, Free University of Bolzano, DWDM 2012-13 10
Data Cube
By make
and color
By color and year
By make and yearFordChevy
19911992
19931994
Sum
The data cube stores multidimensional GROUP BY relations of tables in data warehouses
Cross TabChevy Ford By color
Sum
Red
White
Blue
By make
Group By (with total)
By color
Sum
Red
White
Blue
By make
Database capable
Data warehouse capable
Aggregate
Sum
illustrative only
J. Gamper, Free University of Bolzano, DWDM 2012-13 11
A – 8 possible groupings of attributes (or views) with 3 di-mensions. Each grouping gives the total sales as per that grouping.
A Data Cube Example
Scenario: A query asks for the sales
of a parta)If view pc is available,
will need to process about 6M rows
b)If view p is available, will need to process about 0.2M rows
1.part, supp, cust (6M rows)2.part, cust (6M)3.part, supp (0.8M)4.supp, cust (6M)5.part (0.2M)6.supp (0.01M)7.cust (0.1M)8.none (1)
psc 6M
pc 6M ps 0.8M sc 6M
p 0.2M s 0.01M c 0.1M
none 1
B – 8 views organized intoa lattice
Questions: a) How many views to
materialize to get good performance?
b) Given that we have space S, what views to materialize to minimize average query costs?
C – Picking the right views to materialize improves performance
D – View pc and sc are not needed. This reduces effective rows needed from 19M to 7M. A reduction of 60%
19 M
row
s to
tal
2 3
J. Gamper, Free University of Bolzano, DWDM 2012-13 12
Lattice FrameworkWe denote a lattice with a set of queries L and dependence relation ≤ by (L, ≤)
• The ≤ operator – Q1 ≤ Q2 if Q1 can be answered using only the results of Q2
• In other words, Q1 is dependent on Q2
• The ≤ operator imposes a partial ordering on the queries
Essentially, the lattice models dependencies among queries/views and can be represented by a lattice graph
• Partial ordering imposes strict requirements as to what is a lattice
• However, in practice, we only need to assume there is a top view in which every view is dependent upon
J. Gamper, Free University of Bolzano, DWDM 2012-13 13
Hierarchies are important as they underlie two commonly used query operations, drill-down and roll-up …
Day
Week Month
Year
none
• Year ≤ Month ≤ Day
• Week ≤ Day; but
• Month ≤ week; week ≤ month
A common hierarchy … … and its dependency relations
… but, hierarchies introduce query dependencies that must be accounted for when determining which queries to materialize; and this can be complex
Hierarchies & the Lattice Framework
J. Gamper, Free University of Bolzano, DWDM 2012-13 14
Dependencies caused by different dimensions and attribute hierarchies can be combined into a direct product lattice …
Composite Lattices
… assuming views can be created by independently grouping any or no member of the hierarchy for each of the n dimensions
An example of combining two hierarchical dimensions
*
customer
nation
none
0.1M
25
1
product
size type
none
0.2M
1
50 150
A direct product lattice
=
cp 6M
c 0.1M
cs 5M
np 5M
nt 3750
n 25
ns 1250
p 0.2M
t 150
none 1
s 50
ct 5.99M
4
J. Gamper, Free University of Bolzano, DWDM 2012-13 15
The lattice framework is advantageous for several reasons …
Applicability of Lattice Framework
It provides a clean framework to reason with dimensional hierarchies, since hierarchies are themselves lattices
Clean framework
Easy to model dependencies
Order of ma-terialization
Able to model common queries better as users don’t jump between unconnected elements in the lattice, instead, they move along edges of the lattice
A simple descending-order topological sort on the ≤ operator gives the required order of materialization
J. Gamper, Free University of Bolzano, DWDM 2012-13 16
A framework to calculate the cost of answering a query based on other queries and it has important assumptions …
Cost Model
Time to answer a query is equal to the space occupied by the query (view) from which the query is answered
1
All queries are identical to some queries in the given lattice2The clustering of the mate-rialized query and indexes have not been considered3
An illustration:
• To answer query Q, we choose an ancestor of Q, say, Qa, that has been materialized
• We thus need to process the table of Qa
• The cost of answering Q is a function of the size of the table Qa
• Thus, the cost of answering Q is the number of rows present in the table for that query Qa used to answer Q
… to keep the model simple and realistic. This has enabled the authors to design and analyze powerful algorithms
J. Gamper, Free University of Bolzano, DWDM 2012-13 17
An experimental validation of the cost model found almost a linear relationship between size and running time …
Cost Model/2
… this relationship can be expressed by T= m * S + c, where c is the fixed cost and m is the ratio of the query time to the size of the view (i.e., m = (T-c)/S)
Source Size S Time T Ratio m
From cell itself 1 2.07 -
From view s 10,000 2.38 .000031
From view ps 0.8M 20.77 .000023
From view psc 6M 226.23 .000037
Assumption: The number of rows present in each view is known (not simple, but many ways of estimating the size are available, e.g., sampling, use statistically representative subset)
m = (T-c)/S
Query: Total sales for a supplier, using different views.
J. Gamper, Free University of Bolzano, DWDM 2012-13 18
Greedy Algorithm/1 The Greedy algorithm optimizes the space-time trade-off when implementing a lattice of views
• Given a data cube lattice with space costs associated with each view
• The top view should always be included because it cannot be generated from other views
• Suppose we may only select k number of views in addition to the top view
• After selecting set S of views, the benefit B(v,S) of view v relative to S, is based on how v can improve the costs of evaluating views, including itself
• The total benefit of v is the sum over all views w of the benefit of using v to evaluate w, providing that benefit is positive
Explanation:
The Greedy algorithm
S = {top view}
for i=1 to k do begin
Select that view v not in S such that B(v,S) is maximized;
S = S union {v};
end;
resulting S is the solution of the Greedy algorithm
J. Gamper, Free University of Bolzano, DWDM 2012-13 19
Greedy Algorithm/2 • The benefit, B(v,S), of view v relative to S is
defined as follows:
For each w ≤ v, define the quantity Bw as follows:
Let u be the view of least cost in S such that w ≤ u
If C(v) < C(u), then Bw = C(u) – C(v); otherwise B
w = 0
B(v,S) = ∑w≤v
Bw
100
a
C 7550 b
d f
h g
u
v
w w
S
J. Gamper, Free University of Bolzano, DWDM 2012-13 20
Greedy Algorithm: Example When calculating the benefit, we assume the space costs indicated in the figure. View a is used to evaluate all views and must be chosen. We want to choose three other views.
Benefits of possible choices at each round
Choice 1 Choice 2 Choice 3
b
c
d
e
f
g
h
50 x 5 = 250
25 x 5 = 125
80 x 2 = 160
70 x 3 =210
60 x 2 =120
99 x 1 = 99
90 x 1 = 90
25 x 2 = 50
30 x 2 = 60
20 x 3 = 60
60 + 10 = 70
49 x 1 = 49
40 x 1 = 40
25 x 1 = 25
30 x 2 = 60
2 x 20 + 10 =50
49 x 1 = 49
30 x 1 = 30
100
a
C 7550 b
20 d f 40
h 10 g
1
30
e
Lattice with space costs
At each round, we pick the view that will result in the most benefits after accounting for results of previous rounds
In round 1, view b can answer 5 queries, d, e, g, h and itself at a cost of 50 each. This represents a cost reduction of 250 as compared to if view b, d, e, g & h were to be answered by using view a at a cost of 100 each. Thus, view b results in the biggest benefits of 250.
In round 2, the cost of view a of 100 applies only to certain views. b, d, e, g and h would have a cost of 50. Thus, the benefit of view f wrt view h is the difference between 50 and 40.
After 3 rounds, total costs of evaluating all views can be reduced to 420 from the initial 800.
J. Gamper, Free University of Bolzano, DWDM 2012-13 21
Greedy Algorithm vs Optimal Choice There will be situations where the algorithm does poorly …
… but the benefit is at least 63% of the benefit of the optimal algorithm as reasoned by the authors
200
a
d 100100 b
20 Nodes total 1000
c
99
A Lattice where the greedy does poorly
20 Nodes total 1000
20 Nodes total 1000
20 Nodes total 1000
• Round 1: Picks c whose benefit is 4141
• Round 2: Can pick b or d with benefits of 2100 each
• Greedy results in benefit of 4141+2100 = 6241
• But, the optimal choice is to pick b and d
• b and d would improve by 100 for itself and all 80 nodes below resulting in total benefits of 8200
• Ratio of greedy/optimal=6241/8200=76%
J. Gamper, Free University of Bolzano, DWDM 2012-13 22
Greedy Algorithm – Space vs. Time
Greedy order of view selection for TPC-D based example
Selection Benefit TotSpace
1
2
3
4
56
cp
7
8
9
10
1112
ns
nt
cp
cs
np
ct
t
n
s
none
infinite24M rows
12M
5.9M5.8M
1M
1M
0.01M
small
smallsmall
small
TotTime
72M rows48M
36M
30.1M24.3M
23.3M
23.3M
23.3M
23.3M
23.3M
23.3M23.3M
23.3M
23.3M
23.3M
23.3M23.3M
6M rows
6M
6M
6.1M
11.3M
16.3M
6.3M
Experiment with composite lattice shows that it is important to materialize some views but not all. Performance increases at first …
… but after 5 views, increase of performance gets small even as more space is used
30
20
10
0
70
60
50
40
80
2 4 6 8 10 12
Time / Space
Views
Total Time
Total Space
5
J. Gamper, Free University of Bolzano, DWDM 2012-13 23
Optimal Cases and Anomalies
Two situations where the algorithm is optimal …
If the benfit of the first view is
much larger than the other
benefits, the greedy is close
to optimal
If all the benefits are equal
then greedy is optimal
… but there are also two situations where the algorithm is not realistic
Views in a lattice are unlikely
to have the same probability
of being requested in a
query; hence, probabilities
should be associated to
each view
Instead of asking for some
fixed number of views to
materialize, should instead
allocate a fixed amount of
space to views
J. Gamper, Free University of Bolzano, DWDM 2012-13 24
The size of views grows exponentially, until it reaches the size of the raw data at rank [logr m], i.e. the cliff
Number of group-by attributes
n
Size of views
m
Log r m
How the size of views grows with number of grouped attributes
• Each domain size is r
• Top element has m cells appearing in raw data
• If group on i attributes, cube has ri cells
• If ri ≥ m, then each cell will have at most one data point. Space cost is m.
• If ri <m, then almost all ri cells will have at least one data point. Space cost is ri as several data points can be collapsed into one aggregate
Assumptions and basis of reasoning
Which explains why grouping of 2 attributes (p,c), (s,c) have the same size as (p,s,c) at 6M rows (slide 11)*
* (p,s) does not have 6M rows because the benchmark made it so
Hypercube Lattices – Observations
J. Gamper, Free University of Bolzano, DWDM 2012-13 25
Inevitably, questions will be raised about space and time optimality of hypercubes …
Space- and Time-optimal Solutions
What is the average time for a query when the space is optimal?
• Space is minimized when only the top view is materialized
• Every query would take time m
• Total time cost for all 2n queries is m2n
Is there sense to minimize time by materializing all views?
• No gain past the cliff
• No point to do so
• Nature of time-optimal solution is to get as close to cliff as possible.
J. Gamper, Free University of Bolzano, DWDM 2012-13 26
Summary on Choosing Aggregates
• Problems in deciding which set of views to materialize to improve performance
• Lattice framework: views are organized in a lattice• Notion of linear cost in query processing• Greedy algorithm that picks the right views• Some observations about hypercubes and time-
space trade-off
J. Gamper, Free University of Bolzano, DWDM 2012-13 27
View Maintenance
• How and when should we refresh materialized views?
• Total re-computation Most often too expensive
• Incremental view maintenance Apply only changes since last refresh to view Ri = insert rows Rd = deleted rows
• Additional info must be stored in views To make the views self-maintainable Store “number of derivations” cv (count) along with
each row v in V
J. Gamper, Free University of Bolzano, DWDM 2012-13 28
View Maintenance/2
• Projection views With DISTINCT V=Π(R) If (r,ci)∈Π(Ri) and (r,cr)∈V then cr=cr+ci, otherwise insert
(r,ci) into V
If (r,cd)∈Π(Rd) and (r,cr)∈V then cr=cr-cd, delete from V if cr=0
• Join views V=R⊗S Compute Ri⊗S and add to V, update counts
Compute Rd⊗S and subtract from V, update counts
J. Gamper, Free University of Bolzano, DWDM 2012-13 29
Aggregation View Maintenance
• COUNT Maintain <g,count> Update count based on inserts/deletes Insert <g,1> rows for new values, delete rows from V with
count=0
• SUM Maintain <g,sum,count> Update count and sum based on inserts/deletes Insert <g,a,1> rows for new values, delete rows from V
with count=0
• AVG computed as sum/count
J. Gamper, Free University of Bolzano, DWDM 2012-13 30
Aggregation View Maintenance/2• MIN/MAX
Maintain x = <g, min, cnt> Update min and cnt based on m {=,<,>} min Insert (g,m)
If m < min then x = <g, m, 1> Else if m = min then x = <g, min, cnt+1>
Delete (g,m) If m = min then x = <g, m, cnt-1> If cnt = 0 then scan table for new min and cnt (expensive!)
6
J. Gamper, Free University of Bolzano, DWDM 2012-13 31
Aggregation View Maintenance/3• Determine min and count using SQL• In: q = R(a,b), out: <a, MIN(b), count of MIN(b)>
• Solution 1
SELECT t.*, ( SELECT COUNT(*) FROM q WHERE a = t.a AND b = t.b )FROM ( SELECT a, min(b) b FROM q GROUP BY a ) t;
6
q: a b ---- 1 2 1 2 1 3 2 3
t: a b ---- 1 2 2 3
result: a b cnt --------- 1 2 2 2 3 1
J. Gamper, Free University of Bolzano, DWDM 2012-13 32
Aggregation View Maintenance/4• Solution 2
SELECT a, b, COUNT(*)FROM qGROUP BY a, bHAVING (a, b) IN ( SELECT a, min(b) FROM q GROUP BY a );
• Solution 3
SELECT a, b, COUNT(*)FROM q AS tWHERE b = ( SELECT min(b) FROM q WHERE a = t.a )GROUP BY a, b;
J. Gamper, Free University of Bolzano, DWDM 2012-13 33
Aggregation View Maintenance/5• Using MD-join
X = MD( R/B, R, ( (MIN(b) -> min), (COUNT(*) -> cnt) ), ( (R.a = B.a), (R.a = B.a and R.b = B.b) ) )
Y = σ[b=min](X)
6
X: a b min cnt ------------------ 1 2 2 2 1 2 2 2 1 3 2 1 2 3 3 1
X: a b min cnt ------------------ 1 2 2 2 2 3 3 1
R: a b ---- 1 2 1 2 1 3 2 3
J. Gamper, Free University of Bolzano, DWDM 2012-13 34
Practical View Maintenance
• When to synchronize? Immediate – in same transaction as base changes Lazy – when V is used for the first time after base updates Periodic, e.g., once a day, often together with base load Forced – after a certain number of changes
• Updating aggregates Computation outside DBMS in flat files (no longer very
relevant!) Built by loader Computation in DBMS using SQL Can be expensive: DBMS must be tuned for this
• Supported by tool/DBMS Oracle, SQL Server, DB2
J. Gamper, Free University of Bolzano, DWDM 2012-13 35
Indexing
• Index used in combination with aggregates Index on dimension tables and on materialized views
• Fact table Build primary B-tree index on dim keys (primary key)? Build indexes on each dimension key separately (index
intersection) Indexes on combinations of dimension keys? (many!)
• Sort order is important (index-organized tables) Compressing data can be possible (values not repeated) Can save aggregates due to fast sequential scan Best sort order (almost) always time
• Dimension tables Build indexes on many/all individual columns Build indexes on common combinations
• Hash indexes Efficient for un-sorted data
J. Gamper, Free University of Bolzano, DWDM 2012-13 36
Bitmap Indexes
• A B-tree index stores a list of RowIDs for each value
A RowID takes ~8 bytes Large space use for columns with low cardinality
(gender, color) Example: Index for 1 bio. rows with gender takes 8 GB Not efficient to do “index intersection” for these columns
7
J. Gamper, Free University of Bolzano, DWDM 2012-13 37
Bitmap Indexes/2
• Idea: make a ”position bitmap” for each value (only two)
Female: 01110010101010… Male: 10001101010101… Takes only (no. of values)*(no. of rows)*1 bit Example: bitmap index on gender (as before) takes only
256 MB Very efficient to do”index intersection” (AND/OR) on
bitmaps Can be improved for higher cardinality using compression
• Supported by some RDBMSer (DB2, Oracle)
J. Gamper, Free University of Bolzano, DWDM 2012-13 38
Using Bitmap Indexes
• Query example Find male customers in South Tyrol with blond hair and blue
eyes Male: 01010101010 South Tyrol: 00000011111 Blond: 10110110110 Blue: 01101101111 Result 00000000010 – use AND, only one such
customer
• Range queries can also be handled …and Salary BETWEEN 200,000 AND 300,000 200-250,000: 001001001 250-300,000: 010010010 OR together: 011011011 Use as regular bitmap
J. Gamper, Free University of Bolzano, DWDM 2012-13 39
Compressed Bitmaps
• Problem: space use With m possible values and n records: n*m bits required However, probability of a 1 is 1/m => very few 1’s
• Solution: compressed bitmaps Run-length encoding A run is i 0’s followed by a 1 Concatenating binary numbers i won’t work, since
decoding is not unique Instead, determine j, which is the number of bits in the
binary representation of i Encode run as: “<j-1 1’s>”+”0”+”<i in binary>”
If j>1, the first bit of i is 1 and can be saved in the binary representation of i
Encode next run similarly, trailing 0’s not encoded
8
J. Gamper, Free University of Bolzano, DWDM 2012-13 40
Compressed Bitmaps/2
Example: 000000010000 encoded as 110111 With saving the first bit, it is encoded as 11011
Decoding: scan bits to find j (count bits till 0 and add 1), scan next j-1 bits to find i, find next 0, etc.
Example: j = 3, i = 7 (12) => bitmap = 00000001 + 0000 (trailing)
8
J. Gamper, Free University of Bolzano, DWDM 2012-13 41
Compressed Bitmaps/3
Example: 0000001 01 1 00001 000...0 (n=40) Encode:
0000001 => 11010 (i=6='110', j=3) 01 => 01 (i=1='1', j=1) 1 => 00 (i=0='0', j=1) 00001 => 11000 (i=4='100', j=3) Final encoding: 11010010011000
Decode: 11010 => 0000001 (j=3, i=6='(1)10') 01 => 01 (j=1, i=1='1') 00 => 1 (j=1, i=0='0') 11000 => 00001 (j=3, i=4='(1)00') Fill up remaining 0s Final bitmap: 0000001 01 1 00001 000...0
8
J. Gamper, Free University of Bolzano, DWDM 2012-13 42
Managing Bitmaps
• Compression factor Assume m=n (unique values) Each value has just one run of length i < n Each run takes at most 2 log2 n bits (j <= log2 n) Total space consumption: 2n log2 n bits (compared to n2)
• Operations on compressed bitmaps Decompress one run at a time and produce relevant 1’s in
output
• Finding and storing bit vectors Index with B-trees + store in blocks/block chains
• Finding records Use secondary or primary index (hashing or B-tree)
J. Gamper, Free University of Bolzano, DWDM 2012-13 43
Managing Bitmaps/2
• Handling modifications Deletion: “retire” record number + update bitmaps with
1’s Insertion: add new record to file + update bitmaps with 1’s
(trail 0’s) Updates: update bitmaps with old and new 1’s
J. Gamper, Free University of Bolzano, DWDM 2012-13 44
Bit-Sliced Index/1• A bit-sliced index for a numeric attribute c of a
relation R consists of a bit matrix B with n columns B
0,...,B
n-1 and as many rows as tuples in R.
Row i represents the binary representation of the value of the c-attribute of tuple i.
n is the # of bits needed by the binary representation of the maximum value of c, i.e., log
2(MaxVal).
• Each column (slice) is stored separately.• Example: Bit-sliced
index for values ranging from 1-100; log
2(100) = 7 bits
are needed
J. Gamper, Free University of Bolzano, DWDM 2012-13 45
Bit-Sliced Index/2• Bit-sliced indexes are possible for attributes with
large domains• Standard bitmap indexes grow linearly with the
number of distinct attribute values 1 column for each value
• Bit-sliced indexes have only a logarithmic grow in the size of the domain.
• Boolean operators can still be applied.• To get all tuples with
quantity > 63, retrieveall RIDs with B
6=1
J. Gamper, Free University of Bolzano, DWDM 2012-13 46
Bit-Sliced Index/3• Bit-sliced indexes can be used compute some
aggregates without accessing the data. e.g., sum, avg
• Compute the sum of values: count(B
i) returns the # of 1s in slice B
i
Function SUM(B0,...,B
n)
Input: bit-sliced index B consisting of n slices built on the c integer key;
Sum := 0;
For i = 0,...,n do
Sum += 2i * count(Bi);
Return Sum;
J. Gamper, Free University of Bolzano, DWDM 2012-13 47
Bitmap-Encoded Index/1• The idea of storing a binary encoding of numeric
values has been applied to non-numeric domains.• A bitmap-encoded index on an attribute c with
k distinct values or a relation R consists of a bit matrix B and a conversion table T
B contains log2(k) columns and has as many rows as
tuples in R T contains k rows; the i-th row shows the binary coding of
value ci.
• Example: Bitmap-encoded index forposition attribute
J. Gamper, Free University of Bolzano, DWDM 2012-13 48
Bitmap-Encoded Index/2• Though an additional conversion table T is needed
to translate the values encoded in th index, the index size can be considerably reduced (compared to a bitmap index).
• Bitmap-encoded index grows logarithmically in the size of the domain, while bitmap index grows linearly.
• Boolean operators can be applied to bitmap-encoded indexes.
• Any selection predicate on key values can be represented by a Boolean expression, which selects intervals of valid binary values.
• To minimize the number of bitmap vectors that need to be accessed, a “good” encoding is crucial.
J. Gamper, Free University of Bolzano, DWDM 2012-13 49
Bitmap-Encoded Index/3• A bitmap-encoded index coding function is well
defined for a set of selection predicates if it minimizes the number of bit vectors to be accessed to check for the selection predicates.
• Example: Attribute with values a, b, ..., h. Assume key ∈ {a,b,c,d} and key ∈ {c,d,e,f} are the most
frequent predicates. Encoding is well-defined:
The first predicate is true if B
1 vector is 0
The second predicate is true if the B
0 vector is 1
J. Gamper, Free University of Bolzano, DWDM 2012-13 50
Bitmap-Encoded Index/4• Example (contd.):
Boolean expressions for the predicates key ∈ {a,b,c,d} and key ∈ {c,d,e,f}:
Using rules of Boolean algebra these expressions can be simplified to not B
1 (i.e., B
1=0)
and B0.
J. Gamper, Free University of Bolzano, DWDM 2012-13 51
Bitmap-Encoded Index/5• Main OLAP operators are based on functional
dependencies between dimensional attributes in hierarchies.
• Coding function for bitmap-encoded indexes allows to encode hierarchies.
• In general, the coding function allows you to encode both many-to-one and many-to-many associations.
J. Gamper, Free University of Bolzano, DWDM 2012-13 52
Bitmap-Encoded Index/6• Example: Product dimension with the hierarchy
category --> type --> product Coding table on the product attribute Only B
2 is needed to retrieve all products of a specific
category Likewise, B
2 and B
1 are needed to retrieve a specific type
Dimension table Coding table
J. Gamper, Free University of Bolzano, DWDM 2012-13 53
Bitmapped Join Index/1
• A bitmapped join index built on the attributes cR
of a relation R and cS of a relation S is a bit matrix
B with |R| rows and |S| columns. Bit Bi,j is 1 if the
corresponding tuples satisfy the join predicate.• Example: Bitmapped join index for fact table
SALES and dimension table STORE e.g., Tuple 2 in SALES
joins tuple 2 in STORE
J. Gamper, Free University of Bolzano, DWDM 2012-13 54
Bitmapped Join Index/2• Bitmapped join indexes can also be used to
execute queries with multiple joins.1. Access the bitmap indexes to identify the tuples (RIDs)
that fulfill the predicates on the dimensional attributes.2. For every bitmapped join index, load only the bit vectors
corresponding to the RIDs identified in step 1. A bitwise OR yields the RID
i vector that fulfills all predicates on a
dimension table.3. Perform a bitwise AND between the n vectors obtained
for each dimension.
J. Gamper, Free University of Bolzano, DWDM 2012-13 55
Bitmapped Join Index/3• Example
SELECT DISTINCT
FT.m, DT1.a1, DT
n.a
n
FROM FT, DT1, ..., DT
n
WHERE FT.a1=DT
1.a
1 AND
...
FT.an=DT
n.a
n AND
DT1.b
1='val
1' AND
...
DTn.b
n='val
n'
•
J. Gamper, Free University of Bolzano, DWDM 2012-13 56
Physical Storage
• Partitioning Data stored in large ”lumps” (partitions) Example: one partition per quarter Queries need only read the relevant partitions Can yield large performance improvements
• Operations on partitions are independent Creation, deletion, update, indexing Aggregation level can be different among partitions
• Column storage Data stored in columns, not in rows A ”reverse” kind of partitioning Works well for typical DW queries (only few columns
accessed) Supports good compression of data
J. Gamper, Free University of Bolzano, DWDM 2012-13 57
Physical Configuration• RAID
Gives (depends on level) error tolerance and improved read speed DW optimized for reads, not for writes DW well suited for, e.g., RAID5 (20% redundancy)
• Disk type Small drives (many controllers) are more expensive, but faster Large drives are cheaper, store more aggregates for same price
• Block size Large sequential reads faster with large blocks (32K) Scattered index reads faster with small blocks (4K)
• Memory RAM is cheap: buy a lot RAM caching must be per user session
• Monitoring user activity Can give feedback to, e.g., choice of aggregates
J. Gamper, Free University of Bolzano, DWDM 2012-13 58
DBMS Functionalities• Aggregate navigation/use
Oracle 9iR2, DB2 UDB, MS Analysis Services• Aggregate choice
Oracle 9iR2, DB2 UDB, MS Analysis Services• Aggregate maintenance
Oracle 9iR2, DB2 UDB, MS Analysis Services• Using ordinary indexes
Oracle 9iR2, DB2 UDB, MS SQL Server can do ”star joins” • Bitmap indexes
Oracle 9iR2, DB2 UDB – not yet in MS SQL Server• Partitioning
Oracle 9iR2, DB2 UDB, MS SQL Server+Analysis Services• Column storage
Redbrick (Informix->IBM->?)• MOLAP/ROLAP/HOLAP
Oracle 9iR2, DB2 UDB, MS SQL Server
J. Gamper, Free University of Bolzano, DWDM 2012-13 59
Conclusions• Pre-aggregation is a key technique to boost
performance.• Data warehouses automatically determine views
to materialize and when to use them• Views have to be maintained incrementally
(incorporate count) when data changes.• Bitmap indexes are being used by data
warehouses.