Speeding Up Warehouse Physical DesignUsing A Randomized Algorithm
Minsoo Lee
Joachim Hammer
Dept. of Computer & Information Science & Engineering University of Florida
View Selection Problem
• What’s a Data Warehouse?– stores info. collected from multiple, heterogeneous info.
sources to support complex querying and analysis
• Materialized Views in a DW– pre-computed portions of frequently asked queries– maintenance : incremental, periodic refresh
• View Selection Problem– decide which views to materialize in DW– considers query response time, maintenance cost(?), and
storage cost
g h i k
b c d
a
Views
Basetables
OR
OROROR
Overview of Our Problem
• Maintenance-cost View Selection Problem [GM99]– decide which views to materialize in DW
– minimize query response time, given an upper bound on maintenance cost (storage space is not considered)
• DW Configuration based
on OR view graphs– Any view can be computed
from any of its related views
Problems with existing Solutions
• Existing Solutions to the View Selection Problem– Heuristics-based Search
• Greedy Heuristics [Gup97,GM99]
• A* [Rou82, LQA97, GM99]
– Exhaustive Search [TS97]
• Problems– Does not scale up well for more than 20 views
– Time complexity is polynomial
– DW evolution requires efficient re-computation of a configuration
Outline of Our Approach
• Use Randomized Algorithms– Randomized algorithms provide good solution within a
small amount of time (time/quality tradeoff)
– Specifically, use Genetic Algorithms (GA)
• Advantages of Our Approach– Near linear scaleup with a solution within 90% of
optimal
– Support DW evolution with fast reconfiguration
Genetic Algorithms
Loop until termination condition = true
t=t+1
Select P(t) from P(t-1)
Recombine P(t)
Evaluate P(t)
genomes population
generation t generation t+1 generation t+2
GA : Representation of Solution
• Genome– Candidate Solution of the problem to be solved
– Represented as a String
– Ordering problems : Alphanumeric String
Selection problems : Binary String
• Binary String Representation– ex) v1 v2 v3 v4 v5
0 1 0 0 1 views v2 and v5 are selected
GA : Initialization of Population
• Initial Population in our experiments– Pool of randomly generated bit strings
– population size is 300
• Future experiments – generate more favorable initial population
– use external knowledge of problem
GA : Selection, Crossover, Mutation, Termination
• Selection– Select superior genomes among previous population
– Roulette Wheel Method [Mic94]
• Crossover– applied to two genomes by exchanging information
• Mutation– applied to a single genome : ex) flip a bit in the genome
• Termination– termination condition : 400 generations
GA : Evaluation Process
• Fitness Function– measures how good a genome is as a solution
– high : close to optimal, low : further from optimal
• Use Penalty Function in Fitness Function– similar solution to 0/1 knapsack solution[Mic94]
– Evaluate query benefit.
If maintenance limit is exceeded, apply penalty.
GA : Evaluation Process
• Penalty Functions– Logarithmic (LG)
– Linear (LN)
– Exponential (EX)
• Penalty Application Methods– Subtract (S)
– Divide (D)
– Subtract&Divide (SD)
Evaluation of the Algorithm
• Environment– Pentium II 450 MHz PC, Windows NT 4.0
• OR-view graphs– number of base tables : 10 tables
– number of views : 5-20 views
– edge density of graph : 15%, 30%, 50%, 75%
– parameters for node (view) & parameters for edgeRC : 100 - 10000 for base tables QC : 10 - 80% of RC of
QF : 0.1- 0.9 source view
UF : 0.1 - 0.9 MC : 10 - 150% of QC
Results : Quality of Solutions
Average ratios of (optimal total query cost/GA total query cost)
9092949698
100102104106108110
1
14
27
40
53
66
79
92
10
5
11
8
13
1
14
4
15
7
17
0
18
3
19
6
20
9
22
2
23
5
24
8
(density,# of views, maint. constraint)
tota
l qu
ery
co
st
rati
o
(op
tim
al/G
A)
%
LN-D
LN-SD
EX-D
EX-SD
Average ratios of (GA total maintenance cost/maintenance constraint)
050
100
150200250300
350400450
1
14
27
40
53
66
79
92
10
5
11
8
13
1
14
4
15
7
17
0
18
3
19
6
20
9
22
2
23
5
24
8
(density,# of views, maint.constraint)
tota
l mai
nte
nan
ce c
ost
ra
tio
(G
A/o
pti
mal
) %
LN-D
LN-SD
EX-D
EX-SD
Results : Execution Time
< Execution time for exhaustive search algorithm >
0
20
40
60
80
100
120
1 15 29 43 57 71 85 99 113
127
141
155
169
183
197
211
225
239
253
(density, # of views, maint. constraint)
tim
e (
se
c)
Exhaustive Search
< Execution time for genetic algorithm >
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 15 29 43 57 71 85 99 113
127
141
155
169
183
197
211
225
239
253
(density, # of views, maint. cost)
tim
e (
se
c) LN-D
LN-SD
EX-D
EX-SD
Prototype Development
• Used version 2.4.3 of Galib from MIT• Microsoft Visual C++• Encoded our own Fitness Function
– strategy for penalty is controlled by a control variable
• Encoded OR-view graph cost evaluation functions – total query cost, total maintenance cost
• OR-view graph costs– Node: Read Cost, Query Frequency, Update Frequency– Edge: Query Cost, Maintenance Cost
Conclusion
• Use of Genetic Algorithm for Maintenance-cost View Selection Problem– yields a solution within 10% of optimal solution
– linear scale up for execution time w.r.t number of views
– EX-D and EX-SD strategy produce best results
– Suitable for use in DW evolution
• Future work– experiments with better initial population
– various crossover and mutation operators, termination condition
– AND-OR views, indexes
– parallel version of GA
Top Related