Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning
-
Upload
soheila-dehghanzadeh -
Category
Education
-
view
80 -
download
1
Transcript of Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine Learning
Predicting Multiple Metrics for Queries: Better Decision Enabled by
Machine LearningBy: Archana Ganapathi, Harumi Kuno,
Umeshwar Dayal, Janet L. Wiener, Armando Fox, Michael Jordan, David
Patterson
Problem:
• Predicting the performance(running time, resource usage) of a query before executing it will help us in:
• Work load management • query scheduling
• System sizing • requirement for a system to reply a query with time
constraint
• Capacity planning• Given an expected workload, does system require upgarde?
Why it is a hard problem
• Sources of uncertainty• Skewed data distribution• Inaccurate cardinality prediction
• Complex query plans• Huge amount of data• Different schemas for different databases
makes using ML a big challenge
Solution • It should be able to simultaneously predict all
performance metrics, using information available prior to query execution for short and long running queries.
• Potential candidates:• Cost models
– Manually model performance output of each operator for each configuration setting to estimate final value based on query plan.
– Estimation error propagationMachine learning
– Build model based on training data– Not sensitive to estimation error since it is working based on similarity.
Experiment set up (data)
• Machine used to gather training and test query performance metrics
• Hp neoview database system.• Machines with 4,8,16,32 processor.• Fixed memory allocated per CPU.• Each CPU has its own disk and data is partitioned
roughly equally across all 4 disks.
Experiment set up (query)• Categorize queries by runtime:
• 0min < feather < 3min• 3min < golfball < 30min• 30min < bowlingball < 2h.
• Standard decision support benchmark TPC-DS templates to generate queries for feathers.
• Write new templates from real queries that took at least 4 hours to compute for longer queries.
• Some feathers queries from another database with different schema in train and test set.
• Producing queries with appropriate performance was a hard and time-consuming task since changing a constant might turn a feather to bowling ball or vice versa.
Independent modelling of performance metrics• Regression
• Individually model each performance metric y=A1X1+A2X2+…+AnXn
• Regression use different set of features for different performance metrics which will make it hard to unify all performance metrics in one model.
Joint modelling of performance metrics
• Clustering cluster entries of a single dataset based on their similarity.
• PCA Project dataset over dimensions with maximal variance for clustering.
• (K)CCA finds Dimensions of maximal correlations among pairs of datasets and Map each dataset on those dimensions. Notion of similarity can be defined by user in a kernel function.
Query features before running
Performance features after running
KCCA
KCCA
• We are given N queries.• We produce two N*N matrix of similarities
among query features and query performance features .
•
Prediction using KCCA
Evaluation
• Predictive risk
• predictive risk ~ 1 near prefect prediction• This metric is very sensitive to outliers an
removing top outliers can significantly improve predictive risk.
Performance feature vector
• Performance features : 6 measures computed by DBMS after running a query.
– Elapsed time– Disk i/o– Message count– Message bytes– Records accessed– Records used
Query feature vector
• Information available prior to query execution1. SQL text of query• Number of nested sub-queries• Total number of selection predicates• Number of equality selection predicates• Total number of join predicates• Number of equi-join predicates• Number of non-equi-join predicates• Number of sort columns• Number of aggregation columns
Query feature vector
2. query execution plan(a tree of query operators with estimated cardinalities)
• Instance count and cardinality sum for each operator.
Prediction based on neighbours• How to find ‘nearest’ neighbour?
• Euclidian distance captures magnitude-wise closest neighbour.• Cosine distance captures direction-wise closest neighbour.• Experiments suggest that Euclidian distance is providing better
prediction.
Prediction based on neighbours
• How many neighbours to consider when calculating freshness?
• According to experiments done, 3 nearest neighbour is providing a good trade-off.
Prediction based on neighbours
• How to map from neighbours performance metrics to test query performance metric? combine neighbours performance feature vectors.
Equally weighted• 1:2:3 weighted based on distance ranking• Weighting proportinal to distance from test query feature vector
Experiment design
• Experiment 1: Train model with realistic mix of query types-1027(30b+230g+767f)
• Experiment 2 : Train model with 30 queries of each type-120(30b+30g+30f)
• Experiment 3 : 2-step prediction with query type-specific models
• Experiment 4 : Training and testing on queries using different data tables and schemas.
Experiment 1- Time
Experiment 1- Record usage
Experiment 1- Message count
Experiment 2- Time
Experiment 3- Time
Experiment 4- Time
Conclusions
• Predict performance metrics using information available before executing query using ML.
• Prediction can greatly improve system sizing, capacity planning and workload management.
• I want to predict the percentage of up-to-date result for a query result extracted from cache and based on similar queries statistics.