Online aggregation

24
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Presented By: Arjav Dave

description

Online aggregation. Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California, Berkley. Presented By: Arjav Dave. Overview. Introduction Goals Implementation - PowerPoint PPT Presentation

Transcript of Online aggregation

Page 1: Online aggregation

Online aggregation

Joseph M. Hellerstein University of California,

Berkley

Peter J. Haas IBM Research Division

Helen J. Wang University of California,

Berkley

Presented By:Arjav Dave

Page 2: Online aggregation

Overview• Introduction• Goals• Implementation• Performance Evaluation• Future Work

Page 3: Online aggregation

Problems with Aggregation• Aggregation is performed in batch mode: a query

is submitted, the system processes a large volume of data over a long period of time and then the final answer is returned.

• Users are forced to wait without feedback while the query is being processed.

• Aggregate queries are generally used to get a rough picture of a large body, yet they are computed with painstaking precision.

Page 4: Online aggregation

What is Online Aggregation and Why?• Permits users to observe the progress of their

aggregation.• Controls the execution on the fly.

Page 5: Online aggregation

Example• Consider the query:

select avg(final_grade) from grades where course_name=‘cse186’

• Without index the query will scan all records before returning the answer.

Page 6: Online aggregation

Example using online aggregation• The user can stop the query processing if the

result is within specified confidence interval.

Page 7: Online aggregation

Interface with groups• Consider a query with group by clause having 6

groups in output• There are 6 stop signs each for one group. They

can be used to stop the group processing.• Such an interface is easy for the non-statistical

users.

Page 8: Online aggregation
Page 9: Online aggregation

Usability Goals• Continuous Observation: Statistical, Graphical

and other intuitive interfaces. Interfaces must be extensible for each aggregate function.

• Control of Time/Precision: User should be able to terminate processing at any time controlling trade off between time and precision.

• Control of Fairness/Partiality: Users can control the relative rate at which different running aggregates are updated.

Page 10: Online aggregation

Performance Goals• Minimum time to Accuracy: Minimize time

required to produce useful estimate of the final answer.

• Minimize time to completion: Minimize the time required to produce the final answer.

• Pacing: The running aggregates should be updated at regular rates, without being so frequent that they overburden the user or user interface.

Page 11: Online aggregation
Page 12: Online aggregation

Building a system for Online AggregationTwo Approaches:•Naïve Approach•Modifying a DBMS

Page 13: Online aggregation

Naive Approach• Consider the query:

select running_avg(final_grade),running_confidence(final_grade),running_interval(final_grade) from grades;

• POSTGRES supports user defined function which can be defined for simple aggregates. For complex aggregates performance and functional issues arise.

Page 14: Online aggregation

Modifying a DBMS• Modify the database engine to support online

aggregation.

• Random access to Data:▫ Necessary to get the statistically meaningful estimates

of the precision of running aggregates.• Access Methods:

▫ Heap Scan▫ Index Scan▫ Sampling from Indices

Page 15: Online aggregation

Types of Access Methods• Heap Scan:

▫ Generally stored in random order, so easy to fetch▫ In clustered files there may be some logical order, so

choose other access method for queries over that attributes.

• Index Scan:▫ Returns tuples based on some attribute or in groups

based on some attribute.• Sampling from Indices:

▫ Ideal for producing meaningful confidence intervals. ▫ Less efficient than other two.

Page 16: Online aggregation

Fair, Non-Blocking Group By• Traditional technique does sorting and then

grouping. But sorting blocks• Instead hash the input relation on its grouping

columns. But does not perform better as the number of groups increases.

• Hybrid hashing or its optimized version Hybrid Cache can be used.

• For DISTINCT queries same technique can be used.

Page 17: Online aggregation

Index Striding• Update for the group with few members will be

very infrequent.• Index Striding: It uses Round-Robin technique to

fetch tuples from different groups fairly.• Advantages:

▫ Output is updated according to default or user settings.▫ Delivery of tuples of a group can be stopped, so delivery

of tuples for other groups is much faster.

Page 18: Online aggregation

Non-Blocking Join Algorithms• Sort-Merge Join: Sorting blocks.• Merge Join: Not acceptable for access methods

that fetch tuples in sorted order.• Hybrid Hash Join: Acceptable if inner relation is

small, as it blocks to hash the inner relation.• Pipeline Hash: Less Efficient than hybrid hash

join. But efficient if both the relations are large.• Nested-Loops join: Not useful for joins with large

inner relation non-indexed.

Page 19: Online aggregation

Optimization• Avoid Sorting

• Function to calculate cost for Blocking sub-operations (e.g. Hashing inner relation in hybrid hash join) according to processing time should be exponential.

• Maximize user control.

• Trade off need to be evaluated between the output rate and the time to completion.

Page 20: Online aggregation

Aggregate Functions• New aggregate functions must be defined to

return running confidence intervals.

• Query Executor must be modified to provide running aggregate values for display.

• An API must be provided to control the rate e.g. stopGroup, speedupGroup, slowDownGroup, setSkipFactor.

Page 21: Online aggregation

Running Confidence Intervals• Precision of running aggregates is given by

running confidence intervals• Three types:

▫ Conservative Confidence Interval▫ Large Sample Intervals▫ Deterministic Confidence Interval

Page 22: Online aggregation
Page 23: Online aggregation

Future Work• Enhancing User Interface• Support for Nested Queries• Checkpointing and Continuation• Tracking online queries

Page 24: Online aggregation

QUESTIONS?