Autonomics and Data Management

47
Autonomics and Data Management Norman Paton University of Manchester

Transcript of Autonomics and Data Management

Page 1: Autonomics and Data Management

Autonomics and Data Management

Norman Paton

University of Manchester

Page 2: Autonomics and Data Management

Hypothesis

If database management systems are to be effective in an increasing range of challenging

environments, such as grids, then automation will have follow them into these new settings.

Page 3: Autonomics and Data Management

Outline

• Existing examples of automation.

• Limitations in current practice.

• Opportunities presented by ubiquitous automation.

Page 4: Autonomics and Data Management

Outline

• Existing examples of automation:

– Database administration.

– Query processing.

– Data integration.

• Limitations in current practice.

• Opportunities presented by ubiquitous automation.

Page 5: Autonomics and Data Management

Example: Database Administration• Database administration involves setting values for a lot of

controls:

– Where to put indexes.

– What views to materialise.

– How to allocate memory.

– Maximum number of concurrent transactions.

– Which disks to place data on.

– Which statistics to maintain.

– How often to refresh statistics.

– Which transaction isolation level to use.

• Autonomic database administration may set any of these automatically.

Page 6: Autonomics and Data Management

Multiprogramming Level• The multiprogramming level (MPL) indicates the maximum

number of concurrent transactions that may be run.

• Problem: excessive lock conflicts may lead to thrashing, either through deadlocks or significant amounts of blocking.

• Setting the MPL level:

– If too high, then risk of thrashing.

– If too low, then too many jobs waiting in queue.

• The risk of thrashing at a given MPL depends on the update intensity of the transactions.

• G. Weikum, A. Mönkeberg, C. Hasse, P. Zabback: Self-tuning Database Technology and Information Services: from Wishful Thinking to Viable Engineering. VLDB 2002: 20-3.

Page 7: Autonomics and Data Management

Automating the Setting of MPL – 1

• Observation:

– Want to set the MPL as high as possible, but not too high!

– Identify a property that indicates that there is a high risk of conflicts.

– Conflict ratio:

• (# locks held by all transactions / # locks held by non-blocked transactions)

• Experimental and analytical studies indicated that a level of 1.3 or more means there is a high risk of thrashing.

Page 8: Autonomics and Data Management

Automating the Setting of MPL – 2• Monitoring:

– Number of active transactions.

– Number of blocked transactions.

• Assessment:– Conflict ratio exceeds 1.3.

• Response:– Transaction admission policy:

• Block admission of new transactions from queue.

– Transaction cancellation policy:

• Cancel one or more blocking transactions.

Page 9: Autonomics and Data Management

Example: Query Evaluation

• Query optimization involves making lots of decisions:

– Which operators to use.

– What order to evaluate the operators in.

– What parallelism level to use.

– How to allocate work to parallel nodes.

• Adaptive query processing may revise any of the decisions made by a query optimizer during query evaluation.

Page 10: Autonomics and Data Management

Adaptation for Load Balancing• In partitioned parallelism, a task is divided into subtasks that are

run in parallel on different nodes.

• For a join, A⋈B is represented as the union of the results of plan fragments Fi = Ai ⋈Bi , for i = 1..P, where P is the level of parallelism.

• The time taken to evaluate the join is max(evaluation_time(Fi )), for i = 1..P.

• As a result, any delay in completing a fragment Fi delays the completion of the operator, so it is crucial to match fragment size to node capabilities.

• Many join algorithms have state; as such changing the size of a fragment allocated to a machine involves replicating or relocating operator state.

Page 11: Autonomics and Data Management

Load Balancing: Flux• When load imbalance is

detected:– Halt query execution.– Compute new distribution policy

(dp).– Update hash tables by

transferring data between nodes.

– Update dp in parent exchange nodes.

– Resume query execution.

• M. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin, Flux: An Adaptive Partitioning Operator for Continuous Query Systems. 25-36, ICDE 2003.

Scan(A)

Join(A1,B1) Join(A2,B2)

Hash table A1

dp

Hash table A2

Page 12: Autonomics and Data Management

Example: Data Integration

• Data integration involves assembling information about the relationships between sources:

– What sources there are.

– The services provided by the source.

– The concepts represented in each source.

– How the data represented.

– What relationships there are between extents.

– What mappings exist between source data types.

• Autonomic data integration involves inferring some of the above data.

Page 13: Autonomics and Data Management

Inferring Web Service AnnotationsWeb service annotations are useful for:

− discovering services.

− composing workflows.

− characterising and identifying mismatches.

However, service annotation is expensive: knowledge of the ontology used for annotation.

knowledge of the web services to be annotated.

(Semi)automatic annotation can be carried out using: schema matching and text classification techniques.

workflow specifications.

K. Belhajjame, S.M. Embury, N.W. Paton, N.W., R. Stevens and C.A. Goble, Automatic Annotation of Web Services Based on Workflow Definitions, Proc. 5th Intl. Semantic Web Conference, Springer, 116-129, 2006.

Page 14: Autonomics and Data Management

Inferring Web Service Annotations

• Use workflows to infer information about the semantics of linked parameters:

Page 15: Autonomics and Data Management

Summary on Examples of Automation

• Data management and integration are complex, with many possibilities to benefit from automation.

• Automation has been applied in many different settings, with many worthwhile results.

• The diversity in approaches to and technologies associated with automation is great.

Page 16: Autonomics and Data Management

Outline

• Existing examples of automation.

• Limitations in current practice.

• Opportunities presented by ubiquitous automation.

Page 17: Autonomics and Data Management

Outline

• Existing examples of automation.

• Limitations in current practice:

– Predictability.

– Methodology.

– Composability.

– Semantics.

• Opportunities presented by ubiquitous automation.

Page 18: Autonomics and Data Management

Limitations: Predictability

• Adaptive systems change system behaviour in response to runtime feedback. Risks include:

– Reacting too quickly in response to temporary effects.

– Reacting too slowly to be effective.

– Reacting in a way that makes things worse.

• It can be difficult for developers of adaptive systems to predict how effective their proposals might be.

• It sometimes takes several attempts to refine an adaptive strategy.

Page 19: Autonomics and Data Management

Adaptive Load Balancing: Comparison

• Several existing strategies were compared, across a range of environmental conditions.

• Conditions could be identified in which all of the proposals were worse than not adapting.

• Published evaluations of the existing proposals gave no indication of problematic cases.

• Several of the developers did not know under which circumstances their approaches performed poorly.

• N.W. Paton, V. Raman, G. Swart, I. Narang, Autonomic Query Parallelization using Non-dedicated Computers: An Evaluation of Adaptivity Options, Proc. ICAC, 221-230, 2006.

Page 20: Autonomics and Data Management

Adaptive Load Balancing: Experiment

• Query: – P⋈PS (P has 200,000 tuples, PS has 800,000 tuples).

– Simulation of parallel run on three nodes.

• Types of imbalance:– Constant: A consistent external load exists on one of the

nodes throughout the experiment. The level of the external load represents the number of external tasks that are seeking to make full-time use of the machine.

– Periodic: The load on one of the machines comes and goes during the experiment. The duration of the load indicates for how long each load spike lasts; and the repeat duration represents the gap between load spikes.

Page 21: Autonomics and Data Management

Results: Constant Imbalance

Page 22: Autonomics and Data Management

Periodic Imbalance (1s)

Page 23: Autonomics and Data Management

Designing Adaptive Strategies• Overheads: pessimistic

strategies carry out additional work on the assumption that things will go wrong (e.g. replicating data).

• Adaptation costs: optimistic strategies evaluate queries as normal, but may pay a high price to carry out specific adaptations when required.

Overheads

Adaptation Cost

Adapt-5Adapt-4

Adapt-2

Adapt-3 Adapt-1

Page 24: Autonomics and Data Management

Limitations: Methodology

• Adaptive data management proposals are generally described as specific algorithms or techniques:

– It is often not clear what methodology has been followed in their development.

– It is not necessarily clear if there are well established techniques that could have been used to direct their design.

• Approaches that have been applied in the design of adaptive systems include:

– Systematic functional decomposition.

– Control theory.

Page 25: Autonomics and Data Management

Autonomic Computing Architecture

• Autonomic systems typically involve a control loop, with monitoring information driving planning and decision making.

• IBM’s Autonomic Computing Toolkit provides components that implement a functional decomposition known as MAPE (Monitor, Analyze, Plan and Execute).

• The toolkit provides implementations for several of the components (in particular Monitor and Analyze).

J.O. Kephart, D.M. Chess, The Vision of Autonomic Computing, IEEE Computer, 36(1), 41-50, 2003.

Page 26: Autonomics and Data Management

Data Management and MAPE• Sensors: what monitoring

information should a database platform expose to enable effective decision making?

• Effectors: what hooks should a database platform expose to enable effective runtime modification?

• It is not straightforward:– to retrofit sensing and effecting

functionality.– to predict what may be

required.

• Monitor, Analyze, Plan and Execute components may also be able to be implemented in different ways.

• Generic monitoring components have been proposed for tracking query progress and for adaptation:

– A. Gounaris, N.Paton, A. Fernandes, R. Sakellariou, Self-Monitoring Query Execution for Adaptive Query Processing, Data and Knowledge Eng., 51(3), 325-348, 2004.

– L. Luo, J. Naughton, C. Ellmann, M. Watzke, Towards a progress indicator for database queries, SIGMOD, 791-802, 2004.

Page 27: Autonomics and Data Management

Monitoring Query Progress• Progress monitoring predicts properties of an operator

incrementally from monitored data.

• Raw monitoring data may count the number of tuples returned by an operator, the average tuple size, etc.

• From such information, operator selectivity, result size and runtime can be estimated.

• Unnest:

= (nout / nin)

– cardinality = cardinalityoperand *

– size = cardinalityoperand * * avg(sizeresult_tuple)

– time = cardinalityoperand * * tuple_build_cost

Page 28: Autonomics and Data Management

Building Adaptive Databases

• Most adaptive database extensions involve hard coding changes to the existing code base.

– Complex core infrastructure subject to intrusive changes.

– Steep learning curve for developers of adaptive extensions.

– Incremental changes result in reduced reuse.

• With respect to MAPE:

– Growing experience with generic monitoring.

– Considerable diversity in Analyze, Plan and Execute.

– Control theory provides some insights into decision making.

Page 29: Autonomics and Data Management

Control Theory• Provides a systematic framework for computing a

change to an input given a measured output.

• Designs seek to exhibit SASO properties:– Stable: bounded input gives bounded output.

– Accurate: measured output converges on desired value.

– Short Settling: converges to stable value quickly.

– No Overshoot: achieves objectives in a steady manner.

• Either find a control engineer, learn the book, or apply a well established model.– J.L. Hellerstein, Y. Diao, S. Parakh, D.M. Tilbury, Feedback

Control of Computing Systems, Wiley, 2004.

Page 30: Autonomics and Data Management

Control Theory: PID Controllers

Source: http://en.wikipedia.org/wiki/PID_control

Page 31: Autonomics and Data Management

PID Controllers Example

• Task: evaluating queries from a queue over a server.

• Objective: keep all query evaluation in memory to avoid use of multi-pass algorithms.

• Goal for controller: keep the amount of free memory at 512Mb in order to ensure condition met.

• Control parameter: multiprogramming level.

Page 32: Autonomics and Data Management

Proportional Controller

• Terminology:

– m: output signal.

– Kp: proportional gain.

– e: error.

• Definition: m = Kpe.

• Query processing example:

– m: multiprogramming level.

– e: (amount of free memory – 512Mb).

– Kp: 1/(job size in Mb): assumed 0.01, as 100Mb jobs.

Page 33: Autonomics and Data Management

Proportional Controller: Examplee: Error m: Multiprogramming Level Change

-1024 -10.24

-512 -5.12

-256 -2.56

0 0

256 2.56

512 5.12

1024 10.24

Page 34: Autonomics and Data Management

Integrative and Derivative Controllers

• Integrative Controller:– Controller output depends on level and duration of error.

– Ki: proportional gain.

– Ti: integral time.

– Definition:

• Differential Controller:– Controller output depends on rate of reduction in error.

– Kd: differential gain.

– Td: derivative time.

– Definition:

. Ki

. Kd

Page 35: Autonomics and Data Management

Control Theory for Data Management

• There are currently rather few examples of control theory being used in data management. Recent example in grid query processing:– Anastasios Gounaris, Christos Yfoulis, Rizos Sakellariou

and Marios Dikaiakos, Self-optimizing Block Transfer in Web Service Grids, WIDM, 2007.

• Modelling the relationship between measured values and controlled inputs can be challenging.

• Many adaptive data management techniques change more than an input parameter. For example:– A query may be reoptimized by an adaptive query

processor.

Page 36: Autonomics and Data Management

Limitations: Composability

• Many proposals for autonomic data management focus on specific adaptations:

– Selecting views for materialization.

– Selecting data for replication.

– Selecting fields for indexing.

– Allocation of memory to functions.

• … however, such decisions are often inter-related, and modelling the inter-relationships between such strategies is challenging.

Page 37: Autonomics and Data Management

Query Processing Inter-Dependency• Load imbalance results from

inappropriate allocation of work to resources in partitioned parallelism.

• Bottlenecks result from inappropriate allocation of work to resources in pipelined parallelism.

• There is no benefit from resolving load imbalance if the bottleneck is elsewhere in the plan.

• Resolving load imbalance may change the location of the bottleneck.

join

join

join

join

A B

C

coordinator

Change Allocation

join

Remove Bottleneck

Page 38: Autonomics and Data Management

Limitations: Semantics

• Property guarantees:

– Autonomic systems change behaviour mid-task.

– Non-trivial adaptations may leave uncertainty as to whether an adaptation is meaning-preserving.

– Few adaptations have had their meaning-preserving properties proved:

• K. Eurviriyanukul, A. Fernandes, N. Paton, A Foundation for the Replacement of Pipelined Physical Join Operators in Adaptive Query Processing, EDBT Workshops, 589-600, 2006.

Page 39: Autonomics and Data Management

Limitations: Semantics

• Performance guarantees:

– Autonomic behaviour may take certain risks with performance.

– Some proposals may redo work, leading to the need for thresholds to remove the risk of continuous reoptimization:

• V. Markl, V. Raman, D. Simmen, G. Lohman, H. Pirahesh: Robust Query Processing through Progressive Optimization. SIGMOD Conference 2004: 659-67.

– Some algorithms provide bounded worst case performance:

• Daniel M. Yellin: Competitive algorithms for the dynamic selection of component implementations. IBM Systems Journal 42(1): 85-97 (2003).

Page 40: Autonomics and Data Management

Summary on Limitations of Automation

• Automation is currently partial in scope and often ad hoc in development.

• Automation is a second class citizen in data management; there is interest in the benefits it can bring but not so much in automation per se.

• As a result, automation in data management can be seen as immature, with considerable scope for improving the predictability, composability and clarity of proposals through enhanced methodologies.

Page 41: Autonomics and Data Management

Outline

• Existing examples of automation.

• Limitations in current practice.

• Opportunities presented by ubiquitous automation.

Page 42: Autonomics and Data Management

Outline

• Existing examples of automation.

• Limitations in current practice.

• Opportunities presented by ubiquitous automation:

– Increasing manageability of database technologies.

– Extending the reach of database technologies.

Page 43: Autonomics and Data Management

Increasing Manageability - 1• Database products:

– Commercial database systems are typically associated with high total cost of ownership, resulting in significant measure from high administrative costs.

– Vendors are seeking to improve competitiveness by automating or supporting management of their intrinsically complex products.

• Data management components:

– It has been suggested that current database products are too complex, and that more data should be managed by lighter weight components.

– As of yet, there is little evidence that light-weight data management components are being designed with automation in mind, but this is perhaps a practical proposition.

Page 44: Autonomics and Data Management

Increasing Manageability - 2• There are increasing needs to manage personal data, and data

management within workgroups or laboratories is often hindered by the complexity of current data management platforms.

• Personal and workgroup data management often has evolving requirements, but rarely needs the full range of capabilities of current database products.

• Proposals in this space:

– Data services: I. Subasu, P. Ziegler, K. Dittrich: Towards Service-Based Database Management Systems. BTW Workshops 2007: 296-30.

– Data components: S. Chaudhuri, G. Weikum: Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System. VLDB 2000: 1-1.

Page 45: Autonomics and Data Management

Increasing Reach - 1

• Most automation in data management has sought to ask the question:

– Which current requirements can be met better by increasing the ranges of tasks that are carried out automatically?

• An alternative view gives rise to a different question:

– If we assume that there is to be no manual administration, what sorts of data management system can be developed?

Page 46: Autonomics and Data Management

Increasing Reach - 2• The vision of dataspaces is to support database style access

over diverse sources with minimal manual integration.

– A. Halevy, M. Franklin, D. Maier: Principles of dataspace systems. PODS 2006: 1-9.

• Preliminary proposals match schemas automatically but partially, thus giving approximate answers that can be ranked.

– J-P. Dittrich, M. Salles: iDM: A Unified and Versatile Data Model for Personal Dataspace Management. VLDB 2006: 367-378.

– S. Abiteboul, N. Polyzotis: The Data Ring: Community Content Sharing. CIDR 2007: 154-16.

• The challenge is to enable querying over structured data in a personal file store, within an organisation or at internet scale, with no manual integration.

Page 47: Autonomics and Data Management

Conclusions• Automation is already in lots of places:

– Database administration.

– Query evaluation.

– Data integration.

• Automation in data management is not mature:– Predictability.

– Methodology.

– Composability.

– Semantics.

• If automation becomes a more central focus:– Understanding of automation per se should improve.

– The nature of data management systems will change.