Parallel databases Architecture Query evaluation Query optimization Distributed databases ...

Distributed Databases


Parallel databases Architecture Query evaluation Query optimization

Distributed databases Architectures Data storage Catalog management Query processing Transactions

Introduction

A parallel database system is designed to improve performance through parallelism Loading data, building indexes, evaluating queries Data may be stored in a distributed way, but solely

for performance reasons A distributed database system is physically

stored across several sites Each site is managed by an independent DBMS Distribution is affected by local ownership, and

availability as well as performance

Parallel Databases

Motivation

How long does it take to scan a 1 terabyte table at 10MB/s? 1,099,511,627,776 bytes = 1,0244 or 240

bytes 10MB = 10,485,760 bytes 1,099,511,627,776 / 10,485,760 =

104,858 104,858 / (60 * 20 * 24) = 1.2 days!

Using 1,000 processors in parallel the time can be reduced to 1.5 minutes

Coarse-Grain and Fine-Grain

A coarse-grain parallel machine consists of a small number of processors Most current high-end computers

A fine-grain parallel machine uses thousands of smaller processors Also referred to as a massively parallel

machine

Performance

Both throughput and response time can be improved by parallelism

Throughput – the number of tasks completed in a given time Processing many small tasks in parallel

increases throughput Response time – the time it takes to

complete a single task Subtasks of large transactions can be

performed in parallel increasing response time

Speed-Up, Scale-Up

Speed-up More resources

means less time for a given amount of data

Scale-up If resources

increase in proportion to increase in data size, time is constant

degree of parallelism

thro

ughp

ut

ideal

degree of parallelism

resp

onse

tim

e

ideal

Parallel Database Architecture

Where possible a parallel database should carry out evaluation steps in parallel There are many opportunities for parallelism in

a relational database There are three main parallel DBMS

architectures Shared nothing Shared memory Shared disk

Shared Memory Architecture Multiple CPUs attached to an

interconnection network Accessing a common region of main

memory Similar to a conventional system

Good for moderate parallelism Communication overhead is low OS services control the CPUs

Interference increases with size As CPUs are added memory contention

becomes a bottleneck Adding more CPUs eventually slows the

system down

P

D

interconnection

network

D D

…

global shared

memory

P P

Shared Disk Architecture

Each CPU has private memory and direct access to data Through the

interconnection network Good for moderate

parallelism Suffers from interference

in the interconnection network Which acts as a bottleneck

Not a good solution for a large scale parallel system

M … MM

P P P

interconnection

network

D D D

Shared Nothing Architecture Each CPU has local memory

and disk space No two CPUs access the same

storage area All CPU communication is

through the network Increases complexity

Linear speed-up▪ Operation time decreases

proportional to increase in CPUs

Linear scale-up▪ Performance maintained if CPU

increase is proportional to data

interconnection

network

P …P P

M MM

D D D

Parallel Query Evaluation A relational query execution plan is a tree,

or graph, of relational algebra operators Operators in a query tree can be executed

in parallel If one operator consumes the output of

another, there is pipelined parallelism Otherwise the operators can be evaluated

independently An operator blocks if it does not produce

any output until it has consumed all its inputs Pipelined parallelism is limited by blocking

operators

Single Operator Evaluation

Individual operators can be evaluated in a parallel way by partitioning input data In data-partitioned parallel evaluation

the input data is partitioned, and worked on in parallel

The results are then combined Tables are horizontally partitioned

Different rows are assigned to different processors

Data Partitioning

Partition using a round-robin algorithm

Partition using hashing Partition using ranges of field values

Round-Robin

Partition using a round-robin algorithm Assign record i to processor i mod n

▪ Similar to RAID systems Suitable for evaluating queries that

access the entire table Less efficient for queries that access

ranges of values and queries on equality

Hash Partitioning

Partition using hashing A hash function based on selected

attributes is applied to each record to determine its processor

The data remains evenly distributed as the table grows, or shrinks over time

Good for equality selections Only one disk is used, leaving the others

free Also useful for sequential scans where

the partitioning attributes are a candidate key

Range Partitioning

Partition using ranges of field values Ranges are chosen from the sort key

values, each range should contain the same number of records▪ Each disk contains one range

If a range is too large can lead to data skew Skew can lead to the processors with

large partitions becoming bottlenecks Good for equality selections, and range

selections

Data Skew

Both hash and range partitioning may result in data skew Where some partitions are larger or smaller Skew can dramatically reduce the speed-up

obtained from parallelism In range partitioning skew can be

reduced by using histograms The histograms contain the number of

attributes and are used to derive even partitions

Parallel Evaluation Code

Parallel data streams are used to provide data for relational operators The streams can come from different disks, or Output of other operators

Streams are merged or split Merged to provide the inputs for a relational operator Split as needed to parallelize processing These operations can buffer data, and should be able

to halt operators that provide their input data A parallel evaluation consists of a network of

relational, merge and split operators

Types of Parallelism

Inter-query parallelism Different queries or transactions execute in

parallel Throughput is increased but response time is

not Easy to support in a shared-memory system

Intra-query parallelism Executing a single query in parallel to speed up

large queries Which in turn can entail either intra-operation

or inter-operation parallelism, or both

Parallel Operations

Scanning and loading Pages can be read in parallel while

scanning a relation The results can be merged If hash or range partitioning is used

selections can be directed to the relevant processors

Sorting Joins

Sorting

The simplest sort method is for each processor to sort its portion of the table Then merge the sorted records The merging phase may limit the amount of parallelism

A better method is to first redistribute the records over the processors using range partitioning Using the sort attributes Each processor sorts its set of records The sets of sorted records are then retrieved in order

To make the partitions even, the data in the processors can be sampled

Joins

Join algorithms can be parallelized Parallelization is most effective for hash or sort-

merge joins▪ Parallel hash join is widely used

The process for parallel hash join is First partition the two tables across the

processors using the same hash function Join the records locally, using any join algorithm Merge the results of the local joins, the union of

these results is the join of the two tables

Improved Parallel Hash Join If tables are very large, parallel hash join

may have a high cost at each processor If each partition is large, multiple passes will be

required for the local joins An alternative approach is to use all

processors for each partition Partition the tables using h1

▪ Each partition of the smaller relation should fit into the combined memory of the processors

Process each partition using all processors▪ Use h2 to determine which processor to send records to

Joins on Inequalities

Partitioning is not suitable for joins on inequalities Such as R ⋈R.a < S.b S Since all records in R could join with a

record in S Fragment and replicate joins can be

used In asymmetric fragment and replicate

join▪ One of the relations is partitioned▪ The other relation is replicated across all

partitions

P0,0

Fragment and Replicate

Each relation can be both fragmented and replicated Into m fragments

of R and n of S However m * n

processors are required

This works with any join condition When partitioning

is not possible

R0

R1

R2

…

Rm-1

S0

P1,0

P2,0

P0,1

P1,1

P0,2

S1 S2 … Sn-1

Pm-1,n-1

Other Operations

Selection – the table may already be partitioned on the selection attribute If not, it can be scanned in parallel

Duplicate elimination – use parallel sorting Projection – can be performed by scanning Aggregation – partition by the grouping attribute

If records do have to be transferred between processors it may be possible to just send partial results

The final result can then be calculated from the partial results▪ e.g. sum

Costs of Parallelism

Using parallel processors reduces the time to perform an operation Possibly to as little as 1/n * original cost

▪ Where n is the number of processors However there are also additional

costs Start-up costs for initiating the operation Skew which may reduce the speed-up Contention for resources resulting in

delays Cost of assembling the final result

Parallel Query Optimization As well as parallelizing individual

operators, different operators can be processed in parallel Different processors perform different

operations Result of one operator can be pipelined

into another Note that sorting and the hash-join partitioning

block pipelines Multiple independent operations can be

executed concurrently Using bushy, rather than left-deep, join trees

Parallel vs. Serial Plans

The best serial plan may not be the best parallel plan Also note that parallelization introduces further

complexity into query optimization Consider a table partitioned into two nodes,

with a local secondary index Node 1 contains names between A and M Node 2 contains names between N and Z

Consider the selection: name < “Noober“ Node 1 should scan its partition, but Node 2 should use the name index

Parallel System Design

In a large-scale parallel system the chances of failure increase Such systems should be designed to

operate even if a processor disk fails Data can be replicated across

multiple processors Failed processors or disks are tracked And request re-routed to the backup

Summary

Architecture Shared-memory is easy, but costly and does

not scale well Shared-nothing is cheap and scales well, but is

harder to implement Both intra-operation, and inter-operation

parallelism are possible Most relational algebra operations can be

performed in parallel How the data is partitioned across

processors is very important

Introduction

A distributed database is motivated by a number of factors Increased availability

▪ If a site containing a table goes down, the table may still be available if a copy is maintained at another site

Distributed access to data▪ An organization may have branches in several cities▪ Access patterns are typically affected by locality

Analysis of distributed data Distributed systems must support integrated

access

Ideal

Data is stored at several sites Each site is managed by an independent DBMS The system should make the fact that data is

distributed transparent to the user Distributed Data Independence

Users should not need to know where the data is located

Queries that access several sites should be optimized

Distributed Transaction Atomicity Users should be able to write transactions that

access several sites, in the same way as local transactions

Reality

Users may have to be aware of where data is located Distributed data independence and distributed

transaction atomicity may not be supported These properties may be hard to support

efficiently▪ Sites may be connected by a slow long-distance

network Consider a global system

Administrative overheads for viewing data as a single unified collection may be prohibitively expensive

Distributed vs. Parallel

Distributed and shared-nothing parallel systems appear similar

In practice these are often very different since distributed DBs are typically Geographically separated Separately administered Have slower interconnections May have both local and global

transactions

Types of Distributed Database Homogeneous

Data is distributed but every site runs the same DBMS software

Heterogeneous, or multidatabase Different sites run different DBMSs, and the sites are

connected to enable access to data Require standards for gateway protocols A gateway protocol is an API that allows external

applications access to the database▪ e.g. ODBC and JDBC

Gateways add a layer of processing, and may not be able to entirely mask differences between servers

Distributed DBMS Architecture

Client-Server Collaborating Server Middleware

Client-Server Systems

One or more client processes and one or more server processes A client process sends a query to any one server process Clients are responsible for UI Servers manage data and execute transactions

A popular architecture Relatively simple to implement Servers do not have to deal with user-interactions Users can run a GUI on clients

Communication between client and server should be as set-oriented as possible e.g. stored procedures vs. cursors

Collaborating Server Systems Client-server systems do not allow a single query

to access multiple servers as this would require Breaking the query into sub-queries to be executed at

different sites and merging the answers to the sub-queries

To do this the client would have to be overly complex In a collaborating server system the distinction

between clients and servers is eliminated A collection of DB servers, each able to run

transactions against local data When a query is received that requires data from other

servers the server generates appropriate sub-queries

Middleware Systems

Designed to allow a single query to access multiple servers, but Without requiring all servers to be capable of managing

multi-site query execution Often used to integrate legacy systems

Requires one database server (the middleware) capable of managing multi-server queries Other servers only handle local queries and

transactions The special server coordinates queries and transactions The middleware server typically doesn’t maintain any

data

Distributed Data

Storing Distributed Data

In a distributed system tables are stored across several sites Accessing a table stored elsewhere incurs

message-passing costs A single table may be replicated or

fragmented across several sites Fragments are stored at the sites where they

are most often accessed Several replicas of a table may be stored at

different sites Fragmentation and replication can be combined

Fragmentation

Fragmentation consists of breaking a table into smaller tables, or fragments The fragments are stored instead of the original table Possibly at different sites

Fragmentation can either be vertical or horizontal

TID

empID

fName lName age

city

1 111 Sam Spade 43 Chicago

2 222 Peter Whimsey

51 Surrey

3 333 Sherlock Holmes 35 Surrey

4 444 Anita Blake 29 Boston

horizontal

vertical

Managing Fragmentation Records that belong to a horizontal fragment are

usually identified by a selection query e.g. all the records that relate to a particular city, achieving

locality, reducing communication costs A horizontally fragmented table can be recreated by

computing the union of the fragments▪ Fragments are usually required to be disjoint

Records belonging to a vertical fragment are identified by a projection query The collection of vertical fragments must be a lossless-join

decomposition A unique tuple ID is often assigned to records

Replication

Replication entails storing several copies of a table or of table fragments for Increased availability of data, which protects

against▪ Failure of individual sites, and▪ Failure of communication links

Faster query evaluation▪ Queries can execute faster by using a local copy of a

table There are two kinds of replication,

synchronous, and asynchronous These differ in how replicas are kept current

when the table is modified

Managing Distributed Catalogs Distributing data across sites adds complexity

It is important to track where replicated or fragmented tables are stored

Each replica or fragment must be uniquely named Naming should be performed locally A global relation name consists of {birth site, local name}

▪ The birth site is the site where the table was created

A site catalog records fragments and replicas at a site, and tracks replicas of tables created at the site To locate a table, look up its birth site catalog The birth site never changes, even if the table is moved

Distributed Queries

Distributed Query Processing

Estimating the cost of an evaluation plan must include communication costs Evaluate the number of page reads or writes, and The number of pages that must be sent from one

site to another Pages may need to be shipped between a

number of sites Sites where the data is located, and where the

result is computed, and The site that initiated the query

Single Table Queries

Simple, one table, queries are affected by fragmentation and replication

If a table is horizontally fragmented a query has to be evaluated at multiple sites And the union of the result computed Selections that only require data at one site can be

executed just at that site If a table is vertically fragmented the fragments

have to be joined on the common attribute If a table is replicated, the shipping costs have

to be considered to determine which site to use

Joins in a Distributed DBMS

Joins of tables at different sites can be very expensive

There are a number of strategies for computing joins Fetch as needed Ship to one site Semijoins and Bloomjoins

Fetch As Needed Joins

Designate one table as the outer relation, and compute the join at that site

Fetch records of the inner relation as needed; the cost depends on The size of the relations Whether the inner relation is cached at the outer relation's

site▪ If not, communication costs are incurred once for each time the

inner relation is read

The size of the result relation If the size of the result (R ⋈ S) is greater than R + S

it is cheaper to ship both relations to the query site

Ship to One Site

In this strategy, relations are shipped to a site and the join carried out at that site

The site can be one of the sites involved in the join The result has to be shipped from where it was

computed to the site where the query was posed

Alternatively both input relations can be shipped to the site where the query was originally posed The join is then computed at that site

Semi-joins and Bloom-joins Consider a join between two relations, R and S

at different sites, London and Vancouver Assume that S (the inner join) is to be shipped to

London where the join will be computed Note that some S records may not join to R records Shipping costs can be reduced by only shipping

those S records that will actually join to R records There are two techniques that can reduce the

number of S records to be shipped Semi-joins, and Bloom-joins

Semi-joins

At the first site (London) compute the projection of R on the join columns, a Ship this relation to site 2 (Vancouver)

At Vancouver compute the join of a(R) and S The result of this join is the reduction of S with

respect to R Ship the reduction of S to London

At London compute the join of the reduction of S, and R

The effectiveness of this technique depends on how much smaller the reduction of S is compared to S

Bloomjoins

Bloom-joins are similar to semi-joins, except that a bit vector is sent to the second site The vector is size k and each record in R is hashed to it

▪ A bit is set to 1 if a record hashes to it

▪ The hash function is on the join attribute

The reduction of S is then computed in step 2 By hashing records of S to the bit vector Only those records that hash to a bit with the value of 1

are included in the reduction The cost to send the bit vector is less than the cost

to send the projection (of the join attribute on R) But some unwanted records of S may be in the reduction

Cost Based Optimization

The basic cost based approach is to consider a set of plans and pick the cheapest Communication costs must be considered Local autonomy must be respected Some operations can be carried out in parallel

The query site generated a global plan with suggested local plans Local sites are allowed to change their

suggested plans if they can improve them

Distributed Transactions

Updating Distributed Data If data is distributed it should be transparent to

users Users should be able to ask queries without having to

worry where tables are stored Transactions should be atomic actions, regardless of

data fragmentation or replication If so, all copies of a replicated relation must be modified

before the transaction commits Referred to as synchronous replication

Another approach, asynchronous replication, allows copies of a relation to differ More efficient, but compromises data independence

Synchronous Replication

There are two techniques for ensuring that a transaction sees the same values Regardless of which copy of an object it accesses

In voting, a transaction must write a majority of copies to modify an object, and

Must read enough copies to ensure that it sees at least one most recent copy e.g. 10 copies of an object, at least 6 copies must be

written, and at least 5 read Note that the copies include a version number so

that it is possible to tell which copy is the latest

Synchronous Replication … Voting is not a generally efficient technique

Reading an object requires that multiple copies of the object must be read

Typically, objects are read more than they are written The read-any write-all policy allows any single copy

to be read, but All copies must be written when an object is

written Writes are slower, relative to voting, but Reads are fast, particularly is a local copy is available

Read-any write-all is usually used for synchronous replication

Synchronous Replication Cost Synchronous replication is expensive

Before an update transaction is committed it must obtain X locks on all copies of the data

This may entail sending lock requests to remote sites and waiting for the locks to be confirmed

While holding its other locks If sites, or the communication links fail, the transaction

cannot commit until they are back up Committing the transaction requires ending multiple

messages as part of a commit protocol An alternative is to use asynchronous replication

Asynchronous Replication

A transaction is allowed to commit before all the copies have been changed Readers still only look at a single copy Users must be aware of which copy they

are reading, and that copies may be out of sync

There are two approaches to asynchronous replication Peer-to-peer, and Primary site

Peer-to-Peer Replication

More than one copy can be designated as updatable Changes to the master(s) must be propagated to other

copies If two masters are changed a conflict resolution

strategy must be used Peer-to-peer replication is best used when conflicts

do not arise Where each master site owns a disjoint fragment

▪ Usually a horizontal fragment

Update rights are only held by one master at a time▪ A backup site may gain update rights if the main site fails

Primary Site Replication

One copy of a table is designated as the primary or master copy Users register or publish the primary copies Other sites subscribe to the table (or fragments

of it), by creating secondary copies Secondary copies cannot be directly updated

Changes to the primary copy must be propagated to the secondary copies First, capture change made by committed

transactions Apply the changes to secondary copies

Capture Step

Log-based capture creates an update record from the recovery log when it is written to stable storage Log changes that affect replicated tables are written to

a change data table (CDT) Note that aborted transactions must, at some point, be

removed from the CDT Another approach is to use procedural capture

A trigger invokes a procedure which takes a snapshot of the primary copy

Log-based capture is cheaper and has less delay, but relies on proprietary log details

Apply Step

The apply step takes the capture step changes and propagates them to secondary copies This can be continuously pushed from the master

whenever a CDT is generated, or Periodically requested (or pulled) by the copies

▪ A timer or application controls the frequency of the requests

Log-based capture with continuous apply minimizes delay A cheaper substitute for synchronous replication

Procedural capture and application driven apply gives the most flexibility

Data Warehousing

Complex decision support queries that require data from multiple sites are popular To improve query efficiency, all the data can be copied

to one site, which is then queried These data collections are called data warehouses

Warehouses use asynchronous replication The source data is typically controlled by different

DBMSs Source data often has to be cleaned when creating the

replicas Procedural capture and application apply is best

used for this environment

Distributed Transactions

Transactions may be submitted at one site but can access data at other sites The transaction manager breaks the transaction into

sub-transactions that execute at different sites The sub-transactions are submitted to the other sites The transaction manger at the initial site must

coordinate the activity of the sub-transactions Distributed concurrency control

Locks and deadlocks must be managed across sites Distributed recovery

Transaction atomicity must be ensured across sites

Distributed Locking Schemes In centralized locking, a single site is in charge of

handling lock and unlock requests This is vulnerable to single site failure and bottlenecks

In primary copy locking, all locking is done at the primary copy site for an object Reading a copy of an object usually requires

communication with two sites In fully distributed locking, lock requests are handled

by the lock manager at the local site X locks must be set at all sites when copies are modified S locks are only set at the local site There are other protocols for locking replicated data

Distributed Deadlock

If deadlock detection is being used (rather than prevention) the scheme must be modified Centralized - send all local waits-

for graphs to a central site Hierarchical - organize sites into

a hierarchy and send local graphs to parent

Timeout - abort the transaction if it waits too long

Communication delays can cause phantom deadlocks

T1 T2

site A

T1 T2

site B

T1 T2

global

Distributed Recovery

Recovery in a distributed system is more complex

New kinds of failure can occur Communication failures, and Failures at remote sites where sub-transactions

are executing To ensure atomicity, either all or no sub-

transactions must commit This property must be guaranteed regardless of

site or communication failure This is achieved using a commit protocol

Normal Execution

During normal execution each site maintains a log

Transactions are logged where they execute

The transaction manager at the originating site is called the coordinator

Transaction managers at sub-transaction sites are referred to as subordinates

The most widely used commit protocol is two-phase commit The 2PC protocol for normal execution starts

when the user commits a transaction

Two Phase Commit (2PC)

Coordinator sends prepare messages Subordinates decide whether to abort or commit

Force-write an abort or prepare log record Send no or yes messages to coordinator

If the coordinator receives unanimous yes, it force-writes commit record and sends commit messages Otherwise, force-writes abort and sends abort messages

Subordinates force-write abort or commit log records and send acknowledge messages to the coordinator

When all acknowledge messages have been received the coordinator writes an end log record

2PC Notes

2PC requires two rounds of messages Voting phase Termination phase

Any site’s transaction manager can unilaterally abort a transaction

Log records describing decisions are always forced to stable storage before the message is sent

Log records include the record type, transaction ID, and coordinator ID The coordinator’s commit or abort log record includes the

IDs of all subordinates

2PC Recovery

If there is a commit or abort log record for transaction, T, but no end record, T must be redone If the site is a coordinator keep sending commit, or abort

messages until all acknowledge messages are received If there is a prepare log record for T, but not commit

or abort the site is a subordinate The coordinator is repeatedly contacted to determine T’s

status, until a commit or abort message is received If there is no prepare log record for T, the transaction

is unilaterally aborted And send an abort message if contacted by a subordinate

Blocking and Site Failures If a coordinator fails, the subordinates are unable

to determine whether to commit or abort The transaction is blocked until the coordinator

recovers What happens if a remote site does not respond

during the commit protocol? If the site is the coordinator the transaction should be

aborted If the site is a subordinate that has not voted yes, it

should abort the transaction If the site is a subordinate that has voted yes, it is

blocked until the coordinator responds

2PC Comments

The acknowledge messages are used to tell the coordinator that it can forget a transaction Until all acknowledge messages are received it must keep T in

the transaction table The coordinator may fail after prepare messages, but

before commit or abort It therefore has no information about the transaction’s status

before the crash▪ So it subsequently aborts the transaction

If another site enquires about T, the recovery process responds with an abort message

If a sub-transaction doesn’t perform updates its commit or abort status is irrelevant

2PC with Presumed Abort When a coordinator aborts T, it can undo T

and remove it from the transaction table If there is no information about T, it is

presumed to be aborted Similarly, subordinates do not need to

send ack messages on abort As the coordinator does not have to wait for

acks to abort a transaction Abort log records do not have to be force-

written As the default decision is to abort a transaction

2PC with Presumed Abort … It a sub-transaction does not perform

updates it responds to prepare with a reader message And writes no log records

If the coordinator receives a reader message it is treated as yes But no further messages are sent to that

subordinate If all sub-transactions are readers the

second phase of the protocol is not required The transaction can be removed from the

transaction table

Cloud Databases

Cloud Computing

In cloud computing a vendor supplies computing resources as a service A large number of computers are

connected through a communication network

Such as the internet … The client runs applications and

stores data using these resources And can access the resources with little

effort

Web Databases

Web applications have to be highly scalable Applications may have hundreds of millions of

users Requiring data to be partitioned across

thousands of processors There are a number of systems for data

storage on the cloud Such as Bigtable (from Google) They do not necessarily guarantee the ACID

properties▪ They drop ACID …

Data Representation

Many web data storage systems are not built around an SQL data model Such as NoSql DBs or BigTable Some support semi-structured data

Many web applications manage without extensive query language support

Data storage systems often allow multiple versions of data items to be stored Versions can be identified by timestamp

Partitioning Data

Data is often partitioned using hash or range partitioning Such partitions are referred to as tablets This is performed dynamically as required

It is necessary to know which site contains a particular tablet A tablet controller site tracks the partitioning

function▪ And can map a request to the appropriate site

The mapping information can be replicated to a set of router sites▪ So that the controller does not act as a bottleneck

Traditional DBs on the Cloud A cloud DB introduces a number of

challenges to making a DB ACID compliant Locking Ensuring transactions are atomic Frequent communication between sites

In addition there are a number of issues that relate to both DBs and data storage Replication is controlled by the cloud vendor Security and legal issues

Parallel databases Architecture Query evaluation Query optimization Distributed databases ...

Documents

Transcript of Parallel databases Architecture Query evaluation Query optimization Distributed databases ...