Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS...

36
Distributed Database: Distributed Database: Part 2 Part 2

Transcript of Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS...

Page 1: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Distributed Database: Part 2Distributed Database: Part 2

Page 2: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Distributed DBMSDistributed DBMS Distributed databaseDistributed database requires requires distributed DBMSdistributed DBMS Functions of a distributed DBMS:Functions of a distributed DBMS:

– Locate data with a distributed data dictionaryLocate data with a distributed data dictionary– Determine location from which to retrieve data and process query Determine location from which to retrieve data and process query

componentscomponents– DBMS translation between nodes with different local DBMSsDBMS translation between nodes with different local DBMSs– Data management functions: security, concurrency, deadlock Data management functions: security, concurrency, deadlock

control, query optimization, failure recoverycontrol, query optimization, failure recovery– Provide consistency among copies of data across the remote Provide consistency among copies of data across the remote

sitessites– Global primary key controlGlobal primary key control– ScalabilityScalability– Data and stored procedure replicationData and stored procedure replication– Allowing for different DBMSs and application code at different nodesAllowing for different DBMSs and application code at different nodes

Page 3: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Distributed DBMS architecture

Page 4: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Local & Global TransactionLocal & Global Transaction

LOCAL TRANSACTION a transaction that requires reference only to

data that are stores at the site where the transaction originates

GLOBAL TRANSACTON: A transaction that requires reference to data

at one or more non-local sites to satisfy the request.

Page 5: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Local Transaction StepsLocal Transaction Steps

1. Application makes request to distributed DBMS2. Distributed DBMS checks distributed data

repository for location of data. Finds that it is local3. Distributed DBMS sends request to local DBMS4. Local DBMS processes request5. Local DBMS sends results to application

Page 6: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Distributed DBMS Architecture Distributed DBMS Architecture Local TransactionLocal Transaction

Local transaction–all data stored locally

1

3

4

5

2

Page 7: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Global Transaction StepsGlobal Transaction Steps1.1. Application makes request to distributed DBMSApplication makes request to distributed DBMS2.2. Distributed DBMS checks distributed data repository for Distributed DBMS checks distributed data repository for

location of data. Finds that it is location of data. Finds that it is remoteremote3.3. Distributed DBMS routes request to remote siteDistributed DBMS routes request to remote site4.4. Distributed DBMS at remote site translates request for Distributed DBMS at remote site translates request for

its local DBMS if necessary, and sends request to local its local DBMS if necessary, and sends request to local DBMSDBMS

5.5. Local DBMS at remote site processes requestLocal DBMS at remote site processes request6.6. Local DBMS at remote site sends results to distributed Local DBMS at remote site sends results to distributed

DBMS at remote siteDBMS at remote site7.7. Remote distributed DBMS sends results back to Remote distributed DBMS sends results back to

originating siteoriginating site8.8. Distributed DBMS at originating site sends results to Distributed DBMS at originating site sends results to

applicationapplication

Page 8: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Distributed DBMS architecture Distributed DBMS architecture Global Transaction Global Transaction

Global transaction–some data is at remote site(s)

1

2

4

5

6

3

7

8

Page 9: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Distributed DBMSDistributed DBMSTransparency ObjectivesTransparency Objectives

Location TransparencyLocation Transparency Replication TransparencyReplication Transparency Failure TransparencyFailure Transparency Concurrency TransparencyConcurrency Transparency

Page 10: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Location TransparencyLocation Transparency User/application does not need to know where data User/application does not need to know where data

residesresides To achieve location transparency, the distributed To achieve location transparency, the distributed

DBMS must have access to an accurate and current DBMS must have access to an accurate and current data dictionary/directory that indicates location(s) data dictionary/directory that indicates location(s) of all data in the network.of all data in the network.

Directories must be synchronized: each copy of the Directories must be synchronized: each copy of the directory reflects the same information concerning directory reflects the same information concerning the location of data.the location of data.

Page 11: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

• San Mateo: List of all company customers whose total purchases exceed 100000.SELECT *

FROM CUSTOMER

WHERE TOTAL_SALES < 100000;

San Mateo, California Tulsa, Oklahoma

Location TransparencyLocation Transparency

Page 12: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

• Tulsa: List of all orange-colored parts (regardless of location)SELECT DISTINCT PART_NUMBER, PART_NAME

FROM PART

WHERE COLOR = ‘Orange’

ORDER BY PART_NO;

San Mateo, California Tulsa, Oklahoma

Location TransparencyLocation Transparency

Page 13: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Replication Replication TransparencyTransparency

Sometimes called fragmentation transparencySometimes called fragmentation transparency User/application does not need to know about User/application does not need to know about

duplicationduplication

Page 14: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

• An identical copy of Standard Price List is maintained at all 3 nodes• Reading part list: Distributed DBMS consult data dictionary &

determine local transaction. User need not be aware that the same data are stored at other sites.

San Mateo, California Tulsa, Oklahoma

Replication TransparencyReplication Transparency

Page 15: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Failure Failure TransparencyTransparency

Either all or none of the actions of a transaction are Either all or none of the actions of a transaction are committedcommitted

Each site has a Transaction Manager (TM)Each site has a Transaction Manager (TM)– Logs transactions and before and after imagesLogs transactions and before and after images– Concurrency control scheme to ensure data integrityConcurrency control scheme to ensure data integrity

For global transaction: TM at each participating For global transaction: TM at each participating site cooperate to ensure that all update site cooperate to ensure that all update operations are synchronized. If not, data operations are synchronized. If not, data integrity can be lost when failure happensintegrity can be lost when failure happens

Page 16: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

• New York: change the price of a part in the Standard Price List file• Global transaction: every copy of the record for that part must be

updated. Price list records in New York & Tulsa are successfully updated, however transmission failure occurs: the price list record in San Mateo is not updated.

• Failure Transparency: either all the actions of a transaction are committed or none of them are committed.

San Mateo, California Tulsa, Oklahoma

FailureFailure TransparencyTransparency

Page 17: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Failure Failure TransparencyTransparency

To ensure data integrity for real-time, distributed update To ensure data integrity for real-time, distributed update operations, the cooperating TM execute a operations, the cooperating TM execute a commit commit protocolprotocol– An algorithm to ensure that a transaction is successfully An algorithm to ensure that a transaction is successfully

completed or else it is abortedcompleted or else it is aborted

Most widely used: Most widely used: two-phase committwo-phase commit– An algorithm for coordinating updates in a distributed An algorithm for coordinating updates in a distributed

databasedatabase

– Ensure concurrent transactions at multiple sites are processed Ensure concurrent transactions at multiple sites are processed as though they were executed in the same, serial order at all as though they were executed in the same, serial order at all sitessites

– Something like arranging a meeting between many peopleSomething like arranging a meeting between many people

Page 18: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Failure Failure TransparencyTransparency The site originating the global transaction or an overall The site originating the global transaction or an overall

coordinating site sends a request to each of the sites that will coordinating site sends a request to each of the sites that will process some portion of the transaction.process some portion of the transaction.

Each site processes the sub transaction, but does not Each site processes the sub transaction, but does not immediately commit/store the result to the local databaseimmediately commit/store the result to the local database

The result is stored in a temporary fileThe result is stored in a temporary file Each site lock its portion of the database being updated Each site lock its portion of the database being updated Each site notifies the originating site when it has completed its Each site notifies the originating site when it has completed its

sub transactionsub transaction When all sites have responded, the originating site now initiates When all sites have responded, the originating site now initiates

the two-phase commit protocolthe two-phase commit protocol

Page 19: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Two-Phase CommitTwo-Phase CommitPrepare PhasePrepare Phase

– Coordinator receives a commit requestCoordinator receives a commit request– Coordinator instructs all resource managers Coordinator instructs all resource managers

to get ready to “go either way” on the to get ready to “go either way” on the transaction. Each resource manager writes transaction. Each resource manager writes all updates from that transaction to its own all updates from that transaction to its own physical logphysical log

– Coordinator receives replies from all resource Coordinator receives replies from all resource managers. If all are ok, it writes commit to its managers. If all are ok, it writes commit to its own log; if not then it writes rollback to its logown log; if not then it writes rollback to its log

FailureFailure TransparencyTransparency

Page 20: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Two-Phase CommitTwo-Phase Commit Commit PhaseCommit Phase

– Coordinator then informs each resource manager of its Coordinator then informs each resource manager of its decision and broadcasts a message to either commit or decision and broadcasts a message to either commit or rollback (abort). If the message is commit, then each rollback (abort). If the message is commit, then each resource manager transfers the update from its log to its resource manager transfers the update from its log to its databasedatabase

– A failure during the commit phase puts a transaction “in A failure during the commit phase puts a transaction “in limbo.” limbo.”

– A limbo transaction can be identified by a timeout or A limbo transaction can be identified by a timeout or pollingpolling

– TimeoutTimeout• No confirmation of commit for a specified time periodNo confirmation of commit for a specified time period• Not possible to distinguish between busy or failed siteNot possible to distinguish between busy or failed site

– PollingPolling• Expensive in terms of network load and processing timeExpensive in terms of network load and processing time

Page 21: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Concurrency ControlConcurrency Control

Design goal for distributed database with the Design goal for distributed database with the property that although a distributed systems property that although a distributed systems runs many transactions, it appears that a given runs many transactions, it appears that a given transaction is the only activity in the system.transaction is the only activity in the system.

The TM at each site must cooperate to provide The TM at each site must cooperate to provide concurrency control in a distributed databaseconcurrency control in a distributed database

3 basic approaches may be used:3 basic approaches may be used:– LockingLocking– VersioningVersioning– Time stampingTime stamping

Page 22: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Concurrency ControlConcurrency Control

Time stampingTime stamping– A concurrency control mechanism that assigns a globally A concurrency control mechanism that assigns a globally

unique time stamp to each transactionunique time stamp to each transaction– Alternative to locks in distributed databasesAlternative to locks in distributed databases– To ensure that transactions are processed in serial order: To ensure that transactions are processed in serial order:

avoiding the use of locks.avoiding the use of locks.– Every record in the database carries the time stamp of the Every record in the database carries the time stamp of the

transaction that last updated it.transaction that last updated it.– If a new transaction attempts to update that record and its If a new transaction attempts to update that record and its

time stamp is earlier than that carried in the record, the time stamp is earlier than that carried in the record, the transaction is assigned a new time stamp and restarted.transaction is assigned a new time stamp and restarted.

– A transaction cannot process a record until its time stamp A transaction cannot process a record until its time stamp is later that that carried in the record, therefore it cannot is later that that carried in the record, therefore it cannot interfere with another transaction.interfere with another transaction.

Page 23: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Concurrency ControlConcurrency Control

Advantage:– Locking and deadlock detection are

avoided.Disadvantage:

– Conservative approach: sometimes transactions restarted even there is no conflict with other transactions.

Page 24: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Query OptimizationQuery Optimization

In a query involving a multi-site join and, possibly, a In a query involving a multi-site join and, possibly, a distributed database with replicated files, the distributed distributed database with replicated files, the distributed DBMS must decide where to access the data and how to DBMS must decide where to access the data and how to proceed with the join. proceed with the join.

Three step to develop a query processing plan:Three step to develop a query processing plan:– Query decomposition – simplified and rewritten into a Query decomposition – simplified and rewritten into a

structured, relational algebra formstructured, relational algebra form– Data localization – query fragmented so that fragments Data localization – query fragmented so that fragments

reference data at only one sitereference data at only one site– Global optimizationGlobal optimization

• Order in which to execute query fragmentsOrder in which to execute query fragments

• Data movement between sitesData movement between sites

• Where parts of the query will be executedWhere parts of the query will be executed

Page 25: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Query OptimizationQuery Optimization

One technique used to make processing a One technique used to make processing a distributed query more efficient: Semijoindistributed query more efficient: Semijoin

– Semijoin operation: only the joining attribute of the query Semijoin operation: only the joining attribute of the query is sent from one site to another, only the required rows is sent from one site to another, only the required rows are returned. (rather than all selected attributes)are returned. (rather than all selected attributes)

SITE 1

Customer table

Cust_No 10 bytes

Cust_Name 50 bytes

Zip_Code 10 bytes

SIC 5 bytes

10,000 rows

SITE 2

Order table

Order_No 10 bytes

Cust_No 10 bytes

Order_Date 4 bytes

Order_Amount 6 bytes

400,000 rows

Page 26: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Query OptimizationQuery Optimization

Query at Site 1: display the Cust_Name, SIC, Order_Date for all customers in a particular Zip_Code range and an Order_Amount above a specified limit.

Assume that 10% of the customers fall in the Zip_Code range and 2 % of the orders are above the amount limit.

Page 27: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Query OptimizationQuery Optimization

A semijoin would work as follows:– A query is executed at Site 1 to create a list of the

Cust_No values in the desired Zip_Code range.• 10% 1000 rows satisfy the Zip_Code • 1000 rows of 10 bytes each for the Cust_No attribute or

10,000 (1000 * 10) bytes will be sent to Site 2

– A query is executed at site 2 to create a list of the Cust_No and Order_Date values to be sent back to site 1 to compose the final result.

• Assume the same number of orders for each customer: 40,000 rows of the Order table will match with the customer numbers sent from Site 1

• Customer Order: 2% above the amount limit: 800 rows• Customer_No & Order_Date: 14 bytes * 800 = 11200 bytes• Total data transferred: 10,000 + 11,200 = 21,200 bytes

Page 28: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Query OptimizationQuery Optimization

If not using Semijoin:– To send data from Site 1 to Site 2: need to send

Cust_No. Cust_Name and SIC (10+50+5 = 65) bytes for (10,000*10% = 1000) rows of the Customer table (65 * 1000 = 65000) bytes to Site 2

– To send data from Site 2 to Site 1: need to send Cust_No and Order_Date (10+4 = 14) bytes for (400,000*2% = 8000) rows of the Order table (14 * 8000 = 112,000) bytes

– Total data transferred: 65,000 + 112,000 = 177,000)

Page 29: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Evolution of Distributed DBMSEvolution of Distributed DBMS

Distributed DBMS still an emerging rather than Distributed DBMS still an emerging rather than established technologyestablished technology

3 stages in the evolution:3 stages in the evolution:– Remote Unit of WorkRemote Unit of Work– Distributed Unit of WorkDistributed Unit of Work– Distributed RequestDistributed Request

““Unit of Work”: sequence of instructions required to Unit of Work”: sequence of instructions required to process a transaction.process a transaction.

Page 30: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Remote Unit of Work – Remote TransactionRemote Unit of Work – Remote Transaction

Allow multiple SQL statements to be originated at one location and Allow multiple SQL statements to be originated at one location and executed as a single unit of work on a single remote DBMSexecuted as a single unit of work on a single remote DBMS

The originating computer does not consult the data directory to The originating computer does not consult the data directory to locate the site containing the selected tables in the remote of unit locate the site containing the selected tables in the remote of unit workwork

The originating application must know where the data reside and The originating application must know where the data reside and connect to the remote DBMS prior to each remote unit of workconnect to the remote DBMS prior to each remote unit of work

Remote Unit of Work concept Remote Unit of Work concept does not support location does not support location transparencytransparency

Allows updates at the single remote computerAllows updates at the single remote computer All updates within a unit of work are tentative until a commit All updates within a unit of work are tentative until a commit

operation makes them permanent or a rollback undoes them operation makes them permanent or a rollback undoes them

Evolution of Distributed DBMSEvolution of Distributed DBMS

Page 31: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Remote Unit of Work – Remote TransactionRemote Unit of Work – Remote Transaction

Transaction integrity is maintained for a single remote siteTransaction integrity is maintained for a single remote site An application cannot assure transaction integrity when An application cannot assure transaction integrity when

more than one remote location involved.more than one remote location involved. Example:Example:

– An application in San Mateo could update the Part file in Tulsa An application in San Mateo could update the Part file in Tulsa and transaction integrity would be maintainedand transaction integrity would be maintained

– The application could not simultaneously update the Part file in The application could not simultaneously update the Part file in two or more locations two or more locations

– Remote Unit of Work does not provide Failure TransparencyRemote Unit of Work does not provide Failure Transparency

Evolution of Distributed DBMSEvolution of Distributed DBMS

Page 32: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Distributed Unit of WorkDistributed Unit of Work

Allows various statements within a unit of work to Allows various statements within a unit of work to refer to multiple remote DBMS locationsrefer to multiple remote DBMS locations

Support some location transparencySupport some location transparency All tables in a single SQL statement must be at a All tables in a single SQL statement must be at a

single sitesingle site

Evolution of Distributed DBMSEvolution of Distributed DBMS

Page 33: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

San Mateo, California Tulsa, Oklahoma

Evolution of Distributed DBMSEvolution of Distributed DBMS

Page 34: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Distributed Unit of WorkDistributed Unit of Work

Distributed Unit of Work would not allow:Distributed Unit of Work would not allow:– Assemble parts information from all three sitesAssemble parts information from all three sites

SELECT DISTINCT PART_NUMBER, PART_NAMESELECT DISTINCT PART_NUMBER, PART_NAMEFROM PARTFROM PARTWHERE COLOR = ‘Orange’WHERE COLOR = ‘Orange’ORDER BY PART_NUMBER;ORDER BY PART_NUMBER;

– A single SQL statement that attempts to update data at A single SQL statement that attempts to update data at more than one locationmore than one location

UPDATE PARTUPDATE PARTSET UNIT_PRICE = 127.49SET UNIT_PRICE = 127.49WHERE PART_NUMBER = 12345;WHERE PART_NUMBER = 12345;

Evolution of Distributed DBMSEvolution of Distributed DBMS

Page 35: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Distributed RequestDistributed Request

Allows a single SQL statement to refer to tables in Allows a single SQL statement to refer to tables in more than one remote sitemore than one remote site– Overcome a major limitation of the distributed unit of Overcome a major limitation of the distributed unit of

workwork Supports true location transparencySupports true location transparency May not support replication transparency or failure May not support replication transparency or failure

transparencytransparency

Evolution of Distributed DBMSEvolution of Distributed DBMS

Page 36: Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Information in this slides were taken from Information in this slides were taken from Modern Database Management by Jeffrey A. Modern Database Management by Jeffrey A.

Hoffer, Mary B. Prescott, Hoffer, Mary B. Prescott, Heikki Topi Heikki Topi