Distributed Databases Based on material provided by: Jim Gray (Microsoft), Heinz Stockinger (CERN),...

132
Distributed Distributed Databases Databases Based on material provided by: Based on material provided by: Jim Gray (Microsoft), Heinz Stockinger (CERN), Jim Gray (Microsoft), Heinz Stockinger (CERN), Raghu Ramakrishnan (Wisconsin) Raghu Ramakrishnan (Wisconsin) Dr. Julian Bunn Dr. Julian Bunn Center for Advanced Computing Research Center for Advanced Computing Research Caltech Caltech

Transcript of Distributed Databases Based on material provided by: Jim Gray (Microsoft), Heinz Stockinger (CERN),...

Distributed Distributed DatabasesDatabases

Based on material provided by:Based on material provided by:Jim Gray (Microsoft), Heinz Stockinger (CERN), Raghu Jim Gray (Microsoft), Heinz Stockinger (CERN), Raghu

Ramakrishnan (Wisconsin)Ramakrishnan (Wisconsin)

Dr. Julian BunnDr. Julian BunnCenter for Advanced Computing ResearchCenter for Advanced Computing Research

CaltechCaltech

J.J.Bunn, Distributed Databases, 2001 2

OutlineOutline

Introduction to Introduction to Database SystemsDatabase Systems

Distributed Databases Distributed Databases Distributed SystemsDistributed Systems Distributed Databases Distributed Databases

for Physicsfor Physics

Part IPart I

Introduction to Introduction to Database SystemsDatabase Systems

..

Julian Bunn

California Institute of Technology

J.J.Bunn, Distributed Databases, 2001 4

What is a Database?What is a Database? A large, integrated collection A large, integrated collection

of dataof data Entities (things) and Entities (things) and

Relationships (connections)Relationships (connections) Objects and Objects and

Associations/ReferencesAssociations/References A Database Management A Database Management

System (DBMS) is a software System (DBMS) is a software package designed to store package designed to store and manage Databasesand manage Databases

““Traditional” (ER) Databases Traditional” (ER) Databases and “Object” Databasesand “Object” Databases

J.J.Bunn, Distributed Databases, 2001 5

Why Use a DBMS?Why Use a DBMS?

Data IndependenceData Independence Efficient AccessEfficient Access Reduced Application Development Reduced Application Development

TimeTime Data IntegrityData Integrity Data SecurityData Security Data Analysis ToolsData Analysis Tools Uniform Data AdministrationUniform Data Administration Concurrent AccessConcurrent Access Automatic ParallelismAutomatic Parallelism Recovery from crashesRecovery from crashes

J.J.Bunn, Distributed Databases, 2001 6

Cutting Edge Cutting Edge DatabasesDatabases

Scientific ApplicationsScientific Applications Digital Libraries, Interactive Digital Libraries, Interactive

Video, Human Genome Video, Human Genome project, Particle Physics project, Particle Physics Experiments, National Digital Experiments, National Digital Observatories, Earth ImagesObservatories, Earth Images

Commercial Web Systems Commercial Web Systems Data Mining / Data WarehouseData Mining / Data Warehouse Simple data but very high Simple data but very high

transaction rate and transaction rate and enormous volume (e.g. click enormous volume (e.g. click through)through)

J.J.Bunn, Distributed Databases, 2001 7

Data ModelsData Models Data Model: A Collection of Data Model: A Collection of

Concepts for Describing DataConcepts for Describing Data Schema: A Set of Descriptions Schema: A Set of Descriptions

of a Particular Collection of of a Particular Collection of Data, in the context of the Data, in the context of the Data ModelData Model

Relational Model:Relational Model: E.g. A Lecture E.g. A Lecture is attended byis attended by

zero or more Studentszero or more Students Object Model:Object Model:

E.g. A Database Lecture E.g. A Database Lecture inherits inherits attributesattributes from a general from a general LectureLecture

J.J.Bunn, Distributed Databases, 2001 8

Data IndependenceData Independence

Applications insulated from Applications insulated from how data in the Database is how data in the Database is structured and storedstructured and stored Logical Data Independence: Logical Data Independence:

Protection from changes in the Protection from changes in the logical structure of the datalogical structure of the data

Physical Data Independence: Physical Data Independence: Protection from changes in the Protection from changes in the physical structure of the dataphysical structure of the data

J.J.Bunn, Distributed Databases, 2001 9

Concurrency ControlConcurrency Control

Good DBMS performance Good DBMS performance relies on allowing concurrent relies on allowing concurrent access to the data by more access to the data by more than one client than one client

DBMS ensures that DBMS ensures that interleaved actions coming interleaved actions coming from different clients do not from different clients do not cause inconsistency in the cause inconsistency in the data data E.g. two simultaneous bookings E.g. two simultaneous bookings

for the same airplane seatfor the same airplane seat Each client is unaware of how Each client is unaware of how

many other clients are using many other clients are using the DBMSthe DBMS

J.J.Bunn, Distributed Databases, 2001 10

TransactionsTransactions

A Transaction is an atomic A Transaction is an atomic sequence of actions in the sequence of actions in the Database (reads and writes)Database (reads and writes)

Each Transaction has to be Each Transaction has to be executed executed completelycompletely, and , and must leave the Database in a must leave the Database in a consistent stateconsistent state The definition of “consistent” is ultimately the client’s The definition of “consistent” is ultimately the client’s

responsibility!responsibility!

If the Transaction fails or If the Transaction fails or aborts midway, then the aborts midway, then the Database is “rolled back” to Database is “rolled back” to its initial consistent state its initial consistent state (when the Transaction (when the Transaction began).began).

J.J.Bunn, Distributed Databases, 2001 11

What Is A What Is A Transaction?Transaction?

Programmer’s view: Programmer’s view: Bracket a collection of actionsBracket a collection of actions

A A simplesimple failure model failure model Only two outcomes:Only two outcomes:

Begin()Begin() actionaction actionaction actionaction actionactionCommit()Commit()

Success!Success!

Begin()Begin()action action actionactionactionactionRollback()Rollback()

Begin()Begin()action action actionactionactionaction

Rollback()Rollback()

Failure!Failure!

Fail !Fail !Fail !Fail !

J.J.Bunn, Distributed Databases, 2001 12

ACIDACID

AtomicAtomic: all or nothing: all or nothing ConsistentConsistent: state : state

transformationtransformation IsolatedIsolated: no concurrency : no concurrency

anomaliesanomalies DurableDurable: committed : committed

transaction effects persisttransaction effects persist

J.J.Bunn, Distributed Databases, 2001 13

Why Bother: Why Bother: Atomicity?Atomicity?

RPC semantics:RPC semantics: At most once: try one time At most once: try one time

At least once: keep trying At least once: keep trying ’till acknowledged’till acknowledged

Exactly once: keep trying Exactly once: keep trying ’till acknowledged and ’till acknowledged and serverserverdiscards duplicate requestsdiscards duplicate requests

???

J.J.Bunn, Distributed Databases, 2001 14

Why Bother: Why Bother: Atomicity?Atomicity? Example: insert record in fileExample: insert record in file

At most onceAt most once: time-out means “maybe”: time-out means “maybe” At least onceAt least once: retry may get “duplicate” : retry may get “duplicate”

error error or retry may do or retry may do second insertsecond insert

Exactly onceExactly once: you do not have to worry: you do not have to worry What if operation involvesWhat if operation involves

Insert several records? Insert several records? Send several messages?Send several messages?

Want ALL or NOTHING for group of Want ALL or NOTHING for group of actionsactions

J.J.Bunn, Distributed Databases, 2001 15

Why Bother: Why Bother: ConsistencyConsistency

Begin-Commit brackets a set of Begin-Commit brackets a set of operationsoperations

You can violate consistency inside You can violate consistency inside bracketsbrackets Debit but not credit (destroys money)Debit but not credit (destroys money) Delete old file before create new file in a Delete old file before create new file in a

copycopy Print document before delete from spool Print document before delete from spool

queuequeue Begin and commit are points of Begin and commit are points of

consistencyconsistencyState transformationsState transformations

new state under constructionnew state under construction

Be

gin

Be

gin

Co

mm

itC

om

mit

J.J.Bunn, Distributed Databases, 2001 16

Why Bother: Why Bother: Isolation Isolation

Running programs concurrentlyRunning programs concurrentlyon same data can createon same data can createconcurrency anomaliesconcurrency anomalies The shared checking account The shared checking account

exampleexample

Programming is hard enough Programming is hard enough without having to worry about without having to worry about concurrencyconcurrency

Begin()Begin() read BALread BAL add 10add 10 write BALwrite BALCommit()Commit()

Bal = 100Bal = 100

Bal = 70Bal = 70

Bal = 110Bal = 110

Bal = 100Bal = 100

Begin()Begin() read BALread BAL Subtract 30Subtract 30 write BALwrite BALCommit()Commit()

J.J.Bunn, Distributed Databases, 2001 17

IsolationIsolation It is as though programs run one at a It is as though programs run one at a

timetime No concurrency anomaliesNo concurrency anomalies

System automatically protects System automatically protects applicationsapplications Locking (DB2, Informix, MicrosoftLocking (DB2, Informix, Microsoft® ®

SQL ServerSQL Server™™, Sybase…), Sybase…) Versioned databases (Oracle, Interbase…)Versioned databases (Oracle, Interbase…)

Begin()Begin() read BALread BAL add 10add 10 write BALwrite BALCommit()Commit()

Bal = 100Bal = 100

Bal = 110Bal = 110

Bal = 80Bal = 80

Bal = 110Bal = 110

Begin()Begin() read BALread BAL Subtract 30Subtract 30 write BALwrite BALCommit()Commit()

J.J.Bunn, Distributed Databases, 2001 18

Why Bother: Why Bother: DurabilityDurability Once a transaction commits,Once a transaction commits,

want effects to survive failureswant effects to survive failures Fault tolerance:Fault tolerance:

old master-new master won’t work: old master-new master won’t work: Can’t do daily dumps: Can’t do daily dumps:

would lose recent workwould lose recent work Want “continuous” dumpsWant “continuous” dumps

Redo “lost” transactions Redo “lost” transactions in case of failurein case of failure

Resend unacknowledged messages Resend unacknowledged messages

J.J.Bunn, Distributed Databases, 2001 19

Why ACID For Why ACID For Client/Server And Client/Server And

DistributedDistributed ACID is important for ACID is important for centralized systemscentralized systems

Failures in centralized systems Failures in centralized systems are simplerare simpler

In distributed systems:In distributed systems: More and more-independent More and more-independent

failuresfailures ACID is harder to implementACID is harder to implement

That makes it even MORE That makes it even MORE IMPORTANTIMPORTANT Simple failure modelSimple failure model Simple repair modelSimple repair model

J.J.Bunn, Distributed Databases, 2001 20

ACID GeneralizationsACID Generalizations Taxonomy of actions Taxonomy of actions

Unprotected: not undone or redoneUnprotected: not undone or redone Temp filesTemp files

Transactional: can be undone before Transactional: can be undone before commitcommit Database and message operationsDatabase and message operations

Real: cannot be undoneReal: cannot be undone Drill a hole in a piece of metal,Drill a hole in a piece of metal,

print a checkprint a check Nested transactions: Nested transactions:

subtransactionssubtransactions Work flow: long-lived transactionsWork flow: long-lived transactions

J.J.Bunn, Distributed Databases, 2001 21

Scheduling Scheduling TransactionsTransactions

The DBMS has to take care of The DBMS has to take care of a set of Transactions that a set of Transactions that arrive concurrentlyarrive concurrently

It converts the concurrent It converts the concurrent Transaction set into a new set Transaction set into a new set that can be executed that can be executed sequentiallysequentially

It ensures that, before It ensures that, before reading or writing an Object, reading or writing an Object, each Transaction waits for a each Transaction waits for a LockLock on the Object on the Object

Each Transaction releases all Each Transaction releases all its Locks when finishedits Locks when finished (Strict Two-Phase-Locking Protocol)(Strict Two-Phase-Locking Protocol)

J.J.Bunn, Distributed Databases, 2001 22

Concurrency ControlConcurrency ControlLockingLocking

How to automatically preventHow to automatically preventconcurrency bugs?concurrency bugs?

Serialization theorem:Serialization theorem: If you lock all you touch and hold to If you lock all you touch and hold to

commit: commit: no bugsno bugs

If you do not follow these rules, you may If you do not follow these rules, you may see bugssee bugs

Automatic Locking:Automatic Locking: Set automatically (well-formed)Set automatically (well-formed) Released at commit/rollback (two-phase Released at commit/rollback (two-phase

locking)locking)

Greater concurrency for locks:Greater concurrency for locks: Granularity: objects or containers or serverGranularity: objects or containers or server Mode: shared or exclusive or…Mode: shared or exclusive or…

J.J.Bunn, Distributed Databases, 2001 23

Reduced Isolation Reduced Isolation LevelsLevels

It is possible to lock less and risk fuzzy It is possible to lock less and risk fuzzy datadata

Example: want statistical summary of DB Example: want statistical summary of DB But do not want to lock whole databaseBut do not want to lock whole database

Reduced levels:Reduced levels: Repeatable Read: may see fuzzy Repeatable Read: may see fuzzy

inserts/deleteinserts/delete But will serialize all updatesBut will serialize all updates

Read Committed: see only committed dataRead Committed: see only committed data Read Uncommitted: may see uncommitted Read Uncommitted: may see uncommitted

updatesupdates

J.J.Bunn, Distributed Databases, 2001 24

Ensuring AtomicityEnsuring Atomicity

The DBMS ensures the The DBMS ensures the atomicityatomicity of a Transaction, even if the of a Transaction, even if the system crashes in the middle of itsystem crashes in the middle of it

In other words In other words allall of the of the Transaction is applied to the Transaction is applied to the Database, or Database, or nonenone of it is of it is

How?How? Keep a Keep a log/historylog/history of all actions of all actions

carried out on the Databasecarried out on the Database Before making a change, put the log Before making a change, put the log

for the change somewhere “safe”for the change somewhere “safe” After a crash, effects of partially After a crash, effects of partially

executed transactions are undone executed transactions are undone using the logusing the log

J.J.Bunn, Distributed Databases, 2001 25

Each action generates a log Each action generates a log recordrecord

Has an UNDO action Has an UNDO action

Has a REDO actionHas a REDO action

DO/UNDO/REDODO/UNDO/REDO

New stateNew stateOld stateOld state

DODO

Log Log

New stateNew state Old stateOld state

UNDOUNDO

Log Log

New stateNew stateOld stateOld state

REDOREDO

Log Log

J.J.Bunn, Distributed Databases, 2001 26

What Does A Log What Does A Log Record Look Like?Record Look Like?

Log record has Log record has Header (transaction ID, timestamp… )Header (transaction ID, timestamp… ) Item IDItem ID Old valueOld value New valueNew value

For messages: just message textFor messages: just message textand sequence #and sequence #

For records: old and new valueFor records: old and new valueon updateon update

Keep records small Keep records small

? Log ? ? Log ?

J.J.Bunn, Distributed Databases, 2001 27

Transaction Is A Transaction Is A Sequence Of ActionsSequence Of Actions Each action changes stateEach action changes state

Changes database Changes database Sends messagesSends messages Operates a display/printer/drill Operates a display/printer/drill

presspress Leaves a log trailLeaves a log trail

New stateNew stateOld stateOld state

DODO

Log Log

Log Log

New stateNew stateOld stateOld state

DODO

DODO

Log Log

New stateNew stateOld stateOld state

Old stateOld state

DODO

Log Log

New stateNew state

J.J.Bunn, Distributed Databases, 2001 28

Transaction UNDO Is EasyTransaction UNDO Is Easy

Read log backwardsRead log backwards UNDO one step at a timeUNDO one step at a time Can go half-way back toCan go half-way back to

get nested transactionsget nested transactions

New stateNew stateOld stateOld state

UNDOUNDO

Log Log

Log Log

Old stateOld state

UNDOUNDO

New stateNew state

UNDOUNDO

Log Log

Old stateOld state New stateNew state

Old stateOld state

UNDOUNDO

Log Log

New stateNew state

New stateNew stateOld stateOld state

UNDOUNDO

Log Log

Log Log

Old stateOld state

UNDOUNDO

New stateNew state

UNDOUNDO

Log Log

Old stateOld state New stateNew state

New stateNew stateOld stateOld state

UNDOUNDO

Log Log

Log Log

Old stateOld state

UNDOUNDO

New stateNew state

New stateNew stateOld stateOld state

UNDOUNDO

Log Log

J.J.Bunn, Distributed Databases, 2001 29

Durability: Protecting Durability: Protecting The LogThe Log

When transaction commitsWhen transaction commits Put its log in a durable place Put its log in a durable place

(duplexed disk)(duplexed disk) Need log to redo transaction Need log to redo transaction

in case of failurein case of failure System failure: lostSystem failure: lost

in-memory updatesin-memory updates Media failure (lost disk)Media failure (lost disk)

This makes transaction durableThis makes transaction durable Log is sequential fileLog is sequential file

Converts random IO to single Converts random IO to single sequential IOsequential IO

See NTFS or newer UNIX file systemsSee NTFS or newer UNIX file systems

Log Log Log Log Log Log Log Log Log Log Log Log Log Log Log Log

Writ

eW

rite

J.J.Bunn, Distributed Databases, 2001 30

Recovery After System Recovery After System FailureFailure During normal processing, During normal processing,

write checkpoints on non-volatile write checkpoints on non-volatile storagestorage

When recovering from a system When recovering from a system failure…failure… return to the checkpoint statereturn to the checkpoint state Reapply log of all committed transactionsReapply log of all committed transactions Force-at-commit insures log will survive Force-at-commit insures log will survive

restartrestart Then UNDO all uncommitted Then UNDO all uncommitted

transactionstransactions

New stateNew stateOld stateOld state

REDOREDO

Log Log

New stateNew stateOld stateOld state

REDOREDO

Log Log

REDOREDO

New stateNew stateOld stateOld state

Log Log Log Log

Old stateOld state

REDOREDO

New stateNew state

J.J.Bunn, Distributed Databases, 2001 31

IdempotenceIdempotenceDealing with failureDealing with failure

What if fail during restart? What if fail during restart? REDO many timesREDO many times

What if new state not around at What if new state not around at restart?restart? UNDO something not doneUNDO something not done

Old stateOld state

REDOREDO

New stateNew state

Log Log Log Log

REDOREDO

New stateNew state New stateNew state

UNDOUNDO

Old stateOld state

Log Log Log Log

UNDOUNDO

Old stateOld state

J.J.Bunn, Distributed Databases, 2001 32

IdempotenceIdempotenceDealing with failureDealing with failure

Solution: make F(F(x))=F(x) Solution: make F(F(x))=F(x) (idempotence)(idempotence) Discard duplicates Discard duplicates

Message sequence numbers Message sequence numbers to discard duplicatesto discard duplicates

Use sequence numbers on pages to Use sequence numbers on pages to detect statedetect state

(Or) make operations idempotent (Or) make operations idempotent Move to position x, write value V to byte Move to position x, write value V to byte

B…B…Old stateOld state

REDOREDO

New stateNew state

Log Log Log Log

REDOREDO

New stateNew state New stateNew state

UNDOUNDO

Old stateOld state

Log Log Log Log

UNDOUNDO

Old stateOld state

J.J.Bunn, Distributed Databases, 2001 33

The Log: More DetailThe Log: More Detail Actions recorded in the LogActions recorded in the Log

Transaction writes an ObjectTransaction writes an Object Store in the Log: Transaction Store in the Log: Transaction

Identifier, Object Identifier, Identifier, Object Identifier, new value and old valuenew value and old value

This must happen before This must happen before actually writing the Object!actually writing the Object!

Transaction commits or abortsTransaction commits or aborts Duplicate Log on “stable” Duplicate Log on “stable”

storagestorage Log records chained by Log records chained by

Transaction Identifier: easy to Transaction Identifier: easy to undo a Transactionundo a Transaction

J.J.Bunn, Distributed Databases, 2001 34

Structure of a Structure of a DatabaseDatabase

Typical DBMS has a layered Typical DBMS has a layered architecturearchitecture

Query Optimisation & Execution

Relational Operators

Files and Access Methods

Buffer Management

Disk Space Management

Disk

J.J.Bunn, Distributed Databases, 2001 35

Database Database AdministrationAdministration

Design Logical/Physical Design Logical/Physical SchemaSchema

Handle Security and Handle Security and AuthenticationAuthentication

Ensure Data Availability, Ensure Data Availability, Crash RecoveryCrash Recovery

Tune Database as needs and Tune Database as needs and workload evolvesworkload evolves

J.J.Bunn, Distributed Databases, 2001 36

SummarySummary Databases are used to Databases are used to

maintain and query large maintain and query large datasetsdatasets

DBMS benefits include DBMS benefits include recovery from crashes, recovery from crashes, concurrent access, data concurrent access, data integrity and security, quick integrity and security, quick application developmentapplication development

Abstraction ensures Abstraction ensures independenceindependence

ACIDACID Increasingly Important (and Increasingly Important (and

Big) in Scientific and Big) in Scientific and Commercial EnterprisesCommercial Enterprises

Part 2Part 2

Distributed Distributed DatabasesDatabases

..

Julian Bunn

California Institute of Technology

J.J.Bunn, Distributed Databases, 2001 38

Distributed Distributed DatabasesDatabases

Data are stored at several Data are stored at several locationslocations Each managed by a DBMS that Each managed by a DBMS that

can run autonomouslycan run autonomously Ideally, location of data is Ideally, location of data is

unknown to clientunknown to client Distributed Data IndependenceDistributed Data Independence

Distributed Transactions are Distributed Transactions are supportedsupported Clients can write Transactions Clients can write Transactions

regardless of where the regardless of where the affected data are locatedaffected data are located

Distributed Transaction Distributed Transaction AtomicityAtomicity

Hard, and in some cases Hard, and in some cases undesirableundesirable E.g. need to avoid overhead of ensuring location E.g. need to avoid overhead of ensuring location

transparencytransparency

J.J.Bunn, Distributed Databases, 2001 39

Types of Distributed Types of Distributed DatabaseDatabase

Homogeneous: Every site runs Homogeneous: Every site runs the same type of DBMSthe same type of DBMS

Heterogeneous: Different Heterogeneous: Different sites run different DBMS sites run different DBMS (maybe even RDBMS and (maybe even RDBMS and ODBMS)ODBMS)

J.J.Bunn, Distributed Databases, 2001 40

Distributed DBMS Distributed DBMS ArchitecturesArchitectures

Client-ServersClient-Servers Client sends query to each Client sends query to each

database server in the database server in the distributed systemdistributed system

Client caches and accumulates Client caches and accumulates responsesresponses

Collaborating ServerCollaborating Server Client sends query to “nearest” Client sends query to “nearest”

ServerServer Server executes query locallyServer executes query locally Server sends query to other Server sends query to other

Servers, as requiredServers, as required Server sends response to ClientServer sends response to Client

J.J.Bunn, Distributed Databases, 2001 41

Storing the Storing the Distributed DataDistributed Data

In fragments at each siteIn fragments at each site Split the data upSplit the data up Each site stores one or more Each site stores one or more

fragmentsfragments In complete replicas at each In complete replicas at each

sitesite Each site stores a replica of the Each site stores a replica of the

complete datacomplete data A mixture of fragments and A mixture of fragments and

replicasreplicas Each site stores some replicas Each site stores some replicas

and/or fragments or the dataand/or fragments or the data

J.J.Bunn, Distributed Databases, 2001 42

Partitioned Data Partitioned Data Break file into disjoint Break file into disjoint

groupsgroups Exploit data access Exploit data access

localitylocality Put data near consumerPut data near consumer Less network trafficLess network traffic Better response timeBetter response time Better availabilityBetter availability Owner controls data Owner controls data

autonomy autonomy Spread LoadSpread Load

data or traffic may exceed data or traffic may exceed single storesingle store

OrdersN.A. S.A. Europe Asia

J.J.Bunn, Distributed Databases, 2001 43

How to Partition Data?How to Partition Data? How to PartitionHow to Partition

by attribute or by attribute or random or random or by source or by source or by useby use

Problem: to find it must Problem: to find it must havehave Directory (replicated) orDirectory (replicated) or AlgorithmAlgorithm

Encourages Encourages attribute-based attribute-based partitioningpartitioning

N.A. S.A. Europe Asia

J.J.Bunn, Distributed Databases, 2001 44

Replicated DataReplicated DataPlace fragment at many Place fragment at many

sitessites Pros:Pros:+ Improves availabilityImproves availability+ Disconnected (mobile) Disconnected (mobile)

operationoperation+ Distributes loadDistributes load+ Reads are cheaperReads are cheaper

Cons:Cons: N times more updates N times more updates N times more storageN times more storage

Placement strategies:Placement strategies: Dynamic: cache on demandDynamic: cache on demand Static: place specific Static: place specific

Catalog

J.J.Bunn, Distributed Databases, 2001 45

ID #Particles Energy Event# Run# Date Time… … … … … … …

10001 3 121.5 111 13120 3/1406 13:30:55.000110002 3 202.2 112 13120 3/1406 13:30:55.000110003 4 99.3 113 13120 3/1406 13:30:55.000110004 5 231.9 120 13120 3/1406 13:30:55.000110005 6 287.1 125 13120 3/1406 13:30:55.000110006 6 107.7 126 13120 3/1406 13:30:55.000110007 6 98.9 127 13120 3/1406 13:30:55.000110008 9 100.1 128 13120 3/1406 13:30:55.0001

… … … … … … …

FragmentationFragmentation

Horizontal – Horizontal – “Row-wise”“Row-wise” E.g. rows of the table make up one E.g. rows of the table make up one

fragmentfragment Vertical – Vertical – “Column-Wise”“Column-Wise”

E.g. columns of the table make up E.g. columns of the table make up one fragmentone fragment

J.J.Bunn, Distributed Databases, 2001 46

ReplicationReplication

Make synchronised or Make synchronised or unsynchronised copies of data unsynchronised copies of data at serversat servers Synchronised: data are always Synchronised: data are always

current, updates are constantly current, updates are constantly shipped between replicasshipped between replicas

Unsynchronised: good for read-Unsynchronised: good for read-only dataonly data

Increases availability of dataIncreases availability of data Makes query execution fasterMakes query execution faster

J.J.Bunn, Distributed Databases, 2001 47

Distributed Catalogue Distributed Catalogue ManagementManagement

Need to know where data are Need to know where data are distributed in the systemdistributed in the system

At each site, need to name each At each site, need to name each replica of each data fragmentreplica of each data fragment ““Local name”, “Birth Place”Local name”, “Birth Place”

Site Catalogue: Site Catalogue: Describes all fragments and replicas Describes all fragments and replicas

at the siteat the site Keeps track of replicas of relations at Keeps track of replicas of relations at

the sitethe site To find a relation, look up Birth site’s To find a relation, look up Birth site’s

catalogue: “Birth Place” site never catalogue: “Birth Place” site never changes, even if relation is movedchanges, even if relation is moved

J.J.Bunn, Distributed Databases, 2001 48

Replication CatalogueReplication Catalogue Which objects are being Which objects are being

replicatedreplicated Where objects are being Where objects are being

replicated toreplicated to How updates are propagatedHow updates are propagated

Catalogue is a set of tables Catalogue is a set of tables that can be backed up, and that can be backed up, and recovered (as any other table)recovered (as any other table)

These tables are themselves These tables are themselves replicated to each replication replicated to each replication sitesite No single point of failure in the No single point of failure in the

Distributed DatabaseDistributed Database

J.J.Bunn, Distributed Databases, 2001 49

ConfigurationsConfigurations Single Master with multiple read-only Single Master with multiple read-only

snapshot sitessnapshot sites Multiple MastersMultiple Masters Single Master with multiple updatable Single Master with multiple updatable

snapshot sitessnapshot sites Master at record-level granularityMaster at record-level granularity Hybrids of the aboveHybrids of the above

J.J.Bunn, Distributed Databases, 2001 50

Distributed QueriesDistributed Queries

SELECT AVG(E.Energy) FROM SELECT AVG(E.Energy) FROM Events E WHERE E.particles > 3 Events E WHERE E.particles > 3 AND E.particles < 7AND E.particles < 7

Replicated: Copies of the complete Replicated: Copies of the complete Event table at Geneva and at Event table at Geneva and at IslamabadIslamabad

Choice of where to execute queryChoice of where to execute query Based on local costs, network costs, Based on local costs, network costs,

remote capacity, etc.remote capacity, etc.

ID #Particles Energy Event# Run# Date Time… … … … … … …

10001 3 121.5 111 13120 3/1406 13:30:55.000110002 3 202.2 112 13120 3/1406 13:30:55.000110003 4 99.3 113 13120 3/1406 13:30:55.000110004 5 231.9 120 13120 3/1406 13:30:55.000110005 6 287.1 125 13120 3/1406 13:30:55.000110006 6 107.7 126 13120 3/1406 13:30:55.000110007 6 98.9 127 13120 3/1406 13:30:55.000110008 9 100.1 128 13120 3/1406 13:30:55.0001

… … … … … … …

ID #Particles Energy Event# Run# Date Time… … … … … … …

10001 3 121.5 111 13120 3/1406 13:30:55.000110002 3 202.2 112 13120 3/1406 13:30:55.000110003 4 99.3 113 13120 3/1406 13:30:55.000110004 5 231.9 120 13120 3/1406 13:30:55.000110005 6 287.1 125 13120 3/1406 13:30:55.000110006 6 107.7 126 13120 3/1406 13:30:55.000110007 6 98.9 127 13120 3/1406 13:30:55.000110008 9 100.1 128 13120 3/1406 13:30:55.0001

… … … … … … …

IslamabadID #Particles Energy Event# Run# Date Time… … … … … … …

10001 3 121.5 111 13120 3/1406 13:30:55.000110002 3 202.2 112 13120 3/1406 13:30:55.000110003 4 99.3 113 13120 3/1406 13:30:55.000110004 5 231.9 120 13120 3/1406 13:30:55.000110005 6 287.1 125 13120 3/1406 13:30:55.000110006 6 107.7 126 13120 3/1406 13:30:55.000110007 6 98.9 127 13120 3/1406 13:30:55.000110008 9 100.1 128 13120 3/1406 13:30:55.0001

… … … … … … …

Geneva

J.J.Bunn, Distributed Databases, 2001 51

Distributed Queries Distributed Queries (contd.)(contd.) SELECT AVG(E.Energy) FROM SELECT AVG(E.Energy) FROM

Events E WHERE E.particles > Events E WHERE E.particles > 3 AND 3 AND E.particles < 7E.particles < 7

Row-wise fragmented:Row-wise fragmented:

Particles < 5 at Geneva, Particles < 5 at Geneva, Particles > 4 at IslamabadParticles > 4 at Islamabad Need to compute SUM(E.Energy) and Need to compute SUM(E.Energy) and

COUNT(E.Energy) at COUNT(E.Energy) at bothboth sites sites If WHERE clause had E.particles > 4 If WHERE clause had E.particles > 4

then only need to compute at then only need to compute at IslamabadIslamabad

ID #Particles Energy Event# Run# Date Time… … … … … … …

10001 3 121.5 111 13120 3/1406 13:30:55.000110002 3 202.2 112 13120 3/1406 13:30:55.000110003 4 99.3 113 13120 3/1406 13:30:55.000110004 5 231.9 120 13120 3/1406 13:30:55.000110005 6 287.1 125 13120 3/1406 13:30:55.000110006 6 107.7 126 13120 3/1406 13:30:55.000110007 6 98.9 127 13120 3/1406 13:30:55.000110008 9 100.1 128 13120 3/1406 13:30:55.0001

… … … … … … …

J.J.Bunn, Distributed Databases, 2001 52

Distributed Queries Distributed Queries (contd.)(contd.)

SELECT AVG(E.Energy) FROM Events E SELECT AVG(E.Energy) FROM Events E WHERE E.particles > 3 AND E.particles WHERE E.particles > 3 AND E.particles < 7< 7

Column-wise Fragmented: Column-wise Fragmented:

ID, Energy and Event# Columns at ID, Energy and Event# Columns at Geneva, ID and remaining Columns at Geneva, ID and remaining Columns at Islamabad:Islamabad: Need to join on IDNeed to join on ID Select IDs satisfying Particles constraint at Select IDs satisfying Particles constraint at

IslamabadIslamabad SUM(Energy) and Count(Energy) for those SUM(Energy) and Count(Energy) for those

IDs at GenevaIDs at Geneva

ID #Particles Energy Event# Run# Date Time… … … … … … …

10001 3 121.5 111 13120 3/1406 13:30:55.000110002 3 202.2 112 13120 3/1406 13:30:55.000110003 4 99.3 113 13120 3/1406 13:30:55.000110004 5 231.9 120 13120 3/1406 13:30:55.000110005 6 287.1 125 13120 3/1406 13:30:55.000110006 6 107.7 126 13120 3/1406 13:30:55.000110007 6 98.9 127 13120 3/1406 13:30:55.000110008 9 100.1 128 13120 3/1406 13:30:55.0001

… … … … … … …

J.J.Bunn, Distributed Databases, 2001 53

JoinsJoins Joins are used to compare or Joins are used to compare or

combine relations (rows) from combine relations (rows) from two or more tables, when the two or more tables, when the relations share a common relations share a common attribute valueattribute value

Simple approach: for every Simple approach: for every relation in the first table “S”, relation in the first table “S”, loop over all relations in the loop over all relations in the other table “R”, and see if the other table “R”, and see if the attributes matchattributes match

N-way joins are evaluated as N-way joins are evaluated as a series of 2-way joinsa series of 2-way joins

Join Algorithms are a Join Algorithms are a continuing topic of intense continuing topic of intense research in Computer Scienceresearch in Computer Science

J.J.Bunn, Distributed Databases, 2001 54

Join AlgorithmsJoin Algorithms Need to run in memory for best Need to run in memory for best

performanceperformance Nested-Loops: efficient only if “R” Nested-Loops: efficient only if “R”

very small (can be stored in very small (can be stored in memory)memory)

Hash-Join: Build an in-memory Hash-Join: Build an in-memory hash table of “R”, then loop over hash table of “R”, then loop over “S” hashing to check for match“S” hashing to check for match

Hybrid Hash-Join: When “R” hash Hybrid Hash-Join: When “R” hash is too big to fit in memory, split is too big to fit in memory, split join into partitionsjoin into partitions

Merge-Join: Used when “R” and Merge-Join: Used when “R” and “S” are already sorted on the join “S” are already sorted on the join attribute, simply merging them in attribute, simply merging them in parallelparallel

Special versions of Join Algorithms Special versions of Join Algorithms needed for Distributed Database needed for Distributed Database query execution!query execution!

J.J.Bunn, Distributed Databases, 2001 55

Distributed Query Distributed Query OptimisationOptimisation

Cost-based:Cost-based: Consider all “plans”Consider all “plans” Pick cheapest: include Pick cheapest: include

communication costscommunication costs Need to use distributed join Need to use distributed join

methodsmethods Site that receives query Site that receives query

constructs Global Plan, hints constructs Global Plan, hints for local plansfor local plans Local plans may be changed at Local plans may be changed at

each siteeach site

J.J.Bunn, Distributed Databases, 2001 56

ReplicationReplication

Synchronous: All data that Synchronous: All data that have been changed must be have been changed must be propagated before the propagated before the Transaction commitsTransaction commits

Asynchronous: Changed data Asynchronous: Changed data are periodically sentare periodically sent Replicas may go out of sync.Replicas may go out of sync. Clients must be aware of thisClients must be aware of this

J.J.Bunn, Distributed Databases, 2001 57

Synchronous Synchronous Replication CostsReplication Costs

Before an update Transaction Before an update Transaction can commit, it obtains locks can commit, it obtains locks on all modified copieson all modified copies Sends lock requests to remote Sends lock requests to remote

sites, holds lockssites, holds locks If links or remote sites fail, If links or remote sites fail,

Transaction cannot commit until Transaction cannot commit until links/sites restoredlinks/sites restored

Even without failure, Even without failure, commit commit protocolprotocol is complex, and is complex, and involves many messagesinvolves many messages

J.J.Bunn, Distributed Databases, 2001 58

Asynchronous Asynchronous ReplicationReplication

Allows Transaction to commit Allows Transaction to commit before all copies have been before all copies have been modifiedmodified

Two methods:Two methods: Primary SitePrimary Site

Peer-to-PeerPeer-to-Peer

J.J.Bunn, Distributed Databases, 2001 59

Primary Site Primary Site ReplicationReplication

One copy designated as One copy designated as “Master” “Master”

Published to other sites who Published to other sites who subscribe to “Secondary” subscribe to “Secondary” copiescopies

Changes propagated to Changes propagated to “Secondary” copies“Secondary” copies

Done in two steps:Done in two steps: Capture changes made by Capture changes made by

committed Transactionscommitted Transactions Apply these changesApply these changes

J.J.Bunn, Distributed Databases, 2001 60

The Capture StepThe Capture Step

Procedural: A procedure, Procedural: A procedure, automatically invoked, does automatically invoked, does the capture (takes a the capture (takes a snapshot)snapshot)

Log-based: the log is used to Log-based: the log is used to generate a Change Data Tablegenerate a Change Data Table Better (cheaper and faster) but Better (cheaper and faster) but

relies on proprietary log detailsrelies on proprietary log details

J.J.Bunn, Distributed Databases, 2001 61

The Apply StepThe Apply Step

The Secondary site The Secondary site periodically obtains from the periodically obtains from the Primary site a snapshot or Primary site a snapshot or changes to the Change Data changes to the Change Data Table Table Updates its copyUpdates its copy Period can be timer-based or Period can be timer-based or

defined by the user/applicationdefined by the user/application Log-based capture with Log-based capture with

continuous Apply minimises continuous Apply minimises delays in propagating delays in propagating changeschanges

J.J.Bunn, Distributed Databases, 2001 62

Peer-to-Peer Peer-to-Peer ReplicationReplication

More than one copy can be More than one copy can be “Master”“Master”

Changes are somehow Changes are somehow propagated to other copiespropagated to other copies

Conflicting changes must be Conflicting changes must be resolvedresolved

So best when conflicts do not So best when conflicts do not or cannot arise:or cannot arise: Each “Master” owns a disjoint Each “Master” owns a disjoint

fragment or copyfragment or copy Update permission only granted Update permission only granted

to one “Master” at a timeto one “Master” at a time

J.J.Bunn, Distributed Databases, 2001 63

Replication ExamplesReplication Examples Master copy, many slave copies (SQL Master copy, many slave copies (SQL

Server)Server) always know the correct value (master)always know the correct value (master) change propagation can be change propagation can be

transactionaltransactional as soon as possibleas soon as possible periodicperiodic on demandon demand

Symmetric, and anytime (Access)Symmetric, and anytime (Access) allows mobile (disconnected) updatesallows mobile (disconnected) updates updates propagated ASAP, periodic, on updates propagated ASAP, periodic, on

demanddemand non-serializablenon-serializable colliding updates must be reconciled.colliding updates must be reconciled. hard to know “real” valuehard to know “real” value

J.J.Bunn, Distributed Databases, 2001 64

Data Warehousing Data Warehousing and Replicationand Replication

Build giant “warehouses” of data Build giant “warehouses” of data from many sitesfrom many sites Enable complex decision support Enable complex decision support

queries over data from across an queries over data from across an organisationorganisation

Warehouses can be seen as an Warehouses can be seen as an instance of asynchronous instance of asynchronous replicationreplication Source data is typically controlled by Source data is typically controlled by

different DBMS: emphasis on different DBMS: emphasis on “cleaning” data by removing “cleaning” data by removing mismatches while creating replicasmismatches while creating replicas

Procedural Capture and Procedural Capture and application Apply work best for application Apply work best for this environmentthis environment

J.J.Bunn, Distributed Databases, 2001 65

Distributed LockingDistributed Locking How to manage Locks across How to manage Locks across

many sites?many sites? Centrally: one site does all Centrally: one site does all

lockinglocking Vulnerable to single site Vulnerable to single site

failurefailure Primary Copy: all locking for an Primary Copy: all locking for an

object done at the primary copy object done at the primary copy site for the objectsite for the object Reading requires access to Reading requires access to

locking site as well as site locking site as well as site which stores objectwhich stores object

Fully Distributed: locking for a Fully Distributed: locking for a copy done at site where the copy done at site where the copy is storedcopy is stored Locks at all sites while writing Locks at all sites while writing

an objectan object

J.J.Bunn, Distributed Databases, 2001 66

Distributed Deadlock Distributed Deadlock DetectionDetection

Each site maintains a local “waits-Each site maintains a local “waits-for” graphfor” graph

Global deadlock might occur even Global deadlock might occur even if local graphs contain no cyclesif local graphs contain no cycles E.g. Site A holds lock on X, waits for E.g. Site A holds lock on X, waits for

lock on Ylock on Y Site B holds lock on Y, waits for lock Site B holds lock on Y, waits for lock

on Xon X Three solutions:Three solutions:

Centralised (send all local graphs to Centralised (send all local graphs to one site)one site)

Hierarchical (organise sites into Hierarchical (organise sites into hierarchy and send local graphs to hierarchy and send local graphs to parent)parent)

Timeout (abort Transaction if it waits Timeout (abort Transaction if it waits too long)too long)

J.J.Bunn, Distributed Databases, 2001 67

Distributed RecoveryDistributed Recovery

Links and Remote Sites may Links and Remote Sites may crash/failcrash/fail

If sub-transactions of a If sub-transactions of a Transaction execute at Transaction execute at different sites, all or none different sites, all or none must commitmust commit

Need a commit protocol to Need a commit protocol to achieve thisachieve this

Solution: Maintain a Log at Solution: Maintain a Log at each site of commit protocol each site of commit protocol actionsactions Two-Phase CommitTwo-Phase Commit

J.J.Bunn, Distributed Databases, 2001 68

Two-Phase CommitTwo-Phase Commit Site which originates Transaction is Site which originates Transaction is

coordinator, other sites involved in coordinator, other sites involved in Transaction are subordinatesTransaction are subordinates

When the Transaction needs to Commit:When the Transaction needs to Commit: Coordinator sends “prepare” message to Coordinator sends “prepare” message to

subordinatessubordinates Subordinates each force-writes an Subordinates each force-writes an abort abort or or

prepareprepare Log record, and sends “yes” or “no” Log record, and sends “yes” or “no” message to Coordinatormessage to Coordinator

If Coordinator gets unanimous “yes” If Coordinator gets unanimous “yes” messages, force-writes a messages, force-writes a commitcommit Log record, Log record, and sends “commit” message to all and sends “commit” message to all subordinatessubordinates

Otherwise, force-writes an Otherwise, force-writes an abortabort Log record, Log record, and sends “abort” message to all and sends “abort” message to all subordinatessubordinates

Subordinates force-write abort/commit Log Subordinates force-write abort/commit Log record accordingly, then send an “ack” record accordingly, then send an “ack” message to Coordinatormessage to Coordinator

Coordinator writes Coordinator writes endend Log record after Log record after receiving all acksreceiving all acks

J.J.Bunn, Distributed Databases, 2001 69

Notes on Two-Phase Notes on Two-Phase Commit (2PC)Commit (2PC)

First: voting, Second: termination First: voting, Second: termination – both initiated by Coordinator– both initiated by Coordinator

Any site can decide to abort the Any site can decide to abort the TransactionTransaction

Every message is recorded in the Every message is recorded in the local Log by the sender to ensure local Log by the sender to ensure it survives failuresit survives failures

All Commit Protocol log records All Commit Protocol log records for a Transaction contain the for a Transaction contain the Transaction ID and Coordinator ID. Transaction ID and Coordinator ID. The Coordinator’s abort/commit The Coordinator’s abort/commit record also includes the Site IDs of record also includes the Site IDs of all subordinatesall subordinates

J.J.Bunn, Distributed Databases, 2001 70

Restart after Site Restart after Site FailureFailure If there is a commit or abort Log If there is a commit or abort Log

record for Transaction T, but no record for Transaction T, but no end record, then must undo/redo Tend record, then must undo/redo T If the site is Coordinator for T, then If the site is Coordinator for T, then

keep sending keep sending commit/abortcommit/abort messages messages to Subordinates until to Subordinates until acksacks received received

If there is a prepare Log record, If there is a prepare Log record, but no commit or abort:but no commit or abort: This site is a Subordinate for TThis site is a Subordinate for T Contact Coordinator to find status of Contact Coordinator to find status of

T, then T, then write write commit/abortcommit/abort Log record Log record Redo/undoRedo/undo T T Write Write endend Log record Log record

J.J.Bunn, Distributed Databases, 2001 71

BlockingBlocking

If Coordinator for Transaction If Coordinator for Transaction T fails, then Subordinates T fails, then Subordinates who have voted “yes” cannot who have voted “yes” cannot decide whether to decide whether to commitcommit or or abort abort until Coordinator until Coordinator recovers!recovers!

T is T is blockedblocked Even if all Subordinates are Even if all Subordinates are

aware of one another (e.g. via aware of one another (e.g. via extra information in extra information in “prepare” message) they are “prepare” message) they are blocked blocked Unless one of them voted “no”Unless one of them voted “no”

J.J.Bunn, Distributed Databases, 2001 72

Link and Remote Site Link and Remote Site FailuresFailures

IfIf a Remote Site does not a Remote Site does not respond during the Commit respond during the Commit Protocol for TProtocol for T E.g. it crashed or the link is E.g. it crashed or the link is

downdown ThenThen

If current Site is Coordinator for If current Site is Coordinator for T: abortT: abort

If Subordinate and not yet voted If Subordinate and not yet voted “yes”: abort“yes”: abort

If Subordinate and has voted If Subordinate and has voted “yes”, it is blocked until “yes”, it is blocked until Coordinator back onlineCoordinator back online

J.J.Bunn, Distributed Databases, 2001 73

Observations on 2PCObservations on 2PC

Ack messages used to let Ack messages used to let Coordinator know when it can Coordinator know when it can “forget” a Transaction“forget” a Transaction Until it receives all acks, it must Until it receives all acks, it must

keep T in the Transaction Tablekeep T in the Transaction Table If Coordinator fails after If Coordinator fails after

sending “prepare” messages, sending “prepare” messages, but before writing but before writing commit/abort Log record, commit/abort Log record, when it comes back up, it when it comes back up, it aborts Taborts T

If a subtransaction does no If a subtransaction does no updates, its commit or abort updates, its commit or abort status is irrelevantstatus is irrelevant

J.J.Bunn, Distributed Databases, 2001 74

2PC with Presumed 2PC with Presumed AbortAbort When Coordinator aborts T, it When Coordinator aborts T, it

undoes T and removes it from the undoes T and removes it from the Transaction Table immediatelyTransaction Table immediately Doesn’t wait for “acks”Doesn’t wait for “acks” ““Presumes Abort” if T not in Presumes Abort” if T not in

Transaction TableTransaction Table Names of Subordinates not recorded Names of Subordinates not recorded

in in abortabort Log record Log record Subordinates do not send “ack” Subordinates do not send “ack”

on aborton abort If subtransaction does no updates, If subtransaction does no updates,

it responds to “prepare” message it responds to “prepare” message with “reader” (instead of with “reader” (instead of “yes”/”no”)“yes”/”no”)

Coordinator subsequently ignores Coordinator subsequently ignores “reader”s“reader”s

If all Subordinates are “reader”s, If all Subordinates are “reader”s, then 2then 2ndnd. Phase not required. Phase not required

J.J.Bunn, Distributed Databases, 2001 75

Replication and Replication and Partitioning ComparedPartitioning Compared

CentralCentralScaleupScaleup2x2x

more workmore work

Partition Partition ScaleupScaleup2x2x

more workmore work

ReplicationReplicationScaleupScaleup4x4x

more workmore work

ReplicationPartitioningTwo 1 TPS systems

1 TPS server100 Users

1 TPS server100 Users

O t

ps

O t

ps

1 TPS server100 Users

Base casea 1 TPS system

2 TPS server200 Users

Scaleupto a 2 TPS centralized system

Two 2 TPS systems

2 TPS server100 Users

2 TPS server100 Users

1 t

ps

1 t

ps

J.J.Bunn, Distributed Databases, 2001 76

““Porter” Agent-based Porter” Agent-based Distributed DatabaseDistributed Database

Charles Univ, PragueCharles Univ, Prague Based on “Aglets” SDK from IBMBased on “Aglets” SDK from IBM

Part 3Part 3

Distributed SystemsDistributed Systems..

Julian Bunn

California Institute of Technology

J.J.Bunn, Distributed Databases, 2001 78

What’s a Distributed What’s a Distributed System?System?

Centralized: Centralized: everything in one placeeverything in one place stand-alone PC or stand-alone PC or

MainframeMainframe

Distributed: Distributed: some parts remotesome parts remote

distributed usersdistributed users distributed executiondistributed execution distributed datadistributed data

J.J.Bunn, Distributed Databases, 2001 79

Why Distribute?Why Distribute?

No best organizationNo best organization

Organisations constantly swing Organisations constantly swing betweenbetween Centralized: focus, control, economyCentralized: focus, control, economy Decentralized: adaptive, responsive, Decentralized: adaptive, responsive,

competitivecompetitive

Why distribute?Why distribute? reflect organisation or application reflect organisation or application

structure structure empower users / producersempower users / producers improve service (response / improve service (response /

availability)availability) distribute loaddistribute load use PC technology (economics)use PC technology (economics)

J.J.Bunn, Distributed Databases, 2001 80

What What Should Be Should Be

Distributed? Distributed? Users and User Users and User InterfaceInterface Thin client Thin client

ProcessingProcessing Trim clientTrim client

DataData Fat clientFat client

Will discuss tradeoffs Will discuss tradeoffs later later

Database

Business Objects

workflow

Presentation

J.J.Bunn, Distributed Databases, 2001 81

Transparency Transparency in Distributed in Distributed

SystemsSystems Make distributed system as easy to Make distributed system as easy to use and manage as a centralized use and manage as a centralized systemsystem

Give a Single-System ImageGive a Single-System Image

Location transparency:Location transparency: hide fact that object is remotehide fact that object is remote hide fact that object has movedhide fact that object has moved hide fact that object is partitioned or hide fact that object is partitioned or

replicatedreplicated Name doesn’t change if object is Name doesn’t change if object is

replicated, partitioned or moved.replicated, partitioned or moved.

J.J.Bunn, Distributed Databases, 2001 82

Naming- The basicsNaming- The basics Objects haveObjects have

Globally Unique Identifier (GUIDs)Globally Unique Identifier (GUIDs) location(s) = address(es) location(s) = address(es) name(s) name(s) addresses can changeaddresses can change objects can have many namesobjects can have many names

Names are context dependent:Names are context dependent: (Jim @ KGB (Jim @ KGB Jim @ CIA)Jim @ CIA)

Many naming systemsMany naming systems UNC: \\node\device\dir\dir\dir\objectUNC: \\node\device\dir\dir\dir\object Internet: Internet:

http://node.domain.root/dir/dir/dir/objecthttp://node.domain.root/dir/dir/dir/object LDAP: LDAP:

ldap://ldap.domain.root/o=org,c=US,cn=dirldap://ldap.domain.root/o=org,c=US,cn=dir

guid

Jim

Address

James

J.J.Bunn, Distributed Databases, 2001 83

Name ServersName Serversin Distributed in Distributed

SystemsSystems Name servers translate Name servers translate names + context names + context

to address (+ GUID)to address (+ GUID) Name servers are partitioned Name servers are partitioned

(subtrees of name space)(subtrees of name space) Name servers replicate root Name servers replicate root

of name treeof name tree Name servers form a hierarchyName servers form a hierarchy Distributed data from hell: Distributed data from hell:

high read traffic high read traffic high reliability & availabilityhigh reliability & availability

autonomyautonomy

J.J.Bunn, Distributed Databases, 2001 84

Autonomy Autonomy in Distributed in Distributed

SystemsSystems Owner of site Owner of site (or node, or application, or (or node, or application, or database)database)Wants to control itWants to control it

If my part is working, If my part is working, must be able to access & manage itmust be able to access & manage it

(reorganize, upgrade, add user,(reorganize, upgrade, add user,…)…)

Autonomy isAutonomy is EssentialEssential Difficult to implement. Difficult to implement. Conflicts with global consistencyConflicts with global consistency

examples: naming, authentication, examples: naming, authentication, admin…admin…

J.J.Bunn, Distributed Databases, 2001 85

Security Security The BasicsThe Basics

Authentication server Authentication server subject + Authenticator => subject + Authenticator =>

(Yes + (Yes + token) | Notoken) | No

Security matrix:Security matrix: who can do what to who can do what to

whomwhom Access control list is Access control list is

column of matrixcolumn of matrix ““who” is who” is

authenticated IDauthenticated ID In a distributed system, In a distributed system,

“who” and “what” and “who” and “what” and “whom” are distributed “whom” are distributed objectsobjects

subject

Object

Permissions

J.J.Bunn, Distributed Databases, 2001 86

Security Security in Distributed in Distributed

SystemsSystems Security domainSecurity domain: :

nodes with a shared security server.nodes with a shared security server. Security domains can have trust relationships:Security domains can have trust relationships:

A trusts B: A “believes” B when it says this is Jim@BA trusts B: A “believes” B when it says this is Jim@B

Security domains form a hierarchy.Security domains form a hierarchy. Delegation: Delegation: passing authority to a server passing authority to a server

when A asks B to do something when A asks B to do something (e.g. print a file, read a (e.g. print a file, read a database)database)B may need A’s authorityB may need A’s authority

Autonomy requires:Autonomy requires: each node is an authenticatoreach node is an authenticator each node does own security checkseach node does own security checks

Internet Today: Internet Today: no trust among domains no trust among domains (fire walls, many (fire walls, many

passwords)passwords) trust based on digital signaturestrust based on digital signatures

J.J.Bunn, Distributed Databases, 2001 87

Clusters Clusters The Ideal Distributed System.The Ideal Distributed System.

Cluster is Cluster is distributed distributed system BUT singlesystem BUT single locationlocation managermanager security policysecurity policy

relatively relatively homogeneoushomogeneous

communications iscommunications is high bandwidthhigh bandwidth low latencylow latency low error ratelow error rate

Clusters use Clusters use distributed distributed

system system techniques fortechniques for load distributionload distribution

storage storage executionexecution

growthgrowth fault tolerancefault tolerance

J.J.Bunn, Distributed Databases, 2001 88

Cluster: Shared Cluster: Shared What?What? Shared Memory Shared Memory

MultiprocessorMultiprocessor Multiple processors, one Multiple processors, one

memorymemory all devices are localall devices are local HP V-classHP V-class

Shared Disk ClusterShared Disk Cluster an array of nodesan array of nodes all shared common disksall shared common disks VAXcluster + OracleVAXcluster + Oracle

Shared Nothing ClusterShared Nothing Cluster each device local to a nodeeach device local to a node ownership may changeownership may change Beowulf,Tandem, SP2, Beowulf,Tandem, SP2,

WolfpackWolfpack

J.J.Bunn, Distributed Databases, 2001 89

Distributed ExecutionDistributed ExecutionThreads and MessagesThreads and Messages

Thread is Execution unitThread is Execution unit(software analog of (software analog of

cpu+memory)cpu+memory)

Threads execute at a Threads execute at a nodenode

Threads communicate Threads communicate viavia Shared memory (local)Shared memory (local) Messages (local and Messages (local and

remote)remote)

threads

shared memory

messages

J.J.Bunn, Distributed Databases, 2001 90

Peer-to-Peer or Client-Peer-to-Peer or Client-ServerServer

Peer-to-Peer is Peer-to-Peer is symmetric:symmetric: Either side can sendEither side can send

Client-serverClient-server client sends requestsclient sends requests server sends responsesserver sends responses simple subset of peer-to-simple subset of peer-to-

peerpeer

requestresponse

J.J.Bunn, Distributed Databases, 2001 91

Connection-less Connection-less oror ConnectedConnected Connection-less Connection-less

request containsrequest contains client idclient id client contextclient context work requestwork request

client authenticated on client authenticated on each messageeach message

only a single response only a single response messagemessage

e.g. HTTP, NFS v1 e.g. HTTP, NFS v1

Connected Connected (sessions)(sessions)open - request/reply - closeopen - request/reply - closeclient authenticated onceclient authenticated onceMessages arrive in orderMessages arrive in orderCan send many replies Can send many replies (e.g. FTP)(e.g. FTP)

Server has client contextServer has client context (context sensitive) (context sensitive) e.g. Winsock and ODBC e.g. Winsock and ODBC HTTP adding connectionsHTTP adding connections

J.J.Bunn, Distributed Databases, 2001 92

Remote Procedure Remote Procedure Call: The key to Call: The key to transparencytransparency

Object may Object may be local or be local or remoteremote

Methods on Methods on object work object work wherever it wherever it is.is.

Local Local invocationinvocation

y = pObj->f(x);

f()

x

valy = val;

return val;

J.J.Bunn, Distributed Databases, 2001 93

Remote Procedure Remote Procedure Call: Call: The key to transparencyThe key to transparency

Remote invocationRemote invocation

Obj Local?x

valy = val;

f()

return val;

y = pObj->f(x);

marshal

unmarshal

marshal

unmarshal

x

proxy

unmarshal

pObj->f(x)

marshal

xstub

Obj Local?Obj Local?x

val

f()

return val;

val

val

Gee!! Nice pictures!

J.J.Bunn, Distributed Databases, 2001 94

Transaction

Object Request Broker Object Request Broker (ORB)(ORB)

Orchestrates RPCOrchestrates RPC Registers ServersRegisters Servers Manages pools of serversManages pools of servers Connects clients to serversConnects clients to servers Does Naming, request-level Does Naming, request-level

authorization,authorization, Provides transaction coordination Provides transaction coordination (new (new

feature)feature) Old names: Old names:

Transaction Processing Monitor, Transaction Processing Monitor, Web server, Web server, NetWareNetWare Object-Request Broker

J.J.Bunn, Distributed Databases, 2001 95

Send updates to correct Send updates to correct partitionpartition

Using RPC for Using RPC for TransparencyTransparency

Partition TransparencyPartition Transparency

part Local?xy = pfile->write(x);

sendto

correct partition

sendto

correct partition

x

unmarshal

pObj->write(x)

marshal

x

x

val

write()

return val;

val

val

J.J.Bunn, Distributed Databases, 2001 96

Send updates to EACH Send updates to EACH nodenode

Using RPC for Using RPC for TransparencyTransparency

Replication TransparencyReplication Transparency

xy = pfile->write(x);

Sendto

eachreplica

Sendto

eachreplica

x

val

J.J.Bunn, Distributed Databases, 2001 97

Client/Server Client/Server InteractionsInteractions

All can be done with RPCAll can be done with RPC Request-ResponseRequest-Responseresponse may be many response may be many

messagesmessages

ConversationalConversationalserver keeps client contextserver keeps client context

DispatcherDispatcherthree-tier: complex operation at three-tier: complex operation at

serverserver

QueuedQueuedde-couples client from serverde-couples client from serverallows disconnected operationallows disconnected operation

C S

C S

C S SS

C SS

J.J.Bunn, Distributed Databases, 2001 98

Queued Queued Request/ResponseRequest/Response Time-decouples client and serverTime-decouples client and server

Three TransactionsThree Transactions

Almost real time, ASAP processingAlmost real time, ASAP processing

Communicate at each other’s Communicate at each other’s convenienceconvenienceAllows mobile (disconnected) operationAllows mobile (disconnected) operation

Disk queues survive client & server Disk queues survive client & server failuresfailures

Client Server

SubmitPerform

Response

J.J.Bunn, Distributed Databases, 2001 99

Why Queued Why Queued Processing?Processing? Prioritize requestsPrioritize requests

ambulance dispatcher favors high-priority callsambulance dispatcher favors high-priority calls

Manage WorkflowsManage Workflows

Deferred processing in mobile appsDeferred processing in mobile apps

Interface heterogeneous systemsInterface heterogeneous systemsEDI, EDI, MOM: Message-Oriented-Middleware MOM: Message-Oriented-Middleware DAD: Direct Access to Data DAD: Direct Access to Data

Order Build Ship Invoice Pay

J.J.Bunn, Distributed Databases, 2001 100

Work Distribution Work Distribution SpectrumSpectrum

Presentation Presentation

and plug-insand plug-ins Workflow Workflow

manages manages session & session & invokes invokes objectsobjects

Business Business objectsobjects

DatabaseDatabase

Fat

ThinFat

Thin

Database

Business Objects

workflow

Presentation

J.J.Bunn, Distributed Databases, 2001 101

Transaction Processing Transaction Processing Evolution to Three TierEvolution to Three TierIntelligence migrated to clients Intelligence migrated to clients

Mainframe Batch Mainframe Batch processing (centralized)processing (centralized)

Dumb terminals &Dumb terminals & Remote Job Entry Remote Job Entry

Intelligent terminals Intelligent terminals database backendsdatabase backends

Workflow SystemsWorkflow SystemsObject Request BrokersObject Request BrokersApplication GeneratorsApplication Generators

Mainframe

cards

Active

green screen3270

Server

TP Monitor

ORB

J.J.Bunn, Distributed Databases, 2001 102

Web Evolution to Three Web Evolution to Three TierTier

Intelligence migrated to clients (like Intelligence migrated to clients (like TP)TP)

Character-mode clients, Character-mode clients, smart serverssmart servers

GUI Browsers - Web file GUI Browsers - Web file serversservers

GUI Plugins - Web dispatchers GUI Plugins - Web dispatchers - CGI- CGI

Smart clients - Web dispatcher Smart clients - Web dispatcher (ORB)(ORB)pools of app servers (ISAPI, pools of app servers (ISAPI, Viper)Viper)workflow scripts at client & workflow scripts at client & serverserver

archie ghophergreen screen

WebServer

Mosaic

WAIS

NS & IE

Active

J.J.Bunn, Distributed Databases, 2001 103

PC Evolution to Three PC Evolution to Three TierTier

Intelligence migrated to serverIntelligence migrated to server Stand-alone PC Stand-alone PC

(centralized)(centralized)

PC + File & print PC + File & print serverserver

message per I/Omessage per I/O

PC + Database PC + Database server server

message per SQL message per SQL statementstatement

PC + App server PC + App server message per message per

transactiontransaction

ActiveX Client, ORB ActiveX Client, ORB ActiveX server, ActiveX server, Xscript Xscript

disk I/OIO request

reply

SQL Statement

Transaction

J.J.Bunn, Distributed Databases, 2001 104

The Pattern: The Pattern: Three Tier ComputingThree Tier Computing

Clients do presentation, gather Clients do presentation, gather inputinput

Clients do some workflow Clients do some workflow (Xscript)(Xscript)

Clients send high-level requests Clients send high-level requests to ORB (Object Request Broker)to ORB (Object Request Broker)

ORB dispatches workflows and ORB dispatches workflows and business objects -- proxies for business objects -- proxies for client, orchestrate flows & client, orchestrate flows & queuesqueues

Server-side workflow scripts call Server-side workflow scripts call on distributed business objects on distributed business objects to execute taskto execute task

Database

Business Objects

workflow

Presentation

J.J.Bunn, Distributed Databases, 2001 105

The Three The Three TiersTiers

Web Client

HTML

VB or Java Script Engine

VB or Java Virt Machine

VBscritptJavaScrpt

VB Javaplug-ins

InternetORB

HTTP+DCOM

ObjectserverPool

MiddlewareORB

TP MonitorWeb Server...

DCOM (oleDB, ODBC,...)

Object & Dataserver.

LU6.2

IBMLegacy Gateways

J.J.Bunn, Distributed Databases, 2001 106

Why Did Everyone Go To Why Did Everyone Go To Three-Tier?Three-Tier?

ManageabilityManageability Business rules must be with dataBusiness rules must be with data Middleware operations toolsMiddleware operations tools

Performance (scaleability)Performance (scaleability) Server resources are preciousServer resources are precious ORB dispatches requests to server ORB dispatches requests to server

poolspools

Technology & PhysicsTechnology & Physics Put UI processing near userPut UI processing near user Put shared data processing near Put shared data processing near

shared datashared data Database

Business Objects

workflow

Presentation

J.J.Bunn, Distributed Databases, 2001 107

DAD’sRaw Data

Customer comes to storeTakes what he wantsFills out invoiceLeaves money for goods

Easy to buildNo clerks

Why Put Business Why Put Business Objects Objects

at Server?at Server?

Customer comes to store with list Gives list to clerk Clerk gets goods, makes invoiceCustomer pays clerk, gets goods

Easy to manageClerks controls accessEncapsulation

MOM’s Business Objects

J.J.Bunn, Distributed Databases, 2001 108

Why Server Pools?Why Server Pools? Server resources are precious.Server resources are precious.

Clients have 100x more power than Clients have 100x more power than server. server.

Pre-allocate everything on serverPre-allocate everything on server preallocate memorypreallocate memory pre-open filespre-open files pre-allocate threadspre-allocate threads pre-open and authenticate clientspre-open and authenticate clients

Keep high duty-cycle on objectsKeep high duty-cycle on objects (re-use them)(re-use them) Pool threads, not one per clientPool threads, not one per client

Classic example: Classic example: TPC-C benchmarkTPC-C benchmark 2 processes2 processes

everything pre-allocatedeverything pre-allocated

7,000 clients

IIS SQL

Pool ofDBC linksHTTP

N clients x N Servers x F files =N x N x F file opens!!!

IE

J.J.Bunn, Distributed Databases, 2001 109

Classic MistakesClassic Mistakes Thread per terminalThread per terminal

fix: DB server thread poolsfix: DB server thread poolsfix: server poolsfix: server pools

Process per request (CGI)Process per request (CGI)fix: ISAPI & NSAPI DLLs fix: ISAPI & NSAPI DLLs fix: connection poolsfix: connection pools

Many messages per Many messages per operationoperationfix: stored proceduresfix: stored proceduresfix: server-side objectsfix: server-side objects

File open per requestFile open per requestfix: cache hot filesfix: cache hot files

J.J.Bunn, Distributed Databases, 2001 110

Distributed Distributed Applications need Applications need

Transactions!Transactions! Transactions are key to Transactions are key to structuring distributed applications structuring distributed applications

ACID properties easeACID properties easeexception handlingexception handling AtomicAtomic: all or nothing: all or nothing ConsistentConsistent: state transformation: state transformation IsolatedIsolated: no concurrency anomalies: no concurrency anomalies DurableDurable: committed transaction effects : committed transaction effects

persistpersist

J.J.Bunn, Distributed Databases, 2001 111

Programming & Programming & TransactionsTransactions

The Application ViewThe Application View You Start You Start (e.g. in TransactSQL)(e.g. in TransactSQL):: Begin [Distributed] Transaction Begin [Distributed] Transaction

<name><name> Perform actionsPerform actions Optional Save Transaction Optional Save Transaction

<name><name> Commit or RollbackCommit or Rollback

You Inherit a XIDYou Inherit a XID Caller passes you a transactionCaller passes you a transaction You return or Rollback.You return or Rollback. You can Begin / Commit sub-You can Begin / Commit sub-

trans.trans. You can use save pointsYou can use save points

Begin

Commit

Begin

RollBack

Return

RollBackReturn

XID

J.J.Bunn, Distributed Databases, 2001 112

Transaction Save Transaction Save PointsPoints

Backtracking within a Backtracking within a transactiontransaction

Allows app to Allows app to cancel parts cancel parts of a of a transaction transaction prior to prior to commitcommit

This is in This is in most SQL most SQL productsproducts

BEGIN WORK:1

SAVE WORK:2

action

action

action

action

action

action

action

SAVE WORK:4

SAVE WORK:3

ROLLBACK WORK(2)

SAVE WORK:5

action

action

SAVE WORK:6

action

SAVE WORK:7

action

action

action

action

ROLLBACK WORK(7)

SAVE WORK:8

action

action

action

COMMIT WORK

J.J.Bunn, Distributed Databases, 2001 113

Chained TransactionsChained Transactions Commit of T1 implicitly begins T2. Carries context forward to next

transaction cursorscursors lockslocks other stateother state

Transaction #1 Transaction #2Commit

Begin

Processingcontext

establishedestablished

Processingcontextusedused

J.J.Bunn, Distributed Databases, 2001 114

Nested TransactionsNested TransactionsGoing Beyond Flat TransactionsGoing Beyond Flat Transactions

Need transactions within Need transactions within transactionstransactions

Sub-transactions commit only if Sub-transactions commit only if root doesroot does

Only root commit is durable.Only root commit is durable. Subtransactions may rollbackSubtransactions may rollback

if so, all its subtransactions if so, all its subtransactions rollbackrollback

Parallel version of nested Parallel version of nested transactionstransactions

T1

T114

T113

T112T111

T11

T123T122T121T12

T131 T132 T133T13

J.J.Bunn, Distributed Databases, 2001 115

Workflow: Workflow: A Sequence of TransactionsA Sequence of Transactions

Application transactions are Application transactions are multi-stepmulti-step order, build, ship & invoice, order, build, ship & invoice,

reconcilereconcile Each step is an ACID unitEach step is an ACID unit Workflow is a script describing Workflow is a script describing

stepssteps Workflow systems Workflow systems

Instantiate the scriptsInstantiate the scripts Drive the scriptsDrive the scripts Allow query against scriptsAllow query against scripts

Examples Examples Manufacturing Work In Process Manufacturing Work In Process

(WIP)(WIP)Queued processingQueued processingLoan application & approval,Loan application & approval,Hospital admissions…Hospital admissions…

Database

Business Objects

workflow

Presentation

J.J.Bunn, Distributed Databases, 2001 116

Workflow ScriptsWorkflow Scripts Workflow scripts are programsWorkflow scripts are programs

(could use VBScript or JavaScript)(could use VBScript or JavaScript)

If step fails, compensation action If step fails, compensation action handles errorhandles error

Events, messages, time, other steps Events, messages, time, other steps cause step.cause step.

Workflow controller drives flowsWorkflow controller drives flows

Source

Step

branchcase

fork

loopCompensationAction

join

J.J.Bunn, Distributed Databases, 2001 117

Workflow and ACIDWorkflow and ACID Workflow is not Atomic or IsolatedWorkflow is not Atomic or Isolated Results of a step visible to allResults of a step visible to all Workflow is Consistent and Workflow is Consistent and

DurableDurable Each flow may take hours, weeks, Each flow may take hours, weeks,

monthsmonths Workflow controller Workflow controller

keeps flows movingkeeps flows moving maintains context (state) for each flowmaintains context (state) for each flow provides a query and operator provides a query and operator

interfaceinterfacee.g.: “what is the status of Job # e.g.: “what is the status of Job # 72149?”72149?”

J.J.Bunn, Distributed Databases, 2001 118

ACID Objects Using ACID ACID Objects Using ACID DBsDBs

The easy way to build transactional The easy way to build transactional objectsobjects

Application uses transactional Application uses transactional objectsobjects(objects have ACID properties)(objects have ACID properties)

If object built on top of ACID If object built on top of ACID objects, objects, then object is ACID.then object is ACID. Example: New, EnQueue, DeQueue Example: New, EnQueue, DeQueue

on top of SQLon top of SQL SQL provides ACIDSQL provides ACID

SQL

Business Object: Customer

Business Object Mgr: CustomerMgr

SQL

dim c as Customerdim CM as CustomerMgr...set C = CM.get(CustID)...C.credit_limit = 1000...CM.update(C, CustID)..Persistent Programming languages automate this.

J.J.Bunn, Distributed Databases, 2001 119

ACID Objects From Bare ACID Objects From Bare Metal Metal The Hard Way to Build The Hard Way to Build

Transactional ObjectsTransactional Objects Object Class is a Object Class is a Resource Manager Resource Manager (RM)(RM) Provides ACID objects from persistent Provides ACID objects from persistent

storagestorage Provides Undo (on rollback)Provides Undo (on rollback) Provides Redo (on restart or media Provides Redo (on restart or media

failure)failure) Provides Isolation for concurrent opsProvides Isolation for concurrent ops

Microsoft SQL Server, IBM DB2, Microsoft SQL Server, IBM DB2, Oracle,…Oracle,…are Resource managers.are Resource managers.

Many more coming.Many more coming. RM implementation techniques described RM implementation techniques described

laterlater

J.J.Bunn, Distributed Databases, 2001 120

Transaction ManagerTransaction Manager Transaction Manager (TM): Transaction Manager (TM):

manages transaction manages transaction objects.objects. XID factoryXID factory tracks themtracks them coordinates themcoordinates them

App gets XID from TMApp gets XID from TM Transactional RPC Transactional RPC

passes XID on all callspasses XID on all calls manages XID inheritancemanages XID inheritance

TM manages commit & TM manages commit & rollbackrollback

AppAppRMRM

TMTM

begin

XID

call(..XID)

enlist

J.J.Bunn, Distributed Databases, 2001 121

TM Two-Phase TM Two-Phase CommitCommit

Dealing with multiple RMsDealing with multiple RMs If all use one RM, then all or none If all use one RM, then all or none commitcommit

If multiple RMs, then need If multiple RMs, then need coordinationcoordination

Standard technique:Standard technique: Marriage: Do you? I do. I Marriage: Do you? I do. I

pronounce…Kisspronounce…Kiss Theater: Ready on the set? Ready! Theater: Ready on the set? Ready!

Action! ActAction! Act Sailing: Ready about? Ready! Helm’s Sailing: Ready about? Ready! Helm’s

a-lee! Tacka-lee! Tack Contract law: Escrow agentContract law: Escrow agent

Two-phase commit:Two-phase commit: 1. Voting phase: can you do it?1. Voting phase: can you do it? 2. If all vote yes, then commit phase: 2. If all vote yes, then commit phase:

do it! do it!

J.J.Bunn, Distributed Databases, 2001 122

Two-Phase Commit In Two-Phase Commit In PicturesPictures Transactions managed by TM Transactions managed by TM

App gets unique ID (XID) from App gets unique ID (XID) from TM at Begin()TM at Begin()

XID passed on Transactional XID passed on Transactional RPCRPC

RMs Enlist when first do work RMs Enlist when first do work on XIDon XID

AppApp RM1RM1

TMTM

RM2RM2

Begin

XID

Call(..XID..)

Enlist

Call(..XID..)

Enlist

J.J.Bunn, Distributed Databases, 2001 123

When App Requests When App Requests CommitCommit

Two Phase Commit in PicturesTwo Phase Commit in Pictures TM tracks all RMs enlisted on an TM tracks all RMs enlisted on an XIDXID

TM calls enlisted RM’s Prepared() TM calls enlisted RM’s Prepared() callbackcallback

If all vote yes, TM calls RM’s Commit() If all vote yes, TM calls RM’s Commit() If any vote no, TM calls RM’s Rollback()If any vote no, TM calls RM’s Rollback()

AppApp RM1RM1

TMTM

RM2RM2

1. Application requests Commit1. Application requests Commit

Commit1

Prepare

Prepare2

2

2. TM broadcasts prepared?2. TM broadcasts prepared?

YesYes

3

3

3. RMs all vote Yes3. RMs all vote Yes

4. TM decides Yes, 4. TM decides Yes, broadcastsbroadcasts

4

Comm

it

Comm

it4

5. RMs 5. RMs acknowledgeacknowledge

5

Yes5

Yesyes

6. TM says6. TM saysyesyes

J.J.Bunn, Distributed Databases, 2001 124

Implementing Implementing TransactionsTransactions

AtomicityAtomicity The DO/UNDO/REDO protocolThe DO/UNDO/REDO protocol IdempotenceIdempotence Two-phase commitTwo-phase commit

DurabilityDurability Durable logsDurable logs Force at commitForce at commit

IsolationIsolation Locking or versioningLocking or versioning

Part 4Part 4

Distributed Distributed Databases for Databases for

PhysicsPhysics..

Julian Bunn

California Institute of Technology

J.J.Bunn, Distributed Databases, 2001 126

Distributed Distributed Databases in PhysicsDatabases in Physics

Virtual Observatories (e.g. Virtual Observatories (e.g. NVO)NVO)

Gravity Wave Data (e.g. LIGO)Gravity Wave Data (e.g. LIGO) Particle Physics (e.g. LHC Particle Physics (e.g. LHC

Experiments)Experiments)

J.J.Bunn, Distributed Databases, 2001 127

Distributed Particle Distributed Particle Physics DataPhysics Data

Next Generation of particle Next Generation of particle physics experiments are data physics experiments are data intensiveintensive Acquisition rates of 100 Acquisition rates of 100

MBytes/secondMBytes/second At least One PetaByte (10At least One PetaByte (101515

Bytes) of raw data per year, per Bytes) of raw data per year, per experimentexperiment

Another PetaByte of Another PetaByte of reconstructed datareconstructed data

More PetaBytes of simulated More PetaBytes of simulated datadata

Many TeraBytes of MetaDataMany TeraBytes of MetaData To be accessed by ~2000 To be accessed by ~2000

physicists sitting around the physicists sitting around the globeglobe

J.J.Bunn, Distributed Databases, 2001 128

An Ocean of ObjectsAn Ocean of Objects Access from anywhere to any Access from anywhere to any

object in an Ocean of many object in an Ocean of many PetaBytes of objectsPetaBytes of objects

Approach:Approach: Distribute collections of useful Distribute collections of useful

objects to where they will be objects to where they will be most usedmost used

Move applications Move applications toto the the collection locationscollection locations

Maintain an up-to-date Maintain an up-to-date catalogue of collection locationscatalogue of collection locations

Try to balance the global Try to balance the global compute resources with the compute resources with the task load from the global clientstask load from the global clients

J.J.Bunn, Distributed Databases, 2001 129

RDBMS vs. Object RDBMS vs. Object DatabaseDatabase

•DBMS functionality split between the client and server

•allowing computing resources to be used

•allowing scalability.

•clients added without slowing down others,

•ODBMS automatically establishes direct, independent, parallel communication paths between clients and servers

•servers added to incrementally increase performance without limit.

•Users send requests into the server queue

•all requests must first be serialized through this queue.

•to achieve serialization and avoid conflicts, all requests must go through the server queue.

•Once through the queue, the server may be able to spawn off multiple threads

J.J.Bunn, Distributed Databases, 2001 130

Designing the Designing the Distributed DatabaseDistributed Database

Problem is: how to handle Problem is: how to handle distributed clients and distributed distributed clients and distributed data whilst maximising client task data whilst maximising client task throughput and use of resourcesthroughput and use of resources

Distributed Databases for:Distributed Databases for: The physics dataThe physics data The metadataThe metadata

Use middleware that is conscious Use middleware that is conscious of the global state of the system:of the global state of the system: Where are the clients?Where are the clients? What data are they asking for?What data are they asking for? Where are the CPU resources?Where are the CPU resources? Where are the Storage resources?Where are the Storage resources? How does the global system measure How does the global system measure

up to it workload, in the past, now up to it workload, in the past, now and in the future?and in the future?

J.J.Bunn, Distributed Databases, 2001 131

Distributed Distributed Databases for HEPDatabases for HEP

Replica synchronisation usually based Replica synchronisation usually based on small transactionson small transactions But HEP transactions are large (and long-But HEP transactions are large (and long-

lived)lived) Replication at the Object level desiredReplication at the Object level desired

Objectivity DRO requires dynamic quorumObjectivity DRO requires dynamic quorum bad for unstable WAN linksbad for unstable WAN links

So too difficult – use file replication So too difficult – use file replication E.g. GDMP Subscription methodE.g. GDMP Subscription method

Which Replica to Select?Which Replica to Select? Complex decision tree, involvingComplex decision tree, involving

Prevailing WAN and Systems conditionsPrevailing WAN and Systems conditions Objects that the Query “touches” and Objects that the Query “touches” and

“needs”“needs” Where the compute power isWhere the compute power is Where the replicas areWhere the replicas are Existence of previously cached datasetsExistence of previously cached datasets

J.J.Bunn, Distributed Databases, 2001 132

Distributed LHC Distributed LHC Databases TodayDatabases Today

Architecture is Architecture is loosely loosely coupled, coupled, autonomous, autonomous, Object Object DatabasesDatabases

File-based File-based replication replication withwith

Globus Globus middlewaremiddleware

Efficient WAN Efficient WAN transporttransport