CS377: Database Systems Distributed Databaseslxiong/cs377_f11/share/slides/24_ddb.pdf · CS377:...

CS377: Database Systems

Distributed Databases

1

Distributed Databases

Li Xiong

Department of Mathematics and Computer Science

Emory University

Centralized DBMS on a Network

Site 5

Site 1

Site 2

2

Site 5

Site 3Site 4

Communication

Network

Distributed DBMS Environment

Site 5

Site 1

Site 2

3

Site 5

Site 3Site 4

Communication

Network

Distributed Database System

� A distributed database (DDB) is a collection of

multiple, logically interrelated databases distributed

over a computer network.

4

�A distributed database management system (D–

DBMS) is the software that manages the DDB and

provides an access mechanism that makes this

distribution transparent to the users.

�Distributed database system (DDBS) = DDB + D–

DBMS

Distributed Database SystemThe EMPLOYEE, PROJECT, and WORKS_ON tables may be

fragmented horizontally and stored with possible replication as

shown below.

6

Distributed DBMS Promises

�Transparent management of distributed,

fragmented, and replicated data

�Improved reliability/availability through distributed

7

transactions

�Improved performance

�Easier and more economical system expansion

Distributed DBMS Issues

� Distributed Database Design

� How to distribute the database

� Query Processing

8

� Optimize cost = data transmission + local processing


� Concurrency Control

� Synchronization of concurrent accesses

� Consistency and isolation of transactions' effects

9

� Deadlock management

� Reliability

� How to make the system resilient to failures

� Atomicity and durability

Distributed database design

� Data distribution

� Top-down - mostly in designing systems from scratch

� Bottom-up - when the databases already exist at a

number of sites

10

� Unit of distribution

� relation

� fragments of relations (sub-relations)

�Data are inherently fragmented, e.g. in locality

�Allow concurrent execution of a number of transactions that

access different portions of a relation

ExampleEmployee relation E (#,name,loc,sal,…)

40% of queries: 40% of queries:

Qa: select * Qb: select *

from E from E

where loc=Sa where loc=Sb

11

where loc=Sa where loc=Sb

and… and ...

Motivation: Two sites: Sa, Sb

Qa → ← QbSa Sb

Fragmentation Alternatives –

Horizontal

PROJ1 : projects with budgets

less than $200,000

PROJ2 : projects with budgets

greater than or equal to

New YorkNew York

PROJ

PNO PNAME BUDGET LOC

P1 Instrumentation 150000 Montreal

P3 CAD/CAM 250000P2 Database Develop. 135000

P4 Maintenance 310000 ParisP5 CAD/CAM 500000 Boston

12

greater than or equal to

$200,000PROJ1


P3 CAD/CAM 250000 New York

P4 Maintenance 310000 Paris

P5 CAD/CAM 500000 Boston

PNO PNAME LOC


P2 Database Develop. 135000 New York

BUDGET

PROJ2


Fragmentation Alternatives –

Vertical

PROJ1: information about

project budgets

PROJ2: information about

project names and

New YorkNew York

PROJ



P3 CAD/CAM 250000P2 Database Develop. 135000

P4 Maintenance 310000 ParisP5 CAD/CAM 500000 Boston

13

project names and

locations

PNO BUDGET

P1 150000

P3 250000P2 135000

P4 310000P5 500000

PNO PNAME LOC

P1 Instrumentation Montreal

P3 CAD/CAM New YorkP2 Database Develop. New York

P4 Maintenance ParisP5 CAD/CAM Boston

PROJ1 PROJ2


Data Fragmentation, Replication and

Allocation

� Horizontal fragmentation

� A horizontal subset of a relation which contain those of tuples

which satisfy selection conditions.

� E.g. Employee relation with selection condition (DNO = 5)

�Can be specified by a σσσσCi (R) operation in the relational algebra.

14

�Can be specified by a σσσσCi (R) operation in the relational algebra.

� Complete horizontal fragmentation

�A set of horizontal fragments whose conditions C1, C2, …, Cn

include all the tuples in R- every tuple in R satisfies (C1 OR C2

OR … OR Cn).

�Disjoint complete horizontal fragmentation: No tuple in R

satisfies (Ci AND Cj) where i ≠ j.

�How to reconstruct R from complete horizontal fragments?

Three common horizontal

partitioning techniques� Round robin

� Hash partitioning

� Range partitioning

1515

• Round robin

R D0 D1 D2

t1 t1

t2 t2

t3 t3

t4 t4

16

t4 t4

... t5

• Hash partitioning

R D0 D1 D2

t1→h(k1)=2 t1

t2→h(k2)=0 t2

t3→h(k3)=0 t3

17

t3→h(k3)=0 t3

t4→h(k4)=1 t4

...

• Range partitioning

R D0 D1 D2

t1: A=5 t1

t2: A=8 t2

t3: A=2 t3

t4: A=3 t4

4 7

partitioningvector

V0 V1

18

t4: A=3 t4

...

V0 V1


Allocation

� Vertical fragmentation

� A vertical subset of a relation that contains a subset of

columns.

� E.g. Employee relation: a vertical fragment of Name, Bdate, Sex

� Can be specified by a ΠLi(R) operation in the relational algebra.

19

� Can be specified by a ΠLi(R) operation in the relational algebra.

� Each fragment must include the primary key attribute of the parent

relation Employee

� Complete vertical fragmentation�A set of vertical fragments whose projection lists L1, L2, …, Ln

include all the attributes in R but share only the primary key of R.

� L1 ∪ L2 ∪ ... ∪ Ln = ATTRS (R)

� Li ∩ Lj = PK(R) for any i j

�How to reconstruct R from complete vertical fragments?


Allocation

� Mixed (Hybrid) fragmentation

� A combination of Vertical fragmentation and Horizontal fragmentation.

� This is achieved by SELECT-PROJECT operations which is represented by ΠLi(σσσσCi (R))

20

which is represented by ΠLi(σσσσCi (R))


Allocation

� Fragmentation schema

� A definition of a set of fragments (horizontal or vertical or

mixed) that can reconstruct the original database

� Allocation schema

� Distribution of fragments to sites of distributed databases. It

21

� Distribution of fragments to sites of distributed databases. It

can be fully or partially replicated or can be partitioned

� Data Replication

� Full replication: database is replicated to all sites.

� Partial replication: some selected part is replicated

Distributed Database SystemThe EMPLOYEE, PROJECT, and WORKS_ON tables may be

fragmented horizontally and stored with possible replication as

shown below.

22


� Distributed Database Design

� How to distribute the database

� Query Processing

23

� Optimize cost = data transmission + local processing

Query Processing in Distributed Databases

� Cost of transferring data (files and results) over the network is usually high

� Example:

� Employee at site 1 and Department at Site 2

�Employee at site 1. 10,000 rows. Row size = 100 bytes. Table

size = 106 bytes.

24

size = 106 bytes.

�Department at Site 2. 100 rows. Row size = 35 bytes. Table size

= 3,500 bytes.

� Q submitted at Site 3: retrieve employee name and department name

where the employee works.

�ΠFname,Lname,Dname (Employee Dno = Dnumber Department)

�Result has 10,000 tuples and each result tuple is 40 bytes

Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno

Dname Dnumber Mgrssn Mgrstartdate

Query Processing in Distributed

Databases� Strategies:

1. Transfer Employee and Department to site 3.

� Total transfer size

2. Transfer Employee to site 2, execute join at site 2 and send

the result to site 3.

25


� Total transfer size

3. Transfer Department relation to site 1, execute the join at site

1, and send the result to site 3.

� Total bytes transferred

� Optimization criteria: minimizing data transfer.

� Which strategy?

Query Processing in Distributed

Databases� Strategies:

1. Transfer Employee and Department to site 3.

� Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes.

2. Transfer Employee to site 2, execute join at site 2 and send


26


� Query result size = 40 * 10,000 = 400,000 bytes. Total transfer

size = 400,000 + 1,000,000 = 1,400,000 bytes.

3. Transfer Department relation to site 1, execute the join at site

1, and send the result to site 3.

� Total bytes transferred = 400,000 + 3500 = 403,500 bytes.

� Optimization criteria: minimizing data transfer.

� Preferred approach: strategy 3.


� What if Q is submitted at site 2?

� Example:

� Employee at site 1 and Department at Site 2

�Employee at site 1. 10,000 rows. Row size = 100 bytes. Table

size = 106 bytes.

27

size = 106 bytes.

�Department at Site 2. 100 rows. Row size = 35 bytes. Table size

= 3,500 bytes.

� Q submitted at Site 2: retrieve employee name and department name

where the employee works.

�ΠFname,Lname,Dname (Employee Dno = Dnumber Department)

�Result has 10,000 tuples and each result tuple is 40 bytes

Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno

Dname Dnumber Mgrssn Mgrstartdate


� Semijoin: � Objective is to reduce the number of tuples in a relation

before transferring it to another site.

� Example execution of Q:1. Project the join attributes of Department at site 2, and

transfer them to site 1. For Q, 4 * 100 = 400 bytes are

28

transfer them to site 1. For Q, 4 * 100 = 400 bytes are transferred

2. Join the transferred file with the Employee relation at site 1, and transfer the required attributes from the resulting file to site 2. For Q, 32 * 10,000 = 320,000 bytes are transferred

3. Execute the query by joining the transferred file with Department and present the result to the user at site 2.

� Semi-join� Left semi-join R ⋉⋉⋉⋉ S = ΠR (R join S).

Parallel Databases

� Parallel database

� Using parallel processers

� Architectures

� Shared memory

29

� Shared memory

� Shared disk

� Shared nothing

�Data partitioning (shard)

CS377: Database Systems Distributed Databaseslxiong/cs377_f11/share/slides/24_ddb.pdf · CS377:...

Documents

Transcript of CS377: Database Systems Distributed Databaseslxiong/cs377_f11/share/slides/24_ddb.pdf · CS377:...