CS377: Database Systems Distributed Databaseslxiong/cs377_f11/share/slides/24_ddb.pdf · CS377:...
Transcript of CS377: Database Systems Distributed Databaseslxiong/cs377_f11/share/slides/24_ddb.pdf · CS377:...
CS377: Database Systems
Distributed Databases
1
Distributed Databases
Li Xiong
Department of Mathematics and Computer Science
Emory University
Centralized DBMS on a Network
Site 5
Site 1
Site 2
2
Site 5
Site 3Site 4
Communication
Network
Distributed DBMS Environment
Site 5
Site 1
Site 2
3
Site 5
Site 3Site 4
Communication
Network
Distributed Database System
� A distributed database (DDB) is a collection of
multiple, logically interrelated databases distributed
over a computer network.
4
�A distributed database management system (D–
DBMS) is the software that manages the DDB and
provides an access mechanism that makes this
distribution transparent to the users.
�Distributed database system (DDBS) = DDB + D–
DBMS
Distributed Database SystemThe EMPLOYEE, PROJECT, and WORKS_ON tables may be
fragmented horizontally and stored with possible replication as
shown below.
6
Distributed DBMS Promises
�Transparent management of distributed,
fragmented, and replicated data
�Improved reliability/availability through distributed
7
transactions
�Improved performance
�Easier and more economical system expansion
Distributed DBMS Issues
� Distributed Database Design
� How to distribute the database
� Query Processing
8
� Optimize cost = data transmission + local processing
Distributed DBMS Issues
� Concurrency Control
� Synchronization of concurrent accesses
� Consistency and isolation of transactions' effects
9
� Deadlock management
� Reliability
� How to make the system resilient to failures
� Atomicity and durability
Distributed database design
� Data distribution
� Top-down - mostly in designing systems from scratch
� Bottom-up - when the databases already exist at a
number of sites
10
� Unit of distribution
� relation
� fragments of relations (sub-relations)
�Data are inherently fragmented, e.g. in locality
�Allow concurrent execution of a number of transactions that
access different portions of a relation
ExampleEmployee relation E (#,name,loc,sal,…)
40% of queries: 40% of queries:
Qa: select * Qb: select *
from E from E
where loc=Sa where loc=Sb
11
where loc=Sa where loc=Sb
and… and ...
Motivation: Two sites: Sa, Sb
Qa → ← QbSa Sb
Fragmentation Alternatives –
Horizontal
PROJ1 : projects with budgets
less than $200,000
PROJ2 : projects with budgets
greater than or equal to
New YorkNew York
PROJ
PNO PNAME BUDGET LOC
P1 Instrumentation 150000 Montreal
P3 CAD/CAM 250000P2 Database Develop. 135000
P4 Maintenance 310000 ParisP5 CAD/CAM 500000 Boston
12
greater than or equal to
$200,000PROJ1
PNO PNAME BUDGET LOC
P3 CAD/CAM 250000 New York
P4 Maintenance 310000 Paris
P5 CAD/CAM 500000 Boston
PNO PNAME LOC
P1 Instrumentation 150000 Montreal
P2 Database Develop. 135000 New York
BUDGET
PROJ2
P5 CAD/CAM 500000 Boston
Fragmentation Alternatives –
Vertical
PROJ1: information about
project budgets
PROJ2: information about
project names and
New YorkNew York
PROJ
PNO PNAME BUDGET LOC
P1 Instrumentation 150000 Montreal
P3 CAD/CAM 250000P2 Database Develop. 135000
P4 Maintenance 310000 ParisP5 CAD/CAM 500000 Boston
13
project names and
locations
PNO BUDGET
P1 150000
P3 250000P2 135000
P4 310000P5 500000
PNO PNAME LOC
P1 Instrumentation Montreal
P3 CAD/CAM New YorkP2 Database Develop. New York
P4 Maintenance ParisP5 CAD/CAM Boston
PROJ1 PROJ2
P5 CAD/CAM 500000 Boston
Data Fragmentation, Replication and
Allocation
� Horizontal fragmentation
� A horizontal subset of a relation which contain those of tuples
which satisfy selection conditions.
� E.g. Employee relation with selection condition (DNO = 5)
�Can be specified by a σσσσCi (R) operation in the relational algebra.
14
�Can be specified by a σσσσCi (R) operation in the relational algebra.
� Complete horizontal fragmentation
�A set of horizontal fragments whose conditions C1, C2, …, Cn
include all the tuples in R- every tuple in R satisfies (C1 OR C2
OR … OR Cn).
�Disjoint complete horizontal fragmentation: No tuple in R
satisfies (Ci AND Cj) where i ≠ j.
�How to reconstruct R from complete horizontal fragments?
Three common horizontal
partitioning techniques� Round robin
� Hash partitioning
� Range partitioning
1515
• Round robin
R D0 D1 D2
t1 t1
t2 t2
t3 t3
t4 t4
16
t4 t4
... t5
• Hash partitioning
R D0 D1 D2
t1→h(k1)=2 t1
t2→h(k2)=0 t2
t3→h(k3)=0 t3
17
t3→h(k3)=0 t3
t4→h(k4)=1 t4
...
• Range partitioning
R D0 D1 D2
t1: A=5 t1
t2: A=8 t2
t3: A=2 t3
t4: A=3 t4
4 7
partitioningvector
V0 V1
18
t4: A=3 t4
...
V0 V1
Data Fragmentation, Replication and
Allocation
� Vertical fragmentation
� A vertical subset of a relation that contains a subset of
columns.
� E.g. Employee relation: a vertical fragment of Name, Bdate, Sex
� Can be specified by a ΠLi(R) operation in the relational algebra.
19
� Can be specified by a ΠLi(R) operation in the relational algebra.
� Each fragment must include the primary key attribute of the parent
relation Employee
� Complete vertical fragmentation�A set of vertical fragments whose projection lists L1, L2, …, Ln
include all the attributes in R but share only the primary key of R.
� L1 ∪ L2 ∪ ... ∪ Ln = ATTRS (R)
� Li ∩ Lj = PK(R) for any i j
�How to reconstruct R from complete vertical fragments?
Data Fragmentation, Replication and
Allocation
� Mixed (Hybrid) fragmentation
� A combination of Vertical fragmentation and Horizontal fragmentation.
� This is achieved by SELECT-PROJECT operations which is represented by ΠLi(σσσσCi (R))
20
which is represented by ΠLi(σσσσCi (R))
Data Fragmentation, Replication and
Allocation
� Fragmentation schema
� A definition of a set of fragments (horizontal or vertical or
mixed) that can reconstruct the original database
� Allocation schema
� Distribution of fragments to sites of distributed databases. It
21
� Distribution of fragments to sites of distributed databases. It
can be fully or partially replicated or can be partitioned
� Data Replication
� Full replication: database is replicated to all sites.
� Partial replication: some selected part is replicated
Distributed Database SystemThe EMPLOYEE, PROJECT, and WORKS_ON tables may be
fragmented horizontally and stored with possible replication as
shown below.
22
Distributed DBMS Issues
� Distributed Database Design
� How to distribute the database
� Query Processing
23
� Optimize cost = data transmission + local processing
Query Processing in Distributed Databases
� Cost of transferring data (files and results) over the network is usually high
� Example:
� Employee at site 1 and Department at Site 2
�Employee at site 1. 10,000 rows. Row size = 100 bytes. Table
size = 106 bytes.
24
size = 106 bytes.
�Department at Site 2. 100 rows. Row size = 35 bytes. Table size
= 3,500 bytes.
� Q submitted at Site 3: retrieve employee name and department name
where the employee works.
�ΠFname,Lname,Dname (Employee Dno = Dnumber Department)
�Result has 10,000 tuples and each result tuple is 40 bytes
Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno
Dname Dnumber Mgrssn Mgrstartdate
Query Processing in Distributed
Databases� Strategies:
1. Transfer Employee and Department to site 3.
� Total transfer size
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3.
25
the result to site 3.
� Total transfer size
3. Transfer Department relation to site 1, execute the join at site
1, and send the result to site 3.
� Total bytes transferred
� Optimization criteria: minimizing data transfer.
� Which strategy?
Query Processing in Distributed
Databases� Strategies:
1. Transfer Employee and Department to site 3.
� Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3.
26
the result to site 3.
� Query result size = 40 * 10,000 = 400,000 bytes. Total transfer
size = 400,000 + 1,000,000 = 1,400,000 bytes.
3. Transfer Department relation to site 1, execute the join at site
1, and send the result to site 3.
� Total bytes transferred = 400,000 + 3500 = 403,500 bytes.
� Optimization criteria: minimizing data transfer.
� Preferred approach: strategy 3.
Query Processing in Distributed Databases
� What if Q is submitted at site 2?
� Example:
� Employee at site 1 and Department at Site 2
�Employee at site 1. 10,000 rows. Row size = 100 bytes. Table
size = 106 bytes.
27
size = 106 bytes.
�Department at Site 2. 100 rows. Row size = 35 bytes. Table size
= 3,500 bytes.
� Q submitted at Site 2: retrieve employee name and department name
where the employee works.
�ΠFname,Lname,Dname (Employee Dno = Dnumber Department)
�Result has 10,000 tuples and each result tuple is 40 bytes
Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno
Dname Dnumber Mgrssn Mgrstartdate
Query Processing in Distributed Databases
� Semijoin: � Objective is to reduce the number of tuples in a relation
before transferring it to another site.
� Example execution of Q:1. Project the join attributes of Department at site 2, and
transfer them to site 1. For Q, 4 * 100 = 400 bytes are
28
transfer them to site 1. For Q, 4 * 100 = 400 bytes are transferred
2. Join the transferred file with the Employee relation at site 1, and transfer the required attributes from the resulting file to site 2. For Q, 32 * 10,000 = 320,000 bytes are transferred
3. Execute the query by joining the transferred file with Department and present the result to the user at site 2.
� Semi-join� Left semi-join R ⋉⋉⋉⋉ S = ΠR (R join S).
Parallel Databases
� Parallel database
� Using parallel processers
� Architectures
� Shared memory
29
� Shared memory
� Shared disk
� Shared nothing
�Data partitioning (shard)