Distributed Database Systems COP5711. What is a Distributed Database System ? A distributed database...

146
Distributed Database Systems COP5711
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    261
  • download

    9

Transcript of Distributed Database Systems COP5711. What is a Distributed Database System ? A distributed database...

Distributed Database Systems

COP5711

What is a Distributed Database System ?

A distributed database is a collection of databases which are distributed over different computers of a computer network.•Each site has autonomous processing capability and can

perform local applications.

•Each site also participates in the execution of at least one global application which requires accessing data at several sites.

Communication NetworkServer 1

Database 1

Server 2

Database 2

Server 3

Database 3

Multiprocessor Database Computers

Application (front-end) computer

Interface Processor

Access Processor

Access Processor

Access Processor

What we miss here is the existence of local applications, in the sense that the integration of the system has reached the point where no one of the computers (i.e., IFPs & ACPs) is capable of executing an application by itself.

Cannot run an

application by itself

Why Distributed Databases ?

1. Local Autonomy: permits setting and enforcing local policies regarding the use of local data (suitable for organization that are inherently decentralized).

2. Improved Performance: The regularly used data is proximate to the users and given the parallelism inherent in distributed systems.

3. Improved Reliability/Availability: Data replication can be used to obtain higher reliability and

availability. The autonomous processing capability of the different sites

ensures a graceful degradation property.

4. Incremental Growth: supports a smooth incremental growth with a minimum degree of impact on the already existing sites.

5. Shareability: allows preexisting sites to share data.

6. Reduced Communication Overhead: The fact that many applications are local clearly reduces the communication overhead with respect to centralized databases.

Disadvantages of DDBSs

Cost: replication of effort (manpower).

Security: More difficult to control

Complexity:

• The possible duplication is mainly due to reliability and efficiency considerations. Data redundancy, however, complicates update operations.

• If some sites fail while an update is being executed, the system must make sure that the effects will be reflected on the data residing at the failing sites as soon as the system can recover from the failure.

• The synchronization of transactions on multiple sites is considerably harder than for a centralized system.

Distributed DBMS Architecture

NetworkTransparancy

• The user should be protected from the operational details of the network.

• It is desirable to hide even the existence of the network, if possible. Location transparency: The command used

is independent of the system on which the data is stored.

Naming transparency: a unique name is provided for each object in the database.

Replication & Fragmentation Transparancy

• The user is unaware of the replication of framents

• Queries are specified on the relations (rather than the fragments).

Fragment R1

Fragment R2

Fragment R3

Fragment R4

Copy 2 of R1

Copy 1 of R1

Copy 1 of R2

Relation R

Copy 2 of R2

Site A

Site B

Site C

ANSI/SPARC Architecture

External view

External view

External view

Conceptual view

Internal view

External Schema

Conceptual Schema

Internal Schema

Internal view: deals with the physical definition and organization of data.

Conceptual view: abstract definition of the database. It is the “real world” view of the enterprise being modeled in the database.

External view: individual user’s view of the database.

A Taxonomy of Distributed Data Systems

Distributed data systems

HomogeneousHeterogeneous(Multidatabase)

Unfederated(no local users)Federated

Loosely coupled(interoperable DB systems using export schema)

Tightly coupled(/w global schema)

A distributed database can be defined as• a logically

integrated collection of shared data which is

• physically distributed across the nodes of a computer network.

Architecture of a Homogeneous DDBMS

Global user view 1

Global Schema

Global user view n

Fragmentation Schema

Local conceptu

al schema 1

Local internal

schema 1

Local DB 1

Allocation Schema

Local conceptu

al schema n

Local internal

schema n

Local DB n

A homogeneous

DDBMS resembles a

centralized DB, but

instead of storing all

the data at one site,

the data is

distributed across a

number of sites in a

network.

Fragmentation Schema & Allocation Schema

Fragmentation Schema: describes how the global relations are divided into fragments.

Allocation Schema: specifies at which sites each fragment is stored.

Example: Fragmentation of global relation R.A B

C D

E

To materialize R, the following operations are required:R = (A B) U ( C D) U E

Homogeneous vs. Heterogeneous

• Homogeneous DDBMS– No local users– Most systems do not have

local schemas (i.e., every user uses the same schema)

• Heterogeneous DDBMS– There are both local and

global users– Multidatabase systems are

split into:• Tightly Coupled Systems:

have a global schema• Loosely Coupled

Systems: do not have a global schema.

MultidatabaseManagement

system

DBMSDBMS DBMS DBMS

Database 1 Database 2 Database 3 Database 4

Globaluser

Localuser

Localuser

Schema Architecture of a Tightly-Coupled System

Global user view 1

Global user view n

Global Conceptual Schema

Local Participation Schema 1

Auxiliary Schema 1

Local Conceptual Schema 1

Local user view 1

Local user view 2

Local Internal Schema 1

Local DB 1

Local Participation Schema 1

Auxiliary Schema 1

Local Conceptual Schema 1

Local user view 1

Local user view 2

Local Internal Schema 1

Local DB 1

An individual node’s participation in the MDB is defined by means of a participation schema.

Auxiliary Schema (1)

Rules for unit conversion: may be required when one site expresses distance in kilometers and another in miles, …

Rules for handling null values: may be necessary where one site stores additional information which is not stored at another site.– Example: One site stores the name, home address

and telephone number of its employees, whereas another just stores names and addresses.

Auxiliary schema describes the rules which govern the mappings between the local and global levels.

Auxiliary Schema (2) Rules for naming conflicts: naming conflicts occur

when: semantically identical data items are named differently

• DNAME Department name (at Site 1)

• DEPTNAME Department name (at Site 2)

semantically different data items are named identically.• NAME Department name (at Site 1)

• NAME Manager name (at Site 2)

Rules for handling data representation conflicts: Such conflicts occur when semantically identical data items are represented differently in different data source. Example: Data represented as a character string in one

database may be represented as a real number in the other database.

Auxiliary Schema (3)

Rules for handling data scaling conflicts: Such conflicts occur when semantically identical data items stored in different databases using different units of measure. Example: “Large”, “New”, “Good”, etc.

These problems are calleddomain mismatch problems

Loosely-Coupled Systems(Interoperable Database Systems)

Globaluser view 1

Globaluser view 2

Globaluser view 3

LocalConceptualschema 1

Localinternal

schema 1

Localinternal

Schema 2

LocalConceptualSchema 2

Localinternal

Schema n

LocalConceptualSchema n

Local DB nLocal DB 2Local DB 1

Localuser view 1

Localuser view 2

Loosely-Coupled Systems

Globaluser view 1

Globaluser view 2

Globaluser view m

LocalConceptualschema 1

Localinternal

schema 1

Localinternal

Schema 2

LocalConceptualSchema 2

Localinternal

Schema n

LocalConceptualSchema n

Local DB nLocal DB 2Local DB 1

Localuser view 1

Localuser view 2

Exportschema 2

ExportSchema 3

ExportSchema n

Exportschema 1

Integration of Heterogeneous Data Models

• Provide bidirectional translators between all pairs of models– Advantage: support multiple models at the global

level. No need to learn another data model and language

– Disadvantage: requires n(n-1) translators, where n is the number of different models.

• Adopt a single model (called canonical model) at the global level and map all the local models onto this model– Advantage: requires only 2n translators– Disadvantage: translations must go through the

global model.(The 2nd approach is more widely used)

Distributed Database Design

•Top-Down Approach: The database system is being designed from scratch.

• Issues: fragmentation & allocation

•Bottom-up Approach: Integrating existing databases into one database

• Issues: Design of the export and global schemas.

Requirements Analysis

System Requirements(Objectives)

Conceptual design View

Design

Global conceptual schema

Access information

External Schema Definitions

Distribution Design

Local Conceptual Schemas

Physical Design

Physical Schema

View integration

TOP-DOWN DESIGN PROCESS

Fragmentation &

allocation

Defining the interfaces for

end users

Entity analysis + functional

analysis

Maps the local conceptual schemas to

physical storage devices

Design Consideration (1)

The organization of distributed systems can be investigated along three dimensions:

Level of sharing

1. No sharing: Each application and its data execute at one site.

2. Data sharing: Programs are replicated at all sites, but data files are not.

3. Data + Program Sharing: Both data and programs may be shared.

Access Pattern

1. Static: Access patterns do not change.

2. Dynamic: Access patterns change over time.

Level of Knowledge

3. No information

4. Partial information: Access patterns may deviate from the predictions.

5. Complete information: Access patterns can reasonably be predicted.

Design Consideration (2)

Fragmentation Alternatives

Horizontal Partitioning

JNO JNAME BUDGET LOC

J1 Instrumental 150,000 MontrealJ2 Database Dev. 135,000 New York

J1

JNO JNAME BUDGET LOC

J3 CAD/CAM 150,000 MontrealJ4 Maintenance. 310,000 Paris

J2

JNO JNAME BUDGET LOC

J1 Instrumental 150,000 MontrealJ2 Database Dev. 135,000 New YorkJ3 CAD/CAM 250,000 New YorkJ4 Maintenance 350,000 Paris

J

Vertical Partitioning

JNO BUDGET J1 150,000 J2 135,000 J3 250,000 J4 310,000

JNO JNAME LOC

J1 Instrumentation MontrealJ2 Database Devl New YorkJ3 CAD/CAM New YorkJ4 Maintenance Paris

Why fragment at all?

Reasons:• Interquery concurrency• Intraquery concurrency

Disadvantages:• Vertical fragmentation may incur overhead.• Attributes participating in a dependency

may be allocated to different sites.

Integrity checking is more costly.

Degree of Fragmentation

• Application views are usually subsets of relations. Hence, it is only natural to consider subsets of relations as distribution units.

• The appropriate degree of fragmentation is dependent on the applications.

Correctness Rules

• Vertical Partitioning• Lossless

decomposition• Dependency

preservation

• Horizontal Partitioning

• Disjoint fragments

Allocation Alternatives

•Partitioning: No replication

•Partial Replication: Some fragments are replicated

•Full Replication: Database exists in its entirety at each site

Notations

Title SAL

ENO ENAME TITLE

S

E

L1

JNO JNAME BUDGETJ LOC

L2L3

ENO JNO RESP DURG

L1: 1-to-many relationship

S: Owner(L1), Source relation

E: Member(L1), Target relation

Simple PredicatesGiven a relation R(A1, A2, …, An) where Ai has domain Di, a simple predicate pj defined on R has the form

pj: Ai Value

where

},,,,,{ and Value Di

Example:

JNO JNAME BUDGET LOCJ1 Instrumental 150,000 MontrealJ2 Database Dev. 135,000 New YorkJ3 CAD/CAM 250,000 New YorkJ4 Maintenance 350,000 Orlando

J

Simple predicates: p1: JNAME = “Maintenance”

P2: BUDGET < 200,000

Note: A simple predicate defines a data fragment

Given a set of simple predicates for relation R.

P = {p1, p2, …, pm}

The set of minterm predicatesM = {m1, m2, …, mn}

is defined as

M = {mi | mi = }where

MINTERM PREDICATE

*

jp

Pp j

jjj pppp *j

* or

TITLE SAL

Elect. Eng. 40,000

Syst. Analy. 54,000

Mech. Eng. 32,000

Programmer 42,000

Possible simple predicates:

P1: TITLE=“Elect. Eng.”P2: TITLE=“Syst. Analy”P3: TITLE=“Mech. Eng.”P4: TITLE=“Programmer”P5: SAL ≤ 35,000P6: SAL > 35,000

Some corresponding minterm predicates:

000,30".":

000,30.".":

2

1

SALEngElectTITLEm

SALEngElectTITLEm

A minterm predicate definesa data fragment

Primary Horizontal Fragmentation

A primary horizontal fragmentation is defined by a selection operation on the owner relations of a database schema.

ENO ENAME TITLE JNO JNAME BUDGET LOCE J

ENO JNO RESP DURG

L2 L3

Owner(L3) = J

A possible fragmentation of J is defined as follows:

)(

)(

000,2002

000,2001

JJ

JJ

BUDGET

BUDGET

Horizontal Fragments

Thus, a horizontal fragment Ri of relation R consists of all the tuples of R that satisfy a minterm predicate mi.

There are as many horizontal fragments (also called minterm fragments) as there are minterm predicates.

Simple Predicates Minterm Fragments Applications

A1 ≥ k1

A2 = k2

A3 ≤ k3

A4 = k4

Completeness (1)A set of simple predicate Pr is said to be complete if and only if there is an equal probability of access by every application to any two tuples belonging to any minterm fragment that is defined according to Pr.

F1

F2

F3

A1

A2

A3

A4

p1

p1

p3p3

Complete The fragments look homogeneous

Simple Predicates Minterm Fragments Applications

A1 ≥ k1

A2 = k2

A3 ≤ k3

A4 = k4

Completeness (2)

F1

F2

F3

A1

A2

A3

A4

p1

p1

p3p3

p4

p5

Set of simple predicates is incomplete

F32

F31

Simple Predicates Minterm Fragments Applications

A1 ≥ k1

A2 = k2

A3 ≤ k3

A4 = k4

A5 > k5

Completeness (2)

F1

F2

F3

A1

A2

A3

A4

p1

p1

p3p3

p4

p5

Additional simple

predicate Now complete !

Completeness (4)A set of simple predicate Pr is said to be complete if and only if there is an equal probability of access by every application to any two tuples belonging to any minterm fragment that is defined according to Pr.

Case 1: The only application that accesses J wants to access the tuples according to the location.

The set of simple predicates

LOC=“Montreal”,Pr = LOC=“New York”,

LOC=“Orlando”

is complete because each tuple of each fragment has the same probability of being accessed.

" "

" "

" "

1

2

3

( )

( )

( )

LOC MONTREAL

LOC NewYork

LOC Orlando

J J

J J

J J

J

J1

J2

J3

LOC=“Montreal”

LOC=“New York”

LOC=“Orlando”

Completeness (5)

Example: JNO JNAME BUDGET LOC001 Instrumental 150,000 Montreal

JNO JNAME BUDGET LOC004 GUI 135,000 New York007 CAD/CAM 250,000 New York

J1

J2

JNO JNAME BUDGET LOC003 Database Dev. 310,000 Orlando

J3

Case 2: There is a second application which accesses only those project tuples where the budget is less than $200,000.

Since tuple “004” is accessed more frequently than tuple “007”, Pr is not complete.

To make the the set complete, we need to add (BUDGET< 200,000) to Pr.

LOC=“Montreal”,Pr = LOC=“New York”, LOC=“Orlando”

J

J1

J2

J3

LOC=“Montreal”

LOC=“New York”

LOC=“Orlando”

J11

J12

BUDGET<=200,000

BUDGET>200,000

J21

BUDGET<=200,000

J22

BUDGET>200,000

J31

J32

BUDGET>200,000

BUDGET<=200,000

Completeness (6)

Small-budget applications

Note: Completeness is a desirable property because a complete set defines fragments that are not only logically uniform in that they all satisfy the minterm predicate, but statistically homogeneous.

Redundant Fragmentation

• Fragments 1 and 2 have the same characteristics

• The fragmentation is unnecessary

Logically uniform & statistically

homogeneous fragment

Fragment 1

Fragment 2

MinimalityRelevant:

Let mi and mj be two almost identical minterm predicates:

mi = p1 Λ p2 Λ p3 fragment fi

mj = p1 Λ ¬ p2 Λ p3 fragment fj

p2 is relevant if and only if

)(

)(

)(

)(

j

j

i

i

fcard

macc

fcard

macc

Access frequency

Cardinality

ff1

f12

fi

fj

p1

p3

p2

¬p2

AProb1

Prob2 Prob1 ≠ Prob2

MinimalityRelevant:

Let mi and mj be two almost identical minterm predicates:

mi = p1 Λ p2 Λ p3 fragment fi

mj = p1 Λ ¬ p2 Λ p3 fragment fj

p2 is relevant if and only if

)(

)(

)(

)(

j

j

i

i

fcard

macc

fcard

macc

Access frequency

Cardinality

That is, there should be at least one application that accesses fi and fj differently.

i.e., The simple predicate pi should be relevant in determining a fragmentation.

Minimal: If all the predicates of a set Pr are relevant, Pr is minimal.

A Complete and Minimal Example

Two applications:

1. One application accesses the tuples according to location.

2. Another application accesses only those project tuples where the budget is less than $200,000.

Case 1: Pr={Loc=“Montreal”, Loc=“New York”, Loc=“Orlando”, BUDGET<=200,000,BUDGET>200,000} iscomplete and minimal.

Case 2: If, however, we were to add the predicate JNAME= “Instrumentation” to Pr, the resulting set would not be minimal since the new predicate is not relevant with respect to the applications.

J

J1

J2

J3

LOC=“Montreal”

LOC=“New York”

LOC=“Orlando”

J11

J12

BUDGET<=200,000

BUDGET>200,000

J121

J122

JNAME = “Instrument”

JNAME! “Instrument”

J21

BUDGET<=200,000

J22

BUDGET>200,000

J31

J32

RelevantBUDGET>200,000

BUDGET<=200,000

[ JNAME = “Instrument” ] is not relevant.

Irrelevant

Application Information• Qualification Information

– The fundamental qualification information consists of the predicates used in user queries (i.e., “where” clauses in SQL).

– 80/20 rule: 20% of user queries account for 80% of the total data access.

One should investigate the more important queries.

• Quantitative Information– Minterm Selectivity sel(mi):

number of tuples that would be accessed by a query specified according to a given minterm predicate.

– Access Freequency acc(qi): the access frequency of queries in a given period.

Qualitative information guides the fragmentation activity

Quantitative information guides the allocation activity

Determine the set of meaningful minterm predicates

Applications: • Take the salary and determine a raise accordingly.• The employee records are managed in two places, one handling the

records of those with salary less than or equal to $30,000 and the other handling the records of those who earn more than $30,000.

)000,30()000,30(:

)000,30()000,30(:

)000,30()000,30(:

)000,30()000,30(:

4

3

2

1

SALSALm

SALSALm

SALSALm

SALSALm

Implications:

)000,30()000,30(:

)000,30()000,30(:

)000,30()000,30(:

)000,30()000,30(:

4

3

2

1

SALSALi

SALSALi

SALSALi

SALSALi

42

11

mi

mi

is contradictory

is contradictory

Therefore, we are left withM = {m2, m3}

Pr={p1: SAL<=30,000, p2: SAL>30,000} is complete and minimal.

The minterm predicates:

Invalid Implications

JNO JNAME BUDGET LOCJ1 Instrumental 150,000 MontrealJ2 Database Dev. 135,000 New YorkJ3 CAD/CAM 250,000 New YorkJ4 Maintenance 350,000 Orlando

J

Simple predicatesp1: LOC = “Montreal”p2: LOC = “New York”p3: LOC = “Orlando”p4: BUDGET ≤ 200,000p5: BUDGET > 200,000

VALID Implications

457

546

455

544

2133

3122

3211

:

:

:

:

:

:

:

ppi

ppi

ppi

ppi

pppi

pppi

pppi

INVALID Implications

)000,200("":

)000,200("":

9

8

BUDGETOrlandoLOCi

BUDGETMontrealLOCi

Implications should be defined according to the semantics of the database, not according to the current values.

Compute Complete & Minimal Set

• Repeat until the predicate set is complete– Find a simple predicate pi that is relevant– Determine minterm fragments fi and fj according to pi

– Accept pi , fi , and fj – Remove any pk and fk from acceptance list if pk becomes

irrelevant /* the list is minimal */

• Determine the set of minterm predicates M (using the acceptance list)

• Determine the set of implications I (among the acceptance list)

• For each mi in M, remove mi if it is contradictory according to I

Rule: a relation or fragment is partitioned into at least two parts which are accessed differently by at least one application.

Relevant: a simple predicate which satisfies the above rule, is relevant.

Derived Horizontal Fragmentation

Derived fragmentation is used to facilitate the join between fragments.

In some cases, the horizontal fragmentation of a relation cannot be based on a property of its own attributes, but is derived from the horizontal fragmentation of another relation.

PAY (TITLE, SAL)

EMP (ENO, ENAME, TITLE)

1 ( "Assistant Professor")

2 ( " Associate Professor")

3 ( " Full Professor")

( )

( )

( )

TITLE

TITLE

TITLE

PAY PAY

PAY PAY

PAY PAY

Not using derived fragmentation: one can divide EMP into EMP1 and EMP2 based on TITLE and divide PAY into PAY1, PAY2, PAY3 based on SAL. To join EMP and PAY, we have the following scenarios.

PAY1

PAY2

PAY3

More communication overhead !

Benefits of Derived FragmentationPrimary Fragmentation:

EMP1 PAY1

EMP2PAY2 EMPi and PAYi can be

allocated to the same site.

Using Derived Fragmentation:

EMP1 = EMP SJ PAY1

EMP2 = EMP SJ PAY2

EMP3 = EMP SJ PAY3

EMP3 PAY3

EMP1

EMP2

EMP3

Chain Relationships

• Design the primary fragmenation for R1.

• Derive the derived fragmentation for Rk as follows:

• Rk = Rk SJRKFK=R(k-1)PK R(k-1)

• for 2 k n in that order.

R1 (R1PK, …)

R2 (R2PK, R1FK, …)

R3 (R3PK, R2FK, …)

. . .

Derived Fragmentation

• How do we fragment EMP_PROJ ?– Semi-Join with EMP, or– Semi-Join with PROJ

• Criterion: Suport the more-frequent join operation

EMP (ENO, ENAME, TITLE) PROJ (PNO, PNAME, BUDGET)

EMP_PROJ (ENO, PNO, RESP, DUR)Join might

be required

VERTICAL FRAGMENTATION

Purpose: Identify fragments Ri such that many applications can be executed using just one fragment.

Advantage: When many applications which use R1 and many applications which use R2 are issued at different sites, fragmenting R avoids communication overhead.

Vertical partitioning is more complicated than horizontal partitioning:

• Vertical Partitioning: The number of possible fragments is equal to mm where m is the number of nonprimary key attributes

• Horizontal Partitioning: 2n possible minterm predicates can be defined, where n is the number of simple predicates in the complete and minimal set Pr.

R1R2

A1A7

Site 1 Site 2

Greedy Heuristic Approaches:

Split Approach: Global relations are progressively split into fragments.

Grouping Approach: Attributes are progressively aggregated to constitute fragments.

Correctness:

Each attribute of R belongs to at least one fragment.

Each fragment includes either a key of R or a “tuple identifier”.

Vertical Fragmentation Approaches

Vertical Clustering - Replication

Example: EMP(ENUM,NAME,SAL,TAX,MGRNUM,DNUM)

Bad Fragmentation: NAME not available in EMP21. EMP1(ENUM,NAME,TAX,SAL)2. EMP2(ENUM,MGRNUM,DNUM)

Good Fragmentation: 1. EMP1(ENUM, NAME, TAX, SAL)2. EMP2(ENUM, NAME, MGRNUM, DNUM)

In evaluating the convenience of vertical clustering, it is important that overlapping attributes are not heavily updated.

Administrative Applicationsat Site 1

Applicationsat all sites

NAME is relatively

stable

Split Approach

1. Obtain attribute affinity matrix.

2. Use a clustering algorithm to group some attributes together based on the attribute affinity matrix. This algorithm produces a clustered affinity matrix.

3. Use a partitioning algorithm to partition attributes such that set of attributes are accessed solely or for the most part by distinct set of applications.

• Splitting is considered only for attributes that do not participate in the primary key.

• The split approach involves three steps:

PNO PNAME BUDGET LOCPROJA1 A2 A3 A4

q1: SELECT BUDGET FROM PROJ WHERE PNO=Value;

q2: SELECT PNAME, BUDGET FROM PROJ;

q3: SELECT PNAME FROM PROJ WHERE LOC=Value;

q4: SELECT SUM(BUDGET) FROM PROJ WHERE Loc=Value

1100

1010

0110

0101

A1 A2 A3 A4

q1

q2

q3

q4

Attribute Usage Matrix

1 if Aj is referenced by qi

0 otherwise

Attribute Usage Matrix

use(qi,Aj) =

Attribute Affinity Measure

Ai

Ak

Aj

Relation RSite m

qk

qi

Site s

qk

qi

Site n

qi

qi

( )s kref q

( )s kacc qrefs(qk) : Number of accesses to attributes (Ai,Aj) for each execution of qk at site s

accs (qk) : Application access frequency of qk at site s.

, ( , ) 1 ( , ) 1

( , ) ( ) ( )i j

k i k j

s k s kk use q A use q A s

aff A A ref q acc q

For each query qk that uses both Ai and Aj Popularity of such Ai-Aj pair at

all sitesPopularity of

using Ai and Aj

together

A1 A2 A3 A4

A1

A2

A3

A4

Attribute Affinity Matrix

Attribute Affinity Matrix

),( 32 AAaff

refs (qk): Number of accesses to attributes (Ai,Aj) for each execution of qk at site s

accs (qk): Application access frequency of qk at site s.

, ( , ) ( , )

( , ) ( ) ( )i j

k i k j

s k s kk use q A s use q A s s

aff A A ref q acc q

For each query qk that uses both Ai and Aj Popularity of such Ai-Aj pair at

all sites

1100

1010

0110

0101

A1 A2 A3 A4

q1

q2

q3

q4

Attribute Usage Matrix

783750

353545

755800

045045

A1 A2 A3 A4

A1

A2

A3

A4

Attribute Affinity Matrix (AA)

Attribute Affinity Matrix Example

Next Step - Determine clustered affinity (CA) matrix

783750

353545

755800

045045

A1 A2 A3 A4

A1

A2

A3

A4

Attribute Affinity Matrix (AA)

Clustered Affinity MatrixStep 1: Initialize CA

750

545

800

045

A1 A2 A3 A4

A1

A2

A3

A4

Clustered Affinity Matrix (CA)

Copy first 2 columns

783750

353545

755800

045045

A1 A2 A3 A4

A1

A2

A3

A4

Attribute Affinity Matrix (AA)

Clustered Affinity MatrixStep 2: Determine Location for A3

750

545

800

045

A1 A2

A1

A2

A3

A4

Clustered Affinity Matrix (CA)

3 possiblepositionsfor A3

A0 A0A5

A5A3 A4

A1 A2 A3

A1 A3 A2A0 A3 A1

Clustered Affinity MatrixStep 2: Determine the order for A3

n

zyzxzyx AAaffAAaffAAbond

1

),(),(),(

),(2),(2),(2),,( jijkkijki AAbondAAbondAAbondAAAcont

783750

353545

755800

045045

A1 A2 A3 A4

A1

A2

A3

A4

Attribute Affinity Matrix (AA)

7530

55345

8050

04545

A1 A3 A2 A4

A1

A2

A3

A4

Clustered Affinity Matrix (CA)

Cont(A0,A3,A1) = 8820 Cont(A1,A3,A2) = 10150 Cont(A2,A3,A4) = 1780

Since Cont(A1,A3,A2) is the greatest, [A1,A3,A2] is the best order.

Note: aff(A0,Ai)=aff(Ai,A0)=aff(A5,Ai)=aff(Ai,A5)=0 by definition

Contribution

783750

353545

755800

045045

A1 A2 A3 A4

A1

A2

A3

A4

Attribute Affinity Matrix (AA)

Clustered Affinity MatrixStep 2: Determine the order for A4

787530

355345

758050

004545

A1 A3 A2 A4

A1

A2

A3

A4

Clustered Affinity Matrix (CA)

Since Cont(A3,A2,A4) is the biggest, [A3,A2,A4] is the best order.

Clustered Affinity MatrixStep 3: Re-order the Rows

787530

758050

355345

004545

A1 A3 A2 A4

A1

A3

A2

A4

Clustered Affinity Matrix (CA)

The rows are organized in the same order as the columns.

787530

355345

758050

004545

A1 A3 A2 A4

A1

A2

A3

A4

Clustered Affinity Matrix (CA)

787530

758050

355345

004545

A1 A3 A2 A4

A1

A3

A2

A4

Clustered Affinity Matrix (CA)

PartitioningFind the sets of attributes that are accessed, for the most part, by distinct sets of applications

We look for a good dividing points along the diagnose

Cluster 1: A1 & A3

Cluster 2: A2 & A4

Two vertical fragments: PROJ1(A1, A3) and PROJ2(A2, A4)

A4 and A3 are

usually not

accessed together

A4 and A2 are often

accessed

together

Bad grouping since A1 and A2 are never accessed together

MIXED FRAGMENTATION

• Apply horizontal fragmentation to vertical fragments.

• Apply vertical fragmentation to horizontal fragments.

Example: Applications about work at each department reference tuples of employees in the departments located around the site with 80% probability.

EMP(ENUM,NAME,SAL,TAX,MGRNUM,DNUM)ENUM NAME TAX SAL ENUM NAME MGRNUM DNUM

Jacksonville

Orlando

Miami

Vertical fragmentationHorizontal Fragmentation(local work)

NOT RELATED TO

WORK

WORK RELATED

i: fragment index

j: site index

k: application index

fkj: the frequency of application k at site j

rki: the number of retrieval references of application k to fragment i.

uki: the number of update references of application k to fragment i.

nki =

rki + uki

ALLOCATION – Notations

Fragment i

Application k/w freq. fkj

rki

uki

Site j

Allocation of Horizontal Fragments (1)

No replication: Best Fit Strategy

• The number of local references of Ri at site j is

• Ri is allocated at site j* such that Bij* is

maximum.

k

kikjnfBij

Advantage: A fragment is allocated to a site that needs it most.

Disadvantage: It disregards the “mutual” effect of placing a fragment at a given site if a related fragment is also at that site.

All applications kat Site j

Frequency ofapplication k

Number of Access by kBenefit to

Site j

Allocation of Horizontal Fragments (2)

All beneficial sites approach (replication)

k jj k

kikjkikjij ufcrfB'

'

Savings due to retrieval references

Cost of update references from other sites

• Ri is allocated at all sites j* such that Bij* > 0.

• When all Bij’s are negative, a single copy of Ri is placed at the site such that Bij* is maximum.

Fragment i

Site j

Allocation of Horizontal Fragments (3)

Another Replication Approach:

di The degree of redundancy of Ri

Fi

The reliability and availability benefit of having Ri fully replicated.

(di)The reliability and availability benefit when the fragment has di copies.

,4

3)3(,2

)2(,0)1()21()( 1 FFFd

iii

di

i

The benefit of introducing a new copy of Ri at site j :

)('

' dufcrfB ik k jj

kikjkikjij

Same as All BeneficialSites approach

Also takes into account the benefit of availability

β

1

Fi

di

Allocation of Vertical Fragments

This formula can be used within an exhaustive “splitting” algorithm by trying all possible combinations of sites s and t.

1

2 34

2

s t

l

ist ks kt ksks kt ksk k k

kt ki kikt ki kll nk k k

f f fn n nBA A A

f f fn n nA A A

Applications of type As

at PSs

As At A4 An

PSr A1 A3 A2

Ri RsRt

PSs PSt PS4 PSn

. . .

Application type A1 at site PSr , that

accesses only Rs

Rs RtAs At

A1 A3 A2

PSr

PSs

PSt

PS4

PSn

A4

An

...

Should we allocate fragment Rs to site PSs , and fragment Rt to site PSt ?

SUMMARY

Design of a distributed DB consists of four phases:– Phase 1: Global schema design (same as in centralized

DB design)– Phase 2: Fragmentation

• Horizontal Fragmentation– Primary: Determent a complete and minimal set of

predicates– Derived: Use semijoin

• Vertical FragmentationIdentify fragments such that many applications can be

executed using just one fragment.

– Phase 3: AllocationThe primary goal is to minize the number of remote

accesses.

– Phase 4: Physical schema design (same as in centralized DB design).

Database IntegrationBottom-up Design

Overview• The design process in

multidatabase systems is bottomup.

– The individual databases actually exists

– Designing the global conceptual schema (GCS) involves integrating these local databases into a multidatabase.

• Database integration can occur in two steps: Schema Translation and Schema Integration.

Database 1 Database 2 Database 3

Translator 1 Translator 2 Translator 3

InS1

INTEGRATOR

GCS

Intermediate schema in canonicalrepresentation

InS3InS2

Network Data Model (Review)

• There are two basic data structures in the network model: records and sets.Record type: a group of records of the same type.Set type: indicates a many-to-one relationship in the direction of the

arrow.

DEPARTMENT (DEPT-NAME, BUDGET, MANAGER)

EMPLOYEE (E#, NAME, ADDRESS, TITLE, SALARY)

• Implementation of set instances:

Employs

owner record type

set type

member record type

Database

Jones, L.

Patel, J. Vu, K.

DEPARTMENT (owner record)

EMPLOYEE(member records)

Example: Three Local Databases

Database 1 (Relational Model):

S (TITLE, SAL)

E (ENO, ENAME, TITLE) J (JNO, JNAME, BUDGET, LOC, CNAME)

G (ENO, JNO, RESP, DUR)

Database 2 (Network Model):DEPARTMENT (DEPT_NAME, BUDGET, MANAGER)

Work

EMPLOYEE (E#, NAME, ADDRESS, TITLE, SALARY)

Employs

Worksin

Dummy Record Type

Example: Three Local Databases

Database 3 (ER Model):

EngineerNo.

EngineerName

Title Salary

ProjectNo.

ProjectName

Budget

Location

Duration

Responsibility

ContractDate

AddressClientName

ENGINEER WORKSIN

PROJECT

CONTRACTEDBY

CLIENT

1N

N

1

Schema Translation: Relational to ER

S (TITLE, SAL)

E (ENO, ENAME, TITLE) J (JNO, JNAME, BUDGET, LOC, CNAME)

G (ENO, JNO, RESP, DUR)

ENO ENAME

TITLESAL

E

PAY

S

G

CNAME

LOC

J

BUDGET

JNO JNAME

DUR

RESP

N M

1

N

ENO ENAME

TITLE SAL

E G

CNAME

LOC

J

BUDGET

JNO JNAME

DUR

RESP

N M

• E & J have a many-to-many relationship

• E & S have a 1-to-many relationship

Treat salary as an attribute of an engineer entity

Relationships may be identified fromthe foreign keys defined for eachrelation.

Schema Translation: Network to ER

• Map each record type in the network schema to an entity and each set type to a relationship.

• Network model uses dummy records in its representation of many-to-many relationships that need to be recognized during mapping.

DEPARTMENT EMPLOYEE

WORK

Employs Works-in

WORK

DEPARTMENT EMPLOYEE

EMPLOYS WORKS-IN

N M

11

DEPARTMENT EMPLOYS EMPLOYEEN M

Dummy record type

Schema Integration

Schema integration follows the translation process and generates the GCS by integrating the intermediate schemas.

– Identify the components of a database which are related to one another.• Two components can be related as (1) equivalent,

(2) one contained in the other one, (3) overlapped, or (4) disjoint.

– Select the best representation for the GCS.

– Integrate the components of each intermediate schema.

Integration Methodologies

IntegrationProcess

N-aryBinary

BalancedLadder IterativeOne-shot

Binary: Decreases the potential integration complexity and lead toward automation techniques.

One-shot: There is no implied priority for integration order of schemas, and the trade-off can be made among all schemas rather than among a few.

Integration Process

• Preintegration: establish the “rules” of the integration process before actual integration occurs.

• Comparison: naming and structural conflicts are identified.

• Conformation: resolve naming and structural conflicts

• Merging and restructuring: all schemas must be merged into a single database schema and then restructured to create the “best integrated schema.

Schema integration occurs in a sequence of four steps:

Schema Integration: Preintegration

1. An integration method (binary or n-ary) must be selected and the schema integration order defined.– The order implicitly defines priorities.

2. Candidate keys in each schema are identified to enable the integrator to determine dependencies implied by the schemas.

3. The mapping or transformation rules should be described before integration begins.– e.g., mapping from degree Celsius in one schema

to degrees Fahrenheit in another.

Preintegration Example: InS1

EngineerNo.

EngineerName

Title Salary

ProjectNo.

ProjectName

Budget

Location

Duration

Responsibility

ContractDate

AddressClientName

ENGINEER WORKSIN

PROJECT

CONTRACTEDBY

CLIENT

1N

N

1

Preintegration Example: InS2 & InS3

E#

Name

Address Salary

Dept-name Budget

Manager

EMPLOYEE DEPARTMENTEMPLOYS1N

InS2

Eno Ename

Title Sal

JNO Jname

Budget

LocDur

Resp

Cname

ENGINEER J MN EMPLOYS

InS3

Title

Keys & Integration Order

InS1 InS2InS3

KEYS

InS1: Engineer No. in ENGINEERProject No. in PROJECTClient name in CLIENT

InS2: E# in EMPLOYEEDept-name in DEPARTMENT

InS3: Eno in EJno in J

Integration method

Schema Comparison: Naming Conflict (1)

Synonyms: two identical entities that have different names.

InS1 InS3

ENGINEER Engineering No Engineer Name SalaryWORKSIN Responsibility DurationPROJECT Project No Project Name Location

E Eno Ename SalG Resp DurJ Jno Jname Loc

Schema Comparison: Naming Conflict (2)

• In InS1, ENGINEER.Title refers to the title of engineers.

• In InS2, EMPLOYEE.Title refers to the title of all employees.

Homonyms: Two different entities that have identical names.

domain (EMPLOYEE.Title) >> domain (ENIGNEREER.Title)

Schema Comparison – Relation between Schemas• Two schemas can be related in four

possible ways:–They can be identical to one another.–One can be a subset of the other.–Some components from one may occur in other while retaining some unique features

–They could be completely different with no overlap.

• An attribute in one schema may represent the same information as an entity in another one

Schema Comparison Example

• InS3 is a subset of InS2

• Some parts of InS1 (about engineers) and InS3 (about engineers) occur in InS2 (about employees)

ENGINEER

EMPLOYS

E#

Name

Title

Salary

Address

IS-A relationship

DEPARTMENT

EMPLOYEE

Schema Comparison – Structural Conflicts (1)

• Type conflicts: occur when the same object is represented by an attribute in one schema and by an entity in another schema.

– The client of a project is modeled as an entity in InS1, however

– the client is included as an attribute of the J entity in InS3

JNO Jname

Budget

LocDur

Resp

Cname

J M

EMPLOYS

InS3

ContractDate

AddressClientName

PROJECTCONTRACTEDBY

CLIENT

N

1

InS1

Schema Comparison – Structural Conflicts (2)

Dependency conflicts: occur when different relationship modes are used to represent the same thing in different schemas.

EngineerNo.

EngineerName

Title Salary

ProjectNo.

ENGINEER WORKSIN

PROJECT1N

InS1

Eno Ename

Title Sal Dur

Resp

ENGINEER JMN EMPLOYS

InS3

This is 1-to-many

This is many-to-

many

Schema Comparison: Structural Conflicts (3)

• Key conflicts: occur when different candidate keys are available and different primary keys are selected in different schemas

• Behavioral conflicts: are implied by the modeling mechanism,

– e.g., deletion of the last employee causes the dissolution of the department.

Conformation: Naming Conflicts

Naming conflicts are resolved simply by renaming conflict ones.

InS3 InS1

E Eno Engineering No Ename Engineering Name Sal SalaryG Resp Responsibility Dur DurationJ Jno Project No Jname Project Name Loc Location

ENGINEER Engineering No Engineer Name SalaryWORKSIN Responsibility DurationPROJECT Project No Project Name Location

Homonyms: • Prefix each

attribute by the name of the entity to which it belong,

e.g., ENGINEER.Title EMPLOYEE.Title

• and prefix each entity by the name of the schema to which it belongs.

e.g., InS1.ENGINEER InS2.EMPLOYEE

Synonyms: rename the schema of InS3 to conform to the naming of InS1.

EngineerNo.

EngineerName

Title Salary

Budget

Location

Duration

Responsibility

ENGINEER WORKSIN

PROJECT

ClientName

N

Resolving Structural ConflictsTransforming entities/attributes/relationships among one another

Transform the attribute Client name in InS3 to an entity C to make InS3 conform to the presentation of InS1.

M

EngineerNo.

EngineerName

Title Salary

ProjectNo.

ProjectName

Budget

Location

Duration

Responsibility

ENGINEER WORKSIN

PROJECTM

N

Example:

ProjectNo.

ProjectName

C-P

C

N

M

ClientName

InS3

NewInS3

Schema Integration:Merging & Restructuring

Merging requires that the information contained in the participating schemas be retained in the integrated schema.

InS1InS2 InS3

Merging using the IS-A relationship

Use InS3 as the final schema since it is more general in terms of the C-P relationship(i.e., many-to-many) (next page)

(Employees) (Engineers) (Engineers)

Integrate InS1 & InS3

EngineerNo.

EngineerName

Title Salary

ProjectNo.

ProjectName

Budget

Location

Duration

Responsibility

ENGINEER WORKSIN

PROJECT

CONTRACTEDBY

C

MN

N

MClientName

EngineerNo.

EngineerName

Title Salary

ProjectNo.

ProjectName

Budget

Location

Duration

Responsibility

ContractDate

AddressClientName

ENGINEER WORKSIN

PROJECT

CONTRACTEDBY

CLIENT

1N

N

1

InS1

InS3

InS3 is more general

Merging & Restructuring Example

ProjectNo.

ProjectName

Budget

Location

Duration

AddressClientname

ENGINEER WORKSIN

CONTRACTEDBY

CLIENT

MN

N

1

Final Result:

EMPLOYEE

EMPLOYS

E#

Name

Title

SAL

Address

Dept-nameBudget Manager

DEPARTMENTInS2

InS1/InS3

Unfortunately, Conformation and restructuring stages are an art rather then a science

Responsibility

PROJECT

Query Processing inMultidatabase Systems

Query Processing in Three Steps

1. Global query is decomposed into local queries Local Schema 1 Local Schema 2 Local Schema 3

Translator 1 Translator 2 Translator 3

InS1

INTEGRATOR

GCS

InS3InS2

Schema Integration

Q1

Q1,1 Q1,2 Q1,3

Query Processing in Three Steps

2. Each local query is translated into queries over the corresponding local database system

Local Schema 1 Local Schema 2 Local Schema 3

Translator 1 Translator 2 Translator 3

InS1

INTEGRATOR

GCS

InS3InS2

Schema Integration

Q1

Q1,1 Q1,2 Q1,3

Q’1,

1

Q’1,2Q’1,3

Query Processing in Three Steps

3. Results of the local queries are combined into the answer

Local Schema 1 Local Schema 2 Local Schema 3

Translator 1 Translator 2 Translator 3

InS1

INTEGRATOR

GCS

InS3InS2

Schema Integration

Q1

Q1,1 Q1,2 Q1,3

Q’1,

1

Q’1,2Q’1,3

Combine

Finalanswer

Query Processing in Three Steps

1. Global query is decomposed into local queries

2. Each local query is translated into queries over the corresponding local database system

3. Results of the local queries are combined into the answer

Local Schema 1 Local Schema 2 Local Schema 3

Translator 1 Translator 2 Translator 3

InS1

INTEGRATOR

GCS

InS3InS2

Schema Integration

Outline

• Overview of major query processing components in multidatabase systems:– Query Decomposition– Query Translation– Global Query Optimization

• Techniques for each of the above components

Query Decomposition

Query DecompositionOverview

Global Query

Query decomposition &global optimization

SQ1 SQ2SQn

. . .Querytranslator 1

Querytranslator 2

Querytranslator n

TQ1 TQ2TQn

DB1 DB2 DBn

. . .

PQ1 PQn…

SQi export-schema subquery in global query language

TQi target query (local subquery) in local query language

PQi postprocessing query used to combine results returned by subqueries to form the answer

Assumptions• We use the object-oriented data model

to present a query decomposition algorithm

• To simplify the discussion, we assume that there are only two export schemas:

ES1 ES2 Emp1: SSN Emp2: SSN

Name Name Salary Salary Age Rank

Definitions• type: Given a class C, the

type of C denoted by type(C ), is the set of attributes defined for C and their corresponding domains.

• world: the world of C, denoted by world(C ), is the set of real-world objects described by C.

• extension: the extension of C, denoted by extension(C ), is the set of instances contained in C.

Extension

Type

A Class

World

Schema Integration

• Integration through outerjoin

• Integration through outerunion (generalization)

Review: Outerjoin

The outerjoin of relation R1 and R2 (R1 ⋈o R2 ) is the union of three components:

– the join of R1 and R2,

– dangling tuples of R1 padded with null values, and

– dangling tuples of R2 padded with null values.

Outerjoin Example

OID SSN Name Salary Age

3 6789 Smith 90,000 40

4 4321 Chang 62,000 30

5 8642 Patel 75,000 35

OID SSN Name Salary Rank

1 2222 Ahad 98,000 S. Mgr.

2 7531 Wang 95,000 S. Mgr.

3 6789 Smith 25,000 Mgr.

OID SSN Name Salary Age Rank

1 2222 Ahad 98,000 null S. Mgr.

2 7531 Wang 95,000 mull S. Mgr.

3 6789 Smith

Incon-

sistent

40 Mgr.

4 4321 Chang 62,000 30 null

5 8642 Patel 75,000 35 null

Emp1

Emp2

Dangling Tuple Dangling Tuple

EmpO = Emp1 ⋈o Emp2

Outerunion

OID SSN Name Salary Age

3 6789 Smith 90,000 40

4 4321 Chang 62,000 30

5 8642 Patel 75,000 35

OID SSN Name Salary Rank

1 2222 Ahad 98,000 S. Mgr.

2 7531 Wang 95,000 S. Mgr.

3 6789 Smith 25,000 Mgr.

OID SSN Name Salary Age Rank

1 2222 Ahad 98,000 null S. Mgr.

2 7531 Wang 95,000 mull S. Mgr.

3 6789 Smith Conflict null Mgr.

3 6789 Smith Conflict 40 null

4 4321 Chang 62,000 30 null

5 8642 Patel 75,000 35 null

Emp1

Emp2

EmpG = Emp1 Uo Emp2

Schema Integration Using Outerjoin

Two classes C1 and C2 can be integrated by equi-outerjoining the two classes on the OID to form a new class C.

– extension(C ) = extension(C1 ) ⋈o

extension(C2 )

– type(C ) = type(C1 ) ⋃ type(C2 )– world(C ) = world(C1 ) ⋃ world(C2 )

C1 C2 C

Schema Integration thru Generalization

Two classes C1 and C2 can be integrated by generalizing the two classes to form the superclass C.

type(C ) = type(C1 ) ⋂ type(C2 )

extension(C ) = ᅲ type(C) [extension(C1 ) ⋃o extension(C2 )]

world(C ) = world(C1 ) ⋃ world(C2 )

Outer union

Generalization

Generalization ExampleEmp1: SSN Emp2: SSN EmpG: SSN

Name Name Name Salary Salary

SalaryAge Rank

• Emp1 and Emp2 will also appear in the global schema since not all information in Emp1 and Emp2 is retained in EmpG

SSNNameSalary

Age Rank

EmpG

Emp2Emp1Genera

lizati

o n

Morespecific

Inconsistency Resolution

• The schema integration techniques work as long as there is no data inconsistency

• If data inconsistency occurs, aggregate functions may be used to resolve the problem.

Export Schemas Integrated Schema

Emp1: SSN Emp2: SSN EmpG: SSN EmpO: SSN

Name Name Name or Name

Salary Salary Salary Salary

Age Rank Age

Rank

Aggregate Functions - Examples:

EmpG.Name = Emp1.Name, if EmpG is in world(Emp1) = Emp2.Name, if EmpG is in world(Emp2) – world(Emp1)

EmpG.Salary = Emp1.Salary, if EmpG is in world(Emp1) – world(Emp2) = Emp2.Salary, ifEmpG is in world(Emp2) – world(Emp1) = Sum(Emp1.Salary, Emp2.Salary), if EmpG is in world(Emp1) ⋂

world(Emp2)

EmpO.Age = Emp1.Age, if EmpO is in world(Emp1) = Null, if EmpO is in world(Emp2) – world(Emp1)

EmpO.Rank = Emp2.Rank, if EmpO is in world(Emp2) = Null, if EmpO is in world(Emp1) – world(Emp2)

Inconsistency Resolution Example

World (Emp1) World (Emp2)

world(Emp2) –

world(Emp1)

world(Emp1) –

world(Emp2)

world(Emp1) ⋂

world(Emp2)

Generalization

Outer join

Query DecompositionStep 1: Determine Number of

SubqueriesGlobal Select EmpO.Name, EmpO.RankQuery From EmpO

Where EmpO.Salary > 80,000 ANDEmpO.Age > 35

Obtain a partition of world(EmpO) based on the aggregate function used to resolve the data inconsistency.

Option 1 (based on Salary)

part. 1: world(Emp1) – world(Emp2)part. 2: world(Emp2) – world(Emp1) part. 3: world(Emp1) ⋂ world(Emp2)

1 3 2

world(Emp1)

world(Emp2)

Inconsistency Function:

EmpO.Salary = Emp1.Salary, if EmpO is in world(Emp1) – world(Emp2)

= Emp2.Salary, if EmpO is in world(Emp2) – world(Emp1)

= Sum(Emp1.Salary,Emp2.Salary), if EmpO is in world(Emp1) ⋂ world(Emp2)

Assume Outerjoin is used for schema integration

Query DecompositionStep 1: Determine Number of

SubqueriesGlobal Select EmpO.Name, EmpO.RankQuery From EmpO

Where EmpO.Salary > 80,000 ANDEmpO.Age > 35

Obtain a partition of world(EmpO) based on the aggregate function used to resolve the data inconsistency.

Option 2 (based on Age)

part. 1: world(Emp1) part. 2: world(Emp2) –

world(Emp1)

21

world(Emp1)

world(Emp2)

Inconsistency Function:

EmpO.Age

= Emp1.Age, if EmpO is in world(Emp1)

= Null, if EmpO is in world(Emp2) – world(Emp1)

Query DecompositionStep 1: Determine Number of

SubqueriesGlobal Select EmpO.Name, EmpO.RankQuery From EmpO

Where EmpO.Salary > 80,000 ANDEmpO.Age > 35

Obtain a partition of world(EmpO) based on the aggregate function used to resolve the data inconsistency.

Option 1 (based on Salary) Option 2 (based on Age)

part. 1: world(Emp1) – world(Emp2) part. 1: world(Emp1)part. 2: world(Emp2) – world(Emp1) part. 2: world(Emp2) – part. 3: world(Emp1) ⋂ world(Emp2) world(Emp1)

We use Option 1 since it is the finest partition among all the partitions.

1 3 2

world(Emp1)

world(Emp2)

21

world(Emp1)

world(Emp2)

Query DecompositionAnother Example

1 3 2

world(Emp1)

world(Emp2)

21

world(Emp1)

world(Emp2)1

world(Emp1)

world(Emp2)

2

Option 1: Option 2:

Use finer partition (Option 3):

Query DecompositionStep 2: Query Decomposition

Global Query:Select EmpO.Name, EmpO.RankFrom EmpOWhere EmpO.Salary > 80,000

AND EmpO.Age > 35

Partition:

Query Decomposition: Obtain a query for each subset in the chosen partition.

part. 1: Select Emp1.Name From Emp1 Where Emp1.Salary > 80,000

AND Emp1.Age > 35 AND Emp1.SSN NOT IN (Select

Emp2.SSN From

Emp2)

part. 2: This subquery is discarded because EmpO.Age is Null.

part. 3: Select Emp1.Name, Emp2.Rank

From Emp1, Emp2 Where Sum(Emp1.Salary,

Emp2.Salary) >

80,000 AND Emp1.Age > 35 AND Emp1.SSN =

Emp2.SSN

1 3 2world(Emp1) world(Emp2)

EmpO.Age = Emp1.Age, if EmpO is in world(Emp1) = Null, if EmpO is in world(Emp2) – world(Emp1)

EmpO.Salary = Emp1.Salary, if EmpG is in world(Emp1) – world(Emp2) = Emp2.Salary, ifEmpG is in world(Emp2) – world(Emp1) = Sum(Emp1.Salary, Emp2.Salary), if EmpG is in world(Emp1) ⋂ world(Emp2)

Query DecompositionStep 2: Query Decomposition

Global Query:Select EmpO.Name, EmpO.RankFrom EmpOWhere EmpO.Salary > 80,000

AND EmpO.Age > 35

Query Decomposition: Obtain a query for each subset in the chosen partition.

part. 1: Select Emp1.Name From Emp1 Where Emp1.Salary > 80,000

AND Emp1.Age > 35 AND Emp1.SSN NOT IN (Select

Emp2.SSN From

Emp2)

part. 2: This subquery is discarded because EmpO.Age is Null.

part. 3: Select Emp1.Name, Emp2.Rank

From Emp1, Emp2 Where Sum(Emp1.Salary,

Emp2.Salary) >

80,000 AND Emp1.Age > 35 AND Emp1.SSN =

Emp2.SSN

13

2

world(Emp1) world(Emp2)

Emp1.Age

Emp1.Salary

Emp1.Age

Emp1.Salary + Emp2.Salary

Age = nullEmp2.Salary

Query Modification

Query DecompositionStep 3: Further Decomposition

Before STEP 3:Select Emp1.NameFrom Emp1Where Emp1.Salary > 80,000

and Emp1. Age > 35 and Emp1.SSN NOT IN (Select Emp2.SSN From Emp2)

Select Emp1.NameFrom Emp1Where Emp1.Salary > 80,000

and Emp1. Age > 35 and Emp1.SSN NOT IN X

Insert INTO XSelect Emp2.SSNFrom Emp2)

STEP 3: Some resulting query may still reference data from more than one database. They need to be further decomposed into subqueries and possibly also postprocessing queries

X

Query DecompositionStep 4: Query Optimization

STEP 4: It may be desirable to reduce the number of subqueries by combining subqueries for the same database.

Query Translation

Query Translation (1)

IF Global Query Language ≠Local Query Language

THEN Export Local Schema Query Subquery

Language

Translator

Query Translation (2)IF the source query language has a higher

expressive power THEN EITHER– Some source queries cannot be translated; or

– they must be translated using both• the syntax of the target query language, and• some facilities of a high-level programming language.

Example: A recursive OODB query may not be translated into a relational query using SQL alone.

Relation-to-OO Translation

Equivalent Relational Schema:

Auto (Auto-OID, Color, Company-OID)Company (Company-OID, Name, Profit, City-OID,

People-OID)People (People-OID, Name, Age, City-OID, Auto-OID)City (City-OID, Name, State)

OODB Schema:

Auto OID Color Manufacturer

Company OID Name Profit Headquarter President

People OID Name Hometown Automobile Age

City OID Name State

Foreign key

Relational-to-OO Example (1)

Global Query:Select Auto1.*

From Auto Auto1, Auto Auto2, Company, People, City City1, City City2

Where Auto1.Conmpany-OID = Company.Company-OID AND Company.People-OID = People.People-OID AND People.Age = 52 AND People.Auto-OID = Auto2.Auto-OID AND Auto2.Color = “red” AND People.City-OID = City1.City-OID AND City1.Name = City2.Name AND Company.City-OID = City2.City-OID

Relational Predicate Graph:

Auto1 Company

City2

City1

PeopleAge=52

Auto2Color=red

1) Company-OID

4) City

-OID

2) People-OID

3) Auto-OID

Find all red cars own by a 52 year

old who is the President of the

car manufacturer and lives in the

same city of the car manufacturer

1

2

3

4

5

6

5) Name

1+2+3

4+5+6

6) City

-OID

(Join)

Relational-to-OO Example (2)

OO Predicate Graph:

Auto1 Company

City2

PeopleAge=52

Auto2Color=red

Company-OID

City-O

ID

People-OID

Auto-OID

City1

City-OID

(Headquarte

r)

(Hometo

wn)

NameRelational Predicate Graph:

Auto1 Company

City2

City1

PeopleAge=52

Auto2Color=red

1) Company-OID

4) City

-OID

2) People-OID

3) Auto-OID5) Name

6) City

-OID

(Join)

Relational-to-OO Example (3)

OO Query:Where Auto.Manufacturer.President.Age = 52 AND

Auto.Manufacturer.President.Automobile.Color = red AND

Auto.Manufacturer.Headquarter.Name =

Auto.Manufacturer.President.Hometown.Name

OO Predicate Graph:

Auto1 Company

City2

PeopleAge=52

Auto2Color=red

Company-OID

City-O

ID

People-OID

Auto-OID

City1

Predicate 3

Predicate 1

Predicate 2

City-OID

(Headquarte

r)

(Hometo

wn)

Name

Global Query Optimization

Query Optimization (1)

CASE 1: A single target query is generated

IF the target database system has a query optimizer

THEN the query optimizer can be used to optimize the translated query

ELSE the translator has to consider the performance issues

Query Optimization (2)

CASE 2: A set of target queries is needed.

• It might pay to have the minimum number of queries– It minimizes the number of invocations of the target

system– It may also reduce the cost of combining the partial

results

• It might pay for a set to contain target queries that can be well coordinated– The results or intermediate results of the queries

processed earlier can be used to reduce the cost of processing the remaining queries

Global Query Optimization (1)

• A query obtained by the query modification process may still reference data from more than one database.

Example: part. 3 (i.e., world(Emp1) ⋂ world(Emp2)) on page 126

Select Emp1.Name, Emp2.Rank From Emp1, Emp2 /* access two databases Where sum(Emp1.Salary, Emp2.Salary) > 80,000 AND Emp1.Age > 35 AND Emp1.SSN = Emp2.SSN

→ Some global strategy is needed to process such queries

Global Query Optimization (2)

• Select Emp1.Name, Emp2.Rank From Emp1, Emp2 /* access two databases Where sum(Emp1.Salary, Emp2.Salary) > 80,000

AND Emp1.Age > 35 AND Emp1.SSN = Emp2.SSN

→ Some global strategy is needed to process such queries

Emp1

formresult

Emp2

Site 1

Site 2

Emp1

formresult

Emp2

Site 1

Site 2

Emp1

Site 1

Emp2

Site 2

formresult

Site 3

1+2

OID SSN Name Salary Age Rank

1 2222 Ahad 98,000 null S. Mgr.

2 7531 Wang 95,000 mull S. Mgr.

3 6789 Smith

Incon-

sistent

40 Mgr.

4 4321 Chang 62,000 30 null

5 8642 Patel 75,000 35 null

Data Inconsistency• If C is integrated from C1 and C2 with no

data inconsistency on attribute A, then

бA op a (C) = бA op a (C1) ⋃ бA op a (C2)

• If A has data inconsistency, then the above equality may no longer hold.

Example: Consider the select operation бEmpO.Salary > 100,000

(EmpO)

EmpO

The correct answer should have the record for Smith. However, the above query returns an empty setSmith does have a combined salary greater than

100,000

Data Inconsistency - Optimization

Express an outerjoin (or a generalization) as outer-unions as follows:

C1 ⋈o C2 = C1-O ⋃o C2-O ⋃o (C1-C ⋈OID C2-C)

C1-O: Those tuples of C1 that have no matching tuples in C2 (private part)

C1-C: Those tuples of C1 that have matching tuples in C2 (overlap part)

бA op a (C1 ⋈o C2 ) = бA op a (C1-O) ⋃o бA op a (C2-O)

⋃o бA op a (C1-C ⋈ C2-C)Can we improve this term ?

Distribution of Selections (1)

бA op a (C1 ⋈o C2 ) = бA op a (C1-O) ⋃o бA op a (C2-O)

⋃o бA op a (C1-C ⋈ C2-C)

When can we dustributeб over ⋈ ? Expensive operation

Attribute A is defined byan aggregate function(see page 124)

Distribution of Selection (2)

Four cases were identified when all arguments of the aggregate function (for resolving conflicts) are non-negative

1. f(A1,A2) op a ≡ A1 op a AND A2 op a:

бA op a (C1-C ⋈ C2-C) = бA op a (C1-C) ⋈ бA op a ( C2-C)

Example: max(Emp1-C.Salary, Emp2-C.Salary) < 30K

≡ Emp1-C.Salary < 30K AND

Emp2-C.Salary < 30K

2. f(A1,A2) op a ≡ f(A1 op a, A2 op a) op a:

бA op a(C1-C ⋈ C2-C) = бA op a(бA1 op a(C1-C) ⋈ бA2 op a(C2-C))

Example: sum(Emp1-C.Salary, Emp2-C.Salary) < 30K

≡ sum(Emp1-C.Salary < 30K,

Emp2-C.Salary < 30K) < 30K

Aggregate function

An aggregate

function

Distribution of Selection (3)

3. f(A1,A2) op a ≡ f(A1 op’ a, A2 op’ a) op a:

бA op a(C1-C ⋈ C2-C) = бA op a(бA1 op’ a(C1-C) ⋈

бA2 op’ a(C2-C))

Example: sum(Emp1-C.Salary, Emp2-C.Salary) = 30K

≡ sum(Emp1-C.Salary ≤ 30K, Emp2-C.Salary ≤ 30K) = 30K

4. No improvement is possible:

Example: sum(Emp1-C.Salary, Emp2-C.Salary) > 30K

Distribution Rules for б over ⋈

бA op a(C1-C ⋈ C2-C)

> ≥ ≤ < = ≠ in Not in

sum(A1, A2) 4 4 2 2 3 4 4 4

avg(A1, A2) 4 4 2 2 3 4 4 4

max(A1, A2) 4 4 1 1 3 4 4 4

min(A1, A2) 1 1 4 4 3 4 4 4

opf

No improvement possible

Problem in Global Query Optimization (1)

Important information about local entity sets that is needed to determine global query processing plans may not be provided by the local database systems.

– Example: cardinalities availability of fast access paths

– Techniques:

• Sampling queries may be designed to collect statistics about the local databases.

• A monitoring system can be used to collect the completion time for subqueries. This can be used to better estimate subsequent subqueries.

Problems in Global Query Optimization (2)

• Different query processing algorithms may have been used in different local database systems.→ Cooperation across different systems difficult Examples: Semijoin may not be supported on some local systems.

• Data transmission between different local database systems may not be fully supported.Examples:– A local database system may not allow update

operations– For many nonrelational systems, the instances of one

entity set are more likely to be clustered with the instances of other entity sets. Such clustering makes it very expensive to extract data for one entity set.

→ Need more sophisticated decomposition algorithms.