DB Korth 6th Edition

8/19/2019 DB Korth 6th Edition

http://slidepdf.com/reader/full/db-korth-6th-edition 1/6

CH PTER

8

arallel Databases

his chapter is

suitable

for an advaru:Ed course but can also be used for inde

pendent

study proje

cts

by students of a first course. The chapter covers se

veral

aspects

of the de

sign of parallel

database

systems

-

partitioning

of data,. par

allelization

of indhidual

relational operations, and parallelization of relational

expressions.

The

chapter also

br i

e

fly coven some systems

issues, such

as

cache

coherency

and

failure resiliency.

The most important

applications

of parallel

databases

today re for v..•are-

housing and analyzing large amountsof

data.

Therefore partitioning of d

ata a.. 'ld

parallelquery proc

essing

are co .e:red in significant detail. Query optimiz.ation is

also

of

importance, for

the

same reason.

Ho

•

ever

, parallel query optim.iz.ation

is

still

not a

fully solved problem

; exhaustive search,. as is

used for

sequential

query

optimization is too

expensive in a

parallel

system, forcing

the

use of heuristics..

The

description

of parallel query processing algorithms is based O.. \

the

shared-nothing

mode

l.

Students

may be

asked

to

study

h0\'1'

the

algorithms

can

be: improved if

shared-memory

machines are

used

instead.

13.9 For each

of

the three partitioning techniques, name

ly

roW'ld-robin,. hash

partitioning,

and range

partitioning, gh:e

anexample of

a

query

for

-'hich

that

partitioning

technique , . ould provide

the

fastest response.

Am;wu

R

ound

robin partitioning:

\<\1hen relations

a.re

large and queries

read

entire relations, round

robin

gives

good speed-up

and fast response time.

Hash partitioning

For point

queries

on

the partitioning

attributes

,

this

gives

the

fastest

response, as

each

disk can process a diffemt query simultaneously.

I f

he hash

partitioning

is uniform, entire relation scans can be per

formed

efficiently.



Ran

ge

partitioning

For

rmge

queries

on the

p

artitioning

a

ttribut

es

,.

, ..hkh

acca1

1 fC \\• tuple1, nnge: partitioning gh·es

the

fastest

re-

•

ponM

.

U . U

\.Vhat faeton

could

raul

t

i n ' '

"

Y

lh n

a relation s parti

ti

oned on

o.. rte

of

it&

attrlbut.ab

y:

. . Huh

partit <mlng?

b. R.angc partit <mlng?

In each ease, ""hat

can be

done to red.Utt the sk.n-?

A

H C L

. . ~ a r t l d o n i n g

Too

man

y rtt0rdl ith

tht u . m t

,.alu.e for thehashing attribute

,.

ora

poo

d r choocn huh function without the properties of randomness

one unlfonnlty, can ...Wt in

• s bw

ed

partition.

To impron the

oitua

dcn,, w e

ohoWd c:xpaiment

with

better

has Ung functions

for

th.at

relation.

b. IW>s«-p&rt doning:

Non-unifonn diltribution of 1"&lun for

the

partitioning a.ttribute

(including dupliate

1"

&lua for the partitioning attribute) ,,+uch

ar

e

not taken

into

account by 1 Nd

partiti.on ng

\.-ector is the main

rcuon for

llccwed

partitions.

Sorting the

relatio.

l

an he

partitioning

attribute and

then

dhi.dirlg i t into n rang e:s i th

equal

number of

tupla per range " 1ll gi1•e 1 g ood putitioning 1·ec

to

r 1vith \ a )

lO'\\

·

&biv.

U.U. Give: an ex.ample of 1 join

that

i1

not

a simple equi-join

for

""hich pa.rti

tione

d

puAilclll.n'I. can

be used.

\t\'Nt

1.ttributes sho

ul

d be used for parti

tioning?

A.n.wa:

\o\'e gh·t: tv.·o eampl1:1 of

such

joins.

a...

r t . A ) - ( t t. ) l

Here: l \ I: N.vc &n cqui1oincond iti

on \\•

hi

ch can

be execu

ted

first, a.nd

th

e

extr

1 conditions

can

be checked

in

d

ependentl

y

on

each

tu

ple

in

the join raul t. Pa.rtitioned pa-rallelism is useful

to

execute

the

equi

Joln,

b. r °'(r..<l:(,"H/»,J-'O)*""«><·IJ'OJ)• l)•»l s

Thi.I

i i a

query

in 1..-hich an r tuple and an s

tuple

join v.i th each

otMr 1f

they

fall

into

the 1arn1 range of \•alues. Hence

partition

ed

puallclltm

applies natuB.lly

to

this

sc

enario, e\"en

though

th

e join

is

l10t an cqul·joln.

F

or

both the

qucrln

, r ahould

be

partitioned on

attrib

u te

A a.."ld

s

on

a ttribute

o

For

the

accond

query,

the putitioning ; should actually be

do

r>« an (l; .8/

20J) •

20

.



- u

1.8.12 Describe a good ·ay to pa.ral.lehz.e each of the follo·wing:

a. The difference o

peration

b.

.4.ggrega tion

by

the countoperation

c. .4.gg:rega tion

by

the countdistiod:o

pera

tion

d . A

ggre

ga tion

by

the a.vgopera tion

e.

left

outer join.. i the join

conditi

on in

volv

es only equality

f. Left

outer join,. i he join

conditi

on involves

comparisons

other than

equali

ty

g. Full out

er jo

in,.

i h

e join condition

in

v

ol

v

es

comparisons

other

than

equality

Amiwu:

a. \.Ve

can

pa...allelize the difference operation

by parti

tioning the rela

tions

on

all the attributes . an

d

then co

mputing differences loc

ally

at

each

processor .

•

5 in aggregation,

th

e cost o f transferring

tu

pl

es

dur

ing

parti

ti

oning

can

be

red

u

ced

by

pa

rtially

computin

g differences

at

each

processor

,. befor

e p

arti

ti

onin

g.

b.

Let

us refer to the group-by a

ttrib

ute as attribute

A, an

d the attribu te

on ' ·

hich

the

ag

gregation functi

on opera

t

es

; as a ttri

bu

te G.

co

u:ntis

performed

just

like sum. mentioned in he book) except that, a

count

of

the

number of values of a ttribute B for each value o f attribute A

is transferred to the correct destina

ti

on

pro

cessor,

ins

t

ead

of

a

sum.

Aft

er

partiti

onin

g,

th

e

partia

l c

ounts

from

all

the

pr

oce

ssors are

added up

locally at

each

processor to

ge

t the final re s

ul

t.

c. For this . partial counts cannot e

comp

ut

ed

locally before p

arti

tion

ing. Each processor ins.tead transfers all

uni

que Ov

alues

for each A

value to the correct destination processor.

After

partitio

nin

g, each

processor locally counts the number of un ique tuples for each value

o f

A and

then ou tpu ts the final

tt:S-

ult.

d .

This

can

again

be

implemented

lik

e

i ;um

exce

pt

tha

t

for each

v

alu

e

o f A a

ISUDl

of the 9

,-

alues as ,.,·ell as a

coont

of he n

umber of tu

pl

es

in the gro

up

. is transferred

during

partitio.-tlng. Then each processor

o u tpu ts i ts local result .

by

dh..-iding the total sum

by

total numb

er

of

tuples for each A value assigned to its partition.

e . This c

an

be performed just like partition

ed

natural jo

in

.

par

titio

nin

g, each processor

compu

tes the left ou ter jo

in

locally using

any of the strategies of Chapter 1

2.

f. The left outer join can be c

om

puted using a. \ exte. \Sion of the

a g m e n t ~ d e a t e

scheme

to compute non equi-jo ins. Con

sider r s The relations. a.re partition

ed

, and r

bd

s

is

c

om

puted at



each site.

\>\i

e also collect tuples from r that did not match any tuples

from

s;

c

all

the

set of these danglin

g tu

ples at

s

it

e

i

as

11• • dter

the

abo

ve

step

is

d

on

e

at each

si

te, for

each

fr

agment

of

r

\V

e take the

intersecti

on

of the 1//s from every

pr

ocessor in

\Vhich

the &

agment

of r \-\'as replicated. The intersections give the real set of

dang

ling

tuples

; these tupl

es

a.re

padded ,.,.-}th

nulls an

d added

to the resull

The intersections themseh·es, followed by additi

on

ofpadded tupl

es

to the result,

can

be done

in par

allel by parti

ti

oning.

g. The algorithm is basically the same as abo, e, except that '' hen com

bining results,

th

e processing of dangling tuples must

done

for both

relations.

18.ts escribe

the benefits

and dr

a

v.-b

acks of pipelined parallelism.

An 5wa:

•

Bendits:..'\i

o

need

to

\\'lite in

termediate

relations

to

disk onl

y to

read

them back immediately.

•

Dr awbada;:

a.

Canno

t

take

ad

v

antage

of high d egrees

of

p

ara

llelism,

as

queries do not have large number of operations.

b. Not possible to pipeline o

pera

to

rs

v.·hi

ch

ne

ed

to look

at

all the

inpu

t before producing any ou tput.

c. Since each op

eration

ex

ecu

t

es on

a single

r o c e ~ o r

the

most -

pens

iY

e

ones take

a lo

ng tim

e to

finish.Thus

speed-up v.ill

be lo\V

despite the use of parallelism.

13.14

Suppose

yo

u

\ \ 1

sh

to

han

dl

e a \V

orkloa

d consisting

of

a large

number

of

small r a n . . ~ c t i o n s by using shared-nothing parallelism.

a.. Is in

traqu ery parallelism

required in

such a

si

tu

ation

?

I f

not,

''hy,

and

\Vhat form of

parall

elism is

appr

opriate?

b.

What fo

rm of

skel. · \vould be

of si

gnificance

\\rith

su

ch

a \\'orl<load?

c. Suppose

most

transac

ti

ons accessed

on

e

ua vunl

record, ,

.hi.ch in

-

cl

udes u a v u ~ 1 t y p . . ~ a t t r i b u t e andanassociatednavunl typeJJJtr.ilt.r

record,

'·hich pr

°

-i

des

info

rmation about the acco

unt

ty

pe

. Ho ''

,

.

ould you parti

tion

and/ or replic

ate

data to speed up transac

tions

?

You may

assume

that the

11rro1111 iypeJJu1slt:r

relation is rarely u

p

da ted.

An 5wa:

a..

Intra.query

par

allelism

is probably

not

appr

opriate for

this si

tu

at i

on.

Since each indh-idual transaction is small, the overhead of paral-

lelizing each query may exceed the

po

tential benefits. Interquery

parallelism ·ould be a be

tter

choice, allo,.,.-ing many transactions to

run

in parallel



b.

Partiti

on

s

ke\.\• can

be a

performance issue

in

th i

s type of syst

em

,.

especially \.\-ith

th

e use ofshared-nothing parallelism.. A load imbal

ance

amongst

the

processors

of h

e

dis

tr i

bu

t

ed

syst

em

can

si

gnificant

reduce

the

speedup gained by parallel execution. F

or

exam ple

,.

if

all

transactio

ns

happen to

inv

olve o

nly the

data in a

sin

gle partiti

on

,.

the

processors not associated

\\ith tha

t partiti

on

.\rill

no

t be used at

a

ll

.

c. Since n ro

l f

I yp .1111J /1.r is rare

ly

updated,.

i t can be

replica ted

in

entirety across all nod

es

. I f he nlrvlf11lrela

tion is

updated

frequently

an

d accesses are , .. ell

-disbibu

t

ed,.

t

sh

ould be partitioned across

n

odes

.

1.&.15 The attribu te on \.\·

hich

a relation is

partition

ed can

ha

ve a significant

impact on the cost o f a query.

a. Givena \\'orkload of

SQ

L

queries on

a single relatio. l. \\•hata

ttri

but

es

\.\•ould be candidates for partitio ning?

b. Ho

v

\vould you choose behv

een th

e altem.ati:ve partitioning tech

ni

qu

es

,

based

on th

e

l\ or

ldoad

?

c. Is t possible

to parti

tion a relation on

more than

one a ttribu te?

Expl

ain

your ans

\\·er

.

a. The candidate a

ttrib

utes ' o

ul

d be

i

Attri

butes on

vhich

o

ne

or more queries

has

a selection

con

d ition.

The

correspon

ding

selecti

on

con

diti

on

can

then

be

ev

al

ua ted at a single processor, instead of

be

ing e .aluated at all

processors..

ii Attri

bu

t

es

involved

in

join conditions.

If

such an a

ttri

bu

te is

used for partitioning, it is

possib

le to perform

th

e join l\-ithou t

repartitioning

th

e relation. This effect is particularly beneficial

for , ·e ry

large

relatio. '\S, for ' '•hich repa.rtitio. '\ing can be

\ ery

expensive.

i i i.

Attri

bu

t

es

inv

oh·

ed

n

r o u

cl

auses

;

similar

to

joins

, it

is

possible

to

perf

orm

aggregation l \-ithout repartitioning

th

e cor

responding relation.

b. A.cos

t-based approach \\

o

des best

n choosing

beh veen altemat

h·

es.

In this

ap

proach,

candida

te

pa

rtitioning choices are genera ted

,.

and

for eac..li.

candidate

the

cos

t of

executin

g

all

the

queries

/ updates in a

l\

orldoad is estimated. The cho ice leading

to

the least cost is picked.

One

is

sue is

tha

t the number of candidate choices is generally ery

large. Al

gori

thms and heuristics d esigned to limit

th

e number

of

ca.. \did

ates

for

' 'hi.ch

costs need

to

be es

timated

are \\•id

el...-

used

in

practice. ·



Another Luuc

Lt

that

the

,.,,.orldoad

may

have a very large number

of

qucria/ updata. Techniqun to reduce

this number include

the

o ~ i n g

(a) combining repeated occurrences

of

a

query

that

o Uy

differ ln

conatantt

, rtpl.acing

themby

one parametrized

query

along

' -ith

a count ol nw:nber o occurrences and (b) dropping queries

'

·hkh arc

V'U). cheap

ln) ·ay, or

not lilc ely

to be affected

by the

p.utitionlng choke.

c. lt

i pou.U>lc

to

putition a reli ion on more th.an one attribute,. n

two •Y ·

One

Lt to im'Oh•c

multiple

ilttnDutes in a si. lgle composite

putidoningby

.

The

other

1\ ay

is to k.eepmore

tha- l

one copyof

the

. . . .

moder\, partitioned ;,,different

w•ys.

The

latter 4

pproach

s

a.a..

update cools, but

a n

speed up somequems signl.fiGL'llly.

DB Korth 6th Edition

Documents

Transcript of DB Korth 6th Edition