Parallel Database Systems Analyzing LOTS of Data Jim Gray Microsoft Research

1

Parallel Database Systems

Analyzing LOTS of Data

Jim GrayMicrosoft Research

3

Main Message Technology trends give

– many processors and storage units– inexpensively

To analyze large quantities of data– sequential (regular) access patterns are 100x faster– parallelism is 1000x faster (trades time for money)– Relational systems show many parallel algorithms.

5

Moore’s Law

128KB

128MB

20008KB

1MB

8MB

1GB

1970 1980 1990

1M 16Mbits: 1K 4K 16K 64K 256K 4M 64M 256M

1 chip memory size ( 2 MB to 32 MB)

•XXX doubles every 18 months 60% increase per year

•Micro Processor speeds•chip density•Magnetic disk density•Communications bandwidthWAN bandwidth

approaching LANs.

6

Magnetic Storage Cheaper than Paper• File Cabinet: cabinet (4 drawer) 250$

paper (24,000 sheets) 250$space (2x3 @ 10$/ft2) 180$

total 700$

3 ¢/sheet• Disk: disk (4 GB =) 500$

ASCII: 2 m pages

(100x cheaper) 0.025 ¢/sheet

• Image: 200 k pages

(10x cheaper) .25 ¢/sheet• Store everything on disk

9

Today’s Storage Hierarchy : Speed & Capacity vs Cost

Tradeoffs

1

1015

1012

109

106

103

Size vs Speed

Access Time (seconds)10 -9 10-6 10-3 10 0 10 3

Cache

Main

Secondary

Disc

Nearline Tape Offline

Tape

OnlineTape

104

102

100

10-2

10-4

Price vs Speed

Access Time (seconds)10 -9 10 -6 10 -3 10 0 10 3

Cache

MainSecondary

Disc

Nearline Tape

OfflineTape

OnlineTape

Size(B) $/MB

10

Storage Latency: How Far Away is the

Data?

RegistersOn Chip CacheOn Board Cache

Memory

Disk

12

10

100

Tape /Optical Robot

10 9

106

Sacramento

This CampusThis Room

My Head

10 min

1.5 hr

2 Years

1 min

Pluto

2,000 YearsAndromdeda

Clo

ck T

icks

11

ThesisMany Little will Win over Few

Big

1 M$100 K$ 10 K$

Mainframe MiniMicro Nano

14"9"

5.25" 3.5" 2.5" 1.8"

12

1k SPECintCPU

500 GB Disc

2 GB RAM

Implications of Hardware Trends

Large Disc Farms will be inexpensive (10k$/TB)

Large RAM databases will be inexpensive (1K$/GB)

Processors will be inexpensive

So building block will be a processor with large RAM lots of Disc

lots of network bandwidth

CyberBrick™

13

Implication of Hardware Trends:

Clusters

Future Servers are CLUSTERSof processors, discs

Distributed Database techniquesmake clusters work

CPU

50 GB Disc

5 GB RAM

15

Summary

� Tech trends => pipeline & partition parallelism– Lots of bytes & bandwidth per dollar– Lots of latency– Lots of MIPS per dollar– Lots of processors

� Putting it together Scaleable Networks and Platforms – Build clusters of commodity processors & storage– Commodity interconnect is key (S of PMS)

• Traditional interconnects give 100k$/slice.– Commodity Cluster Operating System is key– Fault isolation and tolerance is key– Automatic Parallel Programming is key

18

Parallelism: Performance is the GoalGoal is to get 'good' performance.

Law 1: parallel system should be faster than serial system

Law 2: parallel system should give near-linear scaleup or

near-linear speedup orboth.

19

Kinds of Parallel Execution

Pipeline

Partition outputs split N ways inputs merge M ways

Any Sequential Program


SequentialSequential

SequentialSequential Any Sequential Program


20

The Perils of Parallelism

Old

Tim

e N

ewTi

me

Spe

edup

=

Processors & Discs

The Good Speedup Curve

Processors & Discs

A Bad Speedup Curve

Linearity

No Parallelism Benefit

Processors & Discs

A Bad Speedup Curve3-Factors

Sta

rtu

p

Inte

rfe

ren

ce

Ske

w

Startup: Creating processesOpening filesOptimization

Interference: Device (cpu, disc, bus)logical (lock, hotspot, server, log,...)

Skew: If tasks get very small, variance > service time

21

Why Parallel Access To Data?

1 Terabyte

10 MB/s

At 10 MB/s1.2 days to scan

1 Terabyte

1,000 x parallel1.5 minute SCAN.

Parallelism: divide a big problem into many smaller ones to be solved in parallel.

Bandwidth

22

Data Flow ProgrammingPrefetch & Postwrite Hide Latency Can't wait for the data to arrive (2,000 years!)

Memory that gets the data in advance ( 100MB/S)

Solution:

Pipeline from storage (tape, disc...) to cpu cache

Pipeline results to destination

Latency

23

Parallelism: Speedup & Scaleup100GB 100GB

Speedup: Same Job, More Hardware Less time

Scaleup: Bigger Job, More Hardware Same time

100GB 1 TB

100GB 1 TB

Server Server

1 k clients 10 k clientsTransactionScaleup: more clients/servers Same response time

24

Benchmark Buyer's GuideThe Whole Story

(for any system)

Th

rou

gh

pu

t

Processors & Discs

The Benchmark Report

Things to ask

When does it stop scaling?

Throughput numbers,Not ratios.

Standard benchmarks allowComparison to others

Comparison to sequential

Ratios & non-standard benchmarks are red flags.

25

Why are Relational OperatorsSo Successful for

Parallelism?Relational data model uniform operatorson uniform data streamclosed under composition

Each operator consumes 1 or 2 input streamsEach stream is a uniform collection of dataSequential data in and out: Pure dataflow

partitioning some operators (e.g. aggregates, non-equi-join, sort,..)

requires innovation

AUTOMATIC PARALLELISM

26

Database Systems “Hide” Parallelism • Automate system management via tools

• data placement• data organization (indexing)• periodic tasks (dump / recover / reorganize)

• Automatic fault tolerance• duplex & failover• transactions

• Automatic parallelism• among transactions (locking)• within a transaction (parallel execution)

27

Automatic Parallel OR DBSelect image

from landsatwhere date between 1970 and 1990and overlaps(location, :Rockies) and snow_cover(image) >.7;

Temporal

Spatial

Image

date loc image

Landsat

1/2/72.........4/8/95

33N120W.......34N120W

Assign one process per processor/disk:find images with right data & locationanalyze image, if 70% snow, return it

image

Answer

date, location, & image tests

28

Automatic Data PartitioningSplit a SQL table to subset of nodes & disks

Partition within set:Range Hash Round Robin

Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning

A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z

Good for equi-joins, range queriesgroup-by

Good for equi-joins Good to spread load

29

Index PartitioningHash indices partition by hash

B-tree indices partition as a forest of trees.One tree per range

Primary index clusters data

0...9 10..19 20..29 30..39 40..

A..C D..F G...M N...R S..Z

30

Secondary Index PartitioningIn shared nothing, secondary indices are Problematic

Partition by base table key rangesInsert: completely local (but what about unique?)Lookup: examines ALL trees (see figure)

Unique index involves lookup on insert.

Partition by secondary key rangesInsert: two nodes (base and index)Lookup: two nodes (index -> base)Uniqueness is easy

Teradata solutionA..C D..F G...M N...R S..

Base Table

A..Z

Base Table

A..Z A..Z A..Z A..Z

31

Data Rivers: Split + Merge Streams

Producers add records to the river, Consumers consume records from the riverPurely sequential programming.River does flow control and buffering

does partition and merge of data records River = Split/Merge in Gamma =

Exchange operator in Volcano.

River

M ConsumersN producers

N X M Data Streams

32

Partitioned Execution

A...E F...J K...N O...S T...Z

A Table

Count Count Count Count Count

Count

Spreads computation and IO among processors

Partitioned data gives NATURAL parallelism

33

N x M way Parallelism

A...E F...J K...N O...S T...Z

Merge

Join

Sort

Join

Sort

Join

Sort

Join

Sort

Join

Sort

Merge Merge

N inputs, M outputs, no bottlenecks.

Partitioned DataPartitioned and Pipelined Data Flows

34

Picking Data RangesDisk Partitioning

For range partitioning, sample load on disks.Cool hot disks by making range smaller

For hash partitioning, Cool hot disks by mapping some buckets to others

River PartitioningUse hashing and assume uniform If range partitioning, sample data and use

histogram to level the bulk

Teradata, Tandem, Oracle use these tricks

35

Blocking Operators = Short PipelinesAn operator is blocking,

if it does not produce any output, until it has consumed all its input

Examples:Sort, Aggregates, Hash-Join (reads all of one operand)

Blocking operators kill pipeline parallelismMake partition parallelism all the more important.

Sort RunsScan

Sort Runs

Sort Runs

Sort Runs

Tape File SQL Table Process

Merge Runs

Merge Runs

Merge Runs

Merge Runs

Table Insert

Index Insert

Index Insert

Index Insert

SQL Table

Index 1

Index 2

Index 3

Database LoadTemplate hasthree blocked phases

36

Simple Aggregates (sort or hash?)

Simple aggregates (count, min, max, ...) can use indicesMore compactSometimes have aggregate info.

GROUP BY aggregatesscan in category order if possible (use indices)Else If categories fit in RAM

use RAM category hash table Else

make temp of <category, item>sort by category,do math in merge step.

37

Parallel AggregatesFor aggregate function, need a decomposition strategy:

count(S) = count(s(i)), ditto for sum()avg(S) = ( sum(s(i))) / count(s(i))and so on...

For groups,sub-aggregate groups close to the sourcedrop sub-aggregates into a hash river.

A...E F...J K...N O...S T...Z

A Table

Count Count Count Count Count

Count

38

Sub-sortsgenerateruns

Mergeruns

Range or Hash Partition River

River is range or hash partitioned

Scan or other source

Parallel SortM inputs N outputs

Disk and mergenot needed if sort fits in memory

Scales “linearly” because

6

12= => 2x slowerlog(10 ) 6

log(10 ) 12

Sort is benchmark from hell for shared nothing machinesnet traffic = disk bandwidth, no data filtering at the source

39

Hash Join: Combining Two TablesHash smaller table into N buckets (hope N=1)

If N=1 read larger table, hash to smallerElse, hash outer to disk then

bucket-by-bucket hash join.

Purely sequential data behavior

Always beats sort-merge and nestedunless data is clustered.

Good for equi, outer, exclusion joinLots of papers,

products just appearing (what went wrong?)

Hash reduces skew

Right Table

LeftTable

HashBuckets

40

Parallel Hash JoinICL implemented hash join with bitmaps in CAFS

machine (1976)!

Kitsuregawa pointed out the parallelism benefits of hash join in early 1980’s (it partitions beautifully)

We ignored them! (why?) But now, Everybody's doing it.

(or promises to do it).

Hashing minimizes skew, requires little thinking for redistribution

Hashing uses massive main memory

42

What Systems Work This WayShared Nothing

Teradata: 400 nodes 80x12 nodes

Tandem: 110 nodesIBM / SP2 / DB2: 128 nodesInformix/SP2 100 nodesATT & Sybase 8x14 nodes

Shared DiskOracle 170 nodesRdb 24 nodes

Shared MemoryInformix 9 nodes RedBrick ? nodes

CLIENTS

MemoryProcessors

CLIENTS

CLIENTS

43

Outline Why Parallelism:

– technology push– application pull

Parallel Database Techniques– partitioned data– partitioned and pipelined execution– parallel relational operators

44

Main Message Technology trends give

– many processors and storage units– inexpensively

To analyze large quantities of data– sequential (regular) access patterns are 100x faster– parallelism is 1000x faster (trades time for money)– Relational systems show many parallel algorithms.

Parallel Database Systems Analyzing LOTS of Data Jim Gray Microsoft Research

Documents

Transcript of Parallel Database Systems Analyzing LOTS of Data Jim Gray Microsoft Research