Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.
-
Upload
marshall-shelton -
Category
Documents
-
view
223 -
download
1
Transcript of Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.
![Page 1: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/1.jpg)
Parallel Database Systems
Instructor: Dr. Yingshu Li Student: Chunyu Ai
![Page 2: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/2.jpg)
Main Message Technology trends give
– many processors and storage units– inexpensively
To analyze large quantities of data– sequential (regular) access patterns are 100x faster– parallelism is 1000x faster (trades time for money)– Relational systems show many parallel algorithms.
![Page 3: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/3.jpg)
5
1k SPECintCPU
500 GB Disc
2 GB RAM
Implications of Hardware Trends
Large Disc Farms will be inexpensive (10k$/TB)
Large RAM databases will be inexpensive (1K$/GB)
Processors will be inexpensive
So building block will be a processor with large RAM lots of Disc
lots of network bandwidth
![Page 4: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/4.jpg)
Why Parallel Access To Data?
1 Terabyte
10 MB/s
At 10 MB/s1.2 days to scan
1 Terabyte
1,000 x parallel1.5 minute SCAN.
Parallelism: divide a big problem into many smaller ones to be solved in parallel.
Bandwidth
![Page 5: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/5.jpg)
Implication of Hardware Trends: Clusters
Future Servers are CLUSTERSof processors, discs
CPU
50 GB Disc
5 GB RAM
ThesisMany Little will Win over Few Big
![Page 6: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/6.jpg)
Parallel Database Architectures
![Page 7: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/7.jpg)
Parallelism: Performance is the GoalGoal is to get 'good' performance.
Law 1: parallel system should be faster than serial system
Law 2: parallel system should give near-linear scaleup or
near-linear speedup orboth.
![Page 8: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/8.jpg)
Speed-Up and Scale-Up Speedup: a fixed-sized problem executing on a small system is given to a
system which is N-times larger.
– Measured by:
speedup = small system elapsed time
large system elapsed time
– Speedup is linear if equation equals N. Scaleup: increase the size of both the problem and the system
– N-times larger system used to perform N-times larger job
– Measured by:
scaleup = small system small problem elapsed time
big system big problem elapsed time
– Scale up is linear if equation equals 1.
![Page 9: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/9.jpg)
Kinds of Parallel Execution
Pipeline
Partition outputs split N ways inputs merge M ways
Any Sequential Program
Any Sequential Program
SequentialSequential
SequentialSequential Any Sequential Program
Any Sequential Program
![Page 10: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/10.jpg)
The Drawbacks of ParallelismO
ldTi
me
New
Tim
eS
peed
up =
Processors & Discs
The Good Speedup Curve
Processors & Discs
A Bad Speedup Curve
Linearity
No Parallelism Benefit
Processors & Discs
A Bad Speedup Curve3-Factors
Sta
rtu
p
Inte
rfe
ren
ce
Ske
w
Startup: Creating processesOpening filesOptimization
Interference: Device (cpu, disc, bus)logical (lock, hotspot, server, log,...)
Communication: send data among nodesSkew: If tasks get very small, variance > service time
![Page 11: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/11.jpg)
Parallelism: Speedup & Scaleup100GB 100GB
Speedup: Same Job, More Hardware Less time
Scaleup: Bigger Job, More Hardware Same time
100GB 1 TB
100GB 1 TB
Server Server
1 k clients 10 k clientsTransactionScaleup: more clients/servers Same response time
![Page 12: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/12.jpg)
14
Database Systems “Hide” Parallelism Automate system management via tools
• data placement• data organization (indexing)• periodic tasks (dump / recover / reorganize)
Automatic fault tolerance• duplex & failover• transactions
Automatic parallelism• among transactions (locking)• within a transaction (parallel execution)
![Page 13: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/13.jpg)
Automatic Data PartitioningSplit a SQL table to subset of nodes & disks
Partition within set:Range Hash Round Robin
Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning
A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z
Good for equi-joins, range queriesgroup-by
Good for equi-joins Good to spread load
![Page 14: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/14.jpg)
Index PartitioningHash indices partition by hash
B-tree indices partition as a forest of trees.One tree per range
Primary index clusters data
0...9 10..19 20..29 30..39 40..
A..C D..F G...M N...R S..Z
![Page 15: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/15.jpg)
Partitioned Execution
A...E F...J K...N O...S T...Z
A Table
Count Count Count Count Count
Count
Spreads computation and IO among processors
Partitioned data gives NATURAL parallelism
![Page 16: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/16.jpg)
N x M way Parallelism
A...E F...J K...N O...S T...Z
Merge
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Merge Merge
N inputs, M outputs, no bottlenecks.
Partitioned DataPartitioned and Pipelined Data Flows
![Page 17: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/17.jpg)
Blocking Operators = Short PipelinesAn operator is blocking,
if it does not produce any output, until it has consumed all its input
Examples:Sort, Aggregates, Hash-Join (reads all of one operand)
Blocking operators kill pipeline parallelismMake partition parallelism all the more important.
Sort RunsScan
Sort Runs
Sort Runs
Sort Runs
Tape File SQL Table Process
Merge Runs
Merge Runs
Merge Runs
Merge Runs
Table Insert
Index Insert
Index Insert
Index Insert
SQL Table
Index 1
Index 2
Index 3
Database LoadTemplate hasthree blocked phases
![Page 18: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/18.jpg)
Parallel AggregatesFor aggregate function, need a decomposition strategy:
count(S) = count(s(i)), ditto for sum()avg(S) = ( sum(s(i))) / count(s(i))and so on...
For groups,sub-aggregate groups close to the sourcedrop sub-aggregates into a hash river.
A...E F...J K...N O...S T...Z
A Table
Count Count Count Count Count
Count
![Page 19: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/19.jpg)
Sub-sortsgenerateruns
Mergeruns
Range or Hash Partition River
River is range or hash partitioned
Scan or other source
Parallel Sort
M inputs N outputs
Disk and mergenot needed if sort fits in Memory
Sort is benchmark from hell for shared nothing machinesnet traffic = disk bandwidth, no data filtering at the source
![Page 20: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/20.jpg)
Hash Join: Combining Two Tables
Hash smaller table into N buckets (hope N=1)
If N=1 read larger table, hash to smallerElse, hash outer to disk then
bucket-by-bucket hash join.
Purely sequential data behavior
Always beats sort-merge and nestedunless data is clustered.
Good for equi, outer, exclusion joinLots of papers,
Hash reduces skew
Right Table
LeftTable
HashBuckets
![Page 21: Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.](https://reader035.fdocuments.in/reader035/viewer/2022062517/56649efb5503460f94c0d5eb/html5/thumbnails/21.jpg)
Q&A
Thank you!
23