Lectures 23: Parallel Databases

27
Introduction to Data Management CSE 344 Lectures 23: Parallel Databases CSE 344 - Fall 2016 1

Transcript of Lectures 23: Parallel Databases

Page 1: Lectures 23: Parallel Databases

Introduction to Data ManagementCSE 344

Lectures 23: Parallel Databases

CSE 344 - Fall 2016 1

Page 2: Lectures 23: Parallel Databases

Announcements

• WQ7 due tomorrow• HW7 due on Wednesday• HW8 (last one!) will be out on Wednesday

• Final on 12/12 (Monday), 2:30-4:20pm– Location TBD– You can bring 2 sheets of notes

CSE 344 - Fall 2016 2

Page 3: Lectures 23: Parallel Databases

HW8 Preview

• HW8 will require using Amazon EC2 compute cloud– If you do not already have an Amazon account,

go to http://aws.amazon.com/ and sign up. • Amazon will ask you for your credit card information.

– Apply for credits onhttps://aws.amazon.com/education/awseducate/apply/ (choose students)

• Use your @uw.edu email address

CSE 344 - Fall 2016 3

Page 4: Lectures 23: Parallel Databases

Outline

• Review of transactions

• Parallel databases

CSE 344 - Fall 2016 4

Page 5: Lectures 23: Parallel Databases

Transactions Recap

• Why we use transactions?– ACID properties

• Serializability– Conflict-serializable / non-serializable

• Implementation using locks– 2PL and strict 2PL– Serialization levels

CSE 344 - Fall 2016 5

Page 6: Lectures 23: Parallel Databases

In-Class Exercise• Given these 3 transactions: (Co = commit)

– T1 : R1(A), R1(B), W1(A), W1(B), Co1– T2 : R2(B), W2(B), R2(C), W2(C), Co2– T3 : R3(C), W3(C), R3(A), W3(A), Co3

• And this schedule:– R1(A), R1(B), W1(A), R3(C), W3(C), R3(A), W3(A),

Co3,W1(B), R2(B), W2(B), Co1, R2(C), W2(C), Co2

• Determine: – If the schedule conflict-serializable? If yes, indicate

a serialization order.– Is this schedule possible under the strict 2PL

protocol?CSE 344 - Fall 2016 6

Page 7: Lectures 23: Parallel Databases

Parallel DBMS

CSE 344 - Fall 2016 7

Page 8: Lectures 23: Parallel Databases

Why compute in parallel?

• Multi-cores:– Most processors have multiple cores– This trend will increase in the future

• Big data: too large to fit in main memory– Distributed query processing on 100x-1000x

servers– Widely available now using cloud services

CSE 344 - Fall 2016 8

Page 9: Lectures 23: Parallel Databases

Big Data• Companies, organizations, scientists have

data that is too big, too fast, and too complex to be managed without changing tools and processes

• Complex data processing:– Decision support queries (SQL w/ aggregates)– Machine learning (adds linear algebra and

iteration)

CSE 344 - Fall 2016 9

Page 10: Lectures 23: Parallel Databases

Two Kinds of Parallel Data Processing

• Parallel (relational) databases, developed starting with the 80s (this lecture)– OLTP (Online Transaction Processing) – OLAP (Online Analytic Processing, or Decision

Support)– Will only cover parallel query execution in 344– Parallel transactions and recovery (444)– Schema design for parallel DBMS (544)

• General purpose distributed processing: MapReduce, Spark (next lectures)– Mostly for Decision Support Queries

CSE 344 - Fall 2016 10

Page 11: Lectures 23: Parallel Databases

Performance Metrics for Parallel DBMSs

P = the number of nodes (processors, computers)• Speedup:

– More nodes, same data è higher speed• Scaleup:

– More nodes, more data è same speed

• OLTP: “Speed” = transactions per second (TPS)• Decision Support: “Speed” = query time

CSE 344 - Fall 2016 11

Page 12: Lectures 23: Parallel Databases

Linear v.s. Non-linear Speedup

CSE 344 - Fall 2016

# nodes (=P)

Speedup

12

×1 ×5 ×10 ×15

Page 13: Lectures 23: Parallel Databases

Linear v.s. Non-linear Scaleup

# nodes (=P) AND data size

BatchScaleup

×1 ×5 ×10 ×15

CSE 344 - Fall 2016 13

Ideal

Page 14: Lectures 23: Parallel Databases

Challenges to Linear Speedup and Scaleup

• Startup cost– Cost of starting an operation on many nodes

• Interference– Contention for resources between nodes

• Skew– Slowest node becomes the bottleneck

CSE 344 - Fall 2016 14

Page 15: Lectures 23: Parallel Databases

Architectures for Parallel Databases

• Shared memory

• Shared disk

• Shared nothing

CSE 344 - Fall 2016 15

Page 16: Lectures 23: Parallel Databases

Shared Memory

Interconnection Network

P P P

Global Shared Memory

D D D16CSE 344 - Fall 2016

Page 17: Lectures 23: Parallel Databases

Shared Disk

Interconnection Network

P P P

M M M

D D D17CSE 344 - Fall 2016

Page 18: Lectures 23: Parallel Databases

Shared Nothing

Interconnection Network

P P P

M M M

D D D18CSE 344 - Fall 2016

Page 19: Lectures 23: Parallel Databases

A Professional Picture…

19

From: Greenplum (now EMC) Database Whitepaper

SAN = “Storage Area Network”

CSE 344 - Fall 2016

Page 20: Lectures 23: Parallel Databases

Shared Memory• Nodes share both RAM and disk• Dozens to hundreds of processors

Example: SQL Server runs on a single machine and can leverage many threads to get a query to run faster (check your HW3 query plans)

• Easy to use and program• But very expensive to scale: last remaining

cash cows in the hardware industry

CSE 344 - Fall 2016 20

Page 21: Lectures 23: Parallel Databases

Shared Disk• All nodes access the same disks• Found in the largest "single-box" (non-

cluster) multiprocessors

Oracle dominates this class of systems.

Characteristics:• Also hard to scale past a certain point:

existing deployments typically have fewer than 10 machines

CSE 344 - Fall 2016 21

Page 22: Lectures 23: Parallel Databases

Shared Nothing• Cluster of machines on high-speed network• Called "clusters" or "blade servers”• Each machine has its own memory and disk: lowest

contention.

NOTE: Because all machines today have many cores and many disks, then shared-nothing systems typically run many "nodes” on a single physical machine.

Characteristics:• Today, this is the most scalable architecture.• Most difficult to administer and tune.

22CSE 344 - Fall 2016We discuss only Shared Nothing in class

Page 23: Lectures 23: Parallel Databases

Purchase

pid=pid

cid=cid

Customer

ProductPurchase

pid=pid

cid=cid

Customer

Product

Approaches toParallel Query Evaluation

• Inter-query parallelism– Transaction per node– OLTP

• Inter-operator parallelism– Operator per node– Both OLTP and Decision Support

• Intra-operator parallelism– Operator on multiple nodes– Decision Support

CSE 344 - Fall 2016We study only intra-operator parallelism: most scalable

Purchase

pid=pid

cid=cid

Customer

Product

Purchase

pid=pid

cid=cid

Customer

Product

Purchase

pid=pid

cid=cid

Customer

Product

23

Page 24: Lectures 23: Parallel Databases

Single Node Query Processing (Review)

Given relations R(A,B) and S(B, C), no indexes:

• Selection: σA=123(R)– Scan file R, select records with A=123

• Group-by: γA,sum(B)(R)– Scan file R, insert into a hash table using A as key– When a new key is equal to an existing one, add B to the value

• Join: R ⋈ S– Scan file S, insert into a hash table using B as key– Scan file R, probe the hash table using B

CSE 344 - Fall 2016 24

Page 25: Lectures 23: Parallel Databases

Distributed Query Processing

• Data is horizontally partitioned on many servers

• Operators may require data reshuffling

CSE 344 - Fall 2016 25

Page 26: Lectures 23: Parallel Databases

Horizontal Data Partitioning

CSE 344 - Fall 2016 26

1 2 P . . .

Data: Servers:

K A B… …

Page 27: Lectures 23: Parallel Databases

Horizontal Data Partitioning

CSE 344 - Fall 2016 27

K A B… …

1 2 P . . .

Data: Servers:

K A B

… …

K A B

… …

K A B

… …

Which tuplesgo to what server?