Low Cost, High Performance, and Scalability:

30

description

GUARANTEED!. Low Cost, High Performance, and Scalability: A New Approach to User-Level Distributed Shared Memory. OR YOUR MONEY BACK!!!. Patrick Anthony La Fratta WORTS 2005 15 December 2005. Programming Models: Message-Passing. Programming Models: Message-Passing. - PowerPoint PPT Presentation

Transcript of Low Cost, High Performance, and Scalability:

Page 1: Low Cost, High Performance, and Scalability:
Page 2: Low Cost, High Performance, and Scalability:

Low Cost, High Performance, and Scalability:A New Approach to User-Level Distributed

Shared Memory

Patrick Anthony La FrattaWORTS 2005

15 December 2005

GUARANTEED!

OR YOUR MONEY BACK!!!

Page 3: Low Cost, High Performance, and Scalability:

Programming Models: Message-Passing

Page 4: Low Cost, High Performance, and Scalability:

Programming Models: Message-Passing

Page 5: Low Cost, High Performance, and Scalability:

Programming Models: Shared Memory

Page 6: Low Cost, High Performance, and Scalability:

Implementing a DSM System at the User Level

Page 7: Low Cost, High Performance, and Scalability:

Implementing the DSM ClientInitialization, Step 1:

Get size of shared memory segment.

Page 8: Low Cost, High Performance, and Scalability:

Initialization, Step 2:Map n pages into local

memory.

Implementing the DSM Client

Page 9: Low Cost, High Performance, and Scalability:

Initialization, Step 3:Take away all access privileges from the

shared segments.

Implementing the DSM Client

Page 10: Low Cost, High Performance, and Scalability:

Initialization, Step 4:Set up the segmentation fault

handler.

Implementing the DSM Client

Page 11: Low Cost, High Performance, and Scalability:

Implementing the DSM System

Application Reads Shared Address: Preview

Page 12: Low Cost, High Performance, and Scalability:

Implementing the DSM System

Shared address read, Step 1:Application reads shared address.

Page 13: Low Cost, High Performance, and Scalability:

Implementing the DSM SystemShared address read, Step 2:

Control transferred to seg-fault handler.

Page 14: Low Cost, High Performance, and Scalability:

Implementing the DSM System

Shared address read, Step 3:Client contacts the server to get the page’s

data.

Page 15: Low Cost, High Performance, and Scalability:

Implementing the DSM System

Shared address read, Step 4:Client grants read access privileges to

application.

Page 16: Low Cost, High Performance, and Scalability:

Implementing the DSM System

Application Writes Shared Address: Preview

Page 17: Low Cost, High Performance, and Scalability:

Implementing the DSM System

Shared address write, Step 1: Application writes shared address.

Page 18: Low Cost, High Performance, and Scalability:

Implementing the DSM SystemShared address write, Step 2: Control transferred to seg-fault

handler.

Page 19: Low Cost, High Performance, and Scalability:

Implementing the DSM SystemShared address write, Step 3:

Client contacts server to with write notification.

Page 20: Low Cost, High Performance, and Scalability:

Implementing the DSM SystemShared address write, Step 4:

Server calls back all other copies of pages being written.

Page 21: Low Cost, High Performance, and Scalability:

Implementing the DSM System

Shared address write, Step 5: Server indicates to client to proceed.

Page 22: Low Cost, High Performance, and Scalability:

Implementing the DSM SystemShared address write, Step 6:

Client grants write privileges to application.

Page 23: Low Cost, High Performance, and Scalability:

Implementing the DSM SystemShared address write, Step 7:

Later, the app detaches pages so others may use them.

Page 24: Low Cost, High Performance, and Scalability:

Preliminary Results: All Pairs Shortest Paths

Note: Results matched for all test cases, and all runs completed successfully.

Exec. Time vs. Problem Size for Seq. and Parallel (with Row-wise Decomposition) Implementations of Floyd's Algorithm

0.00001

0.00010

0.00100

0.01000

0.10000

1.00000

10.00000

100.00000

8 16 32 64 128 256

problem size, n, # of vertices

exe

cutio

n ti

me

, t, s

ec

Sequential

Parallel, 2 PEs

Page 25: Low Cost, High Performance, and Scalability:

System Profiling: All Pairs Shortest Paths

0.010

0.100

1.000

10.000

100.000

1000.000

8 16 32 64 128 256 512 1024 2048

Problem Size, n, # of Vertices

Tim

e, t

, sec

Total Execution Time

Page 26: Low Cost, High Performance, and Scalability:

0.001

0.010

0.100

1.000

10.000

100.000

1000.000

8 16 32 64 128 256 512 1024 2048

Problem Size, n, # of Vertices

Tim

e, t

, sec

T

C1DT

C2DT

BWT

System Profiling: All Pairs Shortest Paths

Page 27: Low Cost, High Performance, and Scalability:

System Modifications and Extensions

• Better understanding of the trade-offs in the design of the interface.

• Efficient synchronization primitives through extended memory semantics with full/empty bits.

• Server-side per-page locking and client-side full- page flushing.

* Speedups > 1! *

System profiles resulted in:

Page 28: Low Cost, High Performance, and Scalability:

Performance Results: Speedup for Various Configurations

0

1

2

3

4

5

6

7

256 512 1024 2048 4096 8192

Problem Size, n, # of Vertices

Sp

eed

up

vs.

Seq

uen

tial

Imp

lem

enta

tio

n2 P Es

4

8

16

Page 29: Low Cost, High Performance, and Scalability:

0

1

2

3

4

5

6

7

256 512 1024 2048 4096 8192

Problem Size, n, # of Vertices

Sp

eed

up

vs.

Seq

uen

tial

Imp

lem

enta

tio

n2 P Es

4

8

16

Performance Results: Trends

Page 30: Low Cost, High Performance, and Scalability:

Future Work

• Scalability: Enable clients to use more than one server.

• Peer-to-peer: Merge the server and client modules.

• Fault-tolerance: Checkpoint and Migration?

• Further testing: Implement and evaluate performance of other parallel applications.

Questions?