Download - Low Cost, High Performance, and Scalability: A New Approach to User-Level Distributed Shared Memory Patrick Anthony La Fratta WORTS 2005 15 December 2005.

Low Cost, High Performance, and Scalability:A New Approach to User-Level Distributed

Shared Memory

Patrick Anthony La FrattaWORTS 2005

15 December 2005

GUARANTEED!

OR YOUR MONEY BACK!!!

Programming Models: Message-Passing

Programming Models: Shared Memory

Implementing a DSM System at the User Level

Implementing the DSM ClientInitialization, Step 1:

Get size of shared memory segment.

Initialization, Step 2:Map n pages into local

memory.

Implementing the DSM Client

Initialization, Step 3:Take away all access privileges from the

shared segments.


Initialization, Step 4:Set up the segmentation fault

handler.


Implementing the DSM System

Application Reads Shared Address: Preview


Shared address read, Step 1:Application reads shared address.

Implementing the DSM SystemShared address read, Step 2:

Control transferred to seg-fault handler.


Shared address read, Step 3:Client contacts the server to get the page’s

data.


Shared address read, Step 4:Client grants read access privileges to

application.


Application Writes Shared Address: Preview


Shared address write, Step 1: Application writes shared address.

Implementing the DSM SystemShared address write, Step 2: Control transferred to seg-fault

handler.

Implementing the DSM SystemShared address write, Step 3:

Client contacts server to with write notification.


Server calls back all other copies of pages being written.


Shared address write, Step 5: Server indicates to client to proceed.


Client grants write privileges to application.


Later, the app detaches pages so others may use them.

Preliminary Results: All Pairs Shortest Paths

Note: Results matched for all test cases, and all runs completed successfully.

Exec. Time vs. Problem Size for Seq. and Parallel (with Row-wise Decomposition) Implementations of Floyd's Algorithm

0.00001

0.00010

0.00100

0.01000

0.10000

1.00000

10.00000

100.00000

8 16 32 64 128 256

problem size, n, # of vertices

exe

cutio

n ti

me

, t, s

ec

Sequential

Parallel, 2 PEs

System Profiling: All Pairs Shortest Paths

0.010

0.100

1.000

10.000

100.000

1000.000

8 16 32 64 128 256 512 1024 2048

Problem Size, n, # of Vertices

Tim

e, t

, sec

Total Execution Time

0.001

0.010

0.100

1.000

10.000

100.000

1000.000

8 16 32 64 128 256 512 1024 2048


Tim

e, t

, sec

T

C1DT

C2DT

BWT

System Profiling: All Pairs Shortest Paths

System Modifications and Extensions

• Better understanding of the trade-offs in the design of the interface.

• Efficient synchronization primitives through extended memory semantics with full/empty bits.

• Server-side per-page locking and client-side full- page flushing.

* Speedups > 1! *

System profiles resulted in:

Performance Results: Speedup for Various Configurations

0

1

2

3

4

5

6

7

256 512 1024 2048 4096 8192


Sp

eed

up

vs.

Seq

uen

tial

Imp

lem

enta

tio

n2 P Es

4

8

16

0

1

2

3

4

5

6

7

256 512 1024 2048 4096 8192


Sp

eed

up

vs.

Seq

uen

tial

Imp

lem

enta

tio

n2 P Es

4

8

16

Performance Results: Trends

Future Work

• Scalability: Enable clients to use more than one server.

• Peer-to-peer: Merge the server and client modules.

• Fault-tolerance: Checkpoint and Migration?

• Further testing: Implement and evaluate performance of other parallel applications.

Questions?