Design of a Framework for Data- Intensive Wide-Area Applications Michael D. Beynon, Tahsin Kurc,...

Design of a Framework for Data-Intensive Wide-Area Applications

Michael D. Beynon, Tahsin Kurc,Alan Sussman, Joel Saltz

High Performance Systems LabDepartment of Computer Science

University of Maryland, College Park, MD 20742

http://www.cs.umd.edu/projects/hpsl/

Michael D. Beynon ([email protected])

2

Targeted ApplicationsPathology

VolumeRendering

SurfaceGroundwaterModeling

SatelliteData Analysis


3

Sample Application: generate 3D reconstructed view

from new set of sensor readings (computationally intensive)

compare features with reference db

Configuration: remote data server - reference db sensor host - large raw readings parallel computation farm available

A Motivating Scenario

WAN

Raw Datasetsensor readings

Sensor ?

Computation Farm

?

Client PC

?

Data Server

?

Reference DB

feature list

How to design application?


4

A Motivating Scenario (2)

WAN

Raw Datasetsensor readings

Sensor

Extract raw

Client PC

View result

Data Server

Extract ref

Reference DBfeature list

Computation Farm

3D reconstruction

Application :// process relevant raw readings// generate 3D view// compute features of 3D view// find similar features in reference db// display new view and similar cases

Extract ref

Extract raw

3D reconstruction

View result

Raw Dataset

Reference DB


5

In Summary …

Application Characteristics– Subset through range queries;

Carry out data transformations

Issues– Can target applications be decomposed into tasks?– Where to execute application components?

Heterogeneity, shared resources– Common support for developing such applications?


6

Application: Virtual Microscope Client-server system for interactively visualizing digital slides

Image Dataset (100MB to 5GB per focal plane)Rectangular region queries, multiple data chunk reply

Setup: UMD 10 node IBM SP (1,2,4 processor nodes)HPSS (10TB tape, 500GB disk cache)dataset size: 4GB (90GB), 250GB (5.7TB)queries: (q1-q5,q6); 4x size (q7); 16x size (q8){server}--{client} as SP node -- Ultra Sparc over dept Ethernet

pipeline style processing

zoom viewread data decompress clip


7

Experimental Setup (2)

Filter

read_datadecompressclipzoom (no)zoom (8)

Volume / Chunk

102.52 KB2373.04 KB1645.02 KB1645.02 KB

25.70 KB

Total Volume

3.60 MB83.42 MB57.83 MB57.83 MB

.90 MB

q1 q2

q3 q4

q5

90k

90k

180k

45k

q5 sample volume4GB dataset


8

Experiment: Filter Overhead

Setup:4GB dataset, zoom 8Point: overhead is small for the unoptimized filter prototype

Query

q1 q2 q3 q4 q5

Re

spo

nse

Tim

e (

sec)

0

10

20

30

40

50

Original Server

Filter Server

6% - 30% overhead


9

Experiment: Computation / Volume

Filter Placement / Query

RDCZ--V RDC--ZV RD--CZV R--DCZV

Re

spo

nse

Tim

e (

sec)

0

100

200

300

400

500

q1 q1 q1 q1q2 q2 q2 q2q3 q3 q3 q3q4 q4 q4 q4q5 q5 q5 q5

No sub-sampling



Re

spo

nse

Tim

e (

sec)

0

100

200

300

400

500


Sub-sampling factor 8

Setup: 4GB dataset, warm cachePoint: amount of computation and communication makes a significant

difference for efficient placement.


10

Experiment: Server Load

Setup:4GB dataset, zoom 8, warm cachePoint: available server capacity and network volume have a dramatic

effect on efficient placement.

Filter Placement / Server Load


Re

spo

nse

Tim

e (

sec)

- (

q1

-q5

avg

)

0

20

40

60

80

100

120

140

160

1x load

2x load

4x load


11

Experiment: Varying Query Size



Re

spo

nse

Tim

e (

sec)

0

120

240

360

480

600

720

840

960

1080

1200


OtherHPSS dataHPSS index



Re

spo

nse

Tim

e (

sec)

0

120

240

360

480

600

720

840

960

1080

1200

q6 q6 q6 q6q7 q7 q7 q7q8 q8 q8 q8

OtherHPSS dataHPSS index

Setup: 250GB dataset, zoom 8, cold cachePoint: q1-q5 dominated by HPSS access, tape storage location dependent; q6-q8

larger, but not dominated as much by HPSS access, network volume still important


12

Application: External Sort based on NowSort [Arpaci-Dusseau97]

partitioned input/output files; 100 byte tuples, 10 byte integer key128MB file of unsorted tuples per node

Partitioner: read local unsorted blocks, divide tuples into P buckets. Send full bucketi to Sorteri filter.Sorter: [Sort] receive buckets and append to a sort buffer.Sort full buffers and write to disk (run).[Merge] read sorted runs, merge, write to sorted local output file.

SPMD parallelprogram style partitioner1

sorter1 ...partitioner2

sorter2

partitionerP

sorterP


13

External Sort: Algorithm

partitioner2

sorter2

to Sorter1

to Sorter2

to Sorter3

to Sorter4

Sorted Runs

Phase 1 - PartitioningPhase 2 - Merge


14

Experiment: Memory Usage

Setup: UMD 16 node Linux cluster (2 400MHz cpu, 256MB, Gbit)Point: No gain by allowing half the nodes to use full memory

ex: 1/8 case, 4.5x the memory

Filter Memory Usage (4 node groups)

Full Full Full 1/2 Full 1/4 Full 1/8

So

rt T

ime

(se

c)

0

60

120

180

240

300merge phase

partition/sort phase


15

Experiment: Memory Usage (2)

Setup: Consider amount of memory and processors per nodePoint: Tuning work based on memory consumption is important



So

rt T

ime

(se

c)

0

60

120

180

240

300merge phase


Irregular Partitioning



So

rt T

ime

(se

c)

0

60

120

180

240

300merge phase


2 Sets on Full nodes


16

Observations & Implications

Applications can be decomposed into tasks Placement matters

– Heterogeneity, shared resources– Application dependent (e.g., volume of data movement)

Resource availability– Memory capacity

Tailored support for developing such applications


17

DataCutter Project

Targets multi-dim datasets, distributed storageSubsetting through Range Queries

– range defines a hyperbox in the multi-dimensional attribute space underlying the dataset

– retrieve items whose coordinates fall within box

Manually divide application processing into Filters– intended to execute near (LAN) storage system– to reduce amount of data transmitted to client


18

Application FiltersPurpose: Specialized user code for processing data segments

before returning them to the client

inspired by research in Active Disks [Acharya, Uysal, Saltz: ASPLOS’98],dataflow, functional parallelism.

filters are the unit of computation– high level tasks– init,process,finalize interface

streams are how filters communicate– unidirectional buffer pipes

pre-specify filter connectivity

Extract ref

Extract raw

3D reconstruction

View result

Raw Dataset

Reference DB


19

Filter EnvironmentWant to constrain filters for

– location independence– knowledge of filter communication pattern– easier scheduling of resources– filter stop and restart defined explicitly– style encourages minimized resource usage

Filter constraints– communicate with other filters only using streams– cannot change stream endpoints– may pre-disclose dynamic memory/scratch space needs


20

What will not work well?

Fine grained tasks Fine coordination between filters Long request-response chains with dependencies

between queries


21

Related Work

ApplicationLevel

ProgrammingModels

InfrastructureServices

ResourceLevel

Grid availableResources

SRB

User specifiedResources

Legion

Client/Server Sockets

Condor Pool

IdleResources

JavaRMI,DCOM,CORBA

NetSolve,Ninf

AppLeS

HPC++

NWS

DataCutter

HarmonyDSM MPI RPC

DPSSGlobus


22

Programming Model Semantics

RPC / RMI (Java) / CORBA– function call semantics of imperative languages

(caller blocks until function has returned)– simple, well-understood programming model – drawback: mandatory RTT penalty in all calls

time

Caller:

Function:


23

Partial Solution [ABACUS]

Chain RPC calls if possible– A() calls B(), which calls C(), which calls D(), etc.– reduces some of the RTT penalties– drawback: only works when structures as a chain of calls

Batch multiple calls into a single call– if A(int) was original, change to use A’(int *array, int sz)– drawbacks:

– how to choose batch size?– are earlier fcn calls delayed to achieve a full batch?– mandatory overhead for first call


24

Alternate Solution [DataCutter]

Change the programming model– mandate use of filters-stream programming model

(message passing style)– avoid drawbacks of retro-fitting older programming model– specifically target data-intensive wide-area applications for best

gain– return results to caller– detached mode, no return of results

– unit of work notion for well-defined adaptivity


25

Filter InstanceDefine: a logically related collection of filters

– application starts a filter instance at any time– multiple concurrent filter instances allowed

Unit-of-work handling– “work” is defined by application (ex: for vmscope, a query)– appended to any running instance– adaptivity - the unit-of-work boundary– similar to ABACUS set of records in calls, with same drawback of

choosing size

Allow flexible optimizations by application– Ex: pre-create two instances each placed w.r.t. zoom factor


26

Asynchronous Operation

Nothing blocks and no barriers– creating new filter instances– connecting streams– appending work to filter instances– sending application data on streams– stopping filter instances

Common case is cheap– essential for any adaptivity policy– client server applications repeatedly use existing filter instances– RPC style still possible

– create instance, append work, collect result, stop instance


27

Pipeline DataflowBenefits

– idea: overlap communication cost with previous computation– only schedule filters to execute when steam data to process– no RTT required as in RPC systems for result [ABACUS]– works best if uniform comm cost == comp cost (not always true)

timechunk1: filter1 filter2 filter3 filter4

chunk2: filter1 filter2 filter3 filter4

chunk3: filter1 filter2 filter3 filter4


28

<End of Talk>

Design of a Framework for Data- Intensive Wide-Area Applications Michael D. Beynon, Tahsin Kurc,...

Documents

Transcript of Design of a Framework for Data- Intensive Wide-Area Applications Michael D. Beynon, Tahsin Kurc,...