Design of a Framework for Data- Intensive Wide-Area Applications Michael D. Beynon, Tahsin Kurc,...
-
Upload
bernadette-mcdaniel -
Category
Documents
-
view
214 -
download
0
Transcript of Design of a Framework for Data- Intensive Wide-Area Applications Michael D. Beynon, Tahsin Kurc,...
Design of a Framework for Data-Intensive Wide-Area Applications
Michael D. Beynon, Tahsin Kurc,Alan Sussman, Joel Saltz
High Performance Systems LabDepartment of Computer Science
University of Maryland, College Park, MD 20742
http://www.cs.umd.edu/projects/hpsl/
Michael D. Beynon ([email protected])
2
Targeted ApplicationsPathology
VolumeRendering
SurfaceGroundwaterModeling
SatelliteData Analysis
Michael D. Beynon ([email protected])
3
Sample Application: generate 3D reconstructed view
from new set of sensor readings (computationally intensive)
compare features with reference db
Configuration: remote data server - reference db sensor host - large raw readings parallel computation farm available
A Motivating Scenario
WAN
Raw Datasetsensor readings
Sensor ?
Computation Farm
?
Client PC
?
Data Server
?
Reference DB
feature list
How to design application?
Michael D. Beynon ([email protected])
4
A Motivating Scenario (2)
WAN
Raw Datasetsensor readings
Sensor
Extract raw
Client PC
View result
Data Server
Extract ref
Reference DBfeature list
Computation Farm
3D reconstruction
Application :// process relevant raw readings// generate 3D view// compute features of 3D view// find similar features in reference db// display new view and similar cases
Extract ref
Extract raw
3D reconstruction
View result
Raw Dataset
Reference DB
Michael D. Beynon ([email protected])
5
In Summary …
Application Characteristics– Subset through range queries;
Carry out data transformations
Issues– Can target applications be decomposed into tasks?– Where to execute application components?
Heterogeneity, shared resources– Common support for developing such applications?
Michael D. Beynon ([email protected])
6
Application: Virtual Microscope Client-server system for interactively visualizing digital slides
Image Dataset (100MB to 5GB per focal plane)Rectangular region queries, multiple data chunk reply
Setup: UMD 10 node IBM SP (1,2,4 processor nodes)HPSS (10TB tape, 500GB disk cache)dataset size: 4GB (90GB), 250GB (5.7TB)queries: (q1-q5,q6); 4x size (q7); 16x size (q8){server}--{client} as SP node -- Ultra Sparc over dept Ethernet
pipeline style processing
zoom viewread data decompress clip
Michael D. Beynon ([email protected])
7
Experimental Setup (2)
Filter
read_datadecompressclipzoom (no)zoom (8)
Volume / Chunk
102.52 KB2373.04 KB1645.02 KB1645.02 KB
25.70 KB
Total Volume
3.60 MB83.42 MB57.83 MB57.83 MB
.90 MB
q1 q2
q3 q4
q5
90k
90k
180k
45k
q5 sample volume4GB dataset
Michael D. Beynon ([email protected])
8
Experiment: Filter Overhead
Setup:4GB dataset, zoom 8Point: overhead is small for the unoptimized filter prototype
Query
q1 q2 q3 q4 q5
Re
spo
nse
Tim
e (
sec)
0
10
20
30
40
50
Original Server
Filter Server
6% - 30% overhead
Michael D. Beynon ([email protected])
9
Experiment: Computation / Volume
Filter Placement / Query
RDCZ--V RDC--ZV RD--CZV R--DCZV
Re
spo
nse
Tim
e (
sec)
0
100
200
300
400
500
q1 q1 q1 q1q2 q2 q2 q2q3 q3 q3 q3q4 q4 q4 q4q5 q5 q5 q5
No sub-sampling
Filter Placement / Query
RDCZ--V RDC--ZV RD--CZV R--DCZV
Re
spo
nse
Tim
e (
sec)
0
100
200
300
400
500
q1 q1 q1 q1q2 q2 q2 q2q3 q3 q3 q3q4 q4 q4 q4q5 q5 q5 q5
Sub-sampling factor 8
Setup: 4GB dataset, warm cachePoint: amount of computation and communication makes a significant
difference for efficient placement.
Michael D. Beynon ([email protected])
10
Experiment: Server Load
Setup:4GB dataset, zoom 8, warm cachePoint: available server capacity and network volume have a dramatic
effect on efficient placement.
Filter Placement / Server Load
RDCZ--V RDC--ZV RD--CZV R--DCZV
Re
spo
nse
Tim
e (
sec)
- (
q1
-q5
avg
)
0
20
40
60
80
100
120
140
160
1x load
2x load
4x load
Michael D. Beynon ([email protected])
11
Experiment: Varying Query Size
Filter Placement / Query
RDCZ--V RDC--ZV RD--CZV R--DCZV
Re
spo
nse
Tim
e (
sec)
0
120
240
360
480
600
720
840
960
1080
1200
q1 q1 q1 q1q2 q2 q2 q2q3 q3 q3 q3q4 q4 q4 q4q5 q5 q5 q5
OtherHPSS dataHPSS index
Filter Placement / Query
RDCZ--V RDC--ZV RD--CZV R--DCZV
Re
spo
nse
Tim
e (
sec)
0
120
240
360
480
600
720
840
960
1080
1200
q6 q6 q6 q6q7 q7 q7 q7q8 q8 q8 q8
OtherHPSS dataHPSS index
Setup: 250GB dataset, zoom 8, cold cachePoint: q1-q5 dominated by HPSS access, tape storage location dependent; q6-q8
larger, but not dominated as much by HPSS access, network volume still important
Michael D. Beynon ([email protected])
12
Application: External Sort based on NowSort [Arpaci-Dusseau97]
partitioned input/output files; 100 byte tuples, 10 byte integer key128MB file of unsorted tuples per node
Partitioner: read local unsorted blocks, divide tuples into P buckets. Send full bucketi to Sorteri filter.Sorter: [Sort] receive buckets and append to a sort buffer.Sort full buffers and write to disk (run).[Merge] read sorted runs, merge, write to sorted local output file.
SPMD parallelprogram style partitioner1
sorter1 ...partitioner2
sorter2
partitionerP
sorterP
Michael D. Beynon ([email protected])
13
External Sort: Algorithm
partitioner2
sorter2
to Sorter1
to Sorter2
to Sorter3
to Sorter4
Sorted Runs
Phase 1 - PartitioningPhase 2 - Merge
Michael D. Beynon ([email protected])
14
Experiment: Memory Usage
Setup: UMD 16 node Linux cluster (2 400MHz cpu, 256MB, Gbit)Point: No gain by allowing half the nodes to use full memory
ex: 1/8 case, 4.5x the memory
Filter Memory Usage (4 node groups)
Full Full Full 1/2 Full 1/4 Full 1/8
So
rt T
ime
(se
c)
0
60
120
180
240
300merge phase
partition/sort phase
Michael D. Beynon ([email protected])
15
Experiment: Memory Usage (2)
Setup: Consider amount of memory and processors per nodePoint: Tuning work based on memory consumption is important
Filter Memory Usage (4 node groups)
Full Full Full 1/2 Full 1/4 Full 1/8
So
rt T
ime
(se
c)
0
60
120
180
240
300merge phase
partition/sort phase
Irregular Partitioning
Filter Memory Usage (4 node groups)
Full Full Full 1/2 Full 1/4 Full 1/8
So
rt T
ime
(se
c)
0
60
120
180
240
300merge phase
partition/sort phase
2 Sets on Full nodes
Michael D. Beynon ([email protected])
16
Observations & Implications
Applications can be decomposed into tasks Placement matters
– Heterogeneity, shared resources– Application dependent (e.g., volume of data movement)
Resource availability– Memory capacity
Tailored support for developing such applications
Michael D. Beynon ([email protected])
17
DataCutter Project
Targets multi-dim datasets, distributed storageSubsetting through Range Queries
– range defines a hyperbox in the multi-dimensional attribute space underlying the dataset
– retrieve items whose coordinates fall within box
Manually divide application processing into Filters– intended to execute near (LAN) storage system– to reduce amount of data transmitted to client
Michael D. Beynon ([email protected])
18
Application FiltersPurpose: Specialized user code for processing data segments
before returning them to the client
inspired by research in Active Disks [Acharya, Uysal, Saltz: ASPLOS’98],dataflow, functional parallelism.
filters are the unit of computation– high level tasks– init,process,finalize interface
streams are how filters communicate– unidirectional buffer pipes
pre-specify filter connectivity
Extract ref
Extract raw
3D reconstruction
View result
Raw Dataset
Reference DB
Michael D. Beynon ([email protected])
19
Filter EnvironmentWant to constrain filters for
– location independence– knowledge of filter communication pattern– easier scheduling of resources– filter stop and restart defined explicitly– style encourages minimized resource usage
Filter constraints– communicate with other filters only using streams– cannot change stream endpoints– may pre-disclose dynamic memory/scratch space needs
Michael D. Beynon ([email protected])
20
What will not work well?
Fine grained tasks Fine coordination between filters Long request-response chains with dependencies
between queries
Michael D. Beynon ([email protected])
21
Related Work
ApplicationLevel
ProgrammingModels
InfrastructureServices
ResourceLevel
Grid availableResources
SRB
User specifiedResources
Legion
Client/Server Sockets
Condor Pool
IdleResources
JavaRMI,DCOM,CORBA
NetSolve,Ninf
AppLeS
HPC++
NWS
DataCutter
HarmonyDSM MPI RPC
DPSSGlobus
Michael D. Beynon ([email protected])
22
Programming Model Semantics
RPC / RMI (Java) / CORBA– function call semantics of imperative languages
(caller blocks until function has returned)– simple, well-understood programming model – drawback: mandatory RTT penalty in all calls
time
Caller:
Function:
Michael D. Beynon ([email protected])
23
Partial Solution [ABACUS]
Chain RPC calls if possible– A() calls B(), which calls C(), which calls D(), etc.– reduces some of the RTT penalties– drawback: only works when structures as a chain of calls
Batch multiple calls into a single call– if A(int) was original, change to use A’(int *array, int sz)– drawbacks:
– how to choose batch size?– are earlier fcn calls delayed to achieve a full batch?– mandatory overhead for first call
Michael D. Beynon ([email protected])
24
Alternate Solution [DataCutter]
Change the programming model– mandate use of filters-stream programming model
(message passing style)– avoid drawbacks of retro-fitting older programming model– specifically target data-intensive wide-area applications for best
gain– return results to caller– detached mode, no return of results
– unit of work notion for well-defined adaptivity
Michael D. Beynon ([email protected])
25
Filter InstanceDefine: a logically related collection of filters
– application starts a filter instance at any time– multiple concurrent filter instances allowed
Unit-of-work handling– “work” is defined by application (ex: for vmscope, a query)– appended to any running instance– adaptivity - the unit-of-work boundary– similar to ABACUS set of records in calls, with same drawback of
choosing size
Allow flexible optimizations by application– Ex: pre-create two instances each placed w.r.t. zoom factor
Michael D. Beynon ([email protected])
26
Asynchronous Operation
Nothing blocks and no barriers– creating new filter instances– connecting streams– appending work to filter instances– sending application data on streams– stopping filter instances
Common case is cheap– essential for any adaptivity policy– client server applications repeatedly use existing filter instances– RPC style still possible
– create instance, append work, collect result, stop instance
Michael D. Beynon ([email protected])
27
Pipeline DataflowBenefits
– idea: overlap communication cost with previous computation– only schedule filters to execute when steam data to process– no RTT required as in RPC systems for result [ABACUS]– works best if uniform comm cost == comp cost (not always true)
timechunk1: filter1 filter2 filter3 filter4
chunk2: filter1 filter2 filter3 filter4
chunk3: filter1 filter2 filter3 filter4