Post on 18-Jan-2016
1
On-line Parallel Tomography
Shava Smallen
UCSD
2
I) Introduction to On-line Parallel Tomography
II) Tunable On-line Parallel Tomography
III) User-directed application-level scheduler
IV) Experiments
V) Conclusion
Talk Outline
3
What is tomography?
• A method for reconstructing the interior of an object from its projections
• At the National Center for Microscopy and Imaging Research (NCMIR), tomography is applied to electron microscopy to study specimens at the cellular and subcellular level
4
Tomogram of spiny dendrite(Images courtesy of Steve Lamont)
Example
5
Parallel Tomography at NCMIR
• Embarrassingly parallel
X
Y
slice
specimen
Z
scanlineprojection
projection
scanline
6
NCMIR Usage Scenarios
Off-line parallel tomography (off-line PT)
– Data resides somewhere on secondary storage
– Single, high quality tomogram
– Reduce turnaround time
– Previous work (HCW’ 00)
On-line parallel tomography (on-line PT)
– Data streamed from the electron microscope
• long makespan, configuration errors, etc.
– Iteratively computed tomogram
– Soft real-time execution
7
On-line PT
• Real-time feedback on quality of data acquisition1 ) First projection acquired from microscope2 ) Generate coarse tomogram3 ) Iteratively refine tomogram using subsequent
projections (refresh)• Update each voxel value • Size of tomogram is constant
8
NCMIR Target Platform
• Multi-user, heterogenous resources– NCMIR cluster
• SGI Indigo2, SGI Octane, SUN ULTRA, SUN Enterprise
• IRIX, Solaris
– Meteor cluster• Pentium III dual proc• Linux, PBS
– Blue Horizon• AIX, Loadleveler, Maui Scheduler
network
slices
preprocessor
ptomo
ptomo
ptomo
ptomo
ptomo
writer
On-line PT Architecture
projection
scanlines
tomogram
10
On-line PT Design
1) Frame on-line parallel tomography as a tunable application– Resource limitations / dynamic– Availability of alternate configurations [Chang,et
al]• each configuration corresponds to different output
quality and resource usage
2) Coupled with user-directed application-level scheduler (AppLeS)– adaptive scheduler– promote application performance
11
On-line PT Configuration
• Triple: (f, r, su)
• Reduction factor (f) – Reduce resolution of data reduce both
computation and communication
• Projections per refresh (r)– Reduce refinement frequency reduce
communication
• Service Units - (su)– Increase cost of execution increase
computational power
12
User Preferences
• Best configuration (f, r, su) = (1, 1, 0 )
• Several possible configurations user specifies bounds– projections should be at least size 256x256
• 1 f 4 or 1 f 8
– user could tolerate up to a 10 minute time wait• 1 r 13
– reasonable upper bound• 0 su (50 x acquisition period x c)
13
User-directed
• Feasible?– Use dynamic load information– if work allocation found
• Better? – e.g.
1. (1, 6, 4) - best f
2. (2, 2, 8) - good su/r
3. (2, 1, 20) - best r
reduction factor
projections per refresh
service units
generaterequest
displaytriples
adjustrequest
reviewtriples
processrequest
findwork
allocation
executeon-line PT
accepts one
rejects all
infeasible
feasible
User-directed AppLeS
User
User-directed AppLeS
15
Triple Search
• Search parameter space– If triple satisfies constraints feasible
• Constrained optimization problem based on soft real-time execution– compute constraint– transfer constraint
• Heuristics to reduce search space– e.g. assume user will always choose (1,2,1)
over (1,2,4)
16
Work Allocation
work allocation
transfer constraints
cost
user constraints
compute constraints
cpu availability
processor availability
ptomo-to-writer bandwidth
subnet-to-writer bandwidth
Multiple mixed-integer programs approx soln
17
Experiments
• Impact of dynamic information on scheduler performance
• Usefulness of tunability Grid environments
• Scheduling latency
18
Dynamic Information
• We fix the triple and let schedulers determine work allocation
Infinite bandwidth
Dynamic bandwidth
Dedicated cpu
wwa wwa+bw
Dynamic cpu
wwa+cpu AppLeS
19
• Evaluate schedulers– Repeatibility – Long makespan– several resource environments
• Simgrid (Casanova [CCGrid’2001])– API for evaluating scheduling algorithms
• tasks• resources modeled using traces
– E.g. Parameter sweep applications [HCW’00]
• Simtomo
Simulation
20
relative refresh lateness
expected refresh period
actual refresh period
• Relative refresh lateness
Performance Metric
21
NCMIR experiments
• Traces (8 machines)– 8 hour work day on March 8th, 2001
• Ran simulations throughout day at 10 minute intervals
8:00 am 4:00 pm
22
Perfect Load Predictions
0 1 2 3 4 5 6 7 810
0
101
102
103
104
hours since 3/8/2001 - 8:00 PST
mea
n re
lativ
e re
fres
h la
tene
ss
wwawwa+cpuwwa+bwAppLeS
23
Imperfect Load Predictions
0 1 2 3 4 5 6 7 810
0
101
102
103
104
hours since 3/8/2001 - 8:00 PST
me
an
rela
tive
re
fre
sh la
tene
ss
wwawwa+cpuwwa+bwAppLeS
24
Synthetic Grids
• Bandwidth predictibility– Average prediction error
– pi {L, M, H}
– p1 p2 p3
• e.g. LMH
– 27 types– 2510 Grids
x 4 schedulers
– 10,040 simulations
writer
cluster3
cluster2
cluster1
p1
p2
p3
25
wwa wwa+cpu wwa+bw AppLeS 0
500
1000
1500
2000
2500
3000
scheduler
num
be
r o
f run
s1st2nd3rd4th
Relative Scheduler Performance
705.89 658.91 127.10 1.07
26
Partial Ordering
• Performance vs. bandwidth predictability
• Grid predictibility– Partial orders using p1 p2 p3
– Comparable/Not Comparable• e.g. HML is comparable to HLL• e.g. HLM is not comparable to LHM
• HHH, HHM, HMM, HLM, MLM, LLM, LLL
27
Example Partial Order
HHH HHM HMM HLM MLM LLM LLL . 10
0
101
102
103
104
rela
tive
re
fre
sh la
ten
ess
(se
con
ds)
wwawwa+cpuwwa+bwAppLeS
28
Tunability Experiments
• How useful is tunability?– variability
• Fixed topology– categorized traces
• L, M, H
– v1 v2 v3 v4 v5
– 243 Grid types cluster2
cluster1
writer
supercomputer
v2
v1
v3
v4
v5
29
Tunability Experiments
• Run over a 2 day period– back-to-back– assume single user
model• f, r, su
• Set of triples chosen– T = {1,…,61}
02
46
8
05
10150
2
4
6
x 104
fr
su
30
Tunability Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
fra
ctio
n o
f ch
an
ge
s
parameters
frsu
• Count how many times a triple changed per 2-day simulation
• e.g.– 12.9%– 25.7%
31
0 2 4 6 8 100
1000
2000
3000
4000
5000
6000
7000
seconds
nu
mb
er
of
exp
erim
en
ts
Scheduling Latency
• Time to search for feasible triples• e.g.
– 88% under 1 sec– 63% under 1 sec
32
Conclusions and Future Work
• Grid-enabled version of on-line parallel tomography– Tunable application
• Tunability is useful in Grid environments
– User-directed AppLeS• Importance of bandwidth predictability
– e.g. rescheduling
• Scheduling latency is nominal
• Production use