Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa
description
Transcript of Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation
on Shared Memory Parallel Computer
Yoshihiro Oyama, Kenjiro Taura,
Toshio Endo, Akinori Yonezawa
Department of Information Science, Faculty of Science,
University of Tokyo
Background
“Irregular” parallel applications• Tasks are not identified until runtime• synchronization structure is complicated
Languages with fine-grain threads• promising approach to handle the complexity
Motivation
Q: Are fine-grain threads really effective?
• Easy to describe irregular parallelism?• Scalable?• Fast?
Case studies to answer the Q are few
Many sophisticated designs and implementation techniqueshave been proposed so far, but
Goal
Case study to better understandthe effectiveness of fine-grain threads
C + Solaris threads
VS.
• program description cost• speed on 1 PE• scalability on 64PE SMP
in terms of
our language Schematic
approach w/o fine-grain threads
approach withfine-grain threads
Overview
Applications ( RNA & CKY )
Solutions without fine-grain threads
Solutions with fine-grain threads
Performance evaluation
Case Study 1: RNA- protein secondary structure prediction -
Algorithm simple node traversal + pruning
finding a path• satisfying certain condition• with largest weight
unbalanced tree
Case Study 2: CKY- context-free grammar parser -
calculation of matrix elements
depends on all s
She is a girl whose mother is a teacher.
calculation time significantlyvaries from element to element
actual size 100≒
To create a threadfor each node large overhead
communicationwith memory
Task Pool
P P P
Solution without Fine-grain Threads(RNA)
calculating 1 element→ 0 ~ 200 synchronization
P P P
decision strategy?• trial & error• prediction
Solution without Fine-grain Threads(CKY )
how to implement?• small delay → simple spin• large delay → block wait
Schematic [Taura et. al 96] = Scheme + future + touch [Halstead 85]
(define (fib x) (if (< x 2) 1 (let ((r1 (future (fib (- x 1)))) (r2 (future (fib (- x 2))))) (+ (touch r1) (touch r2)))))
thread creation
synchronization
channel
Language with Fine-grain Threads
Thread Management in Schematic• Lazy Task Creation [Mohr et al. 91]
PE A PE B
future future
future
future
future future
future
future
future
stac
k future
future
future
Synchronization on Register
PE A PE B
• StackThreads [Taura 97]
register
memory
register
register
register register
registerregister
register
memory
register
memory
Synchronization by Code Duplication
heuristics to decide which to duplicate+
if (r has value) { } else { c = closure(cont, fv1, ...); put_closure(r, c); /* switch to another work */ ...}
cont(c, v){ }
work A
work B ver. 1;
work B ver. 2;
work A work B(touch r)
simple spin
block wait
What description can be omittedin Schematic? Management of fine-grain tasks
Synchronization details
future ⇔ manipulation of task pool + load balance
touch ⇔ manipulation of comm. medium + aggressive optimizations
SchematicC + thread
Codes for Parallel Execution
int search_node(...){ if (condition) { } else { child = ...; ... search_node(...); ... ... ...}
C
(define (search_node) (if condition ‘done (let ((child ..)) ... ... (search_node) ... ... ...)))
Schematic
whole: 1566 lines whole: 453 lines
parallel: 537 lines (34 %)
parallel: 29 lines (6.4 %)
for parallelexecution
RNA
Performance Evaluation(Condition) Sun Ultra Enterprise 10000
(UltraSparc 250MHz × 6464) Solaris 2.5.1 Solaris thread (user-level thread)
GC time not included Runtime type check omitted
Performance Evaluation(Sequential)
0
1
2
3
RNA CKY
norm
aliz
ed e
laps
ed t
ime
C Schematic
Performance Evaluation(Parallel)
0
10
20
30
40
50
0 10 20 30 40 50 60# of PEs
spee
dup
C (RNA) Schematic (RNA) C (CKY) Schematic (CKY)
Related Work
ICC++ [Chien et al. 97]• Similar study using 7 apps• Experiments on distributed memory machines• Focus on
• namespace management
• data locality
• object-consistency model
Conclusion
We demonstrated the usefulness of fine-grain multithread languages• Task pool-like execution with simple description• Aggressive optimizations for synchronization
We showed the experimental results• A factor of 2.8 slower than C• Scalability comparable to C
Performance Evaluation(Other Applications 1/2)
14.7
0
1
2
3
4
Fib Tak Qsort Knapsack Grobner SPLASH2
norm
aliz
ed e
laps
ed t
ime
C Schematic
Performance Evaluation(Other Applications 2/2)
0
10
20
30
40
50
0 10 20 30 40 50 60
# of PEs
spee
dup
Fib Tak Nqueen QsortKnapsack Puzzle QAP SPLASH2
Identifying Overheads
0
200
400
600
800
1000
normal no poll no GCcheck
stolentagopt.
flagcheck
usesmalltag
globalvaropt.
C
norm
aliz
ed e
laps
ed t
ime