Mohamed. M. Saad Mohamed A. Mohamedin & Prof. Binoy Ravindran VT-MENA Program Electrical & Computer...
-
Upload
heaven-lyn -
Category
Documents
-
view
214 -
download
0
Transcript of Mohamed. M. Saad Mohamed A. Mohamedin & Prof. Binoy Ravindran VT-MENA Program Electrical & Computer...
Hydra VM: Extracting Parallelization From Legacy Code Using STM
Mohamed. M. SaadMohamed A. Mohamedin &Prof. Binoy Ravindran
VT-MENA ProgramElectrical & Computer Engineering DepartmentVirginia Polytechnic Institute and State University
Outline
Motivation & Objectives Background
Transactional Memory Jikes RVM
Program Reconstruction Architecture
Profiler, Builder & Runtime Future Work
Motivation
Why Multicores? Difficult to make single-
core clock frequencies even higher
Deeply pipelined circuits▪ heat problems▪ speed of light problems▪ difficult design and
verification▪ large design teams
necessary▪ server farms need
expensive air-conditioning
Motivation
No fast CPUs any more, just more cores!
Trend is using multi-core & hyper-threading
Motivation
At 2005, Sun Niagara (8 cores with HT run 32 HWT)
At 2010, Supermicro (48-core AMD Opteron). Now, Sun make boxes with between 128-512
hardware threads (16 HWT/core, 8 cores/CPU) !!
What About Software!!!
Are we ready for this HW ?!
Objective
Many applications are designed to use few threads
Legacy systems were designed to run at a single processor
Multi-threading programming is headache for developers (race situations, concurrent access, …)
HydraVM: Java Virtual Machine Prototype based on Jikes RVM and targets utilizing large number of cores through detecting automatically possible parallel portions of code
Background
Transactional Memory
Jikes RVM (Adaptive Online Architecture)
Atomicity Atomicity: An operation (or set of operations) appears to
the rest of the system to occur instantaneously
Example (Money Transfer):……synchronized {
from = from - amount to = to + amount }
…………
Example (Money Transfer):…………account1.lock()account2.lock()from = from - amount to = to + amount account1.unlock()account2.unlock()…………
account1
account2X
Y
≈
Mutual Locks “Classical Approach”
Drawbacks Deadlock Livelock Starvation Priority Inversion Non-composable Cost of managing the lock Non-scalable on multiprocessors
A B
X
Y
Transactional Memory
Simplifies parallel programming by allowing a group of load and store instructions to execute in an atomic way using additional primitives
Example (Money Transfer):…………START-
TRANSACTIONfrom = from - amount to = to + amount END-TRANSACTION ………… Commit
orRollback & Retry
account1
account2
X
Yaccount1y
account2y
account1x
account2x
Transactional Memory
Each transaction has ReadSet & WriteSet Transactions conflict if have the same variable(s)
at ReadSet / WriteSet Conflict Resolution using Contention Manager
that employs different policies (Aggressive, Polite, Back-Off, Random, …..)
Aborted code undo changes (if required) and retries again
Transactional Memory
Transactions may be nested (multiple levels) Inner transaction share the ReadSet/WriteSet of parent Inner transactions conflicts with each other and with
other higher level transactions Aborting parent transaction forces abort for children Inner transactions changes are visible to parents once
commit successfully, but hidden from outside world till commit of highest level
Transactional Memory
Hardware Transactional MemoryModifications in processors, cache and bus protocolex; unbounded HTM, TCC, ….
Software Transactional MemorySoftware runtime library or the programming language supportMinimal hardware support; CAS, LL/SCex; RSTM, DSTM, ESTM, ..
Hybrid Transactional MemoryExploits HTM support to achieve hardware performance for transactions that do not exceed the HTM’s limitations, and STM otherwiseex; LogTM, HyTM, …
Distributed Transactional MemoryExtends transaction primitives to distributed environment (network of multiple
machines)ex; HyFlow, DecentSTM, GenSTM, …
Jikes RVM
Mature modular open source Java virtual machine designed for research purposes. Unlike most other JVMs it is written in Java!
Adaptive Online System
Program Reconstruction “The Main Idea”
We view program as a set of basic building blocks Each block is a set of instructions Block has single entry and multiple exists Blocks may access the same memory
(variables) It is possible to reconstruct the program from
these blocks by rearranging it differently with some changes to the control instructions.
It is even possible to assign each set of blocks to different thread
Example
int counter = 0; for(int i=0; i<2; i++) if(Math.random()>0.3) counter++; else counter--;
0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 4: goto 29 7: invokestatic #13; 10: ldc2_w #19; 13: dcmpl 14: ifle 23 17: iinc 1, 1 20: goto 26 23: iinc 1, -1 26: iinc 2, 1 29: iload_2 30: bipush 12 32: if_icmplt 7 35: return
0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 4: goto 29 7: invokestatic #13; 10: ldc2_w #19; 13: dcmpl 14: ifle 23 17: iinc 1, 1 20: goto 26 23: iinc 1, -1 26: iinc 2, 1 29: iload_2 30: bipush 12 32: if_icmplt 7 35: return
Example
public class Test{ public static void foo(){ int counter = 0; for(int i=0; i<12; i++) if(Math.random()>0.3) counter++; else counter--; }
public static void zoo(){ System.out.println("hi"); }
public static void main(String[] args){ int i=6; if(i<10) foo(); else zoo(); }}
Architecture
Architecture Profiler
Split code into Basic Block
Inject loaded classes with additional instructions to monitor: Program Flow (Which Basic
Blocks are accessed and in what order?)
Memory accessed by each Basic Block
Which Basic Block is doing I/O ?
Architecture Profiler
0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 write J write C visit B1 4: goto 29 7: invokestatic #13; 10: ldc2_w #19; 13: dcmpl read K write K visit B2 14: ifle 23 17: iinc 1, 1 read C write C visit B3 20: goto 26 23: iinc 1, -1 read C write C visit B4 26: iinc 2, 1 read J write J visit B5 29: iload_2 30: bipush 12 visit B6 read J 32: if_icmplt 7 35: return
0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 4: goto 29 7: invokestatic #13; 10: ldc2_w #19; 13: dcmpl 14: ifle 23 17: iinc 1, 1 20: goto 26 23: iinc 1, -1 26: iinc 2, 1 29: iload_2 30: bipush 12 32: if_icmplt 7 35: return
Example:
int C = 0; for(int J=0; J<2; J++) if(Math.random()>0.3) C++; else C--;
ArchitectureRecompilation
Recompile the Java class bytecode into machine-code
Replace and reload class definition at memory
ArchitectureCode Execution
Running the profiled code
Collecting flow & memory access information and store it at the knowledge repository
Architecture Builder
Analyze knowledge repository information and know: Which Blocks can be
grouped together Which groups of blocks can
be parallelized
Architecture Builder
Program can be represented as a string (each character is a basic block).
Example:
for (Integer i = 0; i < DIMx; i++) { for (Integer j = 0; j < DIMx; j++) {
for (Integer k = 0; k < DIMy; k++) {C[i][j] += A[i][k] * B[k][j];
} }
}
abjbhcfefghcfefghijbhcfefghcfefghijk
ab(jb(hcfefg)2hi)2jk
Architecture Builder
ab(jb(hcfefg)2hi)2k
Externalize common blocks patterns as methods
Generated methods may benested
Reconstruct the program asproducer-consumer pattern Collector
▪ Provides Executor with suitable blocks as Tasks to execute according to flow up-to time
Executor▪ Allocates core threads▪ Assign tasks to threads▪ Requests Collector for more blocks based on program flow, after all threads
complete
Architecture Builder
Problems Threads may conflict when
access the same variables Threads may finish out of
normal order Collector may generate invalid
tasks
Lets represents each Thread as Transaction
When two transactions conflicts abort one that has newer blocks relative to normal execution
Transaction will not commit unless its preceding one in timeline is finished
Transaction timeout if not reachable
ArchitectureCode Execution – revisit
Collects which transactions conflicts and commit rate
We can refine the constructed program
Builder re-organize generated blocks and recompile the code again
Ongoing & Future Work
Complete the implementation of HydraVM
Profiling by monitoring memory instead of generating new instructions
Automatically uses of Java NIO to handle I/O operations and generate callbacks to process it
Using thread scheduling techniques instead of TM
Formal verification of reconstructed programs matches desired semantics