1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M....

Fast and EfficientPartial Code Reordering

Xianglong Huang (UT Austin, Adverplex)Stephen M. Blackburn (Intel)David Grove (IBM)Kathryn McKinley (UT Austin)

Software Trends

• By 2008, 80% of software will be written in Java or C#. [Gartner report]

• Java and C# are coming to your OS soon - Jnode, Singularity

• Advantages of modern programming languages: – Productivity, security, reliability…

• Performance?

Hardware Trends

2X/1.5yr

2X/10 yrs

Processor-MemoryPerformance Gap:(grows 50% / year)

Fully Associative IL1

Improvement Potential

Base case: JikesRVM default with separate code space

. Cache configuration: 32K IL1 direct map, 512K L2(small programs on a big cache)

New and Better Opportunities

• Virtual machine monitors application behavior at runtime

• Dynamic recompilation– With dynamic feedback– Allocates instructions at runtime

Previous Work on Instruction Locality

• Static schemes– Static profile calling correlation and reorder

code at compile and link time [Pettis and Hansen 90]

– Cache coloring [Hashemi et al 97]

– Profile procedure interleaving [Gloy et al. 99]

– Static schemes are not flexible

• Dynamic scheme– JIT code reordering [Chen et al. 97]

– Used as our base case

Optimizations in Virtual Machine

• Static instruction allocation used at runtime, – e.g. Just-in-time

compilations – Invocation order

CompilerMemory Manager

Runtime

StaticOptimizations

Optimizations in Virtual Machine

• Dynamic instruction allocation/reordering adapt to the program behavior with low overhead

CompilerMemory Manager

Runtime

StaticOptimization

Opportunity for Instruction Locality

• Dynamic detection of hot methods, hot basic blocks

• Dynamic recompilation relocates methods at runtime

PCR Optimizations

• Reduce instruction capacity misses– Code space – Method separation– Code splitting

• Reduce instruction conflict misses– Code padding

PCR System

JikesRVM componentInput/Output

Optimized method

Baseline method

BaselineCompiler

SourceCode

ExecutingCode

AdaptiveSampler Optimizing

Compiler

HotMethods

PCR System: Method Separation

Hot method (optimized code)

Cold method (baseline code)

Hot Methods

Cold MethodsCode

PCR System: Code Splitting

• Online edge profile identifies hot basic blocks in a method

• Code reordering moves hot basic blocks to the beginning of a method

• Code splitting to separate hot/cold basic blocks inside the heap

Cold basic blocks

Hot basic blocks

Method A:

PCR System: Code Splitting

Hot Blocks

Cold Methods

Cold Blocks

Hot methods (optimized code)

Cold methods (baseline code)

Data Cold basic blocks

Hot basic blocks

Hot Methods

Cold Methods

PCR Optimizations

• Reduce instruction capacity misses– Code Space – Method separation– Code splitting

• Reduce instruction conflict misses– Code padding

PCR System: Code Padding

BaselineCompiler

SourceCode

BinaryCode

AdaptiveSampler Optimizing

Compiler

Hot MethodsDynamic

Call Graph

JikesRVM componentInput/Output

PCR System: Code Padding

Method A() {

classC.B();

Conflict

Dynamic Call Graph

Methodology

• Java virtual machine: Jikes RVM• Various Architectures

– x86 (Pentium 4)– PowerPC– Simulator: Dynamic SimpleScalar

• Use direct-mapped I-cache– Shorter latency– More conflict misses

PCR Results: jess on x86

PCR Results: fop on x86

-10.0%

Code Splitting

Code Padding + Splitting

Fully Associative IL1

Impact of Code Padding

Base case: JikesRVM default + a separate code space. Cache configuration: 32K IL1 direct map, 512K L2

Conclusion

• Code space improve program performance by 6% (up to 30%) (Pentium 4)

• PCR has negligible overhead• PCR no obvious performance improvement

– On Pentium 4, no improvement on average– In simulation,

• PCR has 14% for one program• Not consistent, no improvement on average.

• Potential opportunities for dynamic optimizations

Thank you!

• Questions?

CompilerGarbage collector

Runtime

StaticOptimization

Cache: Small vs. Large

IL1 DL1 L2Size Assoc Latency Size Assoc Latency Size Asso

cLatenc

8K 1 2 8K 2 2 128K

16K 1 2 16K

2 3 256K

64K 1 4 64K

2 4 512K

Cacti, 90nm technology, 3GHz frequency

Cache-Size Comparison

_213_javac

_202_jess

Directmap vs. Two-way

Direct map IL1 with 2 cycle hitlatency

Two way IL1 with 3 cycle hit latency

Cacti, 90nm technology, 3GHz

Improving Performance

• Classic optimizations not sufficient!• Different programming styles

– Automatic memory management– Pointer data structures– Many small methods

• Optimization costs incurred at runtime• Virtual Machine (VM) adds complexity

– Class loading, memory management, Just-in-time compiler…

Instruction Locality

• Instructions have better locality?– More instruction accesses – About same # of data cache misses

• Penalty in pipelined processor– Create bubbles in the pipeline

• Instruction locality can be more critical

Locality Impact On Performance

2005 2010?

Waiting for L2misses

Waiting for L1misses

Other executiontime

Geometric mean of five Java programs

Locality is key to performance

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M....

Documents

Transcript of 1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M....

Alheid, A. , Doufexi, A., & Kaleshi, D. (2016). Packet Reordering … · Packet Reordering Response for MPTCP under Wireless Heterogeneous Environment Amani Alheid, Angela Doufexi

Reordering Ranganathan: Shifting User Behaviors, Shifting ...

Use of linguistic information and reordering stratregies ...mi.eng.cam.ac.uk/~ad465/publications/2006-10-30.SMToview.pdf · Use of linguistic information and reordering stratregies

Memory Model = Instruction Reordering + Store Atomicity

Reordering Ranganathan: Shifting User Behaviors, Shifting Priorities

A Parallel Adaptive P Hierarchical Particle Reordering - arXiv · Hierarchical Particle Reordering ... simulations and is written in Fortran77+OpenMP. ... work highlighted the need

REPRESENTING NATURE, REORDERING SOCIETY EUGENE ODUM, ECOSYSTEM

Reordering Ranganathan : Shifting User Behaviors, Shifting Priorities

Reordering China, Respacing the World: Belt and Road ...

Reordering Facebook tabs

Efficient Triangle Reordering

Automating Run-Time Reordering Transformations with the ...

Variable reordering strategies for SLAM

How tech-enabled consumers are reordering the healthcare ...

form is Reordering nine necessaryto exons Oxytricha - PNAS · for reordering andjoining ofthe nine exons to yield the exon ... containonlyonetranscription unit orgene-encoding ...

1 Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT) Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss (UMass),

Efﬁcient solutions for word reordering in German-English ...liacs.leidenuniv.nl/~bisazzaa/slides/WMT13-EfficientReoSolutions... · Efﬁcient solutions for word reordering in German-English

SMOReS: Sparse Matrix Omens of Reordering Success

Reordering in Japanese* - CORE

Coupling Hierarchical Word Reordering and Decoding in Phrase ...