HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain...
-
date post
20-Dec-2015 -
Category
Documents
-
view
219 -
download
0
Transcript of HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain...
HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing
PIs: Alvin M. Despain and Jean-Luc Gaudiot
USCUNIVERSITY
OF SOUTHERN
CALIFORNIA
University of Southern Californiahttp://www-pdpc.usc.edu
October 12th, 2000
2USC Parallel and Distributed Processing Center
HiDISC: Hierarchical Decoupled Instruction Set
ComputerNew Ideas• A dedicated processor for each level of the memory hierarchy
• Explicitly manage each level of the memory hierarchy using instructions generated by the compiler
• Hide memory latency by converting data access predictability to data access locality
• Exploit instruction-level parallelism without extensive scheduling hardware
• Zero overhead prefetches for maximal computation throughput
Impact• 2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor
• 7.4x speedup for matrix multiply over an in-order issue superscalar processor
• 2.6x speedup for matrix decomposition/substitution over an in-order issue superscalar processor
• Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS)
• Allows the compiler to solve indexing functions for irregular applications
• Reduced system cost for high-throughput scientific codes
Schedule
April 98Start
April 99 April 00
• Defined benchmarks• Completed simulator• Performed instruction-level simulations on hand-compiled benchmarks
• Continue simulations of more benchmarks (SAR)•Define HiDISC architecture•Benchmark result
• Develop and test a full decoupling compiler• Update Simulator•Generate performance statistics and evaluate design
Dynamic Database
Memory
Cache
Registers
Application(FLIR SAR VIDEO ATR /SLD Scientific )Application
(FLIR SAR VIDEO ATR /SLD Scientific )
Processor
Decoupling CompilerDecoupling Compiler
HiDISC Processor
SensorInputs
Processor Processor
Situational Awareness
3USC Parallel and Distributed Processing Center
HiDISC: Hierarchical Decoupled Instruction Set Computer
Dynamic Database
Memory
Cache
Registers
Application(FLIR SAR VIDEO ATR /SLD Scientific )Application
(FLIR SAR VIDEO ATR /SLD Scientific )
Processor
Decoupling CompilerDecoupling Compiler
HiDISC Processor
SensorInputs
Processor Processor
Situational Awareness
Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year)
Problem: Some SPEC benchmarks spend more than half of their time stalling [Lebeck and Wood 1994]
Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs
Present Solutions: Multithreading (Homogenous), Larger Caches, Prefetching, Software Multithreading
4USC Parallel and Distributed Processing Center
Present Solutions
SolutionLarger Caches
Hardware Prefetching
Software Prefetching
Multithreading
Limitations— Slow
— Works well only if working set fits cache and there is temporal locality.
— Cannot be tailored for each application
— Behavior based on past and present execution-time behavior
— Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching
— Adaptive software prefetching is required to change prefetch distance during run-time
— Hard to insert prefetches for irregular access patterns
— Solves the throughput problem, not the memory latency problem
5USC Parallel and Distributed Processing Center
Observation:
• Software prefetching impacts compute performance
• PIMs and RAMBUS offer a high-bandwidth memory system - useful for speculative prefetching
The HiDISC Approach
Approach:
• Add a processor to manage prefetching -> hide overhead
• Compiler explicitly manages the memory hierarchy
• Prefetch distance adapts to the program runtime behavior
6USC Parallel and Distributed Processing Center
Cache
What is HiDISC?
A dedicated processor for each level of the memory hierarchy
Explicitly manage each level of the memory hierarchy using instructions generated by the compiler
Hide memory latency by converting data access predictability to data access locality (Just in Time Fetch)
Exploit instruction-level parallelism without extensive scheduling hardware
Zero overhead prefetches for maximal computation throughput
Access Instructions
ComputationInstructions
Access Processor(AP)
Access Processor(AP)
ComputationProcessor (CP)
ComputationProcessor (CP)
Registers
Cache Mgmt. Processor
(CMP)
Cache Mgmt. Processor
(CMP)
Cache Mgmt.Instructions
CompilerProgram
2nd-Level Cache and Main Memory
7USC Parallel and Distributed Processing Center
MIPS DEAP CAPP HiDISC(Conventional) (Decoupled) (New Decoupled)
2nd-Level Cache and Main Memory
Registers
5-issue
Cache
Access Processor(AP) - (3-issue)
Access Processor(AP) - (3-issue)
2-issue
Cache Mgmt. Processor (CMP)
Registers
Access Processor(AP) - (5-issue)
Access Processor(AP) - (5-issue)
ComputationProcessor (CP)Computation
Processor (CP)
3-issue
2nd-Level Cache and Main Memory
Registers
8-issue
Registers
3-issue 3-issueCache Mgmt. Processor (CMP)
DEAP: [Kurian, Hulina, & Caraor ‘94]
PIPE: [Goodman ‘85]
Other Decoupled Processors: ACRI, ZS-1, WA
Cache
2nd-Level Cache and Main Memory
2nd-Level Cache and Main Memory
2nd-Level Cache and Main Memory
ComputationProcessor (CP)Computation
Processor (CP)
ComputationProcessor (CP)Computation
Processor (CP)
ComputationProcessor (CP)Computation
Processor (CP)
CacheCache
Decoupled Architectures
8USC Parallel and Distributed Processing Center
Slip Control Queue The Slip Control Queue (SCQ) adapts
dynamically
• Late prefetches = prefetched data arrived after load had been issued
• Useful prefetches = prefetched data arrived before load had been issued
if (prefetch_buffer_full ())Don’t change size of SCQ;
else if ((2*late_prefetches) > useful_prefetches)
Increase size of SCQ;else
Decrease size of SCQ;
9USC Parallel and Distributed Processing Center
Decoupling Programs for HiDISC
(Discrete Convolution - Inner Loop)
for (j = 0; j < i; ++j)y[i]=y[i]+(x[j]*h[i-j-1]);
while (not EOD)y = y + (x * h);
send y to SDQ
for (j = 0; j < i; ++j) {load (x[j]);load (h[i-j-1]);GET_SCQ;
}send (EOD token)send address of y[i] to SAQ
for (j = 0; j < i; ++j) {prefetch (x[j]);prefetch (h[i-j-1];PUT_SCQ;
}
Inner Loop Convolution
Computation Processor Code
Access Processor Code
Cache Management Code
SAQ: Store Address QueueSDQ: Store Data QueueSCQ: Slip Control QueueEOD: End of Data
10USC Parallel and Distributed Processing Center
BenchmarksBenchmarks Source of
Benchmark Lines of Source Code
Description Data Set Size
LLL1 Livermore Loops [45]
20 1024-element arrays, 100 iterations
24 KB
LLL2 Livermore Loops
24 1024-element arrays, 100 iterations
16 KB
LLL3 Livermore Loops
18 1024-element arrays, 100 iterations
16 KB
LLL4 Livermore Loops
25 1024-element arrays, 100 iterations
16 KB
LLL5 Livermore Loops
17 1024-element arrays, 100 iterations
24 KB
Tomcatv SPECfp95 [68] 190 33x33-element matrices, 5 iterations
<64 KB
MXM NAS kernels [5] 113 Unrolled matrix multiply, 2 iterations
448 KB
CHOLSKY NAS kernels 156 Cholsky matrix decomposition
724 KB
VPENTA NAS kernels 199 Invert three pentadiagonals simultaneously
128 KB
Qsort Quicksort sorting algorithm [14]
58 Quicksort 128 KB
11USC Parallel and Distributed Processing Center
Parameter Value Parameter Value
L1 cache size 4 KB L2 cache size 16 KB
L1 cache associativity 2 L2 cache associativity 2
L1 cache block size 32 B L2 cache block size 32 B
Memory Latency Variable, (0-200 cycles)
Memory contention time
Variable
Victim cache size 32 entries Prefetch buffer size 8 entries
Load queue size 128 Store address queue size
128
Store data queue size 128 Total issue width 8
Simulation
12USC Parallel and Distributed Processing Center
Simulation Results
0
1
2
3
4
5
0 40 80 120 160 200Main Memory Latency
LLL3
MIPSDEAPCAPP
HiDISC
0
0.5
1
1.5
2
2.5
3
0 40 80 120 160 200Main Memory Latency
Tomcatv
MIPSDEAPCAPP
HiDISC
02468
10121416
0 40 80 120 160 200Main Memory Latency
Cholsky
MIPSDEAPCAPP
HiDISC
0
2
4
6
8
10
12
0 40 80 120 160 200Main Memory Latency
Vpenta
MIPSDEAP
CAPP
HiDISC
13USC Parallel and Distributed Processing Center
Accomplishments
2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor
7.4x speedup for matrix multiply (MXM) over an in-order issue superscalar processor - (similar operations are used in ATR/SLD)
2.6x speedup for matrix decomposition/substitution (Cholsky) over an in-order issue superscalar processor
Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS)
Allows the compiler to solve indexing functions for irregular applications
Reduced system cost for high-throughput scientific codes
14USC Parallel and Distributed Processing Center
Work in Progress
• Compiler design• Data Intensive Systems (DIS) benchmarks
analysis • Simulator update• Parameterization of silicon space for VLSI
implementation
15USC Parallel and Distributed Processing Center
Compiler Requirements
Source language flexibility Sequential assembly code for
streaming• Ease of implementation• Optimality of sequential code• Source level language flexibility• Portability
Ease of implementation Portability and upgradability
16USC Parallel and Distributed Processing Center
Gcc-2.95 Features
Localized register spilling, global common sub expression elimination using lazy code motion algorithms
There is also an enhancement made in the control flow graph analysis.The new framework simplifies control dependence analysis, which is used by aggressive dead code elimination algorithms+ Provision to add modules for instruction scheduling and delayed branch execution+ Front-ends for C, C++ and Fortran available+ Support for different environments and platforms + Cross compilation
17USC Parallel and Distributed Processing Center
Compiler Organization
HiDISC Compilation Overview
GCCGCC
Source Program
Assembly Code
Stream Separator
Computational Assembly CodeComputational Assembly Code
Access Assembly Code
Access Assembly Code
Cache Management
Assembly Code
Computation Assembly CodeComputation
Assembly CodeAccess Assembly
Code
Cache Management Object Code
Cache Management Object Code
AssemblerAssembler Assembler
18USC Parallel and Distributed Processing Center
HiDISC Stream SeparatorSequential Source Sequential Source
Program Flow GraphProgram Flow Graph
Classify Address RegistersClassify Address Registers
Allocate Instruction to streamsAllocate Instruction to streams
Access Stream
Access Stream
Computation Stream
Computation Stream
Fix Conditional StatementsFix Conditional Statements
Move Queue Access into InstructionsMove Queue Access into Instructions
Move Loop Invariants out of the loopMove Loop Invariants out of the loop
Add Slip Control Queue InstructionsAdd Slip Control Queue Instructions
Substitute Prefetches for Loads, Remove
global Stores, andReverse SCQ Direction
Substitute Prefetches for Loads, Remove
global Stores, andReverse SCQ Direction
Add global data Communication and SynchronizationAdd global data Communication and Synchronization
Produce Assembly codeProduce Assembly code
ComputationAssembly CodeComputation
Assembly Code
AccessAssembly Code
AccessAssembly Code
Cache ManagementAssembly Code
Cache ManagementAssembly Code
Current Work
Future Work
19USC Parallel and Distributed Processing Center
Compiler Front End Optimizations
Jump Optimization: simplify jumps to the following instruction, jumps across jumps and jumps to jumps
Jump Threading: detect a conditional jump that branches to an identical or inverse test
Delayed Branch Execution: find instructions that can go into the delay slots of other instructions
Constant Propagation: Propagate constants into a conditional loop
20USC Parallel and Distributed Processing Center
Compiler Front End Optimizations (contd.)
Instruction Combination: combine groups of two or three instructions that are related by data
flow into a single instruction Instruction Scheduling: looks for instructions
whose output will not be available by the time that it is used in subsequent instructions
Loop Optimizations: move constant expressions out of loops, and do strength-reduction
21USC Parallel and Distributed Processing Center
Example of Stressmarks
Pointer Stressmark• Basic idea: repeatedly follow pointers to randomized
locations in memory• Memory access pattern is unpredictable• Randomized memory access pattern:
– Not sufficient temporal and spatial locality for conventional cache architectures
• HiDISC architecture provides lower memory access latency
22USC Parallel and Distributed Processing Center
Decoupling of Pointer Stressmarks
for (i=j+1;i<w;i++) { if (field[index+i] > partition) balance++;}if (balance+high == w/2) break;else if (balance+high > w/2) { min = partition;}else { max = partition; high++;}
while (not EOD)if (field > partition) balance++; if (balance+high == w/2) break;else if (balance+high > w/2) { min = partition;}else { max = partition; high++;}
for (i=j+1; i<w; i++) {load (field[index+i]);GET_SCQ;
}send (EOD token)
for (i=j+1; i<w; i++) {prefetch (field[index+i]);PUT_SCQ;
}
Inner loop for the next indexing
Computation Processor Code
Access Processor Code
Cache Management Code
23USC Parallel and Distributed Processing Center
Stressmarks
Hand-compile the 7 individual benchmarks• Use gcc as front-end• Manually partition each of the three instruction
streams and insert synchronizing instructions Evaluate architectural trade-offs
• Updated simulator characteristics such as out-of-order issue
• Large L2 cache and enhanced main memory system such as Rambus and DDR
24USC Parallel and Distributed Processing Center
Simulator Update
Survey the current processor architecture• Focus on commercial leading edge technology
for implementation
Analyze the current simulator and previous benchmark results• Enhance memory hierarchy configurations• Add Out-of-Order issue
25USC Parallel and Distributed Processing Center
Memory Hierarchy
Current modern processors have increasingly large L2 on-chip caches• E.g., 256 KB L-2 cache on Pentium and Athlon
processor• reduces L1 cache miss penalty
Also, development of new mechanisms in the architecture of the main memory (e.g., RAMBUS) reduces the L2 cache miss penalty
26USC Parallel and Distributed Processing Center
Out-of-Order multiple issue
Most of the current advanced processors are based on the Superscalar and Multiple Issue paradigm. • MIPS-R10000, Power-PC, Ultra-Sparc, Alpha
and Pentium family Compare HiDISC architecture and modern
superscalar processors• Out-of-Order instruction issue• For precision exception handling, include in-
order completion New access decoupling paradigm for out-of-order
issue
27USC Parallel and Distributed Processing Center
HiDISC with Modern DRAM Architecture
RAMBUS and DDR DRAM improve the memory bandwidth• Latency does not improve significantly
Decoupled access processor can fully utilize the enhanced memory bandwidth• More requests caused by access processor • Pre-fetching mechanism hide memory access
latency
28USC Parallel and Distributed Processing Center
HiDISC / SMT
Reduced memory latency of HiDISC can • decrease the number of threads for SMT
architecture• relieve memory burden of SMT architecture• lessen complex issue logic of multithreading
The functional unit utilization can increase with multithreading features on HiDISC• More instruction level parallelism is possible
29USC Parallel and Distributed Processing Center
The McDISC System: Memory-Centered Distributed Instruction Set Computer
Registers
Cache
Access
Main Memory
Processor (AP)
ComputationProcessor (CP)
Cache ManagementProcessor (CMP)
Program CompilerAccess Instructions
Computation Instructions
Cache ManagementInstructions
DiscProcessor (DP)
Adaptive SignalPIM (ASP)
Adaptive GraphicsPIM (AGP)
Network InterfaceProcessor (NIP)
Disc Cache
SituationAwareness
SensorInputs
SAR Video
DynamicDatabase
X Y Z
3-D Torusof Pipelined Rings
RAID
Understanding
Inference Analysis
Decision ProcessTargeting
NetworkManagement
Register Linksto CP Neighbors
Instructions
FLIR SAR VIDEO ESS SES
SensorData
Disc Farm
to Displaysand Network
30USC Parallel and Distributed Processing Center
Summary
Designing a compiler• Porting gcc to HiDISC
Benchmark simulation with new parameters and updated simulator
Analysis of architectural trade-offs for equal silicon area
Hand-compilation of Stressmarks suites and simulation
DIS benchmarks simulation