Amalgam: a Reconfigurable Processor for Future Fabrication Processes
-
Upload
hayden-fulton -
Category
Documents
-
view
33 -
download
1
description
Transcript of Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Amalgam: a Reconfigurable Processor for Future Fabrication
Processes
Nicholas P. Carter
University of Illinois at Urbana-Champaign
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Performance = f(architecture, implementation)
1-DIDCT
1-DIDCT
Time
1-DIDCT
1-DIDCT
1-DIDCT
1-DIDCT
1-DIDCT
1-DIDCT
LDLD LDLDA
DD
AD
DM
UL
MU
LLDLDM
UL
MU
LS
TS
T LDLDM
UL
MU
LS
TS
T LDLDA
DD
AD
DM
UL
MU
LLDLDM
UL
MU
LS
TS
T ST
ST LD
LD LDLDA
DD
AD
DM
UL
MU
LA
DD
AD
DM
UL
MU
LA
DD
AD
DM
UL
MU
LS
TS
T ST
ST
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Efficient Implementation• Everything you give up in clock rate you
have to make back in architectural efficiency
• Wire delay is the big limiting factor in system architectures today– Wires get slower relative to transistors as fab.
process improves
• Programmable processors moving to deeper pipelines– Not good enough to just prevent wires from
making reconf. logic slower
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
AmalgamDRAMDRAM
Cache(Multi-Banked)
NetworkNetwork
PCluster PCluster PCluster PCluster
RCluster RCluster RCluster RCluster
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Network InterfaceNetwork Interface
ACUACU
Reconfigurable Cluster Design• 4 Register banks
– 8 registers/bank
• 4 Reconfigurable logic segments– 8 Rows x 32 LBs
per segment
• Array control unit• Network interface• Counter-clockwise
flow of computation through cluster
SegmentSegment BankBank
BankBank SegmentSegment
BankBankSegmentSegment
BankBank SegmentSegment
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Reconfigurable Clock Rates
BBBBBBBJJJJJJJHHHHHHHFFFFFFF1 8 01 3 09 06 54 53 22 202 0 0 04 0 0 06 0 0 08 0 0 01 0 0 0 01 2 0 0 01 4 0 0 0Fa b ric a tio n P ro c e s s (n m )BP ro g ra m m a b le C lu s te rJID C THD N AFR ijn d a e l
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Unpipelined Critical Path• Latches in logic blocks
only resource for pipelining
• Vertical and horizontal wires carry data
between logic blocks– Wires have heavy
loads, making them slower than their length
would indicate
• Effect on clock rate varies significantly with
fabrication process
LBLBFF
HWIRE
VW
IRE
BankBank
VW
IRE
HWIRE
LBLBFF
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Supporting Pipelining• Goal: make logic block delay the limiting
factor on clock rate
• Add configurable latches at each wire intersection– Problem: different paths may have different
latencies
• Add retiming buffers at logic block inputs/outputs
• Add network queues to reduce synchronization overhead
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Pipelined Critical Path• Delay of individual
wires < logic block delay in all processes studied
• Add configurable pipeline latches at junctions between wires
• Pipeline latches also added on carry chains within rows
LBLBFF
HWIRE
VW
IRE
BankBank
VW
IRE
HWIRE
FF
FFFF
FFFF
LBLBFF
FF
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Retiming Buffers• 5-deep chain of
latches added to each logic block input– Similar structure added
to LB output
• Can “borrow” up to two cycles of additional delay from adjacent input
• Total pipeline register overhead = 17%
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
FFFF
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Register Queues
WRITE R8, Val1
WRITE R8, Val2
Sync.Message
NetworkNetwork
RegisterFile
RegisterFile
Original Architecture
WRITE R8, Val1WRITE R8, Val2
EMPTY R8
NetworkNetwork
Original Architecture
RegisterQueue
RegisterQueue
RegisterFile
RegisterFile
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Implementing Pipelined Apps.• Logical vs. Physical pipelining
– Logical: Program-visible, uses array and registers
– Physical: Only visible to ACU, uses pipeline registers on wires, retiming buffers
• Take advantage of decoupling provided by queues
• Applications use same reconfigurable logic configurations in different fab. processes– Only FSM in ACU changes– Applications to portability, managing intra-die
variation
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Experimental Methodology• Programs simulated using Amalsim
– Set each cluster’s clock rate independently• Benchmarks: IDCT, Rijndael, DNA comparison
– Fine-grained version of each benchmark does one computation– Medium-grained version performs four independent computatons
• Programmable cluster clock rates based on ITRS– Limit stages to 7 FO4 delay, slightly more aggressive than ITRS
• Logic block latencies, wire lengths taken from circuit-level design of reconf. Cluster in 180nm CMOS– Convert logic block delay to FO4, scale by FO4 delay of each
fabrication process– Scale wire length based on fabrication process, simulate wire
delay in SPICE– Pipeline such that reconf. cluster cycle time is determined by logic
block delay
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Fine-Grained Benchmark Perf.
• Reconfigurable version maintains about 20% perf. Improvement over programmable in all fab. processes
• Pipelining only small benefit• Majority of speedup comes from reduction in
memory references
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Medium-Grain Benchmark Perf.
• Pipelined architecture sees 2.6x perf improvement over programmable
• Unpipelined architecture only minor improvement over programmable– Greater parallelism means more ability to tolerate
memory delays
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Limit Studies• Believe that memory operations are much of the
benefit for small tasks– Study limit where memory latency = 1– Also test theory that streaming benchmarks have
enough parallelism to cover latency
• Understand how much clock rate of reconfigurable unit affects performance– Model reconfigurable unit at same clock rate as
programmable clusters– Completely unreasonable for unpipelined– Might be indicator of what industry could do with
pipelined
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Unpipelined Fine-Grained
• Removing memory latencies makes programmable performance similar to reconfigurable
• Latency of reconfig. clusters has large impact on performance -- no parallelism to cover latency
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Pipelined Fine-Grained
• Results similar to unpipelined– Benefit still mostly from memory reduction
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Unpipelined Medium-Grain
• Eliminating memory latencies really helps programmable
• Latency of reconf. logic an even bigger problem– Programmable clusters can exploit parallelism through
pipelines
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Pipelined Medium-Grain
• Impact of memory system on reconfigurable performance very small
• Less benefit from increasing reconfigurable cluster clock rate– With even small amounts of parallelism, throughput
becomes more important than latency.
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Future Directions• ASIC-like performance with programmable
systems– ASICs typically get 100x better performance
per unit area than microprocessors
• Application-specific memory systems in a programmable chip– Transform memory references into
communication– Create natural division of programs into regular
and irregular blocks
Amalgam: a Reconfigurable Processor for Future Fabrication Processes
Conclusion• Reconfigurable computing must provide
both speedup from custom logic and high clock rates to succeed
• Amalgam does this by limiting and tolerating wire delay at multiple levels– Clustered architecture– Segmented reconfigurable unit– Pipeline wire delays
• Result: 2.6x speedup over 8-way CMP in current and future fabrication processes