ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

35
Lecture 19: Reconfigurable Coprocessors November 15, 2004 ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

description

ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors. Overview. Focus on Processor and Array hybrids. Motivation Compute Models: how to fit into computation Examples: Garp, Prism, Remarc, OneChip, Prisc - PowerPoint PPT Presentation

Transcript of ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Page 1: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

ECE 697F

Reconfigurable Computing

Lecture 19

Reconfigurable Coprocessors

Page 2: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Overview

• Focus on Processor and Array hybrids.

• Motivation

• Compute Models: how to fit into computation

• Examples: Garp, Prism, Remarc, OneChip, Prisc

• Some lecture material taken with permission from Dehon lecture on reconfigurable computing.

Page 3: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Compression Techniques

• Processors efficient at sequential codes, regular arithmetic operations.

• FPGA efficient at fine-grained parallelism, unusual bit-level operations.

• Tight-coupling important: allows sharing of data/control

• Converging technologies: SRAM being migrated to same die as processor anyway. Why not integrate?

Page 4: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Motivational: Other Viewpoints

• Replace interface glue logic.

• I/O pre/post processing

• Handle real-time responsiveness

• Provide powerful, application specific operation.

• Allow migration of function /performance over time.

Page 5: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Compute Models

• Glue logic for buses, adapters.

• Dedicated I/O processor

• Instruction augmentation

- Special instructions/coprocessor ops

- VLIW/microcoded extension to processor

- Configurable vector unit

• Autonomous co/stream processor

Page 6: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Interfacing• Logic replaces:

- ASIC customization

- External FPGA/CPLD

• Example

- Bus protocols

- Peripherals

- Sensors, actuators

• Argument

- Need customization

- Modern chips have capacity

- Reduce part count

- Migrate to system-on-a-chip

- Performance/power

Page 7: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Triscend E5 Architecture

Page 8: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Triscend E5 Architecture

Page 9: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

I/O Processor• Array dedicated to servicing to I/O channel

- Sensor, LAN, WAN, peripheral

- Many protocols, services

• Provides protocol handling

- Stream computation

- Compression, encrypt

• Effectively looks like I/O peripheral to processor.

- Don’t need all at same time

- Offload function from processor.

Page 10: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

I/O Processing

• Single threaded processor created in reconfigurable logic.

• No support for multiple data pipes or multiple contexts.

• Need some minimal, local control to handle events.

• For performance or real-time guarantees, may need to service rapidly.

• Checksum and acknowledge packets, for example

Page 11: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Instruction Augmentation

• Processor can only describe a small number of basic computations in a cycle

- I bits -> 2I operations

• Recall that for Boolean function a total of ______ operations could be performed on 2 W-bit words.

• ALU implementations restrict execution of some simple operations.

- e. g. bit reversal

a31 a30………. a0

b31 b0

Swap bitpositions

Page 12: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Instruction Augmentation

• Provide a way to augment the processor instruction set for an application.

• Avoid mismatch between hardware/software

•Fit augmented instructions into data and and

control stream.

•Create a functional unit for augmented instructions.

•Compiler techniques to identify/use new functional unit.

What’s Required?

Page 13: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Chimaera

• Start from Prisc idea.

- Integrate as a functional unit

- No state

- RFU Ops (like expfu)

- Stall processor on instruction miss

• Add

- Multiple instructions at a time

- More than 2 inputs possible

• Hauck: University of Washington

Page 14: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Chimaera Architecture

• Live copy of register file values feed into array

• Each row of array may compute from register of intermediates

• Tag on array to indicate RFUOP

Page 15: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Chimaera Architecture

• Array can operate on values as soon as placed in register file.

• Logic is combinational

• When RFUOP matches

- Stall until result ready

- Drive result from matching row

Page 16: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Chimaera Timing

• If R1 presented last then stall

• Might be helped by instruction reordering

• Physical implementation an issue.

R5 R3 R2 R1

Page 17: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Chimaera Results

• Three Spec92 benchmarks

- Compress 1.11 speedup

- Eqntott 1.8

- Life 2.06

• Small arrays with limited state

• Small speedup

• Perhaps focus on global router rather than local optimization.

Page 18: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Garp

• Integrate as coprocessor

- Similar bandwidth to processor as functional unit

- Own access to memory

• Support multi-cycle operation

- Allow state

- Cycle counter to track operation

• Configuration cache, path to memory

Page 19: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Garp – UC Berkeley

• ISA – coprocessor operations

- Issue gaconfig to make particular configuration present.

- Explicitly move data to/from array

- Processor suspension during coproc operation

- Use cycle counter to track progress

• Array may directly access memory

- Processor and array share memory

- Exploits streaming data operations

- Cache/MMU maintains data consistency

Page 20: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Garp Instructions

• Interlock indicates if processor waits for array to count to zero.

• Last three instructions useful for context swap

• Processor decode hardware augmented to recognize new instructions.

Page 21: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Garp Array

• Row-oriented logic

• Dedicated path for processor/memory

• Processor does not have to be involved in array-memory path

Page 22: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Garp Results

• General results- 10-20X

improvement on stream, feed-forward operation

- 2-3x when data dependencies limit pipelining

- [Hauser-FCCM97]

Page 23: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

PRISC/Chimaera vs. Garp

• Prisc/Chimaera

- Basic op is single cycle: expfu

- No state

- Could have multiple PFUs

- Fine grained parallelism

- Not effective for deep pipelines

• Garp

- Basic op is multi-cycle – gaconfig

- Effective for deep pipelining

- Single array

- Requires state swapping consideration

Page 24: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Common Theme

• To overcome instruction expression limits:

- Define new array instructions. Make decode hardware slower / more complicated.

- Many bits of configuration… swap time. An issue -> recall tips for dynamic reconfiguration.

• Give array configuration short “name” which processor can call out.

• Store multiple configurations in array. Access as needed (DPGA)

Page 25: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

ReMarc

• Miyamori/Olukotun – Stanford

• Array of “nano-processors”

- 16b, 32 instructions each

- VLIW –like instruction

• Coprocessor interface (similar to Garp)

- No direct array -> memory

Page 26: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

ReMarc Architecture

• 8x8 array of nanoprocessor

• Reminiscent of DPGA except that processing element is ALU

Page 27: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Nanoprocessor Tile

• Each tile has own instruction RAM

• Communication with near-neighbor tiles

• Global sequence specifies non-PC

• 16 bit output.

Page 28: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

ReMarc Results

• ReMarc 60X smaller than FPGA

• Performance comparable

Page 29: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Observation

• All coprocessors have been single-threaded

- Performance improvement limited by application parallelism

• Potential for task/thread parallelism

- DPGA

- Fast context switch

• Concurrent threads seen in discussion of IO/stream processor

• Added complexity needs to be addressed in software.

Page 30: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Scalability?

• Can scale….

- Number of inactive contexts.

- Similar to cache model

- Number of PFUs in PRISC/Chimaera– Still limited by single execution thread.

– Exacerbate pressure/complexity of reconfigurable logic/interconnect

• Cannot scale?

- Amount of active resources.

- Perhaps take coarser-grain focus to parallel processing.

Page 31: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Parallel Computation: Processor and FPGA

• What would it take to let the processor and FPGA run in parallel?

Modern Processors

Deal with:

• Variable data delays

• Dependencies with data

• Multiple heterogeneous functional units

Via:

• Register scoreboarding

• Runtime data flow (Tomasulo)

Page 32: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

OneChip -> Toronto

• Allow array to have more memory-memory operations

• Want to fit into programming model/ISA without forcing exclusive processor/FPGA operation.

• Also allow decoupled processor/array execution.

• Allow interlocking of data in special “scoreboard” area.

Page 33: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

OneChip Innovations

• FPGA operates on certain memory regions only

• Makes regions explicit to processor issue.

• Scoreboard memory blocks

0x00x1000

0x10000

FPGA

Proc

Indicates usage of data pages like virtual memory system!

Page 34: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

OneChip

• Basic Op is FPGAMem -> Mem

• No state between ops

• Ops must appear sequential

• Could have multiple/parallel FPGA compute units

- Scoreboard between all

• Multiprocessing?

Page 35: ECE 697F Reconfigurable Computing Lecture 19 Reconfigurable Coprocessors

Lecture 19: Reconfigurable Coprocessors November 15, 2004

Summary

• Several different models and uses for “reconfigurable processor”

• Some move towards parallel computing. Others towards single processors

• Exploit density and expressiveness of fine-grained, parallel operations.

• Number of ways to integrate. Need to work around limitations.