Mihai Budiu Microsoft Research – Silicon Valley Girish Venkataramani, Tiberiu Chelcea, Seth Copen...

Mihai BudiuMicrosoft Research – Silicon Valley

Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein

Carnegie Mellon University

Spatial ComputationComputing without General-Purpose Processors

Outline• Intro: Problems of current architectures

• Compiling Application-Specific Hardware

• ASH Evaluation

• Conclusions

Resources

• We do not worry about not having hardware resources• We worry about being able to use hardware resources

[Intel]

Complexity

Designer productivity

Chip size

Cannot rely on global signals(clock is a global signal)

5ps 20ps

gatewire

Complexity

Designer productivity

Chip size

Cannot rely on global signals(clock is a global signal)

5ps 20ps

gatewire

Automatictranslation

C ! HW

Simple, short,unidirectionalinterconnect

No interpretationDistributed

control,Asynchronous

Simple hw,mostly idle

Our Proposal:Application-Specific Hardware

• ASH addresses these problems• ASH is not a panacea• ASH “complementary” to CPU

High-ILPcomputation

Low ILP computation+ OS + VM CPU ASH

Memory

Paper Content

• Automatic translation of C to hardware dataflow machines

• High-level comparison of dataflow and superscalar

• Circuit-level evaluation -- power, performance, area

Outline• Problems of current architectures

• CASH: Compiling Application-Specific Hardware

• ASH Evaluation

• Conclusions

Application-Specific HardwareC program

Compiler

Dataflow IR

Reconfigurable/custom hw

HW backend

Computation Dataflow

x = a & 7;...

y = x >> 2;

Program

Circuits

No interpretation

Operations Nodes Pipeline stages

Variables Def-use edges Channels (wires)

Basic Computation=Pipeline Stage

latch+

Distributed Control Logic

ackrdy

global

short, local wires

MUX: Forward Branches

if (x > 0) y = -x;

elsey = b*x;

Conditionals ) Speculation

SSA= no arbitration

Memory Access

MonolithicMemory

local communication global structures

pipelinedarbitratednetwork

Future work: fragment this!

Outline• Problems of current architectures

• Compiling ASH

• ASH Evaluation

• Conclusions

Evaluating ASHC

CASHcore

Verilog back-end

Synopsys,Cadence P/R

180nm std. cell library, 2V

~1999technology

Mediabench kernels(1 hot function/benchmark)

ModelSim(Verilog simulation)

performancenumbers

commercial tools

Compile TimeC

CASHcore

Verilog back-end

Synopsys,Cadence P/R

20 seconds

10 seconds

20 minutes1 hour

200 lines

ASH AreaP4: 217

minimal RISC core

Mem accessDatapath

ASH vs 600MHz CPU [.18 m]

0.600.77

0.53 0.48

1.351.52 1.55

3.65 3.57

Bottleneck: Memory Protocol

ST Memory

• Enabling dependent operations requires round-trip to memory.• Limit study: round trip zero time ) up to 5x speed-up.

• Exploring novel memory access protocols.

PowerDSP110

mP4000

Xeon [+cache]67000

9.3 9.3

23.622.5

25.2 25.2

Energy-delay vs. Wattch

Energy Efficiency

0.01 0.1 1 10 100 1000

Energy Efficiency [Operations/nJ]

General-purpose DSP

Dedicated hardware

ASH media kernels

Microprocessors

Asynchronous P

Outline

Problems of current architectures

+ Compiling ASH

+ Evaluation

= Related work, Conclusions

Related Work• Optimizing compilers

• High-level synthesis

• Reconfigurable computing

• Dataflow machines

• Asynchronous circuits

• Spatial computation

We target an extreme point in the design space:no interpretation,

fully distributed computation and control

ASH Design Point

• Design an ASIC in a day

• Fully automatic synthesis to layout

• Fully distributed control and computation

(spatial computation)– Replicate computation to simplify wires

• Energy/op rivals custom ASIC

• Performance rivals superscalar

• E£t 100 times better than any processor

Conclusions

Feature Advantages

No interpretation Energy efficiency, speed

Spatial layout Short wires, no contention

Asynchronous Low power, scalable

Distributed No global signals

Automatic compilation Designer productivity

Spatial computation strengths

Backup Slides• Absolute performance • Control logic• Exceptions• Leniency• Normalized area• Loops• ASH weaknesses• Splitting memory• Recursive calls• Leakage• Why not compare to…• Targetting FPGAs

Absolute Performance

MOPSall

MOPSspec

ackout

rdyoutackin

datain dataout

Pipeline Stage

Exceptions• Strictly speaking, C has no exceptions

• In practice hard to accommodate exceptions in hardware implementations

• An advantage of software flexibility: PC is single point of execution control

High-ILPcomputation

Low ILP computation+ OS + VM + exceptions CPU ASH

Memory

Critical Paths

if (x > 0) y = -x;

elsey = b*x;

Lenient Operations

if (x > 0) y = -x;

elsey = b*x;

Solves the problem of unbalanced paths

Normalized Area

2.5Lines/sq mmsq mm/kbyte

Control Flow ) Data Flow

datapredicate

Merge (label)

Gateway

Split (branch)p

+1< 100

int sum=0, i;

for (i=0; i < 100; i++)

sum += i*i;

return sum;return sum; !

retback

ASH Weaknesses

• Both branch and join not free• Static dataflow (no re-issue of same instr)• Memory is “far”• Fully static

– No branch prediction– No dynamic unrolling– No register renaming

• Calls/returns not lenient

Predicted not takenEffectively a noop for CPU!

Predicted taken.

Branch Prediction

for (i=0; i < N; i++) {

if (exception) break;

exception

result available before inputs

ASH crit path

CPU crit path

Memory Partitioning• MIT RAW project: Babb FCCM ‘99,

Barua HiPC ‘00,Lee ASPLOS ‘00

• Stanford SpC: Semeria DAC ‘01, TVLSI ‘02

• Illinois FlexRAM: Fraguella PPoPP ‘03

• Hand-annotations #pragma

Recursion

recursive call

save live values

restore live valuesstack

Leakage Power

Ps = k Area e-VT

• Employ circuit-level techniques

• Cut power supply of idle circuit portions– most of the circuit is idle most of the time– strong locality of activity

Why Not Compare To…• In-order processor

– Worse in all metrics than superscalar, except power– We beat it in all metrics, including performance

• DSP– We expect roughly the same results as for superscalar

(Wattch maintains high IPC for these kernels)

• ASIC– No available tool-flow supports C to the same degree

• Asynchronous ASIC– We compared with a Balsa synthesis system– We are 15 times better in Et compared to resulting ASIC

• Async processor– We are 350 times better in Et than Amulet (scaled to .18)

Compared to Next Talk

Engine[180nm]

Performance[MIPS]

E/instruction[pJ]

SNAP/LE 28 24

SNAP/LE 240 218

ASH 1100 20

Why not target FPGA

• Do not support asynchronous circuits

• Very inefficient in area, power, delay

• Too fine-grained for datapath circuits

• We are designing an async FPGA

Mihai Budiu Microsoft Research – Silicon Valley Girish Venkataramani, Tiberiu Chelcea, Seth Copen...

Documents

Transcript of Mihai Budiu Microsoft Research – Silicon Valley Girish Venkataramani, Tiberiu Chelcea, Seth Copen...

Compiling C to Asynchronous Hardware • Overview of ...mihaib/async04/async04.pdf · Compiling C to Asynchronous Hardware Mihai Budiu, Girish Venkataramani, Tiberiu Chelcea, Seth

Area Optimizations for Dual-Rail Circuits Using …seth/papers/chelcea-async07.pdfArea Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis Tiberiu Chelcea, Girish Venkataramani,

Queston2 by chelcea townsend

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

Mihai Budiu May 23, 2007. Based On Critical Path: A Tool for System-Level Timing Analysis Girish Venkataramani, Tiberiu Chelcea, Mihai Budiu, and Seth.

Optimizing Memory Accesses for Spatial Computation Mihai Budiu, Seth Goldstein CGO 2003.

Training Kinect Mihai Budiu Microsoft Research, Silicon Valley UCSD CNS 2012 RESEARCH REVIEW February 8, 2012.

CS745: Register Allocation© Seth Copen Goldstein & Todd C. Mowry 2002-31 15-745 Register Allocation.

CS745: SSA© Seth Copen Goldstein & Todd C. Mowry 2001-31 15-745 Static Single Assignment.

radionic copen

Scanned Document - portal.just.roportal.just.ro/90/SiteAssets/SitePages/informatii/2016 - Anunt... · Petrescu Alexandra Luminita Chelcea Constantin Dancu Gilda Cristina Dancu Bogdan

copen manual - HERMOSA · Title: copen_manual Created Date: 5/18/2020 2:11:51 PM

ENGINEERING WORKSHOP Compute Engineering Workshop P4: specifying data planes Mihai Budiu San Jose, March 11, 2015.

Spatial Computation Mihai Budiu CMU CS CALCM Seminar, Oct 21, 2003.

Machine Learning in DryadLINQ Kannan Achan Mihai Budiu MSR-SVC, 1/30/2008 1.

COUNCIL OF THE EUROPEAN UNION 9968/14 COPEN 153

Architectural Support for Software-Based Protection Mihai Budiu Úlfar Erlingsson Martín Abadi ASID Workshop, Oct 21, 2006 Silicon Valley.

Mihai Budiu Microsoft Research – Silicon Valley joint work with Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein Carnegie Mellon University.

Instructors: Seth Copen Goldstein, Anthony Rowe, Greg Kesden

Fileshare.ro_opinia Publica. Strategii de Persuasiune Si Manipulare(Septimiu Chelcea, Editura Economica, 2006)