CS184b: Computer Architecture (Abstractions and Optimizations)

31
Caltech CS184 Spring2005 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 25: May 27, 2005 Transactional Computing

description

CS184b: Computer Architecture (Abstractions and Optimizations). Day 25: May 27, 2005 Transactional Computing. Today. Transactional Model Hardware Support Course Wrapup. Transactional Compute Model. Computation is a collection of atomic transactions - PowerPoint PPT Presentation

Transcript of CS184b: Computer Architecture (Abstractions and Optimizations)

Page 1: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon1

CS184b:Computer Architecture

(Abstractions and Optimizations)

Day 25: May 27, 2005

Transactional Computing

Page 2: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon2

Today

• Transactional Model

• Hardware Support

• Course Wrapup

Page 3: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon3

Transactional Compute Model

• Computation is a collection of atomic transactions

• Transactions are preformed in some ordering– Specified sequential ordering– Some consistent sequential interleaving

• Looks like sequential (multithreaded) computation

Page 4: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon4

Atomic Blocks

• Indivisible

• Cannot get interleaving of operations among blocks

• Each block – reads state at some point in time; – provides all its updates (writes) at that

point

Page 5: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon5

How Get?

• How do we get atomic blocks in our current models?

• Get parallelism with this model?

Page 6: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon6

Simple/Sequential Version

• Acquire a mutex

• Perform transaction

• Release mutex

Page 7: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon7

Parallel/Locking Version

• Read inputs, locking each as encounter• Lock all outputs• Perform operations• Write outputs• Unlock inputs/outputs

• OneChip Vector case is a single source/sink version of this

Page 8: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon8

Parallel Locking Dangers

What do we have to worry about in this version?

• Lock acquisition order deadlock

• Parallelism?

Page 9: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon9

Optimistic (Speculative) Transactions

• Pattern: speculate/rollback– Common case: non-interacting

• Read inputs• Perform computations• Check no one changed inputs

– Got correct inputs? write outputs atomicly

– Inputs changed rollback and try again

Page 10: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon10

Optimistic Transactions

• Compare – advanced load– ll/sc– branch prediction

• Doesn’t need to be sequentiallized

• Just need to look sequential

• It is adequate to – detect sequentialization violation

Page 11: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon11

Hardware Microarchitecture

Page 12: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon12

How support in hardware?

• What need?– Keep/mark inputs– Hold on to outputs– Check inputs– Commit outputs

Page 13: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon13

Microarchitecture

• Mark inputs in processor-local cache• Writes into processor-local cache• Write buffer for outputs

– Hold until end of transaction

• Commit bus• Snoop on bus for invalidations• Dump write buffer to commit bus when

turn to commit

Page 14: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon14

Stanford TCC uArch

[Hammond et al./ISCA2004]

Page 15: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon15

uArch Concerns?

• Impact of commit sequentialization?

• Time spent rolling back?

• Size of cache/write-buffer needed?

Page 16: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon16

Commit Traffic

[Hammond et al./ISCA2004]

Page 17: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon17

Where time go?

[Hammond et al./ISCA2004]

Page 18: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon18

Read State

[Hammond et al./ISCA2004]

Page 19: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon19

Write State

• Fig 7

[Hammond et al./ISCA2004]

Page 20: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon20

Execution Time

[Hammond et al./ASPLOS2004]

Page 21: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon21

Other Patterns?

• Double Buffer

• Parallel Prefix for Commits

Page 22: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon22

Programming

• Explicit– Parallel/transactional looping constructs– forks

• Add like futures– Force ordering to match sequential

language spec and is “safe”

• Automatically add– Support “Mostly Functional” programming– “Make the fast case common”

Page 23: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon23

Performance Issues

• Too many violations?– Force some transactions to wait on others

• Explicit/pragma/program construct• Feedback-based

– Emprical dataflow discovery?

Page 24: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon24

Stanford TCC Prototyping

(from Workshop on Architecture Research using FPGA Platforms

at HPCA 2004)

Page 25: CS184b: Computer Architecture (Abstractions and Optimizations)

C. Kozyrakis, WARFP, Feb. 2005 2525

ATLAS OverviewATLAS Overview

A multi-board emulator for transactional A multi-board emulator for transactional parallel systemsparallel systems

GoalsGoals 16 to 64 CPUs (8 to 32 boards)16 to 64 CPUs (8 to 32 boards) 50 to 100MHz50 to 100MHz Stand-alone full-feature systemStand-alone full-feature system

OS, IDE disks, 100Mb Ethernet, … OS, IDE disks, 100Mb Ethernet, …

CPU

Transaction Cache

CPU

Transaction Cache

DR

AMIO/DRAM

Control

DISK

Net

PCI

CMP NETWORK

ATLAS architecture spaceATLAS architecture space Small, medium, and large-scale CMPs and SMPsSmall, medium, and large-scale CMPs and SMPs UMA and NUMAUMA and NUMA Flexible transactional memory hierarchy & protocolFlexible transactional memory hierarchy & protocol Flexible network model Flexible network model Flexible clocking, latency, and bandwidth settingsFlexible clocking, latency, and bandwidth settings

Page 26: CS184b: Computer Architecture (Abstractions and Optimizations)

C. Kozyrakis, WARFP, Feb. 2005 2626

Building Block: Xilinx ML310 BoardBuilding Block: Xilinx ML310 Board

XC2VP30 FPGA featuresXC2VP30 FPGA features 2 PowerPC 405 cores2 PowerPC 405 cores 2.4Mb dual-ported SRAM2.4Mb dual-ported SRAM 30K logic cells30K logic cells 8 RocketIO 3.125Gbps transceivers 8 RocketIO 3.125Gbps transceivers

System featuresSystem features 256MB DDR, 512MB CompactFlash256MB DDR, 512MB CompactFlash Ethernet, PCI, USB, IDE, … Ethernet, PCI, USB, IDE, …

Design and development toolsDesign and development tools Foundation ISE for design entry, synthesis, … Foundation ISE for design entry, synthesis, …

For the transactional memory hierarchy and networkFor the transactional memory hierarchy and network Chipscope Pro logic analyzer for debuggingChipscope Pro logic analyzer for debugging EDK for system simulation, system SW development, configuration, … EDK for system simulation, system SW development, configuration, … Montavista Linux 3.1 ProMontavista Linux 3.1 Pro

Page 27: CS184b: Computer Architecture (Abstractions and Optimizations)

C. Kozyrakis, WARFP, Feb. 2005 2727

Example: 2-way bus-based transactional CMPExample: 2-way bus-based transactional CMP

I-Cache

RegisterCheckpoint

CPU

D-CacheTags

D-CacheData

Tra

nsa

ctio

n S

tate

Sto

re

Que

ue

CommitControl

SnoopControl

CommitArbiter

I-Cache

RegisterCheckpoint

CPU

D-CacheTags

D-CacheData

Tra

nsaction

State

Store

Q

ueue

CommitControl

SnoopControl

L2 Cache

DRAM Controller

PowerPC 405 PowerPC 405

BRAM

OCM

BRAM

BRAM BRAM

BR

AM

BR

AM

OCM

Logic LogicLogic

BRAM

PLB

Logic Macro

PLBPLB

PLB PLB

Page 28: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon28

Big Ideas [Transactional]

• Model/Implementation– Only needs to appear sequentialized

• Speculation/Rollback

• Common Case

Page 29: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon29

Model Review

Page 30: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon30

ModelsThreads

single multiple

Singledata

Conv.Processors

Multipledata

DataParallel(SIMD,Vector)

No SM SharedMemory

SideEffects

No SideEffects

(S)MT Dataflow

SM MP

DSMFine-grainedthreading

PureMP

StreamingDataflow(e.g. SCORE)

Trans-actional

Page 31: CS184b: Computer Architecture (Abstractions and Optimizations)

Caltech CS184 Spring2005 -- DeHon31

Big Ideas

• Model

• Implementation support model semantics– But may be very different, optimized

• Scaling and change

• Common case fast; fast case common– Measure to understand