CS184b: Computer Architecture (Abstractions and Optimizations)
description
Transcript of CS184b: Computer Architecture (Abstractions and Optimizations)
Caltech CS184 Spring2005 -- DeHon1
CS184b:Computer Architecture
(Abstractions and Optimizations)
Day 25: May 27, 2005
Transactional Computing
Caltech CS184 Spring2005 -- DeHon2
Today
• Transactional Model
• Hardware Support
• Course Wrapup
Caltech CS184 Spring2005 -- DeHon3
Transactional Compute Model
• Computation is a collection of atomic transactions
• Transactions are preformed in some ordering– Specified sequential ordering– Some consistent sequential interleaving
• Looks like sequential (multithreaded) computation
Caltech CS184 Spring2005 -- DeHon4
Atomic Blocks
• Indivisible
• Cannot get interleaving of operations among blocks
• Each block – reads state at some point in time; – provides all its updates (writes) at that
point
Caltech CS184 Spring2005 -- DeHon5
How Get?
• How do we get atomic blocks in our current models?
• Get parallelism with this model?
Caltech CS184 Spring2005 -- DeHon6
Simple/Sequential Version
• Acquire a mutex
• Perform transaction
• Release mutex
Caltech CS184 Spring2005 -- DeHon7
Parallel/Locking Version
• Read inputs, locking each as encounter• Lock all outputs• Perform operations• Write outputs• Unlock inputs/outputs
• OneChip Vector case is a single source/sink version of this
Caltech CS184 Spring2005 -- DeHon8
Parallel Locking Dangers
What do we have to worry about in this version?
• Lock acquisition order deadlock
• Parallelism?
Caltech CS184 Spring2005 -- DeHon9
Optimistic (Speculative) Transactions
• Pattern: speculate/rollback– Common case: non-interacting
• Read inputs• Perform computations• Check no one changed inputs
– Got correct inputs? write outputs atomicly
– Inputs changed rollback and try again
Caltech CS184 Spring2005 -- DeHon10
Optimistic Transactions
• Compare – advanced load– ll/sc– branch prediction
• Doesn’t need to be sequentiallized
• Just need to look sequential
• It is adequate to – detect sequentialization violation
Caltech CS184 Spring2005 -- DeHon11
Hardware Microarchitecture
Caltech CS184 Spring2005 -- DeHon12
How support in hardware?
• What need?– Keep/mark inputs– Hold on to outputs– Check inputs– Commit outputs
Caltech CS184 Spring2005 -- DeHon13
Microarchitecture
• Mark inputs in processor-local cache• Writes into processor-local cache• Write buffer for outputs
– Hold until end of transaction
• Commit bus• Snoop on bus for invalidations• Dump write buffer to commit bus when
turn to commit
Caltech CS184 Spring2005 -- DeHon14
Stanford TCC uArch
[Hammond et al./ISCA2004]
Caltech CS184 Spring2005 -- DeHon15
uArch Concerns?
• Impact of commit sequentialization?
• Time spent rolling back?
• Size of cache/write-buffer needed?
Caltech CS184 Spring2005 -- DeHon16
Commit Traffic
[Hammond et al./ISCA2004]
Caltech CS184 Spring2005 -- DeHon17
Where time go?
[Hammond et al./ISCA2004]
Caltech CS184 Spring2005 -- DeHon18
Read State
[Hammond et al./ISCA2004]
Caltech CS184 Spring2005 -- DeHon19
Write State
• Fig 7
[Hammond et al./ISCA2004]
Caltech CS184 Spring2005 -- DeHon20
Execution Time
[Hammond et al./ASPLOS2004]
Caltech CS184 Spring2005 -- DeHon21
Other Patterns?
• Double Buffer
• Parallel Prefix for Commits
Caltech CS184 Spring2005 -- DeHon22
Programming
• Explicit– Parallel/transactional looping constructs– forks
• Add like futures– Force ordering to match sequential
language spec and is “safe”
• Automatically add– Support “Mostly Functional” programming– “Make the fast case common”
Caltech CS184 Spring2005 -- DeHon23
Performance Issues
• Too many violations?– Force some transactions to wait on others
• Explicit/pragma/program construct• Feedback-based
– Emprical dataflow discovery?
Caltech CS184 Spring2005 -- DeHon24
Stanford TCC Prototyping
(from Workshop on Architecture Research using FPGA Platforms
at HPCA 2004)
C. Kozyrakis, WARFP, Feb. 2005 2525
ATLAS OverviewATLAS Overview
A multi-board emulator for transactional A multi-board emulator for transactional parallel systemsparallel systems
GoalsGoals 16 to 64 CPUs (8 to 32 boards)16 to 64 CPUs (8 to 32 boards) 50 to 100MHz50 to 100MHz Stand-alone full-feature systemStand-alone full-feature system
OS, IDE disks, 100Mb Ethernet, … OS, IDE disks, 100Mb Ethernet, …
CPU
Transaction Cache
CPU
Transaction Cache
DR
AMIO/DRAM
Control
DISK
Net
PCI
CMP NETWORK
ATLAS architecture spaceATLAS architecture space Small, medium, and large-scale CMPs and SMPsSmall, medium, and large-scale CMPs and SMPs UMA and NUMAUMA and NUMA Flexible transactional memory hierarchy & protocolFlexible transactional memory hierarchy & protocol Flexible network model Flexible network model Flexible clocking, latency, and bandwidth settingsFlexible clocking, latency, and bandwidth settings
C. Kozyrakis, WARFP, Feb. 2005 2626
Building Block: Xilinx ML310 BoardBuilding Block: Xilinx ML310 Board
XC2VP30 FPGA featuresXC2VP30 FPGA features 2 PowerPC 405 cores2 PowerPC 405 cores 2.4Mb dual-ported SRAM2.4Mb dual-ported SRAM 30K logic cells30K logic cells 8 RocketIO 3.125Gbps transceivers 8 RocketIO 3.125Gbps transceivers
System featuresSystem features 256MB DDR, 512MB CompactFlash256MB DDR, 512MB CompactFlash Ethernet, PCI, USB, IDE, … Ethernet, PCI, USB, IDE, …
Design and development toolsDesign and development tools Foundation ISE for design entry, synthesis, … Foundation ISE for design entry, synthesis, …
For the transactional memory hierarchy and networkFor the transactional memory hierarchy and network Chipscope Pro logic analyzer for debuggingChipscope Pro logic analyzer for debugging EDK for system simulation, system SW development, configuration, … EDK for system simulation, system SW development, configuration, … Montavista Linux 3.1 ProMontavista Linux 3.1 Pro
C. Kozyrakis, WARFP, Feb. 2005 2727
Example: 2-way bus-based transactional CMPExample: 2-way bus-based transactional CMP
I-Cache
RegisterCheckpoint
CPU
D-CacheTags
D-CacheData
Tra
nsa
ctio
n S
tate
Sto
re
Que
ue
CommitControl
SnoopControl
CommitArbiter
I-Cache
RegisterCheckpoint
CPU
D-CacheTags
D-CacheData
Tra
nsaction
State
Store
Q
ueue
CommitControl
SnoopControl
L2 Cache
DRAM Controller
PowerPC 405 PowerPC 405
BRAM
OCM
BRAM
BRAM BRAM
BR
AM
BR
AM
OCM
Logic LogicLogic
BRAM
PLB
Logic Macro
PLBPLB
PLB PLB
Caltech CS184 Spring2005 -- DeHon28
Big Ideas [Transactional]
• Model/Implementation– Only needs to appear sequentialized
• Speculation/Rollback
• Common Case
Caltech CS184 Spring2005 -- DeHon29
Model Review
Caltech CS184 Spring2005 -- DeHon30
ModelsThreads
single multiple
Singledata
Conv.Processors
Multipledata
DataParallel(SIMD,Vector)
No SM SharedMemory
SideEffects
No SideEffects
(S)MT Dataflow
SM MP
DSMFine-grainedthreading
PureMP
StreamingDataflow(e.g. SCORE)
Trans-actional
Caltech CS184 Spring2005 -- DeHon31
Big Ideas
• Model
• Implementation support model semantics– But may be very different, optimized
• Scaling and change
• Common case fast; fast case common– Measure to understand