Rebound: Scalable Checkpointing for Coherent Shared Memory
description
Transcript of Rebound: Scalable Checkpointing for Coherent Shared Memory
![Page 1: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/1.jpg)
Rebound: Scalable Checkpointing for Coherent Shared Memory
Rishi Agarwal, Pranav Garg, and Josep TorrellasDepartment of Computer Science
University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu
![Page 2: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/2.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Checkpointing in Shared-Memory MPs
• HW-based schemes for small CMPs use Global checkpointing– All procs participate in system-wide checkpoints
• Global checkpointing is not scalable– Synchronization, bursty movement of data, loss in rollback…
save chkpt
save chkpt
rollback
2
Fault
checkpoint
checkpoint
P1 P2 P3 P4
![Page 3: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/3.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Alternative: Coordinated Local Checkpointing
• Idea: threads coordinate their checkpointing in groups• Rationale:
– Faults propagate only through communication – Interleaving between non-comm. threads is irrelevant
3
+ Scalable: Checkpoint and rollback in processor groups– Complexity: Record inter-thread dependences dynamically.
GlobalChkpt
P1 P2 P3 P4 P5 P1 P2 P3 P4 P5
LocalChkptLocal
Chkpt
![Page 4: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/4.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Contributions
• Leverages directory protocol to track inter-thread deps.
• Opts to boost checkpointing efficiency:• Delaying write-back of data to safe memory at checkpoints• Supporting multiple checkpoints• Optimizing checkpointing at barrier synchronization
• Avg. performance overhead for 64 procs: 2%• Compared to 15% for global checkpointing
Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory
4
![Page 5: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/5.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Background: In-Memory Checkpt with ReVive
P1 P2 P3
MemoryLog
Writebacks
Logging
RegisterDump
Caches
Writeback
5
[Prvulovic-02]
CHK
W W W W WBDirty Cache lines
Execution
CheckpointApplication
Stalls
oldoldold
Displacement
![Page 6: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/6.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Fault
Background: In-Memory Checkpt with ReVive
[Pvrulovic-02]
6
Old Register restored
Cache Invalidated
Memory LinesReverted
Global Broadcast protocol
Local CoordinatedScalable protocol
CHK
W W W W WB
Log Memory
P3P2
Caches
P1
![Page 7: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/7.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Coordinated Local Checkpointing Rules
• Banatre et al. used Coordinated Local checkpointing for bus-based machines [Banatre96]
wr x
rd x
P1 P2
Producerrollback
Consumerrollback
P1 P2
Producerchkpoint
Consumerchkpoint
P1 P2
chkptchkpt
7
P checkpoints P’s producers checkpointP rolls back P’s consumers rollback
![Page 8: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/8.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Rebound Fault Model
• Any part of the chip can suffer transient or permanent faults.• A fault can occur even during checkpointing• Off-chip memory and logs suffer no fault on their own (e.g. NVM)• Fault detection outside our scope:
• Fault detection latency has upper-bound of L cycles
Log (in SW)Main Memory
Chip Multiprocessor
8
![Page 9: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/9.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Rebound Architecture
Main Memory
Chip Multiprocessor
L2DirectoryCache
LW-ID
MyProducerMyConsumer
DepRegister
P+L1
9
![Page 10: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/10.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• Dependence (Dep) registers in the L2 cache controller:• MyProducers : bitmap of proc. that produced data consumed by
the local proc.• MyConsumers : bitmap of proc. that consumed data produced
by the local proc.
Rebound Architecture
Main Memory
Chip Multiprocessor
L2DirectoryCache
LW-ID
MyProducerMyConsumer
DepRegister
P+L1
10
![Page 11: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/11.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• Dependence (Dep) registers in the L2 cache controller:• MyProducers : bitmap of proc. that produced data consumed by
the local proc.• MyConsumers : bitmap of proc. that consumed data produced
by the local proc. • Processor ID in each directory entry:
• LW-ID : last writer to the line in the current checkpoint interval.
Rebound Architecture
Main Memory
Chip Multiprocessor
L2DirectoryCache
LW-ID
MyProducerMyConsumer
DepRegister
P+L1
11
![Page 12: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/12.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Recording Inter-Thread Dependences
Assume MESI protocol
P1 P2
Log
DP1
Memory
Write
12
P1 writes MyProducersMyConsumers
MyProducersMyConsumers
LW-ID
![Page 13: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/13.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Recording Inter-Thread Dependences
Assume MESI protocol
P1 P2
DP1 S
Write back
Logging
13
MemoryLog
P2 reads
MyConsumers P2
MyProducers P1
MyProducersMyConsumers
MyProducersMyConsumersP2
P1
LW-ID
![Page 14: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/14.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
P1 S
Recording Inter-Thread Dependences
Assume MESI protocol
P1 P2
DP1
14
MemoryLog
P1 writes P2P1MyProducers
MyConsumersMyProducersMyConsumers
LW-ID
![Page 15: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/15.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
P1P1 S
Recording Inter-Thread Dependences
Assume MESI protocol
P1 P2
DWritebacks
Clear LW-ID
Logging
15
MemoryLog
P1 checkpoints
LW-ID should remain set till the line is checkpointed
P2P1MyProducers
MyConsumersMyProducersMyConsumers
Clear Dep registers
LW-ID
![Page 16: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/16.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Lazily clearing Last Writers
• Clear LW-IDs Expensive process !
• Write Signature encodes all line addresses that the processor has written to (or read exclusively) in the current interval.
• At checkpoint, the processors clear their Write Signature– Potentially stale LW-ID
16
![Page 17: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/17.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
P1 P2
P1 S
17
MemoryLog
P2 readsMyProducersMyConsumers
MyProducersMyConsumers
Stale LW-ID
Lazily clearing Last Writers
WSigNO !
Addr ?Clear LW-ID
![Page 18: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/18.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers
Distributed Checkpointing Protocol in SW
initiatecheckpoint
P1P1 P2 P3 P4
chk
InteractionSet : P1
18
![Page 19: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/19.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers
Distributed Checkpointing Protocol in SW
initiatecheckpoint
P1P1 P2 P3 P4
chk
InteractionSet : P1
19
P3
Ck? Ck?
P2
![Page 20: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/20.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers
Distributed Checkpointing Protocol in SW
initiatecheckpoint
P1
P2
P4
P3
Ck?
Ck? Ck?Acce
pt
P1 P2 P3 P4
chk
InteractionSet : P1, P2, P3
21
Accept
![Page 21: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/21.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers
Distributed Checkpointing Protocol in SW
initiatecheckpoint
P1
P2
P4
P3
Decline
Ack
Ck?
Ck? Ck?Acce
pt
P1 P2 P3 P4
chk
InteractionSet : P1, P2, P3
22
Accept
![Page 22: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/22.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers
Distributed Checkpointing Protocol in SW
initiatecheckpoint
P1
P2
P4
P3
Decline
Ack
Ck?
Ck? Ck?Acce
pt
P1 P2 P3 P4
chk
InteractionSet : P1, P2, P3
23
Accept
• Checkpointing is a 2-phase commit protocol.
![Page 23: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/23.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• Rollback handled similar to the Checkpointing protocol: - Interaction set is built transitively using MyConsumers
• Rollback involves– Clearing the Dep. Registers and Write Signature– Invalidating the processor caches– Restoring the data and register context from the logs up to
the latest checkpoint.
• No Domino Effect
24
Distributed Rollback Protocol in SW
![Page 24: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/24.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Optimization1 : Delayed Writebacks
• Checkpointing overhead dominated by data writebacks
• Delayed Writeback optimization• Processors synchronize and resume execution• Hardware automatically writes back dirty lines in background • Checkpoint only completed when all delayed data written back• Still need to record inter-thread dependences on delayed data
WB dirty linesIn
terv
al
I1Tim
e
25
sync
sync
Che
ckpo
int
Inte
rval
I2
Stallsync
sync
WB dirty lines
Che
ckpo
int
Inte
rval
I1
Inte
rval
I2
Stall
![Page 25: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/25.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Delayed Writeback Pros/Cons
+ Significant reduction in checkpoint overhead
- Additional support:Each processor has two sets of Dep. Registers and Write Signature Each cache line has a delayed bit
- Increased vulnerabilityA rollback event forces both intervals to roll back
26
![Page 26: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/26.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
P1 P2
DP1 S
Write back
Logging
27
MemoryLog
P2 reads
MyConsumers0 P2
MyProducers1 P1
MyProducers0MyConsumers0
MyProducers0MyConsumers0P2
P1
LW-ID
MyProducers1MyConsumers1
MyProducers1MyConsumers1
WSig0
WSig1
Addr ?
Addr ?NO !
YES !xxx
Delayed Writeback protocol
![Page 27: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/27.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Optimization2 : Multiple Checkpoints
• Solution: Keep multiple checkpoints– On fault, roll back interacting processors to safe checkpoints
• No Domino Effect 28
Fault
Det
ectio
n L
aten
cy
Dep registers 1
Dep registers 2Rol
lbac
k
Ckpt 1
Ckpt 2
tf
• Problem: Fault detection is not instantaneous– Checkpoint is safe only after max fault-detection latency (L)
![Page 28: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/28.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Multiple Checkpoints: Pros/Cons
+ Realistic system: supports non-instantaneous fault detection
- Additional support:Each checkpoint has Dep registers Dep registers can be recycled only after fault detection latency
- Need to track communication across checkpoints
- Combination with Delayed Writebacks: one more Dep register set
29
![Page 29: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/29.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Optimization3 : Hiding Chkpt behind Global Barrier
• Global barriers require that all processors communicate– Leads to global checkpoints
• Optimization:– Proactively trigger a global checkpoint at a global barrier– Hide checkpoint overhead behind barrier imbalance spins
30
![Page 30: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/30.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Hiding Checkpoint behind Global Barrier
Lock count++ if(count == numProc) Iam_last = TRUE /*local var*/UnlockIf(I am_last) { count = 0 flag = TRUE …}else while(!flag) {}
31
Update
![Page 31: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/31.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Hiding Checkpoint behind Global Barrier
• First arriving processor initiates the checkpoint• Others: HW writes back data as execution proceeds to barrier• Commit checkpoint as last processor arrives• After the barrier: few interacting processors
Lock count++ if(count == numProc) Iam_last = TRUE /*local var*/UnlockIf(I am_last) { count = 0 flag = TRUE …}else while(!flag) {}
32
UpdateUpdate
Processor P1 Processor P2 Processor P3
Update
BarCK? BarCK?
Notify Notify
flag = TRUE ICHK = {P3} while(!flag)
ICHK = {P2, P3}
while(!flag)ICHK = {P1, P3}
Update
![Page 32: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/32.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Evaluation Setup
• Analysis tool using Pin + SESC cycle-acc. simulator + DRAMsim• Applications: SPLASH-2 , some PARSEC, Apache• Simulated CMP architecture with up to 64 threads • Checkpoint interval : 5 – 8 ms• Modeled several environments:
• Global: baseline global checkpointing• Rebound: Local checkpointing scheme with delayed writeback.• Rebound_NoDWB: Rebound without the delayed writebacks.
33
![Page 33: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/33.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Avg. Interaction Set: Set of Producer Processors
• Most apps: interaction set is a small set– Justifies coordinated local checkpointing– Averages brought up by global barriers
34
64
38
![Page 34: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/34.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Checkpoint Execution Overhead
• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for Global
35
Bar
nes
Cho
lesk
y Fft
Fmm
Rad
ix
Lu-C
Lu-N
C
Vol
rend
Wat
er-
Sp
Wat
er-
Nsq
Rad
iosi
ty
Oce
an
Ray
trace
SP
2
0
10
20
30
40 GlobalRebound_NoDWBRebound
% C
heck
poin
t O
verh
ead
2
15
![Page 35: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/35.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Checkpoint Execution Overhead
• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for Global
• Delayed Writebacks complement local checkpointing
36
Bar
nes
Cho
lesk
y Fft
Fmm
Rad
ix
Lu-C
Lu-N
C
Vol
rend
Wat
er-
Sp
Wat
er-
Nsq
Rad
iosi
ty
Oce
an
Ray
trace
SP
2
0
10
20
30
40 GlobalRebound_NoDWBRebound
% C
heck
poin
t O
verh
ead
![Page 36: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/36.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Rebound Scalability
• Rebound is scalable in checkpoint overhead• Delayed Writebacks help scalability
Constant problem size
37
![Page 37: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/37.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Also in the Paper
• Delayed write backs also useful in Global• Barrier optimization is effective but not universally applicable• Power increase due to hardware additions < 2%• Rebound leads to only 4% increase in coherence traffic
38
![Page 38: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/38.jpg)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Conclusions
• Leverages directory protocol• Boosts checkpointing efficiency:
• Delayed write-backs• Multiple checkpoints• Barrier optimization
• Avg. execution overhead for 64 procs: 2%
Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory
• Future work:• Apply Rebound to non-hardware coherent machines• Scalability to hierarchical directories
39
![Page 39: Rebound: Scalable Checkpointing for Coherent Shared Memory](https://reader036.fdocuments.in/reader036/viewer/2022081512/56816341550346895dd3d279/html5/thumbnails/39.jpg)
Rebound: Scalable Checkpointing for Coherent Shared Memory
Rishi Agarwal, Pranav Garg, and Josep TorrellasDepartment of Computer Science
University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu