@spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages...
Transcript of @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages...
![Page 1: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/1.jpg)
spcl.inf.ethz.ch
@spcl_eth
MACIEJ BESTA, TORSTEN HOEFLER
Fault Tolerance for Remote Memory Access
Programming Models
![Page 2: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/2.jpg)
spcl.inf.ethz.ch
@spcl_eth
2
REMOTE MEMORY ACCESS (RMA) PROGRAMMING
![Page 3: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/3.jpg)
spcl.inf.ethz.ch
@spcl_eth
2
REMOTE MEMORY ACCESS (RMA) PROGRAMMING
Memory
Process p
A
![Page 4: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/4.jpg)
spcl.inf.ethz.ch
@spcl_eth
2
REMOTE MEMORY ACCESS (RMA) PROGRAMMING
Memory Memory
Process p Process q
A
BB
![Page 5: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/5.jpg)
spcl.inf.ethz.ch
@spcl_eth
2
REMOTE MEMORY ACCESS (RMA) PROGRAMMING
Memory Memory
Cray
BlueWaters
Process p Process q
A
BB
![Page 6: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/6.jpg)
spcl.inf.ethz.ch
@spcl_eth
2
REMOTE MEMORY ACCESS (RMA) PROGRAMMING
Memory Memory
Cray
BlueWaters
Process p Process q
A
BB
![Page 7: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/7.jpg)
spcl.inf.ethz.ch
@spcl_eth
2
REMOTE MEMORY ACCESS (RMA) PROGRAMMING
Memory Memory
Cray
BlueWaters
Process p Process q
A
BB
![Page 8: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/8.jpg)
spcl.inf.ethz.ch
@spcl_eth
2
REMOTE MEMORY ACCESS (RMA) PROGRAMMING
Memory Memory
Cray
BlueWaters
Process p Process q
A
BB
![Page 9: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/9.jpg)
spcl.inf.ethz.ch
@spcl_eth
2
REMOTE MEMORY ACCESS (RMA) PROGRAMMING
Memory Memory
Cray
BlueWaters
put
Process p Process q
A
BB
AA
![Page 10: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/10.jpg)
spcl.inf.ethz.ch
@spcl_eth
2
REMOTE MEMORY ACCESS (RMA) PROGRAMMING
Memory Memory
Cray
BlueWaters
put
Process p Process q
A
Bget
B
A
B
A
B
![Page 11: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/11.jpg)
spcl.inf.ethz.ch
@spcl_eth
2
REMOTE MEMORY ACCESS (RMA) PROGRAMMING
Memory Memory
Cray
BlueWaters
put
Process p Process q
A
Bget
B
A
B
flush
A
B
![Page 12: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/12.jpg)
spcl.inf.ethz.ch
@spcl_eth
2
REMOTE MEMORY ACCESS (RMA) PROGRAMMING
Memory Memory
Cray
BlueWaters
put
Process p Process q
A
Bget
B
A
B
flush
A
B
![Page 13: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/13.jpg)
spcl.inf.ethz.ch
@spcl_eth
One-sided communication
2
REMOTE MEMORY ACCESS (RMA) PROGRAMMING
Memory Memory
Cray
BlueWaters
put
Process p Process q
A
Bget
B
A
B
flush
A
B
![Page 14: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/14.jpg)
spcl.inf.ethz.ch
@spcl_eth
One-sided communication
2
REMOTE MEMORY ACCESS (RMA) PROGRAMMING
Memory Memory
Cray
BlueWaters
put
Process p Process q
A
Bget
B
A
B
flush
A
B
no active
participation
![Page 15: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/15.jpg)
spcl.inf.ethz.ch
@spcl_eth
3
REMOTE MEMORY ACCESS PROGRAMMING
Implemented in hardware in NICs in the majority of HPC
networks support RDMA
![Page 16: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/16.jpg)
spcl.inf.ethz.ch
@spcl_eth
3
REMOTE MEMORY ACCESS PROGRAMMING
Implemented in hardware in NICs in the majority of HPC
networks support RDMA
![Page 17: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/17.jpg)
spcl.inf.ethz.ch
@spcl_eth
3
REMOTE MEMORY ACCESS PROGRAMMING
Implemented in hardware in NICs in the majority of HPC
networks support RDMA
![Page 18: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/18.jpg)
spcl.inf.ethz.ch
@spcl_eth
3
REMOTE MEMORY ACCESS PROGRAMMING
Implemented in hardware in NICs in the majority of HPC
networks support RDMA
![Page 19: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/19.jpg)
spcl.inf.ethz.ch
@spcl_eth
3
REMOTE MEMORY ACCESS PROGRAMMING
Implemented in hardware in NICs in the majority of HPC
networks support RDMA
![Page 20: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/20.jpg)
spcl.inf.ethz.ch
@spcl_eth
4
REMOTE MEMORY ACCESS PROGRAMMING
Supported by many HPC libraries and languages
![Page 21: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/21.jpg)
spcl.inf.ethz.ch
@spcl_eth
4
REMOTE MEMORY ACCESS PROGRAMMING
Supported by many HPC libraries and languages
![Page 22: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/22.jpg)
spcl.inf.ethz.ch
@spcl_eth
4
REMOTE MEMORY ACCESS PROGRAMMING
Supported by many HPC libraries and languages
![Page 23: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/23.jpg)
spcl.inf.ethz.ch
@spcl_eth
4
REMOTE MEMORY ACCESS PROGRAMMING
Supported by many HPC libraries and languages
![Page 24: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/24.jpg)
spcl.inf.ethz.ch
@spcl_eth
5
REMOTE MEMORY ACCESS PROGRAMMING
Enables significant speedups over message passing in
many types of applications, e.g.:
[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.
ACM/IEEE Supercomputing 2013, SC13, Best Paper Award
[2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC. SPAA’12
![Page 25: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/25.jpg)
spcl.inf.ethz.ch
@spcl_eth
5
REMOTE MEMORY ACCESS PROGRAMMING
Enables significant speedups over message passing in
many types of applications, e.g.: Speedup of ~1.5 for communication patterns in graph analytics
[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.
ACM/IEEE Supercomputing 2013, SC13, Best Paper Award
[2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC. SPAA’12
![Page 26: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/26.jpg)
spcl.inf.ethz.ch
@spcl_eth
5
REMOTE MEMORY ACCESS PROGRAMMING
Enables significant speedups over message passing in
many types of applications, e.g.: Speedup of ~1.5 for communication patterns in graph analytics
Speedup of ~1.4-2 in physics computations
[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.
ACM/IEEE Supercomputing 2013, SC13, Best Paper Award
[2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC. SPAA’12
![Page 27: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/27.jpg)
spcl.inf.ethz.ch
@spcl_eth
6
FAULT TOLERANCE + RMA
![Page 28: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/28.jpg)
spcl.inf.ethz.ch
@spcl_eth
6
15.8h of MTBF (for nodes) for the TSUBAME2
FAULT TOLERANCE + RMA
![Page 29: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/29.jpg)
spcl.inf.ethz.ch
@spcl_eth
7
FAULT TOLERANCE + RMA
![Page 30: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/30.jpg)
spcl.inf.ethz.ch
@spcl_eth
Fault tolerance is well studied for message passing
7
FAULT TOLERANCE + RMA
![Page 31: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/31.jpg)
spcl.inf.ethz.ch
@spcl_eth
Fault tolerance is well studied for message passing
7
FAULT TOLERANCE + RMA
Message Passing
![Page 32: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/32.jpg)
spcl.inf.ethz.ch
@spcl_eth
Fault tolerance is well studied for message passing
7
FAULT TOLERANCE + RMA
Message Passing
Coordinated
Checkpointing (CC)
![Page 33: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/33.jpg)
spcl.inf.ethz.ch
@spcl_eth
Fault tolerance is well studied for message passing
7
FAULT TOLERANCE + RMA
Message Passing
Coordinated
Checkpointing (CC)
uncoordinated
checkpointing
and message
logging (UC)
![Page 34: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/34.jpg)
spcl.inf.ethz.ch
@spcl_eth
Fault tolerance is well studied for message passing
7
FAULT TOLERANCE + RMA
Message Passing
Coordinated
Checkpointing (CC)
uncoordinated
checkpointing
and message
logging (UC)
![Page 35: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/35.jpg)
spcl.inf.ethz.ch
@spcl_eth
Fault tolerance is well studied for message passing
Scarce research exists for fault tolerance for RMA
7
FAULT TOLERANCE + RMA
Message Passing
Coordinated
Checkpointing (CC)
uncoordinated
checkpointing
and message
logging (UC)
![Page 36: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/36.jpg)
spcl.inf.ethz.ch
@spcl_eth
Fault tolerance is well studied for message passing
Scarce research exists for fault tolerance for RMA
7
FAULT TOLERANCE + RMA
RMAMessage Passing
Coordinated
Checkpointing (CC)
uncoordinated
checkpointing
and message
logging (UC)
![Page 37: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/37.jpg)
spcl.inf.ethz.ch
@spcl_eth
Fault tolerance is well studied for message passing
Scarce research exists for fault tolerance for RMA
7
FAULT TOLERANCE + RMA
RMAMessage Passing
Coordinated
Checkpointing (CC)
uncoordinated
checkpointing
and message
logging (UC)
![Page 38: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/38.jpg)
spcl.inf.ethz.ch
@spcl_eth
Fault tolerance is well studied for message passing
Scarce research exists for fault tolerance for RMA
7
FAULT TOLERANCE + RMA
RMAMessage Passing
Coordinated
Checkpointing (CC)
uncoordinated
checkpointing
and message
logging (UC)
logging memory
accesses vs. messages
![Page 39: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/39.jpg)
spcl.inf.ethz.ch
@spcl_eth
Fault tolerance is well studied for message passing
Scarce research exists for fault tolerance for RMA
7
FAULT TOLERANCE + RMA
RMAMessage Passing
Coordinated
Checkpointing (CC)
uncoordinated
checkpointing
and message
logging (UC)
logging memory
accesses vs. messages
checkpointing in RMA-
based applications
![Page 40: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/40.jpg)
spcl.inf.ethz.ch
@spcl_eth
Fault tolerance is well studied for message passing
Scarce research exists for fault tolerance for RMA
7
FAULT TOLERANCE + RMA
RMAMessage Passing
Coordinated
Checkpointing (CC)
uncoordinated
checkpointing
and message
logging (UC)
logging memory
accesses vs. messages
checkpointing in RMA-
based applications
fault tolerance
mechanisms and
schemes
![Page 41: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/41.jpg)
spcl.inf.ethz.ch
@spcl_eth
Fault tolerance is well studied for message passing
Scarce research exists for fault tolerance for RMA
7
FAULT TOLERANCE + RMA
RMAMessage Passing
Coordinated
Checkpointing (CC)
uncoordinated
checkpointing
and message
logging (UC)
logging memory
accesses vs. messages
checkpointing in RMA-
based applications
performance
fault tolerance
mechanisms and
schemes
![Page 42: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/42.jpg)
spcl.inf.ethz.ch
@spcl_eth
8
OVERVIEW OF OUR RESEARCH
![Page 43: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/43.jpg)
spcl.inf.ethz.ch
@spcl_eth
Generic model
8
OVERVIEW OF OUR RESEARCH
![Page 44: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/44.jpg)
spcl.inf.ethz.ch
@spcl_eth
Generic model
8
OVERVIEW OF OUR RESEARCH
![Page 45: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/45.jpg)
spcl.inf.ethz.ch
@spcl_eth
Generic model
8
OVERVIEW OF OUR RESEARCH
![Page 46: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/46.jpg)
spcl.inf.ethz.ch
@spcl_eth
Generic model
8
OVERVIEW OF OUR RESEARCH
![Page 47: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/47.jpg)
spcl.inf.ethz.ch
@spcl_eth
CC in RMAGeneric model
8
OVERVIEW OF OUR RESEARCH
![Page 48: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/48.jpg)
spcl.inf.ethz.ch
@spcl_eth
CC in RMAGeneric model
8
OVERVIEW OF OUR RESEARCH
MP vs. RMA
![Page 49: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/49.jpg)
spcl.inf.ethz.ch
@spcl_eth
CC in RMAGeneric model
8
OVERVIEW OF OUR RESEARCH
MP vs. RMA Schemes
![Page 50: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/50.jpg)
spcl.inf.ethz.ch
@spcl_eth
CC in RMAGeneric model
UC in RMA
8
OVERVIEW OF OUR RESEARCH
MP vs. RMA Schemes
![Page 51: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/51.jpg)
spcl.inf.ethz.ch
@spcl_eth
CC in RMAGeneric model
UC in RMA
8
OVERVIEW OF OUR RESEARCH
MP vs. RMA
MP vs. RMA
Schemes
![Page 52: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/52.jpg)
spcl.inf.ethz.ch
@spcl_eth
CC in RMAGeneric model
UC in RMA
8
OVERVIEW OF OUR RESEARCH
MP vs. RMA
MP vs. RMA
Checkpointing
schemesSchemes
![Page 53: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/53.jpg)
spcl.inf.ethz.ch
@spcl_eth
CC in RMAGeneric model
UC in RMA
8
OVERVIEW OF OUR RESEARCH
MP vs. RMA
MP vs. RMA
Checkpointing
schemesSchemes
Logging accesses
![Page 54: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/54.jpg)
spcl.inf.ethz.ch
@spcl_eth
CC in RMAGeneric model
UC in RMA
Recovery in RMA
8
OVERVIEW OF OUR RESEARCH
MP vs. RMA
MP vs. RMA
Checkpointing
schemesSchemes
Logging accesses
![Page 55: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/55.jpg)
spcl.inf.ethz.ch
@spcl_eth
CC in RMAGeneric model
UC in RMA
Recovery in RMA
8
OVERVIEW OF OUR RESEARCH
MP vs. RMA
MP vs. RMA
Checkpointing
schemesSchemes
Basic scheme
Logging accesses
![Page 56: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/56.jpg)
spcl.inf.ethz.ch
@spcl_eth
CC in RMAGeneric model
UC in RMA
Recovery in RMA
8
OVERVIEW OF OUR RESEARCH
MP vs. RMA
MP vs. RMA
Checkpointing
schemesSchemes
Basic scheme
Extended RMA
semantics
Logging accesses
![Page 57: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/57.jpg)
spcl.inf.ethz.ch
@spcl_eth
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMA
8
OVERVIEW OF OUR RESEARCH
MP vs. RMA
MP vs. RMA
Checkpointing
schemesSchemes
Basic scheme
Extended RMA
semantics
Logging accesses
![Page 58: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/58.jpg)
spcl.inf.ethz.ch
@spcl_eth
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMA
8
OVERVIEW OF OUR RESEARCH
MP vs. RMA
MP vs. RMA
Model extensions
Checkpointing
schemesSchemes
Basic scheme
Extended RMA
semantics
Logging accesses
![Page 59: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/59.jpg)
spcl.inf.ethz.ch
@spcl_eth
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMA
8
OVERVIEW OF OUR RESEARCH
Distribution of
processes
MP vs. RMA
MP vs. RMA
Model extensions
Checkpointing
schemesSchemes
Basic scheme
Extended RMA
semantics
Logging accesses
![Page 60: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/60.jpg)
spcl.inf.ethz.ch
@spcl_eth
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMA
8
OVERVIEW OF OUR RESEARCH
Distribution of
processes
MP vs. RMA
MP vs. RMA
Model extensions
Checkpointing
schemesSchemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 61: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/61.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMA
8
OVERVIEW OF OUR RESEARCH
Distribution of
processes
MP vs. RMA
MP vs. RMA
Model extensions
Checkpointing
schemesSchemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 62: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/62.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMA
8
OVERVIEW OF OUR RESEARCH
Distribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Checkpointing
schemesSchemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 63: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/63.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMA
8
OVERVIEW OF OUR RESEARCH
Distribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 64: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/64.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMA
8
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 65: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/65.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
8
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 66: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/66.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
8
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 67: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/67.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
8
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 68: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/68.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
9
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 69: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/69.jpg)
spcl.inf.ethz.ch
@spcl_eth
10
COORDINATED CHECKPOINTING (MP)
![Page 70: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/70.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
10
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
![Page 71: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/71.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
10
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...com
pute
com
pute
com
pute
com
pute
com
pute
com
pute
com
pute
com
pute
![Page 72: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/72.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
10
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...com
pute
com
pute
com
pute
com
pute
com
pute
com
pute
com
pute
com
pute
barrier
![Page 73: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/73.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
10
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...com
pute
com
pute
com
pute
com
pute
com
pute
com
pute
com
pute
com
pute
barrier
![Page 74: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/74.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
10
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
barrier com
pute
com
pute
com
pute
com
pute
com
pute
com
pute
com
pute
com
pute
barrier
![Page 75: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/75.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
10
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
barrier com
pute
com
pute
com
pute
com
pute
com
pute
com
pute
com
pute
com
pute
barrier
![Page 76: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/76.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
10
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
barrier com
pute
com
pute
com
pute
com
pute
barrier
![Page 77: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/77.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
10
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
barrier com
pute
com
pute
com
pute
com
pute
global
rollback
barrier
![Page 78: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/78.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
10
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
barrier com
pute
com
pute
com
pute
com
pute
barrier
![Page 79: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/79.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
10
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
barrier com
pute
com
pute
com
pute
com
pute
barrier
![Page 80: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/80.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
10
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
barrier com
pute
com
pute
com
pute
com
pute
barrier
![Page 81: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/81.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
10
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
barrier com
pute
com
pute
com
pute
com
pute
barrier
![Page 82: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/82.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
10
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
barrier com
pute
com
pute
com
pute
com
pute
barrier
![Page 83: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/83.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
10
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
![Page 84: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/84.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
11
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
![Page 85: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/85.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
11
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
coordinated
checkpoint
![Page 86: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/86.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
11
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
coordinated
checkpoint
![Page 87: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/87.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
11
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
send
coordinated
checkpoint
![Page 88: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/88.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
11
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
send
coordinated
checkpoint
recv
![Page 89: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/89.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
11
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
send
coordinated
checkpoint
recv
![Page 90: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/90.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
11
COORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
coordinated
checkpoint
![Page 91: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/91.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
12
COORDINATED CHECKPOINTING (RMA)
Proc k Proc 1Proc 1 ... ... Proc k...
coordinated
checkpoint
![Page 92: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/92.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
12
COORDINATED CHECKPOINTING (RMA)
Proc k Proc 1Proc 1 ... ... Proc k...
coordinated
checkpoint
![Page 93: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/93.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
12
COORDINATED CHECKPOINTING (RMA)
Proc k Proc 1Proc 1 ... ... Proc k...
put
coordinated
checkpoint
![Page 94: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/94.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
12
COORDINATED CHECKPOINTING (RMA)
Proc k Proc 1Proc 1 ... ... Proc k...
put
coordinated
checkpoint
![Page 95: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/95.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
12
COORDINATED CHECKPOINTING (RMA)
Proc k Proc 1Proc 1 ... ... Proc k...
put
coordinated
checkpoint
flush
![Page 96: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/96.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
12
COORDINATED CHECKPOINTING (RMA)
Proc k Proc 1Proc 1 ... ... Proc k...
put
coordinated
checkpoint
flush
![Page 97: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/97.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
13
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 98: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/98.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
13
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 99: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/99.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
13
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 100: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/100.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
RMA: EPOCHS
Proc p Proc q
![Page 101: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/101.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
RMA: EPOCHS
Proc p Proc q
memorymemory
A B C D E F
![Page 102: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/102.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
RMA: EPOCHS
Proc p Proc q
memorymemory
A B C D E F
A
B
![Page 103: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/103.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
RMA: EPOCHS
Proc p Proc q
memorymemory
A B C D E F
A
B
actions are
non-blocking
![Page 104: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/104.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
RMA: EPOCHS
Proc p Proc q
memorymemory
A B C D E F
A
B
actions are
non-blocking
data will be valid
upon synchronizing
memories
![Page 105: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/105.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
RMA: EPOCHS
Proc p Proc q
memorymemory
A B C D E F
A
B
actions are
non-blocking
data will be valid
upon synchronizing
memories
![Page 106: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/106.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
RMA: EPOCHS
Proc p Proc q
memorymemory
A B C D E F
A B
actions are
non-blocking
data will be valid
upon synchronizing
memories
![Page 107: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/107.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
RMA: EPOCHS
Proc p Proc q
memorymemory
A B C D E F
A B
C
actions are
non-blocking
data will be valid
upon synchronizing
memories
![Page 108: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/108.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
RMA: EPOCHS
Proc p Proc q
memorymemory
A B C D E F
A B
C
D
actions are
non-blocking
data will be valid
upon synchronizing
memories
![Page 109: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/109.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
RMA: EPOCHS
Proc p Proc q
memorymemory
A B C D E F
A B
C
D
actions are
non-blocking
data will be valid
upon synchronizing
memories
![Page 110: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/110.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
RMA: EPOCHS
Proc p Proc q
memorymemory
A B C D E F
A B CD
actions are
non-blocking
data will be valid
upon synchronizing
memories
![Page 111: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/111.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
RMA: EPOCHS
Proc p Proc q
memorymemory
A B C D E F
A B CD
E
F
actions are
non-blocking
data will be valid
upon synchronizing
memories
![Page 112: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/112.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
RMA: EPOCHS
Proc p Proc q
memorymemory
A B C D E F
A B CD
E
F
actions are
non-blocking
data will be valid
upon synchronizing
memories
![Page 113: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/113.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
RMA: EPOCHS
Proc p Proc q
memorymemory
A B C D E F
A B CD E F
actions are
non-blocking
data will be valid
upon synchronizing
memories
![Page 114: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/114.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
RMA: EPOCHS
Proc p Proc q
memorymemory
A B C D E F
A B CD E F
actions are
non-blockingEpoch 0
data will be valid
upon synchronizing
memories
![Page 115: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/115.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
RMA: EPOCHS
Proc p Proc q
memorymemory
A B C D E F
A B CD E F
actions are
non-blockingEpoch 0
Epoch 1
data will be valid
upon synchronizing
memories
![Page 116: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/116.jpg)
spcl.inf.ethz.ch
@spcl_eth
14
RMA: EPOCHS
Proc p Proc q
memorymemory
A B C D E F
A B CD E F
actions are
non-blockingEpoch 0
Epoch 1
Epoch 2
data will be valid
upon synchronizing
memories
![Page 117: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/117.jpg)
spcl.inf.ethz.ch
@spcl_eth
15
RMA: EPOCHS
Proc p Proc q
![Page 118: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/118.jpg)
spcl.inf.ethz.ch
@spcl_eth
15
RMA: EPOCHS
Proc p Proc q
X
Y
Z
![Page 119: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/119.jpg)
spcl.inf.ethz.ch
@spcl_eth
15
RMA: EPOCHS
Proc p Proc q
Epoch 0
X
Y
Z
![Page 120: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/120.jpg)
spcl.inf.ethz.ch
@spcl_eth
15
RMA: EPOCHS
Proc p Proc q
Epoch 0
Epoch 1
X
Y
Z
![Page 121: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/121.jpg)
spcl.inf.ethz.ch
@spcl_eth
15
RMA: EPOCHS
Proc p Proc q
![Page 122: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/122.jpg)
spcl.inf.ethz.ch
@spcl_eth
16
RMA: THE CONSISTENCY ORDER
Proc p Proc q
A
D
Epoch 0
Epoch 1
Epoch 2
co
B
C
E
F
![Page 123: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/123.jpg)
spcl.inf.ethz.ch
@spcl_eth
16
RMA: THE CONSISTENCY ORDER
Proc p Proc q
A
D
Epoch 0
Epoch 1
Epoch 2
co
B
C
E
F
![Page 124: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/124.jpg)
spcl.inf.ethz.ch
@spcl_eth
16
RMA: THE CONSISTENCY ORDER
Proc p Proc q
A
D
Epoch 0
Epoch 1
Epoch 2
co
B
C
E
F
![Page 125: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/125.jpg)
spcl.inf.ethz.ch
@spcl_eth
16
RMA: THE CONSISTENCY ORDER
Proc p Proc q
A
D
Epoch 0
Epoch 1
Epoch 2
co
B
C
E
F
![Page 126: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/126.jpg)
spcl.inf.ethz.ch
@spcl_eth
16
RMA: THE CONSISTENCY ORDER
Proc p Proc q
A
D
Epoch 0
Epoch 1
Epoch 2
co
co
B
C
B Cput put
E
F
![Page 127: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/127.jpg)
spcl.inf.ethz.ch
@spcl_eth
16
RMA: THE CONSISTENCY ORDER
Proc p Proc q
A
D
Epoch 0
Epoch 1
Epoch 2
co
co
B
C
B Cput put
E
F
![Page 128: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/128.jpg)
spcl.inf.ethz.ch
@spcl_eth
16
RMA: THE CONSISTENCY ORDER
Proc p Proc q
A
D
Epoch 0
Epoch 1
Epoch 2
co
co
B
C
B Cput put
E
F
![Page 129: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/129.jpg)
spcl.inf.ethz.ch
@spcl_eth
16
RMA: THE CONSISTENCY ORDER
Proc p Proc q
A
D
Epoch 0
Epoch 1
Epoch 2
co
co
co
B
C
B Cput put
E Fget getE
F
![Page 130: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/130.jpg)
spcl.inf.ethz.ch
@spcl_eth
16
RMA: THE CONSISTENCY ORDER
Proc p Proc q
A
D
Epoch 0
Epoch 1
Epoch 2
co
co
co
For recovery, a
process has to
replay actions
in the correct
order! co
B
C
B Cput put
E Fget getE
F
![Page 131: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/131.jpg)
spcl.inf.ethz.ch
@spcl_eth
16
RMA: THE CONSISTENCY ORDER
Proc p Proc q
A
D
Epoch 0
Epoch 1
Epoch 2
co
co
co
For recovery, a
process has to
replay actions
in the correct
order! co
For this,
we use epoch
counters (EC)
B
C
B Cput put
E Fget getE
F
![Page 132: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/132.jpg)
spcl.inf.ethz.ch
@spcl_eth
16
RMA: THE CONSISTENCY ORDER
Proc p Proc q
A
D
Epoch 0
Epoch 1
Epoch 2
co
co
co
For recovery, a
process has to
replay actions
in the correct
order! co
For this,
we use epoch
counters (EC)
memory
EC
B
C
B Cput put
E Fget getE
F
![Page 133: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/133.jpg)
spcl.inf.ethz.ch
@spcl_eth
16
RMA: THE CONSISTENCY ORDER
Proc p Proc q
A
D
Epoch 0
Epoch 1
Epoch 2
co
co
co
For recovery, a
process has to
replay actions
in the correct
order! co
For this,
we use epoch
counters (EC)
memory
EC 0
B
C
B Cput put
E Fget getE
F
![Page 134: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/134.jpg)
spcl.inf.ethz.ch
@spcl_eth
16
RMA: THE CONSISTENCY ORDER
Proc p Proc q
A
D
Epoch 0
Epoch 1
Epoch 2
co
co
co
For recovery, a
process has to
replay actions
in the correct
order! co
For this,
we use epoch
counters (EC)
memory
EC 1
B
C
B Cput put
E Fget getE
F
![Page 135: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/135.jpg)
spcl.inf.ethz.ch
@spcl_eth
16
RMA: THE CONSISTENCY ORDER
Proc p Proc q
A
D
Epoch 0
Epoch 1
Epoch 2
co
co
co
For recovery, a
process has to
replay actions
in the correct
order! co
For this,
we use epoch
counters (EC)
memory
EC 2
B
C
B Cput put
E Fget getE
F
![Page 136: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/136.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
17
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 137: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/137.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
17
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 138: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/138.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
17
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 139: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/139.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
18
UNCOORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
![Page 140: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/140.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
18
UNCOORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
uncoordinated
checkpoint
![Page 141: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/141.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
18
UNCOORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
uncoordinated
checkpoint
![Page 142: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/142.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
18
UNCOORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
uncoordinated
checkpoint
![Page 143: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/143.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
18
UNCOORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
uncoordinated
checkpoint
rollback
![Page 144: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/144.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
18
UNCOORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
uncoordinated
checkpoint
dependency
rollback
![Page 145: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/145.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
18
UNCOORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
uncoordinated
checkpoint
dependency
rollback
rollback
![Page 146: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/146.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
18
UNCOORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
uncoordinated
checkpoint
dependency
dependency
rollback
rollback
![Page 147: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/147.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
18
UNCOORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
uncoordinated
checkpoint
dependency
dependency
rollback
rollback rollback
![Page 148: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/148.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
18
UNCOORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
uncoordinated
checkpoint
![Page 149: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/149.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
18
UNCOORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
uncoordinated
checkpoint
logging a
message
![Page 150: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/150.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
18
UNCOORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
![Page 151: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/151.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
18
UNCOORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
local
rollback
![Page 152: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/152.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
18
UNCOORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
![Page 153: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/153.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
18
UNCOORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
get the log and
replay the message
![Page 154: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/154.jpg)
spcl.inf.ethz.ch
@spcl_eth
Node 1 Node N
18
UNCOORDINATED CHECKPOINTING (MP)
Proc k Proc 1Proc 1 ... ... Proc k...
![Page 155: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/155.jpg)
spcl.inf.ethz.ch
@spcl_eth
19
RMA: LOGGING PUTS
Proc p Proc q
C
![Page 156: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/156.jpg)
spcl.inf.ethz.ch
@spcl_eth
19
RMA: LOGGING PUTS
Proc p Proc q
C
memoryCC
![Page 157: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/157.jpg)
spcl.inf.ethz.ch
@spcl_eth
19
RMA: LOGGING PUTS
Proc p Proc q
C
memoryCC
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
![Page 158: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/158.jpg)
spcl.inf.ethz.ch
@spcl_eth
19
RMA: LOGGING PUTS
Proc p Proc q
C
memoryCC
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
![Page 159: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/159.jpg)
spcl.inf.ethz.ch
@spcl_eth
19
RMA: LOGGING PUTS
Proc p Proc q
C
memory
Logs of puts
CC
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
![Page 160: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/160.jpg)
spcl.inf.ethz.ch
@spcl_eth
19
RMA: LOGGING PUTS
Proc p Proc q
C
record the put
memory
Logs of puts
CC
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
![Page 161: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/161.jpg)
spcl.inf.ethz.ch
@spcl_eth
19
RMA: LOGGING PUTS
Proc p Proc q
C
record the put
memory
Logs of puts
CC
Data is valid
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
![Page 162: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/162.jpg)
spcl.inf.ethz.ch
@spcl_eth
19
RMA: LOGGING PUTS
Proc p Proc q
C
record the put
memory
Logs of puts
C
C
Data is valid
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
![Page 163: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/163.jpg)
spcl.inf.ethz.ch
@spcl_eth
19
RMA: LOGGING PUTS
Proc p Proc q
C
record the put
memory
Logs of puts
C + #put
C
Data is valid
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
![Page 164: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/164.jpg)
spcl.inf.ethz.ch
@spcl_eth
19
RMA: LOGGING PUTS
Proc p Proc q
C
record the put
memory
Logs of puts
C + #put
C
Data is valid
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
EC 1
![Page 165: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/165.jpg)
spcl.inf.ethz.ch
@spcl_eth
19
RMA: LOGGING PUTS
Proc p Proc q
C
record the put
memory
Logs of puts
C + #put
+ EC( )
C
Data is valid
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
EC 1
![Page 166: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/166.jpg)
spcl.inf.ethz.ch
@spcl_eth
19
RMA: LOGGING PUTS
Proc p Proc q
C
record the put
memory
Logs of puts
C + #put
+ EC( )
C
Data is valid
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
EC 1
1
![Page 167: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/167.jpg)
spcl.inf.ethz.ch
@spcl_eth
19
RMA: LOGGING PUTS
Proc p Proc q
C
record the put
memory
Logs of puts
C + #put
+ EC( )
C
Data is valid
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
EC 1
1
![Page 168: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/168.jpg)
spcl.inf.ethz.ch
@spcl_eth
19
RMA: LOGGING PUTS
Proc p Proc q
C
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
1
![Page 169: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/169.jpg)
spcl.inf.ethz.ch
@spcl_eth
20
RMA: LOGGING GETS
Proc p Proc q
C
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
![Page 170: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/170.jpg)
spcl.inf.ethz.ch
@spcl_eth
20
RMA: LOGGING GETS
Proc p Proc q
A
B
C
D
E
F
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
![Page 171: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/171.jpg)
spcl.inf.ethz.ch
@spcl_eth
20
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
![Page 172: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/172.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
![Page 173: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/173.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
![Page 174: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/174.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
memory memoryD
![Page 175: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/175.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
memory
Logs of #gets
memoryD
![Page 176: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/176.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
memory
Logs of #gets
memoryD
Logs of gets
![Page 177: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/177.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
memory
Logs of #gets
memoryD
Logs of gets
![Page 178: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/178.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
memory
Logs of #gets
memoryD
Logs of gets
![Page 179: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/179.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
memory
Logs of #gets
Data is not
yet valid
memoryD
Logs of gets
![Page 180: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/180.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
log(#get + EC)
memory
Logs of #gets
Data is not
yet valid
memoryD
Logs of gets
![Page 181: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/181.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
log(#get + EC)
memory
Logs of #gets
#get
Data is not
yet valid
memoryD
Logs of gets
![Page 182: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/182.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
log(#get + EC)
memory
Logs of #gets
#get + EC( )
Data is not
yet valid
EC 1memory
D1
Logs of gets
![Page 183: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/183.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
log(#get + EC)
memory
Logs of #gets
#get + EC( )
Data is not
yet valid
EC 1memory
D
1
Logs of gets
![Page 184: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/184.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
log(#get + EC)
memory
Logs of #gets
#get + EC( )
Data is not
yet valid
EC 1memory
D
1...
Logs of gets
![Page 185: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/185.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
log(#get + EC)
memory
Logs of #gets
#get + EC( )
Data is not
yet valid
EC 1memory
D
1...
Logs of gets
![Page 186: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/186.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
log(#get + EC)
memory
Logs of #gets
#get + EC( )
Data is not
yet valid
EC 1memory
DD
1...
Logs of gets
![Page 187: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/187.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
log(#get + EC)
memory
Logs of #gets
#get + EC( )
EC 1memory
DD
Data is valid
1...
Logs of gets
![Page 188: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/188.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
log(#get + EC)
memory
Logs of #gets
#get + EC( )
EC 1memory
D
Now this data
may be invalid
Data is valid
1...
Logs of gets
![Page 189: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/189.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
log(#get + EC)
memory
Logs of #gets
#get + EC( )
EC 1memory
D
Now this data
may be invalid
Data is valid
1...
Logs of gets
![Page 190: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/190.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
log(#get + EC)
memory
Logs of #gets
#get + EC( )
EC 1memory
+ #get
+ EC( )
D
Now this data
may be invalid
Data is valid
1...
Logs of gets
D
1
![Page 191: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/191.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
log(#get + EC)
memory
Logs of #gets
#get + EC( )
EC 1memory
+ #get
+ EC( )
D
Now this data
may be invalid
Data is valid
delete #get
and EC
1...
Logs of gets
D
1
![Page 192: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/192.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
D
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
log(#get + EC)
memory
Logs of #gets
EC 1memory
+ #get
+ EC( )
D
Now this data
may be invalid
Data is valid
delete #get
and EC
...Logs of gets
D
1
![Page 193: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/193.jpg)
spcl.inf.ethz.ch
@spcl_eth
21
RMA: LOGGING GETS
Proc p Proc q
![Page 194: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/194.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
22
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 195: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/195.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
22
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 196: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/196.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
22
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 197: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/197.jpg)
spcl.inf.ethz.ch
@spcl_eth
memorymemory
23
RMA: RECOVERY
Proc p Proc q
App. dataChpt. data
DataP
![Page 198: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/198.jpg)
spcl.inf.ethz.ch
@spcl_eth
memorymemory
23
RMA: RECOVERY
Proc p Proc q
App. data
DataP
![Page 199: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/199.jpg)
spcl.inf.ethz.ch
@spcl_eth
memorymemory
23
RMA: RECOVERY
Proc p Proc q
Stage 2: replay actions beyond the checkpoint
App. data
DataP
![Page 200: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/200.jpg)
spcl.inf.ethz.ch
@spcl_eth
memorymemory
23
RMA: RECOVERY
Proc p Proc q
Stage 2: replay actions beyond the checkpoint
X
Y
Z
App. data
DataP
![Page 201: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/201.jpg)
spcl.inf.ethz.ch
@spcl_eth
memorymemory
23
RMA: RECOVERY
Proc p Proc q
Stage 2: replay actions beyond the checkpoint
X
Y
Z
App. data
DataP
![Page 202: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/202.jpg)
spcl.inf.ethz.ch
@spcl_eth
memorymemory
23
RMA: RECOVERY
Proc p Proc q
Stage 2: replay actions beyond the checkpoint
X
Y
App. data
DataP
![Page 203: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/203.jpg)
spcl.inf.ethz.ch
@spcl_eth
memorymemory
23
RMA: RECOVERY
Proc p Proc q
Stage 2: replay actions beyond the checkpoint
X
Y
App. data
DataP
![Page 204: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/204.jpg)
spcl.inf.ethz.ch
@spcl_eth
memorymemory
23
RMA: RECOVERY
Proc p Proc q
memory
Logs of puts
Stage 2: replay actions beyond the checkpoint
X
Y
App. data
DataPX + #put + EC( )0
Y + #put + EC( )0
![Page 205: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/205.jpg)
spcl.inf.ethz.ch
@spcl_eth
memorymemory
23
RMA: RECOVERY
Proc p Proc q
memory
Logs of puts
Stage 2: replay actions beyond the checkpoint
X
Y
App. data
DataP
Epoch 0
X + #put + EC( )0
Y + #put + EC( )0
![Page 206: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/206.jpg)
spcl.inf.ethz.ch
@spcl_eth
memorymemory
23
RMA: RECOVERY
Proc p Proc q
memory
Logs of puts
Stage 2: replay actions beyond the checkpoint
X
Y
X + #put + EC( )0
Y + #put + EC( )0
App. data
DataP
Epoch 0
X + #put + EC( )0
Y + #put + EC( )0
![Page 207: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/207.jpg)
spcl.inf.ethz.ch
@spcl_eth
memorymemory
23
RMA: RECOVERY
Proc p Proc q
memory
Logs of puts
Stage 2: replay actions beyond the checkpoint
X
Y
X + #put + EC( )0
Y + #put + EC( )0
App. data
DataP
Epoch 0
X + #put + EC( )0
Y + #put + EC( )0
![Page 208: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/208.jpg)
spcl.inf.ethz.ch
@spcl_eth
memorymemory
23
RMA: RECOVERY
Proc p Proc q
memory
Logs of puts
Stage 2: replay actions beyond the checkpoint
X + #put + EC( )0
Y + #put + EC( )0
App. data
DataPX + #put + EC( )0
Y + #put + EC( )0
![Page 209: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/209.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
X + #put + EC( )0
Y + #put + EC( )0
Stage 2: replay actions beyond the checkpoint
DataP
![Page 210: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/210.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
X + #put + EC( )0
Y + #put + EC( )0
A
B
C
D
E
F
Stage 2: replay actions beyond the checkpoint
DataP
![Page 211: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/211.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
X + #put + EC( )0
Y + #put + EC( )0
A
B
C
D
E
F
Stage 2: replay actions beyond the checkpoint
DataP
![Page 212: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/212.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
X + #put + EC( )0
Y + #put + EC( )0
D
E
F
Stage 2: replay actions beyond the checkpoint
DataP
![Page 213: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/213.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
X + #put + EC( )0
Y + #put + EC( )0
D
E
F
Stage 2: replay actions beyond the checkpoint
DataP
![Page 214: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/214.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
X + #put + EC( )0
Y + #put + EC( )0
D
E
F
Logs of gets
D + #get + EC( )1
E + #get + EC( )2
F + #get + EC( )2
Stage 2: replay actions beyond the checkpoint
DataP
![Page 215: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/215.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
X + #put + EC( )0
Y + #put + EC( )0
D
E
F
Logs of gets
D + #get + EC( )1
E + #get + EC( )2
F + #get + EC( )2
Stage 2: replay actions beyond the checkpoint
DataP
![Page 216: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/216.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
X + #put + EC( )0
Y + #put + EC( )0
D
E
F
Logs of gets
D + #get + EC( )1
E + #get + EC( )2
F + #get + EC( )2
Stage 2: replay actions beyond the checkpoint
DataP
![Page 217: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/217.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
X + #put + EC( )0
Y + #put + EC( )0
D
E
F
Logs of gets
E + #get + EC( )2
F + #get + EC( )2
D + #get + EC( )1
Stage 2: replay actions beyond the checkpoint
DataP
D + #get + EC( )1D
![Page 218: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/218.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
X + #put + EC( )0
Y + #put + EC( )0
D
E
F
Logs of gets
E + #get + EC( )2
F + #get + EC( )2
D + #get + EC( )1
Stage 2: replay actions beyond the checkpoint
DataP
D + #get + EC( )1D
![Page 219: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/219.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
X + #put + EC( )0
Y + #put + EC( )0
D
E
F
Logs of gets
D + #get + EC( )1
E + #get + EC( )
F + #get + EC( )2
2
Stage 2: replay actions beyond the checkpoint
DataP
D + #get + EC( )1D
E + #get + EC( )2
F + #get + EC( )2
![Page 220: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/220.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
updating
state...
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
X + #put + EC( )0
Y + #put + EC( )0
D
E
F
Logs of gets
D + #get + EC( )1
E + #get + EC( )
F + #get + EC( )2
2
Stage 2: replay actions beyond the checkpoint
DataP
D + #get + EC( )1D
E + #get + EC( )2
F + #get + EC( )2
![Page 221: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/221.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
updating
state...
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
X + #put + EC( )0
Y + #put + EC( )0
D
E
F
Logs of gets
D + #get + EC( )1
E + #get + EC( )
F + #get + EC( )2
2
Stage 2: replay actions beyond the checkpoint
DataP
D + #get + EC( )1D
E + #get + EC( )2
F + #get + EC( )2
![Page 222: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/222.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
updating
state...
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
D
E
F
Logs of gets
D + #get + EC( )1
E + #get + EC( )
F + #get + EC( )2
2
Stage 2: replay actions beyond the checkpoint
DataP
D + #get + EC( )1D
E + #get + EC( )2
F + #get + EC( )2
![Page 223: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/223.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
updating
state...
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
D
E
F
Logs of gets
D + #get + EC( )1
E + #get + EC( )
F + #get + EC( )2
2
Stage 2: replay actions beyond the checkpoint
DataP
D + #get + EC( )1D
E + #get + EC( )2
F + #get + EC( )2
![Page 224: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/224.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
updating
state...
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
D
E
F
Logs of gets
E + #get + EC( )
F + #get + EC( )2
2
Stage 2: replay actions beyond the checkpoint
DataP
D + #get + EC( )1D
E + #get + EC( )2
F + #get + EC( )2
![Page 225: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/225.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
updating
state...
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
D
E
F
Logs of gets
E + #get + EC( )
F + #get + EC( )2
2
Stage 2: replay actions beyond the checkpoint
DataP
D + #get + EC( )1D
E + #get + EC( )2
F + #get + EC( )2
![Page 226: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/226.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
updating
state...
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
D
E
F
Logs of gets
Stage 2: replay actions beyond the checkpoint
DataP
D + #get + EC( )1D
E + #get + EC( )2
F + #get + EC( )2
![Page 227: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/227.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory memory
24
RMA: RECOVERY
Proc p Proc q
Logs of puts
X + #put + EC( )0
Y + #put + EC( )0
App. data
D
E
F
Logs of gets
state
restored!
Stage 2: replay actions beyond the checkpoint
DataP
D + #get + EC( )1D
E + #get + EC( )2
F + #get + EC( )2
![Page 228: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/228.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
25
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 229: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/229.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
25
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 230: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/230.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
25
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 231: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/231.jpg)
spcl.inf.ethz.ch
@spcl_eth
26
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
![Page 232: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/232.jpg)
spcl.inf.ethz.ch
@spcl_eth
26
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Today’s supercomputers have a hierarchical layout
![Page 233: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/233.jpg)
spcl.inf.ethz.ch
@spcl_eth
26
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Today’s supercomputers have a hierarchical layout
A Cray XE/XT
supercomputer
![Page 234: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/234.jpg)
spcl.inf.ethz.ch
@spcl_eth
26
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Today’s supercomputers have a hierarchical layout
...4 cabinets:
A Cray XE/XT
supercomputer
![Page 235: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/235.jpg)
spcl.inf.ethz.ch
@spcl_eth
26
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Today’s supercomputers have a hierarchical layout
...4 cabinets:
3 chassis: ...
A Cray XE/XT
supercomputer
![Page 236: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/236.jpg)
spcl.inf.ethz.ch
@spcl_eth
26
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Today’s supercomputers have a hierarchical layout
...4 cabinets:
3 chassis: ...
A Cray XE/XT
supercomputer
...8 blades:
![Page 237: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/237.jpg)
spcl.inf.ethz.ch
@spcl_eth
26
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Today’s supercomputers have a hierarchical layout
...4 cabinets:
3 chassis:
4 nodes:
...
...
A Cray XE/XT
supercomputer
...8 blades:
![Page 238: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/238.jpg)
spcl.inf.ethz.ch
@spcl_eth
26
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Today’s supercomputers have a hierarchical layout
...4 cabinets:
3 chassis:
4 nodes:
...
...
...
A Cray XE/XT
supercomputer
...8 blades:
32 cores:
![Page 239: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/239.jpg)
spcl.inf.ethz.ch
@spcl_eth
26
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Today’s supercomputers have a hierarchical layout
A single hardware crash may kill multiple processes
...4 cabinets:
3 chassis:
4 nodes:
...
...
...
A Cray XE/XT
supercomputer
...8 blades:
32 cores:
![Page 240: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/240.jpg)
spcl.inf.ethz.ch
@spcl_eth
26
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Today’s supercomputers have a hierarchical layout
A single hardware crash may kill multiple processes
...4 cabinets:
3 chassis:
4 nodes:
...
...
...
A Cray XE/XT
supercomputer
...8 blades:
32 cores:
Up to 128 process
failures (assuming
1 process per core)
![Page 241: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/241.jpg)
spcl.inf.ethz.ch
@spcl_eth
26
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Today’s supercomputers have a hierarchical layout
A single hardware crash may kill multiple processes
Introduced protocols usually cannot handle > 1 process crash
...4 cabinets:
3 chassis:
4 nodes:
...
...
...
A Cray XE/XT
supercomputer
...8 blades:
32 cores:
Up to 128 process
failures (assuming
1 process per core)
![Page 242: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/242.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
![Page 243: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/243.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
![Page 244: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/244.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
G
![Page 245: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/245.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
G
A
B
C
Application data
![Page 246: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/246.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
G
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
Application data
![Page 247: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/247.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
G
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
![Page 248: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/248.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
Add m parity processes to each group to store the parity data
m
G
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
![Page 249: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/249.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
Add m parity processes to each group to store the parity data
m
G
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
![Page 250: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/250.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
Add m parity processes to each group to store the parity data
m
G
A
B
C
X
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
![Page 251: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/251.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
Add m parity processes to each group to store the parity data
m
G
A
B
C
X
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
![Page 252: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/252.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
Add m parity processes to each group to store the parity data
m
G
A
B
C
X
Y
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
![Page 253: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/253.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
Add m parity processes to each group to store the parity data
m
G
A
B
C
X
Y
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
![Page 254: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/254.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
Add m parity processes to each group to store the parity data
m
G
A
B
C
X
Y
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
Parity data
![Page 255: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/255.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
Add m parity processes to each group to store the parity data
m
G
A
B
C
X
Y
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
![Page 256: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/256.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
Add m parity processes to each group to store the parity data
m
G
A
B
C
X
Y
D
E
F
S
T
G
H
I
U
W
J
K
L
Z
V
M
N
O
1
2
P
Q
R
3
4
![Page 257: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/257.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
Add m parity processes to each group to store the parity data
m
G
A
X
Y
D
E
F
S
T
G
H
I
U
W
J
K
L
Z
V
M
N
O
1
2
P
Q
R
3
4
![Page 258: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/258.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
Add m parity processes to each group to store the parity data
m
G
A
X
Y
D
E
F
S
T
G
H
I
U
W
J
K
L
Z
V
M
N
O
1
2
P
Q
R
3
4
![Page 259: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/259.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
Add m parity processes to each group to store the parity data
m
G
A
B
X
Y
D
E
F
S
T
G
H
I
U
W
J
K
L
Z
V
M
N
O
1
2
P
Q
R
3
4
![Page 260: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/260.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
Add m parity processes to each group to store the parity data
m
G
A
B
X
Y
D
E
F
S
T
G
H
I
U
W
J
K
L
Z
V
M
N
O
1
2
P
Q
R
3
4
![Page 261: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/261.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
Add m parity processes to each group to store the parity data
m
G
A
B
C
X
Y
D
E
F
S
T
G
H
I
U
W
J
K
L
Z
V
M
N
O
1
2
P
Q
R
3
4
![Page 262: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/262.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
Add m parity processes to each group to store the parity data
m
G
A
B
C
X
Y
D
E
F
S
T
G
H
I
U
W
J
K
L
Z
V
M
N
O
1
2
P
Q
R
3
4
![Page 263: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/263.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
Add m parity processes to each group to store the parity data
m
G
A
B
C
X
Y
D
E
F
S
T
G
H
I
U
W
J
K
L
Z
V
M
N
O
1
2
P
Q
R
3
4
![Page 264: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/264.jpg)
spcl.inf.ethz.ch
@spcl_eth
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
27
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 1: groups of processes: Divide processes into groups of size G each
Add m parity processes to each group to store the parity data
m
G
A
B
C
X
Y
D
E
F
S
T
G
H
I
U
W
J
K
L
Z
V
M
N
O
1
2
P
Q
R
3
4
![Page 265: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/265.jpg)
spcl.inf.ethz.ch
@spcl_eth
28
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
A D G J M P
Y T W V 2 4
Q
R
3
B
X
E
F
S
H
I
U
K
L
Z
N
O
1
C
![Page 266: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/266.jpg)
spcl.inf.ethz.ch
@spcl_eth
28
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 2: topology-aware distribution of groups:
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
A D G J M P
Y T W V 2 4
Q
R
3
B
X
E
F
S
H
I
U
K
L
Z
N
O
1
C
![Page 267: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/267.jpg)
spcl.inf.ethz.ch
@spcl_eth
28
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
A D G J M P
Y T W V 2 4
Q
R
3
B
X
E
F
S
H
I
U
K
L
Z
N
O
1
C
![Page 268: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/268.jpg)
spcl.inf.ethz.ch
@spcl_eth
28
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
A D G J M P
Y T W V 2 4
Q
R
3
B
X
E
F
S
H
I
U
K
L
Z
N
O
1
C
![Page 269: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/269.jpg)
spcl.inf.ethz.ch
@spcl_eth
28
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
A D G J M P
Y T W V 2 4
Q
R
3
B
X
E
F
S
H
I
U
K
L
Z
N
O
1
C
![Page 270: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/270.jpg)
spcl.inf.ethz.ch
@spcl_eth
28
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...
... and nodes
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
A D G J M P
Y T W V 2 4
Q
R
3
B
X
E
F
S
H
I
U
K
L
Z
N
O
1
C
![Page 271: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/271.jpg)
spcl.inf.ethz.ch
@spcl_eth
28
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...
... and nodes
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
A D G J M P
Y T W V 2 4
Q
R
3
B
X
E
F
S
H
I
U
K
L
Z
N
O
1
C
![Page 272: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/272.jpg)
spcl.inf.ethz.ch
@spcl_eth
28
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...
... and nodes
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
A D G J M P
Y T W V 2 4
Q
R
3
B
X
E
F
S
H
I
U
K
L
Z
N
O
1
C
![Page 273: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/273.jpg)
spcl.inf.ethz.ch
@spcl_eth
28
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...
... and nodes
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
A D G J M P
Y T W V 2 4
Q
R
3
B
X
E
F
S
H
I
U
K
L
Z
N
O
1
C
![Page 274: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/274.jpg)
spcl.inf.ethz.ch
@spcl_eth
28
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...
... and nodes
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
G J M P
Y T W V 2 4
Q
R
3
B
X
E
F
S
H
I
U
K
L
Z
N
O
1
C
![Page 275: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/275.jpg)
spcl.inf.ethz.ch
@spcl_eth
28
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...
... and nodes
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
Y T W V 2 4
Q
R
3
B
X
E
F
S
H
I
U
K
L
Z
N
O
1
C
![Page 276: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/276.jpg)
spcl.inf.ethz.ch
@spcl_eth
28
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...
... and nodes
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
Y T W V 2 4
Q
R
3
B
X
E
F
S
H
I
U
K
L
Z
N
O
1
C
![Page 277: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/277.jpg)
spcl.inf.ethz.ch
@spcl_eth
28
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...
... and nodes
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
A D G J M P
Y T W V 2 4
Q
R
3
B
X
E
F
S
H
I
U
K
L
Z
N
O
1
C
![Page 278: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/278.jpg)
spcl.inf.ethz.ch
@spcl_eth
28
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...
... and nodes
Proc 5 Proc 6
Proc 7 Proc 8 Proc 11 Proc 12
Proc 13 Proc 14 Proc 17 Proc 18
Proc 19 Proc 20 Proc 23 Proc 24
Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24
Proc 1 Proc 2 Proc 3 Proc 4
Proc 9 Proc 10
Proc 15 Proc 16
Proc 21 Proc 22
A D G J M P
Y T W V 2 4
Q
R
3
B
X
E
F
S
H
I
U
K
L
Z
N
O
1
C
![Page 279: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/279.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
![Page 280: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/280.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine
![Page 281: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/281.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
![Page 282: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/282.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
...
...
...
...
...
![Page 283: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/283.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
𝑃𝑐𝑓 =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)
...
...
...
...
...
![Page 284: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/284.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
𝑃𝑐𝑓 =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)
Probability that xj elements
of level j will fail and cause
a catastrophic failure
...
...
...
...
...
![Page 285: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/285.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
𝑃𝑐𝑓 =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)
Probability that xj elements
of level j will fail and cause
a catastrophic failure
...
...
...
...
...
![Page 286: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/286.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
𝑃𝑐𝑓 =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)
Probability that xj elements
of level j will fail and cause
a catastrophic failure
...
...
...
...
...
![Page 287: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/287.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
𝑃𝑐𝑓 =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)
Probability that xj elements
of level j will fail and cause
a catastrophic failureProbability
that xj elements
of level j will fail
...
...
...
...
...
![Page 288: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/288.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
𝑃𝑐𝑓 =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)
Probability that xj elements
of level j will fail and cause
a catastrophic failureProbability
that xj elements
of level j will fail
Probability that xj
given failures at level j
are catastrophic
...
...
...
...
...
![Page 289: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/289.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
𝑃𝑐𝑓 =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)
Probability that xj elements
of level j will fail and cause
a catastrophic failureProbability
that xj elements
of level j will fail
Probability that xj
given failures at level j
are catastrophic
Every number
of xj elements
is considered...
...
...
...
...
...
![Page 290: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/290.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
𝑃𝑐𝑓 =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)
Probability that xj elements
of level j will fail and cause
a catastrophic failureProbability
that xj elements
of level j will fail
Probability that xj
given failures at level j
are catastrophic
Every number
of xj elements
is considered...
...
...
...
...
...
![Page 291: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/291.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
𝑃𝑐𝑓 =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)
Probability that xj elements
of level j will fail and cause
a catastrophic failureProbability
that xj elements
of level j will fail
Probability that xj
given failures at level j
are catastrophic
Every number
of xj elements
is considered...
...
...
...
...
...
![Page 292: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/292.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
𝑃𝑐𝑓 =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)
Probability that xj elements
of level j will fail and cause
a catastrophic failureProbability
that xj elements
of level j will fail
Probability that xj
given failures at level j
are catastrophic
Every number
of xj elements
is considered...
...
...
...
...
...
![Page 293: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/293.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
𝑃𝑐𝑓 =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)
Probability that xj elements
of level j will fail and cause
a catastrophic failureProbability
that xj elements
of level j will fail
Probability that xj
given failures at level j
are catastrophic
Every number
of xj elements
is considered...
...
...
...
...
...
![Page 294: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/294.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
𝑃𝑐𝑓 =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)
Probability that xj elements
of level j will fail and cause
a catastrophic failureProbability
that xj elements
of level j will fail
Probability that xj
given failures at level j
are catastrophic
Every number
of xj elements
is considered...
...
...
...
...
...
![Page 295: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/295.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
𝑃𝑐𝑓 =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)
Probability that xj elements
of level j will fail and cause
a catastrophic failureProbability
that xj elements
of level j will fail
Probability that xj
given failures at level j
are catastrophic
Every number
of xj elements
is considered...
...at every
hierarchy level
...
...
...
...
...
![Page 296: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/296.jpg)
spcl.inf.ethz.ch
@spcl_eth
29
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
𝑃𝑐𝑓 =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =
𝑗=1
ℎ
𝑥𝑗=1
𝐻𝑗
𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)
Probability that xj elements
of level j will fail and cause
a catastrophic failureProbability
that xj elements
of level j will fail
Probability that xj
given failures at level j
are catastrophic
Every number
of xj elements
is considered...
...at every
hierarchy level
...
...
...
...
...
![Page 297: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/297.jpg)
spcl.inf.ethz.ch
@spcl_eth
30
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
![Page 298: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/298.jpg)
spcl.inf.ethz.ch
@spcl_eth
30
EXTENDING THE PROTOCOLS FOR MORE RESILIENCE
The probability of a catastrophic failure in a multi-level
computing machine A catastrophic failure: a failure that takes place when more than m processes in
the same group die
𝑃𝑐𝑓/𝑑𝑎𝑦:
𝑁𝑟 𝑜𝑓 𝑝𝑎𝑟𝑖𝑡𝑦 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑠
![Page 299: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/299.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
31
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 300: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/300.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
31
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 301: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/301.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
31
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 302: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/302.jpg)
spcl.inf.ethz.ch
@spcl_eth
32
HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW
![Page 303: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/303.jpg)
spcl.inf.ethz.ch
@spcl_eth
32
HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW
Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]
[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.
ACM/IEEE Supercomputing 2013, SC13, Best Paper Award
![Page 304: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/304.jpg)
spcl.inf.ethz.ch
@spcl_eth
32
HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW
Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]
The layered protocol:
[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.
ACM/IEEE Supercomputing 2013, SC13, Best Paper Award
![Page 305: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/305.jpg)
spcl.inf.ethz.ch
@spcl_eth
32
HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW
Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]
The layered protocol:
transparent logging
[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.
ACM/IEEE Supercomputing 2013, SC13, Best Paper Award
![Page 306: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/306.jpg)
spcl.inf.ethz.ch
@spcl_eth
32
HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW
Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]
The layered protocol:
transparent logging
uncoordinated checkpointing
[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.
ACM/IEEE Supercomputing 2013, SC13, Best Paper Award
![Page 307: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/307.jpg)
spcl.inf.ethz.ch
@spcl_eth
32
HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW
Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]
The layered protocol:
transparent logging
uncoordinated checkpointing
coordinated checkpointing
[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.
ACM/IEEE Supercomputing 2013, SC13, Best Paper Award
![Page 308: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/308.jpg)
spcl.inf.ethz.ch
@spcl_eth
32
HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW
Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]
The layered protocol:
transparent logging
uncoordinated checkpointing
coordinated checkpointing
topology-awareness
uses
[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.
ACM/IEEE Supercomputing 2013, SC13, Best Paper Award
![Page 309: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/309.jpg)
spcl.inf.ethz.ch
@spcl_eth
32
HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW
[2] J.T.Daly, A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems,
Volume 22 Issue 3, February 2006, Pages 303-312
Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]
The layered protocol:
transparent logging
uncoordinated checkpointing
usescoordinated checkpointing
Daly’s [2]
formula
topology-awareness
uses
≈ 2𝑀𝑇
[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.
ACM/IEEE Supercomputing 2013, SC13, Best Paper Award
![Page 310: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/310.jpg)
spcl.inf.ethz.ch
@spcl_eth
32
HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW
[2] J.T.Daly, A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems,
Volume 22 Issue 3, February 2006, Pages 303-312
Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]
The layered protocol:
transparent logging
uncoordinated checkpointing
usescoordinated checkpointing
Daly’s [2]
formula
topology-awareness
uses
≈ 2𝑀𝑇
[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.
ACM/IEEE Supercomputing 2013, SC13, Best Paper Award
usesdemand
checkpoints
![Page 311: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/311.jpg)
spcl.inf.ethz.ch
@spcl_eth
33
Evaluation on CSCS Monte Rosa
1,496 computing Cray XE6 nodes
47,872 schedulable cores
46TB memory
4 protocols
2 applications
PERFORMANCE
![Page 312: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/312.jpg)
spcl.inf.ethz.ch
@spcl_eth
34
PERFORMANCE: COORDINATED CHECKPOINTING
NAS 3D FFT
![Page 313: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/313.jpg)
spcl.inf.ethz.ch
@spcl_eth
34
NAS 3D FFT [1] Performance
[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and
overlap.IPDPS’09
PERFORMANCE: COORDINATED CHECKPOINTING
NAS 3D FFT
![Page 314: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/314.jpg)
spcl.inf.ethz.ch
@spcl_eth
34
NAS 3D FFT [1] Performance
[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and
overlap.IPDPS’09
no-FT: no fault tolerance
PERFORMANCE: COORDINATED CHECKPOINTING
NAS 3D FFT
![Page 315: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/315.jpg)
spcl.inf.ethz.ch
@spcl_eth
34
NAS 3D FFT [1] Performance
[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and
overlap.IPDPS’09
no-FT: no fault tolerance
f-daly: using Daly’s interval
1-5% slower than no-FT
PERFORMANCE: COORDINATED CHECKPOINTING
NAS 3D FFT
![Page 316: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/316.jpg)
spcl.inf.ethz.ch
@spcl_eth
34
NAS 3D FFT [1] Performance
[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and
overlap.IPDPS’09
no-FT: no fault tolerance
f-daly: using Daly’s interval
1-5% slower than no-FT
f-no-daly: no Daly’s interval
1-15% slower than no-FT
PERFORMANCE: COORDINATED CHECKPOINTING
NAS 3D FFT
![Page 317: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/317.jpg)
spcl.inf.ethz.ch
@spcl_eth
34
NAS 3D FFT [1] Performance
SCR: a popular checkpoint
/ restart library
21-67% slower than no-FT
[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and
overlap.IPDPS’09
no-FT: no fault tolerance
f-daly: using Daly’s interval
1-5% slower than no-FT
f-no-daly: no Daly’s interval
1-15% slower than no-FT
PERFORMANCE: COORDINATED CHECKPOINTING
NAS 3D FFT
![Page 318: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/318.jpg)
spcl.inf.ethz.ch
@spcl_eth
35
PERFORMANCE: UNCOORDINATED CHECKPOINTING
NAS 3D FFT
![Page 319: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/319.jpg)
spcl.inf.ethz.ch
@spcl_eth
35
NAS 3D FFT [1] Performance
[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and
overlap.IPDPS’09
PERFORMANCE: UNCOORDINATED CHECKPOINTING
NAS 3D FFT
![Page 320: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/320.jpg)
spcl.inf.ethz.ch
@spcl_eth
35
NAS 3D FFT [1] Performance
[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and
overlap.IPDPS’09
How the log size impacts the
number of uncoordinated
checkpoints and the
performance of the code?
PERFORMANCE: UNCOORDINATED CHECKPOINTING
NAS 3D FFT
![Page 321: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/321.jpg)
spcl.inf.ethz.ch
@spcl_eth
35
NAS 3D FFT [1] Performance
[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and
overlap.IPDPS’09
How the log size impacts the
number of uncoordinated
checkpoints and the
performance of the code?
PERFORMANCE: UNCOORDINATED CHECKPOINTING
NAS 3D FFT
![Page 322: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/322.jpg)
spcl.inf.ethz.ch
@spcl_eth
36
PERFORMANCE: ACCESS LOGGING
NAS 3D FFT
![Page 323: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/323.jpg)
spcl.inf.ethz.ch
@spcl_eth
36
NAS 3D FFT [1] Performance
[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and
overlap.IPDPS’09
PERFORMANCE: ACCESS LOGGING
NAS 3D FFT
![Page 324: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/324.jpg)
spcl.inf.ethz.ch
@spcl_eth
36
NAS 3D FFT [1] Performance
[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and
overlap.IPDPS’09
PERFORMANCE: ACCESS LOGGING
NAS 3D FFT
no-FT: no fault tolerance
![Page 325: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/325.jpg)
spcl.inf.ethz.ch
@spcl_eth
36
NAS 3D FFT [1] Performance
[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and
overlap.IPDPS’09
PERFORMANCE: ACCESS LOGGING
NAS 3D FFT
no-FT: no fault tolerance
FTRMA: logging puts (FFT
code does not use gets)
8-9% slower than no-FT
![Page 326: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/326.jpg)
spcl.inf.ethz.ch
@spcl_eth
36
NAS 3D FFT [1] Performance
[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and
overlap.IPDPS’09
PERFORMANCE: ACCESS LOGGING
NAS 3D FFT
no-FT: no fault tolerance
FTRMA: logging puts (FFT
code does not use gets)
8-9% slower than no-FT
ML: a simple protocol
based on message logging
18% slower than no-FT
![Page 327: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/327.jpg)
spcl.inf.ethz.ch
@spcl_eth
37
PERFORMANCE: ACCESS LOGGING
DISTRIBUTED HASHTABLE
![Page 328: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/328.jpg)
spcl.inf.ethz.ch
@spcl_eth
37
Distributed Hashtable Performance
PERFORMANCE: ACCESS LOGGING
DISTRIBUTED HASHTABLE
![Page 329: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/329.jpg)
spcl.inf.ethz.ch
@spcl_eth
37
Distributed Hashtable Performance
PERFORMANCE: ACCESS LOGGING
DISTRIBUTED HASHTABLE
An insert: 1 get and 1 put are logged
A collision: 4 gets and 6 puts are logged
![Page 330: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/330.jpg)
spcl.inf.ethz.ch
@spcl_eth
37
Distributed Hashtable Performance
PERFORMANCE: ACCESS LOGGING
DISTRIBUTED HASHTABLE
no-FT: no fault tolerance
An insert: 1 get and 1 put are logged
A collision: 4 gets and 6 puts are logged
![Page 331: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/331.jpg)
spcl.inf.ethz.ch
@spcl_eth
37
Distributed Hashtable Performance
PERFORMANCE: ACCESS LOGGING
DISTRIBUTED HASHTABLE
no-FT: no fault tolerance
f-puts: logging puts
12% slower than no-FT
An insert: 1 get and 1 put are logged
A collision: 4 gets and 6 puts are logged
![Page 332: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/332.jpg)
spcl.inf.ethz.ch
@spcl_eth
37
Distributed Hashtable Performance
PERFORMANCE: ACCESS LOGGING
DISTRIBUTED HASHTABLE
no-FT: no fault tolerance
f-puts: logging puts
12% slower than no-FT
f-puts-gets: logging puts
and gets
33% slower than no-FT
An insert: 1 get and 1 put are logged
A collision: 4 gets and 6 puts are logged
![Page 333: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/333.jpg)
spcl.inf.ethz.ch
@spcl_eth
37
Distributed Hashtable Performance
PERFORMANCE: ACCESS LOGGING
DISTRIBUTED HASHTABLE
no-FT: no fault tolerance
f-puts: logging puts
12% slower than no-FT
ML: logging puts and gets
40% slower than no-FT
f-puts-gets: logging puts
and gets
33% slower than no-FT
An insert: 1 get and 1 put are logged
A collision: 4 gets and 6 puts are logged
![Page 334: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/334.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
38
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 335: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/335.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
38
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 336: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/336.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
38
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 337: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/337.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
38
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 338: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/338.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
38
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 339: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/339.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
38
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 340: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/340.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
38
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 341: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/341.jpg)
spcl.inf.ethz.ch
@spcl_eth
Holistic fault-tolerance library
Topology-awareness
CC in RMAGeneric model
UC in RMA
Recovery in RMAProofs
38
OVERVIEW OF OUR RESEARCH
PerformanceDistribution of
processesDesign
MP vs. RMA
MP vs. RMA
Model extensions
Deadlock freedom
Correct recovery
Checkpointing
schemes
Optimizations
Checkpoints
on demand
Schemes
Basic scheme
Extended RMA
semantics
Decreasing
failure prob.
Logging accesses
![Page 342: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/342.jpg)
spcl.inf.ethz.ch
@spcl_eth
Thanks to:
Paul Hargrove (and the whole UPC team)
and the MPI Forum RMA WG …
… and the institutions:
39
ACKNOWLEDGMENTS
![Page 343: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/343.jpg)
spcl.inf.ethz.ch
@spcl_eth
Thank you
for your attention
40
![Page 344: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/344.jpg)
spcl.inf.ethz.ch
@spcl_eth
41
RMA: LOGGING PUTS
Proc p Proc q
C
![Page 345: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/345.jpg)
spcl.inf.ethz.ch
@spcl_eth
41
RMA: LOGGING PUTS
Proc p Proc q
C
memory
Logs of puts
CC
![Page 346: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/346.jpg)
spcl.inf.ethz.ch
@spcl_eth
41
RMA: LOGGING PUTS
Proc p Proc q
C
memory
Logs of puts
CC
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
![Page 347: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/347.jpg)
spcl.inf.ethz.ch
@spcl_eth
41
RMA: LOGGING PUTS
Proc p Proc q
C
record the put
memory
Logs of puts
CC
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
![Page 348: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/348.jpg)
spcl.inf.ethz.ch
@spcl_eth
41
RMA: LOGGING PUTS
Proc p Proc q
C
record the put
memory
Logs of puts
CC
Data is valid
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
![Page 349: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/349.jpg)
spcl.inf.ethz.ch
@spcl_eth
41
RMA: LOGGING PUTS
Proc p Proc q
C
record the put
memory
Logs of puts
C
C
Data is valid
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
![Page 350: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/350.jpg)
spcl.inf.ethz.ch
@spcl_eth
41
RMA: LOGGING PUTS
Proc p Proc q
C
record the put
memory
Logs of puts
C + #put
C
Data is valid
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
![Page 351: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/351.jpg)
spcl.inf.ethz.ch
@spcl_eth
41
RMA: LOGGING PUTS
Proc p Proc q
C
To replay the actions
preserving ,
we also record
epoch counters
record the put
memory
Logs of putsco
C + #put
C
Data is valid
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
![Page 352: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/352.jpg)
spcl.inf.ethz.ch
@spcl_eth
41
RMA: LOGGING PUTS
Proc p Proc q
C
To replay the actions
preserving ,
we also record
epoch counters
record the put
memory
Logs of putsco
C + #put
C
Data is valid
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
EC 2
![Page 353: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/353.jpg)
spcl.inf.ethz.ch
@spcl_eth
41
RMA: LOGGING PUTS
Proc p Proc q
C
To replay the actions
preserving ,
we also record
epoch counters
record the put
memory
Logs of putsco
C + #put
+ EC( )
C
Data is valid
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
EC 2
![Page 354: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/354.jpg)
spcl.inf.ethz.ch
@spcl_eth
41
RMA: LOGGING PUTS
Proc p Proc q
C
To replay the actions
preserving ,
we also record
epoch counters
record the put
memory
Logs of putsco
C + #put
+ EC( )
C
Data is valid
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
EC 2
2
![Page 355: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/355.jpg)
spcl.inf.ethz.ch
@spcl_eth
41
RMA: LOGGING PUTS
Proc p Proc q
C
To replay the actions
preserving ,
we also record
epoch counters
record the put
memory
Logs of putsco
C + #put
+ EC( )
C
Data is valid
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
EC 2
2
![Page 356: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/356.jpg)
spcl.inf.ethz.ch
@spcl_eth
41
RMA: LOGGING PUTS
Proc p Proc q
C
𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎
#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …
2
![Page 357: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/357.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
![Page 358: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/358.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory memory
![Page 359: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/359.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
memory
App. data
...
DataPDataP
![Page 360: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/360.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
memory
Chpt. data
App. data
Chpt. data
...
...
DataPDataP
![Page 361: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/361.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
memory
Application
state before
the checkpoint
Chpt. data
App. data
Chpt. data
...
...
DataPDataP
![Page 362: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/362.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
memorymemory
Application
state before
the checkpoint
Chpt. data
App. data
Chpt. data
...
...
DataPDataP
![Page 363: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/363.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memorymemory
Application
state before
the checkpoint
Chpt. data
App. data
Chpt. data
...
...
DataPDataP
![Page 364: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/364.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Application
state before
the checkpoint
DataC
Chpt. data
App. data
Chpt. data
...
...
DataPDataP
![Page 365: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/365.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Application
state before
the checkpoint
DataC
Chpt. data
take a local
checkpointApp. data
Chpt. data
...
...
DataPDataP
![Page 366: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/366.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Application
state before
the checkpoint
DataC
Chpt. data
DataP take a local
checkpointApp. data
Chpt. data
...
...
DataPDataP
![Page 367: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/367.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Application
state before
the checkpoint
DataC
Chpt. data
DataP take a local
checkpointApp. data
Chpt. data
...
...
DataPDataP
![Page 368: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/368.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Application
state before
the checkpoint
DataC
Chpt. data
DataP take a local
checkpointApp. data
Chpt. data
...
...
DataPDataP
![Page 369: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/369.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Application
state before
the checkpoint
DataC
Chpt. data
DataP take a local
checkpointApp. data
Chpt. data
...
...
DataP
DataP
![Page 370: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/370.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Integrate
the data
with the
checksum
Application
state before
the checkpoint
DataC
Chpt. data
DataP take a local
checkpointApp. data
Chpt. data
...
...
DataP
DataP
![Page 371: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/371.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Integrate
the data
with the
checksum
Application
state before
the checkpoint
DataC
Chpt. data
DataP take a local
checkpointApp. data
Chpt. data
...
...
DataP
DataP
![Page 372: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/372.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Integrate
the data
with the
checksum
Application
state before
the checkpoint
DataC
Chpt. data
DataP take a local
checkpointApp. data
Chpt. data
...
...
DataP
DataP
![Page 373: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/373.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Integrate
the data
with the
checksum
Application
state before
the checkpoint
DataC
Chpt. data
DataP take a local
checkpointApp. data
Chpt. data
...
...
DataP
![Page 374: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/374.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Integrate
the data
with the
checksum
Application
state before
the checkpoint
DataC
Chpt. data
DataP take a local
checkpointApp. data
Chpt. data
...
...
DataP
can clear
the logs
![Page 375: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/375.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Integrate
the data
with the
checksum
The Epoch Condition
Application
state before
the checkpoint
DataC
Chpt. data
DataP take a local
checkpointApp. data
Chpt. data
...
...
DataP
can clear
the logs
![Page 376: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/376.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Integrate
the data
with the
checksum
The Epoch Condition
Application
state before
the checkpoint
DataC
Chpt. data
DataP take a local
checkpointApp. data
Chpt. data
...
...
DataP
can clear
the logs
![Page 377: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/377.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Integrate
the data
with the
checksum
The Epoch Condition
Application
state before
the checkpoint
DataC
Chpt. data
DataP take a local
checkpointApp. data
Chpt. data
...
...
DataP
can clear
the logs
![Page 378: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/378.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Integrate
the data
with the
checksum
The Epoch Condition
Application
state before
the checkpoint
DataC
Chpt. data
DataP take a local
checkpointApp. data
Chpt. data
...
...
DataP
can clear
the logs
![Page 379: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/379.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Integrate
the data
with the
checksum
The Epoch Condition
Application
state before
the checkpoint
DataC
Chpt. data
DataP take a local
checkpointApp. data
Chpt. data
...
...
DataP
can clear
the logs
![Page 380: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/380.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Integrate
the data
with the
checksum
The Epoch Condition
DataC
Chpt. data
DataP take a local
checkpointApp. data
Chpt. data
...
...Application
state some time
later, after some
communication...
DataP
can clear
the logs
![Page 381: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/381.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Integrate
the data
with the
checksum
The Epoch Condition
DataC
Chpt. data
DataP take a local
checkpointApp. data
Chpt. data
...
...Application
state some time
later, after some
communication...
DataP
can clear
the logs
![Page 382: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/382.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Integrate
the data
with the
checksum
The Epoch Condition
DataC
Chpt. data
DataP take a local
checkpointApp. data
Chpt. data
...
...State lost!
DataP
can clear
the logs
![Page 383: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/383.jpg)
spcl.inf.ethz.ch
@spcl_eth
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
Integrate
the data
with the
checksum
The Epoch Condition
DataC
Chpt. data
take a local
checkpointApp. data
Chpt. data
...
...State lost!
can clear
the logs
![Page 384: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/384.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory
42
RMA: CHECKPOINTING
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
The Epoch Condition
DataC
Chpt. data
App. data
Chpt. data
...
...State lost!
![Page 385: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/385.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory
42
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
DataC
Chpt. data
App. data
Chpt. data
...
...
RMA: RECOVERY
State lost!
![Page 386: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/386.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory
42
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
DataC
Chpt. data
App. data
Chpt. data
...
...
RMA: RECOVERY
State lost!
![Page 387: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/387.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory
42
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
DataC
Chpt. data
App. data
Chpt. data
...
...
RMA: RECOVERY
State lost!
Stage 1: restore the state upon checkpointing
![Page 388: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/388.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory
42
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
DataC
Chpt. data
App. data
Chpt. data
...
...
RMA: RECOVERY
State lost!
Fetch the data
Stage 1: restore the state upon checkpointing
![Page 389: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/389.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory
42
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
DataC
Chpt. data
App. data
Chpt. data
...
...
RMA: RECOVERY
State lost!
...
Fetch the data
Stage 1: restore the state upon checkpointing
![Page 390: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/390.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory
42
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
DataC
Chpt. data
App. data
Chpt. data
...
...
RMA: RECOVERY
State lost!
...
Fetch the data
Decode
the data
Stage 1: restore the state upon checkpointing
![Page 391: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/391.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory
42
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
DataC
Chpt. data
App. data
Chpt. data
...
...
RMA: RECOVERY
State lost!
...
Fetch the data
Decode
the data
Stage 1: restore the state upon checkpointing
![Page 392: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/392.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory
42
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
DataC
Chpt. data
App. data
Chpt. data
...
...
RMA: RECOVERY
State lost!
...
DataP
Fetch the data
Decode
the data
Stage 1: restore the state upon checkpointing
![Page 393: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/393.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory
42
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
DataC
Chpt. data
App. data
Chpt. data
...
...
RMA: RECOVERY
State lost!DataP
Fetch the data
Decode
the data
Stage 1: restore the state upon checkpointing
![Page 394: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/394.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory
42
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
DataC
Chpt. data
App. data
Chpt. data
...
...
RMA: RECOVERY
State lost!DataP
Fetch the data
Decode
the data
Stage 1: restore the state upon checkpointing
![Page 395: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/395.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory
42
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
DataC
Chpt. data
App. data
Chpt. data
...
...
RMA: RECOVERY
State lost!DataP
Fetch the data
Decode
the data
DataP
Stage 1: restore the state upon checkpointing
![Page 396: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/396.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory
42
Proc p Proc q
memory
App. data
Proc C
a process
that stores
parity data
memory
Parity data
memory
DataC
Chpt. data
App. data
Chpt. data
...
...
RMA: RECOVERY
DataP
Fetch the data
Decode
the data
DataP
Stage 1: restore the state upon checkpointing
![Page 397: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch](https://reader034.fdocuments.in/reader034/viewer/2022050513/5f9d5204818ea11ed647652e/html5/thumbnails/397.jpg)
spcl.inf.ethz.ch
@spcl_eth
memory
42
Proc p Proc q
memory
App. data
memory
Chpt. data
RMA: RECOVERY
DataP