2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Data Integrity Support for Silent Data Corruption in Gfarm File System
Osamu Tatebe University of Tsukuba
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Silent Data Corruption
Data may be corrupted silently Transient soft error by cosmic ray, ... RAID firmware bug, storage software bug RH6.2 & 6.3 – XFS regularly truncating files
after crash/reboot [RH Bug 845233]
2
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Data-Intensive Science
Big Data Science High energy physics experiment - LHC, Belle
(PB/year) Wide-field imaging - Subaru HSC survey,
LSST, SDSS (100TB/year) Next generation DNA sequencer Data assimilation in climate science
Not only FLOPS but Byte/sec (IO bandwidth) is critical
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Scalable performance requirement for Parallel File System
Year FLOPS
#cores
IO BW IOPS Systems
2008 1P 100K 100GB/s O(1K) Jaguar, BG/P 2011 10P 1M 1TB/s O(10K) K, BG/Q 2017 100P 10M 10TB/s O(100K) 2022 1E 100M 100TB/s O(1M)
IO BW and IOPS are expected to be scaled-out in terms of # cores or # nodes
Performance target
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Convergence of HPC and Big Data Computing
Scale-out system R&D such as MapReduce in Data Center Exploit local storage due to poor network
bandwidth Tolerate fault since it is norm
OTOH, MPI and MPI-IO system R&D in HPC using high-performance and high bisection network
Complex data analysis requirement in data-intensive science imposes the convergence
Extreme Big Data
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Converged Architecture
MDS
Compute nodes (clients) Storage
R&D for scale-out system software required
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Storage System of Converged Architecture
Three-tier and more O(10,000) Local storage
More than staging O(1,000) File cache system
More than burst buffer O(100) Parallel file system
Data locality should be considered for local storage
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
SNIA-J Extreme Storage Society of Science Study
Established in May, 2016 Discuss about next-generation high-
performance storage (extreme storage) beyond HPC, Big Data, and Cloud technologies
Monthly meeting Chair: Osamu Tatebe (University of Tsukuba) Members: Fujitsu, Hitachi, NEC, Toshiba, HGST
Japan, TÜV Rheinland Japan, CTCSP, SCSK, DDN, KEL, TIS Solution Link, ...
8
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Gfarm file system
Open Source Software http://sf.net/projects/gfarm 18,000 downloads since March 2007
Major installations JLDG (7.8 PB, 7 sites, 39 file servers) HPCI Storage (21.9 PB, 3 sites, 97 file
servers) Research for much more scalability and Non-
volatile Storage
Winner – Large Systems in HPC Storage Challenge in SC06
Most Innovative Use of Storage In Support of Science Award in SC05
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Gfarm file system
Runtime system to exploit data locality Pwrake workflow system MapReduce, MPI-IO, batch queuing system
NPO Tsukuba OSS Technology Support Center http://oss-tsukuba.org/ Support for Gfarm file system Gfarm symposium, Gfarm workshop
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Two Extremes
MDS
Compute nodes (clients)
O(1,000) local storages for data-intensive applications
O(1,000) file servers in distant sites for global data sharing and archives
Site A
Site B
Site C
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Software Component of Gfarm File System
Master-slave MDS Synchronous replication for fault tolerance, and
asynchronous replication for disaster recovery IO server for each local storage Client software
Gfarm2fs to mount Gfarm file system Gfarm command for replica management and
parallel file copy, and POSIX equivalent Gfarm library API in C, C++, Java
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Workflow
Often used by data-intensive applications
A collection of tasks that includes data dependency
https://confluence.pegasus.isi.edu/display/pegasus/WorkflowGenerator
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Pwrake: Data-Intensive Workflow System [Tanaka]
IO-aware task scheduling Locality-aware scheduling (CCGrid2012)
Minimize data transfer by multi constraint graph partitioning
Disk Cache aware (Cluster2014) Maximize disk cache hit ratio and avoid trailing
task problem
Workflow System based on Rake (Ruby make) http://github.com/masa16/Pwrake/
Gfarm
disk
cache
file
file disk
cache
file
file
process
Local Remote
39 70 71
592
0
100
200
300
400
500
600
Remote Local
MB/
s
File read performance in Gfarm (HDD, GbE)
disk cache
Access from Local disk
cache
Rakefile as a workflow language • Dynamic workflow like Montage
astronomical data analysis possible
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Maximize Locality using Multi-Constraint Graph Partitioning [Tanaka, IEEE CCGrid 2012]
Data movement reduced by 86% Execution time improved by 31%
Simple Graph Partitioning Multi-Constraint Graph Partitioning
Parallel tasks are unbalanced among nodes
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Disk Cache Aware Scheduling [Tanaka IEEE Cluster 2014]
A1
A3
A2
A4
Bn-3
Bn-1
Bn-2
Bn
An-2
An
An-1
B1
B2 B3
idle core
A1
B1
A3
B3
A2
B2
A4
B4
An-2
Bn-2
An
Bn
An-1
Bn-1
:
A1
B1
A3
B3
A2
B2
A4
B4
An-2
Bn-2
Bn-1
An-1
An
Bn
LIFO
HRF
:
A1
B1
B2
B3
A2
A3
A4
A5
An-1
Bn-1
Bn-2
An
Bn
Bn-3
Rank Overlap
HRF
: An-2
FIFO (HRF) LIFO LIFO+HRF Rank Equalization+HRF
Disk Cache × ◎ ◎ ○
Trailing Task ○ × ○ ○
Task Overlap × × × ○
A1 A2 An
B1 B2 Bn
..
C
..
Workflow DAG
HRF: Highest Rank First
switch LIFO and HRF depending on # tasks
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Parallel Execution Tasks over Time
FIFO
LIFO
LIFO+HRF
• LIFO utilizes disk cache but it has a trailing task problem
• LIFO+HRF utilizes disk cache and solves the trailing task problem
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
HPCI Storage
HPCI – High Performance Computing Infrastructure RIKEN AICS (“K”), NII, Hokkaido, Tohoku, Tsukuba, Tokyo,
Titech, Nagoya, Kyoto, Osaka, Kyushu, JAMSTEC, ISM, AIST A 20PB single distributed file system consisting East and West sites Single Sign-on by Grid Security Infrastructure (GSI) and user
identification by Subject DN Parallel file replication among sites Parallel file staging to/from each Supercomputer
center
13.0 PB (68 servers)
9.5 PB (30 servers)
10 (~40) Gbps MDS MDS
West site (AICS) East site (U Tokyo, Tokyo Tech)
MDS MDS
Picture courtesy by Hiroshi Harada (U Tokyo)
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
How to Use HPCI Storage
% mount.hpci # mount Update proxy certificate for gfarm2fs timeleft : 167:50:40 (7.0 days) Mount GfarmFS on /gfarm/hp120273/tatebe % cd /gfarm/hp120273/tatebe % gfpcopy –P /work/CSI/tatebe/data . #parallel copy …. total_throughput: 70.233735 MB/s total_time: 93.311284 sec. % gfncopy –s 2 data #specify # of file replicas (file replication starts in the background)
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
IOPS for Directory Creation
0
500
1000
1500
2000
2500
3000
3500
4000
0 5 10 15 20 25 30
Tokyo
AICS
# clients
Dire
ctor
y C
reat
ion
Perfo
rman
ce [o
p/s]
3,693 op/s
2,495 op/s
Unstable due to close to performance limit
not saturated
10,000 op/s and more achieved without synchronous slave MDS
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
I/O bandwidth of HPCI Storage
898 847
1,107 1,073
0
200
400
600
800
1,000
1,200
Hokkaido Kyoto Tokyo AICS
File copy performance of 300 x 1GB files to HPCI Storage
[MB/s]
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
End-to-end Data Integrity
Silent data corruption in large-scale storage No error happens, but damaged
→ End-to-end data integrity by Gfarm file system Checksumming by client and storage node Checksumming when reading and replicating
data Write verify Corrupted files moved to /lost+found for
automatic recovery
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Data Integrity in Gfarm (1)
Writing data
Write verify after six hours by default
Client
1. digest calc.
Storage node
2. digest calc.
Storage
Metadata
Storage Storage node
1. digest calc. Metadata
3. registration
3. comparison (or registration)
4. comparison (sequential access) (seq. access)
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Data Integrity in Gfarm (2)
Replicating data (just after writing) Storage Storage node Storage node Storage
1. digest calc. 2. digest calc.
Metadata 3. comparison (or registration)
4. comparison
Write verify after six hours
When checksum mismatch happens, moves to /lost+found
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Data Integrity in Gfarm (3)
Reading data
Client
1. digest calc.
Storage node
2. digest calc.
Storage
Metadata 3. comparison (or registration)
4. comparison
When checksum mismatch happens, read returns I/O error, and moves to /lost+found
Prevent from reading corrupted data Automatic repair by file replicas
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Openssl digest evaluation
md5 sha1 sha256 sha512 2.4GHz Xeon E5-2695 v2 (Ivy Bridge-EP) 541 584 218 337
2.4GHz Xeon E5-2665 (Sandy Bridge-EP) 564 585 176 274
2.4GHz Xeon E5620 (Westmere-EP) 483 417 150 238
26
8KB block size, MB/s
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Gfarm Performance
27 Number of processes
Write Bandwidth [MB/s] Client
digest calc.
Storage node
digest calc.
Storage
2 x 12 cores 2.4GHz Ivy Bridge-EP tmpfs
IB FDR IPoIB (6,600 MB/s)
2 x 12 cores 2.4GHz Ivy Bridge-EP
1GB x 16 files
0
1,000
2,000
3,000
4,000
5,000
0 2 4 6 8 10 12 14 16
No digest
md5
End-to-end
4,474 MB/s
4,070 MB/s
3,475 MB/s
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Case Study in JLDG
7.8 PB, 7 sites, 39 file servers Nation-wide storage in Physics community 7.2 PB used, 109 M files
End-to-end data integrity by md5, and write verify enabled
Aug 19~22, 2016 Six damaged files found by digest mismatch
during write verify and replica creation Still, there is no I/O error
28
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Related work
ZFS Checksumming in each block RAID-Z, not only replication, to recover data
29
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Conclusion
Silent data corruption is not rare, but often, in petascale storage Checksumming is promising
Gfarm file system detects SDC by write verify, replica creation, file read and (partial) scrubbing
Native and required feature of file replicas in distant sites can correct it without any waste storage capacity
SNIA-J Extreme Storage Society of Science Study
30
2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.
Contact
Osamu Tatebe University of Tsukuba [email protected]
31
Top Related