Download - Data Integrity Support for Silent Data Corruption in Gfarm ... · PDF fileData Integrity Support for Silent Data Corruption in Gfarm File System ... Scalable performance requirement

2016 Storage Developer Conference. © Osamu Tatebe. All Rights Reserved.

Data Integrity Support for Silent Data Corruption in Gfarm File System

Osamu Tatebe University of Tsukuba


Silent Data Corruption

Data may be corrupted silently Transient soft error by cosmic ray, ... RAID firmware bug, storage software bug RH6.2 & 6.3 – XFS regularly truncating files

after crash/reboot [RH Bug 845233]

2


Data-Intensive Science

Big Data Science High energy physics experiment - LHC, Belle

(PB/year) Wide-field imaging - Subaru HSC survey,

LSST, SDSS (100TB/year) Next generation DNA sequencer Data assimilation in climate science

Not only FLOPS but Byte/sec (IO bandwidth) is critical


Scalable performance requirement for Parallel File System

Year FLOPS

#cores

IO BW IOPS Systems

2008 1P 100K 100GB/s O(1K) Jaguar, BG/P 2011 10P 1M 1TB/s O(10K) K, BG/Q 2017 100P 10M 10TB/s O(100K) 2022 1E 100M 100TB/s O(1M)

IO BW and IOPS are expected to be scaled-out in terms of # cores or # nodes

Performance target


Convergence of HPC and Big Data Computing

Scale-out system R&D such as MapReduce in Data Center Exploit local storage due to poor network

bandwidth Tolerate fault since it is norm

OTOH, MPI and MPI-IO system R&D in HPC using high-performance and high bisection network

Complex data analysis requirement in data-intensive science imposes the convergence

Extreme Big Data


Converged Architecture

MDS

Compute nodes (clients) Storage

R&D for scale-out system software required


Storage System of Converged Architecture

Three-tier and more O(10,000) Local storage

More than staging O(1,000) File cache system

More than burst buffer O(100) Parallel file system

Data locality should be considered for local storage


SNIA-J Extreme Storage Society of Science Study

Established in May, 2016 Discuss about next-generation high-

performance storage (extreme storage) beyond HPC, Big Data, and Cloud technologies

Monthly meeting Chair: Osamu Tatebe (University of Tsukuba) Members: Fujitsu, Hitachi, NEC, Toshiba, HGST

Japan, TÜV Rheinland Japan, CTCSP, SCSK, DDN, KEL, TIS Solution Link, ...

8


Gfarm file system

Open Source Software http://sf.net/projects/gfarm 18,000 downloads since March 2007

Major installations JLDG (7.8 PB, 7 sites, 39 file servers) HPCI Storage (21.9 PB, 3 sites, 97 file

servers) Research for much more scalability and Non-

volatile Storage

Winner – Large Systems in HPC Storage Challenge in SC06

Most Innovative Use of Storage In Support of Science Award in SC05


Gfarm file system

Runtime system to exploit data locality Pwrake workflow system MapReduce, MPI-IO, batch queuing system

NPO Tsukuba OSS Technology Support Center http://oss-tsukuba.org/ Support for Gfarm file system Gfarm symposium, Gfarm workshop


Two Extremes

MDS

Compute nodes (clients)

O(1,000) local storages for data-intensive applications

O(1,000) file servers in distant sites for global data sharing and archives

Site A

Site B

Site C


Software Component of Gfarm File System

Master-slave MDS Synchronous replication for fault tolerance, and

asynchronous replication for disaster recovery IO server for each local storage Client software

Gfarm2fs to mount Gfarm file system Gfarm command for replica management and

parallel file copy, and POSIX equivalent Gfarm library API in C, C++, Java


Workflow

Often used by data-intensive applications

A collection of tasks that includes data dependency

https://confluence.pegasus.isi.edu/display/pegasus/WorkflowGenerator

https://confluence.pegasus.isi.edu/display/pegasus/WorkflowGenerator


Pwrake: Data-Intensive Workflow System [Tanaka]

IO-aware task scheduling Locality-aware scheduling (CCGrid2012)

Minimize data transfer by multi constraint graph partitioning

Disk Cache aware (Cluster2014) Maximize disk cache hit ratio and avoid trailing

task problem

Workflow System based on Rake (Ruby make) http://github.com/masa16/Pwrake/

Gfarm

disk

cache

file

file disk

cache

file

file

process

Local Remote

39 70 71

592

0

100

200

300

400

500

600

Remote Local

MB/

s

File read performance in Gfarm (HDD, GbE)

disk cache

Access from Local disk

cache

Rakefile as a workflow language • Dynamic workflow like Montage

astronomical data analysis possible

http://github.com/masa16/Pwrake/

http://github.com/masa16/Pwrake/


Maximize Locality using Multi-Constraint Graph Partitioning [Tanaka, IEEE CCGrid 2012]

Data movement reduced by 86% Execution time improved by 31%

Simple Graph Partitioning Multi-Constraint Graph Partitioning

Parallel tasks are unbalanced among nodes


Disk Cache Aware Scheduling [Tanaka IEEE Cluster 2014]

A1

A3

A2

A4

Bn-3

Bn-1

Bn-2

Bn

An-2

An

An-1

B1

B2 B3

idle core

A1

B1

A3

B3

A2

B2

A4

B4

An-2

Bn-2

An

Bn

An-1

Bn-1

:

A1

B1

A3

B3

A2

B2

A4

B4

An-2

Bn-2

Bn-1

An-1

An

Bn

LIFO

HRF

:

A1

B1

B2

B3

A2

A3

A4

A5

An-1

Bn-1

Bn-2

An

Bn

Bn-3

Rank Overlap

HRF

: An-2

FIFO (HRF) LIFO LIFO+HRF Rank Equalization+HRF

Disk Cache × ◎ ◎ ○

Trailing Task ○ × ○ ○

Task Overlap × × × ○

A1 A2 An

B1 B2 Bn

..

C

..

Workflow DAG

HRF: Highest Rank First

switch LIFO and HRF depending on # tasks


Parallel Execution Tasks over Time

FIFO

LIFO

LIFO+HRF

• LIFO utilizes disk cache but it has a trailing task problem

• LIFO+HRF utilizes disk cache and solves the trailing task problem


HPCI Storage

HPCI – High Performance Computing Infrastructure RIKEN AICS (“K”), NII, Hokkaido, Tohoku, Tsukuba, Tokyo,

Titech, Nagoya, Kyoto, Osaka, Kyushu, JAMSTEC, ISM, AIST A 20PB single distributed file system consisting East and West sites Single Sign-on by Grid Security Infrastructure (GSI) and user

identification by Subject DN Parallel file replication among sites Parallel file staging to/from each Supercomputer

center

13.0 PB (68 servers)

9.5 PB (30 servers)

10 (~40) Gbps MDS MDS

West site (AICS) East site (U Tokyo, Tokyo Tech)

MDS MDS

Picture courtesy by Hiroshi Harada (U Tokyo)


How to Use HPCI Storage

% mount.hpci ＃ mount Update proxy certificate for gfarm2fs timeleft : 167:50:40 (7.0 days) Mount GfarmFS on /gfarm/hp120273/tatebe % cd /gfarm/hp120273/tatebe % gfpcopy –P /work/CSI/tatebe/data . ＃parallel copy …. total_throughput: 70.233735 MB/s total_time: 93.311284 sec. % gfncopy –s 2 data ＃specify # of file replicas （file replication starts in the background）


IOPS for Directory Creation

0

500

1000

1500

2000

2500

3000

3500

4000

0 5 10 15 20 25 30

Tokyo

AICS

# clients

Dire

ctor

y C

reat

ion

Perfo

rman

ce [o

p/s]

3,693 op/s

2,495 op/s

Unstable due to close to performance limit

not saturated

10,000 op/s and more achieved without synchronous slave MDS


I/O bandwidth of HPCI Storage

898 847

1,107 1,073

0

200

400

600

800

1,000

1,200

Hokkaido Kyoto Tokyo AICS

File copy performance of 300 x 1GB files to HPCI Storage

[MB/s]


End-to-end Data Integrity

Silent data corruption in large-scale storage No error happens, but damaged

→ End-to-end data integrity by Gfarm file system Checksumming by client and storage node Checksumming when reading and replicating

data Write verify Corrupted files moved to /lost+found for

automatic recovery


Data Integrity in Gfarm (1)

Writing data

Write verify after six hours by default

Client

1. digest calc.

Storage node

2. digest calc.

Storage

Metadata

Storage Storage node

1. digest calc. Metadata

3. registration

3. comparison (or registration)

4. comparison (sequential access) (seq. access)



Replicating data (just after writing) Storage Storage node Storage node Storage

1. digest calc. 2. digest calc.

Metadata 3. comparison (or registration)

4. comparison

Write verify after six hours

When checksum mismatch happens, moves to /lost+found



Reading data

Client

1. digest calc.

Storage node

2. digest calc.

Storage

Metadata 3. comparison (or registration)

4. comparison

When checksum mismatch happens, read returns I/O error, and moves to /lost+found

Prevent from reading corrupted data Automatic repair by file replicas


Openssl digest evaluation

md5 sha1 sha256 sha512 2.4GHz Xeon E5-2695 v2 (Ivy Bridge-EP) 541 584 218 337

2.4GHz Xeon E5-2665 (Sandy Bridge-EP) 564 585 176 274

2.4GHz Xeon E5620 (Westmere-EP) 483 417 150 238

26

8KB block size, MB/s


Gfarm Performance

27 Number of processes

Write Bandwidth [MB/s] Client

digest calc.

Storage node

digest calc.

Storage

2 x 12 cores 2.4GHz Ivy Bridge-EP tmpfs

IB FDR IPoIB (6,600 MB/s)

2 x 12 cores 2.4GHz Ivy Bridge-EP

1GB x 16 files

0

1,000

2,000

3,000

4,000

5,000

0 2 4 6 8 10 12 14 16

No digest

md5

End-to-end

4,474 MB/s

4,070 MB/s

3,475 MB/s


Case Study in JLDG

7.8 PB, 7 sites, 39 file servers Nation-wide storage in Physics community 7.2 PB used, 109 M files

End-to-end data integrity by md5, and write verify enabled

Aug 19~22, 2016 Six damaged files found by digest mismatch

during write verify and replica creation Still, there is no I/O error

28


Related work

ZFS Checksumming in each block RAID-Z, not only replication, to recover data

29


Conclusion

Silent data corruption is not rare, but often, in petascale storage Checksumming is promising

Gfarm file system detects SDC by write verify, replica creation, file read and (partial) scrubbing

Native and required feature of file replicas in distant sites can correct it without any waste storage capacity

SNIA-J Extreme Storage Society of Science Study

30


Contact

Osamu Tatebe University of Tsukuba [email protected]

31