Inconsistent Regions - Colorado State Universitycs435/slides/week13-A-6.pdf · 2019-11-25 ·...

9
CS435 Introduction to Big Data Fall 2019 Colorado State University 11/18/2019 Week 13-A Sangmi Lee Pallickara 1 11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.0 CS435 Introduction to Big Data PART 2. LARGE SCALE DATA STORAGE SYSTEMS DISTRIBUTED FILE SYSTEMS Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs435 11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.1 FAQs Term project presentations 12/9 (team 1- 6), 12/11 (team 6-12), 12/13 (team 13-16) Please attend at least 2 presentation sessions and ask questions or provide comments Participation score (attendance + question) 12 minutes (including transition time)/team Submit your slides (No PDF!) on canvas 11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.2 Today’s topics Google File System 11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.3 Inconsistent Regions Data 3 Data 3 Data 1 Data 1 Data 1 Data 2 Data 2 Data 2 User will re-try to store Data 3 Data 3 Data 3 Data 1 Data 2 Data 3 Data 1 Data 2 Data 3 Data 1 Data 2 Data 3 Data 3 Failed Empty 11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.4 What if record append fails at one of the replicas Client must retry the operation Replicas of same chunk may contain Different data Duplicates of the same record In whole or in part Replicas of chunks are not bit-wise identical! In most systems, replicas are identical 11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.5 GFS only guarantees that the data will be written at least once as an atomic unit For an operation to return success Data must be written at the same offset on all the replicas After the write, all replicas are as long as the end of the record Any future record will be assigned a higher offset or a different chunk

Transcript of Inconsistent Regions - Colorado State Universitycs435/slides/week13-A-6.pdf · 2019-11-25 ·...

Page 1: Inconsistent Regions - Colorado State Universitycs435/slides/week13-A-6.pdf · 2019-11-25 · •Please attend at least 2 presentation sessions and ask questions or provide comments

CS435 Introduction to Big DataFall 2019 Colorado State University

11/18/2019 Week 13-ASangmi Lee Pallickara

1

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.0

CS435 Introduction to Big Data

PART 2. LARGE SCALE DATA STORAGE SYSTEMSDISTRIBUTED FILE SYSTEMS

Sangmi Lee Pallickara

Computer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs435

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.1

FAQs• Term project presentations• 12/9 (team 1- 6), 12/11 (team 6-12), 12/13 (team 13-16)• Please attend at least 2 presentation sessions and ask questions or provide comments

• Participation score (attendance + question)

• 12 minutes (including transition time)/team• Submit your slides (No PDF!) on canvas

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.2

Today’s topics

• Google File System

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.3

Inconsistent Regions

Data 3 Data 3

Data 1 Data 1 Data 1

Data 2 Data 2 Data 2

User will re-try to store Data 3

Data 3 Data 3

Data 1

Data 2

Data 3

Data 1

Data 2

Data 3

Data 1

Data 2

Data 3

Data 3

Failed

Empty

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.4

What if record append fails at one of the replicas

• Client must retry the operation

• Replicas of same chunk may contain• Different data• Duplicates of the same record

• In whole or in part

• Replicas of chunks are not bit-wise identical!• In most systems, replicas are identical

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.5

GFS only guarantees that the data will be written at least once as an atomic unit

• For an operation to return success• Data must be written at the same offset on all the replicas

• After the write, all replicas are as long as the end of the record

• Any future record will be assigned a higher offset or a different chunk

Page 2: Inconsistent Regions - Colorado State Universitycs435/slides/week13-A-6.pdf · 2019-11-25 · •Please attend at least 2 presentation sessions and ask questions or provide comments

CS435 Introduction to Big DataFall 2019 Colorado State University

11/18/2019 Week 13-ASangmi Lee Pallickara

2

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.6

GFS client code implements the file system API

• Communications with master and chunk servers done transparently• On behalf of apps that read or write data

• Interact with master for metadata

• Data-bearing communications directly to chunk servers

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.7

Handling failed “write” in Hadoop HDFS [1/2]

• Different from GFS

1. Pipeline is closed• Any packets in the ack queue are added to the front of the data queue

• Datanodes that are downstream from the failed node will not miss any packets

2. The current block on the good datanodes is given a new identity• Reports to the namenode

• To detect and delete partial block on the failed datanode later on

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.8

Handling failed “write” in Hadoop HDFS [2/2]

3. Remainder of the block’s data is written to the other good datanodes in the pipeline

4. Namenode notices the block is under-replicated • It arranges for a further replica to be created on another node• Write quorum

• dfs.replication.min (default to 1)

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.9

Part 2. Large scale data storage system

Distributed File SystemGoogle File System (GFS): Creating Snapshot

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.10

Snapshots

• Copying file or directory tree almost instantaneously

• Minimizing any interruptions of ongoing mutations

• Providing checkpoint mechanism• Users can commit later

• Rollback

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.11

Snapshots allow you to make a copy of a file very fast

1. Master revokes outstanding leases for any chunks of the file (source) to be snapshot

2. Log the operation to disk

3. Update in-memory state1. Duplicate metadata of the source file

4. Newly created file points to the “same chunks” as the source

Page 3: Inconsistent Regions - Colorado State Universitycs435/slides/week13-A-6.pdf · 2019-11-25 · •Please attend at least 2 presentation sessions and ask questions or provide comments

CS435 Introduction to Big DataFall 2019 Colorado State University

11/18/2019 Week 13-ASangmi Lee Pallickara

3

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.12

When a client wants to write to a chunk C after the snapshot operation• Master sees “the reference count to C” > 1• Pick new chunk-handle C’

• Ask chunk-server with current replica of C• Create new chunk C’• Data is copied locally, not over the network

• From this point chunk handling of C’ is no different

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.13

GFS does not have a per-directory structure that lists files in the directory

• Name spaces represented as a lookup table• Maps full pathnames to metadata

• Each node has an associated read/write lock

• File creation does not require a lock on the directory structure• No inode needs to be protected from modification

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.14

Each master operation acquires a set of locks before it runs

• Read lock prevents a directory/file from being deleted, renamed, or snapshotted

• Others can still read it

• Others CANNOT mutate (write/append/delete) this file

• Write lock on directory/file names serialize attempts to create a file with the same twice• Others CANNOT mutate (write/append/delete) this file

• Others CANNOT read this file

• If operation involves /d1/d2/…/dn/leaf• Acquire read locks on directory names

• /d1, /d1/d2, …, /d1/d2/…/dn• Read or write lock on full pathname

• /d1/d2/…/dn/leaf

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.15

Each master operation acquires a set of locks before it runs: Example

• Person A is reading a file mybooks/les_miserable

• Person B is trying to read a file mybooks/les_miserable

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.16

Each master operation acquires a set of locks before it runs: Example• Person A is reading a file

mybooks/les_miserablemybooks/ :read lockmybooks/les_miserable: read lock

• Person B is trying to read a file mybooks/les_miserablemybooks/ :read lockmybooks/les_miserable: read lock

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.17

Each master operation acquires a set of locks before it runs: Example

• Person A is reading a file mybooks/les_miserable

• Person B is trying to create a file mybooks/pride_and_prejudice

Page 4: Inconsistent Regions - Colorado State Universitycs435/slides/week13-A-6.pdf · 2019-11-25 · •Please attend at least 2 presentation sessions and ask questions or provide comments

CS435 Introduction to Big DataFall 2019 Colorado State University

11/18/2019 Week 13-ASangmi Lee Pallickara

4

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.18

Each master operation acquires a set of locks before it runs: Example• Person A is reading a file

mybooks/les_miserablemybooks/ :read lockmybooks/les_miserable: read lock

• Person B is trying to create a file mybooks/pride_and_prejudicemybooks/ :read lockmybooks/pride_and_prejudice :write lock

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.19

Each master operation acquires a set of locks before it runs: Example

• Person A is trying to create a file mybooks/les_miserable

• Person B is trying to create a file mybooks/les_miserable

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.20

Each master operation acquires a set of locks before it runs: Example• Person A is trying to create a file

mybooks/les_miserablemybooks/ :read lockmybooks/les_miserable :write lock

• Person B is trying to create a file mybooks/les_miserablemybooks/ :read lockmybooks/les_miserable :write lock

Write locks will serialize the actions

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.21

Each master operation acquires a set of locks before it runs: Example

• Person A is trying to create a file mybooks/les_miserable

• Person B is trying to delete a directory mybooks/

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.22

Each master operation acquires a set of locks before it runs: Example• Person A is trying to create a file

mybooks/les_miserablemybooks/ :read lockmybooks/les_miserable :write lock

• Person B is trying to delete a file mybooks/mybooks/ :write lock

Write locks will serialize the actions

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.23

Locks are used to prevent operations during snapshots

• /home/user is being snapshotted to /save/user • Read locks on /home and /save• Write lock on /home/user and /save/user

• To create file, /home/user/foo

• Read lock on /home and /home/user• Write lock on /home/user/foo

• The two operations will be serialized

• because they try to obtain /home/user

• File creation does not require write lock on parent directory … there is no “directory”

• Read locks on /home and /home/user• Write lock on /home/user/foo

Q: How do we prevent creating a file /home/user/foowhile a directory /home/user/ is being snapshotted to /save/user/ ?

Page 5: Inconsistent Regions - Colorado State Universitycs435/slides/week13-A-6.pdf · 2019-11-25 · •Please attend at least 2 presentation sessions and ask questions or provide comments

CS435 Introduction to Big DataFall 2019 Colorado State University

11/18/2019 Week 13-ASangmi Lee Pallickara

5

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.24

Part 2. Large scale data storage system

Distributed File SystemGoogle File System (GFS): Deletion of files and garbage

collection

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.25

Garbage collection in GFS

• After a file is deleted, GFS does not reclaim space immediately

• Done lazily during garbage collection at• File and chunk levels

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.26

Master logs a file’s deletion immediately

• File is renamed to a hidden name• Includes deletion timestamp

• Master scans the file system namespace• Delete if hidden file existed for more than 3 days

• When file removed from namespace• In memory metadata is also removed• Severs links to all its chunks!

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.27

Garbage collection:When Master scans its chunk namespace

• Identifies orphaned chunks• Not reachable from any file

• Erase metadata for these chunks

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.28

The role of heart-beats in garbage collection

• Chunk server reports subset of chunks it currently has

• Master replies with identity of chunks no longer present• Chunk server free to delete its replica of such chunks

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.29

Stale chunks and issues • If a chunk server fails • AND misses mutations to the chunk• The chunk replica becomes stale

• Working with a stale replica causes problems with: • Correctness• Consistency

Page 6: Inconsistent Regions - Colorado State Universitycs435/slides/week13-A-6.pdf · 2019-11-25 · •Please attend at least 2 presentation sessions and ask questions or provide comments

CS435 Introduction to Big DataFall 2019 Colorado State University

11/18/2019 Week 13-ASangmi Lee Pallickara

6

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.30

Aiding the detection of stale chunks

• Master maintains a chunk version number for each chunk• Distinguish between stale and up-to-date chunks

• When master grants a new lease on chunk• Increase version number• Inform replicas• Record new version persistently

Occurs BEFORE any client can write tochunk

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.31

If a replica is unavailable its version number will not be advanced

• When a chunk server restarts, it reports to the Master with the following:• Set of Chunks• Corresponding version numbers

• Used to detect stale replicas

• Remove stale replicas in regular garbage collection

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.32

Additional safeguards against stale replicas

• Include chunk version number• When client requests chunk information

• Client/Chunk server verify version to make sure things are up-to-date

• During cloning operations• Clone the most up-to-date chunk

• Clients and chunk servers expected to verify versioning information

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.33

Part 2. Large scale data storage system

Distributed File SystemGoogle File System (GFS): Data Integrity

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.34

Data Integrity• Impractical to detect chunk corruptions across replicas• Not bytewise identical in any case!

• Detection of corruption should be self-contained

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.35

Data Integrity

• Break chunks into 64 KB data blocks

• Compute 32-bit checksum for block• Keep in chunk server memory• Store persistently, separate from the data

• Verify checksums of data blocks that overlap read range

Page 7: Inconsistent Regions - Colorado State Universitycs435/slides/week13-A-6.pdf · 2019-11-25 · •Please attend at least 2 presentation sessions and ask questions or provide comments

CS435 Introduction to Big DataFall 2019 Colorado State University

11/18/2019 Week 13-ASangmi Lee Pallickara

7

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.36

Part 2. Large scale data storage system

Distributed File SystemGoogle File System (GFS): Inefficiencies

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.37

The master server is a single point of failure

• Master server restart takes several seconds

• Shadow servers exist• Can handle reads of files

• In place of the master • But not writes

• Requires a massive main memory

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.38

The system is optimized for large files

• But not for a very large number of very small files

• Primary operation on files• Long, sequential reads/writes• Large number of random overwrites will clog things up quite a bit

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.39

Consistency Issues: GFS expects clients to resolve inconsistencies

• File chunks may have gaps or duplicates of some records• The client has to be able to deal with this

• Imagine doing this for a scientific application• Portions of a massive array are corrupted

• Clients would have to detect this

• HDFS does NOT have this problem

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.40

Security model

• None• Operation is expected to be in a trusted environment

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.41

Part 2. Large scale data storage system

Distributed File SystemGoogle File System II: Colossus

Page 8: Inconsistent Regions - Colorado State Universitycs435/slides/week13-A-6.pdf · 2019-11-25 · •Please attend at least 2 presentation sessions and ask questions or provide comments

CS435 Introduction to Big DataFall 2019 Colorado State University

11/18/2019 Week 13-ASangmi Lee Pallickara

8

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.42

Storage Software: Colossus (GFS2)

• Next-generation cluster-level file system

• Automatically sharded metadata layer• Distributed Masters (64MB block size à1MB)• Data typically written using Reed-Solomon (1.5x) • Client-driven replication, encoding and replication • Metadata space has enabled availability

• Why Reed-Solomon?• Cost

• Especially with cross cluster replication

• More flexible cost vs. availability choices

• Google File System II: Dawn of the Multiplying Master Nodes, http://www.theregister.co.uk/2009/08/12/google_file_system_part_deux/?page=1

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.43

Reed-Solomon Codes• Block-based error correcting codes• Digital communication and storage

• Storage devices (including tape, CD, DVD, barcodes, etc)

• Wireless or mobile communications

• Satellite communications

• Digital TV

• High-speed modems

SOURCE: https://en.wikiversity.org/wiki/Reed–Solomon_codes_for_coders

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.44

What does the R-S code do?

• Takes a block of digital data

• Adds extra “redundant” bits

• If an error happens, the R-S decoder processes each block and recovers original data

Reed-Solomon

Encoder

Reed-Solomon

Decoder

Communication

channel or storage devices

Noise, Errors

Data

source

Data

Sink

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.45

A Quick Example of the R-S encoding

• 4+2 coding

• Original files are broken into 4 pieces

• 2 parity pieces are added

• First piece of data “ABCD”, second piece of data “EFGH”…

A B C D

E F G H

I J K L

M N O P

Original Data

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.46

A Quick Example of the R-S encoding

• Applying coding matrix

A B C D

E F G H

I J K L

M N O P

01 00 00 00

00 01 00 00

00 00 01 00

00 00 00 01

1b 1c 12 14

1c 1b 14 12

A B C D

E F G H

I J K L

M N O P

51 52 53 49

55 56 57 25

x =

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.47

A Quick Example of the R-S encoding

• Data loss• 2 of 6 rows are lost

A B C D

E F G H

I J K L

M N O P

01 00 00 00

00 01 00 00

00 00 01 00

00 00 00 01

1b 1c 12 14

1c 1b 14 12

A B C D

E F G H

I J K L

M N O P

51 52 53 49

55 56 57 25

x =

Page 9: Inconsistent Regions - Colorado State Universitycs435/slides/week13-A-6.pdf · 2019-11-25 · •Please attend at least 2 presentation sessions and ask questions or provide comments

CS435 Introduction to Big DataFall 2019 Colorado State University

11/18/2019 Week 13-ASangmi Lee Pallickara

9

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.48

A Quick Example of the R-S encoding

• Without 2 rows

A B C D

E F G H

I J K L

M N O P

01 00 00 00

00 01 00 00

1b 1c 12 14

1c 1b 14 12

A B C D

E F G H

51 52 53 49

55 56 57 25

x =

A B C D

E F G H

I J K L

M N O P

01 00 00 00

00 01 00 00

00 00 01 00

00 00 00 01

1b 1c 12 14

1c 1b 14 12

A B C D

E F G H

I J K L

M N O P

51 52 53 49

55 56 57 25

x =

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.49

A Quick Example of the R-S encoding

• Multiplying each side with the inverted matrix

A B C D

E F G H

I J K L

M N O P

01 00 00 00

00 01 00 00

1b 1c 12 14

1c 1b 14 12

A B C D

E F G H

51 52 53 49

55 56 57 25

x

=

01 00 00 00

00 01 00 00

8d f6 7b 01

f6 8d 01 7b

x

01 00 00 00

00 01 00 00

8d f6 7b 01

f6 8d 01 7b

x

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.50

A Quick Example of the R-S encoding

• The Inverse Matrix and the Coding Matrix Cancel Out

A B C D

E F G H

I J K L

M N O P

01 00 00 00

00 01 00 00

1b 1c 12 14

1c 1b 14 12

A B C D

E F G H

51 52 53 49

55 56 57 25

x

=

01 00 00 00

00 01 00 00

8d f6 7b 01

f6 8d 01 7b

x

01 00 00 00

00 01 00 00

8d f6 7b 01

f6 8d 01 7b

x

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.51

A Quick Example of the R-S encoding

• Reconstructing the Original Data

A B C D

E F G H

I J K L

M N O P

A B C D

E F G H

51 52 53 49

55 56 57 25

=01 00 00 00

00 01 00 00

8d f6 7b 01

f6 8d 01 7b

x

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.52

Properties of Reed-Solomon codes

• RS(n,k) with s-bit symbols• Encoder takes k data symbols (blocks) of s bits each

• Adds parity symbols to make n symbol code word

• There are n-k parity symbols of s bits each

• A Reed-Solomon decoder can correct up to t symbols that contain errors in a code word, where 2t = n-k.• t= (n-k)/2 for n-k even• t = (n-k-1)/2 for n-k odd data Parity

2tkn

11/18/2019 CS435 Introduction to Big Data – Fall 2019 W13.A.53

Example

• RS(255,223) with 8 bit symbols• Each code word contains 255 code word bytes• 223 bytes are data and 32 bytes are parity• n=255, k=223, s=8, 2t = 32, t=16

• The decoder can correct any 16 symbol errors in the code word• Errors in up to 16 bytes anywhere in the codeword can be automatically corrected.

data Parity

2tkn