Nicholas:hdfs what is new in hadoop 2

48
© Hortonworks Inc. 2013 HDFS: What is New in Hadoop 2 Sze Tsz-Wo Nicholas 施子和 December 6, 2013 Page 1

description

BDTC 2013 Beijing China

Transcript of Nicholas:hdfs what is new in hadoop 2

Page 1: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

HDFS: What is New in Hadoop 2

Sze Tsz-Wo Nicholas

施子和

December 6, 2013

Page 1

Page 2: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

About Me

• 施子和 Sze Tsz-Wo Nicholas, Ph.D.

– Software Engineer at Hortonworks

– PMC Member at Apache Hadoop

– One of the most active contributors/committers of HDFS • Started in 2007

– Used Hadoop to compute Pi at the two-quadrillionth (2x1015th) bit • It is the current World Record.

– Received Ph.D. from the University of Maryland, College Park • Discovered a novel square root algorithm over finite field.

Page 2 Architecting the Future of Big Data

= 3.141592654…

Page 3: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Agenda

• New HDFS features in Hadoop-2

– New appendable write-pipeline

– Multiple Namenode Federation

– Namenode HA

– File System Snapshots

Page 3 Architecting the Future of Big Data

Page 4: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

We have been hard at work…

• Progress is being made in many areas

– Scalability

– Performance

– Enterprise features

– Ongoing operability improvements

– Enhancements for other projects in the ecosystem

– Expand Hadoop ecosystem to more platforms and use cases

• 2192 commits in Hadoop in the last year

– Almost a million lines of changes

– ~150 contributors

– Lot of new contributors - ~80 with < 3 patches

• 350K lines of changes in HDFS and common

Page 4 Architecting the Future of Big Data

Page 5: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Building on Rock-solid Foundation

• Original design choices - simple and robust

– Single Namenode metadata server – all state in memory

– Fault Tolerance: multiple replicas, active monitoring

– Storage: Rely on OS’s file system not raw disk

• Reliability

– Over 7 9’s of data reliability, less than 0.38 failures across 25 clusters

• Operability

– Small teams can manage large clusters • An operator per 3K node cluster

– Fast Time to repair on node or disk failure

• Minutes to an hour Vs. RAID array repairs taking many long hours

• Scalable - proven by large scale deployments not bits

– > 100 PB storage, > 400 million files, > 4500 nodes in a single cluster

– ~ 100 K nodes of HDFS in deployment and use

Page 5

Architecting the Future of Big Data

Page 6: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2011

New Appendable

Write-Pipeline

Architecting the Future of Big Data Page 6

Page 7: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

HDFS Write Pipeline

Page 7 Architecting the Future of Big Data

DN1 DN2 DN3

data data

ack ack

Writer

data

ack

• The write pipeline has been improved dramatically

– Better durability

– Better visibility

– Consistency guarantees

– Appendable

Page 8: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

New Feature in Write Pipeline

• Earlier versions of HDFS

– Files were immutable

– Write-once-read-many model

• New features in Hadoop 2

– Files can be reopened for append

– New primitives: hflush and hsync

– Read consistency

– Replace datanode on failure

Page 8 Architecting the Future of Big Data

Page 9: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

HDFS hflush and hsync

• Java flush (or C++ fflush)

– forces any buffered output bytes to be written out.

• HDFS hflush

– Flush data to all the datanodes in the write pipeline

– Guarantees the data is visible for reading

– The data may be in datanodes’ memory

• HDFS sync

– Hfush with local file system sync

– May also update the file length in Namenode

Page 9 Architecting the Future of Big Data

Page 10: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Read Consistency

• A reader may read data during write

– It can read from any datanode in the pipeline

– and then failover to any other datanode to read the same data

Page 10 Architecting the Future of Big Data

DN1 DN2 DN3

data data

ack ack Writer

data

ack

Reader

read read

Page 11: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

• When a datanode fails, the pipeline is reconstructed with

the remain datanodes

• When another datanode fails, only one datanode remains!

In the past …

Page 11 Architecting the Future of Big Data

DN1 DN2 DN3

data

ack

Writer

data

ack

DN1 DN2 DN3 Writer

data

ack

Page 12: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Replace Datanode on Failure

Page 12 Architecting the Future of Big Data

DN1 DN2 DN3

data

ack

Writer

data

ack

• Add new datanodes to the pipeline

• User clients may choose the replacement policy

– Performance vs data reliability

DN4

data

ack

Page 13: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2011

Multiple Namenode

Federation

Architecting the Future of Big Data Page 13

Page 14: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2011

HDFS Architecture

Page 14 Architecting the Future of Big Data

Namenode

Persistent Namespace

Metadata & Journal

Namespace

State Block

Map

Heartbeats & Block Reports

Block ID Block Locations

Datanodes

Block ID Data

Hierarchal Namespace File Name BlockIDs

Horizontally Scale IO and Storage

14

b1

b5

b3

JBOD

Blo

ck S

tora

ge

N

am

esp

ace

b2

b3

b1

JBOD

b3

b5

b2

JBOD

b1

b5

b2

JBOD

Page 15: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Single Namenode Limitations

• Namespace size is limited by the namenode memory size

– 64GB memory can support ~100m files and blocks

– Solution: Federation

• Single point of failure (SPOF)

– The service is down when the namenode is down

– Solution: HA

Page 15 Architecting the Future of Big Data

Page 16: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Federation Cluster

• Multiple namenodes and namespace volumes in a cluster

– The namenodes/namespaces are independent

– Scalability by adding more namenodes/namespaces

– Isolation – separating applications to their own namespaces

– Client side mount tables/ViewFS for integrated views

• Block Storage as generic storage service

– Datanodes store blocks in block pools for all namespaces

Page 16 Architecting the Future of Big Data

Page 17: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Multiple Namenode Federation

Page 17 Architecting the Future of Big Data

DN 1 DN 2 DN m .. .. ..

NS1

Foreign NS n

... ...

NS k

Block Pools

Pool n Pool k Pool 1

NN-1 NN-k NN-n

Common Storage

Blo

ck

Sto

rag

e

Na

mesp

ac

e

Page 18: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2011

Namenode HA

Architecting the Future of Big Data Page 18

Page 19: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

High Availability – No SPOF

• Support standby namenode and failover

– Planned downtime

– Unplanned downtime

• Release 1.1

– Cold standby • Require reconstructing in-memory data structures during failure-over

– Uses NFS as shared storage

– Standard HA frameworks as failover controller • Linux HA and VMWare VSphere

– Suitable for small clusters up to 500 nodes

Page 19 Architecting the Future of Big Data

Page 20: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Hadoop Full Stack HA

Page 20 Architecting the Future of Big Data

HA Cluster for Master Daemons

Server Server Server

NN JT

Failover

Apps

Running

Outside

JT into Safemode

NN

jo

b

jo

b

jo

b

jo

b

jo

b

Slave Nodes of Hadoop Cluster

Page 21: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

High Availability – Release 2.0

• Support for Hot Standby

– The standby namenode maintains in-memory data structures

• Supports manual and automatic failover

• Automatic failover with Failover Controller

– Active NN election and failure detection using ZooKeeper

– Periodic NN health check

– Failover on NN failure

• Removed shared storage dependency

– Quorum Journal Manager

• 3 to 5 Journal Nodes for storing editlog

• Edit must be written to quorum number of Journal Nodes

• Replay cache for correctness & transparent failovers

Page 21 Architecting the Future of Big Data

Page 22: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Namenode HA in Hadoop 2

Page 22 Architecting the Future of Big Data

NN

Active

NN

Standby

JN JN JN

Shared NN state through Quorum of JournalNodes

DN

FailoverController

Active

ZK

Cmds

Monitor Health of NN. OS, HW

Monitor Health of NN. OS, HW

Block Reports to Active & Standby DN fencing: only obey commands

from active

DN DN

FailoverController

Standby

ZK ZK Heartbeat Heartbeat

DN

Namenode HA has no external dependency

Page 23: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2011

File System Snapshots

Architecting the Future of Big Data Page 23

Page 24: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Before Snapshots…

• Deleted files cannot be restored

– Trash is buggy and not well understood

– Trash works only for CLI based deletion

• No point-in-time recovery

• No periodic snapshots to restore from

– No admin/user managed snapshots

Page 24 Architecting the Future of Big Data

Page 25: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

HDFS Snapshot

Point-in-time image of the file system

Read-only

Copy-on-write

Page 25 Architecting the Future of Big Data

Page 26: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Use Cases

Protection against user errors

Backup

Experimental/Test setups

Page 26 Architecting the Future of Big Data

Page 27: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Example: Periodic Snapshots for Backup

• A typical snapshot policy:

Take a snapshot in

– every 15 mins and keep it for 24 hrs

– every 1 hr, keep 2 days

– every 1 day, keep 14 days

– every 1 week, keep 3 months

– every 1 month, keep 1 year

Page 27 Architecting the Future of Big Data

Page 28: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Design Goal: Efficiency

• Storage efficiency

– No block data copying

– No metadata copying for unmodified files

• Processing efficiency

– No additional costs for processing current data

• Cheap snapshot creation

– Must be fast and lightweight

– Must support for a very large number of snapshots

Page 28 Architecting the Future of Big Data

Page 29: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Design Goal: Features

• Read-only

– Files and directories in a snapshot are immutable

– Nothing can be added to or removed from directories

• Hierarchical snapshots

– Snapshots of the entire namespace

– Snapshots of subtrees

• User operation

– Users can take snapshots for their data

– Admins manage where users can take snapshots

Page 29 Architecting the Future of Big Data

Page 30: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

HDFS-2802: Snapshot Development

• Available in Hadoop 2 GA release (v2.2.0)

• Community-driven

– Special thanks to who have provided for the valuable discussion

and feedback on the feature requirements and the open questions

• 136 subtask JIRAs

– Mainly contributed by Hortonworks

• The merge patch has about 28k lines

• ~8 months of development

Page 30 Architecting the Future of Big Data

Page 31: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Namenode Only Operation

• No complicated distributed mechanism

• Snapshot metadata stored in Namenode

• Datanodes have no knowledge of snapshots

• Block management layer also don’t know about

snapshots

Page 31 Architecting the Future of Big Data

Page 32: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Fast Snapshot Creation

• Snapshot Creation: O(1)

– It just adds a record to an inode

Page 32 Architecting the Future of Big Data

/

d

1

d

2

f1 f2 f3

S1

Page 33: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Low Memory Overhead

• NameNode memory usage: O(M)

– M is the number of modified files/directories

– Additional memory is used only when modifications are made

relative to a snapshot

Page 33 Architecting the Future of Big Data

/

d

1

d

2

f1 f2 f3

S1 Modifications:

1. rm f3

2. add f4

f4

Page 34: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

File Blocks Sharing

• Blocks in datanodes are not copied

– The snapshot files record the block list and the file size

– No data copying

Page 34 Architecting the Future of Big Data

/

d

f’’

S2

f'

blk0 blk1 blk2 blk3

S1

f

Page 35: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Persistent Data Structures

• A well-known data structure for “time travel”

– Support querying previous version of the data

• Access slow down

– The additional time required for the data structure

• In traditional persistent data structures

– There is slow down on accessing current data and snapshot data

• In our implementation

– No slow down on accessing current data

– Slow down happens only on accessing snapshot data

Page 35 Architecting the Future of Big Data

Page 36: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

No Slow Down on Accessing Current Data

• The current data can be accessed directly

– Modifications are recorded in reverse chronological order

Snapshot data = Current data – Modifications

Page 36 Architecting the Future of Big Data

/

d

1

d

2

f1 f2 f3

S1 Modifications:

1. rm f3

2. add f4

f4

d

2

f2 f3

~ modifications

Page 37: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Easy Management

• Snapshots can be taken on any directory

– Set the directory to be snapshottable

• Support 65,536 simultaneous snapshots

• No limit on the number of snapshottable directories

– Nested snapshottable directories are currently NOT allowed

Page 37 Architecting the Future of Big Data

Page 38: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Admin Ops

• Allow snapshots on a directory

– hdfs dfsadmin –allowSnapshot <path>

• Reset a snapshottable directory

– hdfs dfsadmin –disallowSnapshot <path>

• Example

Page 38 Architecting the Future of Big Data

Page 39: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

User Ops

• Create/delete/rename snapshots – hdfs dfs ­createSnapshot <path> [<snapshotName>]

– hdfs dfs –deleteSnapshot <path> <snapshotName>

– hdfs dfs –renameSnapshot <path> <oldName> <newName>

• Get snapshottable directory listing

– hdfs lsSnapshottableDir

• Get snapshots difference report

– hdfs snapshotDiff <path> <from> <to>

Page 39 Architecting the Future of Big Data

Page 40: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Use snapshot paths in CLI

• All regular commands and APIs can be used against

snapshot path – /<snapshottableDir>/.snapshot/<snapshotName>/foo/bar

• List all the files in a snapshot

– ls /test/.snapshot/s4

• List all the snapshots under that path

– ls <path>/.snapshot

Page 40 Architecting the Future of Big Data

Page 41: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Test Snapshot Functionalities

• ~100 unit tests

• ~1.4 million generated system tests

– Covering most combination of (snapshot + rename) operations

• Automated long-running tests for months

Page 41 Architecting the Future of Big Data

Page 42: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2011

NFS Support

and Other Features

Architecting the Future of Big Data Page 42

Page 43: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

NFS Support

• NFS Gateway provides NFS access to HDFS

– File browsing, Data download/upload, Data streaming

– No client-side library

– Better alternative to Hadoop + Fuse based solution • Better consistency guarantees

• Supports NFSv3

• Stateless Gateway

– Simpler design, easy to handle failures

• Future work

– High Availability for NFS Gateway

– NFSv4 support?

Page 43 Architecting the Future of Big Data

Page 44: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Other Features

• Protobuf, wire compatibility

– Post 2.0 GA stronger wire compatibility

• Rolling upgrades

– With relaxed version checks

• Improvements for other projects

– Stale node to improve HBase MTTR

• Block placement enhancements

– Better support for other topologies such as VMs and Cloud

• On the wire encryption

– Both data and RPC

• Expanding ecosystem, platforms and applicability

– Native support for Windows

Page 44 Architecting the Future of Big Data

Page 45: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Enterprise Readiness

• Storage fault-tolerance – built into HDFS

– 100% data reliability

• High Availability

• Standard Interfaces

– WebHDFS(REST), Fuse, NFS, HttpFs, libwebhdfs and libhdfs

• Wire protocol compatibility

– Protocol buffers

• Rolling upgrades

• Snapshots

• Disaster Recovery

– Distcp for parallel and incremental copies across cluster

– Apache Ambari and HDP for automated management

Page 45 Architecting the Future of Big Data

Page 46: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Work in Progress

• HDFS-2832: Heterogeneous storages

– Datanode abstraction from single storage to collection of storages

– Support different storage types: Disk and SSD

• HDFS-5535: Zero download rolling upgrade

– Namenodes and Datanodes can be upgraded independently

– No upgrade downtime

• HDFS-4685: ACLs

– More flexible than user-group-permission

Page 46 Architecting the Future of Big Data

Page 47: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Future Works

• HDFS-5477: Block manager as a service

– Move block management out from Namenode

– Support different name service, e.g. key-value store

• HDFS-3154: Immutable files

– Write-once and then read-only

• HDFS-4704: Transient files

– Tmp files will not be recorded in snapshots

Page 47 Architecting the Future of Big Data

Page 48: Nicholas:hdfs what is new in hadoop 2

© Hortonworks Inc. 2013

Q & A

• Myths and misinformation of HDFS

– Not reliable (was never true)

– Namenode dies, all state is lost (was never true)

– Does not support disaster recovery (distcp in Hadoop0.15)

– Hard to operate for new comers

– Performance improvements (always ongoing) • Major improvements in 1.2 and 2.x

– Namenode is a single point of failure

– Needs shared NFS storage for HA

– Does not have point in time recovery

Thank You!

Page 48 Architecting the Future of Big Data