Design for a Distributed Name Node

36
Reaching 10,000 Aaron Cordova Booz Allen Hamilton | Hadoop Meetup DC | Sep 7 2010 [email protected]

description

A proposed design for a distributed HDFS NameNode.

Transcript of Design for a Distributed Name Node

Page 1: Design for a Distributed Name Node

Reaching 10,000Aaron CordovaBooz Allen Hamilton | Hadoop Meetup DC | Sep 7 2010

[email protected]

Page 2: Design for a Distributed Name Node

Lots of Applications Require Scalability

Intelligence

Bio-Metrics

Bio-Informatics

Defense

Video

Images

Text

Structured Data

Graph Analytics

Machine Learning

Network Security

Page 3: Design for a Distributed Name Node

Hadoop Scales

Page 4: Design for a Distributed Name Node

Cos

t ->

Data Size ->

Shared Nothing Shared Disk

Linear Scalability

Page 5: Design for a Distributed Name Node

Massive Parallelism

Page 6: Design for a Distributed Name Node

MapReduce

Simplified Distributed Programming Model

Fault Tolerant

Designed to Scale to Thousands of Servers

Many Algorithms Easily Expressed as Map and Reduce

Page 7: Design for a Distributed Name Node

HDFS

Distributed File System

Optimized for High-Throughput

Fault Tolerant Through Replication, Checksumming

Designed to Scale to 10,000 servers

Page 8: Design for a Distributed Name Node

Hadoop is a Platform

Page 9: Design for a Distributed Name Node

MapReduce

HDFS

HBase

Mahout Hive

Pig

FlumeCascading

Nutch

Page 10: Design for a Distributed Name Node

HBase

Scalable Structured store

Fast Lookups

Durable, Consistent Writes

Automatic Partitioning

Page 11: Design for a Distributed Name Node

Mahout

Scalable Machine Learning Algorithms

Clustering

Classification

Page 12: Design for a Distributed Name Node

Fuzzy Table

Low-Latency Parallel Search

Generalized Fuzzy Matching

Images, Biometrics, Audio

Page 13: Design for a Distributed Name Node

One Major Problem

Page 14: Design for a Distributed Name Node

HDFS Single NameNode

Single NameSpace - easy to serialize operations

NameSpace stored entirely in memory

Changes written to transaction log first

Single Point of Failure

Performance Bottleneck?

Page 15: Design for a Distributed Name Node

NameNode Scalability

By software evolution standards Hadoop is a young project. In 2005, inspired by two Google papers, Doug Cutting and Mike Cafarella implemented the core of Hadoop. Its wide acceptance and growth started in 2006 when Yahoo! began investing in its development and committed to use Hadoop as its internal distributed platform. During the past sev-eral years Hadoop installations have grown from a handful of nodes to thousands. It is now used in many organizations around the world.

In 2006, when the buzzword for storage was Exabyte, the Hadoop group at Yahoo! formulated long-term target requirements [7] for the Hadoop Distributed File System and outlined a list of projects intended to bring the requirements to life. What was clear then has now become a reality: the need for large distributed storage systems backed by distributed computational frameworks like Ha-doop MapReduce is imminent.

Today, when we are on the verge of the Zettabyte Era, it is time to take a retrospective view of the targets and analyze what has been achieved, how aggressive our views on the evolution and needs of the storage world have been, how the achievements compare to competing systems, and what our lim-its to growth may be.

The main four-dimensional scale requirement targets for HDFS were formulated [7] as follows:

10PB capacity x 10,000 nodes x 100,000,000 files x 100,000 clients

The biggest Hadoop clusters [8, 5], such as the one recently used at Yahoo! to set sorting records, consist of 4000 nodes and have a total space capac-

“100,000 HDFS clients on a 10,000-node HDFS cluster will exceed the throughput capacity of a single name-node.

... any solution intended for single namespace server optimization lacks scalability.

... the most promising solutions seem to be based on distributing the namespace server ...”

Konstantin Shvachko

Login Apr 2010

Page 16: Design for a Distributed Name Node

0

12.5

25

37.5

50

writ

es/s

econ

d (th

ousa

nds)

Single NN Target

Goal

Page 17: Design for a Distributed Name Node

HDFS Single NameNode

Server grade machine

Lots of memory

Reliable components

RAID

Hot-Failover

Page 18: Design for a Distributed Name Node

Needs Parallelism

Page 19: Design for a Distributed Name Node

Scaling NameNode

Grow memory

Read-only Replicas of NameNode

Multiple static namespace partitions

Distributed name server, partition namespace dynamically

Page 20: Design for a Distributed Name Node

Distributed NameNode Features

Fast Lookups

Durable, Consistent writes

Automatic Partitioning

Page 21: Design for a Distributed Name Node

Can we use HBase?

Page 22: Design for a Distributed Name Node

NameSpace

filename : blocks DataNodes

node : blocks Blocks

block : nodes

Mappings as HBase Tables

Page 23: Design for a Distributed Name Node

How to order namespace?

Page 24: Design for a Distributed Name Node

Depth First Search Order

/

/dir1

/dir1/subdir

/dir1/subdir/file

/dir2/file1

/dir2/file2

Page 25: Design for a Distributed Name Node

Depth First Operations

Delete (Recursive)Move / Rename

Page 26: Design for a Distributed Name Node

Breadth First Search Order

0/

1/dir1

2/dir2/file1

2/dir2/file2

2/dir1/subdir

3/dir2/subdir/file

Page 27: Design for a Distributed Name Node

Breadth First Operations

List

Page 28: Design for a Distributed Name Node

NameNode

DFSClientDataNode DataNode DFSClient

Current Architecture

Page 29: Design for a Distributed Name Node

DFSClient

DNNProxy

DataNode

DNNProxy

DataNode

DNNProxy

DFSClient

DNNProxy

RServer RServer RServer RServer

Proposed Architecture

Page 30: Design for a Distributed Name Node

100k clients -> 41k writes/s

Page 31: Design for a Distributed Name Node

0

12.5

25

37.5

50

100 150 200 250

writ

es/s

econ

d (th

ousa

nds)

# machines hosting namespace

Single NN Distributed NN Target

Anticipated Performance

Page 32: Design for a Distributed Name Node

Issues

Synchronization - multiple writers, changes

Name distribution hotspots

Page 33: Design for a Distributed Name Node

Current Status

Working code exists that uses HBase with slightly modified DFSClient and DataNode for create, write, close, open, read, mkdirs, delete.

New component: HealthServer monitors DataNodes and does garbage collection. More like BigTable master, can die, restart without affecting clients.

Page 34: Design for a Distributed Name Node

Code

Will be at http://code.google.com/p/hdfs-dnn

Available under the Apache license - whichever is compatible with Hadoop

Page 35: Design for a Distributed Name Node

Doesn’t HBase run on HDFS?

Page 36: Design for a Distributed Name Node

Self-Hosted HBase

May be possible to have HBase use the same HDFS instance it’s supporting

Some recursion and self-reference already exists: HBase Metadata table is itself a table in HBase

Have to work out bootstrapping and failure recovery to resolve any potential circular dependencies