Download - Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

Storage Systems forBig Data

Sameer TiwariHadoop Storage Architect, Pivotal [email protected], @sameertech

mailto:[email protected]


Storage Hierarchy

HDFS

HBase

Posix filesystem. *nixGeneral purpose FS

- Large Distributed Storage- High aggregate throughput

- Large indexed Tables- Fast Random access- Consistent

Redis- In-memory KV Store- Extremely fast access

Other KV Store(s)

Hadoop Distributed File System(HDFS)

● History○ Based on Google File System Paper (2003)○ Built at Yahoo by a small team

● Goals○ Tolerance to Hardware failure○ Sequential access as opposed to Random○ High aggregated throughput for Large Data Sets○ “Write Once Read Many” paradigm

HDFS - Key Components

Client1-FileA

NameNode

DataNode 1 DataNode 2 DataNode 3 DataNode 4

Client2-FileB

Rack 1 Rack 2


Client1-FileA

NameNode


Client2-FileB

Rack 1 Rack 2

File.create() MetaDataNN OPs

FileA: Metadata e.g. Size, Owner...AB1:D1, AB1:D3, AB1:D4AB2:D1, AB2:D3, AB2:D4

FileB: Metadata e.g. Size, Owner...BB1:D1, BB1:D2, BB1:D4


Client1-FileA

NameNode


Client2-FileB

Rack 1 Rack 2




AB1

BB1

Data BlocksDN OPs

File.write()


Client1-FileA

NameNode


AB1 AB2 BB1

BB1

AB1

BB1

AB1

Client2-FileB

Rack 1 Rack 2

AB2 AB2


Data BlocksDN OPs

File.write()



Replication PipeLining

Client1-FileA

NameNode

HDFS - Communication

HDFS Client API. RPC:ClientProtocol

Client1-FileA

NameNode

DataNode 1

AB1 AB2

BB1



HDFS Client API- DataNodeProtocol- Non-RPC, Streaming- Heavy Buffering

Client1-FileA

NameNode

DataNode 1 DataNode 2

AB1 AB2 BB1

BB1




RPC:DataNodeProtocolDN registration: At init timeHeart Beat: Stats about Activity and Capacity (secs)Block Report: List of blocks (hour)Block Received: (Triggered by Client upload)

AB2

Client1-FileA

NameNode

DataNode 1 DataNode 2

AB1 AB2 BB1

BB1




RPC:DataNodeProtocolDN registration: At init timeHeart Beat: Stats about Activity and Capacity (secs)Block Report: List of blocks (hour)Block Received: (Triggered by Client upload)

AB2ReplicationPipeLining.Streaming

HDFS - NameNode 1 of 4

● Heart of HDFS. Typically Lots of Memory ~128Gigs● Hosts two important tables● The HDFS Namespace: File->Block mapping

○ Persisted for backup● The iNode table: Block->Datanode mapping

○ Not persisted.○ Re-built from block reports

● HDFS is Journaled File system○ Maintains a WAL called edit log○ Edit log is merged into fsimage at a preset log size


● Can take on 3 roles● Regular mode: Hosts the HDFS Namespace● Backup mode: Secondary NN

○ Downloads fsimage regularly○ Merges changes to namespace○ Its a misnomer, it more of a checkpointing server

● Safemode: Startup time○ Its a R/O mode○ Collects data from active DNs

HDFS - NameNode 3 of 4HA using Quorum Journal Manager (Hadoop 2.0+)

Active NN Standby NN

DataNodesDataNodes

DataNodesDataNodes

JournalNodesJournal

NodesJournalNodes

ClientsClients

Clients

ZK ClusterZK

ClusterZK Cluster


● Replication Monitor: Fix over/under replicated blocks○ Replica Modes: Corrupt, Current, Out-of-date,

under-construction● Lease Management: During file creation

○ Ensures single writer (multiple readers are ok)○ Synchronously checks active lease○ Asynchronously checks the entire Tree of leases

● Heartbeat monitor: Collects DN stats and marks them down if no heartbeat recvd for ~10mins.

HDFS - DataNode

● Typical Machine: ~ 4TB X 12 disks JBOD● Has no idea about HDFS, only knows about blocks● Serves 2 types of requests

○ NN requests for Block create/delete/replicate○ Serves Block R/W requests from Clients

● Maintains only one table○ Block->Real Bytes on the local FS○ Stored locally and not backed up○ DN can re-build this table by scanning its local dir

● Creates a chksum file for each block● Runs blockScanner() to find corrupt blocks● DataNode to NameNode communication

○ Init - registration○ Sends HeartBeat to NN every few secs○ Block completion: blockReceived()○ Lets NN respond with block commands○ Sends full Block Report every hour

HDFS - DataNode

HDFS - Typical Deployment

RACK1 . . .

Aggregator Switch 1 Aggregator Switch 2

Master Switch

TOR

RACK N(10-20)

TOR

RACK1 . . .

TOR

RACK N(10-20)

TOR

Aggregator Switch 3. . .

. . .

● NN holds the Namespace in a single Java process● 64Gig Heap == ~250 million files + blocks

○ Federation sort of solves the problem○ Moving Namespace to a KV Store is one solution

● Enterprise features slowly being added○ Snapshots○ NFS access○ Geo replication○ Run Length Encoding to reduce 3X copies to 1.3X

HDFS - Limitations

● Support for fadvise readahead and drop-behind● HDFS takes advantage of multiple disks

○ Individual failures do not cause DN failures○ Spills are parallelized

● Replica and Task placement○ Done by DNSToSwitchMapping():resolve()○ User supplied rack topology○ IP address -> Rack id mapping○ net.topology.* setttings in core-site.xml

HDFS - Advanced Concepts

● Couple of tools for Perf monitoring○ Ganglia for HDFS○ Nagios for general health of the machine.

HDFS - Advanced Concepts

Storage Hierarchy

HDFS

HBase





Other KV Store(s)

HBase

● History○ Based on Google’s Big Table (2006)○ Built at Powerset (later acquired by Microsoft)○ Facebook and Yahoo use it extensively (~1000 machines)

● Goals○ Random R/W access○ Tables with Billions of Rows X Millions of Columns○ Often referred to as a “NoSQL” Data store○ High speed ingest rate. FB == ~Billion msgs+chat per day.○ Good consistency model

HBase - Key Components

NameNodeJobTrackerHMaster

DataNodeTaskTrackerHRegionServer

ZK ClusterZK

ClusterZK Cluster

Client

Master(s):Active and Backup

Slaves:Many

HBase - Data Model

● Google BigTable Paper on #2 says

A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes

Let’s break that down over the next few slides...

HBase - Data Model

● Data is stored in Tables● Tables have Rows and Columns● Thats where the similarity ends

○ Columns are grouped into Column Families

● Rows are stored in a sorted(increasing) order○ Implies, there is only one primary key

● Rows can be sparsely populated○ Variable length rows are common

● Same row can be updated multiple times○ Each will be stored as a versioned update

HBase - Data ModelConceptual View

Row Key Time Stamp ColumnFamily contents ColumnFamily anchor

"com.cnn.www" t9 anchor:cnnsi.com = "CNN"

"com.cnn.www" t8 anchor:my.look.ca = "CNN.com"

"com.cnn.www" t5 contents:html = "<html>..."


Single column in “contents”byte-array

Column => Column Family: Qualifiere.g. Two Columns in the “anchor”byte-array

Versionstimemillis()

Row-Keybyte-array, Sorted by byte order

HBase - Data ModelPhysical View

Row Key Time Stamp ColumnFamily anchor

"com.cnn.www" t9 anchor:cnnsi.com = "CNN"

"com.cnn.www" t8 anchor:my.look.ca = "CNN.com"

Row Key Time Stamp ColumnFamily contents



HBase - Table Objects

Region1R1-R10

Logical Table

Data : R1- R40

HFileMemStore Blocks Blocks

Region Server : ~200 Regions per Server

Region Servers

Shards

HDFSBlocksHDFS

Blocks

HDFSBlocks

HLog/WAL HDFSBlocksHDFS

Blocks

Region2R11-R20 HFileMemStore Blocks Blocks

HLog/WALHDFSBlocks

HBase - Data Model Operations

○ HTable class offers 4 techniques: get, put, delete and scan.○ The first 3 have a single or batch mode available

//Scan example

public static final byte[] CF1 = "empData1".getBytes();public static final byte[] ATTR1 = "empId".getBytes();HTable htable = new HTable(blah... // create an instance of HTable

Scan scan = new Scan();scan.addColumn(CF1, ATTR1);scan.setStartRow(Bytes.toBytes("200"));scan.setStopRow(Bytes.toBytes("500"));ResultScanner rs = htable.getScanner(scan);try { for (Result r = rs.next(); r != null; r = rs.next()) { // do something with it...} finally { rs.close();}

HBase - Data Versioning

○ By default a put() uses timestamp, but you can override it○ Get.setMaxVersions() or Get.setTimeRange○ By default a get() returns the latest version, but you can ask for any○ All Data model operations are in !sorted order. Row:CF:Col:Version○ Delete flavors: delete col+ver, delete col, delete col family, delete row○ Deletes work by creating tombstone markers○ LIMITATIONS:

■ delete() masks a put() till a major compaction takes place■ Major compactions can change get() results

○ All operations are ATOMIC within a row

HBase - Read Path

ZK ClusterZK

ClusterZK Cluster

Client

Region Server1

Q:Where is -ROOT-?A: RegionServer1

.META. Table for all regions in the system, never splits

table, startKey, id::regionInfo, Server

Q:Where is .META.?A: RegionServer2

Region Server2

Q: HTable.get()

-ROOT- Table for keeping track of .META. table

.META.,region,key:regionInfo, Server

MemStore

HFile - 1HFile - 2

1

2

3 4

5A: Row

6

HBase - Write Path

ZK ClusterZK

ClusterZK Cluster

Client

Region Server1

Q:Where is -ROOT-?A: RegionServer1

.META. Table for all regions in the system, never splits

table, startKey, id::regionInfo, Server

Q:Where is .META.?A: RegionServer2

Region Server2

HTable.put()

-ROOT- Table for keeping track of .META. table

.META.,region,key:regionInfo, Server

MemStore

HLog/WAL

HDFSBlocks

1

2

3 4

5

Offline flush

6return Code

HBase - Shell

○ Table MetaData: e.g. create/alter/drop/describe table○ Table Data: e.g. put/scan/delete/count row(s)○ Admin: e.g. flush/rebalance/compact regions, split tables○ Replication Tools: e.g. add/enable/list/start/stop replication○ Security: e.g. grant/revoke/list user permissions

■ Shell interaction example:■ hbase(main):001:0> create 'myTable', 'myColFam1'■ 0 row(s) in 3.8890 seconds■■ hbase(main):002:0> put 'myTable’, 'row-1', 'myColFam1:col1', 'value-1'■ 0 row(s) in 0.1840 seconds ■■ hbase(main):003:0> scan 'test'■ ROW COLUMN+CELL row-11 column=myColFam1:col1, timestamp=1457381922312, value=value-1 ■ 1 row(s) in 0.1160 seconds■■ hbase(main):004:0>

HBase - Advanced Topics

○ Bulk Loading○ Cluster Replication○ Merging and Splitting of regions○ Predicate pushdown using Server side Filters○ Bloom filters○ Co-Processors○ Snapshots○ Performance Tuning

HBase - What its not

○ HBase is not for everyone○ Has no support for

■ SQL■ Joins■ Secondary indexes■ Transactions■ JDBC driver

○ Works well with large deployments○ Requires good working knowledge of the Hadoop eco-system.

HBase - What its good at

● Strongly consistent reads/writes● Automatic sharding● Automatic RegionServer failover● HBase supports MapReduce for using HBase as both source and sink● Works on top of HDFS● HBase provides Java Client AP and a REST/Thrift API● Block Cache and Bloom Filters support● Web UI and JMX support, for operational management

Storage Hierarchy

HDFS

HBase





Other KV Store(s)

Redis● Redis is an open source, in-memory key-value store, with Disk persistence● Originally written at LLOGG by Salvator Sanfilippo ~2009● Written in ANSI C and works in most Linux Systems● No external dependencies● Very small ~1MB memory per instance● Datatypes can be data-structures: String, Hash, Set, Sorted Set.● Compressed in-memory representation of data● Clients are available in lots of languages. C, C#, Clojure, Scala, Lua...

Redis Key Components

Network

Single Threaded Server

Highly OptimizedNetwork Layer

CPU - 1

Highly OptimizedMemory Storage

CPU - 2




CPU - N




Memory

Redis Network Layer

ClientTCP Server

- Typical request/response system- For 10K requests, 20K network calls- If each call 1ms, 20secs is lost- Use Batching:: called Pipelining- Send one response for 10K requests- Saving 10 seconds for 10K calls

Redis Network Layer

ClientTCP Server

- Typical request/response system- For 10K requests, 20K network calls- If each call 1ms, 20secs is lost- Use Batching:: called Pipelining- Send one response for 10K requests- Saving 10 seconds for 10K calls1,2,3,4…10000

Response Queue

Redis Network Layer

ClientTCP Server

- Typical request/response system- For 10K requests, 20K network calls- If each call 1ms, 20secs is lost- Use Batching:: called Pipelining- Send one response for 10K requests- Saving 10 seconds for 10K calls1,2,3,4…10000

Response Queue

● Bypass OS socket layer abstraction○ Uses low level epoll(), kqueue(), select() calls

● Low overhead of waiting threads.● Allows, handling of close to 10K concurrent clients

Redis Memory Optimizations● Integer encoding for small values● Small hashes are converted to arrays

○ Leverage CPU caching● Uses 32 bit version when possible● Leads to 5X to 10X memory saving

Redis Enterprise Features

Redis Master

Slave1

Slave2

Async. replication

Client

Redis Master

Slave1

Slave2

Async. replication

Cluster 1

Cluster 2

Shard 1

Shard 2

Redis WrapUp● Super fast in memory KV store● Provides a CLI● Typical apps will require client side coding● Spills to disk for large data-sets, with reduced performance● Upcoming “cluster” feature will keep 3 copies for HA

Storage Hierarchy

HDFS

HBase





Other KV Store(s)

Questions?

Storage Systems forBig Data

Sameer TiwariHadoop Storage Architect, Pivotal [email protected], @sameertech