Storage Systems forBig Data
Sameer TiwariHadoop Storage Architect, Pivotal [email protected], @sameertech
Storage Systems forBig Data
Sameer TiwariHadoop Storage Architect, Pivotal [email protected], @sameertech
Storage Hierarchy
HDFS
HBase
Posix filesystem. *nixGeneral purpose FS
- Large Distributed Storage- High aggregate throughput
- Large indexed Tables- Fast Random access- Consistent
Redis- In-memory KV Store- Extremely fast access
Other KV Store(s)
Storage Hierarchy
HDFS
HBase
Posix filesystem. *nixGeneral purpose FS
- Large Distributed Storage- High aggregate throughput
- Large indexed Tables- Fast Random access- Consistent
Redis- In-memory KV Store- Extremely fast access
Other KV Store(s)
Hadoop Distributed File System(HDFS)
● History○ Based on Google File System Paper (2003)○ Built at Yahoo by a small team
● Goals○ Tolerance to Hardware failure○ Sequential access as opposed to Random○ High aggregated throughput for Large Data Sets○ “Write Once Read Many” paradigm
HDFS - Key Components
Client1-FileA
NameNode
DataNode 1 DataNode 2 DataNode 3 DataNode 4
Client2-FileB
Rack 1 Rack 2
HDFS - Key Components
Client1-FileA
NameNode
DataNode 1 DataNode 2 DataNode 3 DataNode 4
Client2-FileB
Rack 1 Rack 2
File.create() MetaDataNN OPs
FileA: Metadata e.g. Size, Owner...AB1:D1, AB1:D3, AB1:D4AB2:D1, AB2:D3, AB2:D4
FileB: Metadata e.g. Size, Owner...BB1:D1, BB1:D2, BB1:D4
HDFS - Key Components
Client1-FileA
NameNode
DataNode 1 DataNode 2 DataNode 3 DataNode 4
Client2-FileB
Rack 1 Rack 2
File.create() MetaDataNN OPs
FileA: Metadata e.g. Size, Owner...AB1:D1, AB1:D3, AB1:D4AB2:D1, AB2:D3, AB2:D4
FileB: Metadata e.g. Size, Owner...BB1:D1, BB1:D2, BB1:D4
AB1
BB1
Data BlocksDN OPs
File.write()
HDFS - Key Components
Client1-FileA
NameNode
DataNode 1 DataNode 2 DataNode 3 DataNode 4
AB1 AB2 BB1
BB1
AB1
BB1
AB1
Client2-FileB
Rack 1 Rack 2
AB2 AB2
File.create() MetaDataNN OPs
Data BlocksDN OPs
File.write()
FileA: Metadata e.g. Size, Owner...AB1:D1, AB1:D3, AB1:D4AB2:D1, AB2:D3, AB2:D4
FileB: Metadata e.g. Size, Owner...BB1:D1, BB1:D2, BB1:D4
Replication PipeLining
Client1-FileA
NameNode
HDFS - Communication
HDFS Client API. RPC:ClientProtocol
Client1-FileA
NameNode
DataNode 1
AB1 AB2
BB1
HDFS - Communication
HDFS Client API. RPC:ClientProtocol
HDFS Client API- DataNodeProtocol- Non-RPC, Streaming- Heavy Buffering
Client1-FileA
NameNode
DataNode 1 DataNode 2
AB1 AB2 BB1
BB1
HDFS - Communication
HDFS Client API. RPC:ClientProtocol
HDFS Client API- DataNodeProtocol- Non-RPC, Streaming- Heavy Buffering
RPC:DataNodeProtocolDN registration: At init timeHeart Beat: Stats about Activity and Capacity (secs)Block Report: List of blocks (hour)Block Received: (Triggered by Client upload)
AB2
Client1-FileA
NameNode
DataNode 1 DataNode 2
AB1 AB2 BB1
BB1
HDFS - Communication
HDFS Client API. RPC:ClientProtocol
HDFS Client API- DataNodeProtocol- Non-RPC, Streaming- Heavy Buffering
RPC:DataNodeProtocolDN registration: At init timeHeart Beat: Stats about Activity and Capacity (secs)Block Report: List of blocks (hour)Block Received: (Triggered by Client upload)
AB2ReplicationPipeLining.Streaming
HDFS - NameNode 1 of 4
● Heart of HDFS. Typically Lots of Memory ~128Gigs● Hosts two important tables● The HDFS Namespace: File->Block mapping
○ Persisted for backup● The iNode table: Block->Datanode mapping
○ Not persisted.○ Re-built from block reports
● HDFS is Journaled File system○ Maintains a WAL called edit log○ Edit log is merged into fsimage at a preset log size
HDFS - NameNode 2 of 4
● Can take on 3 roles● Regular mode: Hosts the HDFS Namespace● Backup mode: Secondary NN
○ Downloads fsimage regularly○ Merges changes to namespace○ Its a misnomer, it more of a checkpointing server
● Safemode: Startup time○ Its a R/O mode○ Collects data from active DNs
HDFS - NameNode 3 of 4HA using Quorum Journal Manager (Hadoop 2.0+)
Active NN Standby NN
DataNodesDataNodes
DataNodesDataNodes
JournalNodesJournal
NodesJournalNodes
ClientsClients
Clients
ZK ClusterZK
ClusterZK Cluster
HDFS - NameNode 4 of 4
● Replication Monitor: Fix over/under replicated blocks○ Replica Modes: Corrupt, Current, Out-of-date,
under-construction● Lease Management: During file creation
○ Ensures single writer (multiple readers are ok)○ Synchronously checks active lease○ Asynchronously checks the entire Tree of leases
● Heartbeat monitor: Collects DN stats and marks them down if no heartbeat recvd for ~10mins.
HDFS - DataNode
● Typical Machine: ~ 4TB X 12 disks JBOD● Has no idea about HDFS, only knows about blocks● Serves 2 types of requests
○ NN requests for Block create/delete/replicate○ Serves Block R/W requests from Clients
● Maintains only one table○ Block->Real Bytes on the local FS○ Stored locally and not backed up○ DN can re-build this table by scanning its local dir
● Creates a chksum file for each block● Runs blockScanner() to find corrupt blocks● DataNode to NameNode communication
○ Init - registration○ Sends HeartBeat to NN every few secs○ Block completion: blockReceived()○ Lets NN respond with block commands○ Sends full Block Report every hour
HDFS - DataNode
HDFS - Typical Deployment
RACK1 . . .
Aggregator Switch 1 Aggregator Switch 2
Master Switch
TOR
RACK N(10-20)
TOR
RACK1 . . .
TOR
RACK N(10-20)
TOR
Aggregator Switch 3. . .
. . .
● NN holds the Namespace in a single Java process● 64Gig Heap == ~250 million files + blocks
○ Federation sort of solves the problem○ Moving Namespace to a KV Store is one solution
● Enterprise features slowly being added○ Snapshots○ NFS access○ Geo replication○ Run Length Encoding to reduce 3X copies to 1.3X
HDFS - Limitations
● Support for fadvise readahead and drop-behind● HDFS takes advantage of multiple disks
○ Individual failures do not cause DN failures○ Spills are parallelized
● Replica and Task placement○ Done by DNSToSwitchMapping():resolve()○ User supplied rack topology○ IP address -> Rack id mapping○ net.topology.* setttings in core-site.xml
HDFS - Advanced Concepts
● Couple of tools for Perf monitoring○ Ganglia for HDFS○ Nagios for general health of the machine.
HDFS - Advanced Concepts
Storage Hierarchy
HDFS
HBase
Posix filesystem. *nixGeneral purpose FS
- Large Distributed Storage- High aggregate throughput
- Large indexed Tables- Fast Random access- Consistent
Redis- In-memory KV Store- Extremely fast access
Other KV Store(s)
Storage Hierarchy
HDFS
HBase
Posix filesystem. *nixGeneral purpose FS
- Large Distributed Storage- High aggregate throughput
- Large indexed Tables- Fast Random access- Consistent
Redis- In-memory KV Store- Extremely fast access
Other KV Store(s)
HBase
● History○ Based on Google’s Big Table (2006)○ Built at Powerset (later acquired by Microsoft)○ Facebook and Yahoo use it extensively (~1000 machines)
● Goals○ Random R/W access○ Tables with Billions of Rows X Millions of Columns○ Often referred to as a “NoSQL” Data store○ High speed ingest rate. FB == ~Billion msgs+chat per day.○ Good consistency model
HBase - Key Components
NameNodeJobTrackerHMaster
DataNodeTaskTrackerHRegionServer
ZK ClusterZK
ClusterZK Cluster
Client
Master(s):Active and Backup
Slaves:Many
HBase - Data Model
● Google BigTable Paper on #2 says
A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes
Let’s break that down over the next few slides...
HBase - Data Model
● Data is stored in Tables● Tables have Rows and Columns● Thats where the similarity ends
○ Columns are grouped into Column Families
● Rows are stored in a sorted(increasing) order○ Implies, there is only one primary key
● Rows can be sparsely populated○ Variable length rows are common
● Same row can be updated multiple times○ Each will be stored as a versioned update
HBase - Data ModelConceptual View
Row Key Time Stamp ColumnFamily contents ColumnFamily anchor
"com.cnn.www" t9 anchor:cnnsi.com = "CNN"
"com.cnn.www" t8 anchor:my.look.ca = "CNN.com"
"com.cnn.www" t5 contents:html = "<html>..."
"com.cnn.www" t3 contents:html = "<html>..."
Single column in “contents”byte-array
Column => Column Family: Qualifiere.g. Two Columns in the “anchor”byte-array
Versionstimemillis()
Row-Keybyte-array, Sorted by byte order
HBase - Data ModelPhysical View
Row Key Time Stamp ColumnFamily anchor
"com.cnn.www" t9 anchor:cnnsi.com = "CNN"
"com.cnn.www" t8 anchor:my.look.ca = "CNN.com"
Row Key Time Stamp ColumnFamily contents
"com.cnn.www" t5 contents:html = "<html>..."
"com.cnn.www" t3 contents:html = "<html>..."
HBase - Table Objects
Region1R1-R10
Logical Table
Data : R1- R40
HFileMemStore Blocks Blocks
Region Server : ~200 Regions per Server
Region Servers
Shards
HDFSBlocksHDFS
Blocks
HDFSBlocks
HLog/WAL HDFSBlocksHDFS
Blocks
Region2R11-R20 HFileMemStore Blocks Blocks
HLog/WALHDFSBlocks
HBase - Data Model Operations
○ HTable class offers 4 techniques: get, put, delete and scan.○ The first 3 have a single or batch mode available
//Scan example
public static final byte[] CF1 = "empData1".getBytes();public static final byte[] ATTR1 = "empId".getBytes();HTable htable = new HTable(blah... // create an instance of HTable
Scan scan = new Scan();scan.addColumn(CF1, ATTR1);scan.setStartRow(Bytes.toBytes("200"));scan.setStopRow(Bytes.toBytes("500"));ResultScanner rs = htable.getScanner(scan);try { for (Result r = rs.next(); r != null; r = rs.next()) { // do something with it...} finally { rs.close();}
HBase - Data Versioning
○ By default a put() uses timestamp, but you can override it○ Get.setMaxVersions() or Get.setTimeRange○ By default a get() returns the latest version, but you can ask for any○ All Data model operations are in !sorted order. Row:CF:Col:Version○ Delete flavors: delete col+ver, delete col, delete col family, delete row○ Deletes work by creating tombstone markers○ LIMITATIONS:
■ delete() masks a put() till a major compaction takes place■ Major compactions can change get() results
○ All operations are ATOMIC within a row
HBase - Read Path
ZK ClusterZK
ClusterZK Cluster
Client
Region Server1
Q:Where is -ROOT-?A: RegionServer1
.META. Table for all regions in the system, never splits
table, startKey, id::regionInfo, Server
Q:Where is .META.?A: RegionServer2
Region Server2
Q: HTable.get()
-ROOT- Table for keeping track of .META. table
.META.,region,key:regionInfo, Server
MemStore
HFile - 1HFile - 2
1
2
3 4
5A: Row
6
HBase - Write Path
ZK ClusterZK
ClusterZK Cluster
Client
Region Server1
Q:Where is -ROOT-?A: RegionServer1
.META. Table for all regions in the system, never splits
table, startKey, id::regionInfo, Server
Q:Where is .META.?A: RegionServer2
Region Server2
HTable.put()
-ROOT- Table for keeping track of .META. table
.META.,region,key:regionInfo, Server
MemStore
HLog/WAL
HDFSBlocks
1
2
3 4
5
Offline flush
6return Code
HBase - Shell
○ Table MetaData: e.g. create/alter/drop/describe table○ Table Data: e.g. put/scan/delete/count row(s)○ Admin: e.g. flush/rebalance/compact regions, split tables○ Replication Tools: e.g. add/enable/list/start/stop replication○ Security: e.g. grant/revoke/list user permissions
■ Shell interaction example:■ hbase(main):001:0> create 'myTable', 'myColFam1'■ 0 row(s) in 3.8890 seconds■■ hbase(main):002:0> put 'myTable’, 'row-1', 'myColFam1:col1', 'value-1'■ 0 row(s) in 0.1840 seconds ■■ hbase(main):003:0> scan 'test'■ ROW COLUMN+CELL row-11 column=myColFam1:col1, timestamp=1457381922312, value=value-1 ■ 1 row(s) in 0.1160 seconds■■ hbase(main):004:0>
HBase - Advanced Topics
○ Bulk Loading○ Cluster Replication○ Merging and Splitting of regions○ Predicate pushdown using Server side Filters○ Bloom filters○ Co-Processors○ Snapshots○ Performance Tuning
HBase - What its not
○ HBase is not for everyone○ Has no support for
■ SQL■ Joins■ Secondary indexes■ Transactions■ JDBC driver
○ Works well with large deployments○ Requires good working knowledge of the Hadoop eco-system.
HBase - What its good at
● Strongly consistent reads/writes● Automatic sharding● Automatic RegionServer failover● HBase supports MapReduce for using HBase as both source and sink● Works on top of HDFS● HBase provides Java Client AP and a REST/Thrift API● Block Cache and Bloom Filters support● Web UI and JMX support, for operational management
Storage Hierarchy
HDFS
HBase
Posix filesystem. *nixGeneral purpose FS
- Large Distributed Storage- High aggregate throughput
- Large indexed Tables- Fast Random access- Consistent
Redis- In-memory KV Store- Extremely fast access
Other KV Store(s)
Storage Hierarchy
HDFS
HBase
Posix filesystem. *nixGeneral purpose FS
- Large Distributed Storage- High aggregate throughput
- Large indexed Tables- Fast Random access- Consistent
Redis- In-memory KV Store- Extremely fast access
Other KV Store(s)
Redis● Redis is an open source, in-memory key-value store, with Disk persistence● Originally written at LLOGG by Salvator Sanfilippo ~2009● Written in ANSI C and works in most Linux Systems● No external dependencies● Very small ~1MB memory per instance● Datatypes can be data-structures: String, Hash, Set, Sorted Set.● Compressed in-memory representation of data● Clients are available in lots of languages. C, C#, Clojure, Scala, Lua...
Redis Key Components
Network
Single Threaded Server
Highly OptimizedNetwork Layer
CPU - 1
Highly OptimizedMemory Storage
CPU - 2
Single Threaded Server
Highly OptimizedNetwork Layer
Highly OptimizedMemory Storage
CPU - N
Single Threaded Server
Highly OptimizedNetwork Layer
Highly OptimizedMemory Storage
Memory
Redis Key Components
Network
Single Threaded Server
Highly OptimizedNetwork Layer
CPU - 1
Highly OptimizedMemory Storage
CPU - 2
Single Threaded Server
Highly OptimizedNetwork Layer
Highly OptimizedMemory Storage
CPU - N
Single Threaded Server
Highly OptimizedNetwork Layer
Highly OptimizedMemory Storage
Memory
Redis Network Layer
ClientTCP Server
- Typical request/response system- For 10K requests, 20K network calls- If each call 1ms, 20secs is lost- Use Batching:: called Pipelining- Send one response for 10K requests- Saving 10 seconds for 10K calls
Redis Network Layer
ClientTCP Server
- Typical request/response system- For 10K requests, 20K network calls- If each call 1ms, 20secs is lost- Use Batching:: called Pipelining- Send one response for 10K requests- Saving 10 seconds for 10K calls1,2,3,4…10000
Response Queue
Redis Network Layer
ClientTCP Server
- Typical request/response system- For 10K requests, 20K network calls- If each call 1ms, 20secs is lost- Use Batching:: called Pipelining- Send one response for 10K requests- Saving 10 seconds for 10K calls1,2,3,4…10000
Response Queue
● Bypass OS socket layer abstraction○ Uses low level epoll(), kqueue(), select() calls
● Low overhead of waiting threads.● Allows, handling of close to 10K concurrent clients
Redis Memory Optimizations● Integer encoding for small values● Small hashes are converted to arrays
○ Leverage CPU caching● Uses 32 bit version when possible● Leads to 5X to 10X memory saving
Redis Enterprise Features
Redis Master
Slave1
Slave2
Async. replication
Client
Redis Master
Slave1
Slave2
Async. replication
Cluster 1
Cluster 2
Shard 1
Shard 2
Redis WrapUp● Super fast in memory KV store● Provides a CLI● Typical apps will require client side coding● Spills to disk for large data-sets, with reduced performance● Upcoming “cluster” feature will keep 3 copies for HA
Storage Hierarchy
HDFS
HBase
Posix filesystem. *nixGeneral purpose FS
- Large Distributed Storage- High aggregate throughput
- Large indexed Tables- Fast Random access- Consistent
Redis- In-memory KV Store- Extremely fast access
Other KV Store(s)
Questions?
Storage Systems forBig Data
Sameer TiwariHadoop Storage Architect, Pivotal [email protected], @sameertech
Top Related