CFS: Cassandra backed storage for Hadoop
-
Upload
nickmbailey -
Category
Technology
-
view
1.291 -
download
1
description
Transcript of CFS: Cassandra backed storage for Hadoop
CFSCassandra-backed storage for HadoopNick Bailey@[email protected]
©2012 DataStax
Motivation
2
©2012 DataStax
Help me Cassandra, you’re my only hope
3
©2012 DataStax
Cassandra• Distributed architecture
• No SPOF
• Scalable
• Real time data
• No ad-hoc query support
4
©2012 DataStax
Cassandra, why can’t you...
5
©2012 DataStax
...do the things Hadoop was built for.
6
©2012 DataStax
Cassandra + Hadoop = <3
7
©2012 DataStax
The Solution• InputFormat/OutputFormat
• Unfortunately, still need a DFS
• Run tasktrackers/datanodes locally• Data Locality FTW!
• Run namenode/jobtracker somewhere
• Since Cassandra 0.6 (the dark ages)
8
©2012 DataStax
Ok, but what about these parts that suck...
9
©2012 DataStax
Do not want...• Multiple hadoop stacks?
• SPOF?
• 3 JVMS?
10
©2012 DataStax
CFS
11
©2012 DataStax
Cassandra Data model in 1 minute
12
©2012 DataStax
Column Families• Column Family ~= Table
• Row Key + columns
• Columns are sparse
13
©2012 DataStax
Static - Users Column Family
14
Row Key
nickmbailey password: * name: Nick
zznate password: * name: Nate phone: 512-7777
©2012 DataStax
Select * from Users where name=Nick;
Secondary Indexes
15
©2012 DataStax
Dynamic - Friends
16
Row Key
nickmbailey zznate: thobbs:
zznate jbeiber: thobbs: steve_watt:
©2012 DataStax
So what about CFS...
17
©2012 DataStax
Simple...
18
©2012 DataStax 19
©2012 DataStax
CF: inode• Essentially, namenode replacement
• File metadata
20
©2012 DataStax 21
©2012 DataStax
CF: inode• Row Key = UUID
• Allows for file renames
• Secondary indexes for file browsing
• Columns:
22
Columnfilename /home/nick/data.txt
parent_path /home/nick/attributes nick:nick:777
TimeUUID1 <block metadata>TimeUUID2 <block metadata>TimeUUID3 <block metadata>
...
©2012 DataStax 23
©2012 DataStax
CF: sblocks• Essentially, datanode replacement
• Stores actual contents of files
• Each row is an hdfs block
• Row Key = Block ID
24
Column
TimeUUID1 <compressed file data>
TimeUUID2 <compressed file data>
TimeUUID3 <compressed file data>
...
©2012 DataStax 25
©2012 DataStax
Writes• Write file metadata
• Split into blocks• Still controlled by ‘dfs.block.size’• also ‘cfs.local.subblock.size’
• Read in a block• split into sub blocks
• Update inode, sblocks
• rinse, repeat
26
©2012 DataStax 27
©2012 DataStax
Reads• Check for file in inode
• Determine appropriate blocks
• Request blocks via thrift
• If data is local...• ...get location on local filesystem
• If data is remote...• ...get actual file content via thrift
28
©2012 DataStax
What Else?• Current Implementation: 1.0.4• <property>
<name>fs.cfs.impl</name> <value>com.datastax.bdp.hadoop.cfs.CassandraFileSystem</value> </property>
• Supports HDFS append()
• Immutability makes things easy
• See the first incarnation• https://github.com/riptano/brisk
29
Questions?