CFS: Cassandra backed storage for Hadoop

CFSCassandra-backed storage for HadoopNick Bailey@[email protected]

©2012 DataStax

Motivation

2

©2012 DataStax

Help me Cassandra, you’re my only hope

3

©2012 DataStax

Cassandra• Distributed architecture

• No SPOF

• Scalable

• Real time data

• No ad-hoc query support

4

©2012 DataStax

Cassandra, why can’t you...

5

©2012 DataStax

...do the things Hadoop was built for.

6

©2012 DataStax

Cassandra + Hadoop = <3

7

©2012 DataStax

The Solution• InputFormat/OutputFormat

• Unfortunately, still need a DFS

• Run tasktrackers/datanodes locally• Data Locality FTW!

• Run namenode/jobtracker somewhere

• Since Cassandra 0.6 (the dark ages)

8

©2012 DataStax

Ok, but what about these parts that suck...

9

©2012 DataStax

Do not want...• Multiple hadoop stacks?

• SPOF?

• 3 JVMS?

10

©2012 DataStax

CFS

11

©2012 DataStax

Cassandra Data model in 1 minute

12

©2012 DataStax

Column Families• Column Family ~= Table

• Row Key + columns

• Columns are sparse

13

©2012 DataStax

Static - Users Column Family

14

Row Key

nickmbailey password: * name: Nick

zznate password: * name: Nate phone: 512-7777

©2012 DataStax

Select * from Users where name=Nick;

Secondary Indexes

15

©2012 DataStax

Dynamic - Friends

16

Row Key

nickmbailey zznate: thobbs:

zznate jbeiber: thobbs: steve_watt:

©2012 DataStax

So what about CFS...

17

©2012 DataStax

CF: inode• Row Key = UUID

• Allows for file renames

• Secondary indexes for file browsing

• Columns:

22

Columnfilename /home/nick/data.txt

parent_path /home/nick/attributes nick:nick:777

TimeUUID1 <block metadata>TimeUUID2 <block metadata>TimeUUID3 <block metadata>

...

©2012 DataStax

CF: sblocks• Essentially, datanode replacement

• Stores actual contents of files

• Each row is an hdfs block

• Row Key = Block ID

24

Column

TimeUUID1 <compressed file data>



...

©2012 DataStax

Writes• Write file metadata

• Split into blocks• Still controlled by ‘dfs.block.size’• also ‘cfs.local.subblock.size’

• Read in a block• split into sub blocks

• Update inode, sblocks

• rinse, repeat

26

©2012 DataStax

Reads• Check for file in inode

• Determine appropriate blocks

• Request blocks via thrift

• If data is local...• ...get location on local filesystem

• If data is remote...• ...get actual file content via thrift

28

©2012 DataStax

What Else?• Current Implementation: 1.0.4• <property>

<name>fs.cfs.impl</name> <value>com.datastax.bdp.hadoop.cfs.CassandraFileSystem</value> </property>

• Supports HDFS append()

• Immutability makes things easy

• See the first incarnation• https://github.com/riptano/brisk

29

https://github.com/riptano/brisk

https://github.com/riptano/brisk

Want a job?

[email protected]

mailto:[email protected]

mailto:[email protected]

Questions?

CFS: Cassandra backed storage for Hadoop

Technology

Transcript of CFS: Cassandra backed storage for Hadoop