High order bits from cassandra & hadoop

Post on 15-Jan-2015

1.411 views 3 download

Tags:

description

 

Transcript of High order bits from cassandra & hadoop

High-order bits from Cassandra & Hadoop

srisatish ambati@srisatish

NoSQL -Know your queries.

points

• Usecases• Why NoSQL?• Why cassandra?• Usecase: Hadoop, Brisk• FUD: Consistency • Why facebook is not using Cassandra?• Community, Code, Tools• Q&A

Users. Netflix.Key by Customer, read-heavyKey by Customer:Movie, write-heavy

TimeSeries: (several customers)periodic readings: dev0, dev1…deviceID:metric:timestamp ->value

Metrics typically way larger dataset than users.

Why Cassandra?

Operational simplicity peer-to-peer

Operational simplicity peer-to-peer

Replication: Multi-datacenterMulti-region ec2Multi-availability zones

Replication: Multi-datacenterMulti-region ec2, awsMulti-availability zones

dc1 dc2

reads local

“Movie marathons on Netflix awaiting AWS to come back up.” #ec2disabled

4.21.2011, Amazon Web Services outage:

Netflix was running on AWS.

4.21.2011, Amazon Web Services outage:

fast durable writes. fast reads.

Writes Sequential, append-only.~1-5ms

Writes Sequential, append-only.~1-5ms

On cloud: ephemeral disks rock!

Reads Local Key & row caches, (also, jna-based 0xffheap) indexes, materialized

Reads Local Key & row caches, (also, jna-based 0xffheap) indexes, materialized

ssds, improved read performance!

Clients: cql, thrift pycassa, phpcassa hector, pelops (scala, ruby, clojure)

Usecase #3: hadoopHdfs cassandra hiveLogs stats analytics

BriskTruly peer-to-peer hadoop.

mv computationnot data

Parallel Execution View

jobtracker, tasktrackerhdfs: namenode, datanode

clouderaamazon: elastic map reducehortonworksmapRbrisk

Namenode decomposition, explained.

Use column families (tables)inodesblock

near-real time hadoopLow latency: cassandra_dc nodesBatch Analytics: brisk_dc nodes

FUD, acronym: fear, uncertainty, doubt.

Consistency: R + W > N ORACLE, 2-node: R=1, W=2, N=2,(T=2)DNS

* N is replication factor. Not to be confused with T=total #of nodes

Tune-able, flexibility.For High Consistency:

read:quorum, write:quorumFor High Availability:

high W, low R.

Inbox Search: 600+cores.120+TB (2008)Went from 100-500m users.

Average NoSQL deployment size: ~6-12 nodes.

Usecase #5: searchApache Solr + Cassandra = Solandra

Other inbox/file Searches:xobni, c3

github.com/tjake/solandra

“Eventual consistency is harder to program.”mostly immutable data.complex systems at scale.

Miscellaneous, Myth: data-loss, partial rows.writes are durable.

Three good reasons for Cassandra...

ToolsAMIs, OpsCenter, DataStaxAppDynamics

B e a u t i f u l C 0 d e

= new code(); //less is more~90k.java.concurrent.@annotate. bloomfilters, merkletrees.non-blocking, staged-event-driven.bigtable, dynamo.

Current & Future Focus:Distributed Counters, CQL.Simple client.operational smoothening.

compaction.

CommunityRobust. Rapid. #Professional support from DataStax.Filesystem innovatin from Acunu

engineers: independent,startups, large companies, Rackspace, Twitter, Netflix..

Come join the efforts!

Usecase #4: first NoSQL, then scale!simpledb Cassandra mongodb Cassandra

Copyright: xkcd

Copyright: plantoys

… more than one way to do it!

Summary -high scale peer-to-peer datastore

best friend for multi-region, multi-zone availability.

Hadoop – HDFS engulfing the DataWorld

Q&A@srisatish

NoSQL -Know your queries.