High order bits from cassandra & hadoop

High-order bits from Cassandra & Hadoop

srisatish ambati@srisatish

NoSQL -Know your queries.

points

• Usecases• Why NoSQL?• Why cassandra?• Usecase: Hadoop, Brisk• FUD: Consistency • Why facebook is not using Cassandra?• Community, Code, Tools• Q&A

Users. Netflix.Key by Customer, read-heavyKey by Customer:Movie, write-heavy

TimeSeries: (several customers)periodic readings: dev0, dev1…deviceID:metric:timestamp ->value

Metrics typically way larger dataset than users.

Why Cassandra?

Operational simplicity peer-to-peer

Replication: Multi-datacenterMulti-region ec2Multi-availability zones

Replication: Multi-datacenterMulti-region ec2, awsMulti-availability zones

dc1 dc2

reads local

“Movie marathons on Netflix awaiting AWS to come back up.” #ec2disabled

4.21.2011, Amazon Web Services outage:

Netflix was running on AWS.

4.21.2011, Amazon Web Services outage:

fast durable writes. fast reads.

Writes Sequential, append-only.~1-5ms

On cloud: ephemeral disks rock!

Reads Local Key & row caches, (also, jna-based 0xffheap) indexes, materialized

ssds, improved read performance!

Clients: cql, thrift pycassa, phpcassa hector, pelops (scala, ruby, clojure)

Usecase #3: hadoopHdfs cassandra hiveLogs stats analytics

BriskTruly peer-to-peer hadoop.

mv computationnot data

Parallel Execution View

jobtracker, tasktrackerhdfs: namenode, datanode

clouderaamazon: elastic map reducehortonworksmapRbrisk

Namenode decomposition, explained.

Use column families (tables)inodesblock

near-real time hadoopLow latency: cassandra_dc nodesBatch Analytics: brisk_dc nodes

FUD, acronym: fear, uncertainty, doubt.

Consistency: R + W > N ORACLE, 2-node: R=1, W=2, N=2,(T=2)DNS

* N is replication factor. Not to be confused with T=total #of nodes

Tune-able, flexibility.For High Consistency:

read:quorum, write:quorumFor High Availability:

high W, low R.

Inbox Search: 600+cores.120+TB (2008)Went from 100-500m users.

Average NoSQL deployment size: ~6-12 nodes.

Usecase #5: searchApache Solr + Cassandra = Solandra

Other inbox/file Searches:xobni, c3

github.com/tjake/solandra

“Eventual consistency is harder to program.”mostly immutable data.complex systems at scale.

Miscellaneous, Myth: data-loss, partial rows.writes are durable.

Three good reasons for Cassandra...

ToolsAMIs, OpsCenter, DataStaxAppDynamics

B e a u t i f u l C 0 d e

= new code(); //less is more~90k.java.concurrent.@annotate. bloomfilters, merkletrees.non-blocking, staged-event-driven.bigtable, dynamo.

Current & Future Focus:Distributed Counters, CQL.Simple client.operational smoothening.

compaction.

CommunityRobust. Rapid. #Professional support from DataStax.Filesystem innovatin from Acunu

engineers: independent,startups, large companies, Rackspace, Twitter, Netflix..

Come join the efforts!

Usecase #4: first NoSQL, then scale!simpledb Cassandra mongodb Cassandra

Copyright: xkcd

Copyright: plantoys

… more than one way to do it!

Summary -high scale peer-to-peer datastore

best friend for multi-region, multi-zone availability.

Hadoop – HDFS engulfing the DataWorld

Q&A@srisatish

NoSQL -Know your queries.

High order bits from cassandra & hadoop

Technology

Transcript of High order bits from cassandra & hadoop

Cassandra + Hadoop = Brisk

Evaluating Apache Cassandra as a Cloud DatabaseDataStax Enterprise – Certified Cassandra for Production Applications ..... 11 Solving the Cloud Mixed-Workload Problem ..... 11 Hadoop

From Simple CQL to Time-Series Event Tracking and Aggregation Using Cassandra and Hadoop

Hadoop and Cassandra at Rackspace

Comparing the Hadoop Distributed File System (HDFS) · PDF file1 Comparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS) White Paper BY DATASTAX CORPORATION

Tomorrow’s Enterprise - Delivered Todaycfs22.simplicdn.net/ice9/docs/SL_Focus_Categories.pdf · Scala, Hadoop 2.7, Cassandra, Pig, Hive, Impala, Kafka, MongoDB, Storm Training in

OSMC 2014: Processing millions of logs with Logstash and integrating with Elasticsearch, Hadoop and Cassandra | Valentin Fischer-Mitoiu

Comparing the Hadoop Distributed File System …datastax.com/wp-content/uploads/2012/09/WP-DataStax-HDFSvsCFS.pdf1 Comparing the Hadoop Distributed File System (HDFS) with the Cassandra

Cassandra/Hadoop Integration

Big Data & Hadoop By Mr.Nataraj smallest unit is bit 1 byte=8 bits 1 KB (Kilo Byte)= 1024 bytes =1024*8 bits 1MB (Mega Byte)=1024 KB=(1024)^2 * 8 bits.

Lecture 11 Hadoop & Sparkece.uprm.edu/~wrivera/ICOM6025/Lecture11.pdf · HBase PIG R Hive Cassandra MapReduce . Hadoop • Designed to reliably store data using ... High Performance

Dr. Sabin Buraga - profs.info.uaic.robusaco/teach/courses/soa/presentations/... · eBay Java, Node.js (JavaScript) Oracle DB ... Java, Scala, Rails (Ruby) MySQL, Cassandra, Hadoop,

Manchester Hadoop Meetup: Cassandra Spark internals

CouchDB - ACCU · PDF fileCouchDB Can we be comfortable without SQL? ... • Google BigTable, HBase/Hadoop, Cassandra, ... IBM, Apple, ebay

DCatch: Automatically Detecting Distributed Concurrency ...people.cs.uchicago.edu/~haopliu/paper/asplos17-preprint.pdf · source distributed cloud systems, Cassandra, Hadoop MapRe-duce,

Red Hat. Cassandra and MongoDB on Encryption for Hadoop ...

Real Time Business Intelligence with Cassandra, Kafka and Hadoop - A Real Story... (Alexandra Klimova, Dominique Rond, Allianz Deutschland AG) | C* Summit 2016

C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra

Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Big Data & Hadoop By Mr.Nataraj smallest unit is bit 1 byte=8 bits 1 KB (Kilo Byte)= 1024 bytes =10248 bits 1MB (Mega Byte)=1024 KB=(1024)^2 8 bits.