Austin Cassandra Users 6/19: Apache Cassandra at Vast
-
Upload
planet-cassandra -
Category
Technology
-
view
610 -
download
4
description
Transcript of Austin Cassandra Users 6/19: Apache Cassandra at Vast
June 19, 2014
Cassandra at VastGraham Sanderson - CTO, David Pratt - Director of Applications
1
June 19, 2014
Introduction
2
• Don’t want this to be a data modeling talk
• We aren't experts - we are learning as we go
• Hopefully this will be useful to both you and us • Informal, questions as we go
• We will share our experiences so far moving to Cassandra • We are working on a bunch existing and new projects
• We'll talk about 2 1/2 of them
• Some dev stuff, some ops stuff
• Some thoughts for the future
• Athena Scala Driver
June 19, 2014
Who is Vast?
3
• Vast operates while-label performance based marketplaces for publishers; and delivers big data mobile applications for automotive and real estate sales professionals
• “Big Data for Big Purchases”
• Marketplaces • Large partner sites, including AOL, CARFAX, TrueCar, Realogy, USAA,
Yahoo
• Hundreds of smaller partner sites
• Analytics • Strong team of scarily smart data scientists
• Integrating analytics everywhere
June 19, 2014
Big Data
4
• HDFS - 1100TB
• Amazon S3 - 275TB
• Amazon Glacier - 150TB
• DynamoDB -12TB
• Vertica - 2TB
• Cassandra - 1.5TB
• SOLR/Lucene - 400GB
• Zookeeper
• MySQL
• Postgres
• Redis
• CouchDB
June 19, 2014
Data Flow
5
• Flows between different data store types (many include historical data too) • Systems of Record (SOR)
• Both root nodes and leaf nodes
• Derived data stores (mostly MVCC) for:
• Real time customer facing queries
• Real time analytics
• Alerting
• Offline analytics
• Reporting
• Debugging
• Mixture of dumps and deltas
• We have derived SORs • Cached smaller subset records/fields for a specific purpose
• SORs in multiple data centers - some derived SORs shared
• Data flow is graph not a tree - feedback
June 19, 2014
Goals
6
• Reduce latency <15 mins for customer facing data
• Reduce copying and duplication of data • Network/Storage/Time costs
• More streaming & deltas, less dumps and derived SORs
• Want multi-purpose, multi-tenant central store • Something rock solid
• Something that can handle lots of data fast
• Something that can do random access and bulk operations
• Use for all data store types on previous slide
• (Over?)build it; they will come
• Consolidate rest on • HDFS, Vertica, Postgres, S3, Glacier, SOLR/Lucene
June 19, 2014
Why Cassandra?
7
• Regarded as rock solid • No single point of failure
• Active development & open source Java
• Good fit for the type of data we wanted to store
• Ease of configuration; all nodes are the same
• Easily tunable consistency at application level
• Easy control of sharding at application level
• Drivers for all our languages (we're mostly JVM but also node)
• Data locality with other tools
• Good cross data center support
June 19, 2014
Evolution
8
• July 2013 (alpha on C* 1.1)
• September 2013 (MTC-1 on C* 2.0.0) • First use case (a nasty one) - talk about it later
• Stress/Destructive testing
• Found and helped fix a few bugs along the way
• Learned a lot about tuning and operations
• Half nodes down at one point
• Corrupted SSTables on one node
• We’ve been cautious • Started with internal facing only use (don’t need 100% uptime)
• Moved to external facing use but with ability to fall back off C* in minutes
• Getting braver • C* is only SOR and real time customer facing store for some cases now
• We have on occasion custom built C* with cherry-picked patches
June 19, 2014
HW Specs MTC-1
9
• Remember we want to build for the C* future • 6 nodes
• 16x cores (Sandy Bridge)
• 256G RAM
• Lots of disk cache and mem-mapped NIO buffers
• 7x 1.2TB 10K RPM JBOD (4.2ms latency, 200MB/sec sequential each)
• 1x SSD commit volume (~100K IIOPs, 550MB/sec sequential)
• RAID1 OS drives
• 4x gigabit ethernet
June 19, 2014
SW Specs MTC-1
10
• CentOS 6.5
• Cassandra 2.0.5
• JDK 1.7.0_60-b19 • 8 gig young generation / 6.4 gig eden
• 16 gig old generation
• Parallel new collector
• CMS collector
• Sounds like overkill but we are multi-tenant and have spiky loads
June 19, 2014
General
11
• LOCAL_QUORUM for reads and writes
• Use LZ4 compression
• Use key cache (not row cache)
• Some SizeTiered, some Leveled CompactionStrategy
• Drivers • Athena (Scala / binary)
• Astyanax 1.56.48 (Java / thrift)
• node-cassandra-cql (Node / binary)
June 19, 2014
Use Case 1 - Search API - Problem
12
• 40 million records (including duplicates per VIN) in HDFS
• Map/Reduce to 7 million SOLR XML updates in HDFS
• Not delta today because of map/reduce like business rules
• Export to SOLR XML from HDFS to local FS
• Re-index via SOLR
• 40 gig SOLR index - at least 3 slaves
• OKish every few hours, not every 15 minutes • Even though we made very fast parallel indexer
• % of stored data read per indexing is getting smaller
June 19, 2014
Use Case 1 - Search API - Solution
13
• Indexing in hadoop • SOLR(Lucene) segments created (no stored fields)
• Job option for fallback to stored fields in SOLR index
• Stored fields go to C* as JSON directly from hadoop
• Astyanax - 1MB batches - LOCAL_QUORUM
• Periodically create new table(CF) with full data baseline (clustering) column
• 200MB/s 3 replicas continuously for one to two minutes
• 40000 partition keys/s (one per record)
• Periodically add new (clustering) column to table with deltas from latest dump
• Delta data size is 100x smaller and hits many fewer partition keys
• Keep multiple recent tables for rollback (bad data more than recovery)
• 2 gig SOLR index (20x smaller)
June 19, 2014
Use Case 1 - Search API - Solution
14
• Very bare bones - not even any metadata :-(
• Thrift style
• Note we use blob • Everything is UTF-8
• Avro - Utf8
• Hadoop - Text
• Astyanax - ByteBuffer
• Most JVM drivers try to convert text to String
CREATE TABLE "20140618084015_20140618_081920_1403072360" (! key text,! column1 blob,! value blob,! PRIMARY KEY (key, column1)!) WITH COMPACT STORAGE;
June 19, 2014
Use Case 1 - Search API - Solution
15
• Stored fields cached in SOLR JVM (verification/warm up tests)
• MVCC to prevent read-from-future • Single clustering key limit for the SOLR core
• Reads fallback from LOCAL_QUORUM to LOCAL_ONE • Better to return something even a subset of results
• Never happened in production though
• Issues • Don’t recreate table/CF until C* 2.1
• Early 2.0.x and Astyanax don’t like schema changes
• Create new tables via CQL3 via Astyanax
• Monitoring harder since we now use UUID for table name
• Full (non delta) index write rate strains GC and causes some hinting
• C* remains rock solid
• We can constrain by mapper/reducer count, and will probably add zookeeper mutex
June 19, 2014
Use Case 1.5 - RESA
16
• Newer version of real estate
• Fully streaming delta pipeline (RabbitMQ)
• Field level SOLR index updates (include latest timestamp)
• C* row with JSON delta for that timestamp
• History is used in customer facing features
• Note this is really the same table as thrift one
CREATE TABLE for_sale (! id text,! created_date timestamp,! delta_json text! PRIMARY KEY (id, created_date)!) !
June 19, 2014
Use Case 2 - Feed Management - Problem
17
• Thousands of feeds of different size and frequency
• Incoming feeds must be “polished”
• Geocoding must be done
• Images must be made available in S3
• Need to reprocess individual feeds
• Full output records are munged from asynchronously updated parts
• Previously huge HDFS job • 300M inputs for 70M full output records
• Records need all data to be “ready” for full output
• Silly because most work is redundant from previous run
• Only help partitioning is by brittle HDFS directory structures
June 19, 2014
Use Case 2 - Feed Management - Solution
18
• Scala & Akka & Athena (large throughput - high parallelism)
• Compound partition key (2^n shards per feed)
• Spreads data - limits partition “row” length
• Read entire feed without key scan - small IN clause
• Random access writes
• Any sub-field may be updated asynchronously
• Munged record emitted to HDFS whenever “ready”
CREATE TABLE feed_state (! feed_name text,! feed_record_id_shard int,! record_id uuid,! raw_record text,! polished_data text,! geocode_data text,! image_status text,! ...! PRIMARY KEY ((feed_name, feed_record_id_shard), record_id)!)
June 19, 2014
Monitoring
19
• OpsCenter
• log4j/syslog/graylog • Email alerts
• nagios/zabbix
• Graphite (autogen graph pages) • Machine stats via collectl, JVM from codahale
• Cassandra stats from codahale
• Suspect possible issue with hadoop using same coordinator nodes
• GC logs
• Visual VM
June 19, 2014
General Issues / Lessons Learned
20
• GC issues • Old generation fragmentation causes eventual promotion failure
• Usually of 1MB Memtable “slabs” - These can be off heap in C* 2.1 :-)
• Thrift API with bulk load probably not helping, but fragmentation is inevitable
• Some slow initial mark and remark STW pauses
• We do have a big young gen - New -XX:+ flags in 1.7.0_60 :-)
• As said we aim to be multi-tenant • Avoid client stupidity, but otherwise accommodate any client behavior
• GC now well tuned
• 1 compacting GC at off times/day, very rare 1 sec pauses/day, handful >0.5 sec/day
• Cassandra and it’s own dog food • Can’t wait for hints to be commit log style regular file (C* 3.0)
• Compactions in progress table
• OpsCenter rollup - turned off for search api tables
June 19, 2014
General Issues / Lessons Learned
21
• Don’t repair things that don’t need them
• We also run -pr -par repair on each node
• Beware when not following the rules • We were knowingly running on potentially buggy minor versions
• If you don’t know what you’re doing you will likely screw up • Fortunately for us C* has always kept running fine
• It is usually pretty easy to fix with some googling
• Deleting data is counter-intuitively often a good fix!
June 19, 2014
Future
22
• Upgrade 2.0.x to use static columns
• User defined types :-)
• De-duplicate data into shared storage in C*
• Analytics via data-locality • Hadoop, Pig, Spark/Scalding, R
• More cross data center
• More tuning
• Full streaming pipeline with C* as side state store
June 19, 2014
Athena
23
• Why would we do such an obviously crazy thing? • Need to support async, reactive applications across different problem domains
• Real-time API used by several disparate clients (iOS, Node.js, …)
• Ground-up implementation of the CQL 2.0 binary protocol • Scala 2.10/2.11
• Akka 2.3.x
• Fully async, nonblocking API • Has obvious advantages but requires different paradigm
• Implemented as an extension for Akka-IO
• Low-level actor based abstraction
• Cluster, Host and Connection actors
• Reasonably stable
• High-level streaming streaming Session API
June 19, 2014
Athena
24
• Next steps • Move off of Play Iteratees and onto Akka Reactive Streams
• Token based routing
• Client API very much in flux - suggestions are welcome!
!
• https://github.com/vast-engineering/athena • Release of first beta milestone to Sonatype Maven repository imminent
• Pull requests welcome!
June 19, 201425
Appendix
June 19, 2014
GC Settings
26
-Xms24576M -Xmx24576M -Xmn8192M -Xss228k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+UseCondCardMark -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:+HeapDumpOnOutOfMemoryError -XX:+CMSPrintEdenSurvivorChunks -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:PrintFLSStatistics=1
June 19, 2014
Cassandra at VastGraham Sanderson - CTO, David Pratt - Director of Applications
27