Data- How Does It Work-

Data: How Does It Work?

An overview of the Castle Global Analytics Pipeline (CGAP™)

1. General overview

2. Our current solution

3. Numbers and comparisons

Raw data aggregate

d from multiple sources

Extract, transform, and load

(ETL)

Persist ingestible

data to warehouse

Analyze ingested data for insights

Exhibit A: Generic Data Pipeline

Raw data aggregate



(ETL)

Persist ingestible

data to warehouse


● Two main types of real-time events: client and

server

● Challenges: client call site consistency,

batching, QA

● Multiple persistent databases

● Real-time events temporarily stored in Kafka -

revisit

● Databases always up to date, copied over at

intervals

CGAP™ - How We Obtain Raw Data

Kafka Diagram

Raw data aggregate



(ETL)

Persist ingestible

data to warehouse


The ETL Job (Spark)

● Multiple batch jobs, one per kafka topic, run at intervals

● Concurrently read out of multiple kafka partitions

● Save offset state per job

● Parse, flatten, write JSON

● Takes care of compression and uploading to cloud storage

● Extra batch load jobs run for persistent relational DBs

Spark Diagram

Spark Diagram (cont.)

Raw data aggregate



(ETL)

Persist ingestible

data to warehouse



Data Warehousing (AWS Redshift)

● Load in data from AWS S3 to Redshift

● Two clusters, Kiwi/Plaza, 10 nodes total

● Scaling/upgrading requires some write downtime

● Support for full SQL joins, based on PostgreSQL 8.0

● Specialized proprietary hardware/software layers

● Scales up to petabytes

Raw data aggregate



(ETL)

Persist ingestible

data to warehouse



Business Intelligence

● Primarily using Mode as the UI to Redshift, so so

cheap

● Allows for some visualizations from queries

Our Numbers

● Kiwi: half a billion events ingested daily○ 200MM from clients, 280MM from server (170MM

pushes)○ Uncompressed size of JSON: 500GB/day○ Compressed size, Spark output: 8GB/day○ Compressed and indexed size, Redshift: 15GB/day

● Plaza: 14MM events per day, mostly server○ Plaza loads: 4 hours in 53s vs. Kiwi 4 hours in 8

minutes

Our Numbers (cont.)

● Currently storing 1.2TB of compressed Redshift data○ Roughly 2 months of full Kiwi data (14 billion

rows)○ Past 2 months, archive and coalesce to save

space● At Plaza’s rate, can store over two years on just

256GB● Costs:

○ S3 storage long-term: $5000/year○ Redshift, current status: $16,000/year○ Machines for batch load: 1 MBP○ Mode: $40/month

Thanks to...

● The Apache Foundation for making open source

systems

● AWS for providing us their economies of scale

● You, for listening

Data- How Does It Work-

Documents

Transcript of Data- How Does It Work-