Data- How Does It Work-

17
Data: How Does It Work? An overview of the Castle Global Analytics Pipeline (CGAP™)

Transcript of Data- How Does It Work-

Page 1: Data- How Does It Work-

Data: How Does It Work?

An overview of the Castle Global Analytics Pipeline (CGAP™)

Page 2: Data- How Does It Work-

1. General overview

2. Our current solution

3. Numbers and comparisons

Page 3: Data- How Does It Work-

Raw data aggregate

d from multiple sources

Extract, transform, and load

(ETL)

Persist ingestible

data to warehouse

Analyze ingested data for insights

Exhibit A: Generic Data Pipeline

Page 4: Data- How Does It Work-

Raw data aggregate

d from multiple sources

Extract, transform, and load

(ETL)

Persist ingestible

data to warehouse

Analyze ingested data for insights

Page 5: Data- How Does It Work-

● Two main types of real-time events: client and

server

● Challenges: client call site consistency,

batching, QA

● Multiple persistent databases

● Real-time events temporarily stored in Kafka -

revisit

● Databases always up to date, copied over at

intervals

CGAP™ - How We Obtain Raw Data

Page 6: Data- How Does It Work-

Kafka Diagram

Page 7: Data- How Does It Work-

Raw data aggregate

d from multiple sources

Extract, transform, and load

(ETL)

Persist ingestible

data to warehouse

Analyze ingested data for insights

Page 8: Data- How Does It Work-

The ETL Job (Spark)

● Multiple batch jobs, one per kafka topic, run at intervals

● Concurrently read out of multiple kafka partitions

● Save offset state per job

● Parse, flatten, write JSON

● Takes care of compression and uploading to cloud storage

● Extra batch load jobs run for persistent relational DBs

Page 9: Data- How Does It Work-

Spark Diagram

Page 10: Data- How Does It Work-

Spark Diagram (cont.)

Page 11: Data- How Does It Work-

Raw data aggregate

d from multiple sources

Extract, transform, and load

(ETL)

Persist ingestible

data to warehouse

Analyze ingested data for insights

Exhibit A: Generic Data Pipeline

Page 12: Data- How Does It Work-

Data Warehousing (AWS Redshift)

● Load in data from AWS S3 to Redshift

● Two clusters, Kiwi/Plaza, 10 nodes total

● Scaling/upgrading requires some write downtime

● Support for full SQL joins, based on PostgreSQL 8.0

● Specialized proprietary hardware/software layers

● Scales up to petabytes

Page 13: Data- How Does It Work-

Raw data aggregate

d from multiple sources

Extract, transform, and load

(ETL)

Persist ingestible

data to warehouse

Analyze ingested data for insights

Exhibit A: Generic Data Pipeline

Page 14: Data- How Does It Work-

Business Intelligence

● Primarily using Mode as the UI to Redshift, so so

cheap

● Allows for some visualizations from queries

Page 15: Data- How Does It Work-

Our Numbers

● Kiwi: half a billion events ingested daily○ 200MM from clients, 280MM from server (170MM

pushes)○ Uncompressed size of JSON: 500GB/day○ Compressed size, Spark output: 8GB/day○ Compressed and indexed size, Redshift: 15GB/day

● Plaza: 14MM events per day, mostly server○ Plaza loads: 4 hours in 53s vs. Kiwi 4 hours in 8

minutes

Page 16: Data- How Does It Work-

Our Numbers (cont.)

● Currently storing 1.2TB of compressed Redshift data○ Roughly 2 months of full Kiwi data (14 billion

rows)○ Past 2 months, archive and coalesce to save

space● At Plaza’s rate, can store over two years on just

256GB● Costs:

○ S3 storage long-term: $5000/year○ Redshift, current status: $16,000/year○ Machines for batch load: 1 MBP○ Mode: $40/month

Page 17: Data- How Does It Work-

Thanks to...

● The Apache Foundation for making open source

systems

● AWS for providing us their economies of scale

● You, for listening