Post on 01-Jul-2015
description
Real… Big… Data… and it’s constant evolution Scott MacGregor
Who is this guy?
Akamai Big Data Infrastructure
150,000 collector nodes 5000 map/reduce nodes Billions of jobs per day
What is Big Data?
The V’s
Data that is Big
From Hortonworks
What’s it really about?
From the beginning…
• Akamai needed a billing system and scalable monitoring • The Open Source community wanted a search engine • Yahoo needed better product analytics for page views • Google needed more scalable computation for ad
management • Facebook needed real-time updates to social graph • LinkedIn needed a real-time activity data pipeline • Twitter needed hashtag and topic streams • Amazon needed durable shopping carts • Netflix needed a recommendation engine
Big Data timeline
1998 2006 2001 2003 2005 2007 2008 2010 2011 2012 2013 2014
Akamai
Industry
Generalized map/reduce on 1 machine
Decentralized job scheduling Multiple machines File System DB
Google MapReduce Google FS
Nutch Yahoo spins off Hadoop
Amazon Dynamo
NoSql
Wide area, real-time, in-memory system monitoring
Geographical redundancy
Real-time reporting Columnar DB
Distributed File System DB
Wide-area MapReduce ExaByte Query
HBASE Neo4J
Facebook Cassandra LinkedIn Kafka
Twitter Storm Facebook Presto
How it works…
Big Data modes
• Batch – Computation over a large static data set – Results are complete
• Online – Computation on data as it’s generated – Localized results, must be aggregated
downstream
Big Data primitives
• Collection • Parsing • Partitioning • Filtering • Throttling • Aggregation • Tracking • Validation • Analysis
Collection
• What – Logs – Metadata – System stats – Application
events – Application stats – Network data
• How – Email – SPDY – HTTP POST – SCP – Scribe – Avro – Custom
Parsing
• Read lines or blocks and split into fields • Transform, e.g. protobuf • Map keys to values
S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 itunesus011.download.akamai.com 200 - iPeV image/jpeg - - 44 3031 - - - - - W - us/r1000/011/Purple/53/e6/0f/mzl.slohufby.320x480-75.jpg - a440.phobos.apple.com
1359486900 1423 a440.phobos.apple.com 1 3158
1359486900 1423 200 1 30128
1359486900 1423 1 209158
Partitioning
• Bucketing – Reduce to a single record per bucket – e.g. 5 minutes, /24, etc.
• Hashing – Bucket blocks or records of data by a hash
function
Filtering
• Statistical Methods – Top-k (HierarchicalCountSketch) – Set membership (Bloom filters) – Cardinality counting (HyperLogLog) – Frequency estimates (CountSketch) – Change detection (Deltoid)
• Sampling – Random – Reservoir
Throttling
• Limit on cardinality per partition – Requires central management – Drop records over max
• Remove or trim large fields S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 itunesus011.download.akamai.com 200 - iPeV image/jpeg - - 44 3031 - - - - - W - us/r1000/011/Purple/53/e6/0f/mzl.slohufby.320x480-75.jpg - a440.phobos.apple.com
S 1359487051.701 4200191766 2097152 2097918 61.252.169.21 GET 440 1423 ~ 200 - iPeV image/jpeg - - 44 3031 - - - - - W - ~
Aggregation
• Merge – Merge-sort blocks in a partition
• Reduce – Combine values for like keys
• Sum, Min, Max, Mask, etc. • Shuffle
– Move the data to where its needed or closer to like data
1359486900 1423 1 209158
1359529800 1423 1 209158 1359486900 1423 1 209158
1359486900 1423 2 418316
1359529800 1423 1 209158
Aggregate
2 418316
{1423, 1359486900}
1 209158
{1423, 1359529800}
Shuffle
Tracking
• Tracking – Embed GUID in each data unit sent – Publish GUIDs independent from data flow – Completeness is expected (published GUIDs)
vs. actual (embedded GUID)
Data integrity
• Watermark – Producer watermarks every n-lines with a
crypto key – Receiver checks watermarks
• Checksum – Block checksums – Line CRC – Etc.
Analysis
• Online – Precomputed reports
• Batch – Spark Programs – Map/Reduce – Hive: HQL – SQL
Big Data at Akamai
• Billing and Reporting • System monitoring • Media Analytics • Security • Log archive
Billing and reporting
Logs Akamai Edge Networks and
Products Q Parse
Pipelines
Shuffle Split
Billing DB
Reporting Reporting
Reporting Parsing • splits lines into fields • maps keys to values per pipeline • each log generates many pipelines • each pipeline represents a streaming table
Evolution • Logs were emailed (up to 1PB/day) • Now delivered via SPDY (3PB/day)
Customers
3 PB/day Doubles every year
Reporting Reporting Internal
Apps
Aggregate
System monitoring
Akamai Networks and
Products Client SQL
Parser TLA Agg
Agg Agg
Alert
Trend
TLA: top level aggregator pulls data from aggregators which pull data from producers at the time of the request Produces rewrite data locally
50M jobs/day
Evolution Single machine memory for table joins Future: distributed memory for table joins
Media analytics
Pipelines Akamai
Products Front end
Column Store
Index Reporting Reporting Reporting
API / UI
Customers
Indexes are recreated for each update Supports insert and update Reads are flexible and fast
Evolution: Index now fingerprint to lower cost Hyperloglog for uniqueness counting
Events
Security products
Pipelines Akamai Edge
Networks Front end
HDFS
Events
Akamai Web Firewall
Map/Reduce
HBASE
Hive
Cloudera Graphite
Operations Center
Reputation Scoring
Threat Analysis
Intelligence Reports
Risk Based Authentication
Payment Fraud
External Data External Data
External Data
Evolution: Replacing HBASE with custom aggregator Replacing Hive with custom SQL processor
20 TB/day
Log archive
Logs
Q Archive
Parse
180 PB, 450 Trillion records Doubles every year
Archive Index (10TB) Pipelines
Log cache 10%
Client IP Sketch
Spark
Spark SQL
HDFS
Archive Front End
Client Request
Archive is 90 data centers distributed over wide area; projected 1.2 EB in 3 years Evolution: Was flat file for index, now HDFS/Spark
Get Index and/or CIP
Cache first Then archive
HDFS Hadoop / Yarn
The Ecosystem
Script Pig
SQL Hive
NoSQL HBASE
Stream Kafka Storm
Search Solr
In-Mem Spark
Integration Flume Avro
Operations Ambari Zookeeper Oozie
Monitoring
Graphite
Sharing
Mesos
HDFS Hadoop / Yarn
Building a system
If you need fast access to massive amounts of data where queries are constrained to an index (read optimized): • Start with HDFS or Cassandra • Add HBASE column store • Add Hive for SQL-like access • Add Pig for scripting
HBASE Get, Put
Hive Select *
Pig { … }
Building a system
If you need to search logs: • Start with HDFS • Add Flume for log data integration • Add Avro for data serialization • Add Solr for search
HDFS Hadoop / Yarn
Solr Search, e.g. Ip = 1.1.1.1
Flume Agent Avro Sink
Flume Collector Avro Source
HDFS Hadoop / Yarn
Building a system
If you need flexible and shared access to unlimited amounts of data: • Start with HDFS or Cassandra • Add Hadoop for Map/Reduce or • Add Hive for SQL-like access or • Add Pig for scripting • Add Mesos for resource sharing • Add Ambari for cluster management and provisioning • Add map/reduce programs for business logic
Pig {…}
Hive Select * Flume Ambari
Mesos
Map/Reduce Java { … }
Building a system
If you need fast, flexible access to in-memory data: • Start with HDFS • Add Spark • Add Spark SQL for SQL-like access or • Create Spark programs for other business logic
HDFS Hadoop / Yarn
Spark
SparkSQL Select * from
Spark Progs Java { … }
Building a system
If you need real-time stream event processing: • Start with HDFS • Add Kafka for messaging and pub/sub • Add Storm for event processing • Develop Java Bolts for processing logic
HDFS Hadoop / Yarn
Kafka Storm Bolts { … }
Future at Akamai
• 100x – Everything bigger and faster – Requires new R&D across many Big Data
components • Scaling Big Data Eco across wide-area • Internet Security
• Positive reputation scoring • Automatic DDoS mitigation
• Low latency data collection – 2^53 unique keys, <1 minute latency
• Support DevOps – Near real-time monitoring and control
Thank You