facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave&...

Lukas Go)er, M.Sc. | lukas.go)[email protected] BEDCON 2014

facebook

for the (data) masses

Presto

„an interacKve SQL-‐on-‐hadoop engine“

Presto – InteracKve sql-‐on-‐hadoop Engine

Data scienKst‘s mindset

EMC Data ScienKst Study, 2011

How we manage and analyze data today

By David Rodma (Own work) [Public domain], via Wikimedia Commons

By Infopedian (Own work) [CC-‐BY-‐SA-‐3.0 (h)p://creaKvecommons.org/licenses/by-‐sa/3.0)], via Wikimedia Commons

OperaKonal Systems Data warehouse

Big Data Presto – InteracKve sql-‐on-‐hadoop Engine

Milliseconds Seconds

Minutes -‐ hours

Simple Hadoop Cluster

Master NameNode JobTracker

Slave DataNode TaskTracker



Client Data, Jobs

HIVE Metastore

Hive Que

ry

MapReduce Job


Designed for Batch processing

„We are x Kmes faster than Hive“-‐Engines

„10x -‐ 100x faster!“

Hortonworks Stinger „35x-‐45x faster. Goal: 100x!“

„10x-‐100x faster!“

„10x faster!“

HAWK / GemFire XD


...

h)p://blog.scoutapp.com/arKcles/2011/02/08/how-‐much-‐slower-‐is-‐disk-‐vs-‐ram-‐latency

h)p://queue.acm.org/detail.cfm?id=1563874

Where to improve performance?

§  Disk Vs. Memory


§  In-‐Memory Data

§  In-‐Memory Processing

§  OpKmize Code

•  Target HDFS (on-‐disk data) •  Not based on hadoop‘s mapreduce

•  Intermediate Results in-‐memory vs on-‐hdfs •  Pipelined execuKon

•  OpKmizaKons: –  Queries are compiled –  flat-‐memory data structures (low GC acKvity)

„InteracKve SQL-‐on-‐Hadoop Engine“


HIVE Presto Coordinator

Metastore

HDFS

YARN

Hadoop

App

App Ap

p Cassandra Scribe Hbase* MySQL*

Presto Architecture

Presto Presto Worker

MapReduce Jobs

Presto Jo

bs


DEMO


Presto – interacKve sql-‐on-‐hadoop engine

DEMO


2x-‐8x improvement

By Sivaramakrishnan Narayanan, Quobole

One last thing ;)


„200x faster!“

> Queries with Bounded Errors and Bounded Response Times on Very Large Data

SELECT AVG(hunger) FROM Table WHERE event=„bedcon2014“ WITHIN 2 SECONDS

Thank You! Lukas.Go)[email protected] @lukesolar blog.codecentric.de

Sources SQL is what’s next for Hadoop: Here’s who’s doing it h)p://gigaom.com/2013/02/21/sql-‐is-‐whats-‐next-‐for-‐hadoop-‐heres-‐whos-‐doing-‐it/ Introducing Presto -‐ AnalyKcs @ Scale 2013 h)ps://www.youtube.com/watch?v=qZsfpK9bafY The patologies of big data h)p://queue.acm.org/detail.cfm?id=1563874 Facebook's New RealKme AnalyKcs System: HBase To Process 20 Billion Events Per Day h)p://highscalability.com/blog/2011/3/22/facebooks-‐new-‐realKme-‐analyKcs-‐system-‐hbase-‐to-‐process-‐20.html Big data benchmark h)ps://amplab.cs.berkeley.edu/benchmark/ Understanding Hadoop Clusters and the Network h)p://bradhedlund.com/2011/09/10/understanding-‐hadoop-‐clusters-‐and-‐the-‐network/#comment-‐11853 Shark: SQL and Rich AnalyKcs at Scale h)ps://amplab.cs.berkeley.edu/wp-‐content/uploads/2013/02/shark_sigmod2013.pdf Quora: What are the main differences between Facebook Presto and Amplab Shark? h)p://www.quora.com/Facebook-‐Engineering/What-‐are-‐the-‐main-‐differences-‐between-‐Facebook-‐Presto-‐and-‐Amplab-‐Shark Qubole – Big data as a service (company website) h)p://www.qubole.com/ Presto official website h)p://prestodb.io/


facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave&...

Documents

Transcript of facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave&...