facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave&...
Transcript of facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave&...
Presto
„an interacKve SQL-‐on-‐hadoop engine“
Presto – InteracKve sql-‐on-‐hadoop Engine
Data scienKst‘s mindset
EMC Data ScienKst Study, 2011
How we manage and analyze data today
By David Rodma (Own work) [Public domain], via Wikimedia Commons
By Infopedian (Own work) [CC-‐BY-‐SA-‐3.0 (h)p://creaKvecommons.org/licenses/by-‐sa/3.0)], via Wikimedia Commons
OperaKonal Systems Data warehouse
Big Data Presto – InteracKve sql-‐on-‐hadoop Engine
Milliseconds Seconds
Minutes -‐ hours
Simple Hadoop Cluster
Master NameNode JobTracker
Slave DataNode TaskTracker
Slave DataNode TaskTracker
Slave DataNode TaskTracker
Client Data, Jobs
HIVE Metastore
Hive Que
ry
MapReduce Job
Presto – InteracKve sql-‐on-‐hadoop Engine
Designed for Batch processing
„We are x Kmes faster than Hive“-‐Engines
„10x -‐ 100x faster!“
Hortonworks Stinger „35x-‐45x faster. Goal: 100x!“
„10x-‐100x faster!“
„10x faster!“
HAWK / GemFire XD
Presto – InteracKve sql-‐on-‐hadoop Engine
...
h)p://blog.scoutapp.com/arKcles/2011/02/08/how-‐much-‐slower-‐is-‐disk-‐vs-‐ram-‐latency
h)p://queue.acm.org/detail.cfm?id=1563874
Where to improve performance?
§ Disk Vs. Memory
Presto – InteracKve sql-‐on-‐hadoop Engine
§ In-‐Memory Data
§ In-‐Memory Processing
§ OpKmize Code
• Target HDFS (on-‐disk data) • Not based on hadoop‘s mapreduce
• Intermediate Results in-‐memory vs on-‐hdfs • Pipelined execuKon
• OpKmizaKons: – Queries are compiled – flat-‐memory data structures (low GC acKvity)
„InteracKve SQL-‐on-‐Hadoop Engine“
Presto – InteracKve sql-‐on-‐hadoop Engine
HIVE Presto Coordinator
Metastore
HDFS
YARN
Hadoop
App
App Ap
p Cassandra Scribe Hbase* MySQL*
Presto Architecture
Presto Presto Worker
MapReduce Jobs
Presto Jo
bs
Presto – InteracKve sql-‐on-‐hadoop Engine
DEMO
Presto – InteracKve sql-‐on-‐hadoop Engine
Presto – interacKve sql-‐on-‐hadoop engine
DEMO
Presto – InteracKve sql-‐on-‐hadoop Engine
2x-‐8x improvement
By Sivaramakrishnan Narayanan, Quobole
One last thing ;)
Presto – InteracKve sql-‐on-‐hadoop Engine
„200x faster!“
> Queries with Bounded Errors and Bounded Response Times on Very Large Data
SELECT AVG(hunger) FROM Table WHERE event=„bedcon2014“ WITHIN 2 SECONDS
Thank You! Lukas.Go)[email protected] @lukesolar blog.codecentric.de
Sources SQL is what’s next for Hadoop: Here’s who’s doing it h)p://gigaom.com/2013/02/21/sql-‐is-‐whats-‐next-‐for-‐hadoop-‐heres-‐whos-‐doing-‐it/ Introducing Presto -‐ AnalyKcs @ Scale 2013 h)ps://www.youtube.com/watch?v=qZsfpK9bafY The patologies of big data h)p://queue.acm.org/detail.cfm?id=1563874 Facebook's New RealKme AnalyKcs System: HBase To Process 20 Billion Events Per Day h)p://highscalability.com/blog/2011/3/22/facebooks-‐new-‐realKme-‐analyKcs-‐system-‐hbase-‐to-‐process-‐20.html Big data benchmark h)ps://amplab.cs.berkeley.edu/benchmark/ Understanding Hadoop Clusters and the Network h)p://bradhedlund.com/2011/09/10/understanding-‐hadoop-‐clusters-‐and-‐the-‐network/#comment-‐11853 Shark: SQL and Rich AnalyKcs at Scale h)ps://amplab.cs.berkeley.edu/wp-‐content/uploads/2013/02/shark_sigmod2013.pdf Quora: What are the main differences between Facebook Presto and Amplab Shark? h)p://www.quora.com/Facebook-‐Engineering/What-‐are-‐the-‐main-‐differences-‐between-‐Facebook-‐Presto-‐and-‐Amplab-‐Shark Qubole – Big data as a service (company website) h)p://www.qubole.com/ Presto official website h)p://prestodb.io/
Presto – InteracKve sql-‐on-‐hadoop Engine