facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave&...

15
Lukas Go)er, M.Sc. | lukas.go)[email protected] BEDCON 2014 facebook for the (data) masses

Transcript of facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave&...

Page 1: facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker&

Lukas  Go)er,  M.Sc.  |  lukas.go)[email protected]  BEDCON  2014  

facebook    

for  the  (data)  masses    

Page 2: facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker&

Presto    

„an  interacKve  SQL-­‐on-­‐hadoop  engine“  

Page 3: facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker&

Presto  –  InteracKve  sql-­‐on-­‐hadoop  Engine  

Data  scienKst‘s  mindset  

EMC  Data  ScienKst  Study,  2011  

Page 4: facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker&

How  we  manage  and  analyze  data  today  

By  David  Rodma  (Own  work)  [Public  domain],  via  Wikimedia  Commons  

By  Infopedian  (Own  work)  [CC-­‐BY-­‐SA-­‐3.0  (h)p://creaKvecommons.org/licenses/by-­‐sa/3.0)],  via  Wikimedia  Commons  

OperaKonal  Systems   Data  warehouse  

Big  Data  Presto  –  InteracKve  sql-­‐on-­‐hadoop  Engine  

Milliseconds   Seconds  

Minutes  -­‐  hours  

Page 5: facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker&

Simple  Hadoop  Cluster  

Master  NameNode  JobTracker  

Slave  DataNode  TaskTracker  

Slave  DataNode  TaskTracker  

Slave  DataNode  TaskTracker  

Client  Data,  Jobs  

HIVE  Metastore  

Hive  Que

ry  

MapReduce  Job  

Presto  –  InteracKve  sql-­‐on-­‐hadoop  Engine  

Designed  for  Batch  processing  

Page 6: facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker&

„We  are  x  Kmes  faster  than  Hive“-­‐Engines  

„10x  -­‐  100x  faster!“  

Hortonworks Stinger „35x-­‐45x  faster.  Goal:  100x!“  

„10x-­‐100x  faster!“  

„10x  faster!“  

HAWK / GemFire XD

Presto  –  InteracKve  sql-­‐on-­‐hadoop  Engine  

...  

Page 7: facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker&

h)p://blog.scoutapp.com/arKcles/2011/02/08/how-­‐much-­‐slower-­‐is-­‐disk-­‐vs-­‐ram-­‐latency  

h)p://queue.acm.org/detail.cfm?id=1563874  

Where  to  improve  performance?  

§  Disk  Vs.  Memory  

Presto  –  InteracKve  sql-­‐on-­‐hadoop  Engine  

§  In-­‐Memory  Data    

§  In-­‐Memory  Processing    

§  OpKmize  Code    

Page 8: facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker&

•  Target  HDFS  (on-­‐disk  data)  •  Not  based  on  hadoop‘s  mapreduce  

•  Intermediate  Results  in-­‐memory  vs  on-­‐hdfs  •  Pipelined  execuKon    

•  OpKmizaKons:  –  Queries  are  compiled    –  flat-­‐memory  data  structures  (low  GC  acKvity)  

„InteracKve  SQL-­‐on-­‐Hadoop  Engine“  

Presto  –  InteracKve  sql-­‐on-­‐hadoop  Engine  

Page 9: facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker&

HIVE  Presto  Coordinator  

Metastore  

HDFS  

YARN

 

Hadoop  

App  

App  Ap

p  Cassandra  Scribe  Hbase*  MySQL*  

Presto  Architecture  

Presto  Presto  Worker  

MapReduce  Jobs  

Presto  Jo

bs  

Presto  –  InteracKve  sql-­‐on-­‐hadoop  Engine  

Page 10: facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker&

DEMO  

Presto  –  InteracKve  sql-­‐on-­‐hadoop  Engine  

Page 11: facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker&

Presto  –  interacKve  sql-­‐on-­‐hadoop  engine  

Page 12: facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker&

DEMO  

Presto  –  InteracKve  sql-­‐on-­‐hadoop  Engine  

2x-­‐8x  improvement  

By  Sivaramakrishnan  Narayanan,  Quobole  

Page 13: facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker&

One  last  thing  ;)  

Presto  –  InteracKve  sql-­‐on-­‐hadoop  Engine  

„200x  faster!“  

>  Queries  with  Bounded  Errors  and  Bounded  Response  Times  on  Very  Large  Data  

SELECT  AVG(hunger)  FROM  Table  WHERE  event=„bedcon2014“  WITHIN  2  SECONDS  

Page 14: facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker&

Thank  You!  Lukas.Go)[email protected]  @lukesolar    blog.codecentric.de  

Page 15: facebook&& for&the( data) masses& · Simple&Hadoop&Cluster& Master& NameNode& JobTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker& Slave& DataNode& TaskTracker&

Sources  SQL  is  what’s  next  for  Hadoop:  Here’s  who’s  doing  it  h)p://gigaom.com/2013/02/21/sql-­‐is-­‐whats-­‐next-­‐for-­‐hadoop-­‐heres-­‐whos-­‐doing-­‐it/      Introducing  Presto  -­‐  AnalyKcs  @  Scale  2013  h)ps://www.youtube.com/watch?v=qZsfpK9bafY      The  patologies  of  big  data    h)p://queue.acm.org/detail.cfm?id=1563874    Facebook's  New  RealKme  AnalyKcs  System:  HBase  To  Process  20  Billion  Events  Per  Day  h)p://highscalability.com/blog/2011/3/22/facebooks-­‐new-­‐realKme-­‐analyKcs-­‐system-­‐hbase-­‐to-­‐process-­‐20.html    Big  data  benchmark  h)ps://amplab.cs.berkeley.edu/benchmark/      Understanding  Hadoop  Clusters  and  the  Network  h)p://bradhedlund.com/2011/09/10/understanding-­‐hadoop-­‐clusters-­‐and-­‐the-­‐network/#comment-­‐11853      Shark:  SQL  and  Rich  AnalyKcs  at  Scale  h)ps://amplab.cs.berkeley.edu/wp-­‐content/uploads/2013/02/shark_sigmod2013.pdf    Quora:  What  are  the  main  differences  between  Facebook  Presto  and  Amplab  Shark?  h)p://www.quora.com/Facebook-­‐Engineering/What-­‐are-­‐the-­‐main-­‐differences-­‐between-­‐Facebook-­‐Presto-­‐and-­‐Amplab-­‐Shark    Qubole  –  Big  data  as  a  service  (company  website)  h)p://www.qubole.com/    Presto  official  website  h)p://prestodb.io/        

Presto  –  InteracKve  sql-­‐on-­‐hadoop  Engine