Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

36
Apache Drill a interac.ve, adhoc query system for largescale datasets Michael Hausenblas, Chief Data Engineer EMEA, MapR Big Data User Group Stu>gart, 20130516

description

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets given by MapR Chief Data Engineer EMEA . Big Data User Group in Stuttgart 2013-05-16

Transcript of Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Page 1: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Apache  Drill  a  interac.ve,  ad-­‐hoc  query  system  for  large-­‐scale  datasets  

Michael  Hausenblas,  Chief  Data  Engineer  EMEA,  MapR  Big  Data  User  Group  Stu>gart,  2013-­‐05-­‐16  

Page 2: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Which  workloads  do  you  encounter  in  your  environment?  

h>p://w

ww.flickr.com

/photos/kevinomara/2866648330/  licensed  under  CC  BY-­‐N

C-­‐ND  2.0  

Page 3: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Batch  processing  

…  for  recurring  tasks  such  as  large-­‐scale  data  mining,  ETL  offloading/data-­‐warehousing  à  for  the  batch  layer  in  Lambda  architecture  

Page 4: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

OLTP  

…  user-­‐facing  eCommerce  transac[ons,  real-­‐[me  messaging  at  scale  (FB),  [me-­‐series  processing,  etc.  à  for  the  serving  layer  in  Lambda  architecture  

Page 5: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Stream  processing  

…  in  order  to  handle  stream  sources  such  as  social  media  feeds  or  sensor  data  (mobile  phones,  RFID,  weather  sta[ons,  etc.)  à  for  the  speed  layer  in  Lambda  architecture    

Page 6: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Search/Informa[on  Retrieval  

…  retrieval  of  items  from  unstructured  documents  (plain  text,  etc.),  semi-­‐structured  data  formats  (JSON,  etc.),  as  well  as  data  stores  (MongoDB,  CouchDB,  etc.)  

Page 7: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

h>p://www.flickr.com/photos/9479603@N02/4144121838/    licensed  under  CC  BY-­‐NC-­‐ND  2.0  

But  what  about  interac.ve  ad-­‐hoc  query    at  scale?  

Page 8: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Impala

Interac[ve  Query  (?)  

low-­‐latency  

Page 9: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Use  Case:  Marke[ng  Campaign  

•  Jane,  a  marke[ng  analyst  •  Determine  target  segments  •  Data  from  different  sources    

Page 10: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Use  Case:  Logis[cs  

•  Supplier  tracking  and  performance  •  Queries  – Shipments  from  supplier  ‘ACM’  in  last  24h  – Shipments  in  region  ‘US’  not  from  ‘ACM’  

SUPPLIER_ID   NAME   REGION  

ACM   ACME  Corp   US  

GAL   GotALot  Inc   US  

BAP   Bits  and  Pieces  Ltd   Europe  

ZUP   Zu  Pli   Asia  

{ "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …

Page 11: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Use  Case:  Crime  Detec[on  

•  Online  purchases  •  Fraud,  bilking,  etc.  •  Batch-­‐generated  overview  •  Modes  – Explora[ve  – Alerts  

Page 12: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Requirements  

•  Support  for  different  data  sources  •  Support  for  different  query  interfaces  •  Low-­‐latency/real-­‐[me  •  Ad-­‐hoc  queries  •  Scalable,  reliable  

Page 13: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

And now for something completely different …

Page 14: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Google’s  Dremel  

h>p://research.google.com/pubs/pub36632.html      

Sergey  Melnik,  Andrey  Gubarev,  Jing  Jing  Long,  Geoffrey  Romer,  Shiva  Shivakumar,  Ma@  Tolton,  Theo  Vassilakis,  Proc.  of  the  36th  Int'l  Conf  on  Very  Large  Data  Bases  (2010),  pp.  330-­‐339  

Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. …

“ “ Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. …

Page 15: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Google’s  Dremel  

multi-level execution trees  

columnar data layout  

Page 16: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Google’s  Dremel  

nested data + schema   column-striped representation  

map nested data to tables  

Page 17: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Google’s  Dremel  

experiments: datasets & query performance  

Page 18: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Back to Apache Drill …

Page 19: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Apache  Drill–key  facts  

•  Inspired  by  Google’s  Dremel  •  Standard  SQL  2003  support  •  Plug-­‐able  data  sources  •  Nested  data  is  a  first-­‐class  ci[zen  •  Schema  is  op.onal  •  Community  driven,  open,  100’s  involved  

Page 20: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

High-­‐level  Architecture  

Page 21: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Principled  Query  Execu[on  

Source  Query   Parser  

Logical  Plan   Op[mizer  

Physical  Plan   Execu[on  

SQL  2003    DrQL  MongoQL  DSL  

scanner  API  Topology  CF  etc.  

query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” },

parser  API  

Page 22: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Wire-­‐level  Architecture  •  Each  node:  Drillbit  -­‐  maximize  data  locality  •  Co-­‐ordina[on,  query  planning,  execu[on,  etc,  are  distributed  •  Any  node  can  act  as  endpoint  for  a  query—foreman  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Page 23: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Wire-­‐level  Architecture  •  Curator/Zookeeper  for  ephemeral  cluster  membership  info  •  Distributed  cache  (Hazelcast)  for  metadata,  locality  

informa[on,  etc.  

Curator/Zk  

Distributed  Cache  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Distributed  Cache   Distributed  Cache   Distributed  Cache  

Page 24: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Wire-­‐level  Architecture  •  Origina[ng  Drillbit  acts  as  foreman:  manages  query  execu[on,  

scheduling,  locality  informa[on,  etc.  •  Streaming  data  communica.on  avoiding  SerDe  

Curator/Zk  

Distributed  Cache  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Storage  Process  

Drillbit  

node  

Distributed  Cache   Distributed  Cache   Distributed  Cache  

Page 25: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Wire-­‐level  Architecture  Foreman  turns  into  root  of  the  mul[-­‐level  execu[on  tree,  leafs  ac[vate  their  storage  engine  interface.   node  

node   node  

Curator/Zk  

Page 26: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Key  features  

•  Full  SQL  –  ANSI  SQL  2003  •  Nested  Data  as  first  class  ci[zen  •  Op[onal  Schema  •  Extensibility  Points  …  

Page 27: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Extensibility  Points  

•  Source  query  à  parser  API  •  Custom  operators,  UDF  à  logical  plan  •  Serving  tree,  CF,  topology  à  physical  plan/op[mizer  •  Data  sources  &formats  à  scanner  API  

Source  Query   Parser  

Logical  Plan   Op[mizer  

Physical  Plan   Execu[on  

Page 28: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

…  and  Hadoop?  •  HDFS  can  be  a  data  source  

•  Complementary  use  cases*  

•  …  use  Apache  Drill  –  Find  record  with  specified  condi[on  –  Aggrega[on  under  dynamic  condi[ons  

•  …  use  MapReduce  –  Data  mining  with  mul[ple  itera[ons  –  ETL  

*)  h>ps://cloud.google.com

/files/BigQueryTechnicalW

P.pdf    

Page 29: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Basic  Demo  

h>ps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo    

{ "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { "batter”: [

{ "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" },

data  source:  donuts.json   query:[ { op:"sequence", do:[

{ op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" },

logical  plan:  simple_plan.json  

result:  out.json  

{ "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }

Page 30: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

BE  A  PART  OF  IT!  

Page 31: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Status  

•  Heavy  development  by  mul[ple  organiza[ons  

•  Available  – Logical  plan  (ADSP)  – Reference  interpreter  – Basic  SQL  parser    – Basic  demo  

Page 32: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Status  

May  2013    •  Full  SQL  support  (+JDBC)  •  Physical  plan  •  In-­‐memory  compressed  data  interfaces  •  Distributed  execu[on  

Page 33: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Status  

May  2013    •  HBase  and  MySQL  storage  engine  •  WebUI  client  

Page 34: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Contribu[ng  Contribu[ons  appreciated  (not  only  code  drops)  …    

•  Test  data  &  test  queries  •  Use  case  scenarios  (textual/SQL  queries)  •  Documenta[on  •  Further  schedule  –  Alpha  Q2  –  Beta  Q3  

Page 35: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Kudos  to  …  

•  Julian  Hyde,  Pentaho    •  Lisen  Mu,  XingCloud  •  Tim  Chen,  Microsow  •  Chris  Merrick,  RJMetrics    •  David  Alves,  UT  Aus[n  •  Sree  Vaadi,  SSS/NGData  •  Jacques  Nadeau,  MapR  •  Ted  Dunning,  MapR  

Page 36: Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Engage!  •  Follow  @ApacheDrill  on  Twi>er  

•  Sign  up  at  mailing  lists  (user  |  dev)    h>p://incubator.apache.org/drill/mailing-­‐lists.html    

 •  Standing  G+  hangouts  every  Tuesday  at  5pm  GMT  

h>p://j.mp/apache-­‐drill-­‐hangouts    

•  Keep  an  eye  on  h>p://drill-­‐user.org/