MADlib’and’MLbase’ · MAD’Methodology’ •...

37
MADlib and MLbase Quan Fang Xiaofeng Tao

Transcript of MADlib’and’MLbase’ · MAD’Methodology’ •...

Page 1: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

MADlib  and  MLbase    

Quan  Fang  Xiaofeng  Tao  

Page 2: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

MADlib  Analy7cs  Library    

Page 3: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

What  is  a  mad  lib?  

-­‐  From  bigdatablog.emc.com  

Page 4: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

What  is  a  mad  lib?  

-­‐  From  bigdatablog.emc.com  

Page 5: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

MAD  Methodology  

•  Cohen,  Jeffrey,  et  al.  "MAD  skills:  new  analysis  prac7ces  for  big  data."Proceedings  of  the  VLDB  Endowment  2.2  (2009):  1481-­‐1492.  

•  What  does  MAD  stand  for?    

Page 6: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

Magne7c  

•  Use  all  data  sources  regardless  of  format  or  quality  

Page 7: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

Magne7c  Agile  

•  Use  all  data  sources  regardless  of  format  or  quality  

•  Support  dynamic  workflow  and  whatever  data  analysts  need  

Page 8: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

Magne7c  Agile  Deep  

•  Use  all  data  sources  regardless  of  format  or  quality  

•  Support  dynamic  workflow  and  whatever  data  analysts  need  

•  Analyze  large  datasets  without  sampling  

Page 9: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

Magne7c  Agile  Deep  

•  Goal:  Give  analysts  access  to  familiar  math  concepts,  sta7s7cal  methods,  and  algorithms  for  database  

•  Use  tradi7onal  SQL  and  a  wide  range  of  extension  languages  

Page 10: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

Data  Parallel  Sta7s7cs  

•  Give  analysts  access  to  familiar  math  concepts  and  sta7s7cal  methods  for  database  

•  Database  methods  are  data  parallel  and  have  an  SQL  like  interface  

•  Database  methods  fall  into  one  of  the  following  layers  of  abstrac7on:  –  Scalar  –  Vector  – Matrix  –  Func7on  

Page 11: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

MAD  DBMS  

•  Loading  and  Unloading  – Sca]er  /  Gather  streaming  for  fully  parallel  access  – External  tables  

 •  Extract-­‐Transform-­‐Load  (ETL)  and  Extract-­‐Load-­‐Transform  (ELT)  

Page 12: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

MAD  DBMS  

•  Mul7ple  storage  formats  – External  tables  – Heap  format  – Append-­‐mostly  format  

•  Different  storage  formats  per  table  par77on  •  Atomic  par77on  exchange  

Page 13: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

MAD  Programming  

•  Tradi7onal  SQL  queries  with  extensions  •  Map/Reduce  func7ons  wri]en  in  Python,  Perl,  or  R  

•  Wide  range  of  extension  languages  

 

Page 14: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

MAD  Implementa7ons  

•  MADlib  – Library  for  data  analy7cs  that  can  run  on  Greenplum  or  PostgreSQL  

•  Greenplum  database    – Massively  parallel  processing  database  

Page 15: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

MADlib  

•  Library  for  widely  used  sta7s7cs,  data  mining,  and  machine  learning  that  runs  on  top  of  a  SQL  database  – Regression  models  – Machine  learning  – Linear  systems  – Topic  modelling  – Descrip7ve  sta7s7cs  – And  more….  (see  doc.madlib.net)  

 

Page 16: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

MADlib  History  

•  S7ll  in  early  stages  of  development  •  In  use  by  some  research  universi7es  and  companies  

•  Heavily  sponsored  by  Greenplum  

Page 17: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

MADlib  Interface  

•  Interface  consists  of  extensible  SQL  like  scripts  that  call  MADlib  func7ons  

•  Wide  range  of  extension  languages  possible  but  C++  /  Python  recommended  

•  Designed  for  portability  

Page 18: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

MADlib  User  Extensions  

•  User  defined  aggregates  and  func7ons  in  C++  •  Driver  func7ons  use  Python  to  wrap  mul7ple  MADlib  SQL  calls  

•  Templated  queries  are  supported  with  Python  wrappers    

 

Page 19: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

MADlib:  Logis7c  regression  

Page 20: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

MADlib  Example:  K-­‐means  

•  Run  K-­‐means    SELECT * FROM madlib.kmeanspp( 'km_sample', 'points', 2, 'madlib.squared_dist_norm2', 'madlib.avg', 20, 0.001 );

Page 21: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

MADlib  Example:  K-­‐means  

•  Calculate  silhoue]e  coefficient   SELECT * FROM madlib.simple_silhouette( 'km_sample', 'points',

(SELECT centroids FROM 'km_centroids'), 'madlib.squared_dist_norm2' );

Page 22: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

Live  demo  

•  Postgres  command  line  •  Principal  Component  Analysis  – pca_train()  – pca_project()  

Page 23: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

Conclusion  

•  Due  to  the  economics  of  data  (cheap,  high  volume),  data  warehouses  in  the  future  need  to  focus  on  data  analy7cs  – Magne7cally  a]ract  all  data  sources  – Agile  analysis  workflow  – Deeply  analyze  large  datasets  without  sampling  

•  MADlib  provides  analysts  with  such  a  toolbox  that  is  portable  across  SQL  databases  

Page 24: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

MLbase:  A  Distributed  Machine-­‐Learning  System    

Page 25: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

Four  foci    

•  (1)  A  new  declara7ve  language  to  specify  ML  tasks  

•  (2)  A  novel  op7mizer  to  select  ML  algorithms  •  (3)  Provide  answers  early  and  con7nuously  refine.  

•  (4)  Design  a  distributed  run-­‐7me  op7mized  for  the  data  access  pa]erns  of  ML  

Page 26: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

A  Declara7ve  Approach  to  ML  

SQL   Result   MQL   Model  

Page 27: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009
Page 28: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

COML(Optimizer)

Parser

Executor/Monitoring

Binders of Algorithms

Runtime Runtime Runtime Runtime

LLP

PLP

Master

MLbase  Architecture  

Algorithms,  Data,  Sta7s7cs  

Adap%ve  Op%mizer    

MLI  Interface  between  MLBase  and  Distributed  Infra  

1  

2  

3  

Page 29: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

COML(Optimizer)

Parser

Executor/Monitoring

Binders of Algorithms

Runtime Runtime Runtime Runtime

LLP

PLP

Master

MLI  Interface  between  MLBase  and  Distributed  Infra  

1  

MLI:  Machine  Learning  Interface  1  

Page 30: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

MLI:  Machine  Learning  Interface  

•  Shield  ML  Developers  from  low-­‐level-­‐details.  •  Independence  between  ML  algorithm  and  run-­‐7me  

•  Current  supported  run-­‐7mes:  

1  

TupleWare  

Page 31: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

COML(Optimizer)

Parser

Executor/Monitoring

Binders of Algorithms

Runtime Runtime Runtime Runtime

LLP

PLP

Master

Algorithms,  Data,  Sta7s7cs  

2  

Algorithms  Pool  2  

Page 32: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

Algorithms  Pool  

Implementa%on  Algorithms  pool  

MLBase  defined  Contracts  •  Type  (e.g.,  classifica7on)  •  Parameters  •  Run7me  (e.g.,  O(n))  •  Input-­‐Specifica7on  •  Output-­‐Specifica7on  •  …  

ML Developer

+  

2  

Extensibility  

Page 33: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

COML(Optimizer)

Parser

Executor/Monitoring

Binders of Algorithms

Runtime Runtime Runtime Runtime

LLP

PLP

Master

MLbase  Architecture  

Op%mizer  LLP  &  PLP  

3  

Page 34: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

Query  Op7miza7on(Logical  Learning  Plan)  3  

Page 35: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

Run  Time  (Physical  Learning  Plan)  

•  (1)  The  master  distributes  ML  tasks  to  workers  •  (2)  Monitor  progress  •  (3)  Handle  failures  

3  

Page 36: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

Conclusion    

•  MLbase’s  core  is  its  op7mizer  •  Transforms  a  declara7ve  ML  task  to  a  sophis7cated  learning  plan  

•  Quickly  returns  a  first  quality  answer,  improves  in  the  background  

•  To  be  fully  distributed  and  offers  a  run-­‐7me  able  to  exploit  ML  algorithms    

Page 37: MADlib’and’MLbase’ · MAD’Methodology’ • Cohen,’Jeffrey,’etal.’"MAD’skills:’new’analysis’ prac7ces’for’big’data."Proceedings+of+the+ VLDB+Endowment2.2(2009

ML Developer