MADlib’and’MLbase’ · MAD’Methodology’ •...
Transcript of MADlib’and’MLbase’ · MAD’Methodology’ •...
MADlib and MLbase
Quan Fang Xiaofeng Tao
MADlib Analy7cs Library
What is a mad lib?
-‐ From bigdatablog.emc.com
What is a mad lib?
-‐ From bigdatablog.emc.com
MAD Methodology
• Cohen, Jeffrey, et al. "MAD skills: new analysis prac7ces for big data."Proceedings of the VLDB Endowment 2.2 (2009): 1481-‐1492.
• What does MAD stand for?
Magne7c
• Use all data sources regardless of format or quality
Magne7c Agile
• Use all data sources regardless of format or quality
• Support dynamic workflow and whatever data analysts need
Magne7c Agile Deep
• Use all data sources regardless of format or quality
• Support dynamic workflow and whatever data analysts need
• Analyze large datasets without sampling
Magne7c Agile Deep
• Goal: Give analysts access to familiar math concepts, sta7s7cal methods, and algorithms for database
• Use tradi7onal SQL and a wide range of extension languages
Data Parallel Sta7s7cs
• Give analysts access to familiar math concepts and sta7s7cal methods for database
• Database methods are data parallel and have an SQL like interface
• Database methods fall into one of the following layers of abstrac7on: – Scalar – Vector – Matrix – Func7on
MAD DBMS
• Loading and Unloading – Sca]er / Gather streaming for fully parallel access – External tables
• Extract-‐Transform-‐Load (ETL) and Extract-‐Load-‐Transform (ELT)
MAD DBMS
• Mul7ple storage formats – External tables – Heap format – Append-‐mostly format
• Different storage formats per table par77on • Atomic par77on exchange
MAD Programming
• Tradi7onal SQL queries with extensions • Map/Reduce func7ons wri]en in Python, Perl, or R
• Wide range of extension languages
MAD Implementa7ons
• MADlib – Library for data analy7cs that can run on Greenplum or PostgreSQL
• Greenplum database – Massively parallel processing database
MADlib
• Library for widely used sta7s7cs, data mining, and machine learning that runs on top of a SQL database – Regression models – Machine learning – Linear systems – Topic modelling – Descrip7ve sta7s7cs – And more…. (see doc.madlib.net)
MADlib History
• S7ll in early stages of development • In use by some research universi7es and companies
• Heavily sponsored by Greenplum
MADlib Interface
• Interface consists of extensible SQL like scripts that call MADlib func7ons
• Wide range of extension languages possible but C++ / Python recommended
• Designed for portability
MADlib User Extensions
• User defined aggregates and func7ons in C++ • Driver func7ons use Python to wrap mul7ple MADlib SQL calls
• Templated queries are supported with Python wrappers
MADlib: Logis7c regression
MADlib Example: K-‐means
• Run K-‐means SELECT * FROM madlib.kmeanspp( 'km_sample', 'points', 2, 'madlib.squared_dist_norm2', 'madlib.avg', 20, 0.001 );
MADlib Example: K-‐means
• Calculate silhoue]e coefficient SELECT * FROM madlib.simple_silhouette( 'km_sample', 'points',
(SELECT centroids FROM 'km_centroids'), 'madlib.squared_dist_norm2' );
Live demo
• Postgres command line • Principal Component Analysis – pca_train() – pca_project()
Conclusion
• Due to the economics of data (cheap, high volume), data warehouses in the future need to focus on data analy7cs – Magne7cally a]ract all data sources – Agile analysis workflow – Deeply analyze large datasets without sampling
• MADlib provides analysts with such a toolbox that is portable across SQL databases
MLbase: A Distributed Machine-‐Learning System
Four foci
• (1) A new declara7ve language to specify ML tasks
• (2) A novel op7mizer to select ML algorithms • (3) Provide answers early and con7nuously refine.
• (4) Design a distributed run-‐7me op7mized for the data access pa]erns of ML
A Declara7ve Approach to ML
SQL Result MQL Model
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
COML(Optimizer)
Parser
Executor/Monitoring
Binders of Algorithms
Runtime Runtime Runtime Runtime
LLP
PLP
Master
MLbase Architecture
Algorithms, Data, Sta7s7cs
Adap%ve Op%mizer
MLI Interface between MLBase and Distributed Infra
1
2
3
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
COML(Optimizer)
Parser
Executor/Monitoring
Binders of Algorithms
Runtime Runtime Runtime Runtime
LLP
PLP
Master
MLI Interface between MLBase and Distributed Infra
1
MLI: Machine Learning Interface 1
MLI: Machine Learning Interface
• Shield ML Developers from low-‐level-‐details. • Independence between ML algorithm and run-‐7me
• Current supported run-‐7mes:
1
TupleWare
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
COML(Optimizer)
Parser
Executor/Monitoring
Binders of Algorithms
Runtime Runtime Runtime Runtime
LLP
PLP
Master
Algorithms, Data, Sta7s7cs
2
Algorithms Pool 2
Algorithms Pool
Implementa%on Algorithms pool
MLBase defined Contracts • Type (e.g., classifica7on) • Parameters • Run7me (e.g., O(n)) • Input-‐Specifica7on • Output-‐Specifica7on • …
ML Developer
+
2
Extensibility
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
COML(Optimizer)
Parser
Executor/Monitoring
Binders of Algorithms
Runtime Runtime Runtime Runtime
LLP
PLP
Master
MLbase Architecture
Op%mizer LLP & PLP
3
Query Op7miza7on(Logical Learning Plan) 3
Run Time (Physical Learning Plan)
• (1) The master distributes ML tasks to workers • (2) Monitor progress • (3) Handle failures
3
Conclusion
• MLbase’s core is its op7mizer • Transforms a declara7ve ML task to a sophis7cated learning plan
• Quickly returns a first quality answer, improves in the background
• To be fully distributed and offers a run-‐7me able to exploit ML algorithms
ML Developer