Open Data Science on Hadoop in the Enterprise

39
© 2016 Continuum Analytics - Confidential & Proprietary © 2016 Continuum Analytics - Confidential & Proprietary Open Data Science on Hadoop in the Enterprise From Sandbox to Production Peter Wang CTO, Co-founder

Transcript of Open Data Science on Hadoop in the Enterprise

Page 1: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary

Open Data Science on Hadoop in the EnterpriseFrom Sandbox to Production

Peter Wang CTO, Co-founder

Page 2: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 2

Overview• Open Data Science and Anaconda • Architecture & challenges of real-world Hadoop • Anaconda for Open Data Science in the Enterprise

Page 3: Open Data Science on Hadoop in the Enterprise

an inclusive movement that makes open source tools of data science – data, analytics, & computation –

easily work together as a connected ecosystem

Open Data Science is…

Page 4: Open Data Science on Hadoop in the Enterprise

Availability | Innovation | Interoperability | Transparency For everyone in the data science team

Open Data Science means…

OPEN DATA SCIENCE IS THE FOUNDATION TO MODERNIZATION

Page 5: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 5

Data Science is not just Machine Learning…

Distributed Systems

Business Intelligence

Machine Learning / Statistics

Web

Scientific Computing / HPC

Page 6: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 6

Data Science is Interdisciplinary…

Distributed Systems

Business Intelligence

Machine Learning / Statistics

Web

Scientific Computing / HPC

Classification, deep learning, Regression, PCA

Hadoop, SparkWeb crawling, scraping, 3rd party data & API providers, predictive

services & APIs

GPUs, multi-coresData warehouse, querying, reporting

Page 7: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 7

Numba

dask

xlwings

Airflow

BlazeOpen Source Communities Creates Powerful Technology for Data Science

Distributed Systems

Business Intelligence

Web

Scientific Computing / HPC

Machine Learning / Statistics

Page 8: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 8

Numba

dask

xlwings

Airflow

BlazePython is the common language

Distributed Systems

Business Intelligence

Web

Scientific Computing / HPC

Machine Learning / Statistics

Page 9: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 9

Python’s Not the Only One…

Distributed Systems

Business Intelligence

Web

Scientific Computing / HPC

SQL

Machine Learning / Statistics

Page 10: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 10

But it’s also a Great Glue Language

Distributed Systems

Business Intelligence

Machine Learning / Statistics

Web

Scientific Computing / HPC

SQL

Page 11: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 11

Numba

dask

xlwings

Airflow

BlazeAnaconda is the Open Data Science Platform bringing technology together…

Distributed Systems

Business Intelligence

Web

Scientific Computing / HPC

Machine Learning / Statistics

Page 12: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 12

Open Data ScienceVibrant and Growing Community

Python Community

30M+Packages in Anaconda

720+

R Community

16M+Spark Python Usage

60%+

ANACONDADownloads

8M+

Page 13: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 13

Open Data Science PlatformACCELERATE. CONNECT. EMPOWER

Page 14: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 14

INNOVATE faster through managed agile experimentation MOVE from analysis to deployment immediately DELIVER powerful results backed by high performance open data science platform

LEVERAGE innovative open source analytics to extract value from data MAXIMIZE your computational power to easily analyze all data CONNECT and integrate all your data sources for predictive models

ITERATE quickly to create powerful analysis and predictive models COLLABORATE and share with your data science team PUBLISH interactive results to the business

ACCELERATETime-to-Value

CONNECTData, Analytics & Compute

EMPOWERData Science Teams

Page 15: Open Data Science on Hadoop in the Enterprise

Common Architectures of Real-world Hadoop Environments

15

Page 16: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary

Major Components

16

Hadoop Infrastructure: • Hadoop Manager • HDFS NameNode, DataNodes • Hive, Impala servers • YARN Resource Manager • Spark: History server, Gateway

server, compute nodes

DW / Analytics Env: • SQL DB • ETL systems • Data Marts

Data Science Sandbox: • Notebook server • Big memory nodes • GPU nodes

Page 17: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 17

Anaconda Scale System Architecture

Page 18: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 18

Hadoop / Spark

(& existing DW)

App 1

HTTP API

Legacy ETL

App 2

Data marts

XLS, CSV

Viz servers

Page 19: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 19

source: Master Data Management and Data Governance, 2e

Page 20: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 20

source: Master Data Management and Data Governance, 2e

Data Science “Sandbox”

Page 21: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary

Common Problems

21

• Data Science Sandbox is on isolated network, outside of “GRC reservation”

• Provides freedom to data scientists • Protects production ETL, DW, event processing • … but moving anything from Sandbox to Production is a huge pain

• Multiple orgs / LOBs interface with Data Science team in the mixed sandbox environment

• Compliance, audit, & risk control?

Page 22: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary

Contrasting Concerns

22

Exploration ProductionData • Fast, unfettered access

• Ease of introducing new, varied, messy datasets

• Reproducibility

• Strict, governed access • Well-defined schema • Provenance & auditability

Compute Infrastructure

• High performance • Low latency, interactive • Individualized & specialized

• Scalable, high-availability • Manageable at scale • Cost amortization over many

machines and users

Organization • Individual high-achievers with lots of context & capability

• Agile, able to quickly learn new skills and approaches

• Sustain operations at lowest possible cost

• Robustness against unintended change

Page 23: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary

• Data Exploration generates insight & is required to respond to business challenges

• Production data processing & analytics requires different operational concerns

• Over-engineering for either leads to structural deficiencies • Modern & future needs will require more agile exploration

Core Challenges

23

Page 24: Open Data Science on Hadoop in the Enterprise

The Core Challenge of Open Data Science in the Enterprise

24

Page 25: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary

Conway’s Law

25

The design of any piece of software reflects the communications structure of the organization that produced it.

Page 26: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary

Peter’s Corollary to Conway’s Law

26

The architecture of any business data system evolves to reflect the budget structure of the IT groups that maintain it.

… not strategic or operational needs … not ensuring future analytical agility … not optimizing for rapid insights

Page 27: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary

• How businesses are used to buying can actually push power away from exploratory data science capabilities

• Information systems have ossified into “software & hardware”, which is fine for straightforward data processing

• Not suited for human-in-the-loop production of inference, insight, knowledge

“Don’t Starve the Unicorns”

27

Page 28: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary

• VERY common misconception • Python is probably the most misunderstood language

• There are “tribes” and ecosystems in Python: web dev, scipy, pydata, embedded, scripting, 3D graphics, etc.

• But businesses tend to pigeonhole it: • IT/software/data engineering view: competes with Java, C#, Ruby… • Analytics, stats, data science view: competes with R, SAS, Matlab, SPSS, BI

systems

Data Science != Software Development

28

Python done right can be a powerful, unifying force across the business.

Page 29: Open Data Science on Hadoop in the Enterprise

Anaconda for Open Data Science in Hadoop

29

Page 30: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 30

Data ScientistBiz Analyst Data EngineerDeveloper DevOps

Modern Data Science TeamsLove ANACONDA

• Hadoop / Spark • Programming

Languages • Analytic Libraries • IDE • Notebooks • Visualization

• Spreadsheets • Visualization • Notebooks • Analytic

Development Environment

• Database / Data Warehouse

• ETL

• Programming Languages

• Analytic Libraries • IDE • Notebooks • Visualization

• Database / Data Warehouse

• Middleware • Programming

Languages

Page 31: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 31

Anaconda Powers Teams

EXPLORE & ANALYZE

COLLABORATE & PUBLISH

DEPLOY &OPERATE

• Explore & prepare data • Build, test, validate data science models with Python & R • Build simulations & optimizations • View data lineage & reuse transformations • Leverage & explore metadata

• Create & share data science notebooks with interactive visualizations • Identify reusable data science assets easily • Authorize access to data science projects • Manage & control data science asset versions

• Build & share data science packages & environments • Launch & provision distributed environments

Page 32: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 32

Write Once, Deploy AnywhereO

PEN

DAT

A SC

IEN

CE

Explore & Analyze

Collaborate & Publish

Deploy & Operate

Servers Linux, Windows OSX

GPUs & High End Workstations

Linux & Windows NVIDIA, AMD, X86/ARM

Clusters Yarn, Mesos, MPI Power8, LSF, Sun Grid Engine

NoSQL MongoDB Cassandra / DataStax

Hadoop Cloudera, Hortonworks Apache Hadoop & Spark

Files Microsoft Excel Trifacta, Import.io

DW & SQL Any SQL DB Any SQL DW, Impala

Page 33: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 33

Anaconda Architectures

ON-PREMISE PRIVATE CLOUD ANACONDA CLOUD

Page 34: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 34

Public Anaconda Repository

Cloud

Has access to Gateway Repo Have access to Prod Repo

Active Directory/ LDAP Optional

Authentication

Mirror

Anaconda Repository

Multi-Step Process

– Mirror packages from Anaconda’s public Repository to a ‘Gateway’ Repo

– Testers (with authorization to access Gateway) evaluate new packages.

– Approved packages are mirrored to the Production Repo Server

– Standard End users now have access to updated, approved packages.

Gateway (Test) Repo Server

Production Repo ServerMirror

If an Anaconda repo can function as a gateway

“Tester” End User

</>

End User

Page 35: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 35

Public Anaconda Repository

Cloud

conda install numpy ipython conda update ipython conda create –n env1 ipython pandas

conda env upload environment.yml project1 anaconda notebook upload project1.ipynb

conda build project2 anaconda upload project2.bz2

Active Directory/ LDAP Optional

Authentication

Firewall

Anaconda Repository—Air-Gapped Install

On-site Package Repo and Sharing platform

– Mirror public repository of packages

– Analysts consume packages from local repo

– Analysts upload and share notebooks & pre-configured computing environments

– Developers create, deploy & share custom packages

Internal Anaconda Repository (pre-loaded from disk)

Analyst 1 Analyst 2

</>

Developer

Page 36: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 36

Internal Anaconda Repository

Package Control

Head NodeCluster Provisioning Job Submission Worker

Nodes

Edge Node

State Management

Job Control

Package Control

Cluster

Anaconda Scale: Cluster Management

Client Machine

Page 37: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 37

Gateway & Project Nodes,running IPython kernels

Package Control

Internal Anaconda Repository

Authentication

Anaconda Enterprise Notebook Server

Computation

Web Interface

Active Directory/ LDAP Optional

Workflow: – Analyst Log into the Enterprise

notebook server, authenticating against LDAP/AD

– Based on the project they select, is re-directed to the appropriate project node

– All notebooks/python code runs on project nodes; any needed packages are pulled down from your local repository

Anaconda Enterprise Notebook Computing

User 1 User 2 User 3

Page 38: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 38

Client Machine

Internal

Anaconda Repository

Package Control

Head NodeCluster Provisioning

Job Submission Hadoop Worker Nodes

Edge Node

State Management

Cluster

Package Control

Authentication

Anaconda Enterprise Notebook Server

Web Interface

Computation

Package Control

LDAP: TCP 389/636

HTTP: TCP 8080

HTTP: TCP 5002

SSH: TCP 22 SALT: TCP 4505, 4506

HTTP/HTTPS: TCP 80/44 TCP 8080

Teradata

Integrated Environment

User 1 User 2 User 3Analyst 1 Analyst 2 Developer

</>

Page 39: Open Data Science on Hadoop in the Enterprise

© 2016 Continuum Analytics - Confidential & Proprietary 39

AnacondaAccelerates Adoption of Open Data Science for Enterprises

Across all Data, Operating Systems, & Hardware Platforms

Explore & Visualize complex data easily

Harness Open Source Python & R Analytics

Write Once, Deploy Anywhere for Scalable High Performance

Data Engineering Simplified for All Data

Collaborate with Your Team anywhere in the World

Integrate Data from Anywhere