Distributed Computing on your Cluster with Anaconda - Webinar 2015

Distribute Computing on your Cluster with Anaconda

Presenter BioKristopher Overholt received his Ph.D. in Civil Engineering from The University of Texas at Austin.

Prior to joining Continuum, he worked at theNational Institute of Standards and Technology (NIST),Southwest Research Institute (SwRI), andThe University of Texas at Austin.

Kristopher has 10+ years of experience in areas including applied research, scientific and parallel computing, system administration, open-source software development, and computational modeling.

Kristopher Overholt Software Engineer

Continuum Analytics

Overview

• Overview of Anaconda

• Cluster Functionality of Anaconda

• Demo: Distributed Natural Language Processing

• Demo: Distributed Image Processing with GPUs

• Demo: Distributed SQL Queries on 1 TB of Data

• Anaconda Use Cases for your Enterprise

Overview of Anaconda

is….the modern open source analytics platform powered by Pythonthe fastest growing open data science language• Easy to Build, Maintain & Deploy Analytics• Talks with Everything, Runs Anywhere• High Performance, Scalable Analytics

AnacondaAccelerating Adoption of Python for Enterprises

COLLABORATIVE NOTEBOOKSwith publication, authentication, & search

Jupyter/ IPython

PYTHON & PACKAGE MANAGEMENTfor Hadoop & Apache stack Spark

PERFORMANCEwith compiled Python for lightning fast execution

Numba

VISUAL APPSfor interactivity, streaming, & BigBokeh

SECURE & ROBUST REPOSITORYof data science libraries, scripts, & notebooks

Conda

ENTERPRISE DATA INTEGRATIONwith optimized connectors & out-of-core

processing

NumPy & Pandas

Anaconda for Data ScienceEmpowering Everyone on the Team

Data Scientist• Advanced analytics with Python & R• Simplified library management• Easily share data science notebooks & packages

Developer• Support for common APIs & data formats• Common language with data scientists• Python extensibility with C, C++, etc.

Business Analyst• Collaborative interactive analytics with

notebooks• Rich browser based visualizations• Powerful MS Excel integration

Data Engineer• Powerful & efficient libraries for data

transformations • Robust processing for noisy dirty data• Support for common APIs & data formats

Ops• Validated source of up-to-date packages including indemnification • Agile Enterprise Package Management• Supported across platforms

Computational Scientist• Rich set of advanced analytics• Trusted & production ready libraries for

numerics• Simplified scale up & scale out on clusters &

GPUs

Modern Analytics Stack

Write Once, Deploy AnywhereM

ANAG

ED

PYTH

ON

Explore & Visualize

Python & R Advanced Analytics

High Performance & Scalability

Data Engineering & Analysis

Collaboration & Integration

Servers Linux,Windows

OSX

GPUs&HighEndWorkstations

Linux&Windows

NVIDIA,AMD,X86/ARM

Clusters Yarn,Mesos,MPI

Power8,LSF,SungridEngine

NoSQL MongoDB

Cassandra/DataStax

Hadoop Cloudera,Hortonworks

ApacheHadoop&Spark

Files MicrosoftExcel

Trifacta,Import.io

DW&SQL AnySQLDB

AnySQLDW,Impala

Cluster Functionality of Anaconda

Anaconda: Scaled up Python for your Enterprise

• Analysts, domain experts, quants, statisticians, data scientists, etc. want to leverage Python and existing libraries

• Newer analytics engines leverage existing runtimes, including Python and R (PySpark, SparkR) conda

NumPy SciPy Pandas Scikit-learn Jupyter/ IPython

Numba Matplotlib Spyder Numexpr Cython Theano

Scikit-image NLTK NetworkX IRKernel dplyr shiny

ggplot2 tidyr caret nnet And 330+ packages

PYTHON & R OPEN SOURCE ANALYTICS

For data scientists:

• Scaled-up Analytics Develop and deploy the same code/environment on your local machine and the cluster

• Cluster Management Easily provision and manage your cluster stack and data analysis tools/environments

For system administrators and DevOps:

• Environment Management Provide the tools your data scientists need at enterprise scale

• Remote Packaging Easily deploy Python (or R, or Julia, or…) applications to your Spark/Hadoop cluster

Anaconda-Powered Cluster for your Enterprise

Distributed Systems

Databases Stats/Machine Learning

Scientific Computing

Modern Analytics Ecosystem

Cluster Architecture Diagram

Client Machine Compute Node

Compute Node

Compute Node

Head Node

Remote Conda and Cluster Management

New Spark/Hadoop clusters • Create and provision a Spark/Hadoop cluster with a few simple steps

• Work on the cloud or with your existing in-house servers

Existing Spark/Hadoop clusters

• Deploy and manage conda packages/environments on cluster nodes

• Solves the remote packaging problem

• Empower data scientists without sacrificing control of your cluster

Creating and Provisioning a Cluster

Defining a Cloud-Based Cluster

aws_east: cloud_provider: ec2 keyname: anaconda-cluster location: us-east-1 private_key: ~/.ssh/anaconda-cluster.pem secret_id: ********** secret_key: **********

Provider Profile

name: spark-cluster node_id: ami-d05e75b8 node_type: m3.xlarge num_nodes: 4 provider: aws_east user: ubuntu

Defining a Bare-Metal Cluster

bare_metal:

cloud_provider: none

private_key: ~/.ssh/my-private-key

Provider Profilename: spark-cluster

provider: bare_metal

num_nodes: 4

machines:

head: - 192.168.1.1

compute:

- 192.168.1.2

- 192.168.1.3

- 192.168.1.4

Creating and Using a Cluster

• Define provider

• Remote conda

• Define profile

• Install plugins

• Create cluster

~/.acluster/providers.yaml

~/.acluster/profiles.d/profile.yaml

acluster create cluster_name -p profile

acluster conda install numpy scipy

acluster install spark-yarn notebook

Remote Conda and Cluster Management Commands

acluster conda install numpy scipy pandas numba

acluster conda create -n py34 python=3.4 numpy scipy pandas

acluster conda list

acluster conda info

acluster conda push environment.yml

acluster conda setenv py34

Remote Conda Commands

Install packages

List packages

Create environment

Conda information

Push environment

Set default environment

Cluster Management Commands

Create cluster

Install plugins

List active clusters

SSH to nodes

Put/get files

Run command

acluster create spark-cluster -p spark-profile

acluster list

acluster install spark-yarn notebook

acluster ssh

acluster put data.hdf5 /home/ubuntu/data.hdf5

acluster 'cmd apt-get install ...'

acluster submit spark_script.pySubmit script

Demo: Distributed Natural Language Processing

Demo Overview

This demo shows a simple PySpark job that uses the NLTK library, a popular Python package for processing human language data.

This demo will show the installation of Python packages on the cluster, the use of Spark and the YARN resource manager, and remote execution of the Spark job on the cluster.

Application

Analytics

Data

Server

Jupyter/IPython Notebook

Spark, NLTK

Local files on each node

Bare-metal or Cloud-based cluster

Demo Step-by-Step• Create cluster

– 4 nodes, m3.large, 2 vCPUs, 7.5 GB RAM

• Install Spark, YARN, and Notebook plugins

• Remotely install conda packages

• Parallel download of data onto cluster nodes

• Use Spark and NLTK to tokenize words and tag parts of speech

– Remotely submitting the script to the Spark cluster

– Interactively in a notebook on the Spark cluster

Initialize SparkContext

Perform distributedNLTK operations

Specify location of data

Demo: Distributed Image Processing with GPUs

Demo OverviewTo demonstrate the capability of running a distributed job in PySpark using GPUs, this demo uses Numba and the CUDA platform to perform image processing.

This demo executes two-dimensional FFT convolution on images in grayscale and compares the execution time of CPU-based andGPU-based calculations.

Application

Visualization

Data

Server


matplotlib

Spark, Numba, SciPy, PIL

HDFS

Analytics

GPU-enabled Bare-metal or Cloud-based Cluster


– 4 nodes, g2.2xlarge, 8 vCPUs, 15 GB RAM, 1 GPU

• Install Spark, YARN, HDFS, and Notebook plugins

• Bootstrap CUDA drivers on all nodes

• Remotely install conda packages

• High-performance parallel download of data into HDFS

• Use Spark, Numba, and GPU to perform FFT convolution on images

– Interactively in a notebook on the Spark cluster

Demo: Interactively Querying 1 TB of Data in Distributed SQL Engines

Demo Overview

In this demo, we’ll interactively query, explore, and visualize a data set of approximately 1.8 billion comments (~1 TB).

Blaze Bokeh

Application

Visualization

Data

Server

Analytics


Bokeh

HDFS

Bare-metal or Cloud-based cluster

Blaze/pandas

Hive Impala


– 8 nodes, m3.2xlarge, 8 vCPUs, 30 GB RAM, 1 TB storage

• Install HDFS, Hive, Impala, and Notebook plugins

• High-performance parallel download of data into HDFS

• Move, convert, and load data into distributed SQL databases

• Run interactive queries from notebook using Blaze

• Interactively plot and explore results using Bokeh

Amazon S3 (JSON)

Moving and Loading 1 TB of Data

Time Moving Data Time Querying Data

2 hours -

< 1 minute 30 minutes

1 hour 5 minutes

< 1 minute 5 seconds

HDFS (JSON)

Hive (JSON)

Hive (Parquet)

Impala (Parquet)

Anaconda Use Cases for your Enterprise

Client Machine

Head Node

Compute Nodes

Data Science Use Case


Analyst installs Anaconda on their local machine

1.

Client Machine

Client Machine

Head Node

Compute Nodes


Analyst creates a sandbox Spark/Hadoop cluster and installs plugins

2.

Client Machine

Head Node

Compute Nodes


Analyst deploys packages, environments,and data to cluster nodes

3.

Client Machine

Head Node

Compute Nodes


Analyst submits jobs to Spark/Hadoop cluster

4.

Admin

Head Node

Compute Nodes

Enterprise Use Case

Analyst Machine

Anaconda Server

Analyst Machine

Enterprise Use Case

Analyst ships packages, environments, and data to on-premises repository1.

Admin

Head Node

Compute Nodes

Enterprise Use Case

Admin deploys packages, environments, and data to cluster nodes2.

Analyst Machine

Head Node

Compute Nodes

Enterprise Use Case

Analyst submits jobs to Spark/Hadoop cluster3.

Anaconda Cluster Plugins

Conda Hive Elasticsearch Jupyter/IPython Notebook

Spark Impala Logstash IPython Parallel

YARN Storm Kibana Dask

HDFS Zookeeper Ganglia

System Architecture Diagram

HMS

CM

ISS

HS

NN

RM

S

ID

SG

SNN

JHS

NM

DN

G

WHCS

HS2

ACH Anaconda Cluster Head

ACC

AS

CM G

SG HS

Head Node

AS ACH

YG

YG G

Secondary Head Node

ICS

ICS

ISS S

YG

Edge Node

HFS

HFS

G

H

HS2

HMS WHCS

Edge Node

H

SG

Anaconda Server

Zookeeper Server

Hadoop Manager

Impala StateStore

Impala Daemon

Impala Catalog Server

History Sever (Spark)

Spark Gateway

Resource Manage (YARN)

JobHistory Server Other Services

Hue

NameNode (HDFS)

Secondary NameNode

DataNode

HttpFS

Hive Metastore

Gateway

WebHCat Server

HiveServer2

Yarn GateWay

NodeManager

Anaconda Cluster Compute

ACCACC

Compute Nodes

DN ID

SG ACC

Network Architecture Diagram

Client Machine Compute Node

Compute Node

Compute Node

Head Node Anaconda Server

Ports: 22 (SSH)

Ports: 4505 (Salt) 4506 (Salt)

Ports: 8080 (HTTP)

8443 (HTTPS)

Anaconda Subscriptions and Resources

$ conda install anaconda-client

$ anaconda login

$ conda install anaconda-cluster -c anaconda-cluster

Test-Drive Anaconda on a Cluster

1. Register for an Anaconda Cloud account at Anaconda.org

2. Download Anaconda Cluster using Conda

3. Create a sandbox/demo cluster

Priority 1 support with Dedicated Customer

Support Rep

ANACONDAENTERPRISE

CONTACT USCONTACT US

ANACONDAPRO

Priority 1 support

DOWNLOAD

ANACONDA

Community Support

FREE FOREVER

Open Source Modern Analytics Platform

Powered by Python

Anaconda with Support & Indemnification

Priority 1 support

ANACONDAWORKGROUP

CONTACT US

Anaconda with High Performance and Team

Collaboration

Anaconda with Scalable High Performance and

Team Collaboration

per year

+ $1,000 per year foradditional users

$10,000Starting at


per year

$30,000Starting at


per year

$60,000Starting at

Anaconda Subscriptions

Contact Information and Additional Details

• Contact [email protected] for more information aboutAnaconda subscriptions, consulting, or training

• View documentation and examples at

docs.continuum.io/anaconda-cluster

• View demo notebooks on Anaconda Cloud

notebooks.anaconda.org/anaconda-cluster

Thank you

Email: [email protected]

Twitter: @ContinuumIO

Kristopher Overholt

Twitter: @koverholt

Distributed Computing on your Cluster with Anaconda - Webinar 2015

Data & Analytics

Transcript of Distributed Computing on your Cluster with Anaconda - Webinar 2015