Distributed Computing on your Cluster with Anaconda - Webinar 2015
-
Upload
continuum-analytics -
Category
Data & Analytics
-
view
9.457 -
download
3
Transcript of Distributed Computing on your Cluster with Anaconda - Webinar 2015
Presenter BioKristopher Overholt received his Ph.D. in Civil Engineering from The University of Texas at Austin.
Prior to joining Continuum, he worked at theNational Institute of Standards and Technology (NIST),Southwest Research Institute (SwRI), andThe University of Texas at Austin.
Kristopher has 10+ years of experience in areas including applied research, scientific and parallel computing, system administration, open-source software development, and computational modeling.
Kristopher Overholt Software Engineer
Continuum Analytics
Overview
• Overview of Anaconda
• Cluster Functionality of Anaconda
• Demo: Distributed Natural Language Processing
• Demo: Distributed Image Processing with GPUs
• Demo: Distributed SQL Queries on 1 TB of Data
• Anaconda Use Cases for your Enterprise
is….the modern open source analytics platform powered by Pythonthe fastest growing open data science language• Easy to Build, Maintain & Deploy Analytics• Talks with Everything, Runs Anywhere• High Performance, Scalable Analytics
AnacondaAccelerating Adoption of Python for Enterprises
COLLABORATIVE NOTEBOOKSwith publication, authentication, & search
Jupyter/ IPython
PYTHON & PACKAGE MANAGEMENTfor Hadoop & Apache stack Spark
PERFORMANCEwith compiled Python for lightning fast execution
Numba
VISUAL APPSfor interactivity, streaming, & BigBokeh
SECURE & ROBUST REPOSITORYof data science libraries, scripts, & notebooks
Conda
ENTERPRISE DATA INTEGRATIONwith optimized connectors & out-of-core
processing
NumPy & Pandas
Anaconda for Data ScienceEmpowering Everyone on the Team
Data Scientist• Advanced analytics with Python & R• Simplified library management• Easily share data science notebooks & packages
Developer• Support for common APIs & data formats• Common language with data scientists• Python extensibility with C, C++, etc.
Business Analyst• Collaborative interactive analytics with
notebooks• Rich browser based visualizations• Powerful MS Excel integration
Data Engineer• Powerful & efficient libraries for data
transformations • Robust processing for noisy dirty data• Support for common APIs & data formats
Ops• Validated source of up-to-date packages including indemnification • Agile Enterprise Package Management• Supported across platforms
Computational Scientist• Rich set of advanced analytics• Trusted & production ready libraries for
numerics• Simplified scale up & scale out on clusters &
GPUs
Write Once, Deploy AnywhereM
ANAG
ED
PYTH
ON
Explore & Visualize
Python & R Advanced Analytics
High Performance & Scalability
Data Engineering & Analysis
Collaboration & Integration
Servers Linux,Windows
OSX
GPUs&HighEndWorkstations
Linux&Windows
NVIDIA,AMD,X86/ARM
Clusters Yarn,Mesos,MPI
Power8,LSF,SungridEngine
NoSQL MongoDB
Cassandra/DataStax
Hadoop Cloudera,Hortonworks
ApacheHadoop&Spark
Files MicrosoftExcel
Trifacta,Import.io
DW&SQL AnySQLDB
AnySQLDW,Impala
Anaconda: Scaled up Python for your Enterprise
• Analysts, domain experts, quants, statisticians, data scientists, etc. want to leverage Python and existing libraries
• Newer analytics engines leverage existing runtimes, including Python and R (PySpark, SparkR) conda
NumPy SciPy Pandas Scikit-learn Jupyter/ IPython
Numba Matplotlib Spyder Numexpr Cython Theano
Scikit-image NLTK NetworkX IRKernel dplyr shiny
ggplot2 tidyr caret nnet And 330+ packages
PYTHON & R OPEN SOURCE ANALYTICS
For data scientists:
• Scaled-up Analytics Develop and deploy the same code/environment on your local machine and the cluster
• Cluster Management Easily provision and manage your cluster stack and data analysis tools/environments
For system administrators and DevOps:
• Environment Management Provide the tools your data scientists need at enterprise scale
• Remote Packaging Easily deploy Python (or R, or Julia, or…) applications to your Spark/Hadoop cluster
Anaconda-Powered Cluster for your Enterprise
Distributed Systems
Databases Stats/Machine Learning
Scientific Computing
Modern Analytics Ecosystem
Remote Conda and Cluster Management
New Spark/Hadoop clusters • Create and provision a Spark/Hadoop cluster with a few simple steps
• Work on the cloud or with your existing in-house servers
Existing Spark/Hadoop clusters
• Deploy and manage conda packages/environments on cluster nodes
• Solves the remote packaging problem
• Empower data scientists without sacrificing control of your cluster
Defining a Cloud-Based Cluster
aws_east: cloud_provider: ec2 keyname: anaconda-cluster location: us-east-1 private_key: ~/.ssh/anaconda-cluster.pem secret_id: ********** secret_key: **********
Provider Profile
name: spark-cluster node_id: ami-d05e75b8 node_type: m3.xlarge num_nodes: 4 provider: aws_east user: ubuntu
Defining a Bare-Metal Cluster
bare_metal:
cloud_provider: none
private_key: ~/.ssh/my-private-key
Provider Profilename: spark-cluster
provider: bare_metal
num_nodes: 4
machines:
head: - 192.168.1.1
compute:
- 192.168.1.2
- 192.168.1.3
- 192.168.1.4
Creating and Using a Cluster
• Define provider
• Remote conda
• Define profile
• Install plugins
• Create cluster
~/.acluster/providers.yaml
~/.acluster/profiles.d/profile.yaml
acluster create cluster_name -p profile
acluster conda install numpy scipy
acluster install spark-yarn notebook
acluster conda install numpy scipy pandas numba
acluster conda create -n py34 python=3.4 numpy scipy pandas
acluster conda list
acluster conda info
acluster conda push environment.yml
acluster conda setenv py34
Remote Conda Commands
Install packages
List packages
Create environment
Conda information
Push environment
Set default environment
Cluster Management Commands
Create cluster
Install plugins
List active clusters
SSH to nodes
Put/get files
Run command
acluster create spark-cluster -p spark-profile
acluster list
acluster install spark-yarn notebook
acluster ssh
acluster put data.hdf5 /home/ubuntu/data.hdf5
acluster 'cmd apt-get install ...'
acluster submit spark_script.pySubmit script
Demo Overview
This demo shows a simple PySpark job that uses the NLTK library, a popular Python package for processing human language data.
This demo will show the installation of Python packages on the cluster, the use of Spark and the YARN resource manager, and remote execution of the Spark job on the cluster.
Application
Analytics
Data
Server
Jupyter/IPython Notebook
Spark, NLTK
Local files on each node
Bare-metal or Cloud-based cluster
Demo Step-by-Step• Create cluster
– 4 nodes, m3.large, 2 vCPUs, 7.5 GB RAM
• Install Spark, YARN, and Notebook plugins
• Remotely install conda packages
• Parallel download of data onto cluster nodes
• Use Spark and NLTK to tokenize words and tag parts of speech
– Remotely submitting the script to the Spark cluster
– Interactively in a notebook on the Spark cluster
Demo OverviewTo demonstrate the capability of running a distributed job in PySpark using GPUs, this demo uses Numba and the CUDA platform to perform image processing.
This demo executes two-dimensional FFT convolution on images in grayscale and compares the execution time of CPU-based andGPU-based calculations.
Application
Visualization
Data
Server
Jupyter/IPython Notebook
matplotlib
Spark, Numba, SciPy, PIL
HDFS
Analytics
GPU-enabled Bare-metal or Cloud-based Cluster
Demo Step-by-Step• Create cluster
– 4 nodes, g2.2xlarge, 8 vCPUs, 15 GB RAM, 1 GPU
• Install Spark, YARN, HDFS, and Notebook plugins
• Bootstrap CUDA drivers on all nodes
• Remotely install conda packages
• High-performance parallel download of data into HDFS
• Use Spark, Numba, and GPU to perform FFT convolution on images
– Interactively in a notebook on the Spark cluster
Demo Overview
In this demo, we’ll interactively query, explore, and visualize a data set of approximately 1.8 billion comments (~1 TB).
Blaze Bokeh
Application
Visualization
Data
Server
Analytics
Jupyter/IPython Notebook
Bokeh
HDFS
Bare-metal or Cloud-based cluster
Blaze/pandas
Hive Impala
Demo Step-by-Step• Create cluster
– 8 nodes, m3.2xlarge, 8 vCPUs, 30 GB RAM, 1 TB storage
• Install HDFS, Hive, Impala, and Notebook plugins
• High-performance parallel download of data into HDFS
• Move, convert, and load data into distributed SQL databases
• Run interactive queries from notebook using Blaze
• Interactively plot and explore results using Bokeh
Amazon S3 (JSON)
Moving and Loading 1 TB of Data
Time Moving Data Time Querying Data
2 hours -
< 1 minute 30 minutes
1 hour 5 minutes
< 1 minute 5 seconds
HDFS (JSON)
Hive (JSON)
Hive (Parquet)
Impala (Parquet)
Client Machine
Head Node
Compute Nodes
Data Science Use Case
Analyst creates a sandbox Spark/Hadoop cluster and installs plugins
2.
Client Machine
Head Node
Compute Nodes
Data Science Use Case
Analyst deploys packages, environments,and data to cluster nodes
3.
Client Machine
Head Node
Compute Nodes
Data Science Use Case
Analyst submits jobs to Spark/Hadoop cluster
4.
Analyst Machine
Enterprise Use Case
Analyst ships packages, environments, and data to on-premises repository1.
Admin
Head Node
Compute Nodes
Enterprise Use Case
Admin deploys packages, environments, and data to cluster nodes2.
Analyst Machine
Head Node
Compute Nodes
Enterprise Use Case
Analyst submits jobs to Spark/Hadoop cluster3.
Anaconda Cluster Plugins
Conda Hive Elasticsearch Jupyter/IPython Notebook
Spark Impala Logstash IPython Parallel
YARN Storm Kibana Dask
HDFS Zookeeper Ganglia
System Architecture Diagram
HMS
CM
ISS
HS
NN
RM
S
ID
SG
SNN
JHS
NM
DN
G
WHCS
HS2
ACH Anaconda Cluster Head
ACC
AS
CM G
SG HS
Head Node
AS ACH
YG
YG G
Secondary Head Node
ICS
ICS
ISS S
YG
Edge Node
HFS
HFS
G
H
HS2
HMS WHCS
Edge Node
H
SG
Anaconda Server
Zookeeper Server
Hadoop Manager
Impala StateStore
Impala Daemon
Impala Catalog Server
History Sever (Spark)
Spark Gateway
Resource Manage (YARN)
JobHistory Server Other Services
Hue
NameNode (HDFS)
Secondary NameNode
DataNode
HttpFS
Hive Metastore
Gateway
WebHCat Server
HiveServer2
Yarn GateWay
NodeManager
Anaconda Cluster Compute
ACCACC
Compute Nodes
DN ID
SG ACC
Network Architecture Diagram
Client Machine Compute Node
Compute Node
Compute Node
Head Node Anaconda Server
Ports: 22 (SSH)
Ports: 4505 (Salt) 4506 (Salt)
Ports: 8080 (HTTP)
8443 (HTTPS)
$ conda install anaconda-client
$ anaconda login
$ conda install anaconda-cluster -c anaconda-cluster
Test-Drive Anaconda on a Cluster
1. Register for an Anaconda Cloud account at Anaconda.org
2. Download Anaconda Cluster using Conda
3. Create a sandbox/demo cluster
Priority 1 support with Dedicated Customer
Support Rep
ANACONDAENTERPRISE
CONTACT USCONTACT US
ANACONDAPRO
Priority 1 support
DOWNLOAD
ANACONDA
Community Support
FREE FOREVER
Open Source Modern Analytics Platform
Powered by Python
Anaconda with Support & Indemnification
Priority 1 support
ANACONDAWORKGROUP
CONTACT US
Anaconda with High Performance and Team
Collaboration
Anaconda with Scalable High Performance and
Team Collaboration
per year
+ $1,000 per year foradditional users
$10,000Starting at
+ $3,000 per year foradditional users
per year
$30,000Starting at
+ $6,000 per year foradditional users
per year
$60,000Starting at
Anaconda Subscriptions
Contact Information and Additional Details
• Contact [email protected] for more information aboutAnaconda subscriptions, consulting, or training
• View documentation and examples at
docs.continuum.io/anaconda-cluster
• View demo notebooks on Anaconda Cloud
notebooks.anaconda.org/anaconda-cluster