Open Data Science with R and Anaconda
-
Upload
continuum-analytics -
Category
Data & Analytics
-
view
8.989 -
download
0
Transcript of Open Data Science with R and Anaconda
OPEN DATA SCIENCE WITH RMake Life Easier & More Powerful with Anaconda
Christine Doig, Senior Data Scientist
2
Christine Doig is a Senior Data Scientist at Continuum Analytics, where she worked on MEMEX, a DARPA-funded project helping stop human trafficking. She has 5+ years of experience in analytics, operations research, and machine learning in a variety of industries, including energy, manufacturing, and banking. Christine holds a M.S. in Industrial Engineering from the Polytechnic University of Catalonia in Barcelona. She is an open source advocate and has spoken at many conferences, including PyData, EuroPython, SciPy and PyCon.
About me
Christine DoigSenior Data Scientist
Continuum Analytics
3
• Introduction to Open Data Science • Introduction to Anaconda, the leading Open Data Science platform • Package and environment management for R
– conda, R-Essentials and MRO • Data Science Collaboration in R
– Jupyter notebooks for R and Anaconda Enterprise Notebooks • Scaling R
– Anaconda for cluster management and SparkR
Agenda - Open Data Science with R
OPEN DATA SCIENCEIntroduction to
“ ”© 2015 Continuum Analytics- Confidential & Proprietary 5
An interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms
Wikipedia
Data Science is …
© 2015 Continuum Analytics- Confidential & Proprietary
Open Data Science is …
an inclusive movement that makes open source tools of data science - data, analytics, & computation - easily work
together as a connected ecosystem
6
© 2015 Continuum Analytics- Confidential & Proprietary
Open Source ecosystems for Data Science
7
NumPy SciPy
Pandas Scikit-learn
Jupyter/IPython
dplyr shiny
tidyr
ggplot
Spark
tidyr
ANACONDAIntroduction to
© 2015 Continuum Analytics- Confidential & Proprietary 9
is…. the leading Open Data Science platform powered by Python the fastest growing Open Data Science language
• Accelerate Time-to-Value • Connect Data, Analytics, & Compute • Empower Data Science Teams
10
Why Anaconda? • Easy to install on all platforms • Trusted by industry leaders: e.g. Microsoft Azure ML
• Large user base: 3M+ downloads • BSD license • Extensible - easily build, share and install proprietary libraries with Anaconda Cloud
• Language agnostic - Python, R, Scala… • Allows isolated custom sandboxes with different versions of packages
Why Anaconda?
11
Anaconda Glossary
PYTHONNumPy, SciPy, Pandas, Scikit-learn, Jupyter /
IPython, Numba, Matplotlib, Spyder, Numexpr,
Cython, Theano, Scikit-image, NLTK, NetworkX and
150+ packages
conda
PYTHON
cond
conda
• Anaconda distribution: Python distribution that includes 150+ packages for data science
• conda: Cross-platform and language agnostic package and environment manager
• Miniconda: Lightweight version of Anaconda, with just Python and conda.
• Anaconda Cloud: Cloud service to host and share public and private packages, environments and notebooks
• conda environments: custom isolated sandboxes to easily reproduce and share data science projects
PACKAGE AND ENVIRONMENT MANAGEMENT FOR R
13
From http://www.slideshare.net/RevolutionAnalytics/r-at-microsoft
An R Reproducibility Problem
14
Reproducibility• Programming language (R, Python, Scala…) • Packages (OSS libraries or internally developed) • Data or Access to data • Configuration of Services: DBs, keys… • Your Analysis - Script, Notebook
15
Reproducibility solutions
Bare metal
Virtual Machines
Docker containers
Conda environments
Your Analysis or Application
Your laptop, server, EC2 instance
Env 1 Env 2 Env 3
Analysis 1 Analysis 2 Analysis 3
16
Conda Environments• Programming language (R, Python, Scala…) • Packages (OSS libraries or internally developed) • Data or Access to data • Configuration of Services: DBs, keys… • Your Analysis - Script, Notebook
17
lightweight isolated sandbox to manage your dependencies and allow reproducibility of your project
environment.yml
$ conda env create
$ source activate ENV_NAME
Conda Environments
18
Where packages, notebooks, and environments are shared. Powerful collaboration and package management for open source and private projects.
Public projects and notebooks are always free.REGISTER TODAY! ANACONDA.ORG
19
Anaconda for R
https://www.continuum.io/blog/developer/jupyter-and-conda-r
• R-Essentials: A conda metapackage with 80+ R packages for data science
• MRO: Microsoft R Open distribution with MKL
conda config --add channels r conda install r-essentials
conda config --add channels mro conda install r
20
• Package and environment manager • Language angnostic (Python, R, Java…) • Cross-platform (Windows, OS X, Linux)
$ conda install python=2.7 $ conda install pandas $ conda install -c r r $ conda install mongodb
Conda
21
name: myenv channels: - chdoig - r - foo
dependecies: - python=2.7 - r - r-ldavis - pandas - mongodb - spark=1.5 - pip - pip: - flask-migrate - bar=1.4
environment.yml
$ conda env create $ source activate myenv
$ conda env export -n freeze.yml
Create and activate
Freeze versions
Upload to anaconda.org
$ conda server upload my_foo_env.yml $ conda env create chdoig/my_foo_env.yml
Conda environments flow example
22
FAQ• R-Essentials has too many / too few / not the packages I
want, how can I create my own “R-Essentials”?
• I need an R package that is not on R-Essentials or the R channel, but is available through CRAN, how do I get it?
$ conda skeleton cran ldavis $ conda build r-ldavis/ $ conda server upload r-ldavis $ conda install -c chdoig r-ldavis
$ conda metapackage custom-r-bundle 0.1.0 --dependencies r-irkernel jupyter r-ggplot2 r-dplyr --summary "My custom R bundle”
23
Anaconda: Navigator
• Launch applications and easily manage conda packages, environments and channels.
• No need of using the command line.
•Available for Windows, OS X and Linux.
• Anaconda Navigator has replaced Launcher.
• Integration with Anaconda Cloud.
A desktop graphical user interface included in
Anaconda
24
Anaconda Repository
• Centralized internal repository to share package, environments and notebooks.
• Control user or team access to packages, environments and notebooks
• Blacklist packages in your organization (e.g. GPL licenses)
• Internal mirror Anaconda • Build and easily share internal developed software
DATA SCIENCE COLLABORATION WITH R
© 2015 Continuum Analytics- Confidential & Proprietary
Data Science Development Environments
26
PyCharm Spyder
Text Editors: Sublime, vim, emacs…
RStudio Eclipse
27
http://jupyter.org/https://try.jupyter.org/
The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.
Jupyter
28
IPython IPython notebook
nbviewer tmpnb binderJupyter
https://try.jupyter.org/
http://mybinder.org/
Jupyter
29
Jupyter: IRkernel
https://www.continuum.io/blog/developer/jupyter-and-conda-r
conda config --add channels r conda install r-essentials jupyter notebooks
Trivial to get started writing R notebooks the same way you
write Python ones.
30
To start jupyter notebooks, simply run the following command:
$ jupyter notebook
http://nbviewer.ipython.org/github/chdoig/conda-jupyter-irkernel/blob/master/Jupyter%20and%20conda%20for%20R.ipynb
Jupyter
31
Jupyter
32
Jupyter
33
$ jupyter nbconvert my_r_notebook.ipynb --to slides --post serve
Jupyter
DEMO 1: ENVIRONMENTS & REPOSITORY
35
Moving your team to collaborate with each other with Anaconda Enterprise Notebooks
Data Scientist
Interactive notebooks
Models
Data apps & visualizations
Data Scientist Data Scientist
36
Anaconda Enterprise Notebooks
• Collaborate with your team on the same project
• Notebooks enterprise extensions: diff, collaborative locking
• Manage collaborators and access to projects
• Search and tag notebooks
DEMO 2: NOTEBOOKS AND AEN
SCALING R
39
Scalability
Data Scientists want: • Easy cluster setup and provisioning -> Anaconda for cluster management
• Distributed framework to scale analysis -> SparkR
40
Anaconda for cluster management
• Dynamically manage conda environments across a cluster
• Works with enterprise Hadoop distributions and HPC clusters
• Integrates with on-premises Anaconda repository
• Cluster management features are available with Anaconda subscriptions
Client Machine Compute Node
Compute Node
Compute Node
Head Node
41
Anaconda for cluster management
Before Anaconda for cluster management
Head Node1. Manually install Python,
packages & dependencies2. Manually install R, packages &
dependencies
After Anaconda for cluster management
Compute Nodes1. Manually install Python,
packages & dependencies2. Manually install R,
packages & dependencies
Compute Nodes
Head NodeEasily install conda environments and packages (including Python and R) across cluster nodes
• Empower IT with scalable and supported Anaconda deployments • Fast, secure and scalable Python & R package management on tens or thousands of nodes • Backed by an enterprise configuration management system • Scalable Anaconda deployments tested in enterprise Hadoop and HPC environments
42
SparkR
• Distributed framework for large scale processing
• Provides an R interface through SparkR
DEMO 3: ANACONDA FOR CLUSTER MANAGEMENT AND SPARKR
45https://www.continuum.io/anaconda-subscriptions
46
• Need a centralized repository to publish and share notebooks, environments and packages (OSS and private)? Get Anaconda Repository! (Available in Anaconda Workgroups and Enterprise)
• Need a centralized server to help your data science team interactively collaborate on projects? Get Anaconda Enterprise Notebooks! (Available Enterprise)
• Need a “data scientist friendly” cluster manager? Get Anaconda for cluster management! (Available in Anaconda Workgroups and Enterprise)
Enterprise Product Solutions
47
• Download Anaconda: https://www.continuum.io/downloads
• Sign up for Anaconda cloud: https://anaconda.org
• Contact [email protected] for more information aboutAnaconda subscriptions, consulting, or training
Contact Information and Additional Details