Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data...

Data Science on Google Cloud PlatformLucas Schuermann (lvs.io)Columbia University Data Science Society HackathonSeptember 2017

http://lvs.io

Outline

● Introduction● Definitions● Creating a new project● Provisioning a VM● Interacting with a VM● Easy environment setup with Docker● Interacting with Jupyter● Advanced features

Introduction

By the end of this tutorial, you will have set up a Jupyter notebook environment on a virtual machine on Google Cloud Platform. You will understand how to configure new virtual machines and other environments for data analysis.

Please be sure to read all instructions carefully and thoroughly; you will save tremendous amounts of time by taking care to follow the correct steps the first time!

Definitions

What is Google Cloud Platform?

● One of many cloud computing services (AWS, Digitial Ocean, Azure, etc.)● Offers comprehensive services for data storage, networking, server

provisioning, and more● Cloud computing services are used by startups and big companies alike

for managing:○ Quickly deployable/replicable instances (Our use case)

○ Huge amounts of storage (Snapchat--AWS)

○ High network traffic (Pokemon Go--Google Cloud)

○ Advanced capabilities (computing clusters, always-online redundant systems)

What is a common data science stack?

● Python: a well supported and common programming language● Conda: a package management system with useful bundles● Jupyter: a simple GUI for exploration and evaluation of data● Matplotlib: visualization● Pandas: fast data aggregation (tables, database/file interfaces, etc.)● Scipy: useful fast math libraries● Scikit-learn: the kitchen sink of ML algorithms

Getting credit

During the hackathon, you’ll receive a code. https://console.cloud.google.com/billing/redeem

If you’re following along during the tutorial, use free trial, if available. Follow instructions athttps://cloud.google.com/free/.

A new project with a “billing account” should be automatically created following either of the paths above.

https://console.cloud.google.com/billing/redeem

https://cloud.google.com/free/

Provisioning a VM

Dashboard -> Lefthand Bar -> Compute Engine -> VM Instances

VM Instances -> Create

Create Instance -> Enter Name

Create Instance -> Select Machine Type

Create Instance -> Boot Disk

Create Instance -> Allow HTTP/HTTPS

Create Instance -> Advanced (Management, Disks, SSH Keys)

Generate SSH Key

Create Instance -> Advanced (Management, Disks, SSH Keys) -> Paste

Create Instance [Finish] -> VM Instances Dashboard -> Copy IP

Dashboard -> Lefthand Bar -> VPC Network -> External IP addresses -> Select “Static”

VPC Network -> Firewall rules -> Select “default-allow-http” -> Edit

VPC Network -> Firewall rules -> Select “default-allow-http” -> Edit [Protocols] -> Save

SSH Into New VM!

Interacting with a VM

Lefthand Menu -> Compute Engine -> VM Instances -> Click [Name]

SSH + Bash

Package management

Install and manage Debian packages on our Ubuntu virtual machine with these useful commands:

$ sudo apt-get update$ sudo apt-get install <package>$ sudo apt-get remove <package>

A favorite package of mine is an interactive performance monitor. Let’s practice installation:

$ sudo apt-get update$ sudo apt-get install -y htop$ htop

HTOP monitoring (optional)

Easy Env. Setup with Docker

What is Docker?

● A way to deploy uniform software environments● Commonly used to provision/deploy new servers on cloud platforms● Consists of an engine which runs containers specified by

images/Dockerfiles

We will use an open source Dockerfile which standardizes a common powerful data science environment.

Installing Docker on the VM

$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

$ sudo apt-get update$ sudo apt-get install -y docker-ce$ sudo systemctl status docker

Successful Docker installation

Jupyter Data Science Stack

The Jupyter Data Science Stack is one of the easiest ways to quickly deploy a comprehensive data science environment on a fresh VM. It is distributed as a Dockerfile.

$ sudo docker run -it --rm -p 8888:8888 jupyter/datascience-notebook start-notebook.sh --ip=0.0.0.0 --port=8888 --no-browser

https://github.com/jupyter/docker-stacks/tree/master/datascience-notebook

Downloading new Docker image

Jupyter stack running -> Quit with [Ctrl+C]

Interacting with Jupyter

Connecting with a browser

Starting the Jupyter Docker image as before will give you a URL similar to: http://0.0.0.0:8888/?token=4396756268c013a5cf88277aa75fdaa5bc64250557c1d628

To connect on a computer other than the VM (which would require X-Windows forwarding), replace localhost with the public IP of your machine, which is the same address we used for SSH.http://104.197.127.17:8888/?token=4396756268c013a5cf88277aa75fdaa5bc64250557c1d628

Jupyter landing page

Jupyter -> New -> Python 3

Enter code in cell -> [Shift+Enter] -> Hello World!

Jupyter stack status

Recommended usage

If you wish to use this Jupyter notebook system for most of your hacking, please use the following command to preserve the state of the notebooks on the local filesystem. When prompted, the password is cdss2017. You won’t need to use a token. All files to be saved must be in the work directory.

$ cd ~ && mkdir -p work && sudo chown 1000 work && sudo docker run -d --rm -p 8888:8888 -v ~/work:/home/jovyan/work --user root -e NB_UID=1000 jupyter/datascience-notebook start-notebook.sh --ip=0.0.0.0 --port=8888 --no-browser --NotebookApp.password='sha1:e072c2ec444e:c0302545ca6a0be2723291f2a1f83dfa86f5a1c5'

http://<external ip>:8888/

Helpful commands for managing Docker

If you’ve deployed a docker container in the background (as a daemon with the -d flag), you can see running containers with the command:

● sudo docker container list

A running container can be stopped as follows. Stopping the docker container will remove all associated files/state, but files in the work directory will be preserved on the local filesystem and can be recovered after redeployment.

● sudo docker container stop <container id>

If you’re curious about more interaction with docker and looking to learn more, you can find documentation online here.

https://docs.docker.com/engine/reference/commandline/cli/

Using RStudio

Run RStudio Dockerfile

$ cd ~ && mkdir -p R && sudo chown 1000 R && sudo docker run --rm -d -p 8787:8787 -v ~/R:/home/cdss2017 --user root -e USER=cdss2017 -e PASSWORD=cdss2017 -e ROOT=TRUE -e USERID=1000 rocker/rstudio

State is preserved via local VM filesystem if container is closed. Further, this container launches and runs persistently in the background.Documentation here.

https://github.com/rocker-org/rocker/wiki/Using-the-RStudio-image

Access RStudio

RStudio will invisibly launch at: http://<external ip>:8787(substitute <external ip> with your instance’s External IP, e.g. 104.197.127.17)

● Username: cdss2017● Password: cdss2017

Advanced reading

● Creating a computing cluster with Apache Spark to back Jupyter notebooks (link)

https://cloud.google.com/blog/big-data/2017/02/google-cloud-platform-for-data-scientists-using-jupyter-notebooks-with-apache-spark-on-google-cloud

Thank you! Questions?

Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data...

Documents

Transcript of Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data...