Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data...

49
Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University Data Science Society Hackathon September 2017

Transcript of Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data...

Page 1: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Data Science on Google Cloud PlatformLucas Schuermann (lvs.io)Columbia University Data Science Society HackathonSeptember 2017

Page 2: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Outline

● Introduction● Definitions● Creating a new project● Provisioning a VM● Interacting with a VM● Easy environment setup with Docker● Interacting with Jupyter● Advanced features

Page 3: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Introduction

By the end of this tutorial, you will have set up a Jupyter notebook environment on a virtual machine on Google Cloud Platform. You will understand how to configure new virtual machines and other environments for data analysis.

Please be sure to read all instructions carefully and thoroughly; you will save tremendous amounts of time by taking care to follow the correct steps the first time!

Page 4: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Definitions

Page 5: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

What is Google Cloud Platform?

● One of many cloud computing services (AWS, Digitial Ocean, Azure, etc.)● Offers comprehensive services for data storage, networking, server

provisioning, and more● Cloud computing services are used by startups and big companies alike

for managing:○ Quickly deployable/replicable instances (Our use case)

○ Huge amounts of storage (Snapchat--AWS)

○ High network traffic (Pokemon Go--Google Cloud)

○ Advanced capabilities (computing clusters, always-online redundant systems)

Page 6: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

What is a common data science stack?

● Python: a well supported and common programming language● Conda: a package management system with useful bundles● Jupyter: a simple GUI for exploration and evaluation of data● Matplotlib: visualization● Pandas: fast data aggregation (tables, database/file interfaces, etc.)● Scipy: useful fast math libraries● Scikit-learn: the kitchen sink of ML algorithms

Page 7: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Getting credit

During the hackathon, you’ll receive a code. https://console.cloud.google.com/billing/redeem

If you’re following along during the tutorial, use free trial, if available. Follow instructions athttps://cloud.google.com/free/.

A new project with a “billing account” should be automatically created following either of the paths above.

Page 8: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Provisioning a VM

Page 9: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Dashboard -> Lefthand Bar -> Compute Engine -> VM Instances

Page 10: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

VM Instances -> Create

Page 11: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Create Instance -> Enter Name

Page 12: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Create Instance -> Select Machine Type

Page 13: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Create Instance -> Boot Disk

Page 14: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Create Instance -> Allow HTTP/HTTPS

Page 15: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Create Instance -> Advanced (Management, Disks, SSH Keys)

Page 16: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Generate SSH Key

Page 17: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Create Instance -> Advanced (Management, Disks, SSH Keys) -> Paste

Page 18: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Create Instance [Finish] -> VM Instances Dashboard -> Copy IP

Page 19: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Dashboard -> Lefthand Bar -> VPC Network -> External IP addresses -> Select “Static”

Page 20: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

VPC Network -> Firewall rules -> Select “default-allow-http” -> Edit

Page 21: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

VPC Network -> Firewall rules -> Select “default-allow-http” -> Edit [Protocols] -> Save

Page 22: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

SSH Into New VM!

Page 23: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Interacting with a VM

Page 24: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Lefthand Menu -> Compute Engine -> VM Instances -> Click [Name]

Page 25: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

SSH + Bash

Page 26: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

SSH + Bash

Page 27: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

SSH + Bash

Page 28: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Package management

Install and manage Debian packages on our Ubuntu virtual machine with these useful commands:

$ sudo apt-get update$ sudo apt-get install <package>$ sudo apt-get remove <package>

A favorite package of mine is an interactive performance monitor. Let’s practice installation:

$ sudo apt-get update$ sudo apt-get install -y htop$ htop

Page 29: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

HTOP monitoring (optional)

Page 30: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Easy Env. Setup with Docker

Page 31: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

What is Docker?

● A way to deploy uniform software environments● Commonly used to provision/deploy new servers on cloud platforms● Consists of an engine which runs containers specified by

images/Dockerfiles

We will use an open source Dockerfile which standardizes a common powerful data science environment.

Page 32: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Installing Docker on the VM

$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

$ sudo apt-get update$ sudo apt-get install -y docker-ce$ sudo systemctl status docker

Page 33: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Successful Docker installation

Page 34: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Jupyter Data Science Stack

The Jupyter Data Science Stack is one of the easiest ways to quickly deploy a comprehensive data science environment on a fresh VM. It is distributed as a Dockerfile.

$ sudo docker run -it --rm -p 8888:8888 jupyter/datascience-notebook start-notebook.sh --ip=0.0.0.0 --port=8888 --no-browser

Page 35: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Downloading new Docker image

Page 36: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Jupyter stack running -> Quit with [Ctrl+C]

Page 37: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Interacting with Jupyter

Page 38: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Connecting with a browser

Starting the Jupyter Docker image as before will give you a URL similar to: http://0.0.0.0:8888/?token=4396756268c013a5cf88277aa75fdaa5bc64250557c1d628

To connect on a computer other than the VM (which would require X-Windows forwarding), replace localhost with the public IP of your machine, which is the same address we used for SSH.http://104.197.127.17:8888/?token=4396756268c013a5cf88277aa75fdaa5bc64250557c1d628

Page 39: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Jupyter landing page

Page 40: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Jupyter -> New -> Python 3

Page 41: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Enter code in cell -> [Shift+Enter] -> Hello World!

Page 42: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Jupyter stack status

Page 43: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Recommended usage

If you wish to use this Jupyter notebook system for most of your hacking, please use the following command to preserve the state of the notebooks on the local filesystem. When prompted, the password is cdss2017. You won’t need to use a token. All files to be saved must be in the work directory.

$ cd ~ && mkdir -p work && sudo chown 1000 work && sudo docker run -d --rm -p 8888:8888 -v ~/work:/home/jovyan/work --user root -e NB_UID=1000 jupyter/datascience-notebook start-notebook.sh --ip=0.0.0.0 --port=8888 --no-browser --NotebookApp.password='sha1:e072c2ec444e:c0302545ca6a0be2723291f2a1f83dfa86f5a1c5'

http://<external ip>:8888/

Page 44: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Helpful commands for managing Docker

If you’ve deployed a docker container in the background (as a daemon with the -d flag), you can see running containers with the command:

● sudo docker container list

A running container can be stopped as follows. Stopping the docker container will remove all associated files/state, but files in the work directory will be preserved on the local filesystem and can be recovered after redeployment.

● sudo docker container stop <container id>

If you’re curious about more interaction with docker and looking to learn more, you can find documentation online here.

Page 45: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Using RStudio

Page 46: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Run RStudio Dockerfile

$ cd ~ && mkdir -p R && sudo chown 1000 R && sudo docker run --rm -d -p 8787:8787 -v ~/R:/home/cdss2017 --user root -e USER=cdss2017 -e PASSWORD=cdss2017 -e ROOT=TRUE -e USERID=1000 rocker/rstudio

State is preserved via local VM filesystem if container is closed. Further, this container launches and runs persistently in the background.Documentation here.

Page 47: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Access RStudio

RStudio will invisibly launch at: http://<external ip>:8787(substitute <external ip> with your instance’s External IP, e.g. 104.197.127.17)

● Username: cdss2017● Password: cdss2017

Page 48: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Advanced reading

● Creating a computing cluster with Apache Spark to back Jupyter notebooks (link)

Page 49: Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data Science on Google Cloud Platform Lucas Schuermann (lvs.io) Columbia University

Thank you! Questions?