Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data...
Transcript of Cloud Platform Data Science on Google September 2017 ...lvs2124/assets/Data_Science_GCP.pdf · Data...
Data Science on Google Cloud PlatformLucas Schuermann (lvs.io)Columbia University Data Science Society HackathonSeptember 2017
Outline
● Introduction● Definitions● Creating a new project● Provisioning a VM● Interacting with a VM● Easy environment setup with Docker● Interacting with Jupyter● Advanced features
Introduction
By the end of this tutorial, you will have set up a Jupyter notebook environment on a virtual machine on Google Cloud Platform. You will understand how to configure new virtual machines and other environments for data analysis.
Please be sure to read all instructions carefully and thoroughly; you will save tremendous amounts of time by taking care to follow the correct steps the first time!
Definitions
What is Google Cloud Platform?
● One of many cloud computing services (AWS, Digitial Ocean, Azure, etc.)● Offers comprehensive services for data storage, networking, server
provisioning, and more● Cloud computing services are used by startups and big companies alike
for managing:○ Quickly deployable/replicable instances (Our use case)
○ Huge amounts of storage (Snapchat--AWS)
○ High network traffic (Pokemon Go--Google Cloud)
○ Advanced capabilities (computing clusters, always-online redundant systems)
What is a common data science stack?
● Python: a well supported and common programming language● Conda: a package management system with useful bundles● Jupyter: a simple GUI for exploration and evaluation of data● Matplotlib: visualization● Pandas: fast data aggregation (tables, database/file interfaces, etc.)● Scipy: useful fast math libraries● Scikit-learn: the kitchen sink of ML algorithms
Getting credit
During the hackathon, you’ll receive a code. https://console.cloud.google.com/billing/redeem
If you’re following along during the tutorial, use free trial, if available. Follow instructions athttps://cloud.google.com/free/.
A new project with a “billing account” should be automatically created following either of the paths above.
Provisioning a VM
Dashboard -> Lefthand Bar -> Compute Engine -> VM Instances
VM Instances -> Create
Create Instance -> Enter Name
Create Instance -> Select Machine Type
Create Instance -> Boot Disk
Create Instance -> Allow HTTP/HTTPS
Create Instance -> Advanced (Management, Disks, SSH Keys)
Generate SSH Key
Create Instance -> Advanced (Management, Disks, SSH Keys) -> Paste
Create Instance [Finish] -> VM Instances Dashboard -> Copy IP
Dashboard -> Lefthand Bar -> VPC Network -> External IP addresses -> Select “Static”
VPC Network -> Firewall rules -> Select “default-allow-http” -> Edit
VPC Network -> Firewall rules -> Select “default-allow-http” -> Edit [Protocols] -> Save
SSH Into New VM!
Interacting with a VM
Lefthand Menu -> Compute Engine -> VM Instances -> Click [Name]
SSH + Bash
SSH + Bash
SSH + Bash
Package management
Install and manage Debian packages on our Ubuntu virtual machine with these useful commands:
$ sudo apt-get update$ sudo apt-get install <package>$ sudo apt-get remove <package>
A favorite package of mine is an interactive performance monitor. Let’s practice installation:
$ sudo apt-get update$ sudo apt-get install -y htop$ htop
HTOP monitoring (optional)
Easy Env. Setup with Docker
What is Docker?
● A way to deploy uniform software environments● Commonly used to provision/deploy new servers on cloud platforms● Consists of an engine which runs containers specified by
images/Dockerfiles
We will use an open source Dockerfile which standardizes a common powerful data science environment.
Installing Docker on the VM
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
$ sudo apt-get update$ sudo apt-get install -y docker-ce$ sudo systemctl status docker
Successful Docker installation
Jupyter Data Science Stack
The Jupyter Data Science Stack is one of the easiest ways to quickly deploy a comprehensive data science environment on a fresh VM. It is distributed as a Dockerfile.
$ sudo docker run -it --rm -p 8888:8888 jupyter/datascience-notebook start-notebook.sh --ip=0.0.0.0 --port=8888 --no-browser
Downloading new Docker image
Jupyter stack running -> Quit with [Ctrl+C]
Interacting with Jupyter
Connecting with a browser
Starting the Jupyter Docker image as before will give you a URL similar to: http://0.0.0.0:8888/?token=4396756268c013a5cf88277aa75fdaa5bc64250557c1d628
To connect on a computer other than the VM (which would require X-Windows forwarding), replace localhost with the public IP of your machine, which is the same address we used for SSH.http://104.197.127.17:8888/?token=4396756268c013a5cf88277aa75fdaa5bc64250557c1d628
Jupyter landing page
Jupyter -> New -> Python 3
Enter code in cell -> [Shift+Enter] -> Hello World!
Jupyter stack status
Recommended usage
If you wish to use this Jupyter notebook system for most of your hacking, please use the following command to preserve the state of the notebooks on the local filesystem. When prompted, the password is cdss2017. You won’t need to use a token. All files to be saved must be in the work directory.
$ cd ~ && mkdir -p work && sudo chown 1000 work && sudo docker run -d --rm -p 8888:8888 -v ~/work:/home/jovyan/work --user root -e NB_UID=1000 jupyter/datascience-notebook start-notebook.sh --ip=0.0.0.0 --port=8888 --no-browser --NotebookApp.password='sha1:e072c2ec444e:c0302545ca6a0be2723291f2a1f83dfa86f5a1c5'
http://<external ip>:8888/
Helpful commands for managing Docker
If you’ve deployed a docker container in the background (as a daemon with the -d flag), you can see running containers with the command:
● sudo docker container list
A running container can be stopped as follows. Stopping the docker container will remove all associated files/state, but files in the work directory will be preserved on the local filesystem and can be recovered after redeployment.
● sudo docker container stop <container id>
If you’re curious about more interaction with docker and looking to learn more, you can find documentation online here.
Using RStudio
Run RStudio Dockerfile
$ cd ~ && mkdir -p R && sudo chown 1000 R && sudo docker run --rm -d -p 8787:8787 -v ~/R:/home/cdss2017 --user root -e USER=cdss2017 -e PASSWORD=cdss2017 -e ROOT=TRUE -e USERID=1000 rocker/rstudio
State is preserved via local VM filesystem if container is closed. Further, this container launches and runs persistently in the background.Documentation here.
Access RStudio
RStudio will invisibly launch at: http://<external ip>:8787(substitute <external ip> with your instance’s External IP, e.g. 104.197.127.17)
● Username: cdss2017● Password: cdss2017
Advanced reading
● Creating a computing cluster with Apache Spark to back Jupyter notebooks (link)
Thank you! Questions?