Persistent Data Storage for Docker Containers by Andre Moruga
Data Science Docker• Docker Basics • Docker for Data Science Environments • Connecting your...
Transcript of Data Science Docker• Docker Basics • Docker for Data Science Environments • Connecting your...
Confidential / 13 August 2017
3 August 2017
Data Science ❤DockerDockerize Your Data Science Environment
Confidential / 23 August 2017
Outlook
• About Me, Detego and RFID
• Docker Basics
• Docker for Data Science Environments
• Connecting your Data Science Environment to other Services
Confidential / 33 August 2017
About Me, Detego and RFID
Florian Geigl
PhD from Institute of Interactive Systems and Data Science, Graz University of Technology
Short-Term Scholar at Information Science Institute, University of Southern California
Working as Data Scientist at Detego
Attended Kaggle Competitions
Latin: reveal, uncover, display
located in Graz
~35 employees
Fashion-Retail Industry
International Customers
Confidential / 43 August 2017
About Me, Detego and RFID
Confidential / 53 August 2017
www.detego.com
Confidential / 63 August 2017
3 August 2017
Docker Basics
Confidential / 73 August 2017
https://www.docker.com/what-docker
…”Developers use Docker to eliminate “works on my machine” problems when collaborating on code with co-workers.”…
“everything required to make a piece of software run is packaged into isolated containers. Unlike VMs, containers do not bundle a full operating system - only libraries and settings required to make the software work are needed. This makes for efficient, lightweight, self-contained systems and guarantees that software will always run the same, regardless of where it’s deployed.”
Confidential / 83 August 2017
• Consistent Environments • Linux, MacOS, Windows
• AWS, Azure & many more
• Native Performance• + CUDA version
• Resources Saving
• Easy Configuration• Pre-build/official Images
• or custom Docker Image
• Easy Mounting of Data
Why Docker?
Confidential / 93 August 2017
How fast can you set up an apache server?Switch between apache versions?
Set up an identical apache server on Linux, Mac & Windows?
…
Confidential / 103 August 2017
Live Demo: Apache
docker run
-it
--rm
-p 8888:80
-v C:\path\to\data:/usr/local/apache2/htdocs/
httpd
Image != Container
Image == Class
Container == Instance
Confidential / 113 August 2017
Confidential / 123 August 2017
Confidential / 133 August 2017
Docker…
• has basically no overhead
• provides native performance
• provides a consistent environment
• allows you to build your own docker image
• runs on any host OS
• allows to easily mount data into a container
• starts instantly
• …
Confidential / 143 August 2017
3 August 2017
Building a Docker Data Science Environment
Confidential / 153 August 2017
Building your own Docker Images
e.g.: Ubuntu & vim
“Dockerfile”
FROM ubuntu:latest
RUN apt-get updates && apt-get install vim
docker build .
-> results in a docker image
Confidential / 163 August 2017
Docker Data Science Image
Based on Kaggle’s Docker Image: https://hub.docker.com/u/kaggle/
Open-Source: https://github.com/floriangeigl/docker-DataScience
- pull requests are highly welcome
Contained Services:
- Python (2&3)
- R
- Julia
- Jupyter Notebooks
- Jupyter Labs
- RStudio
“docker pull floriangeigl/datascience”
(-> pulls or updates an image)
Confidential / 173 August 2017
Do It – Do It Now!
docker run --rm -it -p 8888:8888 -p 8889:8889 -p 8787:8787 -p 2222:22 –p 9001:9001 -v "${pwd}:/data/" --name dsdocker floriangeigl/datascience /bin/bash
docker run: Create a container from an image and executes a given command
--rm: Remove the container after shutdown
-p: Map a port from the container to our host machine
(e.g.: HostPort:ContainerPort)
-v: Mount a directory into the container
(e.g.: HostPath:ContainerPath)
pwd = print working directory = current path
floriangeigl/datascience: Docker image
/bin/bash: Executed command
Image != Container
Image == Class
Container == Instance
Confidential / 183 August 2017
Live Demo: Data Science Container
8888: jupyter notebooks
8889: jupyter labs
8787: r-studio-server
22: ssh
9001: supervisord (status of services; restart services; logs…)
Confidential / 193 August 2017
Best Practice #1: Aliases
Win Powershell:
run “notepad $PROFILE”
add “function dsdocker {docker run --rm -i -t -p … -v "${pwd}:/data/“ …}
restart Powershell & use your new “dsdocker” command
Linux&Mac:
add an alias for the command
${pwd} -> $(pwd)
-> see: https://github.com/floriangeigl/docker-DataScience
Confidential / 203 August 2017
Best Practice #2 – Fixed Project Structure
• Cookiecutter: https://github.com/drivendata/cookiecutter-data-science
Confidential / 213 August 2017
Known Bugs
Issues with de-keyboard:
Can’t type “\” on german keyboard in Chrome & IE
https://github.com/jupyter/notebook/issues/2379#issuecomment-301268937
-> workaround: use Firefox
Confidential / 223 August 2017
3 August 2017
Connecting to other Services
Confidential / 233 August 2017
Databases anyone?
Get your hands dirty on various technologies
https://hub.docker.com/u/library/
Confidential / 243 August 2017
Docker-Compose
version: "3.1"
services:
datascience:
image: floriangeigl/datascience:latest
ports:
- "8888:8888"
- "8889:8889"
- "8787:8787"
- "9001:9001"
volumes:
- ./:/data/
links:
- mongo
- cassandra
mongo:
image: mongo:latest
# persistent storage
volumes:
- ./data/mongo/:/data/db
cassandra:
image: cassandra:latest
Accessible ports:
8888, 8889, 8787,
….
Hostname: mongo Hostname: cassandra
Confidential / 253 August 2017
Do It – Do It Now!
Go to /path/to/compose-file
Run “docker-compose(.exe) up”
Use the stack
Shutdown: Strg+C
Remove used containers: “docker-compose(.exe) rm”
Confidential / 263 August 2017
Questions?