JOSA TechTalk: Taking Docker to Production

52
Taking Docker To production JOSA TechTalk by Muayyad Saleh Alsadi http://muayyad-alsadi.github.io/

Transcript of JOSA TechTalk: Taking Docker to Production

Taking Docker To productionJOSA TechTalk by Muayyad Saleh Alsadihttp://muayyad-alsadi.github.io/

What is Docker again? (quick review)

Containers

uses linux kernel features like:

● namespaces● cgroups (control

groups)● capabilities.

Platform

Docker is a key component of many PaaS. Docker provide a way to host images, pull them, run them, pause them, snapshot them into new images, view diffs ..etc.

Ecosystem

Like github, Docker Hub provide publicly available community images.

Containers vs. VMs

No kernel in guest OS (shared with host)containers are more secure and isolated than chroot and less isolated than VM

Why DevOps?

Devs

want change

Ops

wants stability (no change)

DevOps

resolve the conflict.

for devs: docker image contains the same os, same libraries, same version, same config, ...etc.

for admins: host is untouched and stable

Blame each other

Fight each other

Devs Heaven (not for production)

docker compose can bring everything up and connect them and link them with a single command. can mount local dir inside the image (so that developer can use his/her favorite IDE). The command is

docker-compose up

it will read “docker-compose.yml” which might look like:mywebapp: image: mywebapp volumes: - .:/code links: - redisredis: image: redis

Operations Heaven

Having a stable host!

CoreOS does not include any package manager. and does not even have python or tools installed. They have a Fedora based docker image called toolbox.

You can mix and match. Some containers runs Java 6 or Java 7. Some uses CentOS 6, others 7, others ubuntu 14.04 others Fedora 22 ..etc. in the same host.

Linking Containers

docker run -d --name r1 redisdocker run -d --name web --link r1:redis myweb

r1 is container nameredis is link aliasit will update /etc/hosts and set ENVs:

● <alias>_NAME = <THIS>/<THAT> # myweb/r1● REDIS_PORT=<tcp|udb>://<IP>:<PORT>● REDIS_PORT_6379_TCP_PROTO=tcp● REDIS_PORT_6379_TCP_PORT=6379● REDIS_PORT_6379_TCP_ADDR=172.17.1.15

Pets vs. Cattle vs.Ants

Pets (virtualization)

The VM has

● lovely distinct names

● emotions● many highly coupled

roles● if down it’s a

catastrophe

Cattle (cloud)

● no names● no emotions● single role● decoupled (loosely

coupled)● load-balanced● if down other VMs take

over.● VM failure is planned and

part of the process

Ants (docker containers)

containers are like cloud vms, no names no emotions, load balanced.

A single host (might be a VM) is highly dense. The host is stable. Large group of containers are designed to fail as part of the process.

What docker is not

● docker is not a hypervisor○ docker is for process containers not system containers○ example of system containers: LXD and OpenVZ

● no systemd/upstart/sysvinit in the container○ docker is for process containers not system containers○ just run apache, nginx, solr, whatever○ TTYs are not needed○ crons are not needed

● Docker is not for multi-tenant

HINT: LXD is stupid way of winning a meaningless benchmark

Docker ecosystem

● CoreOS, Atomic OS, Ubuntu Core● Openshift (redhat PaaS)● CloudFoundary● Mesos / mesosphere (by Twitter and now apache)● Google Kubernetes (scheduler containers to hosts)● Swarm● etcd/Fleet● Drone● Deis, Flynn, Rancher

Docker golden rules

by twitter@gionn:

● only one process per image● no embedded configuration● no sshd, no syslog, no tty● no! you don't touch a running container to adjust things● no! you will not use a community image

Theory vs. Reality

docker imaginary “unicorn” apps

● statically compiled (no dependencies)

● written in golang● container ~ 10MB

on real world

● interpreted application (python, php)

● system dependencies, config files, log files

● multiple processes (nginx, php-fpm)

● container image >500MB

12 Factor - http://12factor.net/

1. One codebase (in git), many deploys2. Explicitly declare and isolate dependencies3. get config from environment or service discovery4. Treat backing services as attached resources (Database, SMTP, S3, ..etc.)5. Strictly separate build and run stages (no minify css/js on run stage)6. Execute the app as one or more stateless processes (data and state are

persisted elsewhere apart from the app, no need for sticky sessions)7. Export a port (an end point to talk to)8. Scale out via the process model9. Disposability: Maximize robustness with fast startup and graceful shutdown

10. Keep development, staging, and production as similar as possible11. Logs: they are flow of events written to stdout that is captured by execution

env.

12 Factor

last factor is administrative processes● Run admin/management tasks as one-off processes

○ in django: manage.py migrate● One-off admin processes should be run in an identical

environment as the regular long-running processes of the app

● shipped from same code (same git repo)

Example of 12 Factor: bedrock - a 12 factor wordpresshttps://roots.io/bedrock/

12 Factor - Factorish

can be found on https://github.com/factorish/factorish

example:https://github.com/factorish/factorish-elk

Config

● confd○ written in go (a statically linked binary)○ input

■ env variables■ service discovery (like etcd and consul)■ redis

○ output ■ golang template with {{something}}

● crudini, jq● http://gliderlabs.com/registrator/latest/user/quickstart/

Config

● container’s entry point (“/start.sh”) calls REST API to add itslef to haproxy or anyother loadbalancer

● container’s entry point uses discovery service client (ex. etcdctl)

● something listen to docker events and send each container ENV and labels to discovery service

Multiple Process

● supervisord● runit● fake systemd

○ see free-ipa docker image○ https://github.com/adelton/docker-freeipa

Logging/Monitoring

● ctop● cadvisor: https://github.com/google/cadvisor● logstash● logspout - https://github.com/gliderlabs/logspout

Logging/Monitoring

nginx logging use “error_log /dev/stderr;” and “access_log /dev/stdout;” with daemon off. for example in supervisord[program:nginx]directory=/var/lib/nginxcommand=/usr/sbin/nginx -g 'daemon off;'user=rootautostart=trueautorestart=trueredirect_stderr=falsestdout_logfile=/dev/stdoutstderr_logfile=/dev/stderrstdout_logfile_maxbytes=0stderr_logfile_maxbytes=0

Logging/Monitoring

Web UI● tumtum● cockpit-project.org● Shipyard● FleetUI● CoreGI● SUSE/Portus

Web UI - cockpit-project

Web UI - shipyard

Web UI - tumtum

Building Docker Images

● Dockerfile and “docker build -t myrepo/myapp .”○ I have a proposal using pivot root inside dockerfile

(docker build will build the build environment then use another fresh small container as target, copy build result and pivot). Docker builder is frozen but details are here

● Dockramp○ https://github.com/jlhawn/dockramp○ external builder written in golang○ uses only docker api (needs new “cp” api)○ can implement my proposal

● Atomic app / Nulecule/ openshift have their ownway● Use Fabric/Ansible to build

Simple Duct tape launching.

Systemd @ magic. ex: have [email protected]# systemctl start container@myweb[Unit]Description=Docker Container for %IAfter=docker.serviceRequires=docker.service[Service]Type=simpleExecStartPre=bash -c “/usr/bin/mkdir /var/lib/docker/vfs/dir/%i || :”ExecStartPre=/usr/bin/docker kill %iExecStartPre=/usr/bin/docker rm %iExecStart=/usr/bin/docker run -i \ --name=”%i” \ --env-file=/etc/sysconfig/container/%i.rc --label-file=/etc/sysconfig/container/%i.labels -v /var/lib/docker/vfs/dir/%i:/data myrepo/%i

Seriously?Docker on production!

“Docker is about running random code downloaded from the Internet and running it as root.”[1][2]

-- a redhat engineer

Source 1, source 2

● host a private docker registry (so you don’t download random code from random people on internet)

● use HTTPS and be your own certificate authority and trust it on your docker hosts

● use registry version 2 and apply ACL on images○ URLs in v2 look /v2/<name>/blobs/<digest>

● use HTTP Basic Auth (apache/nginx) with whatever back-end you like (ex. LDAP or just plain files)

● have a Read-Only user as your “deployer” on servers● have a build server to push images (not developers)

Host your own private registry

“Containers do not contain.”

-- Dan Walsh (Redhat / SELinux)Seriously?

Docker on production!

in may 2015, a catastrophic vulnerability affected kvm/xen almost every datacenter.

Fedora/RHEL/CentOS had been secure because of SELinux/sVirt (since 2009)

AppArmor was a joke that is not funny.

http://www.zdnet.com/article/venom-security-flaw-millions-of-virtual-machines-datacenters/https://fedoraproject.org/wiki/Features/SVirt_Mandatory_Access_Control

Docker and The next Venom?

sVirt do support Docker

What happens in a container stays in the container.

● Drop privileges as quickly as possible● Run your services as non-root whenever possible

○ apache needs root to open port 80, but you are going to proxy the port anyway, so run it as non-root directly

● Treat root within a container as if it is root outside of the container

● do not give CAP_SYS_ADMIN to a container (it’s equivalent to host root)

Recommendations

Setting proper storage backend

● docker info | grep ‘Storage Driver’● possible drivers/backends:

○ aufs: a union filesystem that is so low quality that was never part of official linux kernel○ overlay: a modern union filesystem that was accepted in kernel 4.0 (too young)○ zfs: linux port of the well-established filesystem in solaris. the quality of the port and driver is still

questionable○ btrfs: the most featureful linux filesystem. too early to be on production○ devicemapper (thin provisioning): well-established redhat technology (already in production ex.

LVM)● do not use loopback default config in EL (RHEL/CentOS/Fedora)

○ WARNING: No --storage-opt dm.thinpooldev specified, using loopback; this configuration is strongly discouraged for production use

● in EL edit /etc/sysconfig/docker-storage● http://developerblog.redhat.com/2014/09/30/overview-storage-scalability-docker/● http://www.projectatomic.io/blog/2015/06/notes-on-fedora-centos-and-docker-storage-drivers/● http://www.projectatomic.io/docs/docker-storage-recommendation/

Storage backend (using script)man docker-storage-setupvim /etc/sysconfig/docker-storage-setupdocker-storage-setup

● DEVS=“/dev/sdb /dev/sdc”○ list of unpartitioned devices to be used or added○ if you are adding more, remove old ones○ required if VG is specified and does not exists

● VG=“<my-volume-group>”○ set to empty to use unallocated space in root’s VG

Storage backend (manual)pvcreate /dev/sdcvgcreate direct-lvm /dev/sdclvcreate --wipesignatures y -n data direct-lvm -l 95%VGlvcreate --wipesignatures y -n metadata direct-lvm -l 5%VGdd if=/dev/zero of=/dev/direct-lvm/metadata bs=1M vim /etc/sysconfig/docker-storage # to add next line

DOCKER_STORAGE_OPTIONS = --storage-opt dm.metadatadev=/dev/direct-lvm/metadata --storage-opt dm.datadev=/dev/direct-lvm/data

systemctl restart docker

Docker VolumesNever put data inside the container (logs, database files, ..etc.). Data should go to mounted volumes.

You can mount folders or files. You can mount RW or RO.

You can have a busybox container with volumes and mount all volumes of that container in another container.

# docker run -d --volumes-from my_vols --name db1 training/postgres

Everything is a child processes of a single daemon. Seriously!

Seriously?Docker on production!

Docker process model is flawedDocker daemon launches containers as attached child processes. if the daemon dies all of them will collapse in a fatal catastrophe. Moreover, docker daemon has so many moving parts. For example fetching images is done inside the daemon.Bad network while fetching an image or having an evil image might collapse all containers.https://github.com/docker/docker/issues/15328

An evil client, an evil request, an evil image, an evil contain, or an evil “inspect” template might cause docker daemon to go crazy and risk all containers.

Docker process model is flawedCoreOS introduced more sane process model in rkt (Rocket) an alternative docker-like containers run time. RedHat contributes to both docker and rocket as both has high potential. Rkt is just a container runtime where you can run containers as non-root and without being a child to anything (ex. rely on systemd/D-Bus). Rocket is not a platform (no layers, no image registry service, ..etc.)

https://github.com/coreos/rkt/

Docker might evolve to fix this, dockerlite is a shell script uses LXC and BTRFS

https://github.com/docker/dockerlite

For now just design your cluster to fail and use anti-affinity

Networking.

Linux Bridges, IPTables NATing, Export ports using a young proxy written in golang. Seriously!

Seriously?Docker on production!

Docker Networking nowDocker uses Linux bridges which only connect within same host.Containers on host A can’t talk to container on host B! And uses NAT to talk to outside world# iptables -t nat -A POSTROUTING -s 172.17.0.0/16 -j MASQUERADE

Exported ports in docker are done via a docker proxy process (written in go). check “netstat -tulnp”

Deprecated geard used to connect multiple hosts using NAT and configured each container to talk to localhost for anything (ex. talk to localhost MySQL and NAT will take it to MySQL container on another host):

# iptables -t nat -A PREROUTING -d ${local_ip}/32 -p tcp -m tcp --dport ${local_port} -j DNAT --to-destination ${remote_ip}:${remote_port}# iptables -t nat -A OUTPUT -d ${local_ip}/32 -p tcp -m tcp --dport ${local_port} -j DNAT --to-destination ${remote_ip}:${remote_port}# iptables -t nat -A POSTROUTING -o eth0 -j SNAT --to-source ${container_ip}

Docker Networking nowA Similar approach is manually hard-code and divide docker bridges on each host 172.16.X.y and where X is the host and y is the container and use NAT to deliver packets (or 172.X.y.y depending on number hosts and number of containers on each host).

http://blog.sequenceiq.com/blog/2014/08/12/docker-networking/

given a remote host with IP 192.168.40.12 and its docker0 bridge with 172.17.52.0/24, and given a host with docker0 on 172.17.51.0/24 in the later host type

route add -net 172.17.52.0 netmask 255.255.255.0 gw 192.168.40.12iptables -t nat -F POSTROUTING # or pass "--iptables=false" to docker daemoniptables -t nat -A POSTROUTING -s 172.17.51.0/24 ! -d 172.17.0.0/16 -j MASQUERADE

Docker Networking Alternatives● OpenVSwitch (well-established production technology)● Flannel (young project from CoreOS written in golang)● Weave (https://github.com/weaveworks/weave)● Calico (https://github.com/projectcalico/calico)

Docker Networking AlternativesOpenVSwitch:Just like a physical, this virtual switch connects different hosts.

One setup would be connecting each container to OVS without bridge. “docker run --net=none” then use ovs-docker script

The other setup just replace docker0 bridge with one that is connected to OVS. (no change need to be done to each container)

Docker Networking Alternatives# ovs_vsctl add-br sw0

or /etc/sysconfig/network-scripts/ifcfg-sw0then

# ip link add veth_s type veth peer veth_c# brctl addif docker0 veth_c # ovs_vsctl add-port sw0 veth_s

see /etc/sysconfig/network-scripts/ifup-ovs

http://git.openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob_plain;f=rhel/README.RHEL;hb=HEAD

Networking the futurein the feature libnetwork will allow docker to use SDN plugins.Docker acquired SocketPlane to implement this.

https://github.com/docker/libnetworkhttps://github.com/docker/libnetwork/blob/master/ROADMAP.md

Introducing Docker Glue● docker-glue - modular pluggable daemon that can run handlers and scripts● docker-balancer - a standalone daemon that just updates haproxy (a special case of glue)

https://github.com/muayyad-alsadi/docker-glue

autoconfigure haproxy to pass traffic to your containers

uses docker labels “-l” to specify http host or url prefix

# docker run -d --name wp1 -l glue_http_80_host='wp1.example.com' mywordpress/wordpress # docker run -d --name wp2 -l glue_http_80_host='wp2.example.com' mywordpress/wordpress # docker run -d --name panel -l glue_http_80_host=example.com -l glue_http_80_prefix=dashboard/ myrepo/control-panel

Introducing Docker Gluerun any thing based on docker events (test.ini)

[handler]class=DockerGlue.handlers.exec.ScriptHandlerevents=allenabled=1triggers-none=0

[params]script=test-handler.shdemo-option=some value

# it will runtest-handler.sh /path/to/test.ini <EVENT> <CONTAINER_ID>

Introducing Docker Glue#! /bin/bash

cd `dirname $0`

function error() { echo "$@" exit -1}

[ $# -ne 3 ] && error "Usage `basename $0` config.ini status container_id"ini="$1"status="$2"container_id="$3"ini_demo_option=$( crudini --inplace --get $ini params demo-option 2>/dev/null || : )echo "`date +%F` container_id=[$container_id] status=[$status] ini_demo_option=[$ini_demo_option]" >> /tmp/docker-glue-test.log

Resources

● http://opensource.com/business/14/7/docker-security-selinux

● http://opensource.com/business/14/9/security-for-docker

● http://www.projectatomic.io/blog/2014/09/yet-another-reason-containers-don-t-contain-kernel-keyrings/

● http://developerblog.redhat.com/2014/11/03/are-docker-containers-really-secure-opensource-com/

● https://www.youtube.com/watch?v=0u9LqGVK-aI● https://github.com/muayyad-alsadi/docker-glue● http://blog.sequenceiq.com/blog/2014/08/12/docker-

networking/● https://docs.docker.com/userguide/dockervolumes/● https://docs.docker.com/userguide/dockerlinks/● https://docs.docker.com/articles/networking/● https://github.

com/openvswitch/ovs/blob/master/INSTALL.Docker.md

● http://radar.oreilly.com/2015/10/swarm-v-fleet-v-kubernetes-v-mesos.html

Q & A