App container rkt

108
github.com/coreos/rkt [email protected] App Container github.com/appc [email protected]

Transcript of App container rkt

Page 1: App container rkt

github.com/coreos/[email protected]

App Containergithub.com/appc

[email protected]

Page 2: App container rkt

Yifan Gugithub.com/yifan-gu@yifan7702

Page 3: App container rkt

With containers what does a "Linux Distro"

mean?

Page 4: App container rkt

KERNELSYSTEMDSSH

PYTHONJAVANGINXMYSQLOPENSSL

dist

ro d

istr

o di

stro

dis

tro

dist

ro

dist

ro d

istr

o

APP

Page 5: App container rkt

KERNELSYSTEMDSSH

LXC/DOCKER/RKT

PYTHONJAVANGINXMYSQLOPENSSL

APPdi

stro

dis

tro

dist

ro d

istr

o di

stro

di

stro

dis

tro

Page 6: App container rkt

The Bad$ python --versionPython 2.7.6$ python app-requiring-python3.py

$ python --versionPython 3.4.3$ python app-requiring-python2.py

package collisions

Page 7: App container rkt

The Bad$ cat /etc/os-release | grep ^NAME= NAME=Fedora

$ rpm -i package-from-suse.rpm file /foo from install of package-from-suse.rpm conflicts with file from package-from-fedora

dependency namespacing

Page 8: App container rkt

The Good$ gpg --list-only --import \

/etc/apt/trusted.gpg.d/*

gpg: key 2B90D010: public key "Debian Archive Automatic Signing Key (8/jessie) <[email protected]>" importedgpg: Total number processed: 1gpg: imported: 1 (RSA: 1)gpg: no ultimately trusted keys found

users control trust

Page 9: App container rkt

The Good

$ rsync ftp.us.debian.org::debian \ /srv/mirrors/debian

$ dpkg -i \ /srv/mirrors/debian/kernel-image-3.16.0-4-amd64-di_3.16.7-ckt9-2_amd64.udeb

trivial mirroring and hosting

Page 10: App container rkt

Linux Packages 2.0.deb and .rpm for containers

Page 11: App container rkt

Container VS VM?● Lightweight (100s vs 10s)

● Easy to deploy● Less isolation?

Page 12: App container rkt

What is container ?● Packaging with your apps with deps● Running in isolation (using namespace,

cgroups)

Page 13: App container rkt

Why I want to use it?● Deploy faster● Run faster, run everywhere● Run in isolation

Page 14: App container rkt

App Container (appc)github.com/appc

[email protected]

Page 15: App container rkt

appc != rkt

Page 16: App container rkt

Application Containersself-contained, portable

(decoupled from operating system)isolated (memory, network, …)

Page 17: App container rkt

appc principlesWhy are we doing this?

Page 18: App container rkt

OpenIndependent GitHub organisation

Contributions from Cloud Foundry, Mesosphere, Google, Red Hat

(and many others!)

Page 19: App container rkt

Simple but efficientSimple to understand and implement, but eye to optimisation (e.g. content-based

caching)

Page 20: App container rkt

SecureCryptographic image addressing

Image signing and encryptionContainer identity

Page 21: App container rkt

Standards-basedWell-known tools (tar, gzip, gpg, http), extensible with modern technologies

(bittorrent, xz)

Page 22: App container rkt

ComposableIntegrate with existing systems

Non-prescriptive about build workflowsOS/architecture agnostic

Page 23: App container rkt

appc components

Page 24: App container rkt

Image FormatApplication Container Imagetarball of rootfs + manifest

uniquely identified by ImageID (hash)

Page 25: App container rkt

Image DiscoveryApp name → artifact

example.com/http-servercoreos.com/etcd

HTTPS + HTML

Page 26: App container rkt

Executor (Runtime)grouped applicationsruntime environment

isolatorsnetworking

Page 27: App container rkt

Metadata Servicehttp://$AC_METADATA_URL/acMetadata

container metadatacontainer identity (HMAC verification)

Page 28: App container rkt

ACE validatoris this executor compliant with the spec?

$EXECUTOR run ace_validator.aci

Page 29: App container rkt

appc community

Page 30: App container rkt

github.com/cdaylward/libappc

C++ library for working with app containers

Page 31: App container rkt

github.com/cdaylward/nosecone

C++ executor for running app containers

Page 32: App container rkt

mesos (wip)https://issues.apache.org/jira/browse/MESOS-2162

Page 33: App container rkt

github.com/3ofcoins/jetpack

FreeBSD Jails/ZFS-based executor(by @mpasternacki)

Page 34: App container rkt

github.com/sgotti/acido

ACI toolkit (build ACIs from ACIs)

Page 35: App container rkt

github.com/appc/docker2acidocker2aci busybox/latest

docker2aci quay.io/coreos/etcd

Page 36: App container rkt

github.com/appc/goaci

goaci github.com/coreos/etcd

Page 37: App container rkt

appc spec in a nutshell

- Image Format (ACI)- what does an application consist of?

- Image Discovery- how can an image be located?

- Pods- how can applications be grouped and run?

- Executor (runtime)- what does the execution environment look like?

Page 38: App container rkt

appc statusStabilising

towards first backwardscompatible release

Page 39: App container rkt

github.com/coreos/rkt

Page 40: App container rkt

rktan implementation of appc

Page 41: App container rkt

Open standards. Composability.

rkt

Page 42: App container rkt

rkta modern, secure container runtime

Page 43: App container rkt

rktsimple CLI tool

Page 44: App container rkt

simple CLI toolgolang + Linuxself-contained

init system/distro agnostic

Page 45: App container rkt

simple CLI toolno daemon

no API*apps run directly under spawning process

Page 46: App container rkt

bash

rkt

application(s)

Page 47: App container rkt

runit

rkt

application(s)

Page 48: App container rkt

systemd

rkt

application(s)

Page 49: App container rkt

rkt internalsmodular architecture

execution divided into stagesstage0 → stage1 → stage2

Page 50: App container rkt

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

Page 51: App container rkt

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

Page 52: App container rkt

stage0 (rkt binary)discover, fetch, manage application images

set up pod filesystemscommands to manage pod lifecycle

Page 53: App container rkt

stage0 (rkt binary)

- rkt run

- rkt prepare

- rkt run-prepared

- rkt list

- rkt status

- ...

- rkt fetch

- rkt trust

- rkt image list

- rkt image export

- rkt image gc

- ...

Page 54: App container rkt

stage0 (rkt binary)file-based locking for concurrent operation

(e.g. rkt gc, rkt list for pods)database + reference counting for images

Page 55: App container rkt

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

Page 56: App container rkt

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

Page 57: App container rkt

stage1execution environment for pods

app process lifecycle managementisolators

Page 58: App container rkt

stage1 (swappable)binary ABI with stage0

stage0 calls an execve(stage1)

Page 59: App container rkt

stage1 (swappable)

● default implementation ○ based on systemd-nspawn+systemd ○ Linux namespaces + cgroups for isolation

● kvm implementation ○ based on lkvm+systemd ○ hardware virtualisation for isolation

● others?

Page 60: App container rkt

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

Page 61: App container rkt

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

Page 62: App container rkt

stage2actual app execution

independent filesystems (chroot)shared namespaces, volumes, IPC, ...

Page 63: App container rkt

rkt + systemdThe different ways rkt integrates with

systemd

Page 64: App container rkt

rkt

Page 65: App container rkt

rkt

systemd (on host)(systemctl)

Page 66: App container rkt

systemd (on host)optional

"systemctl stop" just workssocket activation

pod-level isolators: CPUShares, MemoryLimit

Page 67: App container rkt

rkt

systemd-nspawn

systemd (on host)(systemctl)

Page 68: App container rkt

systemd-nspawndefault stage1, besides lkvm

taking care of most of the low-level things

Page 69: App container rkt

rkt

systemd-nspawn

systemd

systemd (on host)(systemctl)

container

Page 70: App container rkt

systemdpid1

service filessocket activation

Page 71: App container rkt

rkt

systemd-nspawn

application

systemd

systemd (on host)(systemctl)

container

Page 72: App container rkt

applicationapp-level isolators: CPUShares, MemoryLimit

chrooted

Page 73: App container rkt

rkt

systemd-nspawn

application

systemd-journald(journalctl)

logs

systemd

systemd (on host)(systemctl)

container

Page 74: App container rkt

systemd-journaldno changes in apps required

logs in the containeravailable from the host with journalctl -m / -M

Page 75: App container rkt

rkt

systemd-nspawn

application

systemd-machined(machinectl)

systemd-journald(journalctl)

logs

systemd

register

systemd (on host)(systemctl)

container

Page 76: App container rkt

systemd-machinedregister on distros using systemd

machinectl {show,status,poweroff…}

Page 77: App container rkt

rkt

systemd-nspawn

application

systemd-machined(machinectl)

systemd-journald(journalctl)

logs

systemd

register

systemd (on host)(systemctl)

container

Page 78: App container rkt

cgroups

Page 79: App container rkt

What’s a control group? (cgroup)

● group processes together● organised in trees● applying limits to them as a group

Page 80: App container rkt

cgroups

Page 81: App container rkt

cgroup API

/sys/fs/cgroup/*//proc/cgroups/proc/$PID/cgroup

Page 82: App container rkt

List of cgroup controllers/sys/fs/cgroup/

├─ cpu ├─ devices ├─ freezer ├─ memory ├─ ... └─ systemd

Page 83: App container rkt

/sys/fs/cgroup/ ├─ systemd │ ├─ user.slice │ ├─ system.slice │ │ ├─ NetworkManager.service │ │ │ └─ cgroups.procs │ │ ... │ └─ machine.slice

How systemd units use cgroups

Page 84: App container rkt

│... ├─ cpu │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service ├─ memory │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice ...

/sys/fs/cgroup/ ├─ systemd │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service │ │ │...

How systemd units use cgroups w/ containers

Page 85: App container rkt

/sys/fs/cgroup/ ├─ systemd │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service │ │ │...

│... ├─ cpu │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice │ └─ machine-rkt….scope │ └─ system.slice │ └─ app.service ├─ memory │ ├─ user.slice │ ├─ system.slice │ └─ machine.slice ...

cgroups mounted in the container

RW

RO

Page 86: App container rkt

Example: memory isolator

“limit”: “500M”

ApplicationImage Manifest

[Service]ExecStart=MemoryLimit=500M

systemd service file

write tomemory.limit_in_

bytes

systemd action

Page 87: App container rkt

Example: CPU isolator

“limit”: “500m”

ApplicationImage Manifest

write tocpu.share

systemd action

[Service]ExecStart=CPUShares=512

systemd service file

Page 88: App container rkt

Unified cgroup hierarchy● Multiple hierarchies:

○ one cgroup mount point for each controller (memory, cpu, etc.)○ flexible but complex○ cannot remount with a different set of controllers○ difficult to give to containers in a safe way

● Unified hierarchy:○ cgroup filesystem mounted only one time○ still in development in Linux: mount with option

“__DEVEL__sane_behavior”○ initial implementation in systemd-v226 (September 2015)○ no support in rkt yet

Page 89: App container rkt

rkt: a few other things

- rkt and security- rkt API service (new!)- rkt networking- rkt and user namespaces- rkt and production

Page 90: App container rkt

rkt and security"secure by default"

Page 91: App container rkt

rkt security

- image signature verification- privilege separation

- e.g. fetch images as non-root user- SELinux integration- kernel keyring integration (soon)- lkvm stage1 for true hardware isolation

Page 92: App container rkt

rkt API service (new!)optional, gRPC-based API daemon

exposes information on pods and imagesruns as unprivileged user

easier integration with other projects

Page 93: App container rkt

rkt networkingplugin-based

Container Networking Interface (CNI)

Page 94: App container rkt

Container Runtime (e.g. rkt)

veth macvlan ipvlan OVS

Container Networking Interface (CNI)

Page 95: App container rkt

Networking, the rkt way

Page 96: App container rkt

Network tooling

● Linux can create pairs of virtual net interfaces

● Can be linked in a bridge

container1 container2

eth0

veth1

eth0

veth2

IP masquerading via iptables

eth0

bridge

Page 97: App container rkt

rkt and user namespaces

Page 98: App container rkt

History of Linux namespaces✓ 1991: Linux

✓ 2002: namespaces in Linux 2.4.19

✓ 2008: LXC✓ 2011: systemd-nspawn✓ 2013: user namespaces in Linux 3.8✓ 2013: Docker✓ 2014: rkt

… development still active

Page 99: App container rkt

Why user namespaces?

● Better isolation● Run applications which would need more

capabilities● Per user limits● Future?

○ Unprivileged containers: possibility to have container without root

Page 100: App container rkt

0

host

65535

4,294,967,295(32-bit range)

0

container 1655350

container 2

User ID ranges

Page 101: App container rkt

unmapped

User ID mapping/proc/$PID/uid_map: “0 1048576 65536”

host

container

1048576

65536

65536

unmappedunmapped

Page 102: App container rkt

Problems with container images

Container filesystem

Container filesystem

Overlayfs “upper” directory

Overlayfs “upper” directory

Application Container Image (ACI)

ApplicationContainer

Image (ACI)

container 1 container 2

downloading

web server

Page 103: App container rkt

Problems with container images

● Files UID / GID● rkt currently only supports user namespaces

without overlayfs○ Performance loss: no COW from overlayfs○ “chown -R” for every file in each container

Page 104: App container rkt

Problems with volumes

/

/home/var

user

/

/data /my-app

bind mount(rw / ro)

/data

● mounted in several containers

● No UID translation

● Dynamic UID maps

/data

Page 105: App container rkt

User namespace and filesystem problem

● Possible solution: add options to mount() to apply a UID mapping

● rkt would use it when mounting:○ the overlay rootfs○ volumes

● Idea suggested on kernel mailing lists

Page 106: App container rkt

rkt and production

- still pre-1.0- unstable (but stabilising) CLI and API- explicitly not recommended for production

- although some early adopters

Page 107: App container rkt

rkt v1.0.0EOY (fingers crossed)

stable APIstable CLI

ready to use!

Page 108: App container rkt

Questions?

github.com/coreos/rkt

coreos.com/careers (soon in Berlin!)

Join us!