Linux containers-namespaces(Dec 2014)

44
Linux Containers with focus on namespaces created December 2014 for SUSE Linux Expert Forum Ralf Dannert Systems Engineer [email protected]

Transcript of Linux containers-namespaces(Dec 2014)

Page 1: Linux containers-namespaces(Dec 2014)

Linux Containerswith focus on namespaces

created December 2014 for SUSE Linux Expert Forum

Ralf DannertSystems Engineer

[email protected]

Page 2: Linux containers-namespaces(Dec 2014)

2

Agenda

• Containers – clean slate approach

• Linux namespaces

Page 3: Linux containers-namespaces(Dec 2014)

3

Container examples

‒ Non-Linux:

‒ Solaris Containers(Zones), FreeBSD jails, WPAR(AIX)

‒ Linux:

‒ Vserver, OpenVZ, and FreeVPS

‒ Out of tree

‒ Process containers:

‒ OpenAFS's PAGs process authentication group membership

‒ Inheritance through fork()

‒ Cached token used for access control

‒ http://docs.openafs.org/AdminGuide/ch02s10.html

‒ Process containers: http://lwn.net/Articles/236529/

‒ Plan9:

‒ Everything as a filesystem(naming, access, protection methods)

‒ per-process namespaces

Page 4: Linux containers-namespaces(Dec 2014)

4

Linux containers - a conceptual artifice

‒ Namespaces

‒ Isolation, virtualization

‒ clone() and unshare()

‒ Resource containers

‒ manage the use of resources outside the operating system

‒ disk, network, memory and processor

‒ cgroups

‒ Capability bounding sets

‒ divide the privileges traditionally associated with superuser into distinct units

‒ limit the privilege available to containers, CAP_SYS_ADMIN

‒ Checkpoint/restart

‒ Requires former

Page 5: Linux containers-namespaces(Dec 2014)

Containers – clean slate approach

Page 6: Linux containers-namespaces(Dec 2014)

6

Looking forward..

‒ 16 Aug 2006 Andrew Morton

‒ “Generally, I am not very comfortable merging any namespace/containerization/resource management patches into mainline until we have some sort of high- level agreed-to roadmap which will take us to an agreed-to-at-a-high-level destination.

‒ Now, I _am_ OK with merging useless infrastructure as long as all the prime stakeholders are OK with it. ..

‒ That would not be a useful patchset on its own because nothing _uses_ it..

‒ We don't normally merge useless patches, but this is a special case.

‒ So, (policy making on the fly), let's start merging the well-tested, well-isolated, low-overhead generally-agreed-to features into mainline.”

Page 7: Linux containers-namespaces(Dec 2014)

7

Multiple Instances of the Global Linux Namespaces(2006)Eric W. Biederman, Linux Networx

‒ By adding additional namespaces .. we can, at a trivial cost, extend the UNIX concept and make novel uses of Linux possible

‒ Multiple instances of a namespace simply means that you can have two things with the same name.

‒ Implementation: allow an application with capability full control over a namespace and still not be able to escape

‒ https://www.kernel.org/doc/ols/2006/ols2006v1-pages-101-112.pdf

Page 8: Linux containers-namespaces(Dec 2014)

8

Historie

http://www.golem.de/0610/48351.html

Page 9: Linux containers-namespaces(Dec 2014)

9

Coordinated Efforts 2007 Companies And Individuals Involved

‒ Arista Networks(Arastra): Eric Biedermann - all, initial approach

‒ SGI: Paul Jackson - original cpusets, now part of cgroups

‒ Linux-VServer: Herbert Poetzl - namespaces, containers

‒ Openvz: Pavel Emelyanov, Kir Kolyshkin

‒ Google: Paul Menage - task containers, cgroups

‒ Zap project: Oren Ladaan - C/R

‒ IBM: Serge E. Hallyn, Dave Hanson, Cedric Le Goater, Daniel Lezcano - ns, C/R, Balbir Singh, Srivatsa Vaddagiri - task containers

‒ Others: NEC, XtreemOS, kerlabs, Bull, HP, planetlab

‒ Source: container mailing list - containers development plans (Aug 8 2007)

Page 10: Linux containers-namespaces(Dec 2014)

10

Coordinated Efforts

‒ post anything container-related to containers mailinglist, before any attempts to send it upstream - [email protected]

‒ make sure what is in -mm fits openvz, VServer and other products

‒ make sure initial framework also fits requirements of basic resource management system

Page 11: Linux containers-namespaces(Dec 2014)

Linux Namespaces

Page 12: Linux containers-namespaces(Dec 2014)

12

Namespaces

• Namespaces - lightweight process virtualization

• Isolation: Enable a process (or several processes) to have different views of the system than other processes

• Currently 6 namespaces:‒ mnt, pid, net, ipc, uts, user

‒ 4 more planned..(2006)

‒ security namespace

‒ security keys namespace

‒ device namespace

‒ time namespace

Page 13: Linux containers-namespaces(Dec 2014)

13

Mount namespace

‒ Mount namespace first type by Al Viro, 2002

‒ Kernel 2.4.19

‒ CLONE_NEWNS

‒ 6 CLONE_NEW * flags were added (include/linux/sched.h)

‒ These flags (or a combination of them) can be used in clone() or unshare() syscalls to create a namespace

Page 14: Linux containers-namespaces(Dec 2014)

14

Clone() flags

‒ CLONE_NEWNS 2.4.19 CAP_SYS_ADMIN

‒ CLONE_NEWUTS 2.6.19 CAP_SYS_ADMIN

‒ CLONE_NEWIPC 2.6.19 CAP_SYS_ADMIN

‒ CLONE_NEWPID 2.6.24 CAP_SYS_ADMIN

‒ CLONE_NEWNET 2.6.29 CAP_SYS_ADMIN

‒ CLONE_NEWUSER 3.8 No capability is required

Page 15: Linux containers-namespaces(Dec 2014)

15

Namespace: Systemcalls

‒ 3 system calls are used

‒ clone()

‒ Creates new process and a new namespace, attach process to ns

‒ unshare()

‒ new namespace, attach current process to it

‒ reverses sharing that was done using clone(2) system call(2005)

‒ setns(int fd, int nstype)

‒ join an existing namespace

Page 16: Linux containers-namespaces(Dec 2014)

16

• no parameter of a namespace name

• 6 entries (inodes) added under /proc/<pid>/ns‒ Kernel 3.8

• Nsproxy

• Kernel config items:‒ CONFIG_UTS_NS

‒ CONFIG_IPC_NS

‒ CONFIG_USER_NS

‒ CONFIG_PID_NS

‒ CONFIG_NET_NS

Page 17: Linux containers-namespaces(Dec 2014)

17

Namespace: User space additions

‒ nsenter(util-linux >= 2.23)

‒ wrapper around setns

‒ allows running a new process in context of existing process

‒ iproute

‒ ip netns

‒ add, del, exec

‒ util-linux

‒ unshare

‒ All 6 namespaces

Page 18: Linux containers-namespaces(Dec 2014)

18

UTS namespace

‒ Uts - Unix timesharing

‒ new_utsname struct:

‒ sysname, nodename, release, version, machine, domainname

‒ CLONE_NEWUTS

‒ Since 2.6.19

‒ Initial usecase: vserver/openvz - clone a new uts namespace for each new virtual server

‒ http://lwn.net/Articles/179345/

‒ Demo: unshare -u /bin/bash

Page 19: Linux containers-namespaces(Dec 2014)

19

IPC namespace‒ same principle as uts

‒ process will have independent namespace for System V message queues, semaphore sets and shared memory segments

‒ CONFIG_IPC_NS, CONFIG_SYSVIPC

‒ CLONE_NEWIPC flag:

‒ since 2.4.19

Page 20: Linux containers-namespaces(Dec 2014)

20

Network namespace‒ A network namespace is logically another copy of the network

stack, with its own routes, firewall rules, and network devices

‒ a network device belongs to exactly one network namespace

‒ a socket belongs to exactly one network namespace

‒ a new network namespace only includes the loopback device

‒ communication between namespaces using veth or unix sockets

Page 21: Linux containers-namespaces(Dec 2014)

21

Network namespace: Usecases‒ Turn off network inside namespace:

‒ ensure that processes running there will be unable to make connections outside of namespace

‒ i.e.:spam, botnets

‒ Restricted namespace:

‒ Even processes that handle network traffic (a web server worker process or web browser rendering process for example) can be placed into a restricted namespace

‒ Namespace without network devices

‒ make impossible for child or worker processes to make additional network connections

‒ http://lwn.net/Articles/580893/

Page 22: Linux containers-namespaces(Dec 2014)

22

Network namespace‒ man ip-netns

‒ ip netns add <net_ns>

‒ creates /var/run/netns/tns0

‒ ip netns exec NAME cmd ... - Run cmd in the named network namespace

‒ /etc/netns/<net_ns>/resolv.conf overrides /etc/resolv.conf

‒ Communicate between net ns by

‒ creating a pair of network devices (veth) and move one to another network namespace

Page 23: Linux containers-namespaces(Dec 2014)

network namespaces demo

Page 24: Linux containers-namespaces(Dec 2014)

24

Network namespace exampleMove a VPN connection to its own namespace

‒ ip netns add tns0

‒ mkdir /etc/netns/tns0

‒ openconnect -s /etc/vpnc/vpnc-script <your-vpn-network>

‒ ip link set dev tun0 netns tns0

‒ #example: VPN_IP_ADDRESS=`ip a|grep 149|sed -e 's/..*149/149/' -e 's#/32.*##'`

‒ ip netns exec tns0 ip addr add $VPN_IP_ADDRESS dev tun0

‒ ip netns exec tns0 ip link set tun0 up

‒ ip netns exec tns0 ip link set lo up

‒ #test: ip netns exec tns0 ping $VPN_IP_ADDRESS

‒ #ip netns exec tns0 ip route restore </tmp/ip-route-save-vpn

‒ ip route|sed -e 's/ [scope|proto].*//' -e 's/ /̂ip route add /g' >/tmp/ip-route-add

‒ chmod 755 /tmp/ip-route-add

‒ ip netns exec tns0 /tmp/ip-route-add

‒ #test: ip netns exec tns0 ip route

‒ echo nameserver <your_VPN_specific_nameserver> >/etc/netns/tns0/resolv.conf

‒ ip netns exec tns0 cat /etc/resolv.conf

‒ ip netns exec tns0 wget <IP_ADDRESS_only_available_via_VPN>

Page 25: Linux containers-namespaces(Dec 2014)

25

User namespace

‒ only namespace which can be created without CAP_SYS_ADMIN capability

‒ A process will have distinct set of UIDs, GIDs and capabilities

‒ User namespaces allow per-namespace mappings of user and group IDs.

‒ users and groups may have privileges for certain operations inside the container without having those privileges outside the container

‒ Capabilities

‒ have root privileges for operations inside the container only

‒ map user IDs on the host system to corresponding user IDs in the namespace

‒ Since 3.8 complete

‒ aving a full set of caps in your local user namespace is safe

‒ user namespace root users can create network namespaces

Page 26: Linux containers-namespaces(Dec 2014)

User namespaces demo

Page 27: Linux containers-namespaces(Dec 2014)

27

User namespaces demo

‒ as demo user:

‒ unshare --net --user /bin/bash

‒ nobody@sles12rc3:~> echo $$

‒ 4016

‒ as root user:

‒ cat /proc/4016/uid_map

‒ #empty

‒ #ID-inside-ns ID-outside-ns length

‒ echo 0 1000 10 > /proc/4016/uid_map

‒ echo 0 100 10 > /proc/4016/gid_map

‒ as demo user:

‒ nobody@sles12rc3:~> id

‒ uid=0(root) gid=0(root) groups=0(root)

‒ nobody@sles12rc3:~> whoami

‒ root

‒ nobody@sles12rc3:~> ls -la /root/

‒ ls: cannot open directory /root/: Permission denied

http://man7.org/linux/man-pages/man7/user_namespaces.7.html

Page 28: Linux containers-namespaces(Dec 2014)

Appendix

Page 29: Linux containers-namespaces(Dec 2014)

Advanced Container examples

Page 30: Linux containers-namespaces(Dec 2014)

30

cgroup only container

‒ One of the cgroup only container uses we see@Parallels (so no separate filesystem and no net namespaces) is pure apache load balancer type shared hosting. In this scenario, base apache is effectively brought up in the host environment, but then spawned instances are resource limited using cgroups according to what the customer has paid.

‒ Obviously all apache instances are sharing /var and /run from the host (mostly for logging and pid storage and static pages). The reason some hosters do this is that it allows much higher density simple web serving (either static pages from quota limited chroots or dynamic pages limited by database space constraints) because each "instance" shares so much from the host. The service is obviously much more basic than giving each customer a container running apache, but it's much easier for the hoster to administer and it serves the customer just as well for a large cross section of use cases and for those it doesn't serve, the hoster uall has separate container hosting (for a higher price, of course).

‒ systemd-devel ml: Sun, 25 Aug 13, 19:16 CEST James Bottomley

Page 31: Linux containers-namespaces(Dec 2014)

31

PaaS SaaS Container

‒ I gave you one example: a really simplistic one. A more sophisticated example is a PaaS or SaaS container where you bring the OS up in the host but spawn a particular application into its own container (this is essentially similar to what Docker does). Often in this case, you do add separate mount and network namespaces to make the application isolated and migrateable with its own IP address. The reason you share init and most of the OS from the host is for elasticity and density, which are fast becoming a holy grail type quest of cloud orchestration systems: if you don't have to bring up the OS from init and you can just start the application from a C/R image (orders of magnitude smaller than a full system image) and slap on the necessary namespaces as you clone it, you have something that comes online in miliseconds which is a feat no hypervisor based virtualisation can match.

‒ systemd-devel ml, Sun, 25 Aug 13, 20:16 CEST James Bottomley

Page 32: Linux containers-namespaces(Dec 2014)

32

tidbits

‒ mboxgrep namespace systemd-devel201*

‒ It sounds like you're setting up your containers wrongly. If a container can reboot the system it means that host root capabilities have leaked into the container, which is a big security no-no. The upstream way of avoiding this is USER_NS (because root in the container is now not root in the host). The OpenVZ kernel uses a different mechanism to solve the problem, but we think USER_NS is the better way to go on this.

‒ For launching new services in a container simply sending a message to the init process is probably what you want. I think those messages already traverse unix domain sockets so it insn't too shabby.

Page 33: Linux containers-namespaces(Dec 2014)

33

tidbits

‒ mboxgrep namespace systemd-devel201*

‒ Feb 2014

‒ > FYI I have succesfully run Fedora 19 with systemd inside a container

‒ > with libvirt LXC, however, I did *not* enable user namespaces. Every

‒ > time I try user namespaces I find some other bug in either the kernel

‒ > or libvirt, so I wouldn't be surprised if yet more breakage has

‒ > occurred in user namepsaces :-(

‒ Those bugs should now be fixed, if you don't enable the option, how are we supposed to know what is left to be done? :)

Page 34: Linux containers-namespaces(Dec 2014)

34

tidbits

‒ https://lkml.org/lkml/2013/4/25/596

‒ > Final question, is it by design that uid 0 within a namespace in not

‒ > allowed to write to

‒ > /proc/*/oom_score_adj?

‒ Essentially. It is by design that uid 0 within a namespace be mapped to some other uid outside the namespace, and that the permissions on writes should use the permission needed outside of the user namespace.

‒ Which means there are all kinds of things only uid 0 can write to, that you can't touch in a user namespace. Some of those things the policy may need to be reconsidered. A lot of those things the default policy is good. Regardless we are now defaulting to not letting root in a container do risky things which is a good thing.

‒ Eric

Page 35: Linux containers-namespaces(Dec 2014)

35

Capabilities

‒ http://man7.org/linux/man-pages/man7/user_namespaces.7.html

‒ The child process created by clone(2) with the CLONE_NEWUSER flag starts out with a complete set of capabilities in the new user namespace. Likewise, a process that creates a new user namespace using unshare(2) or joins an existing user namespace using setns(2) gains a full set of capabilities in that namespace. On the other hand, that process has no capabilities in the parent (in the case of clone(2)) or previous (in the case of unshare(2) and setns(2)) user namespace, even if the new namespace is created or joined by the root user (i.e., a process with user ID 0 in the root namespace).

‒ Note that a call to execve(2) will cause a process's capabilities to be recalculated in the usual way (see capabilities(7)), so that usually, unless it has a user ID of 0 within the namespace or the executable file has a nonempty inheritable capabilities mask, it will lose all capabilities.

‒ Having a capability inside a user namespace permits a process to perform operations (that require privilege) only on resources governed by that namespace.

Page 36: Linux containers-namespaces(Dec 2014)

36

Socketat - network namespaces

‒ http://lwn.net/Articles/407615/

‒ The use case are applications are the handful of networking applications that find that it makes sense to listen to sockets from multiple network namespaces at once. Say a home machine that has a vpn into your office network and the vpn into the office network runs in a different network namespace so you don't have to worry about address conflicts between the two networks, the chance of accidentally bridging between them, and so you can use different dns resolvers for the different networks.

‒ In that scenario it would be nice if I could run some services on both networks. Starting two+ copies of the daemons just so the can have live in all of the networks is ok, but in the fullness of time I expect that there will be daemons that want to optimize things and have sockets in all of the network namespaces you are connected to.

‒ In a multiple network namespace aware application when it goes to open a socket it will want to specify which network namespace the socket is in. If it is a general listener it will probably listening to events in /proc/mounts waiting for extra namespaces to be mounted under a standard location say: /var/run/netns/<netnsname>/ns.

‒ Once the application receives the event for a new network namespace showing up it can will want to create a new socket listening for connections in the new network namespace.

‒ In that scenario none of those network namespaces are foreign, but one network namespace will be the default and the rest will be non-default network namespaces.

Page 37: Linux containers-namespaces(Dec 2014)

37

socketat

‒ http://lists.openvz.org/pipermail/devel/2010-October/025720.html

‒ [Devel] Re: [PATCH 8/8] net: Implement socketat.

‒ Just to clarify this point. You enter the namespace, create the socket and go back to the initial namespace (or create a new one). Further operations can be made against this fd because it is the network namespace stored in the sock struct which is used, not the current process network namespace which is used at the socket creation only.

‒ We can actually already do that by unsharing and then create a socket. This socket will pin the namespace and can be used as a control socket for the namespace (assuming the socket domain will be ok for all the operations).

‒ .. if I assume you want to create a process controlling 1024 netns, let's try to identificate what happen with setns and with socketat :

‒ With setns:

‒ * open /proc/self/ns/net (1)

‒ * unshare the netns

‒ * open /proc/self/ns/net (2)

‒ * setns (1)

‒ * create a virtual network device

‒ * move the virtual device to (2) (using the set netns by fd)

‒ * unshare the netns

Page 38: Linux containers-namespaces(Dec 2014)

38

socketat

‒ http://lists.openvz.org/pipermail/devel/2010-October/025736.html

‒ > The app control point is in namespace0. I still want to be able to

‒ > "boot" namespaces first and maybe a few seconds later do a socketat()...

‒ > and create devices, tcp sockets etc. I suspect create_ns(namespace-name)

‒ > would involve:

‒ > * open /proc/self/ns/net (namespace-name)

‒ > * unshare the netns

‒ > Is this correct?

‒ Almost.

‒ create should be:

‒ * verify namespace-name is not already in use

‒ * mkdir -p /var/run/netns/<namespace-name>

‒ * unshare the netns

‒ * mount --bind /proc/self/ns/net /var/run/netns/<namespace-name>

Page 39: Linux containers-namespaces(Dec 2014)

39

Operating system–level virtualization

Stand: 30.11.2014http://en.wikipedia.org/wiki/Operating_system-level_virtualization

Page 40: Linux containers-namespaces(Dec 2014)

40

References – old‒ Paul B. Menage. Adding Generic Process Containers to the Linux Kernel. Proceedings

of the Ottawa Linux Symposium, 2007.

‒ http://www.kernel.org/doc/ols/2007/ols2007v2-pages-45-58.pdf

‒ Linux-CR: Transparent Application Checkpoint-Restart in Linux

‒ http://www1.cs.columbia.edu/~orenl/papers/ols2010-linuxcr.pdf

‒ Making applications mobile using containers

‒ http://lxc.sourceforge.net/doc/ols2006/lxc-ols2006-slides.pdf

‒ Virtual Servers and Checkpoint/Restart in Mainstream Linux

‒ describes the general namespace support in Linux and its usage

‒ Transparent Checkpoint-Restart of Multiple Processes on Commodity Operating Systems -Oren Laadan

‒ Source: Operating System Virtualization: Practice and Experience Oren Ladaan(systor2010_osvirt.pdf)

Page 41: Linux containers-namespaces(Dec 2014)

41

References‒

‒ http://lwn.net/Articles/531114/#series_index

‒ Namespaces in operation, 6 part series by Michael Kerrisk

‒ https://github.com/bigbighd604/C-Notes

‒ demo codes git from namespace series

‒ www.haifux.org/lectures/299/netLec7.pdf (Rami Rosen, 2013)

‒ https://www.kernel.org/doc/ols/2006/ols2006v1-pages-101-112.pdf (Biederman)

‒ http://books.google.de/books?id=RpsQAwAAQBAJ&pg=PA424&lpg=PA423&ots=rAqP4sxMXn&focus=viewport&dq=Rami+Rosen+network+namespaces&hl=de

‒ Linux Kernel Networking(Rami Rosen)

‒ http://www.makelinux.net/kernel_map/

‒ http://en.wikipedia.org/wiki/Operating_system-level_virtualization

‒ /usr/src/linux/Documentation/unshare.txt

‒ How to find namespaces in a Linux system

‒ http://www.opencloudblog.com/?p=251

Page 42: Linux containers-namespaces(Dec 2014)

42

Page 43: Linux containers-namespaces(Dec 2014)

Corporate HeadquartersMaxfeldstrasse 590409 NurembergGermany

+49 911 740 53 0 (Worldwide)www.suse.com

Join us on:www.opensuse.org

43

Page 44: Linux containers-namespaces(Dec 2014)

Unpublished Work of SUSE LLC. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.

General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.