Paas Under the Hood Printversion

8/10/2019 Paas Under the Hood Printversion

1/23

1PaaS Under the HoodEpisode 5: Distributed Routing

Platform as a Service Under the HoodEpisodes 1-5

dotcloud.com
http://www.dotcloud.com/http://www.facebook.com/One.Platform.Any.Stackhttps://twitter.com/dot_cloudhttp://www.linkedin.com/company/dotcloudhttp://www.dotcloud.com/


2/23

PaaS Under the Hood 2

INTRODUCTION

Building a Platform as a Service (PaaS)is rewarding work. We get to make the life of a developer easier.

PaaS helps developers deploy, scale, and manage their applications, without making developers hardcore systemsadministrators themselves.

As with many problems, the toughest part about managing applications in the cloud is actually not the building of thPaaS itself. The challenge lies in being able to scale the applications.

To give you a sense of the complexity, each minute, millions of HTTP requests are routed through the application.Not only does our PaaS collect millions of metrics, we also aggregate, process, and analyze the metrics and look forabnormal patterns. Apps are constantly deployed and migrated on our PaaS platform.

For economies of scale, virtually all PaaS providers pack density onto their physical machines. How does a PaaSprovider solve the following issues?

How is application isolation accomplished? How does the platform handle data isolation? How does the platform deal with resource contention? How does the platform deploy and run apps efficiently? How does the platform provide security and resiliency?

How does the platform handle the load from the millions of HTTP requests?

One key element is lightweight virtualization which is the use of virtual environments (called containers) to provide isolation

characteristics comparable to full-blown virtual machines, but with much less overhead. In this area, the dotCloudplatform relon Linux Containers called LXCs.

In the following 5 episodes, we will dive into some of the internals of the dotCloud platform or more specifically, theLinux kernel features used by dotCloud.
http://en.wikipedia.org/wiki/Platform_as_a_servicehttps://www.dotcloud.com/?utm_source=eBook&utm_medium=link&utm_campaign=PaaS_Under_the_hoodhttps://www.dotcloud.com/?utm_source=eBook&utm_medium=link&utm_campaign=PaaS_Under_the_hoodhttp://en.wikipedia.org/wiki/Platform_as_a_service


3/23

PaaS Under the HoodEpisode 1: Kernel Namespaces

3

dotcloud.com

Episode 1: Kernel Namespaces

Simplifying complexity takes a lot of work. At dotCloud, we turn highly complexprocesses such as deploying and scaling web applications in the cloud and make them

appear as simple workflows to developers and DevOps.

How do we accomplish such a feat? In this eBook, we will show you how dotCloud worksunder the hood. We will expose the mechanics behind the kernel-level virtualization andhigh-throughput network routing. We will expose other technologies such as metrics

collection and memory optimization in later eBooks.

A developer once said, Diving into the inner workings of a PaaS is like going

Disneyland, youll uncover a world of wonder.
http://www.dotcloud.com/https://www.dotcloud.com/?utm_source=eBook&utm_medium=link&utm_campaign=PaaS_Under_the_hoodhttp://www.dotcloud.com/http://www.linkedin.com/company/dotcloudhttps://twitter.com/dot_cloudhttps://www.facebook.com/One.Platform.Any.Stackhttps://www.dotcloud.com/?utm_source=eBook&utm_medium=link&utm_campaign=PaaS_Under_the_hood


4/23


4

Episode 1: NamespacesEach time a new Linux Container (LXC)is created, the name of the container is filed under the /cgroup directory. Foexample, a new container named sanfrancisco is filed under the directory /cgroup/sanfrancisco. It is easy to thinkthat the container relies on the control groups. Although cgroupsare useful to Linux Containers (we will cover cgroumore thoroughly in Episode 2), Namespace provides an even more vital function to the Linux Containers.

Namespaces isolate the resources of processes. This isolation is the real magic behind Linux Containers! There are fiNamespaces, each covering a different resource: pid,net,ipc,mnt,anduts.

The pidnamespaceThe pidnamespace is the most useful technology for basic isolation. Each pidnamespace has it own numberingprocess. Different pidnamespaces form a hierarchy with the kernel which keeps track of all the namespace. Aparent namespace can see and implement actions on the child namespaces. A child namespace cannot performany actions on its parent.

There are some principles about the pidnamespace as follows: Each pid namespace has its own PID 1 init-like process Processes residing in a namespace cannot aect processes residing in a parent or sibling namespace with systemcalls like killor ptracebecause process ids are only meaningful inside a given namespace If a pseudo-lesystem like procis mounted by a process within a pidnamespace, it will only show the processesbelonging to the namespace Numbering is dierent in each namespace which means that a process in a child namespace can have multiple PIDfor example, one in its own namespace and a different PIDin its parent namespace. Top-level pidnamespace can seall processes running in all namespaces with different PIDs. A process can have more than 2 PIDsif there are more thtwo levels of hierarchy in the namespaces.

The net namespaceWith the pidnamespace, you can start processes in multiple isolated environments called containers. What if youneed to run separate instances of Apache webserver in each container? Generally only one process can listen to port80/TCP at a time. To configure your instances of Apache webserver to listen on different ports, you could use the nenamespace, which has been designed for networking.

Each different netnamespace can have different network interfaces. Even lo, the loopback interface supporting127.0.0.1, can be dierent in each dierent netnamespace. It is even possible to create a pair of special interfaces, whwill appear in two different netnamespaces and allow one of the two netnamespaces to talk to the outside world.

A typical container will have its own loopback interface (lo), as well as a special interface on one end, generally nameeth0. The other end of the special interface will be in the original namespace, and will bear a poetic name likeveth42xyz0. It is then possible to put those special interfaces together within an Ethernet bridge (to achieve switchinbetween containers), or route packets between them, etc. This is similar to the Xen networking model.

Each net namespace has its own local meaning for INADDR_ANY, a.k.a. 0.0.0.0. When your Apache webserver procbinds to INADDR_ANY and port 80 within its namespace *:80 within its namespace, it will only receive connections

directed to the IP addresses and interfaces of its namespace. That allows you to run multiple Apache instances, eachn their own pidand own netnamespace, with their default configuration listening on port 80 and each will remainndividually addressable.

Each netnamespace has its own routing table, and its own iptableschains and rules.

The ipc namespaceThe ipcnamespace wont appeal to many of you, unless youve passed UNIX 101 when engineering schools still taughclasses on IPC (InterProcess Communication).

PC provides semaphores, message queues, and shared memory segments.While still supported by virtually every UNIX flavors, those features are considered by many as obsolete, andsuperseded by POSIX semaphores, POSIX message queues, and mmap. Nonetheless, some programs such as

PostgreSQL, for example, still use IPC.
http://en.wikipedia.org/wiki/LXChttp://en.wikipedia.org/wiki/Inithttp://linux.die.net/man/2/killhttp://www.linuxjournal.com/article/6100http://en.wikipedia.org/wiki/Procfs#Linuxhttp://en.wikipedia.org/wiki/Inter-process_communicationhttp://en.wikipedia.org/wiki/Semaphore_(programming)http://en.wikipedia.org/wiki/Message_queuehttp://en.wikipedia.org/wiki/Shared_memoryhttp://linux.die.net/man/7/sem_overviewhttp://linux.die.net/man/7/mq_overviewhttp://en.wikipedia.org/wiki/Mmaphttp://www.postgresql.org/docs/9.1/static/kernel-resources.htmlhttp://www.postgresql.org/docs/9.1/static/kernel-resources.htmlhttp://en.wikipedia.org/wiki/Mmaphttp://linux.die.net/man/7/mq_overviewhttp://linux.die.net/man/7/sem_overviewhttp://en.wikipedia.org/wiki/Shared_memoryhttp://en.wikipedia.org/wiki/Message_queuehttp://en.wikipedia.org/wiki/Semaphore_(programming)http://en.wikipedia.org/wiki/Inter-process_communicationhttp://en.wikipedia.org/wiki/Procfs#Linuxhttp://www.linuxjournal.com/article/6100http://linux.die.net/man/2/killhttp://en.wikipedia.org/wiki/Inithttp://en.wikipedia.org/wiki/LXC


5/23


5

Whats the connection with namespaces? Each IPCresource is accessed through a globally unique 32-bit ID. While IPmplements permissions on the resource itself, an application could be surprised if it failed to access a given resourcebecause it has already been claimed by another process in a different container. The app doesnt know anything aboother containers!

Meet the ipcnamespace. Processes within a given ipcnamespace cannot access (or even see) the IPCresources livin other ipcnamespaces. And now you can safely run a PostgreSQL instance in each container without the fear of IPCkey collisions.

The mnt namespacechrootis a mechanism to sandbox a process (and its children) within a given directory. The mntnamespace takes thechrootconcept even further.

As its name implies, the mntnamespace deals with mount points.

Processes living in different mnt namespaces can see different sets of mounted file systems and different rootdirectories. If a file system is mounted in an mntnamespace, it will be accessible only to those processes within that

namespace. It will not be visible for processes in other namespaces.

At first impression, it may sound useful, since the mntnamespace allows you to sandbox each container within its owdirectory, hidden from other containers. However, is this really useful after all? If each container is chrooted in a differdirectory, container C1 wont be able to access or see container C2s le system, right? There are downsides.

nspecting/proc/mountsin a container will show the mount points of all containers. Also, those mountpoints will berelative to the original namespace, which can give out some hints about the layout of your system. Seeing the path fothe global namespace may confuse some applications that rely on the paths in the local namespace/proc/mounts.

The mntnamespace makes the situation much cleaner, allowing each container to have its own mount points, and seonly those mount points, with their path correctly correlated to the actual root of the namespace.

The uts namespaceFinally, the utsnamespace deals with one important detail in that the hostname can be seen by a group of processe

Theutsnamespace addresses this issue by giving each utsnamespace a different hostname, and changing thehostname through the sethostnamesystem call. Also, the uts namespace will only change the hostnamefor processrunning in the same namespace.

Creating namespacesNamespace creation is achieved with the clonesystem call. This system call supports a number of flags, allowing youspecify whether the new process should run within its own pid,net,ipc, mnt, and utsnamespaces.

These are the series of steps that take place when creating a new container. A new process starts with newnamespaces created. Its network interfaces that include the special pair of interfaces to talk with the outside world aconfigured. It then executes an init-like process.

When the last process within a namespace exits, the associated resources (IPC, network interfaces...) are automaticareclaimed. If, for some reason, you want those resources to survive after the termination of the last process of thenamespace, you can use mount --bindto retain the namespace for future use, because each namespace is stored in aspecial file in/proc/$PID/ns.

Not all namespaces can be retained, only for ones up to kernel 3.4 . There is support for ipc, net,and utsnamespacesbut not for mntandpidnamespace. This presents a problem that we will address in the next paragraph.
http://en.wikipedia.org/wiki/Chroothttp://en.wikipedia.org/wiki/Chroothttp://linux.die.net/man/2/sethostnamehttp://linux.die.net/man/2/clonehttp://lwn.net/Articles/407495/http://lwn.net/Articles/407495/http://linux.die.net/man/2/clonehttp://linux.die.net/man/2/sethostnamehttp://en.wikipedia.org/wiki/Chroot


6/23


6

Attaching to Existing NamespacesIt is also possible to get into or enter a namespace, by attaching a process to an existing namespace.

Here are some use cases for assigning your own namespaces

Setting up network interfaces from the outside, without relying on scripts inside the container Running arbitrary commands to retrieve information about the container (this can be done by executing netstat) Obtaining a shell within a container

Attaching a process to existing namespaces requires two things: The setnssystem call (which exists only since kernel 3.0, or with patches for older kernels) The namespace must appear in/proc/$PID/ns

We mentioned in previous paragraphs that only ipc,net, and utsnamespaces were supported/proc/$PID/nsand thmntand pid namespaces were not supported. Only a patched kernel will allow you to attach to existing mntandpidnamespaces.

Combining the necessary patches can be fairly tricky, because it involves resolving conflicts between AUFSandGRSEC.

AUFS and GRSECwill be covered in Episodes 3 & 4 respectively.

To avoid running an overly patched kernel, there are three suggested workarounds.

You can run sshdin your containers, and pre-authorize a special SSH key to execute your commands. This is one the easiest solutions to implement. But if sshdcrashes, or is stopped (either intentionally or by accident), you mabe locked out of the container. Also, if you want to squeeze the memory footprint of your containers as much aspossible, you might want to get rid of sshd. If the latter is your main concern, you can run a low profile SSH servelike dropbear. Or, you can start the SSH service from inetdor a similar service.

If you want something simpler than SSH(or something different than SSHto avoid interferences with sshdcustoconfigurations), you can open a backdoor. An example would be to run socat TCP-LISTEN:222,fork,reuseaddrEXEC:/bin/bash,stderrfrom initin your containers. Make sure that port 222/tcpis configured correctly andfirewalled within.

An even better solution is to embed this control channel within your initprocess. Before changing its rootdirectory, the initprocess could setup a UNIXsocket on a path located outside the container root directory. Whewill change its root directory, it will retain its open file descriptors and therefore, the control socket.

How dotCloud uses namespacesIn previous releases, the dotCloud platform used vanilla LXCs (Linux Containers), which made implicit use ofnamespaces.

From the beginning, we deployed kernel patches that allowed us to attach arbitrary processes into existingnamespaces. We found this approach to be the most convenient and reliable way to deploy, control, and orchestratecontainers. As the dotCloud platform evolved, we still made use of namespaces to isolate applications from each otheven though we have stripped down the vanilla LXC containers.
http://man7.org/linux/man-pages/man2/setns.2.htmlhttp://en.wikipedia.org/wiki/Dropbearhttp://en.wikipedia.org/wiki/Dropbearhttp://man7.org/linux/man-pages/man2/setns.2.html


7/23

7PaaS Under the HoodEpisode 2: Cgroups

dotcloud.com

Episode 2: cgroups

Control groups, or cgroups, are a set of mechanisms to measure

and limit resource usage for groups of processes.

Conceptually, it works somewhat like the ulimitshell command o

the setrlimitsystem call. ulimitand setrlimitset resource limitsfor a single process. cgroupsallow you to set resource limits for

groups of processes.
http://www.dotcloud.com/http://www.linuxhowtos.org/Tips%20and%20Tricks/ulimit.htmhttp://linux.die.net/man/2/setrlimithttp://www.dotcloud.com/http://www.linkedin.com/company/dotcloudhttps://twitter.com/dot_cloudhttps://www.facebook.com/One.Platform.Any.Stackhttp://linux.die.net/man/2/setrlimithttp://www.linuxhowtos.org/Tips%20and%20Tricks/ulimit.htm


8/23


Pseudo-FS InterfaceThe easiest way to manipulate control groups is through the cgroupfile system.Assuming that it has been mounted on/cgroup, creating a new group namedpolkadotis as easy as mkdir /cgroup/polkadot. When you create this (pseudo) directory, it instantly gets populated with many (pseudo) files to manipulat

the control group. You can then move one (or many) processes into the control group by writing theirPID

to the righcontrol file, for example, echo 4242 > /cgroup/polkadot/tasks.

When a process is created, it will be in the same group as its parent. If the initprocess of a container has been placeda control group, all the processes of the container will be also be in the same control group.

Destroying a control group is as easy as rmdir /cgroup/polkadot. However the processes within the cgrouphave tomoved to other groups first. Otherwise rmdirwill fail since it is like trying to remove a non-empty directory.

Technically, control groups are split into many subsystems. Each subsystem is responsible for a set of files in/cgrouppolkadot, and the file names are prefixed with the subsystem name.

For instance, the files cpuacct.stat,cpuacct.usage,cpuacct.usage_percpuare the interface for the cpuacctsubsyste

The available subsystems will be detailed in the next paragraph.

The subsystems can be used together, or independently. In other words, you can decide that each control group willhave limits and counters for all the subsystems. Alternatively, each subsystem can have different control groups.

To explain the latter case more fully, a subsystem can have a process in the polkadotcontrol group for memorycontrol, a process in the bluesuedeshoecontrol group for CPU control such that polkadotand bluesuedeshoeare incompletely separated namespaces.

What can be Controlled?Many things! Well highlight the most useful ones here, at least the ones we think are the most useful.

Memory

You can limit the amount of RAMand swap space that can be used by a group of processes. It accounts for thememory used by the processes for their private use such as their Resident Set Size,or RSS, but also for the memoryused for caching purposes.

This is actually quite powerful, because traditional tools such as psor analysis of/procdo not have a way to identify cache memory usage incurred by specific processes. This can make a big difference, for instance, with databases.

A database typically consumes very little memory for processing but consumes a large chunk of cache. Complexqueries would consume a lot of memory but, for this example, we are not performing complex queries.

To perform optimally, your whole database (or at least, your active set of data that you refer to the most often) shofit into memory.

You can implement a memory limit for a process inside a cgroupthat can easily be done by using echo 100000000/cgroup/polkadot/memory.limit_in_bytes(that will be rounded to a page size).

To check the current usage for a cgroup, inspect the pseudo-filememory.usage_in_bytesin the cgroup directory. Yocan gather very detailed (and very useful) information using memory.stat.


9/23


10/23


Thats why SSD storageis becoming increasingly popular. SSD has virtually no seek time, and can therefore sustainrandom I/O as fast as sequential I/O. The available throughput is therefore predictably good, under any given load.

Actually, there are some workloads that can cause problems. For instance, writing and rewriting a whole disk will

cause performance to drop dramatically. This is because readand writeoperations are fast, but erase, which must beperformed at some point before write, is slow.

An example of this use case would be to use SSD to manage video on demand for hundreds of HD channelssimultaneously. The disk will sustain the writethroughput until it has written every block once. When it needs toerasthe performance will drop to below acceptable levels.

Going back to dotCloud, whats the purpose of the blkiocontroller in a PaaS environment?

The blkiocontroller metrics will help detect applications that are putting an excessive strain on the I/O subsystem. Thcontroller lets you set limits, which can be expressed in number of operations and/or bytes per second. It also allows fdifferent limits for readand writeoperations. It allows you to set some thresholds that no single app can significantlydegrade performance for other apps. Furthermore, once an I/O intensive app has been identified, its quota can be

adapted to reduce impact on other apps.

Its Not Only for ContainersAs we mentioned, cgroupsare convenient for containers, since it is very easy to map each container to a cgroup. Buthere are many other uses for cgroups.

The systemdservice manager is able to put each service in a different cgroup. This allows you to keep track of all thesubprocesses started by a given service, even when they use the double-fork technique to detach from their parentand re-attach to init. It also allows fine-grained tracking and control of the resource used by each service.

t is also possible to run a system-wide daemon to automatically classify processes into cgroups. This can be particuluseful on multi-user systems, to limit and/or meter appropriately the resources of each user, or to run some specificprograms in a special cgroupwhen you know that those programs are prone to excessive resource use.

dotCloud & Control GroupsThanks to cgroups, we can meter very accurately the resource usage of each container, and therefore of each unitof each service for each application. Our metrics collection system uses collectd, along with our in-house lxc plugin.Metrics are streamed to a custom storage cluster, and can be queried and streamed by the rest of the platform usingour ZeroRPC protocol. We will be writing a more in-depth article on metrics collection system in the future.

We also use cgroupsto allocate resource quotas for each container. For instance, when you use vertical scalingondotCloud, you are actually setting limits for memory, swap usage, and CPU shares.
http://en.wikipedia.org/wiki/Solid-state_drivehttp://www.freedesktop.org/wiki/Software/systemdhttp://collectd.org/https://github.com/dotcloud/collectd/tree/lxc-pluginhttp://zerorpc.dotcloud.com/http://docs.dotcloud.com/0.9/guides/scaling/#scaling-vertically/?utm_source=eBook&utm_medium=link&utm_campaign=PaaS_Under_the_hoodhttp://docs.dotcloud.com/0.9/guides/scaling/#scaling-vertically/?utm_source=eBook&utm_medium=link&utm_campaign=PaaS_Under_the_hoodhttp://zerorpc.dotcloud.com/https://github.com/dotcloud/collectd/tree/lxc-pluginhttp://collectd.org/http://www.freedesktop.org/wiki/Software/systemdhttp://en.wikipedia.org/wiki/Solid-state_drive


11/23

1PaaS Under the HoodEpisode 3: AUFS

dotcloud.com

Episode 3: AUFS

AUFS (which initially stood for Another Union File System)

provides fast provisioning while retaining full flexibility and

ensuring disk and memory savings
http://www.dotcloud.com/http://www.dotcloud.com/http://www.linkedin.com/company/dotcloudhttps://twitter.com/dot_cloudhttps://www.facebook.com/One.Platform.Any.Stack


12/23

1PaaS Under the HoodEpisode 3: AUFS

AUFS is a union file system, which merges two directory hierarchies together. On the dotCloud platform, we use AUFto combine a large, read-only file system containing a ready-to-run system image under a writeable layer. The resultifile system looks like the large read-only one, except that you can now write on it anywhere and store just the changefiles. LiveCDsor bootable USBs are common examples of this use case. AUFS allows us to have a common base

image for all applications and a separate read-write layer, unique to each app.

Storage SavingsLets assume that the base image takes up 1 GB of disk space. In reality, it is actually more than that, since were talkinabout a full server file system, containing everything a dotCloud app could potentially need such as Python, Ruby,Perl, Java, C compiler and libraries, and so on. If the entire image had to be cloned each time a dotCloud applicationis deployed, it would use 1 GB of disk space for each new cloned deployment. AUFStherefore lets us save on storagecosts because it is typically using less than 1 MB of disk space.

Faster DeploymentsCopying the whole base image would not only use up precious disk space, but it would also take time, up to a minuteor so depending on the disk speed. Also, the copy would put a significant I/O load on the disk. On the other hand,creating a new pseudo-image using AUFS takes a fraction of a second, and virtually no I/O at all. AUFS offers a mu

better solution when compared to copying an entire image every time.

Better Memory UsageVirtually all operating systems use a feature called buffer cacheto make disk access faster. Without it, your systemcould run at least 10x, 100x or up to 1000x slower, because it has to access the disk even to run simple commands, foexample, when listing your files with ls! As we will see, AUFSalso lets us rack in big savings on this buffer cache.

Every single application will load from disk a number of common files and components such as the libcstandardlibrary, the/bin/shstandard shell... and a lot of common infrastructures, like crond,sshd, the local Mail Transfer Agentjust to name a few. Additionally, all applications of the same type will load the same files. For example, Pythonapplications will load a copy of the Python interpreter every time.

If each app were running from its own copy, identical copies of those common files would be present multiple times i

memory, within the buffer cache. Using AUFS, those common files are in the base image, and the Linux kernel therefknows how to load them only once in memory. This will typically save tens of MB for each app.

Easier UpgradesIf you are familiar with storage technology, you might argue that snapshots,and copy-on-writedevices already havethose advantages mentioned above.

Thats true. However, with those systems, it is not possible to update the base image, and have the changes reflectedin the lightweight clones such as in the snapshots. AUFS, on the other hand, lets you do whatever you want with thbase image. The changes will be immediately visible in the AUFSmount points using the base image. It means that iteasy to do software upgrades, even while the applications are running, just like on a typical single server environmentexcept that you can upgrade thousands of servers all at once.

Allows Arbitrary ChangesAll those things can also be done without AUFS. For a decade, skilled UNIX systems administrators have beendeploying machines (workstations, X terminals, servers...) with a read-only root file system, allowing read-write accesthrough ad hoc mount points. After all, with some clever configuration and tuning, you dont need to write anywhereelse except places like/tmp,/var/run,/var/lock, and of course/home. The latter can be a traditional read-write filesystem, and the formers can even use a tmpfsmount.
http://en.wikipedia.org/wiki/Live_CDhttp://www.tldp.org/LDP/sag/html/buffer-cache.htmlhttp://en.wikipedia.org/wiki/Snapshot_http://en.wikipedia.org/wiki/Copy-on-writehttp://en.wikipedia.org/wiki/Copy-on-writehttp://en.wikipedia.org/wiki/Snapshot_http://www.tldp.org/LDP/sag/html/buffer-cache.htmlhttp://en.wikipedia.org/wiki/Live_CD


13/23


14/23

1PaaS Under the HoodEpisode 4: GRSEC

dotcloud.com

Episode 4: GRSEC

GRSEC is a security patch to Linux kernel. Security features in

GRSEC help detect and deter malicious code.


15/23


GRSECis a fairly large patch for the Linux kernel, providing strong security features that prevent many kinds of attack(or exploits), detect suspicious activity such as people looking for new exploits and/or known system vulnerabilities

There are many features in GRSEC, so our goal is to provide an overview of the relevant features to dotCloud.

Randomize Address SpaceMany exploits rely on the fact that the base address for the heapor the stackis always the same.Consider the following example, this is a classic scenario for an attack on a remote service:

A bug is found in the service. Some index is not checked properly, and can be used to alter the stack, and cause ajump to an arbitrary address (when a function returns)

The stack is altered to introduce some malicious code A pointer to this malicious code is placed on the stack as well The bug is triggered. The service jumps to the malicious code and executes it

f the address space of the stack is randomized, it would be much more difficult for an attacker to exploit the system.

The attacker would have to locate his malicious code before he can jump to the code in memory.

Prevent Execution of Arbitrary CodeThere are two steps to make sure that arbitrary code cant make it inside a running program.

First, program code must be loaded in an area that is marked by the memory management unit as being read-only.This prevents code from modifying itself. Self-modifying code is sometimes referred to as polymorphic code. There aegitimate use cases for polymorphic code. However, it is more often associated with dubious intentions.

Second, the heap and the stack must be marked as non-executable. After all, theyre supposed to contain datastructures, function parameters, and return addresses but no opcodeshould be in there. On architectures supportingt, the heap and the stack regions should be marked as non-executable at the hardware level, effectively preventingaccidental or intentional execution of code located in there.

At this point, there is no memory that is both executable and writable.

We mentioned that there were some legitimate uses for memory regions with both writeand execpermissions. Whedoes that happen, and what can be done about it?

The most common case is on-the-fly code generation for optimization purposes. This use case is applicable to thoseusing Java and JIT (Just-In-Time) compiler.

The good news is that GRSEC lets you flag some specific executables and allows them to write to their code region oexecute their data region.

This reduces the security for those specific processes, but there are benefits. To exploit a bug, there has to be a bug ine.g. the JVM itself, not in your program. Bugs in the JVM are likely to be found and fixed much faster than bugs in yoown program. This is not a comment about the quality of anyones code. Its more about the number of users in theJava community and their scrutiny on the JVM.
http://en.wikipedia.org/wiki/Memory_management#Detailshttp://en.wikipedia.org/wiki/Stack-based_memory_allocationhttp://en.wikipedia.org/wiki/Memory_management_unithttp://en.wikipedia.org/wiki/Polymorphism_(computer_sciencehttp://en.wikipedia.org/wiki/NX_bithttp://en.wikipedia.org/wiki/Just-in-time_compilationhttp://en.wikipedia.org/wiki/Just-in-time_compilationhttp://en.wikipedia.org/wiki/NX_bithttp://en.wikipedia.org/wiki/Polymorphism_(computer_sciencehttp://en.wikipedia.org/wiki/Memory_management_unithttp://en.wikipedia.org/wiki/Stack-based_memory_allocationhttp://en.wikipedia.org/wiki/Memory_management#Details


16/23


Audit Suspicious ActivityAnother interesting security feature of GRSEC is the ability to log some specific events. For instance, it is possible tomake a record each time a process is terminated by SIGSEGV, a.k.a. Segmentation Faultin the kernel log.

Whats the point? Potential attackers will likely run a number of known exploits in an attempt to gain escalatedprivileges. Many of the exploits will hopefully fail. Often, the failure will result in the process having to do asegmentation violation, and then be killed by SIGSEGV.

Any C programmer will tell you that there are legitimate cases where programs are terminated by SIGSEGV. If thesystem detects many different programs started by the same user that are all being killed in the same way, then it istelltale sign that someone is trying to break into the system.

If youre not familiar with those concepts, you can draw upon an analogy in which you observe many scratches arouna padlock. A few scratches on the surface wont mean anything. But if you see the padlock full of dents, you can bethat someone is trying to pick it!

There are many other similar events that are logged by GRSEC. The kernel logs can then be analyzed in real time, and

suspicious patterns can be detected. This allows you to lock out malicious users, or, alternatively, monitor them closeto see what theyre doing. GRSECcan be useful in Forensics in case someone does successfully breach the system.GRSEClogs will record how theyve exploited the system. Knowing how someone exploited the system can be avaluable tool for the person who is trying to close the security gap.

Compile-time Security FeaturesGRSECalso plays its part during the kernel compilation. It enables a compiler plugin, which will constify some kernestructures. It will automatically add the constkeyword to all structures containing only function pointers (unless theyhave a special non const marker to evade the process).

In other words, instead of being mutable by default unless marked const, function tables are now constby default,unless specified otherwise. Accordingly, attempts to modify function tables will be detected at compile-time.Therationale is to make sure that any code that manipulates a function table will be closely audited before the function

table is marked non const.

Why the emphasis on function tables? Because if they can be breached, they are a convenient way for a potentialattacker to jump to arbitrary code, recall the technique explained in the beginning of Episode 4!

Marking those data structures as consthelps at compile time, but also later when the kernel is running, because thosdata structures will be laid out in a memory region which will be made read-only by the memory management unit.

This not only reduces exposure to attacks, but can also make it harder for successful attackers to cover up their trackby hijacking existing function tables.

...And Many MoreAs told in the introduction, this is just a quick overview. If you want to learn about other features, you can checkGRSECs website.

If you want to quench your thirst for technical details, you can follow these four steps to get a full listing of all theGRSECfeatures and descriptions on each feature.

Get the kernelsources

Apply the GRSECpatch set

Run make menuconfig

Navigate to the compilation options related to GRSEC

Almost each feature of GRSECcan be enabled/disabled at compilation time, and therefore will be listed there. The Hprovided with each compilation option is fairly informative.
http://en.wikipedia.org/wiki/Segmentation_faulthttp://en.wikipedia.org/wiki/Memory_management_unithttp://grsecurity.net/http://grsecurity.net/http://en.wikipedia.org/wiki/Memory_management_unithttp://en.wikipedia.org/wiki/Segmentation_fault


17/23


n addition to GRSEC, dotCloud has built-in additional layers of security. Each service runs in its own container. Thebenefits of container isolation were explained in Episode 2 on cgroups.

We do not allow dotCloud users to have root access. No root access means that users cannot SSHas root, cannotogin as root, and cannot get a root shell through sudo. All processes run under a regular, non-privileged UID.

Furthermore, SUIDbinaries are restricted to a set of well-known, well-audited programs, like ping.

Each of those security layers is strong. We believe that combining them together can provide a more than adequateevel of security for massively scaled, multi-tenant platforms.
http://docs.dotcloud.com/0.9/guides/scaling/?utm_source=eBook&utm_medium=link&utm_campaign=PaaS_Under_the_hoodhttp://docs.dotcloud.com/0.9/guides/scaling/?utm_source=eBook&utm_medium=link&utm_campaign=PaaS_Under_the_hood


18/23


dotcloud.com

Episode 5: Distributed RoutingThe dotCloud platform is powered by hundreds of servers, some ofthem running more than one thousand containers. The majority of

these containers are HTTP servers and they handle millions of HTTP

requests every day to power the applications hosted on our platform


19/23


All the HTTP traffic is bound to a group of special machines called the gateways. The gateways parseHTTP requests, and route them to the appropriate backends. When there are multiple backends for asingle service, the gateways also deal with the load balancing and failover. Last but not least, the gatewaalso forward HTTP lto be processed by the metrics cluster.HTTP Routing Layer

This HTTP routing layer, as we call it, runs on an elastic number of dedicated machines. When the load is low, 3machines are enough to deal with the trac. When spikes or DoS attacks happen, we scale up to 6, 10, or even moremachines, to ensure optimal availability and latency.

All HTTP requests are bound to the HTTP routing layer, which is a cluster of identical HTTP load balancers. Each timwe create, update (e.g. scale), or delete an application on dotCloud, the configuration of those load balancers has to updated.

The master source for all the configuration is stored within a Riakcluster, working in tandem with a Rediscache. Thconfiguration is modified using basic commands:

Create a HTTP entry Add/remove a frontend (virtual host) Add/remove a backend (container)

The commands are passed through a ZeroRPCAPI. Each update done through the API propagates through theplatform; in the next sections, we will see which mechanisms are used.

Version 1: Nginx + ZeroRPCAs you probably know, a start-up must be lean, agile, and many other things. It also needs to be pragmatic, and the rightsolution is not always the ideal one, but its the one that allowed us to ship on time. Thats why the first iteration of our routlayer had some shortcomings, as we will see. But it has functioned properly up to support tens of thousands of apps.

Nginxpowered the first version of dotClouds routing layer. Each modification to an app caused the central vhostsservice to push the full configuration to all the load balancers, using ZeroRPC.

Obviously, as the number of apps grew, the size of the configuration grew as well. Sending differential updates wouldhave been better. But at least, when a load balancer lost a few configuration messages, there was no special case tohandle. The next update would contain the full configuration, and provide all the necessary information.

The configuration was transmitted using a compressed, efficient format. Then, each load balancer would transformthis abstract configuration into the Nginx configuration file, and inform Nginx to reload this configuration. Nginx is wdesigned, even when loading the new configurations, it can still serve requests along with the old one which meant tno HTTP request is lost during the configuration update.

visitors

(technically

HTTP clients)

HTTP routing layer

(load balancers)

dotCloud app cluster

dotCloud platform
http://wiki.basho.com/http://redis.io/https://github.com/dotcloud/zerorpc-pythonhttp://theleanstartup.com/http://nginx.org/https://github.com/dotcloud/zerorpc-pythonhttps://github.com/dotcloud/zerorpc-pythonhttp://nginx.org/http://theleanstartup.com/https://github.com/dotcloud/zerorpc-pythonhttp://redis.io/http://wiki.basho.com/


20/23


Nginx also handles load balancing and fail-over well. When a backend server dies, Nginx detects it, removes it from thpool, periodically tries it again, and will re-add it to the pool once it has fixed itself.

This setup had two issues: Nginx does not support the WebSocket protocol, which was one of the top features requested by our users at th

time Nginx has no support for dynamic reconfiguration, which means that each configuration update requires the who

configuration file to be regenerated and reloaded

At some point, the load balancers started to expend a significant amount of CPU time to reload Nginx configurationsThere was no significant impact on running applications, but it required deploying more and more powerful instancesthe number of apps increased.

Although Nginx was still fast and efficient, we had to find a more dynamic alternative.

Version 2: Node.js + Redis + WebSocket = HipacheWe spent some time digging through several kinds of languages and different technologies to solve this issue. We

needed the following features:

Ability to add, update, and remove virtual hosts dynamically, with a very low cost Support for the WebSocket protocol Great flexibility and control over the routed requests: we want to be able to trigger actions, log events, etc., during

different steps of the routing

After looking around, we finally decided to implementour own proxy solution. We didnot implement everything fromscratch. We based our proxy on the node-http-proxylibrary developed by NodeJitsu. It included everything needed route a request efficiently with the appropriate level of control. The new routing layer would therefore be in JavaScripusing Node.JS, leveraging on NodeJitsus library; we added several features such as the following:

Use multi-core machines by scaling the load to multiple workers

Ability to store HTTP routes in Redis, allowing live configuration updates Passive health-checking (when backend is detected as being down, it is removed from the rotation) Efficient logging of requests Memory footprint monitoring - if a leak causes the memory usage of a worker to go beyond a given threshold, th

worker is gracefully recycled Independence from other dotCloud technologies (like ZeroRPC, to make the proxy fully re-usable by third partie

(the code being, obviously, open source)

After several months of engineering and intensive testing, we released the source code of Hipache: our new distributproxy solution!

Behind the scenes, integrating Hipache into the dotCloud platform was very straight forward, due to our service-oriented architecture.

We simply wrote a new adapter which consumed virtual host configurations from the existing ZeroRPC service, andused it to update Hipaches configuration in Redis. No refactoring or modification of the platform was necessary.

Heres a side comment about dynamic configuration and latency. Storing the configuration in an external system (likRedis) means that you have to make following trade-offs:

You can look up the configuration at each request, but it requires a round-trip to Redis at each request, which addlatency

You can cache the configuration locally, but you will have to wait a bit for your changes to take effect, or implemea complex cache-busting mechanism

We implemented a cache mechanism to avoid hitting Redis at each request. But that wasnt necessary, because werealized that requests done to a local Redis are very, very fast. The difference between direct lookups and cachedookups was less than 0.1ms, which was in fact below the error margin of our measurements.
https://github.com/nodejitsu/node-http-proxyhttp://nodejitsu.com/https://github.com/nodejitsu/node-http-proxyhttp://nodejitsu.com/http://nodejs.org/http://nodejs.org/http://zerorpc.dotcloud.com/https://github.com/dotcloud/hipachehttps://github.com/dotcloud/hipachehttp://zerorpc.dotcloud.com/http://nodejs.org/http://nodejitsu.com/https://github.com/nodejitsu/node-http-proxy


21/23


Version 3: Active Health ChecksHipachehas a simple embedded health-check system. When a request fails because of some backend issue (TCPerrors, HTTP 5xx responses, etc.), the backend is flagged as being dead, and remains in this state for 30 seconds.During the 30 seconds, no request is sent to the backend; then it goes back to normal state. However, if it is faulty, it

mmediately be re-flagged as dead. This mechanism is simple enough and it works, but it has three caveats: If a backend is frozen, we will still send requests to it, until it gets marked as dead

When a backend is repaired, it can take up to 30 seconds to mark it liveagain

A backend which is permanently deadwill still receive a few requests every 30 seconds

To address those three problems, we implemented active health checks. The health checker permanently monitors thstate of the backends, by doing the HTTP equivalent of a simple ping. As soon as a backend stops replying correctto ping requests, it is marked as dead. As soon as it starts replying again, it is marked as live. The HTTP pings can besent every few seconds, meaning that it will be much faster to detect when a backend changes.

To implement the active health checker, we considered multiple solutions: Node.js, Python+gevent, Twisted. And fin

decided to roll it with the Go language. Go Lang was chosen for several reasons as follows: The health checker is massively concurrent (hundreds, and even thousands of HTTP connections can be in flight

at a given time)

Go programs can be compiled and deployed as a single, stand-alone, binary

We have other tools doing massively concurrent queries, and this was an excellent occasion to do somecomparative benchmarks (we will be publishing the benchmarks in future eBooks)

The active health checker is completely optional. You dont need it to run Hipache, and you can plug it on top of anexisting Hipache installation without modifying Hipache configuration: it will detect and update Hipache configuratiodirectly through the Redis used by Hipache itself. In other words, it gets along perfectly fine with Hipache embeddedpassive health-checking system, and running it will just improve the dead backend detection. And of course, hchecke

open source, just like Hipache.

Whats next?Since this HTTP routing layer is a major part of the dotCloud infrastructure, were constantly trying to find ways to mat better all the time.

Recently we did some research and tests to see if there was some way to implement dynamic routing with Nginx. Infact, we aimed for an even higher goal. We wanted to route requests with Nginx, using configuration rules stored inRedis, using the format currently used by Hipache. This would allow us to re-use many components such as the Redifeeder and the active health checker that uses the same configuration format.

Guess what: we found something! Less than one year ago when we started to think about the design of Hipache and begmplementation, we looked at the Nginx Lua module. It has improved a lot since then and it may be an ideal candidate.

We started an experimental project which lets Nginx mimic Hipache, by using the same Redis configuration format.Nginx deals with the request proxying, while the routing logic is all in Lua. We used the excellent lua-resty-redismoduto talk to Redis from Nginx.

This open source project, called hipache-nginx.

Some preliminary benchmarks show that under high load, hipache-nginx can be 10x faster than the original Hipache Node.js. The benchmarks have to be refined, but it appears that hipache-nginx can deliver the same performance ashipache-nodejs with 10x fewer resources. So, while the code is still experimental, it shows that there is plenty of roomfor improvement in the current HTTP routing layer. Even if it will probably have an aect on apps with 10,000-100,00requests per second, it is still worth investigating.
https://github.com/dotcloud/hipachehttp://golang.org/https://github.com/samalba/hipache-hcheckerhttps://github.com/samalba/hipache-hcheckerhttps://github.com/dotcloud/hipachehttps://github.com/chaoslawful/lua-nginx-modulehttps://github.com/agentzh/lua-resty-redishttps://github.com/samalba/hipache-nginxhttps://github.com/samalba/hipache-nginxhttps://github.com/agentzh/lua-resty-redishttps://github.com/chaoslawful/lua-nginx-modulehttps://github.com/dotcloud/hipachehttps://github.com/samalba/hipache-hcheckerhttps://github.com/samalba/hipache-hcheckerhttp://golang.org/https://github.com/dotcloud/hipache


22/23


CONCLUSION

As you can see, building a PaaS like dotCloud or Heroku involves specific knowledge about fundamental technologieOf course, you may not choose to implement any of the specific technologies that weve implemented in dotCloud.

We aim to expose the underlying technologies that weve implemented that provide isolation between apps, rapiddeployment, protection against security threats and distributed routing.

In other words, if you are serious about building a robust platform, you may want to become familiar with those typeof technologies. Or, alternatively, you could rely on an existing proven platform like dotCloud.

Join dotClouds Technical Community

Sign up for your own account

Join the technical discussions in our open forums

Read our blog

Have a technical question?

Email us: [email protected]
https://www.dotcloud.com/?utm_source=eBook&utm_medium=link&utm_campaign=PaaS_Under_the_hoodhttps://dotcloud.zendesk.com/categories/20076721-blog-talk/http://blog.dotcloud.com/?utm_source=eBook&utm_medium=link&utm_campaign=PaaS_Under_the_hoodhttp://blog.dotcloud.com/?utm_source=eBook&utm_medium=link&utm_campaign=PaaS_Under_the_hoodhttps://dotcloud.zendesk.com/categories/20076721-blog-talk/https://www.dotcloud.com/?utm_source=eBook&utm_medium=link&utm_campaign=PaaS_Under_the_hood


23/23


Authors Biography

Jrme Petazzoni, PaaS under the Hood, Episodes 1-4

Jrme is a senior engineer at dotCloud, where he rotates between Ops, Support and Evangelist duties and has earnethe nickname of master Yoda. In a previous life he built and operated large scale Xen hosting back when EC2 was justhe name of a plane, supervised the deployment of fiber interconnects through the French subway, built a specializedGIS to visualize fiber infrastructure, specialized in commando deployments of large-scale computer systems inbandwidth-constrained environments such as conference centers, and various other feats of technical wizardry. He cafor the servers powering dotCloud, helps our users feel at home on the platform, and documents the many ways to usdotCloud in articles, tutorials and sample applications. Hes also an avid dotCloud power user who has deployed justabout anything on dotCloud - look for one of his many custom services on our Github repository.

Connect with Jrme on Twitter! @jpetazzo

Sam Alba, PaaS Under the Hood, Episode 5

As dotClouds rst engineering hire, Sam was part of the tiny team that shipped our rst private beta in 2010. Sincethen, he has been instrumental in scaling the platform to tens of millions of unique visitors for tens of thousandsof developers across the world, leaving his mark on every major feature and component along the way. Today, asdotClouds first Director of Engineering, he manages our fast-growing engineering team, which is another way tosay he sits in meetings so that the other engineers dont have to. When not sitting in a meeting, he maintains severalpopular open source projects, including Hipacheand Cirruxcacheand other projects also ending in -ache. In aprevious life, Sam supported Fortune 500s at Akamai, built the web infrastructure at several startups, and wrotesoftware for self-driving cars in a research lab at INRIA.

Follow Sam on Twitter @sam_alba
http://en.wikipedia.org/wiki/Cessna_EC-2http://www.twitter.com/jpetazzohttp://github.com/dotcloud/hipachehttp://code.google.com/p/cirruxcachehttp://www.twitter.com/sam_albahttp://www.twitter.com/sam_albahttp://code.google.com/p/cirruxcachehttp://github.com/dotcloud/hipachehttp://www.twitter.com/jpetazzohttp://en.wikipedia.org/wiki/Cessna_EC-2

Paas Under the Hood Printversion

Documents

Transcript of Paas Under the Hood Printversion