THESIS - Binghamton University
Transcript of THESIS - Binghamton University
REMORA: AGGRESSIVE POWER MANAGEMENT FOR APACHE HTTPD WEB SERVER
BY
SHANE CASE
BS, SUNY Farmingdale, 2006
THESIS
Submitted in partial fulfillment of the requirements for
the degree of Master of Science in Computer Science
in the Graduate School of
Binghamton University
State University of New York
2009
UMI Number: 1473688
All rights reserved
INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion.
UMI 1473688
Copyright 2010 by ProQuest LLC. All rights reserved. This edition of the work is protected against
unauthorized copying under Title 17, United States Code.
ProQuest LLC 789 East Eisenhower Parkway
P.O. Box 1346 Ann Arbor, MI 48106-1346
© Copyright by Shane Case 2009
All Rights Reserved
iii
Accepted in partial fulfillment of the requirements forthe degree of Master of Science in Computer Science
in the Graduate School of Binghamton University
State University of New York2009
November 20, 2009
Kanad Ghose, Department of Computer Science, Binghamton University
Kartik Gopalan, Department of Computer Science, Binghamton University
iv
Abstract
With the release of the Pentium 4 Prescott, the trend of increasing the clock frequencyto boost performance was coming to an end. This increase in clock frequency comeswith an increase in heat dissipation from the CPU. Large data centers and server farmsare having to calculate costs of cooling machines as well as the energy they consumewhile operating. In recent years, the efficiency of individual server machines has beendecreasing due to the utilization not being up to capacity. The power managementsoftware does not factor in the recent increase in computing capability and this isreflected in the percentage of utilization compared to the power consumption of serverhardware. The current solution is the implementation of the Advanced Configurationand Power Interface (ACPI). ACPI gives control of power configuration to the operatingsystem, which then manages power according to the present load on the system. Amachine allocated for a specific purpose, such as a web server, may not benefit from asystem wide power scheme. A web server can see a lot of repeated traffic, this can beexploited by keeping a record of hardware usage for frequent transactions.Categorization of these transactions by either CPU or I/O intensive will lead to a more�intelligent� power management scheme. The goal of this thesis is to show that a moreintelligent power scheme can further reduce energy consumption while minimizingdegradation in performance.
v
Acknowledgments
I would like to thank my parents, for providing me with the resources, support, and
opportunity to further my education in graduate school. I also want to thank my brother
for initially teaching me how to use a computer before a graphical user environment
existed.
Finally, I would like to thank my advisor, Kanad Ghose, for providing such a supportive
work environment, and his advice and encouragement without which this work may not
be possible. Under his tutelage I continue to learn new things and will strive to move
my education ahead even further.
vi
Table of Contents
Chapter Page
Table of Figures
Chapter 1 � Introduction
Chapter 2 � Related Work
Load Leveling and CPU Voltage
Workload Forecasting and Cluster Load Leveling
Minimizing Processor Wakeup Iterations
Full System Power Modeling
Chapter 3 � Hardware Interface
P States
C States
ACPI Power Management Schemes
Chapter 4 � Interface with Apache Web Server
Apache Logging Capabilities
Chapter 5 � Results
Chapter 6 � Conclusions and Future Work
viii
1
5
5
6
7
8
10
11
12
14
17
21
29
36
vii
References 39
viii
Table of Figures
Figure Page
Figure 1 � Data Center Infrastructure Efficiency
Figure 2 � Load Leveling Task Queue
Figure 3 � CPU Power Management Granularity
Figure 4 � Apache Child Listener/Worker Loop
Figure 5 � Httpd Child Process
Figure 6 � Statistic Table
Figure 7 � Queue Implementation
Figure 8 � Watts Consumed Per Frequency
Figure 9 � Power Consumption
Figure 10 � SPECweb Method to Calculate Average �Off� Time
Figure 11 � Power Consumption of SPECweb across 8 cores
Figure 12 � Web Server Throughput
2
5
10
18
19
27
28
30
31
32
33
34
1
Chapter 1 � Introduction
In recent years, the movement to processing data electronically has seen a rise
in the number of data centers present throughout the world. Music, movies, and
telephone services are just the beginning of what is being transferred to
electronic transmission Today, almost all levels of government and corporations
own and maintain a data center. The need for these data centers has been
growing.
The total consumption of energy in 2006 by data centers in the United States
was 1.5% of all energy consumption in the country. This equals about 10 million
typical households in the United States. Data center consumption is just a small
subset of U.S. Industry, which as a whole consumes 1/3 of all energy consumed
by the country. Further investigation however, shows that the majority of this
energy is not being consumed by the servers themselves [HOSTING 05]. More
than half of the energy used by a data center is being utilized to cool these
servers [SYSTEMS 04]. This percentage is an obvious inefficiency, and will only
increase as the size of the data center grows. The U.S. Department of Energy
has called for a 10% reduction in total energy consumption of data centers by
2011. A larger reduction can be seen if a method used for cooling data centers
can be made more efficient.
The metric that the Department of Energy uses is called the Data Center
2
Infrastructure Efficiency (See Figure 1). Currently this metric stands at < 0.5 for
today's typical data center. The goal of the Department of Energy is to increase
this value. In their proposed best practice goal, the metric should stand at 0.85.
If and when this goal is achieved, the majority of the power consumption by a
data center will be the energy required for operating servers. Increasing this
metric will lead to a decrease in total consumption of energy for a data center.
Figure 1 � Data Center InfrastructureEfficiency [DOE 08]
Implementing complex cooling mechanisms can have a high initial cost and can
require the complete renovation of server farms or replacement of servers.
Cheaper alternatives exist in the form of software mechanisms for a more
aggressive power management scheme [PERFORMANCE 01].. These
mechanisms can be complex to implement and will not yield the same outcome
as a hardware implementation. Conserving energy by modifying the power
management on a server can also lead to a decrease in performance for each
machine that follows this modification.
If a machine is dedicated to a task like a web server, there is a characteristic that
can be exploited to maximize power savings and minimize the effect on server
throughput [LEVEL 07]. Web servers typically handle transactions that will
repeat over a period of time. It s this repetition that can be used to profile
resource usage of web servers [PATH 98]. By profiling repeated transactions, it
DCIE=Energy for IT Equipment
Total Energy for DataCenter
3
is possible to learn the common resource usage of handling these transactions.
Taking the information from profiling, transactions can effectively be categorized
by their resource usage. Transactions that typically require a great amount of
computation can be said to be CPU bound. Similarly, transactions that require a
great amount of storage medium activity can be said to be I/O bound.
Using these categorizations, an appropriate power management policy can be
implemented to be dynamically modified when transactions of a certain type are
encountered [POLICIES 03]. For instance, if there are several transactions
occurring that are all of the CPU bound category, we can place the hard disks
into a low power state to conserve energy rather than having them sit idle, but
still in a high power state. By the same token, if there are several transactions
that are all I/O bound, the processor can be placed in a low power state to save
energy.
If the energy consumed by a dedicated task server can be reduced, several
benefits will be observed. First, the power consumed by each server will be
reduced. This reduction in power consumption by a machine will reduce the heat
dissipation of hardware components, thus extending the life of the hardware
itself. Second, since the heat dissipation will be reduced, a reduction in cooling
will be observed. A tertiary benefit will be a possible reduction in the
performance penalty when the temperature increases in the processing
environment.
There have been various studies on implementing software based energy
4
conservation mechanisms. These other methods will be discussed in the related
work section of this thesis. However, the repetitive nature of a web server
presents a unique opportunity to maximize the energy conservation while taking
a minimal impact on system performance.
From this point forward, this thesis will be organized as follows. An overview of
related work will be given to show the uniqueness of the web server scenario.
This will be proceeded by presenting the existing capabilities of today's hardware
to allow the changing of system power settings. Finally, the necessary code
changes will be discussed along with results of benchmarks designed to stress
the web server.
5
Chapter 2 � Related Work
This section will discuss research that is similar to the focus of this thesis. First,
a study that discusses the alteration of the processor P-state according to the
current load across a cluster of servers. Second, a study that uses load
forecasting to balance the workload across a cluster of servers. Third, a study
which consolidates tasks on a small number of cpus in a multi-cpu environment
in order to render several cpus completely idle.
Load Leveling and altering CPU voltage
In a cluster environment (See Figure 2), having multiple servers available to
handle web based transactions is not always energy efficient. By measuring the
latency of request response time and the relative load placed on systems
Figure 2 � Load Leveling Task Queue [MULTI-TIER 07]
6
resources, a more efficient cluster load leveling scheme can be implemented.
At a user defined interval, the latency of response can be measured and the
systems power level can be modified accordingly. If the latency time exceeds a
user defined percentage above a user defined threshold, the P-state on nodes of
the cluster can be increased to attempt to normalize the latency time to the
threshold. Conversely, if a the latency time falls below the threshold by a user
defined percentage, the P-state on the node with the lightest load can be
reduced. This scheme will theoretically maximize the workload to energy
consumption ratio with a minimal performance impact.
The primary drawback to this scheme can be the user defined interval of
monitoring that is set. In the worst case scenario, the load on system resources
can drastically change immediately after the user defined interval has elapsed. If
this were to occur performance would be impacted if a rise in system resources
is present when the system is in a lower power state. Conversely, if there was a
sharp decline in resource load after the interval, and the system was in a high
power state, the efficiency of this scheme would be lost.
Workload Forecasting and Cluster Load Leveling
A second approach from a web server standpoint, is to attempt to forecast the
load that will be placed on a system at a given time. Similar to the first study, in
a cluster environment, having all nodes of a cluster available to service requests
can be energy inefficient. The ability to produce a workload forecast can help to
calculate the appropriate number of cluster nodes that will need to be available.
7
In a web application environment, there are two main values that need to be
monitored, the login rate, and the number of connections. In addition to these
values, time must be allotted to allow for cluster nodes to perform a cold start
and to allow for these nodes to pick up an equal workload to those already
handling transactions. This equation returns the number of cluster nodes that
will be required to handle the current workload placed on the cluster. One prime
drawback of this load leveling scheme is that there is an inherited requirement
for an observation period. This observation period would then require a
validation period to check for correctness of the forecasting equation.
The prime drawback of using forecasting to speculate the load means that a
forecast is only valid for a specific set of hardware. Having to allocate time for
observation can be a large overhead during an upgrade situation. When
hardware, such as a CPU for the designated cluster is updated, the
computational power increases. This increase will allow cluster nodes to handle
more transactions. Each time this occurs, a new observation period will need to
be performed. [LOAD DISPATCH 08]
Minimizing Processor Wakeup Iterations
In today's enterprise computing environment, servers typically have more than
one socket. These sockets can support multi-core or SMP processors. Having
two processors present in a system provides a boost to computational capacity.
With this boost however, an increase in energy consumption is also inherited.
The ability of the OS power manager to keep an idle system in a sleep state
8
becomes important.
Before kernel version 2.6.21 [TICKLESS 07], the kernel had a scheduler tick
present that would cause wake ups at every iteration of an elapsed tick. Starting
in 2.6.21, the option of a �tickless� kernel was available. This feature was
important in that it eliminated any extra wake ups due to scheduling timers. This
alone however, will still lead to undesirable processor wake ups to occur. The
ability to isolate OS daemons, timers, and interrupts to a specific processor
becomes vital.
On today's CPU hardware, SMP processors can be powered down to a �deep
sleep� state (see Chapter 3). There is a restriction on this however, as individual
cores cannot be set to different power states. With this knowledge, it can be
concluded that if we can isolate causes for a processor wake up to a single
processor and its cores, the second processor will then be allowed to be in �deep
sleep� for a longer interval [INTERRUPT 08]. The kernel already has a feature
built-in called the multi-core power saving mode which can be enabled via the bit
in /sys/devices/system/cpu/sched_mc_power_savings. Daemons and interrupts
already have built-in kernel features to be restricted to a specific processor set,
however timer's do not.
Full System Power Modeling
Full system power estimation has been implemented to model the power
requirements necessary for a server. In Mantis [SYSTEM 06], the power
requirements for primary hardware components are estimated using various
9
hardware and software counters. Computing the cpu utilization, off-chip memory
access count, hard disk activity, and network activity allows for the estimation of
power consumption on a specific machine.
One of the drawbacks of the Mantis approach the granularity of power
estimation. Power can only be estimated upon expiration of a set interval, which
has a minimum of one second. The ability to perform a live power estimation, or
power modeling with a lower granularity would be much more accurate. Another
drawback is the system's kernel must be modified heavily. In order to access the
hardware counters necessary to retrieve the data required for power estimation
additional patches must be applied to the kernel. The patch required, perfmon2
also requires separate libraries to be installed to allow for counter monitoring
[PERFMON]. Even with these modifications further counters are required for
access to statistics about the I/O subsystem and the network subsystem.
After the necessary modifications are implemented, the equation that will
compute the current power must be calibrated for your individual hardware
profile. In order to calculate the co-efficient for each of the hardware
components, an application designed to stress individual hardware components
must be used. In order to perform this the emulation suite, Gamut is used to
calculate the co-efficient for CPU activity, off-chip memory accesses, I/O activity,
and network activity [GAMUT 05].
10
Chapter 3 � Hardware Interface
Today's server CPUs (such as the Intel Xeon) have two methods of saving
energy. These two methods have been implemented to comply with the current
ACPI specification (3.0b). The two methods consist of P-states and C-states. P-
states refer to the clock frequency that the processor is currently running at. C-
states refer to processor throttling.
Both types have an N number of states, which is processor dependent, with 0
being the highest state (highest energy consumption, highest performance) and
N being the lowest state (lowest energy consumption, lowest performance). The
P-state can be considered a higher granularity of energy saving as these states
only apply when the processor is in the C0 or execution state. In the event of a
change of C-state from C0, the P-state must be set to the corresponding level
that is required to leave C0 before the processor can be throttled to the CN state.
Figure 3: CPU Power Management Granularity
11
P-States
Processor P-states can be referred to as different clock frequencies that a
processor is capable of running at. P-state capabilities are reported to the OS
power management by reading the _PPC (Performance Present Capabilities)
object [ACPI 06]. The _PPC object simply shows what states are available, each
state shown in the _PPC object has a corresponding PSS (Performance
Supported States). In the _PSS object, there are 7 fields of extended
information about each state. Core frequency, power dissipation in milliwatts,
transition latency, bus master latency, control, and status make up these 7 fields.
The first two fields are straightforward, transition latency is the time that the CPU
will be unavailable due to the P-state change, bus master latency is the time that
the bus masters cannot access the memory hierarchy, control is the value that
must be written to the PERF_CTL register, and status is the correct return from
the PERF_STATUS register.
In the event of a failure to transition to a particular P-state, the return value from
reading PERF_STATUS will differ from the correct value which is contained in
the corresponding _PSS object. There can also be a _PSD object which
contains any dependencies, whether they be hardware or software, that may
constrain the change of P-state. The OS power management must be aware of
these dependencies and satisfy them before a P-state can be successfully
entered. An example of such a dependency can be the inability of today's multi-
12
core processors (Core 2 Duo / Xeon multi-core) to have different cores on
different P-states.
C-States
C-states are throttling states of the processor, and were formerly referred
to as Intel SpeedStep. The number of C-states supported by a processor can
very by implementation, a larger number being supported on mobile processors
and server processors. C-states go in order from the C0 or execution state to
the CN state or deepest sleep state.
C-states can be implemented on the processor using one of two methods. The
first method is the presence of the P_LVLX register. Typically this method is
implemented on platforms that only support up to the C3 state, as these registers
are either P_LVL2 to P_LVL3. In order to change states, the OS power manager
performs a read on the register that matches the desired state. According to the
OS all processors have the same C-state capability, in the event that this is not
the case the management is offloaded to the BIOS, and the BIOS will choose the
lowest C-state that can be entered.
The second method of implementation is when the _CST object is present. In
the event that both P_LVLX and _CST objects are present on a system the OS
power manager always uses the _CST objects. It is the _CST object that
presents more information to the OS power manager to make more efficient
decisions. The _CST also enables the OS to act when power events can change
13
the capability of processors, such as on a mobile device. When such an event
happens the _CST object sends a notify event to the OS power management to
reevaluate the capabilities of the processor.
The C0 state is the only state that execution can take place on the CPU
[CPUIDLE 07]. Therefore the processor must be completely idle in order to exit
the C0 state. System cache integrity must also be considered, and the task of
maintaining the context of these caches varies according to what the current C
state is.
When entering the C1 state, the latency time of returning back to C0 must be
considered negligible by the OS power manager. The C1 state is the only power
throttling state that must be supported by all processors. This can be achieved
due to C1 being the execution of the HLT instruction. System caches are
maintained by the processor itself and there are no software visible effects of
being in this state. Exit from this state can be for any reason, but must always
occur on an interrupt.
Throttling state C2 is the first �sleep� state required to have hardware support by
the chipset. Latency is higher than the C1 state, with the exact time being stored
in the FADT ACPI table. Chipset support is required due to the processor still
having to maintain the context of system caches. The ability to also support
snooping for bus master activity for cache accesses, as well as cache accesses
in a multi-processor environment is the reason for hardware support. As in C1,
there are no software visible effects of being in this state, and it can be exited for
14
any reason, but must always exit the state on an interrupt.
The C3 throttling state requires the most hardware support of all throttling states.
System caches must be maintained by OS power manager, as snooping by the
processor is not supported in this state. There are methods of maintaining the
absence of bus master activity that are implemented on both uni-processor and
multi-processor environments. In a uni-processor environment by setting the bus
master arbitration disable bit. A multi-processor environment can flush the
contents of the on chip caches, this is only supported on multi-processor
environments due a high latency being associated to flushing caches. An
example of a method used to flush caches can be to read a file that has a size
greater than the largest cache that is present in the system (typically L2, or L3 if
present). The OS power manager can check the bus master status bit
(BM_STS) for bus master activity before deciding to enter the C3 state. When
exiting this state the BM_RLD bit tells the OS power manager if the state was
exited due to bus master activity. Due to the hardware not maintaining the
system caches, when exiting the C3 state, the processor must return to the C0
state in order to maintain the context of these caches.
ACPI Power Management Schemes
The method in which ACPI sets the C-state and P-state for a system is called the
�governor� policy. There are currently 4 types of governors available in the Linux
kernel. The policy must be compiled in the kernel, and a default governing policy
15
is selected at compile time. At boot time, the governor that is selected in the
kernel is used, this can be changed via the /sys interface.
There are 5 governors currently available in the Linux kernel [GOVERNOR 08].
Two of them are �static� governors. These two governors are called
�performance� and �powersave�. There is no logic accompanied with these types
of frequency scaling. Performance sets the frequency at the highest possible
setting and powersave sets the frequency at the lowest possible setting. The
frequency will not deviate from its setting at boot-up. A third governor requires a
high level of user interaction, the �userspace� governor. Userspace governing
allows any process running under superuser permission to alter the current
frequency setting of the processor.
The final two governors are dynamic governors and differ greatly from the static
governors. Based off the current load based on the system, the �ondemand� and
�conservative� governors vary the current frequency setting of the processor in
different ways. Both of these governors have a �sampling rate� in common. This
sampling rate is the interval in 10-6 seconds. Every interval that elapses the total
usage percentage of each CPU is checked and the kernel decides whether or
not the current frequency is appropriate or needs to be increased/decreased.
The number that is checked is an average percentage of usage during the
preceding time interval. The CPU load threshold that when exceeded causes a
frequency change is called the �up_threshold�. This threshold can be set by the
user through the interface in /sys. Where these two governors differ, is the
16
frequency decision that is applied. In the ondemand governor, when a
processor's frequency is to be changed, it is increased to the maximum
frequency capacity that is possible on that processor. When the processor then
returns to an idle state, the governor then drops the frequency down to the
minimum frequency.
The conservative governor differs in that the frequency change is performed
more �gracefully�. Instead of immediately jumping up to the maximum frequency
when the usage of the CPU increases, it gradually increases the CPU frequency.
As long as the load on the processor is above the threshold for increasing the
frequency it will move up one P-state at a time until that threshold is no longer
exceeded. When decreasing the frequency there is a similar threshold of CPU
usage that must be met before the frequency changes. This threshold is called
the �down_threshold�. When the CPU usage falls below this threshold the
frequency can be decreased to the next P-state, until the CPU is in its lowest
possible frequency.
17
Chapter 4 � Interface with Apache Web Server
Apache httpd is an open source web server that is the most widely used web
server on today's Linux servers. Since it's initial release in 1995, the Apache
web server has grown to become the most popular web server software on the
web today. The attraction of this software suite may be from it's robustness. The
system administrator has the option of compiling the components of the web
server as being built-in or a module that is loaded at run-time. This even extends
to the method in which HTTP requests are handled by the web server.
The main focus of this thesis will be on the Linux variant of the Apache httpd web
server, so the �pre-forking architecture� will be the prime concern. The �pre-
forking architecture� is the method that the web server will process incoming
httpd requests. Upon start-up the web server will fork off N number of children to
handle all requests. This number is set in the configuration file and has a default
number of 8 children. Typically the administrator will change this based upon the
capabilities present on the server that will perform the web hosting.
The children (See Figure 4) of the parent process are referred to as �worker
threads�. Each worker thread will enter into a loop that will handle a set number
of requests until the child �dies�. This number of requests is also located in the
configuration file of the web server. By default each child thread will handle
4,000 requests and will then die �gracefully�. Looking at the entire hierarchy,
18
only one child enters the �listener� state. All other children sit idly while this
single child process is in the listener state. When an incoming request is
received the child process that is the �listener� takes the header of this request
and enters the working state. As soon as this child process enters the working
state, the next idle child becomes the �listener�. When the child that is in the
working state completes the given transaction it returns to the idle queue to
again wait to be the listening process.
When a request is received, it is usually for an entire web document. This is not
Figure 4: Child Listener/Worker Loop. [APACHE 08]
19
Figure 5 -- httpd child process [APACHE 08]
20
what a request is to the web server. A web document can contain any number of
images, scripts, etc. Each of these parts of a particular web document is
considered a transaction. For instance, let's say a web document consists of
text and three images. This web document would then consist of four
transactions for the web server. One transaction would consist of the text only,
and a transaction for each image.
After the request on the web server has been broken down into transactions
(See Figure 5), the location needs to be translated. When the web server
receives a request for a web site or its contents, the location in the header is only
relative to the �DocumentRoot� that is set in the web server configuration file. It
is the job of the web server to locate the absolute file path for the requested
element.
After the location has been computed, the permission on that file must be
checked. The permissions on the file matter in two ways, the group the file is
owned by, and the groups that are allowed to access the file. This is done in the
same fashion as standard permissions on a Linux system. There are also
methods of authentication that the web server can use if the files are set to
private. These permissions can range to a set of IP addresses or requiring a
user to login to view the files. If either of these two methods of authentication
are failed, then the user will see a �Forbidden� message returned from the web
server.
After permissions have been verified the child then checks to see if there is a
21
�quick handler� for the specific file that is requested. These quick handlers are
more lightweight than having to perform the standard transaction handling
method. If a quick handler does not exist for the specific file or file type
requested, the standard transaction handler is taken. The standard transaction
handler retrieves the file and sends it to it's destination, which would be the
requesting IP from the HTTP header.
When the network connection has finished, the count of transactions that has
been handled by this specific child is incremented. This count is then checked
against the user defined maximum number of transactions a child can handle, if
it equals the max number the child is terminated. If this child is not going to be
terminated it returns to the idle queue to await its next turn to serve as the
listening process.
Apache Logging Capabilities
Apache web server has a module called the �Forensic Logging Module,� which is
modified to become the communication between the web server and the power
manager. The unmodified forensic log module will create two log entries in a
specified forensic log file [HTTPD_DOC 09]. The first entry is created when the
HTTP header is received, before the child �listener� process picks up the
transaction. The second entry is created when the transaction has been
serviced and completed. These two entries will serve as the start and stop
points for profiling the system resources used while processing a particular
transaction.
22
Taking into account that each transaction is handled by a different worker
process, the ability to process transactions in parallel is a key strength of
Apache. From a profiling standpoint however, this increased the complexity of
tracking the system resources used by individual child workers.
Initial attempts to profile transactions saw an overlap in transactions being
processed, this was a substantial problem. The first attempt at profiling was to
implement a message queue in the ForensicLog module. This message queue
would serve as the prime method of communication between Apache and the
power manager. Implementing the message queue was simple and already had
the blocking mechanism needed to have the power manager sleep efficiently.
When a message was sent by the ForensicLog, the power manager would wake
up, gather statistics for the transaction and then return to sleep. A major problem
that could be seen in test runs was the overhead of using the message queue
system calls. First, the time for the inter process communication to occur was
skewing the recorded numbers for resource usage on a particular transaction.
Second, if more than one transaction was being serviced, all of the resource
usage would be reported on the first transaction that the power manager was
monitoring.
The second attempt at managing a feasible communication method would be to
use memory mapped I/O. Implementing the memory map would require
ForensicLog to output to a file and use a tool called 'rotatelogs'. Rotatelogs is a
script included with the Apache source code. The script is applied to log files
23
and forces the log file to be changed when either a set amount of time has
elapsed or the log file to exceed a set size. The name of the log file must be
formatted and set as input to the script. This attempt would be where signal IPC
was first implemented as the primary communication between the web server
and the power manager.
The methodology behind this implementation would require that ForensicLog
write to a file that would be memory mapped. Each time a transaction was to be
serviced, a signal would be sent to the power manager to read from the memory
mapped address return by the mmap() system call. The power manager would
then read the characteristics of the transaction and then retrieve the resource
usage statistics about the transaction.
Several shortcomings of this method arose very quickly. First, the exact format
that the data was written to the memory mapped file was unpredictable at each
transaction. This was not due to poor knowledge of the data, but the method in
which it was stored would often cause for duplicate transactions to appear in the
power manager's statistics table. Second, the mmap() call itself would lead to
significant memory overhead as the entire log file would have to be locked into
memory. Not locking the file into memory could lead to a page fault needing to
be serviced in the middle of transaction profiling. Third, similar to the first
proposed method, individual transactions could not be profiled in the presence of
concurrent transaction handling.
It was clear a more advanced interface would be required to decipher exactly the
24
resources that a specific worker process was using. This new interface would
have to exploit the entries made per pid in the /proc filesystem, and utilize the
information that is stored about each process.
Looking at the /proc/pid/stat was the initial attempt to observe the usage
statistics for each worker process, but the numbers reported would not be of a
fine enough grain. In order to see how many seconds on the CPU, the�
scheduler statistics, or /proc/pid/schedstat will need to be monitored. The
statistics that reported by /proc/pid/schedstat are the length of time on the CPU,
the length of time waiting to return to the CPU, and the number of tasks in this
group that are given to the processor. We can assume that since there is only
one task per child process, that the time waiting for the process to return to the
CPU will be spent performing I/O.
The increased complexity came in the variable synchronization across all of the
web server's child worker processes. Implementing a table that will track a
transactions before and after statistics for each child is required. In order to
synchronize this data structure across all child processes, this table will be
mapped to a shared memory segment and locked into memory. The shared
memory is locked so that there is only one page fault that occurs when
accessing this memory and that page fault will occur at lock time. This will
prevent the CPU statistics from being skewed by handling the page fault during
the runtime of the web server.
When the initial entry of the forensic log is made, the current scheduler stats for
25
the pid of the worker that is handling the transaction will be noted as our zero
values. These values will then be passed to the power manager. The power
manager will then do a lookup to see if this transaction has already been
processed. If it is a new transaction Apache will have communicated the zero
values along with the web server command that is being processed and place
them in the profile table. Upon completion of the transaction, if the transaction
has not already been profiled, the final scheduling statistics will be noted and
passed to the power manager. The power manager will then proceed to
categorize the transaction as either CPU bound or I/O bound, and store this in
the profile table.
The power manager must be extremely lightweight while executing in order to
avoid skewing the statistics of the hardware that is being monitored. The
overhead of any type of �busy waiting�, such as spin locks or sleeping, is not
tolerable. Busy waiting does not render the CPU as idle, and will show up as
CPU usage, which can affect the decision of the power manager to enter into a
sleep state. Therefore, the synchronization method used to communicate
between Apache web server and our power manager will be signals. The benefit
of signals is twofold, as signals are a lightweight inter-process communication
method, and adding signals to the Apache interface will require very little code
change. The bulk of the code change will be required in the power manager
itself, by simply adding a signal mask to the power management process. The
power manager will then block until a signal is received from Apache.
26
There will only be two signals required to achieve our goal of communication. In
this case the signals SIGUSR1 and SIGUSR2 will be used. SIGUSR1 will serve
as the notification that a transaction has been received by Apache and
processing will begin. SIGUSR2 will serve as notification that the transaction
has been completed. The power manager can also read an integer value that is
passed along with the signal which will serve to identify the current entry in the
shared memory segment that it will be profiling. On the power manager side,
SIGUSR1 will be notification to begin profiling, and search our statistic table to
see if the occurring transaction has already been previously profiled. If the
transaction has already been profiled, profiling can be terminated, squashing any
statistics that have been gathered. The power manager can then apply the
appropriate �power profile�, according to the transaction categorization in the
statistic table. If the transaction has not been previously profiled, the counters
have already begun before the search was conducted, and can then be stopped
when SIGUSR2 is received. Once the statistics have been gathered, the
numbers must be analyzed and a category assigned before being added to the
stat table (See Figure 6).
The power manager will also implement two work queues for transactions that
have already been profiled. One queue will be for transactions that have been
categorized as CPU bound, and another queue will be for I/O bound
transactions. The method used to implement these queues will require that a
transaction be looked up in the statistics table to find its categorization. When a
27
transaction has been found to be in the table, then that httpd worker child will be
placed in a blocked state. The pid of the child worker will be entered into the
queue.
Struct statistics_table
int index;
int pid;
char *transaction;
unsigned long long cpu;
unsigned long long disk;
unsigned long long net;
char classication;
int complete;
Figure 6 -- Statistic Table
While the child transaction is blocked, the power state of the system will still be
at its lowest point. The queues will be emptied (See Figure 7) when a user
defined number of transactions has entered a queue. There will be a separate
thread in the power manager that is started upon the first pid entered into a
queue. This thread will poll the queue, checking it until it reaches the interval,
upon which it will be emptied. In the worst case, a queue will not exceed the
interval for an extended period of time, thus a second method for a child to
execute must be implemented.
For such situations as an SSL connection, the occurrence of a transaction
waiting inside a queue is not tolerated by the protocol. In this situation, a child
28
that is blocked for an extended period of time will cause the SSL connection to
fail, causing a great expense of performance for the purpose of increasing
energy efficiency. To avoid this, each child will have a set time that it will be
waiting for a signal from the power manager. This interval will also be user
defined as a maximum amount of wait time. If the child does not receive a signal
to continue, and the interval has elapsed, the child will continue. Although this
may take degrade performance, it can still guarantee the web server to be
functional, in an environment with special requirements.
Figure 7 -- Queue Implementation
29
Chapter 5 � Results
To simulate a workqueue that is similar to the load placed on a typical web
server, the SPECweb 2005 Banking benchmark will be used. This benchmark
emulates an online banking website that will handle transactions concurrently
from a set number of clients. The workload that is simulated will be similar to an
online ledger showing deposits and withdrawals from single or multiple bank
accounts.
The web server that will handle the transactions will be running Apache httpd-
2.2.11 on a 64-bit Linux server using an unmodified 2.6.28 kernel. System load
will be reduced a to a minimum, to simulate a headless server, with minimal
remote access allowed. The hardware profile of this system includes two Intel
Xeon 5410 2.33 GHz cpus, 8 gigabytes of RAM, and two Hitachi 500 GB
enterprise level hard disks. All non essential partitions will be unmounted to
reduce the background file system activity. There is an interrupt balance
daemon in place, as well as the multi-core scheduler being enabled.
Three experiments are to be performed, one each for 100, 150, and 200
concurrent client connections during the SPECweb workload. The two metrics
that will be focused on are the total CPU cycle count, and the average power that
is consumed during the respective workloads. The CPU cycle count is obtained
by using the rdtsc instruction as a counter during the workload execution. Power
30
consumption was measured by attaching a current probe to the 12V rails and
observing the average current of each CPU frequency that the test machine was
capable of. Utilizing a Fluke Y8100 current probe, the current was measured
during each of the three types workloads placed on the server. During each
workload, the average current at each of the 2.00 GHz and 2.33 GHz clock
frequencies was taken. Multiplying this current and the voltage, the average
wattage per second can be calculated for each workload and at each frequency.
This wattage calculation, presented in Figure 8 was then used to calculate the
total power consumed in each frequency.
Concurrent Connections 100 150 200
2.00 GHz 34.8 38.4 41.4
2.33 GHz 39.84 45.6 46.8
Figure 8: Watts consumed per frequency
By utilizing the CPU cycle count taken with the rdtsc instruction [XEON 08], the
number of seconds spent at each frequency. Dividing the total time spent, by the
cycle count, in this case either 2.3 * 109 or 2.0 * 109, the total seconds spent in
each frequency state can be calculated. Calculating the power consumption of
run can then be done by multiplying the time elapsed in each state by that state's
power consumption characteristic.
31
Governor Frequency 100 150 200
Performance 2.00GHz N/A N/A N/A
2.33GHz 93919.01 107497.95 110346.85
Ondemand 2.00GHz 782.36 1191.27 1747.18
2.33GHz 92827.2 106065.6 108248.4
Remora 2.00GHz 81745.2 89779.2 96876
2.33GHz 302.78 793.4 1095.12
Figure 9: Power Consumption
In Figure 9, the power spent in each frequency is shown. The power consumed
in low frequency while using the performance governor does not apply, this is
due to the processor being permanently set at the highest frequency that the
hardware is capable of. Comparing the ondemand governor to the test governor
Remora, there is noticeable difference in power consumed in the two
frequencies. This is observed due to the ondemand governor having a set
interval that causes the average load on the system to be checked. The default
value of the interval is 0.01 seconds. Upon expiration of this interval, the
average load on the CPU since the last interval expiration is checked, if it
exceeds a certain percentage (default 80%), the CPU frequency will be set at the
highest frequency. By being focused on a single application, in this case Apache
web server, Remora will only change the frequency based on the current load of
the web server, all other applications will be ignored.
The increase in power consumption across the three governors does not appear
32
to be proportional to the increase in concurrent connections. There are two
reasons for this to be the case. One reason is that the overhead for parsing up
the HTTP headers for more concurrent connections could decrease the number
of transactions that can be processed by the web server. When a new client
comes in, the server and client must setup an SSL connection, and maintain it
throughout the workload, this adds to the overhead, increasing the network traffic
compared to standard connections.
The second reason can be attributed to the SPECweb benchmark itself. During
the development of the benchmark, emphasis was placed on emulating a real
world environment. Therefore the workload is divided up into �bursts�. The two
bursts are described as �ON� and �OFF� times, �ON� time being when the user is
actively using the banking website, and the �OFF� being one of three events.
The �OFF� time can be considered when a user is �thinking�, has logged off and
successfully closed the connection, or has finished using the banking website but
neglected to log off thus leaving the connection open. The �ON� time is when the
time between transactions by a client is under 5 seconds. The �OFF� time is
considered greater than 5 seconds, on average there is a 10 second delay
period. Both the �OFF� and �ON� times are generated by using a geometric
equation and a random seed. The following equation in Figure 10 demonstrates
��T�I /2���1��M /�T�I /2��1��exp��M /�T� I /2����/�1�exp ��M /�T�I /2���
Figure 10 � SPECweb method toCalculate Average �OFF� Time
[SPEC 05]
whereT=10 I=2 M=150
33
the method to calculate the average �OFF� time when the random seed is
between 0 and 1.
The workload itself is also dynamically generated for each run, therefore these
power consumption results are the average of 10 runs [DESIGNSPEC 06].
There is an aggregate percentage for each type of workload specified in the
banking benchmark, but in the worst case the client may request a majority of it's
workload to be I/O intensive. An example of this would be a successful login,
with all following transactions being a lookup request, in the case of banking this
would be the retrieval of check images.
In Figure 11, the Performance and Ondemand governors are almost identical
across all three tests in the total power they consume. The Remora governor
Figure 11: Power Consumption of SPECweb across 8cores.
34
uses approximately 11% less energy. There is a noticeable drop in the time
required in the highest frequency compared to the ondemand governor. This can
be attributed to the benchmark's transactions causing a significant amount of I/O
transactions. According to the remora governor, if a transaction's I/O time is
greater than its CPU time, the governor will classify this task as I/O bound, thus
placing the CPU in the lower frequency. Energy conserved utilizing the Remora
governor showed a negligible effect on the total throughput of the web server.
According to the average reported by the SPECweb benchmark all of the
governors reported a 0.15 transactions per second throughput. Realizing a
similar throughput rate is important in order to quantify the decreased
consumption of power [PERFORMANCE 01].
Looking closely at the throughput, shown in Figure 12, revealed a less than 1%
total decrease in transactions processed by the web server. The total number of
transactions processed by the web server during the workload was a metric kept
by the SPECweb benchmark itself.
100 150 200
Performance 27682 41271 54870
Ondemand 27676 41181 54759
Remora 27487 40989 54441
Figure 12 -- Web Server Throughput
The percentage of decrease in throughput between the Performance and
35
Remora governors grows by less than one tenth of a percent when the total
number of concurrent clients increases from 100 to 200. The reason for this
small decrease is due to the Remora governor only focusing on changing the
clock speed of each core. While only changing the frequency, the processor is
not placed into a deep sleep state, which allows the processor to still handle web
server transactions, although at a slower pace than that of the higher frequency.
Another reason for this to be the case is that at times, even in the Remora
governor, the processor may be at the highest possible clock frequency when
transactions are being received, which would always be the case in the
Performance governor.
36
Chapter 6 � Conclusions & Future Work
If a situation calls for a small amount of repetitive tasks to be performed by a
server, profiling the system resource can be a great benefit for the energy
conscious data center. By calculating the hardware bias of a particular task, the
system can be placed in a more energy efficient state when that task is to
executed again. The only drawback to this power scheme is that you are
choosing a more power efficient bias compared to a performance bias.
Separating tasks onto different machines and tailoring a power management
scheme designed for that particular task can yield beneficial results. First, there
is negligible loss in performance caused by the power management scheme.
Second, the power consumed by these task specific machines will be decreased
compared to a setup that has tasks assigned to non specific machines. By
favoring energy efficiency over performance in a data center setup, a minimal
decrease in throughput will be observed, but the long term total power savings
will outweigh this decrease.
With the hardware available at the present time, it is not possible to perform
aggressive power management on the storage aspect of a web server. The
latency of spinning up a hard disk from a cold start is greater than 10 seconds,
this would not be a tolerable interval in a web server environment. If a feasible
interval for power management on a hard disk presents itself, power
37
management can be extended to the hard disk.
Incoming web transactions can have the total time spent performing I/O from/to a
storage medium can be profiled and this information can be used in classifying a
transaction accordingly. The addition of a queue to handle transactions that are
primarily I/O orientated, would allow for a disk to be spun up and spun down at
appropriate times to increase the energy efficiency of the storage subsystem.
The resources that are used can be alternated between the CPU and the disk.
When the queue containing CPU bound transactions is executed, the hard disks
can be set to a low power state. Conversely when the I/O queue is being
executed the CPU can be placed in a lower power state.
Another method that can be implemented is the ability to categorize CPU bound
jobs into multiple queues based on the intensity of the task currently executing.
There can be two separate queues for highly intensive jobs and less intensive
jobs. The queue for highly intensive jobs would require the CPU to be at a
frequency at or near the peak, while the queue for less intensive tasks could
execute at a lower frequency and have a minimal performance impact.
The only negative affect of implementing a power aware scheme for a specific
application is a system must be completely dedicated to that application.
Performing other tasks on a system will decrease the power conservation that
can be observed. Other applications will also suffer from a decrease in
throughput as the system will not move into a high power state, unless the
application is performing alongside the targeted application for aggressive power
38
management.
This detriment to implementing a userspace power manager can be remedied by
a more low level approach. Modifying a system at the kernel level, specifically
the scheduler could yield even greater energy conservation benefits. Replacing
simple CPU usage statistics with targeted statistics for various subsystems will
yield to an improved classification scheme. These various subsystems include
off chip memory, block devices, and network devices.
39
References
[ACPI 06] �Advanced Configuration and Power Interface Specification 3.0b�,
Hewlett-Packard/Intel/Microsoft/Phoenix Technologies/Toshiba, October 10, 2006
[APACHE 08] �The Apache Modeling Project�, Bernhard Gröne, Andreas
Knöpfel, Rudolf Kugel, Oliver Schmidt, January 24, 2008
[CPUIDLE 07] �Cpuidle...Doing Nothing Efficiently�, Venkatesh Pallipadi, Adam
Belay, June 27th, 2007
[DESIGNSPEC 06] �Standard Performance and Evaluation Corporation�,
�SPECweb2005 Banking Workload Design Document�, April 05, 2006
[DOE 08] �DOE Data Center Energy Efficiency Program�, Paul Scheihing, U.S.
Department of Energy, May 2008
[GAMUT 05] J. Moore. Gamut - Generic Application eMUlaTion, Dec. 2005
http://issg.cs.duke.edu/cod/.
40
[GOVERNOR 08] �CPU frequency and voltage scaling code in the Linux(TM)
kernel�, Dominik Brodowski, Nico Golde, Linux Kernel Documentation
[HOSTING 05] Yiyu Chen, Amitayu Das, Wubi Qin, Anand Sivasubramaniam,
Qian. Wang, Natarajan Gauta. Managing Server Energy and Operational Costs
in Hosting Centers, June 2005
[HTTPD_DOC 09] �Apache HTTP Server Version 2.3 Documentation�, The
Apache Software Foundation, 2009
[INTERRUPT 08] �Energy-aware Task and interrupt management in Linux�,
Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri et al, July 23,
2008
[LEVEL 07] Charles Lefurgy, Xiaorui Wang, Malcolm Ware �Server Level Power
Control� 2007
[LOAD DISPATCH 08] �Energy Aware Server Provisioning and Load Dispatching
for Connection-Intensive Internet Services�, Gong Chen, Wenbo He, Jie Liu et al,
2008
41
[MULTI-TIER 07] �Enhancing Energy Efficency in Multi-Tier Clusters via
Prioritization�, Tibor Horvath, Kevin Skadron, Tarek Abdelzaher, 2007
[PATH 98] S. Schechter, M. Krishnan, and M. D. Smith. �Using path profiles to
predict http requests� April 1998.
[PERFMON] Pfmon2 Project. http://perfmon2.sourceforge.net/
[PERFORMANCE 01] Tarek Abdelzaher, Kang Shin, Nina Bhatti. Performance
Guarantees for Web Server End-Systems: A Control-Theoretical Approach�.
2001
[POLICIES 03] Mootaz Elnohzahy, Michael Kistler, Ramakrishnan Rajamony.
Energy Conservation Policies for Web Servers. 2003
[SPEC 05] �Standard Performance Evaluation Corporation�, SPECWeb2005
benchmark suite, http://www.spec.org/web2005/
[SYSTEM 06] �Full System Power Analysis and Modeling for Server
Environments�, Dimitris Economou, Suzanne Rivoire, Christos Kozyrakis, Partha
42
Ranganathan, 2006
[SYSTEMS 04] Ricardo Bianchini, Ram Rajamony. �Power and Energy
Management for Server Systems� 2004.
[TICKLESS 07] �Getting More from Tickless�, http://lwn.net/Articles/240253/,
June 30, 2007
[XEON 08] �Dual-Core Intel® Xeon® Processor 5200 Series�, Intel Corporation,
October 2008, Datasheet