THESIS - Binghamton University

REMORA: AGGRESSIVE POWER MANAGEMENT FOR APACHE HTTPD WEB SERVER

BY

SHANE CASE

BS, SUNY Farmingdale, 2006

THESIS

Submitted in partial fulfillment of the requirements for

the degree of Master of Science in Computer Science

in the Graduate School of

Binghamton University

State University of New York

2009

UMI Number: 1473688

All rights reserved

INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript

and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion.

UMI 1473688

Copyright 2010 by ProQuest LLC. All rights reserved. This edition of the work is protected against

unauthorized copying under Title 17, United States Code.

ProQuest LLC 789 East Eisenhower Parkway

P.O. Box 1346 Ann Arbor, MI 48106-1346

iii

Accepted in partial fulfillment of the requirements forthe degree of Master of Science in Computer Science

in the Graduate School of Binghamton University

State University of New York2009

November 20, 2009

Kanad Ghose, Department of Computer Science, Binghamton University

Kartik Gopalan, Department of Computer Science, Binghamton University

iv

Abstract

With the release of the Pentium 4 Prescott, the trend of increasing the clock frequencyto boost performance was coming to an end. This increase in clock frequency comeswith an increase in heat dissipation from the CPU. Large data centers and server farmsare having to calculate costs of cooling machines as well as the energy they consumewhile operating. In recent years, the efficiency of individual server machines has beendecreasing due to the utilization not being up to capacity. The power managementsoftware does not factor in the recent increase in computing capability and this isreflected in the percentage of utilization compared to the power consumption of serverhardware. The current solution is the implementation of the Advanced Configurationand Power Interface (ACPI). ACPI gives control of power configuration to the operatingsystem, which then manages power according to the present load on the system. Amachine allocated for a specific purpose, such as a web server, may not benefit from asystem wide power scheme. A web server can see a lot of repeated traffic, this can beexploited by keeping a record of hardware usage for frequent transactions.Categorization of these transactions by either CPU or I/O intensive will lead to a more�intelligent� power management scheme. The goal of this thesis is to show that a moreintelligent power scheme can further reduce energy consumption while minimizingdegradation in performance.

v

Acknowledgments

I would like to thank my parents, for providing me with the resources, support, and

opportunity to further my education in graduate school. I also want to thank my brother

for initially teaching me how to use a computer before a graphical user environment

existed.

Finally, I would like to thank my advisor, Kanad Ghose, for providing such a supportive

work environment, and his advice and encouragement without which this work may not

be possible. Under his tutelage I continue to learn new things and will strive to move

my education ahead even further.

vi

Table of Contents

Chapter Page

Table of Figures

Chapter 1 � Introduction

Chapter 2 � Related Work

Load Leveling and CPU Voltage

Workload Forecasting and Cluster Load Leveling

Minimizing Processor Wakeup Iterations

Full System Power Modeling

Chapter 3 � Hardware Interface

P States

C States

ACPI Power Management Schemes

Chapter 4 � Interface with Apache Web Server

Apache Logging Capabilities

Chapter 5 � Results

Chapter 6 � Conclusions and Future Work

viii

1

5

5

6

7

8

10

11

12

14

17

21

29

36

vii

References 39

viii

Table of Figures

Figure Page

Figure 1 � Data Center Infrastructure Efficiency

Figure 2 � Load Leveling Task Queue

Figure 3 � CPU Power Management Granularity

Figure 4 � Apache Child Listener/Worker Loop

Figure 5 � Httpd Child Process

Figure 6 � Statistic Table

Figure 7 � Queue Implementation

Figure 8 � Watts Consumed Per Frequency

Figure 9 � Power Consumption

Figure 10 � SPECweb Method to Calculate Average �Off� Time

Figure 11 � Power Consumption of SPECweb across 8 cores

Figure 12 � Web Server Throughput

2

5

10

18

19

27

28

30

31

32

33

34

1

Chapter 1 � Introduction

In recent years, the movement to processing data electronically has seen a rise

in the number of data centers present throughout the world. Music, movies, and

telephone services are just the beginning of what is being transferred to

electronic transmission Today, almost all levels of government and corporations

own and maintain a data center. The need for these data centers has been

growing.

The total consumption of energy in 2006 by data centers in the United States

was 1.5% of all energy consumption in the country. This equals about 10 million

typical households in the United States. Data center consumption is just a small

subset of U.S. Industry, which as a whole consumes 1/3 of all energy consumed

by the country. Further investigation however, shows that the majority of this

energy is not being consumed by the servers themselves [HOSTING 05]. More

than half of the energy used by a data center is being utilized to cool these

servers [SYSTEMS 04]. This percentage is an obvious inefficiency, and will only

increase as the size of the data center grows. The U.S. Department of Energy

has called for a 10% reduction in total energy consumption of data centers by

2011. A larger reduction can be seen if a method used for cooling data centers

can be made more efficient.

The metric that the Department of Energy uses is called the Data Center

2

Infrastructure Efficiency (See Figure 1). Currently this metric stands at < 0.5 for

today's typical data center. The goal of the Department of Energy is to increase

this value. In their proposed best practice goal, the metric should stand at 0.85.

If and when this goal is achieved, the majority of the power consumption by a

data center will be the energy required for operating servers. Increasing this

metric will lead to a decrease in total consumption of energy for a data center.

Figure 1 � Data Center InfrastructureEfficiency [DOE 08]

Implementing complex cooling mechanisms can have a high initial cost and can

require the complete renovation of server farms or replacement of servers.

Cheaper alternatives exist in the form of software mechanisms for a more

aggressive power management scheme [PERFORMANCE 01].. These

mechanisms can be complex to implement and will not yield the same outcome

as a hardware implementation. Conserving energy by modifying the power

management on a server can also lead to a decrease in performance for each

machine that follows this modification.

If a machine is dedicated to a task like a web server, there is a characteristic that

can be exploited to maximize power savings and minimize the effect on server

throughput [LEVEL 07]. Web servers typically handle transactions that will

repeat over a period of time. It s this repetition that can be used to profile

resource usage of web servers [PATH 98]. By profiling repeated transactions, it

DCIE=Energy for IT Equipment

Total Energy for DataCenter

3

is possible to learn the common resource usage of handling these transactions.

Taking the information from profiling, transactions can effectively be categorized

by their resource usage. Transactions that typically require a great amount of

computation can be said to be CPU bound. Similarly, transactions that require a

great amount of storage medium activity can be said to be I/O bound.

Using these categorizations, an appropriate power management policy can be

implemented to be dynamically modified when transactions of a certain type are

encountered [POLICIES 03]. For instance, if there are several transactions

occurring that are all of the CPU bound category, we can place the hard disks

into a low power state to conserve energy rather than having them sit idle, but

still in a high power state. By the same token, if there are several transactions

that are all I/O bound, the processor can be placed in a low power state to save

energy.

If the energy consumed by a dedicated task server can be reduced, several

benefits will be observed. First, the power consumed by each server will be

reduced. This reduction in power consumption by a machine will reduce the heat

dissipation of hardware components, thus extending the life of the hardware

itself. Second, since the heat dissipation will be reduced, a reduction in cooling

will be observed. A tertiary benefit will be a possible reduction in the

performance penalty when the temperature increases in the processing

environment.

There have been various studies on implementing software based energy

4

conservation mechanisms. These other methods will be discussed in the related

work section of this thesis. However, the repetitive nature of a web server

presents a unique opportunity to maximize the energy conservation while taking

a minimal impact on system performance.

From this point forward, this thesis will be organized as follows. An overview of

related work will be given to show the uniqueness of the web server scenario.

This will be proceeded by presenting the existing capabilities of today's hardware

to allow the changing of system power settings. Finally, the necessary code

changes will be discussed along with results of benchmarks designed to stress

the web server.

5

Chapter 2 � Related Work

This section will discuss research that is similar to the focus of this thesis. First,

a study that discusses the alteration of the processor P-state according to the

current load across a cluster of servers. Second, a study that uses load

forecasting to balance the workload across a cluster of servers. Third, a study

which consolidates tasks on a small number of cpus in a multi-cpu environment

in order to render several cpus completely idle.

Load Leveling and altering CPU voltage

In a cluster environment (See Figure 2), having multiple servers available to

handle web based transactions is not always energy efficient. By measuring the

latency of request response time and the relative load placed on systems

Figure 2 � Load Leveling Task Queue [MULTI-TIER 07]

6

resources, a more efficient cluster load leveling scheme can be implemented.

At a user defined interval, the latency of response can be measured and the

systems power level can be modified accordingly. If the latency time exceeds a

user defined percentage above a user defined threshold, the P-state on nodes of

the cluster can be increased to attempt to normalize the latency time to the

threshold. Conversely, if a the latency time falls below the threshold by a user

defined percentage, the P-state on the node with the lightest load can be

reduced. This scheme will theoretically maximize the workload to energy

consumption ratio with a minimal performance impact.

The primary drawback to this scheme can be the user defined interval of

monitoring that is set. In the worst case scenario, the load on system resources

can drastically change immediately after the user defined interval has elapsed. If

this were to occur performance would be impacted if a rise in system resources

is present when the system is in a lower power state. Conversely, if there was a

sharp decline in resource load after the interval, and the system was in a high

power state, the efficiency of this scheme would be lost.

Workload Forecasting and Cluster Load Leveling

A second approach from a web server standpoint, is to attempt to forecast the

load that will be placed on a system at a given time. Similar to the first study, in

a cluster environment, having all nodes of a cluster available to service requests

can be energy inefficient. The ability to produce a workload forecast can help to

calculate the appropriate number of cluster nodes that will need to be available.

7

In a web application environment, there are two main values that need to be

monitored, the login rate, and the number of connections. In addition to these

values, time must be allotted to allow for cluster nodes to perform a cold start

and to allow for these nodes to pick up an equal workload to those already

handling transactions. This equation returns the number of cluster nodes that

will be required to handle the current workload placed on the cluster. One prime

drawback of this load leveling scheme is that there is an inherited requirement

for an observation period. This observation period would then require a

validation period to check for correctness of the forecasting equation.

The prime drawback of using forecasting to speculate the load means that a

forecast is only valid for a specific set of hardware. Having to allocate time for

observation can be a large overhead during an upgrade situation. When

hardware, such as a CPU for the designated cluster is updated, the

computational power increases. This increase will allow cluster nodes to handle

more transactions. Each time this occurs, a new observation period will need to

be performed. [LOAD DISPATCH 08]

Minimizing Processor Wakeup Iterations

In today's enterprise computing environment, servers typically have more than

one socket. These sockets can support multi-core or SMP processors. Having

two processors present in a system provides a boost to computational capacity.

With this boost however, an increase in energy consumption is also inherited.

The ability of the OS power manager to keep an idle system in a sleep state

8

becomes important.

Before kernel version 2.6.21 [TICKLESS 07], the kernel had a scheduler tick

present that would cause wake ups at every iteration of an elapsed tick. Starting

in 2.6.21, the option of a �tickless� kernel was available. This feature was

important in that it eliminated any extra wake ups due to scheduling timers. This

alone however, will still lead to undesirable processor wake ups to occur. The

ability to isolate OS daemons, timers, and interrupts to a specific processor

becomes vital.

On today's CPU hardware, SMP processors can be powered down to a �deep

sleep� state (see Chapter 3). There is a restriction on this however, as individual

cores cannot be set to different power states. With this knowledge, it can be

concluded that if we can isolate causes for a processor wake up to a single

processor and its cores, the second processor will then be allowed to be in �deep

sleep� for a longer interval [INTERRUPT 08]. The kernel already has a feature

built-in called the multi-core power saving mode which can be enabled via the bit

in /sys/devices/system/cpu/sched_mc_power_savings. Daemons and interrupts

already have built-in kernel features to be restricted to a specific processor set,

however timer's do not.

Full System Power Modeling

Full system power estimation has been implemented to model the power

requirements necessary for a server. In Mantis [SYSTEM 06], the power

requirements for primary hardware components are estimated using various

9

hardware and software counters. Computing the cpu utilization, off-chip memory

access count, hard disk activity, and network activity allows for the estimation of

power consumption on a specific machine.

One of the drawbacks of the Mantis approach the granularity of power

estimation. Power can only be estimated upon expiration of a set interval, which

has a minimum of one second. The ability to perform a live power estimation, or

power modeling with a lower granularity would be much more accurate. Another

drawback is the system's kernel must be modified heavily. In order to access the

hardware counters necessary to retrieve the data required for power estimation

additional patches must be applied to the kernel. The patch required, perfmon2

also requires separate libraries to be installed to allow for counter monitoring

[PERFMON]. Even with these modifications further counters are required for

access to statistics about the I/O subsystem and the network subsystem.

After the necessary modifications are implemented, the equation that will

compute the current power must be calibrated for your individual hardware

profile. In order to calculate the co-efficient for each of the hardware

components, an application designed to stress individual hardware components

must be used. In order to perform this the emulation suite, Gamut is used to

calculate the co-efficient for CPU activity, off-chip memory accesses, I/O activity,

and network activity [GAMUT 05].

10

Chapter 3 � Hardware Interface

Today's server CPUs (such as the Intel Xeon) have two methods of saving

energy. These two methods have been implemented to comply with the current

ACPI specification (3.0b). The two methods consist of P-states and C-states. P-

states refer to the clock frequency that the processor is currently running at. C-

states refer to processor throttling.

Both types have an N number of states, which is processor dependent, with 0

being the highest state (highest energy consumption, highest performance) and

N being the lowest state (lowest energy consumption, lowest performance). The

P-state can be considered a higher granularity of energy saving as these states

only apply when the processor is in the C0 or execution state. In the event of a

change of C-state from C0, the P-state must be set to the corresponding level

that is required to leave C0 before the processor can be throttled to the CN state.

Figure 3: CPU Power Management Granularity

11

P-States

Processor P-states can be referred to as different clock frequencies that a

processor is capable of running at. P-state capabilities are reported to the OS

power management by reading the _PPC (Performance Present Capabilities)

object [ACPI 06]. The _PPC object simply shows what states are available, each

state shown in the _PPC object has a corresponding PSS (Performance

Supported States). In the _PSS object, there are 7 fields of extended

information about each state. Core frequency, power dissipation in milliwatts,

transition latency, bus master latency, control, and status make up these 7 fields.

The first two fields are straightforward, transition latency is the time that the CPU

will be unavailable due to the P-state change, bus master latency is the time that

the bus masters cannot access the memory hierarchy, control is the value that

must be written to the PERF_CTL register, and status is the correct return from

the PERF_STATUS register.

In the event of a failure to transition to a particular P-state, the return value from

reading PERF_STATUS will differ from the correct value which is contained in

the corresponding _PSS object. There can also be a _PSD object which

contains any dependencies, whether they be hardware or software, that may

constrain the change of P-state. The OS power management must be aware of

these dependencies and satisfy them before a P-state can be successfully

entered. An example of such a dependency can be the inability of today's multi-

12

core processors (Core 2 Duo / Xeon multi-core) to have different cores on

different P-states.

C-States

C-states are throttling states of the processor, and were formerly referred

to as Intel SpeedStep. The number of C-states supported by a processor can

very by implementation, a larger number being supported on mobile processors

and server processors. C-states go in order from the C0 or execution state to

the CN state or deepest sleep state.

C-states can be implemented on the processor using one of two methods. The

first method is the presence of the P_LVLX register. Typically this method is

implemented on platforms that only support up to the C3 state, as these registers

are either P_LVL2 to P_LVL3. In order to change states, the OS power manager

performs a read on the register that matches the desired state. According to the

OS all processors have the same C-state capability, in the event that this is not

the case the management is offloaded to the BIOS, and the BIOS will choose the

lowest C-state that can be entered.

The second method of implementation is when the _CST object is present. In

the event that both P_LVLX and _CST objects are present on a system the OS

power manager always uses the _CST objects. It is the _CST object that

presents more information to the OS power manager to make more efficient

decisions. The _CST also enables the OS to act when power events can change

13

the capability of processors, such as on a mobile device. When such an event

happens the _CST object sends a notify event to the OS power management to

reevaluate the capabilities of the processor.

The C0 state is the only state that execution can take place on the CPU

[CPUIDLE 07]. Therefore the processor must be completely idle in order to exit

the C0 state. System cache integrity must also be considered, and the task of

maintaining the context of these caches varies according to what the current C

state is.

When entering the C1 state, the latency time of returning back to C0 must be

considered negligible by the OS power manager. The C1 state is the only power

throttling state that must be supported by all processors. This can be achieved

due to C1 being the execution of the HLT instruction. System caches are

maintained by the processor itself and there are no software visible effects of

being in this state. Exit from this state can be for any reason, but must always

occur on an interrupt.

Throttling state C2 is the first �sleep� state required to have hardware support by

the chipset. Latency is higher than the C1 state, with the exact time being stored

in the FADT ACPI table. Chipset support is required due to the processor still

having to maintain the context of system caches. The ability to also support

snooping for bus master activity for cache accesses, as well as cache accesses

in a multi-processor environment is the reason for hardware support. As in C1,

there are no software visible effects of being in this state, and it can be exited for

14

any reason, but must always exit the state on an interrupt.

The C3 throttling state requires the most hardware support of all throttling states.

System caches must be maintained by OS power manager, as snooping by the

processor is not supported in this state. There are methods of maintaining the

absence of bus master activity that are implemented on both uni-processor and

multi-processor environments. In a uni-processor environment by setting the bus

master arbitration disable bit. A multi-processor environment can flush the

contents of the on chip caches, this is only supported on multi-processor

environments due a high latency being associated to flushing caches. An

example of a method used to flush caches can be to read a file that has a size

greater than the largest cache that is present in the system (typically L2, or L3 if

present). The OS power manager can check the bus master status bit

(BM_STS) for bus master activity before deciding to enter the C3 state. When

exiting this state the BM_RLD bit tells the OS power manager if the state was

exited due to bus master activity. Due to the hardware not maintaining the

system caches, when exiting the C3 state, the processor must return to the C0

state in order to maintain the context of these caches.

ACPI Power Management Schemes

The method in which ACPI sets the C-state and P-state for a system is called the

�governor� policy. There are currently 4 types of governors available in the Linux

kernel. The policy must be compiled in the kernel, and a default governing policy

15

is selected at compile time. At boot time, the governor that is selected in the

kernel is used, this can be changed via the /sys interface.

There are 5 governors currently available in the Linux kernel [GOVERNOR 08].

Two of them are �static� governors. These two governors are called

�performance� and �powersave�. There is no logic accompanied with these types

of frequency scaling. Performance sets the frequency at the highest possible

setting and powersave sets the frequency at the lowest possible setting. The

frequency will not deviate from its setting at boot-up. A third governor requires a

high level of user interaction, the �userspace� governor. Userspace governing

allows any process running under superuser permission to alter the current

frequency setting of the processor.

The final two governors are dynamic governors and differ greatly from the static

governors. Based off the current load based on the system, the �ondemand� and

�conservative� governors vary the current frequency setting of the processor in

different ways. Both of these governors have a �sampling rate� in common. This

sampling rate is the interval in 10-6 seconds. Every interval that elapses the total

usage percentage of each CPU is checked and the kernel decides whether or

not the current frequency is appropriate or needs to be increased/decreased.

The number that is checked is an average percentage of usage during the

preceding time interval. The CPU load threshold that when exceeded causes a

frequency change is called the �up_threshold�. This threshold can be set by the

user through the interface in /sys. Where these two governors differ, is the

16

frequency decision that is applied. In the ondemand governor, when a

processor's frequency is to be changed, it is increased to the maximum

frequency capacity that is possible on that processor. When the processor then

returns to an idle state, the governor then drops the frequency down to the

minimum frequency.

The conservative governor differs in that the frequency change is performed

more �gracefully�. Instead of immediately jumping up to the maximum frequency

when the usage of the CPU increases, it gradually increases the CPU frequency.

As long as the load on the processor is above the threshold for increasing the

frequency it will move up one P-state at a time until that threshold is no longer

exceeded. When decreasing the frequency there is a similar threshold of CPU

usage that must be met before the frequency changes. This threshold is called

the �down_threshold�. When the CPU usage falls below this threshold the

frequency can be decreased to the next P-state, until the CPU is in its lowest

possible frequency.

17

Chapter 4 � Interface with Apache Web Server

Apache httpd is an open source web server that is the most widely used web

server on today's Linux servers. Since it's initial release in 1995, the Apache

web server has grown to become the most popular web server software on the

web today. The attraction of this software suite may be from it's robustness. The

system administrator has the option of compiling the components of the web

server as being built-in or a module that is loaded at run-time. This even extends

to the method in which HTTP requests are handled by the web server.

The main focus of this thesis will be on the Linux variant of the Apache httpd web

server, so the �pre-forking architecture� will be the prime concern. The �pre-

forking architecture� is the method that the web server will process incoming

httpd requests. Upon start-up the web server will fork off N number of children to

handle all requests. This number is set in the configuration file and has a default

number of 8 children. Typically the administrator will change this based upon the

capabilities present on the server that will perform the web hosting.

The children (See Figure 4) of the parent process are referred to as �worker

threads�. Each worker thread will enter into a loop that will handle a set number

of requests until the child �dies�. This number of requests is also located in the

configuration file of the web server. By default each child thread will handle

4,000 requests and will then die �gracefully�. Looking at the entire hierarchy,

18

only one child enters the �listener� state. All other children sit idly while this

single child process is in the listener state. When an incoming request is

received the child process that is the �listener� takes the header of this request

and enters the working state. As soon as this child process enters the working

state, the next idle child becomes the �listener�. When the child that is in the

working state completes the given transaction it returns to the idle queue to

again wait to be the listening process.

When a request is received, it is usually for an entire web document. This is not

Figure 4: Child Listener/Worker Loop. [APACHE 08]

19

Figure 5 -- httpd child process [APACHE 08]

20

what a request is to the web server. A web document can contain any number of

images, scripts, etc. Each of these parts of a particular web document is

considered a transaction. For instance, let's say a web document consists of

text and three images. This web document would then consist of four

transactions for the web server. One transaction would consist of the text only,

and a transaction for each image.

After the request on the web server has been broken down into transactions

(See Figure 5), the location needs to be translated. When the web server

receives a request for a web site or its contents, the location in the header is only

relative to the �DocumentRoot� that is set in the web server configuration file. It

is the job of the web server to locate the absolute file path for the requested

element.

After the location has been computed, the permission on that file must be

checked. The permissions on the file matter in two ways, the group the file is

owned by, and the groups that are allowed to access the file. This is done in the

same fashion as standard permissions on a Linux system. There are also

methods of authentication that the web server can use if the files are set to

private. These permissions can range to a set of IP addresses or requiring a

user to login to view the files. If either of these two methods of authentication

are failed, then the user will see a �Forbidden� message returned from the web

server.

After permissions have been verified the child then checks to see if there is a

21

�quick handler� for the specific file that is requested. These quick handlers are

more lightweight than having to perform the standard transaction handling

method. If a quick handler does not exist for the specific file or file type

requested, the standard transaction handler is taken. The standard transaction

handler retrieves the file and sends it to it's destination, which would be the

requesting IP from the HTTP header.

When the network connection has finished, the count of transactions that has

been handled by this specific child is incremented. This count is then checked

against the user defined maximum number of transactions a child can handle, if

it equals the max number the child is terminated. If this child is not going to be

terminated it returns to the idle queue to await its next turn to serve as the

listening process.

Apache Logging Capabilities

Apache web server has a module called the �Forensic Logging Module,� which is

modified to become the communication between the web server and the power

manager. The unmodified forensic log module will create two log entries in a

specified forensic log file [HTTPD_DOC 09]. The first entry is created when the

HTTP header is received, before the child �listener� process picks up the

transaction. The second entry is created when the transaction has been

serviced and completed. These two entries will serve as the start and stop

points for profiling the system resources used while processing a particular

transaction.

22

Taking into account that each transaction is handled by a different worker

process, the ability to process transactions in parallel is a key strength of

Apache. From a profiling standpoint however, this increased the complexity of

tracking the system resources used by individual child workers.

Initial attempts to profile transactions saw an overlap in transactions being

processed, this was a substantial problem. The first attempt at profiling was to

implement a message queue in the ForensicLog module. This message queue

would serve as the prime method of communication between Apache and the

power manager. Implementing the message queue was simple and already had

the blocking mechanism needed to have the power manager sleep efficiently.

When a message was sent by the ForensicLog, the power manager would wake

up, gather statistics for the transaction and then return to sleep. A major problem

that could be seen in test runs was the overhead of using the message queue

system calls. First, the time for the inter process communication to occur was

skewing the recorded numbers for resource usage on a particular transaction.

Second, if more than one transaction was being serviced, all of the resource

usage would be reported on the first transaction that the power manager was

monitoring.

The second attempt at managing a feasible communication method would be to

use memory mapped I/O. Implementing the memory map would require

ForensicLog to output to a file and use a tool called 'rotatelogs'. Rotatelogs is a

script included with the Apache source code. The script is applied to log files

23

and forces the log file to be changed when either a set amount of time has

elapsed or the log file to exceed a set size. The name of the log file must be

formatted and set as input to the script. This attempt would be where signal IPC

was first implemented as the primary communication between the web server

and the power manager.

The methodology behind this implementation would require that ForensicLog

write to a file that would be memory mapped. Each time a transaction was to be

serviced, a signal would be sent to the power manager to read from the memory

mapped address return by the mmap() system call. The power manager would

then read the characteristics of the transaction and then retrieve the resource

usage statistics about the transaction.

Several shortcomings of this method arose very quickly. First, the exact format

that the data was written to the memory mapped file was unpredictable at each

transaction. This was not due to poor knowledge of the data, but the method in

which it was stored would often cause for duplicate transactions to appear in the

power manager's statistics table. Second, the mmap() call itself would lead to

significant memory overhead as the entire log file would have to be locked into

memory. Not locking the file into memory could lead to a page fault needing to

be serviced in the middle of transaction profiling. Third, similar to the first

proposed method, individual transactions could not be profiled in the presence of

concurrent transaction handling.

It was clear a more advanced interface would be required to decipher exactly the

24

resources that a specific worker process was using. This new interface would

have to exploit the entries made per pid in the /proc filesystem, and utilize the

information that is stored about each process.

Looking at the /proc/pid/stat was the initial attempt to observe the usage

statistics for each worker process, but the numbers reported would not be of a

fine enough grain. In order to see how many seconds on the CPU, the�

scheduler statistics, or /proc/pid/schedstat will need to be monitored. The

statistics that reported by /proc/pid/schedstat are the length of time on the CPU,

the length of time waiting to return to the CPU, and the number of tasks in this

group that are given to the processor. We can assume that since there is only

one task per child process, that the time waiting for the process to return to the

CPU will be spent performing I/O.

The increased complexity came in the variable synchronization across all of the

web server's child worker processes. Implementing a table that will track a

transactions before and after statistics for each child is required. In order to

synchronize this data structure across all child processes, this table will be

mapped to a shared memory segment and locked into memory. The shared

memory is locked so that there is only one page fault that occurs when

accessing this memory and that page fault will occur at lock time. This will

prevent the CPU statistics from being skewed by handling the page fault during

the runtime of the web server.

When the initial entry of the forensic log is made, the current scheduler stats for

25

the pid of the worker that is handling the transaction will be noted as our zero

values. These values will then be passed to the power manager. The power

manager will then do a lookup to see if this transaction has already been

processed. If it is a new transaction Apache will have communicated the zero

values along with the web server command that is being processed and place

them in the profile table. Upon completion of the transaction, if the transaction

has not already been profiled, the final scheduling statistics will be noted and

passed to the power manager. The power manager will then proceed to

categorize the transaction as either CPU bound or I/O bound, and store this in

the profile table.

The power manager must be extremely lightweight while executing in order to

avoid skewing the statistics of the hardware that is being monitored. The

overhead of any type of �busy waiting�, such as spin locks or sleeping, is not

tolerable. Busy waiting does not render the CPU as idle, and will show up as

CPU usage, which can affect the decision of the power manager to enter into a

sleep state. Therefore, the synchronization method used to communicate

between Apache web server and our power manager will be signals. The benefit

of signals is twofold, as signals are a lightweight inter-process communication

method, and adding signals to the Apache interface will require very little code

change. The bulk of the code change will be required in the power manager

itself, by simply adding a signal mask to the power management process. The

power manager will then block until a signal is received from Apache.

26

There will only be two signals required to achieve our goal of communication. In

this case the signals SIGUSR1 and SIGUSR2 will be used. SIGUSR1 will serve

as the notification that a transaction has been received by Apache and

processing will begin. SIGUSR2 will serve as notification that the transaction

has been completed. The power manager can also read an integer value that is

passed along with the signal which will serve to identify the current entry in the

shared memory segment that it will be profiling. On the power manager side,

SIGUSR1 will be notification to begin profiling, and search our statistic table to

see if the occurring transaction has already been previously profiled. If the

transaction has already been profiled, profiling can be terminated, squashing any

statistics that have been gathered. The power manager can then apply the

appropriate �power profile�, according to the transaction categorization in the

statistic table. If the transaction has not been previously profiled, the counters

have already begun before the search was conducted, and can then be stopped

when SIGUSR2 is received. Once the statistics have been gathered, the

numbers must be analyzed and a category assigned before being added to the

stat table (See Figure 6).

The power manager will also implement two work queues for transactions that

have already been profiled. One queue will be for transactions that have been

categorized as CPU bound, and another queue will be for I/O bound

transactions. The method used to implement these queues will require that a

transaction be looked up in the statistics table to find its categorization. When a

27

transaction has been found to be in the table, then that httpd worker child will be

placed in a blocked state. The pid of the child worker will be entered into the

queue.

Struct statistics_table

int index;

int pid;

char *transaction;

unsigned long long cpu;

unsigned long long disk;

unsigned long long net;

char classication;

int complete;

Figure 6 -- Statistic Table

While the child transaction is blocked, the power state of the system will still be

at its lowest point. The queues will be emptied (See Figure 7) when a user

defined number of transactions has entered a queue. There will be a separate

thread in the power manager that is started upon the first pid entered into a

queue. This thread will poll the queue, checking it until it reaches the interval,

upon which it will be emptied. In the worst case, a queue will not exceed the

interval for an extended period of time, thus a second method for a child to

execute must be implemented.

For such situations as an SSL connection, the occurrence of a transaction

waiting inside a queue is not tolerated by the protocol. In this situation, a child

28

that is blocked for an extended period of time will cause the SSL connection to

fail, causing a great expense of performance for the purpose of increasing

energy efficiency. To avoid this, each child will have a set time that it will be

waiting for a signal from the power manager. This interval will also be user

defined as a maximum amount of wait time. If the child does not receive a signal

to continue, and the interval has elapsed, the child will continue. Although this

may take degrade performance, it can still guarantee the web server to be

functional, in an environment with special requirements.

Figure 7 -- Queue Implementation

29

Chapter 5 � Results

To simulate a workqueue that is similar to the load placed on a typical web

server, the SPECweb 2005 Banking benchmark will be used. This benchmark

emulates an online banking website that will handle transactions concurrently

from a set number of clients. The workload that is simulated will be similar to an

online ledger showing deposits and withdrawals from single or multiple bank

accounts.

The web server that will handle the transactions will be running Apache httpd-

2.2.11 on a 64-bit Linux server using an unmodified 2.6.28 kernel. System load

will be reduced a to a minimum, to simulate a headless server, with minimal

remote access allowed. The hardware profile of this system includes two Intel

Xeon 5410 2.33 GHz cpus, 8 gigabytes of RAM, and two Hitachi 500 GB

enterprise level hard disks. All non essential partitions will be unmounted to

reduce the background file system activity. There is an interrupt balance

daemon in place, as well as the multi-core scheduler being enabled.

Three experiments are to be performed, one each for 100, 150, and 200

concurrent client connections during the SPECweb workload. The two metrics

that will be focused on are the total CPU cycle count, and the average power that

is consumed during the respective workloads. The CPU cycle count is obtained

by using the rdtsc instruction as a counter during the workload execution. Power

30

consumption was measured by attaching a current probe to the 12V rails and

observing the average current of each CPU frequency that the test machine was

capable of. Utilizing a Fluke Y8100 current probe, the current was measured

during each of the three types workloads placed on the server. During each

workload, the average current at each of the 2.00 GHz and 2.33 GHz clock

frequencies was taken. Multiplying this current and the voltage, the average

wattage per second can be calculated for each workload and at each frequency.

This wattage calculation, presented in Figure 8 was then used to calculate the

total power consumed in each frequency.

Concurrent Connections 100 150 200

2.00 GHz 34.8 38.4 41.4

2.33 GHz 39.84 45.6 46.8

Figure 8: Watts consumed per frequency

By utilizing the CPU cycle count taken with the rdtsc instruction [XEON 08], the

number of seconds spent at each frequency. Dividing the total time spent, by the

cycle count, in this case either 2.3 * 109 or 2.0 * 109, the total seconds spent in

each frequency state can be calculated. Calculating the power consumption of

run can then be done by multiplying the time elapsed in each state by that state's

power consumption characteristic.

31

Governor Frequency 100 150 200

Performance 2.00GHz N/A N/A N/A

2.33GHz 93919.01 107497.95 110346.85

Ondemand 2.00GHz 782.36 1191.27 1747.18

2.33GHz 92827.2 106065.6 108248.4

Remora 2.00GHz 81745.2 89779.2 96876

2.33GHz 302.78 793.4 1095.12

Figure 9: Power Consumption

In Figure 9, the power spent in each frequency is shown. The power consumed

in low frequency while using the performance governor does not apply, this is

due to the processor being permanently set at the highest frequency that the

hardware is capable of. Comparing the ondemand governor to the test governor

Remora, there is noticeable difference in power consumed in the two

frequencies. This is observed due to the ondemand governor having a set

interval that causes the average load on the system to be checked. The default

value of the interval is 0.01 seconds. Upon expiration of this interval, the

average load on the CPU since the last interval expiration is checked, if it

exceeds a certain percentage (default 80%), the CPU frequency will be set at the

highest frequency. By being focused on a single application, in this case Apache

web server, Remora will only change the frequency based on the current load of

the web server, all other applications will be ignored.

The increase in power consumption across the three governors does not appear

32

to be proportional to the increase in concurrent connections. There are two

reasons for this to be the case. One reason is that the overhead for parsing up

the HTTP headers for more concurrent connections could decrease the number

of transactions that can be processed by the web server. When a new client

comes in, the server and client must setup an SSL connection, and maintain it

throughout the workload, this adds to the overhead, increasing the network traffic

compared to standard connections.

The second reason can be attributed to the SPECweb benchmark itself. During

the development of the benchmark, emphasis was placed on emulating a real

world environment. Therefore the workload is divided up into �bursts�. The two

bursts are described as �ON� and �OFF� times, �ON� time being when the user is

actively using the banking website, and the �OFF� being one of three events.

The �OFF� time can be considered when a user is �thinking�, has logged off and

successfully closed the connection, or has finished using the banking website but

neglected to log off thus leaving the connection open. The �ON� time is when the

time between transactions by a client is under 5 seconds. The �OFF� time is

considered greater than 5 seconds, on average there is a 10 second delay

period. Both the �OFF� and �ON� times are generated by using a geometric

equation and a random seed. The following equation in Figure 10 demonstrates

��T�I /2��1��M /�T�I /2��1��exp��M /�T� I /2��/�1�exp ��M /�T�I /2��

Figure 10 � SPECweb method toCalculate Average �OFF� Time

[SPEC 05]

whereT=10 I=2 M=150

33

the method to calculate the average �OFF� time when the random seed is

between 0 and 1.

The workload itself is also dynamically generated for each run, therefore these

power consumption results are the average of 10 runs [DESIGNSPEC 06].

There is an aggregate percentage for each type of workload specified in the

banking benchmark, but in the worst case the client may request a majority of it's

workload to be I/O intensive. An example of this would be a successful login,

with all following transactions being a lookup request, in the case of banking this

would be the retrieval of check images.

In Figure 11, the Performance and Ondemand governors are almost identical

across all three tests in the total power they consume. The Remora governor

Figure 11: Power Consumption of SPECweb across 8cores.

34

uses approximately 11% less energy. There is a noticeable drop in the time

required in the highest frequency compared to the ondemand governor. This can

be attributed to the benchmark's transactions causing a significant amount of I/O

transactions. According to the remora governor, if a transaction's I/O time is

greater than its CPU time, the governor will classify this task as I/O bound, thus

placing the CPU in the lower frequency. Energy conserved utilizing the Remora

governor showed a negligible effect on the total throughput of the web server.

According to the average reported by the SPECweb benchmark all of the

governors reported a 0.15 transactions per second throughput. Realizing a

similar throughput rate is important in order to quantify the decreased

consumption of power [PERFORMANCE 01].

Looking closely at the throughput, shown in Figure 12, revealed a less than 1%

total decrease in transactions processed by the web server. The total number of

transactions processed by the web server during the workload was a metric kept

by the SPECweb benchmark itself.

100 150 200

Performance 27682 41271 54870

Ondemand 27676 41181 54759

Remora 27487 40989 54441

Figure 12 -- Web Server Throughput

The percentage of decrease in throughput between the Performance and

35

Remora governors grows by less than one tenth of a percent when the total

number of concurrent clients increases from 100 to 200. The reason for this

small decrease is due to the Remora governor only focusing on changing the

clock speed of each core. While only changing the frequency, the processor is

not placed into a deep sleep state, which allows the processor to still handle web

server transactions, although at a slower pace than that of the higher frequency.

Another reason for this to be the case is that at times, even in the Remora

governor, the processor may be at the highest possible clock frequency when

transactions are being received, which would always be the case in the

Performance governor.

36

Chapter 6 � Conclusions & Future Work

If a situation calls for a small amount of repetitive tasks to be performed by a

server, profiling the system resource can be a great benefit for the energy

conscious data center. By calculating the hardware bias of a particular task, the

system can be placed in a more energy efficient state when that task is to

executed again. The only drawback to this power scheme is that you are

choosing a more power efficient bias compared to a performance bias.

Separating tasks onto different machines and tailoring a power management

scheme designed for that particular task can yield beneficial results. First, there

is negligible loss in performance caused by the power management scheme.

Second, the power consumed by these task specific machines will be decreased

compared to a setup that has tasks assigned to non specific machines. By

favoring energy efficiency over performance in a data center setup, a minimal

decrease in throughput will be observed, but the long term total power savings

will outweigh this decrease.

With the hardware available at the present time, it is not possible to perform

aggressive power management on the storage aspect of a web server. The

latency of spinning up a hard disk from a cold start is greater than 10 seconds,

this would not be a tolerable interval in a web server environment. If a feasible

interval for power management on a hard disk presents itself, power

37

management can be extended to the hard disk.

Incoming web transactions can have the total time spent performing I/O from/to a

storage medium can be profiled and this information can be used in classifying a

transaction accordingly. The addition of a queue to handle transactions that are

primarily I/O orientated, would allow for a disk to be spun up and spun down at

appropriate times to increase the energy efficiency of the storage subsystem.

The resources that are used can be alternated between the CPU and the disk.

When the queue containing CPU bound transactions is executed, the hard disks

can be set to a low power state. Conversely when the I/O queue is being

executed the CPU can be placed in a lower power state.

Another method that can be implemented is the ability to categorize CPU bound

jobs into multiple queues based on the intensity of the task currently executing.

There can be two separate queues for highly intensive jobs and less intensive

jobs. The queue for highly intensive jobs would require the CPU to be at a

frequency at or near the peak, while the queue for less intensive tasks could

execute at a lower frequency and have a minimal performance impact.

The only negative affect of implementing a power aware scheme for a specific

application is a system must be completely dedicated to that application.

Performing other tasks on a system will decrease the power conservation that

can be observed. Other applications will also suffer from a decrease in

throughput as the system will not move into a high power state, unless the

application is performing alongside the targeted application for aggressive power

38

management.

This detriment to implementing a userspace power manager can be remedied by

a more low level approach. Modifying a system at the kernel level, specifically

the scheduler could yield even greater energy conservation benefits. Replacing

simple CPU usage statistics with targeted statistics for various subsystems will

yield to an improved classification scheme. These various subsystems include

off chip memory, block devices, and network devices.

39

References

[ACPI 06] �Advanced Configuration and Power Interface Specification 3.0b�,

Hewlett-Packard/Intel/Microsoft/Phoenix Technologies/Toshiba, October 10, 2006

[APACHE 08] �The Apache Modeling Project�, Bernhard Gröne, Andreas

Knöpfel, Rudolf Kugel, Oliver Schmidt, January 24, 2008

[CPUIDLE 07] �Cpuidle...Doing Nothing Efficiently�, Venkatesh Pallipadi, Adam

Belay, June 27th, 2007

[DESIGNSPEC 06] �Standard Performance and Evaluation Corporation�,

�SPECweb2005 Banking Workload Design Document�, April 05, 2006

[DOE 08] �DOE Data Center Energy Efficiency Program�, Paul Scheihing, U.S.

Department of Energy, May 2008

[GAMUT 05] J. Moore. Gamut - Generic Application eMUlaTion, Dec. 2005

http://issg.cs.duke.edu/cod/.

40

[GOVERNOR 08] �CPU frequency and voltage scaling code in the Linux(TM)

kernel�, Dominik Brodowski, Nico Golde, Linux Kernel Documentation

[HOSTING 05] Yiyu Chen, Amitayu Das, Wubi Qin, Anand Sivasubramaniam,

Qian. Wang, Natarajan Gauta. Managing Server Energy and Operational Costs

in Hosting Centers, June 2005

[HTTPD_DOC 09] �Apache HTTP Server Version 2.3 Documentation�, The

Apache Software Foundation, 2009

[INTERRUPT 08] �Energy-aware Task and interrupt management in Linux�,

Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri et al, July 23,

2008

[LEVEL 07] Charles Lefurgy, Xiaorui Wang, Malcolm Ware �Server Level Power

Control� 2007

[LOAD DISPATCH 08] �Energy Aware Server Provisioning and Load Dispatching

for Connection-Intensive Internet Services�, Gong Chen, Wenbo He, Jie Liu et al,

2008

41

[MULTI-TIER 07] �Enhancing Energy Efficency in Multi-Tier Clusters via

Prioritization�, Tibor Horvath, Kevin Skadron, Tarek Abdelzaher, 2007

[PATH 98] S. Schechter, M. Krishnan, and M. D. Smith. �Using path profiles to

predict http requests� April 1998.

[PERFMON] Pfmon2 Project. http://perfmon2.sourceforge.net/

[PERFORMANCE 01] Tarek Abdelzaher, Kang Shin, Nina Bhatti. Performance

Guarantees for Web Server End-Systems: A Control-Theoretical Approach�.

2001

[POLICIES 03] Mootaz Elnohzahy, Michael Kistler, Ramakrishnan Rajamony.

Energy Conservation Policies for Web Servers. 2003

[SPEC 05] �Standard Performance Evaluation Corporation�, SPECWeb2005

benchmark suite, http://www.spec.org/web2005/

[SYSTEM 06] �Full System Power Analysis and Modeling for Server

Environments�, Dimitris Economou, Suzanne Rivoire, Christos Kozyrakis, Partha

42

Ranganathan, 2006

[SYSTEMS 04] Ricardo Bianchini, Ram Rajamony. �Power and Energy

Management for Server Systems� 2004.

[TICKLESS 07] �Getting More from Tickless�, http://lwn.net/Articles/240253/,

June 30, 2007

[XEON 08] �Dual-Core Intel® Xeon® Processor 5200 Series�, Intel Corporation,

October 2008, Datasheet

THESIS - Binghamton University

Documents

Transcript of THESIS - Binghamton University