Exploration and Implementation of Wireless Protocol Platforms
by
Suet-Fei Li
B.S. (University of Wisconsin-Madison) 1995
A dissertation submitted in partial satisfaction of the requirements for the degree of
Doctor of Philosophy
in
Engineering- Electrical Engineering And Computer Science
in the
GRADUATE DIVISION
Of the
UNIVERSITY OF CALIFORNIA, BERKELEY
Committee in charge:
Professor Jan Rabaey, Chair
Professor Randy Katz Professor Paul Wright
Fall 2003
The dissertation of Suet-Fei Li is approved:
Chair Date
Date
Date
University of California, Berkeley Fall 2003
Abstract
Exploration and Implementation of Wireless Protocol Platforms
by
Suet-Fei Li
Doctor of Philosophy in Electrical Engineering and Computer Science
University of California, Berkeley
Professor Jan Rabaey, Chair
The focus of the thesis research is on the implementation of flexible energy-efficient
wireless protocols, and the corresponding design methodologies. In the first part of the
thesis, we propose a formal top-down, platform-based design methodology, targeting
complex systems with a high level of integration and heterogeneousity. Our methodology
relies on a formal Model of Computation (MOC). It supports architecture exploration,
meets the application’s need on flexibility while achieving energy efficient solutions.
Using PicoRadio as the design driver, the proposed formal top-down design methodology
yields superior results compared to traditional bottom-up ad-hoc approaches
In the second half of the thesis, we focus on energy-efficient management for event-
driven heterogeneous systems. Traditional Operating Systems, acting as the system
manager and scheduler, are not efficient or in many cases not sufficient for the targeted
types of complex real time, power-critical domain specific systems. Our proposed
solution utilizes a system management framework; it exploits the reactive event-driven
nature of the systems, and deploys aggressive power management. The hierarchical
structure of the framework enhances design scalability, supports concurrency, and
enables power control at various granularities. The scope of our power management
algorithm is not limited to individual nodes; instead, it aims to encompass the interest of
the network as a whole. State space partitioning is deployed to execute our power
management algorithm in two phases: Network level power management and the node
level power scheduling.
We have studied different power management algorithms for the network level.
Adaptive algorithms seem to be good solutions since they are able to explore the
temporal correlations in the traffic streams, handle environmental changes and are
relatively simple to implement. However, simple constant threshold algorithms perform
better for critical controller nodes and systems with high wakeup overhead. Our
experimentation on the various adaptive algorithms lead us to speculate that there is a
performance limit to any adaptive algorithm that only has the knowledge of the recent
inter-arrival history. A more “global” approach that incorporates information on the
network neighborhood is needed to achieve major breakthroughs. In the future, we would
like to explore such approaches by appending dedicated power management fields to
existing packets, as well as adjusting the sleep thresholds based on known topology
information.
Professor Jan Rabaey, Chair
__________________________
1 INTRODUCTION.......................................................................................................... 1 1.1 CHALLENGES IN WIRELESS PROTOCOL IMPLEMENTATION......................................... 1 1.2 RESEARCH STATEMENT AND MAJOR CONTRIBUTIONS................................................ 3 1.3 THESIS ROADMAP ...................................................................................................... 5
2. PLATFORM-BASED DESIGN METHODOLOGY FOR WIRELESS PROTOCOL PROCESSOR ............................................................................................ 9
2.1 MODEL OF COMPUTATION ....................................................................................... 10 2.2 CONCEPT OF PLATFORM .......................................................................................... 12 2.3 THREE-PHASE PLATFORM-BASED DESIGN FLOW..................................................... 14
2.3.1 Phase I – Platform Conception...................................................................... 16 2.3.1.1 Functional Profiling ........................................................................................ 17
Network Layer Profiling ................................................................................... 18 MAC Layer Profiling........................................................................................ 21 Summary of Profiling Results........................................................................... 21
2.3.1.2 Architecture Exploration................................................................................. 22
Traditional Reconfigurable Architectures......................................................... 23 Hybrid Architecture for Protocol Processing [19] ............................................ 24
2.3.1.3 Architecture Library........................................................................................ 27
2.3.2 Phase II – Platform Instantiation ................................................................... 28 2.3.2.1 PicoRadio II Case Study ................................................................................. 29
2.3.3 Phase III – Implementation.............................................................................. 32 2.4 DESIGN ITERATION................................................................................................... 33 2.5 METHODOLOGIES COMPARISON............................................................................... 36
3. REACTIVE OPERATING SYSTEMS -- THE SOFTWARE MANAGEMENT LAYER............................................................................................................................. 38
3.1 REACTIVE SYSTEM BEHAVIOR ................................................................................. 39 3.2 INADEQUACY OF TRADITIONAL GENERAL-PURPOSE OS’S ...................................... 40 3.3 EVENT-DRIVEN OS................................................................................................... 41 3.4 COMPARISON RESULTS ............................................................................................ 42 3.5 REQUIRED EXTENSION OF TINYOS.......................................................................... 44
4. HIERARCHICAL POWER MANAGEMENT FRAMEWORK .......................... 47 4.1 THE GLOBAL POWER SCHEDULER AND SYSTEM MANAGER .................................... 48 4.2 EXISTING WORKS ON POWER MANAGEMENT POLICIES........................................... 50
4.2.1 Stationary Statistical Power Management Policy ........................................... 52 4.2.2 Adaptive Power Management Policy For Non-Stationary Traffic .................. 55 4.2.3 Dynamic Voltage Scaling (DVS)...................................................................... 56
4.3 PROPOSED POWER MANAGEMENT ALGORITHM FOR SENSOR NETWORKS ............... 57
4.3.1 Formulating Power Control Policy For PicoRadio Network.......................... 57 4.3.2 Requirements For PicoRadio Network Power Control Policy ........................ 58 4.3.3 Proposed Power Control Algorithms............................................................... 60
5. NODE LEVEL POWER MANAGEMENT ............................................................. 63 5.1 HIERARCHICAL NODE-LEVEL POWER MANAGEMENT ARCHITECTURE..................... 63 5.2 GLOBAL POWER SCHEDULER ................................................................................... 65 5.3 POWER STATES TRANSITIONS FOR SYSTEM BLOCKS................................................ 67 5.4 NODE LEVEL POWER SCHEDULING .......................................................................... 69
5.4.1 Predictive Look-Ahead Scheduling.................................................................. 72 5.4.2 Implementation of Predictive Scheduling ........................................................ 74 5.4.3 Power Scheduling Without Predictive Wakeups.............................................. 78
5.5 INCORPORATING DYNAMIC VOLTAGE SCHEDULING (DVS)..................................... 78 5.6 THE STATEFLOW - SIMULINK ESTIMATION-SIMULATION FRAMEWORK ................... 79
6. NETWORK LEVEL POWER MANAGEMENT ................................................... 83 6.1 TRAFFIC CONSIDERATIONS ...................................................................................... 83 6.2 CONSTANT THRESHOLD ALGORITHM ....................................................................... 91 6.3 SINHA & CHANDRAKASAN..................................................................................... 101 6.4 MODIFIED HWANG AND WU’S [32] ........................................................................ 103 6.5 ADAPTIVE DYNAMIC THRESHOLD ALGORITHMS ................................................... 105
6.5.1 Improve Adaptive Algorithms By Exploiting Special Characteristics Of The Sensor Network ....................................................................................................... 108
6.6 EVALUATIONS OF VARIOUS POWER MANAGEMENT ALGORITHMS FOR SENSOR NETWORK APPLICATIONS ............................................................................................ 109 6.7 SERVICE TIME CONSIDERATION ............................................................................. 110 6.8 IMPLEMENTATION COST OF THE PM ..................................................................... 111
7. CONCLUSIONS AND FUTURE WORKS............................................................ 111 7.1 SUMMARY OF THESIS RESEARCH AND CONTRIBUTIONS......................................... 112 7.2 LESSONS LEARNED AND FUTURE RESEARCH OPPORTUNITIES ............................... 114
Platform-based design methodology for protocol processing ................................ 115 Node level power management ............................................................................... 115 Network level power management .......................................................................... 116
8. REFERENCES.......................................................................................................... 118
1
1 Introduction
1.1 Challenges in Wireless Protocol Implementation
The implementation of small, mobile, low-cost, energy conscious devices has
created unique challenges for today’s designers. Limiting battery lifetimes make
energy efficiency a most critical design metric and the real time nature of applications
impose strict performance constraints. The drive for miniaturization and inexpensive
fabrication calls for an unprecedented high level of integration and system
heterogeneity. Rapidly shrinking the design time caused by market pressure due to
fierce competition, combined with the need to support multiple wireless standards,
favors a system architecture that is highly reusable and programmable. To meet these
conflicting and unforgiving constraints, we must optimize the designs among all the
above competing criteria: energy, performance, flexibility and cost, while prmanaging
the ever-increasing complexity.
This thesis research is mostly interested the trade-off between energy and
flexibility in architecture implementation. Performance is viewed more as a design
constraint and not as optimization criteria. Figure 2 shows that vast playing ground to
be explored in the energy-flexibility trade-off game. Four different types of
architecture in ascending order of flexibility are presented: dedicated direct mapped
ASIC, hardware reconfigurable DSPs, software programmable Digital Signal
Processors (DSP) and embedded general-purpose microprocessor. Implementing the
same DSP algorithm, energy efficiency increases by 4 orders of magnitude as we go
from the least flexible ASIC to the most flexible embedded processor.
2
Figure 2 shows the energy-flexibility trade-off for a particular wireless protocol
example. The MAC layer of a wireless protocol is implemented in ASIC, FPGA and
the ARM8 embedded processor. Similar to Figure 2, we observe four orders of
magnitude increase in energy consumption from ASIC to embedded microprocessor.
To address these challenges, we must rethink the different aspects of the
ASIC FPGA ARM8Power 0.26mW 2.1mW 114mWEnergy 10.2pJ/op 81.4pJ/op n*457pJ/op
Embedded ProcessorsSA1100.4 MIPS/mW
ASIPsDSPs 2 V DSP: 3 MOPS/mW
DedicatedHW
Flexibility (Coverage)
Ene
rgy
Eff
icie
ncy
MO
PS/m
W(o
r M
IPS/
mW
)
0.1
1
10
100
1000
ReconfigurableProcessor/Logic
Pleiades10-80 MOPS/mW
Embedded ProcessorsSA1100.4 MIPS/mWEmbedded ProcessorsSA1100.4 MIPS/mW
ASIPsDSPs 2 V DSP: 3 MOPS/mWASIPsDSPs 2 V DSP: 3 MOPS/mW
DedicatedHW
Flexibility (Coverage)
Ene
rgy
Eff
icie
ncy
MO
PS/m
W(o
r M
IPS/
mW
)
0.1
1
10
100
1000
ReconfigurableProcessor/Logic
Pleiades10-80 MOPS/mW
ReconfigurableProcessor/Logic
Pleiades10-80 MOPS/mW
Figure 2 Energy vs. Flexibility trade-off for various architectures. MOPs/mW means Million of Operations per micro Watt. Dedicated HW is direct mapped ASIC. Pleiades is a re-configurable hardware Digital Signal Processors (DSP). ASIPs/DSPs are software programmable DSPs. SA110 is the StrongARM processor.
ASIC: 1V, 0.25 um CMOS process FPGA: 1.5 V 0.25 um CMOS low-energy FPGA ARM8: 1 V 25 MHz processor; n = 13 Ratio: 1 - 8 - >> 400 Figure 1 Energy efficiency vs. flexibility trade-off for a wireless protocol implementation
3
traditional approach to system design, from methodology to architecture. This thesis
research targets the issues of energy-efficiency, flexibility and design complexity by
proposing a novel design methodology and carefully studying the software and
hardware architecture. In particular, we focus on the management layer of reactive
heterogeneous systems. In the context of power-critical systems, its role lies in system
resource scheduling and power management. The main research contributions of the
thesis are highlighted in the following section.
1.2 Research statement and Major contributions
A typical wireless system has an analog front-end and a digital back-end. The
digital part consists of base-band and protocol processing units. My research
concentrates on the “protocol” components, which are operations to ensure proper
delivery of the packets given underlining network architecture and physical medium.
Relevant operations include routing, packet processing, classification and Medium
Access Control (MAC). Stream-processing and direct manipulation of data in the
communication pipeline such as data compression, decompression, encryption and
decryption, coding are not relevant. The scope of the thesis is on energy efficient
and flexible implementation of wireless protocols and the accompanying novel
design methodology.
The major contributions of this thesis can be summarized in two parts.
We propose a formal top-down platform-based design methodology for
protocol implementation. Most protocol design methodologies currently in use
are inadequate, either because they do not rely upon formal techniques and
4
therefore do not guarantee correctness, or because they do not provide
sufficient support for performance analysis and design exploration and
therefore often lead to sub-optimal implementations. Our methodology relies
on a formal Model of Computation (MOC). It supports architecture
exploration and meets the application’s need on flexibility while achieving
energy efficient solutions. The methodology specifically targets complex
systems with a high level of integration and hetegeneousity. Based on the
concepts of platform-based design, we divide the design process into three
distinct phases: Platform Conception, Platform Instantiation and
Implementation. A complex real time system, the PicoRadio [12] platform, is
used as the case study to help guide the entire design process. Experiments are
conducted at a variety of levels to illustrate how to apply the design
methodology to devise an architecture that is optimized for size, cost, and
most importantly, energy. The proposed formal top-down design methodology
yields superior results compared to traditional bottom-up ad-hoc approaches.
We propose a hierarchical power management framework for low-power
reactive heterogeneous systems. To achieve the ultimate energy efficiency,
power management should be implemented at every level of the design
hierarchy, from device to system levels. Power saving is normally the highest
at the top level, that is, the system level. Traditional Operating Systems, acting
as the system manager and scheduler, do not usually provide power control
services. Furthermore, as they were developed for broad applications, they
are not efficient or in many cases not sufficient for the targeted types of
5
complex real time, power-critical domain specific systems. Our proposed
solution is developed to exploit the reactive event-driven nature of the domain
and has built-in aggressive power management. The hierarchical structure of
the framework enhances design scalability, supports concurrency in both the
application domain and architecture and enables power control at various
granularities. Our power management algorithm executes in two phases:
Network level algorithm first treats the whole node as one entity and try to
decide when the whole node should go to sleep; then once the node is turned
on, the node level algorithm decides on the scheduling of the various modules
inside the node.
To validate research concepts, I have participated in the development of the
PicoRadio II and PicoRadio III chips. PicoRadio II was designed using our
proposed design methodology. PicoRadio III deploys a power manager to
demonstrate the reactive management concepts discussed in the thesis.
1.3 Thesis Roadmap
Figure 3 shows the general organization of reminder of the thesis.
6
Phase II Platform Instantiation
Phase III Implementation
Phase II Platform ConceptionChapter 23-phase Platform Based Design Methodology
Chapter 3Reactive OS
Lesson learned:OS support is crucial.What is the right OS for reactive systems?
Chapter 4Hierarchical Power Management Framework
Chapter 8Conclusion
Chapter 5Node Level Power management
Chapter 5Network Level Power management
Recap contributionsLesson learns and future works
Propose management framework based on TinyOS
OS as the global power manager and schedulerExisting power control algorithmsPropose power management algorithm for
PicoRadio network
Hierarchical architectureNode level power
schedulingStateflow-Simulink
simulation framework
Sensor network traffic analysis
Evaluating various power control algorithms in OMNet++
2-level power management
Thesis Roadmap
Inadequacy of general purpose OSComparing event driven and general purpose OSTinyOS needs to be extended for heterogeneous
systems
Phase II Platform Instantiation
Phase III Implementation
Phase II Platform ConceptionChapter 23-phase Platform Based Design Methodology
Chapter 3Reactive OS
Lesson learned:OS support is crucial.What is the right OS for reactive systems?
Chapter 4Hierarchical Power Management Framework
Chapter 8Conclusion
Chapter 5Node Level Power management
Chapter 5Network Level Power management
Recap contributionsLesson learns and future works
Propose management framework based on TinyOS
OS as the global power manager and schedulerExisting power control algorithmsPropose power management algorithm for
PicoRadio network
Hierarchical architectureNode level power
schedulingStateflow-Simulink
simulation framework
Sensor network traffic analysis
Evaluating various power control algorithms in OMNet++
2-level power management
Thesis Roadmap
Inadequacy of general purpose OSComparing event driven and general purpose OSTinyOS needs to be extended for heterogeneous
systems
Figure 3 Thesis Roadmap
7
Chapter 2 introduces our formal top-down platform-based design methodology
for protocol implementation. The iterative design process is divided into three distinct
phases: Platform Conception, Platform Instantiation and Implementation. The
PicoRadio platform is used as the case study to help guide the entire design process.
At the end of the implementation phase, however, we discover that our design falls
short of the original goal. It is quite inefficient and far from meeting our application
specification of low-cost and low power. The efficiency is due to two major faults in
the design process: modeling and software synthesis using ECOS, a general purpose
OS. The second fault can be greatly improved by replacing ECOS with TinyOS, a
reactive OS that better matches the application. After repairing these faults, we are
able to show that the proposed formal top-down design methodology yields superior
results compared to traditional bottom-up ad-hoc approaches.
The design flow experiment has drawn our attention to the vast impact of the OS.
We will devote the rest of the thesis to search for the “right” OS for our targeted
application: the wireless communication systems.
In Chapter 3, we discuss how general-purpose multi-tasking OS is less suitable
for our targeted application that reactive OS, which is developed to exploit the
reactive event-driven nature of the domain. We present a comparison between the
two. Our results indicate than the event driven OS achieves an 8x improvement in
performance, 2x and 30x improvement in instruction and data memory requirement,
and a 12x reduction in power over its general-purpose counterpart. However, the
existing TinyOS has many limitations and has to be extended.
8
In Chapter 4, based on the attractive concepts of TinyOS, we propose a power
management framework that specifically targets reactive heterogeneous systems. We
discuss some existing works on power management and proceed to propose our
power management algorithm for sensor networks. To handle complexity in system
modeling, the algorithm executes in two phases: Network level algorithm first treats
the whole node as one entity and try to decide when the whole node should go to
sleep; then once the node is turned on, the node level algorithm determines on the
scheduling of the various modules inside the node.
Chapter 5 covers the node level power control algorithm. We present the
hierarchical architecture and discuss power scheduling related issues. At the node
level, power scheduling makes decisions on the sequence and exact timing of the
block wakeups and sleeps. The goal is to minimize the overall power consumption
while meeting the performance and resource constraints. We use the Stateflow-
simulation environment as our simulation framework.
Chapter 6 covers the network level power control algorithm. We first try to
understand the nature of the network traffic. Then we simulate the various power
control policies in a typical sensor network setting using the OMNet++ simulator.
From our experiments, adaptive algorithms seem to be good solutions since they are
able to explore the temporal correlations in the traffic stream, handle environmental
changes and are relatively simple to implement. However, simple constant threshold
algorithms perform better for some “difficult” cases.
9
Chapter 7 concludes the thesis. We recapitulate the major research results and
contributions. We also discuss the lessons learned and identify the open questions and
opportunities for further research.
2. Platform-based design methodology for wireless
protocol processor
Following a formal design methodology is vital to protocol implementation. Most
existing methodologies are ad-hoc in nature. Without relying upon formal techniques,
they often do not provide sufficient support for performance analysis and design
exploration, and tend to lead to sub-optimal implementations. In our proposed
methodology, we capture the functional behavior of the design with a high-level
abstraction. Specifying the design with a formal Model of Computation (MOC)
enables us to ably apply design exploration and synthesis later in the design flow to
produce flexible and energy-efficient implementations.
Based on the concepts of platform-based design, we divide the design process into
three distinct phases: Platform Conception, Platform Instantiation and
Implementation. A complex real time system, the PicoRadio [12] platform, is used as
the case study to help guide the entire design process. Experiments are conducted at a
variety of levels to illustrate how to apply the design methodology to devise an
architecture that is optimized for size, cost, and most importantly, energy.
In this chapter, we start by introducing the concepts of Models of Computation
and platform. Then we describe in detail our three-phase platform-based design flow,
using PicoRadio as the design driver. To illustrate the effectiveness of our
10
methodology, we conclude this chapter by presenting a comparison between a design
implemented with the traditional ad-hoc methodology and one implemented with our
formal methodology.
2.1 Model of Computation
The process of system design starts with the correct capturing of system behavior.
Traditionally, the choice of the language used for capturing functional specifications
is often informal and application dependent. Natural languages, Matlab, C and C++
are all popular forms of design capture. However, these languages often lack the
semantic constructs to be able to specify concurrency. We promote a more formal
approach to choose the functional specification languages based on their underlying
mathematical model, which is called model of computation (MOC) [1]. MOC are the
rules of interaction of components and the semantics of the composition.,
computation and concurrency [2]. In fact, concurrency models are the most important
differentiating factors among models of computation. A popular model of
computation is threads, where a set of sequential processes operates on the same data.
Other examples of MOCs are: Communicating Sequential Processes [3], the pi
calculus [4], dataflow [5], process networks [6], discrete events [7], Finite State
Machines (FSM) and the synchronous/reactive model [8] etc.
11
The appropriate MOC for the protocol processing is Concurrent Extended Finite
State Machines (CEFSM) [9]. CEFSM models a network of communicating extended
finite state machines (EFSM), which are finite state machines that can have complex
actions on transitions (Variable assignments, computation etc). EFSM can effectively
express both control and the computation found in datapath operations. Figure 4
illustrates how CEFSM naturally models protocol processing. Each layer
(component) in the protocol stack is modeled as an EFSM. The communication
between EFSMs is asynchronous to accommodate components working at different
rates: Lower layers of the stack typically run much faster than the higher layers. The
asynchronous communication is supported through connecting queues between the
components.
The formal capturing of functional behavior enables us to efficiently apply
verification and synthesis later in the design process. Verification and synthesis are
the most effective if complexity is handled by formalization, abstraction and
C=>GC=>G
EFSM
Concurrent EFSMsProtocol stack
C=>GC=>G
EFSM
Concurrent EFSMsProtocol stack
Figure 4 EFSM as the MOC for protocol processing. Each layer in the protocol stack is modeled as an EFSM, the communications between the EFSMs are through queues.
12
decomposition [10]. Specifying the design with a high-level abstraction allows the
freedom to explore a wide variety of implementations.
2.2 Concept Of Platform
As mentioned in Chapter 1, the need for shorter design time and greater design
complexity has made it necessary to look to new design methodologies that support
design reuse. Platform-based design [11] facilitates design reuse by abstracting
hardware to a higher level (system platform) that is visible to the application
software. A system platform has three components: hardware platform, software
platform and interconnect.
PlatformDesign-Space
Exploration
PlatformSpecification
Architectural Space
Application SpaceApplication Instance
Platform Instance
SystemPlatform
PlatformDesign-Space
Exploration
PlatformSpecification
Architectural Space
Application SpaceApplication Instance
Platform Instance
SystemPlatform
Figure 5 Platform-based design methodology
13
The hardware platform should comprise a family of flexible (parameterizable)
architectures that adequately support the functions in the application space with
performance/power models. A software platform is needed to abstract the hardware
platform into a programmer’s model to allow effective mapping. It is usually in the
form of a Real Time Operating System (RTOS), which is responsible for the
scheduling of the computational resources and of the communication between them.
In essence, it is a hardware platform “manager”. Inter-communication strategies
designate the interconnection between the architecture modules.
Once a system platform has been identified for the application space and the
architecture space, the final chip design involves design exploration within the system
platform to determine the best mapping of application to architecture. Figure 5
graphically captures the platform-based design concept.
Figure 6 shows an example of a platform. Its heterogeneous architecture combines
programmable (microprocessor), flexible (FPGA), and application-specific modules.
The mixed-mode platform processes both analog and digital signals and is DSP and
control intensive (FSM processing).
14
2.3 Three-Phase Platform-Based Design Flow
Our platform-based design flow can be split into three phases, as shown in Figure
7. In phase I, the system platform is conceived through consideration of the
application domain and the available architectural modules. Phase II performs the
design exploration to find a suitable platform instance for a given set of target
applications and constraints. Lastly, phase III completes the final implementation
(hardware and software synthesis) of a specific application onto the platform instance.
Note that the design flow is iterative. If the final implementation from Phase III does
not meet the design specification, we need to go back to either Phase I and/or Phase II
to refine the platform or the mapping until a satisfactory implementation is obtained.
ReconfigurableDataPath
ReconfigurableState Machines
Embedded uP+ DSPs
FPGA
DedicatedDSP
ReconfigurableDataPath
ReconfigurableState Machines
Embedded uP+ DSPs
FPGA
DedicatedDSP
Figure 6 Example of a platform
15
To facilitate further understanding of the different phases of the design methodology,
we will present the design of the PicoRadio platform as case study.
PicoRadio [12] is an ad hoc, sensor-based wireless network that comprises
hundreds of programmable and ultra-low power communicating nodes. PicoRadio
applications have the following characteristics: low-data rate, ultra-low power budget,
and mostly passive event-driven computation. Reactivity is triggered by external
events such as sensor data acquisition, transceiver I/O, timer expiration, and other
environmental occurrences. The chosen MOC for the PicoRadio protocol stack is
Concurrent Extended Finite State Machines (CEFSM).
Like most major projects, the PicoRadio design has progressed through different
versions. The second version of the PicoRadio (PicoRadio II) design is far more
Kernel Extraction via
FunctionalProfiling
Fabric Exploration
Configurable Platform
Phase I
Mapping
Performance Evaluation
Phase II
Implementation
Phase III
Functional Specification.
Kernel Extraction via
FunctionalProfiling
Fabric Exploration
Configurable Platform
Phase I
Mapping
Performance Evaluation
Phase II
Implementation
Phase III
Functional Specification.
Figure 7 Platform-based design flow
16
simplistic than the ultimate design; it is to provide a learning experience to address
the methodology, tools and integration issues. PicoRadio II has been completed and
tested and will be used as the case study for the design flow.
2.3.1 Phase I – Platform Conception
The first step in platform-based design is to conceive a system platform that
identifies a set of architectural modules to support the class of functions in our
applications domain.
A typical platform for wireless systems consists of programmable processors,
reconfigurable logic, dedicated logic, memories, and peripherals. To construct
hardware platform that supports the key functions of the application domain is a two-
fold process, proceeding in lock-step: We need to
(1) identify the key functions and their constraints,
and (2) explore the available architecture modules
and their performance behavior. The former is
achieved through functional profiling of a suite of
candidate applications, and extracting a set of key
operations (kernels) common to these applications.
The latter requires architecture exploration of
existing implementation fabrics to obtain first-order
performance, energy and area estimations for these
basic modules. Figure 8, taken from Figure 7
depicts this duality. The output of phase I is a
Kernel Extraction via
FunctionalProfiling
Fabric Exploration
Phase I
Kernel Extraction via
FunctionalProfiling
Fabric Exploration
Phase I
Figure 8 Phase 1 – Platform Conception
17
library of architectural modules with corresponding performance, energy and area
prediction models. In the following sections, we present these two steps in greater
detail in the context of our case study.
2.3.1.1 Functional Profiling
Before starting the implementation process, we need to gain an in-depth
understanding of our application space. A highly efficient implementation could
only be realized if the performance critical operations in the applications are
classified and specially targeted. Functional profiling explores regularity and extracts
common operations (kernel extractions) in the application. The important issues of
functional profiling are profiling granularity and classification and interpretation of
the collected data. If the granularity is either too coarse or too fine, regularity and
commonality may not be fully exposed. To reach an optimal granularity, some
reorganization of the application code (for example, insertion of some wrapper
functions) is often needed. To classify and interpret profiling data in a meaningful
fashion requires some insight into the class of application algorithms.
We use the list of critical operations identified by the high performance wired
network processor (WNP) community as the initial guideline for profiling wireless
applications. It consists of parsing, searching (table lookup), packet modifying and
re-assembly.
We have conducted experiments to profile both the network and the MAC layers
of the protocol stacks. Network layer profiling is performed on a mobile ad hoc
network application that supports different network protocols. MAC layer profiling is
performed on a distributed multi-channel MAC. For both experiments, the application
18
programs are written in OPNET Radio Modeler from Millennium 3 Technologies
[13].
Network Layer Profiling
The model has simple MAC and physical layers, but a rather sophisticated
network layer that enables us to explore different types of routing protocols. Node
distribution and mobility can also be specified. Routing protocols has a significant
impact on implementation parameters such as routing and forwarding table sizes.
The distribution of nodes in the network affects the network activity and hence the
protocol performance. We studied four different scenarios with two different routing
protocols and two types of node distributions. The first protocol is the Ad-hoc On-
Demand Distance Vector Routing (AODV) [14], a reactive protocol. The second is
the Dynamic Destination-Sequenced Distance-Vector Routing (DSDV) [15], a
proactive protocol. The first type of network distribution is a uniform grid of nodes in
which the neighbors can hear each other. The second is a randomly generated
distribution. The nodes have fewer neighbors on average in the random distribution
case and generate less network traffic.
19
OPNET only produces profiling information at the level of the leaf funtions. The
generated fine grained profiling data are grouped and classified using the WNP
guidelines. The results, presented in Figure 9 and Figure 10, look very similar to the
WNP cases. The kernel extracted are: searching (table look-up), packet processing
(parsing, modification, assembly etc) and memory (queues, buffers) management.
We are both interested in the total time these operations consume (Figure 9) and
the number operations (Figure 10) performed. Searching consumes 20%-45% of the
total time, and 26%-46% of total operations performed are searches. Packet
manipulation (parsing, modification, re-assembly, etc.) consumes 18%-28% of the
total time. However, only 4%-9% of the total operations performed are packet
manipulation. This implies the duration of packet manipulation operation is longer
than average.
0
5
10
15
20
25
30
35
40
45
%
Search Memory Mgmt. PacketDisassembly
PacketAssembly
Timers Others
Total time percentage breakdown from different scenarios
aodv_random45dsdv_random35aodv_uniform40dsdv_uniform40
Figure 9 Functional Profiling of the network layer
20
0
5
10
15
20
25
30
35
40
45
50
%
Search Memory Mgmt. PacketDisassembly
PacketAssembly
Timers Others
Number of operations percentage breakdown
aodv_random45dsdv_random35aodv_uniform40dsdv_uniform40
Figure 10 Functional profiling of the network layer
0
20
40
60
80
100
120
Search Queue Proc. PacketDisassambly
PacketAssembly
Timer Proc.
Number of Activation: Distributed multi-channel MAC
Figure 11 Profiling a multiple channel MAC
21
MAC Layer Profiling
The kernels identified in our MAC profiling experiment are similar to these of
network layer: searching, queues management, packet disassembly (parsing, pattern
matching), packet assembly and timer processing (see Figure 11). We have made the
following assumptions: the delivery request rate is 10 packets per second and there
are at most three delivery attempts.
Summary of Profiling Results
Table 1 summarized the profiling experiments conducted in previous sections. As
we go down the stack from application to physical, the processing speed increases
and the processing granularities decreases. Kernels are classified as either control
dominated or data dominated. Kernels are data dominated if the complexity mostly
CRC/Verification;Complex multipliers; Match-filters; FIR filters; Correlators; Magnitude-squarers; CORDICs
Synchronization; TimersSegmentation;Assembly/diassembly
Physical
Localization AlgorithmQueue management; Packet assembly/disassembly; Timers;Channel assignment Table lookup;
MAC
Packet processing (parsing, modification, assembly, disassembly)Routing/forwarding table lookup; Timers
Network
Encryption; Decryption; Compression;Decompression
Application/Transport
Data KernelsData KernelsControl KernelsControl KernelsLayersLayers
CRC/Verification;Complex multipliers; Match-filters; FIR filters; Correlators; Magnitude-squarers; CORDICs
Synchronization; TimersSegmentation;Assembly/diassembly
Physical
Localization AlgorithmQueue management; Packet assembly/disassembly; Timers;Channel assignment Table lookup;
MAC
Packet processing (parsing, modification, assembly, disassembly)Routing/forwarding table lookup; Timers
Network
Encryption; Decryption; Compression;Decompression
Application/Transport
Data KernelsData KernelsControl KernelsControl KernelsLayersLayers
BitsNano sec
PacketsMicro Sec.
PacketsMilli sec.
Source dataSec.
Processing speedProcessing speed
BitsNano sec
PacketsMicro Sec.
PacketsMilli sec.
Source dataSec.
Processing speedProcessing speed
Table 1 Summery of profiling results
22
comes from data processing, and control dominated if the complexity mostly comes
from control structures.
The classification of the functional kernels into control and data processing
operations is very important for the realization of an efficient implementation. The
different nature of control and data processing operations intuitively leads to different
“optimal” implementation structures. An efficient architecture for protocol
implementation should contain a mixture of these different implementation structures.
The exact proportion of the different structures depends on the ratio between control
and data in the application. In the following section, we will introduce a “hybrid
architecture” that is based on this concept.
From the profiling results, we can conclude that the implementation of dedicated
engines that perform packet processing, table searching and queue management may
greatly improve the overall system performance. Furthermore, since searching is
mostly routing and forwarding table lookups, a routing protocol that does not involve
the maintenance of large tables (or system states) should lead to a cheaper
implementation.
2.3.1.2 Architecture Exploration
To ensure the most usable system platform for the remainder of the design process,
the components of the hardware platform should provide sufficient coverage of the
functions of the application domain. Functional profiling identifies the most crucial
functions that we need to address. Architectural exploration generates a set of
architectural modules that best supports these functions.
23
To enable software reuse and allow application exploration, a wireless platform
must provide some degree of programmability. Traditional computing architectures
range from microprocessors to ASICs, but most fail to meet the energy requirement
or the flexibility requirement. There also exist configurable processors [16] [17], but
too often they are designed to specialize for applications other than communication
protocol processing. This leaves us with the family of reconfigurable logic
architectures, which are sufficiently low-level to allow low-energy circuit techniques,
while providing some degree of flexibility through reconfiguration. Also, they are not
limited to certain application sets.
Traditional Reconfigurable Architectures
Traditional reconfigurable architectures come in two flavors: field-programmable
gate array (FPGA) and programmable logic device (PLD) [18]. FPGA and PLD
architectures differ significantly in granularity. Using look-up table (LUT)
technology, FPGAs can efficiently implement any arbitrary logic with few inputs.
Since the LUTs are easily chained together to implement multilevel logic, this
architecture is well suited for complex operations such as arithmetic and signal
processing. On the other hand, the PLDs use programmable array logic (PAL) blocks
that can each implement sum-of-product logic of many inputs but limited output.
Thus, PLD structures are suitable for control FSMs. We performed experiments
mapping benchmarks from PicoRadio II to commercial FPGA and PLD chips, and
measured their utilization based on equivalent gate count. The results, shown in
Figure 12, are consistent with the theoretical claims.
24
Hybrid Architecture for Protocol Processing [19]
Since PicoRadio protocol stack takes the form of EFSMs, we are constructing a
reconfigurable architecture using both PAL and LUT blocks for control and datapath
respectively. By utilizing each structure on functions that they are best suited for, we
can achieve the best performance in the combined structure.
The architecture uses hybrid cells, each consisting of a small PAL block for
control, and a small array of LUTs and flip-flops (FFs) for data processing. Figure 4
shows a block diagram of this architecture. Each cell in this structure comprises a
PAL block and a small array of LUTs and FFs; thus, each cell corresponds to a small
FSM. Since protocols have many interacting FSMs, the architecture shall have an
array of these hybrid cells.
Figure 13 shows a detailed block diagram of a hybrid cell. Since the data
processing elements are isolated in the LUT portion of the cell, the FSM must
generate control signals that feed into the LUTs. The data inputs go directly to the
0
0.2
0.4
0.6
0.8
1
1.2
PhysSend (FSM) Remote (FSM) GenSync (Data) MergeInteger(Data)
Nor
mal
ized
Util
izat
ion
FPGA
PLD
Figure 12 Implementation results of wireless protocol blocks.
25
LUTs, and the control inputs go directly to the PAL. Similarly, the data outputs come
from the LUTs, and the control outputs come from the PAL.
Since control and data outputs, as well as any internal control signals may be used
in the control plane, these signals must be fed back into the PAL block. Layout of the
block should be considered carefully to minimize the lengths of these feedback
signals.
As mentioned in Section 2.3.1, to simplify performance evaluation of the
architectures, an architecture should have estimation models that provide first-order
performance numbers for a given function. Based on empirical data from Spice
simulations, we obtained cost estimates and deduced a set of prediction equations for
the power costs of FPGA and PAL structure. Figure 14 shows these equations. The
estimates are based on 0.25µm technology on a 1.0V supply. Estimation of the data
PAL LUTs & FFs
CONTROL DATAPATH
PAL LUTs
Data Input
Data Output
Ctrl Signals
Ctrl Input
Ctrl Outputs
Figure 13: Basic block diagram of the hybrid architecture.
26
portion is based on the energy consumed by LUTs and FFs. Estimation of the control
processor is based on a dynamic logic PAL implementation, which has significantly
lower energy consumption than a traditional sense amp based implementation. Note
that interconnect power is not included, which is a degree of error that we allow for
the sake of simplicity.
With the power models, we can obtain first-order performance results to see how
well the architecture works for our applications. Using the mac_a design from the
TCI project as benchmark, we estimated its power consumption under three different
scenarios: purely FPGA implementation, purely PAL implementation and the
proposed hybrid approach. The results, shown in Figure 15, suggest that the hybrid
architecture out-performs the other two scenarios. We can expect even greater gain
when the power dissipation of the PAL reduces as our research in low-energy PAL
matures.
LUT Power Estimation LUT Power = (LUTs * 2.2uW) + (FF * 0.5uW)
PAL Power Estimation
PAL Power = (P-terms * Inputs * 0.05uW) + (Outputs * 0.7uW)
Figure 14 Power estimation equations for LUT and PAL implementations.
27
2.3.1.3 Architecture Library The output of phase I is a library of architectural modules with corresponding
0
200
400
600
800
1000
1200
1400
1600
1800
FPGA PAL HybridArchitectures
Pow
er D
iss.
(uW
)P(PAL)P(FPGA)
Figure 15: Power comparison of different architectures.
inst,LD,2inst,LI,1inst,ST,2inst,OP.c,2inst,OP.s,3inst,OP.i,1inst,OP.l,1inst,OP.f,1inst,OP.d,6
inst,DIV.i,118inst,DIV.l,122inst,DIV.f,145inst,DIV.d,155inst,IF,5inst,GOTO,2inst,SUB,19inst,RET,21
inst,MUL.c,9inst,MUL.s,10inst,MUL.i,18inst,MUL.l,22inst,MUL.f,45inst,MUL.d,55inst,DIV.c,19inst,DIV.s,110
SONICS modelSONICS modelInitiator
CoreInitiatorAgent
Inte
rcon
nect
OCP
TargetAgent
TargetCore
OCP
Arbiter
OS
Proc.
Conf.Logic
Interconnect model
ASIC
Delay Model
Power model
Type of scheduler
Delay model: overhead for task management
PAL=(3.54 + M * 6.75 + (P-M) * 1.02 + I * 4.44) * ToggleRate * Vdd^2
FPGA =(L * 4.18 + F * 0.95) *ToggleRate * Vdd^2
inst,LD,2inst,LI,1inst,ST,2inst,OP.c,2inst,OP.s,3inst,OP.i,1inst,OP.l,1inst,OP.f,1inst,OP.d,6
inst,DIV.i,118inst,DIV.l,122inst,DIV.f,145inst,DIV.d,155inst,IF,5inst,GOTO,2inst,SUB,19inst,RET,21
inst,MUL.c,9inst,MUL.s,10inst,MUL.i,18inst,MUL.l,22inst,MUL.f,45inst,MUL.d,55inst,DIV.c,19inst,DIV.s,110
inst,LD,2inst,LI,1inst,ST,2inst,OP.c,2inst,OP.s,3inst,OP.i,1inst,OP.l,1inst,OP.f,1inst,OP.d,6
inst,DIV.i,118inst,DIV.l,122inst,DIV.f,145inst,DIV.d,155inst,IF,5inst,GOTO,2inst,SUB,19inst,RET,21
inst,MUL.c,9inst,MUL.s,10inst,MUL.i,18inst,MUL.l,22inst,MUL.f,45inst,MUL.d,55inst,DIV.c,19inst,DIV.s,110
SONICS modelSONICS modelInitiator
CoreInitiatorAgent
Inte
rcon
nect
OCP
TargetAgent
TargetCore
OCP
Arbiter
SONICS modelSONICS modelInitiator
CoreInitiatorAgent
Inte
rcon
nect
OCP
TargetAgent
TargetCore
OCP
Arbiter
InitiatorCore
InitiatorAgent
Inte
rcon
nect
OCP
TargetAgent
TargetCore
OCP
Arbiter
OSOS
Proc.Proc.
Conf.LogicConf.Logic
Interconnect model
ASICASIC
Delay Model
Power model
Type of scheduler
Delay model: overhead for task management
PAL=(3.54 + M * 6.75 + (P-M) * 1.02 + I * 4.44) * ToggleRate * Vdd^2
FPGA =(L * 4.18 + F * 0.95) *ToggleRate * Vdd^2
PAL=(3.54 + M * 6.75 + (P-M) * 1.02 + I * 4.44) * ToggleRate * Vdd^2
FPGA =(L * 4.18 + F * 0.95) *ToggleRate * Vdd^2
Figure 16 Architecture library
28
performance, energy and area prediction models. Figure 16 shows an example library
with hardware, software and interconnect platform components. The hardware
platform components shown are microprocessor, ASIC and configurable logic. Each
model is annotated with estimation models. The OS is a typical software platform
component. Critical parameters in estimating OS performance consist of but are not
limited to: the type of scheduler, task management overhead etc. The interconnect
platform shown is a SONIC bus model.
2.3.2 Phase II – Platform Instantiation
Once a system platform has been defined, we
need to explore within the system platform to find
a platform instance that is suitable for a given set
of applications and constraints. To do so, we
employ the Y-chart approach [20], which involves
an iterative process of mapping functions to
parameterized architectural modules, and
evaluating the performance of the resulting
platform under the given set of functional
constraints. This process is illustrated in Figure
17.
To fully explore the design space, we need to re-emphasize the need for separation
of functional and implementation concerns. Doing so provides the designer with the
greatest degree of freedom in choosing the best solution. An important consideration
Configurable
Platform
Mapping
Performance Evaluation
Phase II
Functional Specification.
Configurable
Platform
Mapping
Performance Evaluation
Phase II
Functional Specification.
Figure 17 Phase II- Platform Instantiation
29
is the need for a purely functional design specification with an underlying formal
mathematical model, as stated in Section 2.1.
Given that a well-defined system platform is in place, platform analysis becomes a
relatively simple process. In the system platform, we have available a library of
architectural modules with corresponding performance, energy and area prediction
models. From these modules, we can construct different platforms with varying
performance for our target applications. By examining multiple platforms, we can
quickly converge to an optimal platform instance that satisfies the set of design
constraints of our applications.
2.3.2.1 PicoRadio II Case Study
While designing the PicoRadio II platform, we used the Visual Component Co-
design (VCC) tool from Cadence Design System [21] to perform architecture
exploration. The behavior description of the protocol design is specified formally
with CEFSM, as shown in the upper portion of Figure 18. The design is described
hierarchically: the top level consists of data (De)compression blocks (MULAW), an
User Interface (UI) and the protocol stack. The protocol stack expands to include the
transport layer, the MAC layers, and the physical layer (Transmit, Receive, Time base
and Synchronization blocks). VCC provides an architecture library containing
characterizable architecture modules such as microprocessors, ASIC blocks,
operating systems and interconnects. An architecture platform constructed using the
library, as shown in Figure 18, consists of an ARM7 Thumb microprocessor running
the ECOS operating system [22] and an ASIC block connected with the TDMI
30
interconnect bus. In this particular implementation, we have mapped UI and transport
layer onto software (ECOS) and the rest of the protocol stack onto hardware (ASIC).
The qualities of the mappings could be evaluated by the VCC performance
evaluation tool. Figure 19 shows an experiment of the software mapping. If we only
map the UI and the transport onto software, the processor can run at low frequency of
1MHz and still be under-utilized. If we speed up the processor to 11MHz, we can also
map MULAW onto software and increase the processor utilization to 32%. However,
we cannot map the entire MAC layer onto software even if we drastically increase the
processor speed to 2 GHz. In fact, processor utilization drops and event losses and
timing violation occur as we try to accommodate more of MAC onto a faster
processor. The reason is that OS related overheads dominate the processor as we
Architecture
Behavior
Architecture
Behavior
Figure 18 Architecture exploration with VCC design tool
31
increase the processing speed. The issue of OS overhead will be discussed in great
details later in the thesis.
ProcessorUtilization
ClockFrequency
ProcessorUtilization
ClockFrequency
ARM@2GHz
User Interface
ARM@11MHz
32.7%
MulawTransport
User Interface
ARM@1MHz
5.46%
Transport
ARM@200MHz
2.7%User Interface
MulawTransport
0.5 MAC
User Interface
MulawTransport
0.9 MAC
User Interface
ARM@11MHz
32.7%
MulawTransport
User Interface
ARM@11MHz
32.7%
MulawTransport
User Interface
ARM@1MHz
5.46%
Transport
ARM@200MHz
2.7%User Interface
MulawTransport
0.5 MAC
User Interface
MulawTransport
0.9 MAC
Figure 19 Performance evaluation
32
2.3.3 Phase III – Implementation
Once we have a platform that meets all of the design constraints, we are ready for
the final implementation of the design. In this phase, we perform hardware and
software synthesis to implement a specific application
onto the platform instance. In the software synthesis
process, generation and compilation of application code
is performed to translate high-level description into
executables. A real-time operating system (RTOS) is
selected to handle the synchronization and
communication in the system. In the hardware synthesis
process, the hardware-mapped application blocks are
transformed from high-level description to synthesizable
HDL.
Currently for most existing embedded systems,
software synthesis typically means transforming an application with inherence
concurrency into a sequential program running on a uni-processor. The application
code generation is accomplished by turning each concurrent system component in the
specification into a task. Communication wrapper functions are then generated to
connect the tasks and the RTOS, which manages tasks communication and
scheduling.
Implementation
Phase III
Implementation
Phase III
From Phase II Figure 20 Phase III--Implementation
33
Figure 21 shows the software synthesis process for PicoRadio II using VCC. Ecos
was chosen as the embedded OS due to its availability and efficiency. The right hand
side of the figure shows a snapshot of the generated code. This code is reached after
system initiation to start up the application program. A thread is created for each
component mapped to software and subsequently suspended. After starting up the OS
scheduler, the application threads are then resumed and the application starts running.
In hardware synthesis, high-level application specifications need to be translated
into synthesizable languages, which are the entry languages of synthesis tools that
generate silicon. Most tools take HDL as the entry description language. In PicoRadio
II, the hardware-mapped MAC and Physical layers are implemented with standard
cells.
2.4 Design Iteration
At the end of the implementation phase, we need to ask ourselves, whether what
void cyg_user_start(void ){ ……
cyg_thread_create(0, task_ui_2_, 0, …)cyg_thread_create(0, task_transport_1_transport_bs…)cyg_thread_create(0,task_transport_1_transport_remote_,…)
…..
cyg_thread_resume(task_ui_2__handle);cyg_thread_resume(task_transport_1_transport_bs__handle);cyg_thread_resume(task_transport_1_transport_remote__handle);
…..}
Figure 21 Software synthesis for PicoRadio II
34
we get is what we want. In other words, whether the implementation fulfills the
design specification and meets the constraints. If not, we need to go back to Phase I
and/or Phase II to iteratively refine the platform or the mapping until a satisfactory
implementation is obtained.
The final chip layout of from PicoRadio II is shown in Figure 22. We notice that
the software portion of the architecture including the processor and its memory blocks
occupies more than 70% of the total area. This is especially inefficient considering
that the processor is greatly under-utilized (utilization < 7%). Reason being that the
software-implemented UI and transport layers run at much lower activity and rate
(user request and packet level processing) than the hardware-implemented MAC and
physical layers (bit level processing).
35
Careful analysis of the software code reveals that of the total 10K byte instruction
code size, about 50% is communication overhead. The massive data memory size of
54K is a result of the communication overhead, expensive scheduler overhead,
memory management, and stack allocations.
The above design is quite inefficient and far from meeting our application
specification of low-cost and low power. The efficiency is due to two major faults in
our design process. The first fault lies in the Phase I. In phase I, we did not include
memory modeling in our processor model. Hence in Phase II, the software
components are mapped to a processor that has NO instruction and program memory,
which results in a gross underestimation of the cost of the design. The second fault
3.6 mm
2.8 mm
36
Wiri
ng
Cha
nnel
64 KB Instruction SRAM
64 KB Data SRAM
Sonics
I/O
tci_protXtensaIC
ache
audio data
flash interface387
6 TAP
40 I/O
PPI
Wiri
ng
Cha
nnel
Wiring Channel
Figure 22 PicoRadio II floorplan
36
lies in Phase III, the ineffective software code laden with OS overheads is the result
of a poorly chosen OS and synthesis process.
The obvious correction to the first fault is to incorporate memory requirement in
processor modeling and performance estimation. The second fault can be greatly
improved by replacing ECOS, a general purpose OS with one that better matches the
application. TinyOS, an event driven OS specifically developed to target event-driven
systems is a promising candidate. We have re-run the software synthesis process with
TinyOS as the RTOS and were able to reduce the memory requirement drastically.
The application code size has gone from 10K to 5K, with the original 50% overhead
nearly eliminated. The OS now only occupies 3K of memory space and the data
memory size has gone from 54K to a merely 3K. The details of this experiment will
be presented in the next section --- Reactive Operating System.
The software synthesis experiment of PicoRadio II has drawn our attention to the
vast impact of the OS in the system. We will devote the rest of the thesis to search for
the “right” OS for our targeted application: the wireless communication systems. The
existing TinyOS [23] will be examined in detail and then extended to construct a
hierarchical system management framework.
2.5 Methodologies Comparison
To demonstrate the effectiveness of our methodology, we will present a
comparison between a design implemented with the traditional ad-hoc methodology
and one implemented with our formal methodology.
PicoRadio I is the earliest prototyping version of PicoRadio. Used to
demonstration the feasibility of sensor networking, it was an ad-hoc design built out
37
of off-the shelf components. PicoRadio test board consists of a StrongARM processor
and Xilinx FPGAs and implements a protocol very similar to that of PicoRadio. Since
PicoRadio I functions as the test-bed for the PicoRadio project, design methodology
and optimization are not its primary concerns.
Obvious it is unfair to compare the cost and performance of PicoRadio I to that of
PicoRadio II. However, their vast differences in implementation costs demonstrate
what a formal methodology can accomplish. As shown in Table 2, there is one to two
orders of magnitude improvement in software costs. It is rather difficult to compare
the hardware costs due to the differences in implementation fabrics: an ASIC gate
typically “costs” less than a FPGA gate. Even if the hardware portion of PicoRadio II
is three times more expensive that that of PicoRadio I, the winning edge it has in
software should well compensate for it.
8K
200K
Instruc.Mem
Software
3K
225K
Data Mem
72377 ASIC
21200 FPGA
Hardware(equivalent gate count)
PicoRadio II
PicoRadio I
8K
200K
Instruc.Mem
Software
3K
225K
Data Mem
72377 ASIC
21200 FPGA
Hardware(equivalent gate count)
PicoRadio II
PicoRadio I
Table 2 Implementation cost comparison between PicoRadio I and PicoRadio II
38
3. Reactive Operating Systems -- the Software
Management Layer
As we have seen in Section 2.4, OS support is crucial for the design of ultra-low
energy communication systems. These systems, reactive in nature, tend to have high
level of integration and system heterogeneity. General-purpose operating systems
developed for broad application are increasingly less suitable for these types of
complex real time, power-critical, domain specific systems implemented on advanced
heterogeneous architectures. The current practice of developing the OS and the
application independently, in particular the paradigm of blindly treating a task as a
random process, is unlikely to yield efficient implementation [25].
What is needed is an OS that is intimately coupled to, aware of, and interactive
with its managed applications. Specifically, a capable but “lean” OS that is developed
to target the nature of these reactive event-driven embedded systems. “Capable” in a
sense that it provides adequate supports for concurrency in the both the application
and architecture; “lean” in a sense that it executes with minimal overhead. Since
power is the most critical factor in design, aggressive power management schemes
should be deployed to drive down the overall system energy expenditure.
Instead of a general-purpose operating system, an OS that more closely “matches”
the application greatly improves the opportunity to efficient final implementation. By
match we mean to have Models of Computation (MOC) that are similar to that of the
application. Since our targeted applications are event-driven reactive systems, we will
first introduce the basic properties of reactive systems. Then we will describe the
characteristics of a traditional general-purpose multi-tasking OS and an event-driven
39
OS in detail, and present a comparison between them in terms of MOC, generality,
communication, concurrency support, and memory and performance overhead. The
software implementation of PicoRadio II is used as the case study for both.
3.1 Reactive System Behavior
Reactive systems perform tasks in response to input events. A system can
generate events either actively or in response to the environment. A system is purely
reactive if it is invoked only to respond to events [24].
Antenna
DLL (MAC)
App/UI
Network
Transport
Baseband
RF (TX/RX)
Sensor/actuatorinterface
Locationing
Aggregation/forwarding
User interface
Sensor/actuators
Power control
RangingReactive
radio
Energy train
Antenna
DLL (MAC)
App/UI
Network
Transport
Baseband
RF (TX/RX)
Sensor/actuatorinterface
Locationing
Aggregation/forwarding
User interface
Sensor/actuators
Power control
RangingReactive
radio
Energy train
Figure 23 Reactive system example – PicoRadio network
PicoRadio is an example of a reactive system. Reactivity is triggered by external
events such as sensor data acquisition, transceiver I/O, timer expiration, and other
environmental occurrences. Both the communications between nodes and inside
nodes are predominantly asynchronous.
Figure 23 is a behavior diagram of PicoRadio sensor node. It shows the different
components in the system and the interactions between them through events handling
40
(Events are represented as arrows). Communication between components is purely
reactive. External events cause the generation and propagation of internal events.
3.2 Inadequacy Of Traditional General-Purpose OS’s
The general-purpose multi-tasking OS was originally developed for the PC
platform and later adapted for general embedded systems. It is good for supporting
several mostly independent applications running in virtual concurrency. Suspending
and resuming amongst the processes when appropriate provide support for multi-
tasking and/or multi-threading. Inter-task communication involves context switching
which can become an expensive overhead with increased switching frequency. This
overhead is tolerable for PC applications since the communication and hence
switching frequency is typically low when compared to the computation block
granularity. Moreover, as these overheads grow, the wasted energy expenditures are
of relatively little concern for these virtually infinite energy systems. As general-
purpose OSs do not target low power applications, they have no built-in energy
management mechanisms and any employed are wholly deferred to the application
with its limited system scope.
It is apparent that the MOC of the general-purpose OS is quite different from that
of the protocol stack. The processes across the layered protocol stack are not
independent. They are coupled and activated and deactivated with events from
neighboring processes. In other words, the communication frequency is high amongst
neighbors and high overheads are far less tolerable. As we have seen in the software
synthesis experiment with eCOS, described in Section 2.4, this MOC “mismatch”
results in major inefficiencies.
41
3.3 Event-driven OS
Event-driven OS is designed to specifically target event-driven communication
systems. Its MOC is CEFSM, which matches that of the protocol processing system.
This match drastically reduces the communication overhead as well as other OS
related costs. Because it is not designed to support a broad range of general
applications, it can cut down on expensive OS services such as dynamic memory
allocation, virtual memory, etc. In addition, unnecessary performance-degrading
polling is eliminated and context switching is minimized and very efficiently
implemented.
TinyOS [23] is a rather successful example of event-driven OS. In TinyOS, an
application is written as a graph of components. For the PicoRadio II example,
components would be the layers in the protocol stack. Each component has command
and event handlers that process commands and events from other components, tasks
that provide a mechanism for threaded description, and a static frame that stores
internal state and local variables.
The TinyOS system operation can be briefly described as following: external
events from the RF transceivers or sensors propagate from the lowest layers up the
component graph until handled by the higher layers. To prevent event loss, the system
must process incoming events faster than their arrival rate. Threaded behavioral
description is supported via tasks, which are operations in the event or command
handlers that require a “significant” number of processor cycles. Tasks are pre-
empted by the arrival of an incoming event and are dispatched from a task queue.
TinyOS uses a simple FIFO task scheduler. Built-in power control is exercised by
42
shutting down the CPU when no tasks are present in the system after all event
processing.
3.4 Comparison Results
In this section, we will present a comparison between the general-purpose OS
(eCOS) and the event-driven OS (TinyOS) in three important performance metrics:
memory requirement, performance, and power.
Table 3 summarizes the contrast between the two OS’s as presented in Section 2.2
and 2.3. By trading off generality for performance and code size, TinyOS can better
target event-driven systems. Table 4 shows the memory requirement comparison
between the two OS’s. With the same processor selection (16 bit ARM7), TinyOS
needs half the instruction memory and one-thirtieth the data memory. Studies showed
FrequentInfrequentCommunication Frequency
SmallLargeMemory Requirement
SmallLargeCommunication Overhead
Target event driven systemsGeneralGenerality
Communicating EFSMsMulti-taskingMOC
Event-driven OSGeneral purpose OS
FrequentInfrequentCommunication Frequency
SmallLargeMemory Requirement
SmallLargeCommunication Overhead
Target event driven systemsGeneralGenerality
Communicating EFSMsMulti-taskingMOC
Event-driven OSGeneral purpose OS
Table 3 General comparisons.
709317627408 bit RISCEvent-driven
280080005312ARM7 thumbEvent-driven
549882232410,096ARM7 thumbGeneral Purpose
Data memTotal instruction mem ApplicationProcessorOS type
709317627408 bit RISCEvent-driven
280080005312ARM7 thumbEvent-driven
549882232410,096ARM7 thumbGeneral Purpose
Data memTotal instruction mem ApplicationProcessorOS type
Table 4 Memory requirements comparison.
43
that the power consumption of SRAM scales roughly as the square root of the
capacity [26]. This implies that with TinyOS, instruction memory power can be
reduced by 1.6x, and data memory power by 4.2x. Using a simpler processor such as
8-bit RISC could further reduce memory size and power consumption.
Figure 24 presents the performance comparison. The left graph compares the total
processor cycle count: 16365 vs. 2554. TinyOS shows a factor of eight
improvements, which translates directly to a factor of eight reductions in processor
power consumption. The right graph compares the OS overhead (the lowest portion
of the bars) as a percentage of the total cycles. As an indication of its inefficiency, the
general-purpose OS has an OS overhead of 86% while TinyOS has 10%.
Now let us calculate how much power is actually saved considering both the
processor and its memory blocks. With a 0.18µm technology and a supply voltage of
1.8V, an ARM7 consumes 0.25mW/MHz. For a memory size of 64KB, read per
Total cycle count at 1MHz
02000400060008000
1000012000140001600018000
Gen. OS TinyOS
Tota
l Cyc
le C
ount
Figure 24: General-purpose versus event-driven OS. Key at right identifies system components.
86.9
Percentage breakdown at 1MHz
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Gen. OS TinyOS
transport_remotetransport_bsmerge2merger1data converterUIOS
10.
85.9
44
access consumes 0.407mW/MHz and write consumes 0.447mW/MHz. Assume that
10% of the instructions involve memory read operations and 10% memory writes and
apply memory size as well as processor cycle count scaling, the power consumption
for the two OSs are: 0.608mW/MHz and 0.053Mw/MHz. That is, TinyOS
demonstrates a factor of 12 improvements in power.
It should be emphasized that TinyOS is only superior to ECOS for a very specific
class of event-driven applications. ECOS, capable of supporting general applications,
is fully equipped with classical OS utilities such as memory management, re-entrant
interrupt services, a multi-threaded preemptive scheduler etc. While the TinyOS
kernel is just a simple non-preemptive scheduler, with no support for memory
management and priority interrupts. By forcing the application to be written in a
“disciplined” manner, it can delegate much of the inter-component communication
and synchronization to the application program itself. As compared to ECOS, TinyOS
is just more specialized and therefore more suitable for event-driven applications that
may not require all the expensive OS services.
3.5 Required Extension Of TinyOS
Our driving research goal is to design an energy efficient OS for domain specific
heterogeneous architectures. We believe that some basic TinyOS concepts are very
attractive and can be adopted to reach such a goal. However, primarily developed for
an uni-processor architecture, TinyOS has its limitations and is insufficient to fulfill
the ambitious software management role demanded by low power heterogeneous
systems. It has to be properly extended to the system level to include management of
45
not only computation on the embedded processor, but also computation on the
optimized architecture modules.
TinyOS concepts are very attractive in the following aspects:
It has a MOC that closely matches that of reactive systems. Its event-driven
asynchronous characteristics can naturally support the interactions between
modules of vastly different behaviors and processing speeds in a
heterogeneous system.
Its simplicity reduces overheads and leads to more power efficient
implementation.
It provides some support for multiple flows of control (through the usage of
“tasks”).
The inadequacies of TinyOS, which prevent it from fully realizing the role of
managing power-critical heterogeneous systems, can be summarized as following:
Software centric approach does not allow full exploration of the integrated,
heterogeneous system architecture. It is primarily designed for uni-processor
architectures. All components except for the lowest layers of the application
are implemented in software. Low-level hardware components are required to
have a software wrapper to interact with the scheduler and the rest of the
system. All software components need to share resources such as CPU.
Whereas in a heterogeneous system, components can be mapped to separate
hardware blocks and be running simultaneously. Limited by the sequential
execution model of software, TinyOS does not take advantage of the
concurrency offered by the hardware components.
46
The application is described as a flat component graph. It may not scale well
as the number of system components increase with design complexity.
Limited support for concurrency. In TinyOS, concurrency is supported via
tasks, which can reside in the components. A task has to run to completion
and cannot be pre-empted by another task. So there is no real support for
multi-threading. Essentially there is only one thread active at any given time
in the system. The lack of support for multi-threading makes specifying
complex concurrent behavior awkward. In addition, since the system is only
pre-emptive while executing tasks, event loss will occur if the event inter-
arrival rate is greater than the processing speed of the processor at any given
time period.
Rudimentary power management scheme is inadequate for very power critical
applications.
The last limitation is due to the fact that TinyOS have no access to customized
power-efficient blocks and can only apply power control at the highest level. To drive
down the overall system energy expenditure, power management should be applied at
all levels of the design hierarchy: System level, architecture module level, circuit
level and device level [12]. By carefully incorporating power management into the
individual architectural modules, we can push power management down the design
hierarchy.
In the next chapter, we propose a hierarchical event-driven power management
framework that incorporates the attractive TinyOS concepts and strives to overcome
its limitations.
47
4. Hierarchical Power Management Framework
In this chapter, we will propose our own power management framework for
PicoRadio network. Our proposed solution utilizes a system management framework;
it exploits the reactive event-driven nature of the systems, and deploys aggressive
power management. While TinyOS targets a software centric architecture and
schedules only software tasks, our vision of the OS is as the global scheduler that
manages the entire heterogeneous system. Replacing TinyOS’s flat component graph
structure, we adopt a hierarchical composition that enhances design scalability,
supports concurrency, and enables power control at various granularities. In addition,
since we have the luxury of co-designing the OS with the architecture platform, we
can integrate power management into the individual architectural modules. Given that
the OS has the global “view” of the system, it can perform global power management
to optimize the overall system power consumption.
Most state-of-art power management systems handle only stand-alone devices.
The scope of our power management algorithm, however, is not limited to individual
nodes; instead, it aims to encompass the interest of the network as a whole. Our
power management algorithm executes in two phases: Network level algorithm first
treats the whole node as one entity and try to decide when the whole node should go
to sleep; then once the node is turned on, the node level algorithm determines on the
scheduling of the various modules inside the node.
The rest of this chapter is organized as following: we will first describe our vision
of the role of the global power scheduler and system manager; then we will discuss
some of the existing works on power management; lastly we will propose our two-
48
level power management algorithm. The details of the node-level power and
network-level management algorithms will be presented in Chapter 5 and 6
respectively.
4.1 The Global Power Scheduler And System Manager
In a complex heterogeneous system, the OS acts like a hardware abstraction layer
[11] that manages a variety of system resources. For power critical applications,
simplicity should be the primary design philosophy. The OS should perform a
dedicated set of indispensable duties and only these duties. There are two basic OS
duties: concurrency management at both application and architecture domains and
global power management.
Wireless sensor applications typically have multiple flows of control and data. A
sensor node can sense the environment, forward packets and receive commands all at
the same time. The OS needs to support concurrency in the application as well as
explore and utilize the concurrency in the heterogeneous architecture. Since the OS
has the global “view” of the system, it can also perform global power management to
optimize the overall system power consumption. Essentially, the OS becomes a
power scheduler that schedules the power state transitions of the system modules such
that the overall power consumption is minimized, under a set of constraints. Such
intelligent power scheduling is called a power management policy.
Figure 25 illustrates the role of the OS as the global power scheduler in the
reactive system. Note that in a reactive system, all system modules are “off” until
powered-up by the arrival of events. This is quite different from the conventional
power management approaches where all the modules are assumed “on” and can be
49
put to sleep to conserve energy. The OS is refocused from the microprocessor and
becomes a separate unit that can be implemented in software, hardware,
reconfigurable logic, or some combination. All communications in the system are
event-driven; event flows are denoted by arrows.
The power manager (PM) is connected to the energy train; it has the knowledge of
the battery level and the overall energy reserve of the system. Following the TinyOS
convention, the PM issues power-control commands to the system blocks and receive
events from them. Employing a clean and simple approach to avoid mutual exclusion,
the PM has the exclusive right to initiate all power state transitions and manage power
states. Only control signals go through the power scheduler, data communications
between the blocks do not involve the power manager. Figure 25 shows a behavior
diagram of the system; it should be underlined that it does not imply implementations.
Individual blocks (including OS) in the figure can be implemented in either soft-,
hard-, or configure-ware.
50
4.2 Existing Works On Power Management Policies
Recent surge in the popularity of portable and wireless devices has drawn
tremendous interest in high-level power management. The range of applications
various from battery constrained devices such as laptop and cellular phone to
environmentally concerned servers and disk drives. All system level management
policies seek to reduce energy consumption by selectively placing idle components
into low power states.
The vast majority of the previous works on power management are studies done
on X servers, hard disk drives or laptop computers, in which the system is treated as a
stand-alone device. As far as the author is aware of, no studies have been conducted
User interface
Energy train
DLL (MAC)
App/UI
Network
Transport
Baseband
RF (TX/RX)
Sensor/actuatorinterface
Locationing
Aggregation/forwarding
Sensor/actuators
Antenna
PowerScheduler
Reactiveradio
Reactive Digital Network Processor
User interface
Energy train
DLL (MAC)
App/UI
Network
Transport
Baseband
RF (TX/RX)
Sensor/actuatorinterface
Locationing
Aggregation/forwarding
Sensor/actuators
Antenna
PowerScheduler
Reactiveradio
Reactive Digital Network Processor
Figure 25 Behavior diagram of the PicoNode architecture
51
in the context of a network of devices, that is, how to management power in the
interest of the whole network, not just isolated nodes.
The formal definition of power management policy is to predict when to change
power states based on system performance and resource constraints. The system
parameters relevant to policy formulation are: event inter-arrival and service rate
distributions; power consumption at various power states (Active /Idle / Sleep); and
power management overheads. The inter-arrival time between external events is
typically unknown and generally do not fall into well-known probability distributions
[28][29][32]. Service time modeling is also very challenging for complex systems.
Power management overheads consist of wakeup performance and energy overheads,
associated communication overheads and the actual energy consumption of the power
manager. Wakeup performance and energy overheads are typically inherent of the
system and are normally given to the power manager as fixed parameters. Different
policies may have to be devised for different wakeup overheads. However, care
should be taken to ensure the actual implementation cost of the power manager is
minimal. Ideally it should be negligible compared to the dominating wakeup
overheads. A critical power metric of the system is the break-even time Teven, defined
as the ratio of wakeup energy overhead to the idle power dissipation. Teven is also
called the shutdown threshold. It is the length of time such that if the system is idle
for time Teven, the energy that is expended is the same as the energy required for
wakeup the system.
The simplest and most common policy is the time-out policy implemented in most
operating systems in laptops. More sophisticated approaches basically fall into two
52
schools of thoughts: The first are based on stochastic models and tend to be stationary
(the same policy applies at any point in time) in nature except for the one presented in
[29]. The non-stationary (the policy changes over time) model described in [27] only
yields marginal improvement over its stationary counterparts while significantly
increasing the computational complexity and memory requirement.
The second schools of thoughts are predictive policies that estimate the next idle
period of the device, and if the period is long enough, it is transited into a low power
state. The estimation is usually based on dynamically measuring event inter-arrival
rates. These policies are adaptive to traffic changes over time and are non-stationary.
In the next section, we will first briefly present the work of Simunic et al [30],
representative of the stationary stochastic approaches; and then the works of Sinha &
Chandrakasan [31] and Hwang & Hu [32] representatives of the dynamic predictive
approaches.
4.2.1 Stationary Statistical Power Management Policy
Paleologo et al [27] first proposed statistical algorithm based on stochastic
models. Stochastic models use distributions to describe the inter-arrival times of
requests, the service time of the request by the device and the time it takes for the
device to transition between its power states. However, their algorithm was devised
based on the assumption that the inter-arrival times of requests are exponentially
distributed. This assumption is hard to justify and may not hold in view of significant
correlation between requests.
Simunic’s [30] work generalizes the above approach and removes the assumption
on exponential distribution. Her approach is based on Time-indexed Semi-Markov
53
Decision Model (TISMDP). The system model consists of three components: the
user, the device and the queue. The power management policy is formulated as a
constrained optimization problem: Minimize energy under performance constraints.
She has shown that the optimization problem can be solved exactly and in polynomial
time with guaranteed optimal results for general request distribution. This solving
process is both computational and resource intensive: the total size of the system
states equals to the number of time indexes multiplied by the number of states in the
queue multiplied by the number of states in the device service model. Needless to
say, this computation has to be done off-line and only the computed policy table will
be loaded onto the device.
The policy works like a randomized timeout and is easy to implement on the
device. Table 5 shows a sample policy. This example application is a smart badge
device. It has two decision states: idle and standby. From idle state, the device can go
into standby or off state. From standby, it can only transit to off state. The optimal
policy, as shown in Table 5, gives a table of probabilities determining when the
transition between the idle, the standby, and the off states should occur. In this
particular example, it specifies that if the system has been in idle for 50ms, the
transition to the standby state occurs with probability of 0.4, the transition to the off
state with probability of 0.2, and otherwise the device stays idle.
54
The major advantage of the stochastic approach is that it guarantees optimality as
long as one can characterize the distribution models correctly. However, since these
models are generated using existing traces, the results are only as good as the traces.
In certain applications, the traces may be hard to characterize and not fall into any
known distributions. The main drawbacks of this approach are its computational
complexity and its inability to adapt to changing environment. As mentioned in the
previous paragraph, the size of the global state space increases exponentially with the
number of components in the system. And its stationary nature implies that any
significant changes in the service requests distribution will require expensive re-
characterization offline to generate a new policy table.
The authors in [29] try to extend this approach to accommodate non-stationary
service requests. However, their work yields at best 8.5% power improvement under
loose performance constraints, and only 1% improvement with tight performance
constraints, while requiring 100X larger memory space. And the energy saving was
calculated without the consideration of the energy overheads of either the extra
memory itself or the associated control logics.
0.80.90.1100
00.20.450
0000
Standby to Off ProbIdle to Off ProbIdle to Standby ProbIdle Time (ms)
0.80.90.1100
00.20.450
0000
Standby to Off ProbIdle to Off ProbIdle to Standby ProbIdle Time (ms)
Table 5 Sample power control policy
55
4.2.2 Adaptive Power Management Policy For Non-Stationary
Traffic
Sinha & Chandrakasan [31] study power management for a sensor networks
similar to the PicoRadio network. They only concentrate on the individual sensor
nodes, and not the interest of the connected network as a whole.
Each sensor system has four components: Strong ARM, memory, sensor A to D
converter and the radio. Five global power states are generated by the compositions of
the power states of these components. Table 6 shows the five global sleep states S0 to
S4 and their corresponding local states. From S4 to S0, the power consumption
increases while the wakeup overhead decreases.
One bold assumption made by the authors is that inter-arrival time is exponential,
which is not justified by measured or simulated data. Exponential distribution
implies there is no correlation between the past inter-arrival rates and the future. This
assumption may not hold in view of significant correlation between arrivals, as in the
case of typical sensor networks. The algorithm measures the packet inter-arrival rate,
and based on rate decides which low power sleep state to transit into.
Tx, RxOnActiveActiveS0
OffOffSleepSleepS4
OffOnSleepSleepS3
RxOnSleepSleepS2
RxOnSleepIdleS1
RadioSensor,A/DMemoryStrong ARMSleep StateTx, RxOnActiveActiveS0
OffOffSleepSleepS4
OffOnSleepSleepS3
RxOnSleepSleepS2
RxOnSleepIdleS1
RadioSensor,A/DMemoryStrong ARMSleep State
Table 6 Global power states for the sensor node
56
This is the first known work studying power management for wireless sensor
networks. Its dynamic sampling of inter-arrival rate makes it adaptive to non-
stationary traffics. It is also one of the very few works that address systems with more
complex global state space (not just limited to idle/sleep/active). The simplicity of its
implementation makes it feasible for sensor network applications. However, the
drawbacks are also apparent. The assumption on exponential traffic eases the analysis
but is unjustified and unrealistic. Furthermore, there is no analysis on the optimality
of this approach, in other words, no analysis of how close the results are to optimal.
As we can see in the more in-depth analysis presented in the next section, this
approach could produce very poor results under certain circumstances.
Hwang and Wu’s [32] analysis also dynamically samples the inter-arrival rates
and bases decisions on the nature of the data and dynamic behavior. One significant
difference from the previous approach is that they assume that the traffic is correlated.
That is, there is significant correlation between the recent inter-arrival rates and inter-
arrival rates of the near future. In particular, Hwang and Wu ‘s adapted the
exponential – average approach used in CPU scheduling for the prediction of the next
inter-arrival time, i.e. the next idle period of the processing unit. Though a
straightforward implementation of the algorithm has a significant hardware cost. We
will implement this algorithm on the PicoRadio network in the next section and
detailed analysis will be presented.
4.2.3 Dynamic Voltage Scaling (DVS)
In addition to idle and low power states, a component can also transit between
several active states. This is accomplished by incorporating DVS into the power
57
control policy. DVS algorithms adjust the device speed and voltage according to
workload at run time. Since peak performance is not required at all times, the device’s
processing speed and operating voltage can be dynamically adapted to increase
energy efficiency. DVS is a very effective technique for reducing CPU energy and is
supported by many start-of-art embedded processors such as StrongARM [33] and
Transmeta [35]. Software scheduling technique for DVS is a well-published area
[30][31][34].
4.3 Proposed Power Management Algorithm For Sensor
Networks
After reviewing the existing power management policies, we will proceed to study
policy formulation for PicoRadio wireless sensor network applications. We should
first address the unique challenges and then propose our power control algorithm.
4.3.1 Formulating Power Control Policy For PicoRadio Network
The existing works discussed in Section 4.2 only deals with the power
management of a single isolated device. Whereas the goal for our research is to do
power management for the interest of the entire network, not just isolated nodes. In
other words, we are not too concerned about the energy consumed in individual nodes
as long as the network is alive and well. This opens a new outlook on power
management itself, which is completely unexplored by existing works.
We need to first redefine our goals for power control in a network setting. Clearly,
we should refocus from minimizing the energy of individual nodes and concentrate
instead on the interest of the entire network in terms of quality of service and lifetime.
58
One simple metric that matched this interest reasonably well is the worst-case node
energy consumption in the network. That is, the energy consumption of the node that
burns the most amount of energy in the network. Intuitively speaking, controller
nodes at the center of the network consume far more energy than sensor nodes at the
edge of the network. Since energy dissipation translates directly to node lifetime, by
choosing a policy that performs well on these nodes, we avoid the “dying out” of
crucial nodes in the network to ensure a good quality of service. We would like to re-
formulate our power control policy as following:
Given a set of performance and system constraints and network topology,
predict when to change power states such that the worst-case node energy
consumption in the network is minimized.
4.3.2 Requirements For PicoRadio Network Power Control Policy
Most of the existing works model the processing unit as one “black box” that has
only several low power states. While in the PicoRadio system, the processing unit is
partitioned into multiple power domains that can be controlled independently. The
advantage of having a finer granularity of power control is apparent: we can achieve a
higher level of power saving. The disadvantage is the much larger global power state
space. The size of the state space makes the implementation of the statistical approach
infeasible as its computational complexity grows exponentially with the number of
components in the system. As we will see later, how to address this complexity is
crucial in policy formulation.
Moreover, a policy for sensor network has to be adaptive to changes in the
environment and be able to handle stationary traffic. Wireless links (a wireless link is
59
an abstraction of the ensemble of modems, transmitter, receiver, channel) are often
noisy and lossy. Transmission errors occur due to noise and multi-path fading.
Moreover, distance-dependent path loss and adjacent channel interference also affect
the channels. In contrast to the wired connections, the underlying network cannot be
assumed to be reliable, and the time-varying network conditions should be explicitly
taken into account. Moving people and objects, neighborhood changes due to the
mobility (neighbor nodes moving in or out of ranges) and the node lifetime,
interferences from other electronic devices all contribute to the frequently changing
environmental conditions. According to some measurements done on wireless links
[36] [37], both packet loss rates and mean bit error rates are time-varying. In [36],
packet loss rates range from 2% to as high as 80%, and bit error rates vary over
several orders of magnitude. This calls for the inclusion of adaptively into both
protocol implementation and power control policies.
Additional reason for adaptively is the desirability of implementing the same
policy across the network. Obviously it is quite cumbersome and often impractical for
us to characterize the policy for every node in the network, it is therefore very
attractive to have a “self-adaptive” policy that works for different nodes that see
vastly different traffic conditions.
A further requirement is that the policy should be feasible to implement on the
sensor platforms. A sensor node typically has limited computational resources so care
should be taken to ensure the cost of policy implementation does not offset the
benefit. A good counter-example is the straightforward implementation of Hwang &
60
Hu: it demands a large number of multiplications and additions and is clearly
infeasible for sensor networks.
Summarizing the discussions in this section, the “right” power policy for sensor
networks has to be able to handle complexity in system space modeling, be adaptive
as well as implementable on sensor platforms.
4.3.3 Proposed Power Control Algorithms
In handling complexity in system space modeling and avoid state space explosion,
state space partitioning is a very effective method. The question is how to do the
partitioning properly. We have carefully studied the PicoRadio system and made the
following observation: it is much more costly to wakeup a node from “deep sleep”
than to turn on the blocks subsequently. The reason is the following: in the “deep
sleep” mode, the power rails and the clock lines are both turned-off. To wakeup the
node from “deep sleep” involves re-synchronization of the clock, and transit the
global memory blocks from “drowsy” to normal operational modes. Re-
synchronization of the clock involves locking the Phase Lock Loop (PLL), which
could be very expensive both in energy cost and performance. Once the node is
started up and the clock is running, it is relatively cheap to wakeup the subsequent
blocks. For the current version of the PicoNode, it could take up to 300 clock cycles
to synchronize the clock, while waking up the MAC only takes 3 cycles. The wakeup
performance hit of the subsequent blocks can also be partially, if not totally, hidden
by deploying predictive look-ahead scheduling (Section 5.4.1).
Based on the above observation, it is sensible for us to partition the system state
space at the node and network boundary. By doing so, we essential divide the one big
61
problem into two smaller problems: First treat the whole node as one entity and try to
decide when the whole node should go to sleep; then once the node is on, decide on
the scheduling of the blocks in the node. We call the first problem network level
power management and the second problem node level power management.
In network level power management, the system is modeled as a network of nodes
communicating by sending packets. Each node has an inter-arrival and service model.
At this level, the power policy decides whether and when to turn off the whole node
after being in idle. The policy uses network traffic information and targets the most
energy consuming nodes. The detail implementation of the node is abstracted away
and the only relevant issues are its service rate model and system parameters such as
node idle power consumption, node wakeup overhead, etc. In node level power
management, the system is modeled as Concurrent Extended Finite State Machines.
At this level, power scheduling is to determine the sequence and timing of block
wakeups.
These two problems have different levels of abstraction and require different
modeling and simulation environments. The first requires a network simulator and the
second requires a CFSM modeling environment. We have chosen OMNet++ as the
network simulator and the Stateflow-Simulink simulation environment to model the
node architecture.
OMNet++ is a C++ based object-oriented modular discrete event simulator. Its
source code is freely available to provide the programmers maximum freedom. In
OMNet++, we have modeled the complete PicoRadio protocol stack. Interfaces
between layers are cleanly defined such that one can modify the algorithm
62
implemented in one layer without affecting the other layers. At the Application layer,
the sensor node is programmed to be a sensor, a controller or both. Controllers
generate interest packets and sensors generate data packets periodically in respond to
the controller’s inquiry. The network layer implements a geographical routing
protocol. The MAC layer implements a simple carrier sensing collusion avoidance
protocol. The physical layer is modeled using a channel interference matrix. There is
no data aggregation. There is a dedicated power manager that allows us to explore
various power control policies. Since OMNet++ is written in C++, and the protocol
components are not modeled as concurrent FSMs, it is very fast. A simulation of
days-long activities in a 100-node network only takes minutes.
The Stateflow-Simulink integrated simulation environment has the capability to
model and implement the complete digital chip. Stateflow, of which the model of
computation is CFSM, models protocol; Simulink, of which the Model of
Computation is data flow, models the signal processing Baseband. Moreover, there is
a direct path from the Stateflow-Simulink to implementation. Simulink can be
translated to synthesizable VHDL, and Stateflow to C and VHDL. Using this chip
design flow, we are able to do our design specification, estimation and simulation in
Stateflow-Simulink and generate synthesizable hardware and executable software
code as the end product.
In the next chapter, we will discuss in detail the node level power management
algorithm. In Chapter 6, we will present the experimental results on various network
level power management algorithms.
63
5. Node Level Power Management
5.1 Hierarchical Node-level Power Management Architecture
As the design complexity of embedded system increases, so does the management
of the system. The intricacy of the scheduling problem increases considerably with
the number of system components and could soon become intractable. To better
manage complexity, we need to introduce hierarchy. Hierarchy hides low level details
and enhances modularity and scalability. It also enables us to explores locality in the
design and apply power control at various granularities. Figure 26 shows the
hierarchical architecture of the node level power management framework. At every
Domain1
Power Scheduler
Domain2 Domain3
Domain 1
Power Scheduler
Domain2 Domain3
Domain1
Power Scheduler
Domain2 Domain3
Domain 1
Power Scheduler
Domain2 Domain3Domain 1
Power Scheduler
Domain2 Domain3
Figure 26 Hierarchical node-level power management architecture
64
hierarchy level, the system is partitioned into multiple power domains. Power
domains are the basic units of power control and can be implemented through
separate power supply rails. Each power domain can be further divided into sub-
domains. A power scheduler resides in every hierarchy level to provide power
management interface and power scheduling.
Recalled from the profiling experiments conducted in Section 2.3.1.1, timer
services are identified as critical operations (kernels) in protocol processing. There
are numerous timers running in the protocol stacks: routing table timers in the
network layer and random backup timers in the MAC layer. It is quite cumbersome to
support multiple timers in the system for mainly two reasons. Firstly, timers tend to
get out-of-sync over time and are very costly to re-sync. Secondly, timers have to be
running even when the rest of the block is idle and put to sleep. Seemingly the
sensible solution is to export all the local timer services to be handled by one global
timer. This global timer then becomes the only timer that has to be running
throughout the lifetime of the node. The global timer resides and is maintained by the
PM. When a block needs a timer service, it sends a Request_Timer event to the PM to
register, and then it can choose to sleep. The PM will send back a Timer_expiry
command, which potentially can trigger a wake up, once the registered timer expires.
This global timer approach ensures a singular timing reference for the entire sensor
node. By relinquishing the timer functions to the PM, blocks have more opportunities
to go to sleep and save energy.
65
5.2 Global Power Scheduler
Figure 27 sketches the structure of the top-level power manager. The PM consists
of the timer services unit and the power scheduler. The PM is the owner of and the
sole writer to the system power state table. It also gathers network-wide information
such as past inter-arrival traffic rates and statistics (energy, channel qualities etc)
concerning other nodes. Based on the gathered network information, the power state
table and performance/resource constraints, the PM makes scheduling decisions. The
goal is to minimize the overall power consumption while meeting the performance
and resource constraints.
66
The behavior of the power scheduler is modeled as a set of concurrent Finite State
Machines, each of them correspond to a separate power domain. The power state
table lists all the power domains and their operating voltages. The power scheduler
changes the power states of the domains through interfacing with the power state
table. A write to the power state table alters the supply voltage to the power rail of the
domain. This implies the need to have separate supply rails for different power
domains. For example, changing the table entry “Application/Network” from 0 to 3
will transit the domain from the power-off state to active state operating under full
supply voltage of 3. The PM can also write an intermediate voltage between 0 and 3,
MAC timer1 MAC timer2Networktimer1
Timer ServicesMAC timer1 MAC timer2
Networktimer1
Timer Services
Power Scheduler
MAC AppNet PHY
Top Level PM
3Clock & Sys. Init.
…0Physical
0MAC Sub-domain
3Application/Network
VoltageDomain
3Clock & Sys. Init.
…0Physical
0MAC Sub-domain
3Application/Network
VoltageDomain
Power States TableNetwork States
Information
Past Inter-arrival Info.
CLK &Sys. Init
Figure 27 Top-level power manager
67
e.g. 1.5 to the entry. At 1.5V, the domain will run at roughly half the speed and
consume roughly four times less energy.
The top level PM also provides timer services and manages the system time
wheel. If a domain needs a timer service, it sends a Request_Timer_Service event to
the PM, specifying after what time interval the timer should expire. Upon receiving of
the request, the PM will start an alarm with the specified expiry time. Once the alarm
expires, the PM will notify the domain with a Timer_Expiry command. Figure 27
shows three timers registered by the MAC and Network respectively.
Before we proceed further, we need to define the various power states of system
blocks.
5.3 Power States Transitions for System Blocks
The power states for system blocks are Awake and Sleep (see Figure 28) In the
Awake state, a block can respond to incoming events. There are two sub-states in
Awake, Active and Idle. When a block is in Active, it is actively processing incoming
Active
sleep IdleAwake
Active
sleep IdleAwake
Figure 28 Power states for system blocks
68
events; when it is finished with processing all events, it goes into Idle. After being
Idle for certain amount of time, the block can choose to go to Sleep. In the Sleep state,
the block cannot respond to incoming events; it requires some external wakeup signal
to transit to the Awake state. The external wakeup signals are issues by the power
manager. Sleep states consume very little or no power. Awake states can run at
variable frequencies and voltages by implementing dynamic voltage scaling (DVS).
DVS strives to minimize energy consumption by matching module performance and
energy expenditure with workload.
There are two ways to implement the Sleep states, clock gating or power supply
rail gating. Clock gating gates the clock signal to the block; consequently there is no
dynamic power consumption, only static power resulting from leakage current. For
very low duty cycle applications such as sensor networks, even when the clock is
gated, a considerable amount of leakage energy is dissipated. On the other hand,
gating the supply rail completely shuts down the block and causes no power
consumption. Clearly, the latter is a much more efficient, and should be supported by
all the blocks in the system. Unlike the logic blocks, the memory blocks cannot be
totally turned off since the states information has to be preserved. Instead, the
memory blocks are put into a state preserving, low power drowsy mode. In this
drowsy mode, neither read nor write operations can be performed.
The power manager has the exclusive rights to initiate all state transitions. It turns
on a block by ramping up its supply rail and restoring its memory from drowsy to
active; and shuts sown a block by turning off its supply rail and putting its memory to
the drowsy mode. The power state transitions have associated performance and
69
energy costs, which are called the power control overheads. The magnitude of the
costs is determined by certain system implementation parameters such as clock re-
sync overhead and drowsy memory restoration overhead etc.
Since all rights to power state changes are relinquished to the PM, a block cannot
change its own power states. When it wants to go to sleep, it has to send a
Request_to_Sleep power control event to the PM. If the PM accepts the request, it
puts the block to Sleep; otherwise it stays awake.
5.4 Node Level Power Scheduling
At the node level, power scheduling makes decisions on the sequence and exact
timing of the block wakeups and sleeps. The goal is to minimize the overall power
consumption while meeting the performance and resource constraints. Power
scheduling is implemented through power control events and commands. Power
control events are sent by distributed power domain blocks to the centralized power
scheduler: event Request_BlockA_on for wishing to access BlockA and event
Request_to_Sleep for wanting to sleep. Power control commands are issued by the
power scheduler to the domain blocks and may result in power state transitions:
Wakeup_BlockA to transit BlockA from sleep to active, Sleep_Request_Denied to
keep the requesting block in idle and Sleep_Request_Granted to transit the block
from idle to sleep. All the power control events and commands can carry a token that
specifies timing information. For example, Request_BlockA_On(t) means block A
will be accessed in time t.
Power scheduling is a rather complex issue. There are both the local and global
aspects to it. Locally, blocks have to decide when to send in the wakeup and sleep
70
requests. A block may decide to send in a sleep request right after it becomes idle or
wait for a while. Globally, the power scheduler has to decide on what power control
commands to issue and when. In both cases, decisions are made based on the
expected event inter-arrival rates, gathered statistics and implementation specific
parameters such as wakeup overhead and idle state energy dissipation.
We favor a power scheduling approach that is more global in nature: the global
power scheduler makes most intelligent decisions. In this scheme, a local block sends
in the wakeup requests as soon as it realizes another block is to be accessed; and
requests to sleep as soon as it becomes idle. The power scheduler then determines
when to wakeup/sleep the requested block. The justification for the global approach is
that the centralized scheduler has a much grander vision that the individual domain
blocks and can potential make better decisions. The drawback is that the global
scheduler may be overloaded by decisions that can be handled locally. The studies of
localized power control is left to future work.
In a reactive system, power scheduling naturally flows the event flows. This is
best illustrated with an example. Yet again, we will use the PicoRadio sensor node
platform, as shown in Figure 29. Assume every behavior block in the figure is
mapped to a separate power domain. At the initial state, the sensor node is in deep
sleep --- meaning that except for the reactive radio, the top-level PM and the drowsy
memory, the system is completely powered-down.
71
1. Reactive radio senses incoming packet from the network, the RF (TX/RX) is
turned on to receive data stream.
2. Digital processor initializes (Digital clock is re-synchronized, global states are
restored)
3. PM wakes up Baseband, Baseband processes data from RF (TX/RX).
4. Baseband needs to send data to MAC; it sends a Request_MAC_On event to the
PM.
5. PM looks up its power state table and notices MAC is sleeping, it wakes up MAC
and modifies the corresponding power state table entry.
6. Baseband becomes Idle, it sends a Request_to_Sleep event signal to PM.
User interface
Energy train
DLL (MAC)
App/UI
Network
Transport
Baseband
RF (TX/RX)
Sensor/actuatorinterface
Locationing
Aggregation/forwarding
Sensor/actuators
Antenna
PowerScheduler
Reactiveradio
Reactive Digital Network Processor
User interface
Energy train
DLL (MAC)
App/UI
Network
Transport
Baseband
RF (TX/RX)
Sensor/actuatorinterface
Locationing
Aggregation/forwarding
Sensor/actuators
Antenna
PowerScheduler
Reactiveradio
Reactive Digital Network Processor
Figure 29 Example of power scheduling
72
7. PM grants the Request_to_Sleep and puts Baseband to sleep.
8. MAC processes data from Baseband. It needs Locationing data and sends the PM
a Request_Locationing_On event.
9. PM turns on the Locationing block.
10. MAC processes the packet. It realizes the packet has to go to network layer, and
sends a Request_Network_On event request to PM.
11. PM turns on Network.
12. MAC becomes idle and sends a Request_to_Sleep event. At this point, the PM
evaluated the situation and may or may not grant this request. In the case that the
MAC will be accessed later to forward the packet, and it is more expensive to put
it to sleep and then wake it up than to keep it idle, the PM will deny the sleep
request. This is called predictive look-ahead scheduling. We will introduce the
concepts of predictive scheduling and its implementation in the next section.
5.4.1 Predictive Look-Ahead Scheduling
Predictive scheduling is to determine the power state transitions of a domain
based on its future access prediction. That is, following the global event-flow in the
system, blocks are turned on before they are accessed and kept idle if they will be
accessed later. Predictive scheduling decisions are global in nature and may
contradict the local interest. For example, in Step 12 of the power-scheduling
example, MAC’s request to sleep is over-ridden by the PM due to more global
concerns. Predictive scheduling is typical used in scenarios where the performance
and/or energy penalty associated with waking-ups is significant.
73
1. Performance enhancement through latency hiding. A substantial performance
overhead for wake-ups may cause excessive delays in the system. To reduce this
type of delays, the PM can make predictive decisions and wake up a block
beforehand to make sure it is ready when needed. The exact predictive wakeup
timing, or the “look-ahead” window size, has to be carefully determined. If a
block is woken up too early, it will stay in Idle before needed and waste energy.
On the other hand, if the block is woken up too late, the latency can only be
partially hidden and there will be a performance penalty. Ideally the look-ahead
time window of a block should be equal to its wakeup performance overhead. In
practice, however, this is difficult to accomplish, as the PM may not receive the
relevant control events in time to cover the wakeup latency.
Even for systems where performance is not a concern and some amount of
delays are tolerated, this kind of “Just-In-Time” wake-up scheme can be used to
promote power saving. In an asynchronous event-driven nature of the system,
there are typically a large number of data queues connecting the various system
modules. Queues are turned on and off by the sender and receiver blocks they
connect. When the sender needs to write to a queue, it turns on the queue and
populated it with data. The receiver then pulls data from the queue and turns it
off when the queue is empty. In the situation when the receiver has large wakeup
latency, energy is wasted because the queue has to remain on for an extended
amount of time. Depends on the size of the queue and the wakeup latency of the
receiver, this wasted energy could be substantial.
74
2. Energy saving by the elimination of excessive sleeps and wakeups. If the
energy penalty of waking up the block is high and the block has a good chance of
being accessed in the future, the PM may decide to keep it alive even when it
wants to sleep. The decision is made based on the evaluation of wakeup energy
overheads, idle power dissipation, the predicted future access time and the future
access probability etc. If upon receiving a sleeping request from a block, the PM
estimates that it will be accessed later and should remain idle, the sleep request
will be denied. This is the situation where global predictive scheduling decisions
override local requests.
5.4.2 Implementation of Predictive Scheduling
Predictive wakeups are implemented by encapsulating timing information in
power control events. A token t is carried in the power control event to indicate the
look-ahead window size. Event Request_BlockA_On(t) means block A will be
accessed in time t.
Scenario 1 (latency hiding) is relatively straightforward to implement. As soon as
a block (e.g. B) realizes it needs to access another block (e.g. A), it sends out the event
Request_BlockA_On(t) to the PM, where t is the estimated process delay in B after
which the necessary input data to A is ready: the delay it takes for B to generate
proper data and put it into the queue connecting B to A. The value of the process
delay is implementation dependent and can be pre-characterized. Once the PM
receives the event Request_BlockA_On(t), it will schedule a wakeup_A command in
time (t-wakeup_latency_A). Ideally, A should wake up after exactly time t. However,
if t< wakeup_latency_A, there will be a delay in A’s wakeup and its wakeup latency
75
can not be fully hidden. To maximize latency hiding, the block (B in this case) that
sends in the predictive scheduling event should do so as soon as it knows another
block will be accessed.
Figure 30 shows an example of the network layer trying to wakeup MAC. Starting
from the Init state, the network receives a packet from the MAC layer. The packet can
be of type DATA or INTEREST. The network then processes the packet and decides
whether the packet is to be forwarded to another sensor-node or not. If forwarding is
necessary, it sends out an event Request_MAC_On(process_delay) to the PM, where
the actual value of the parameter process_delay depends on the types of the packet.
Process_delay can be obtained by pre-characterizing the processing speed of the
network module. If an accurate measurement cannot be acquired, an upper bound of
the delay should be used to prevent MAC from staying in Idle after being woken up.
The penalty of a pessimistic estimation of process_delay is that the queues connecting
Init
Parse
Check Table
InterestProc
Table Update
Data
MACPacket?
TimerExpiry? Data?Interest?
Update?
SendPacketToQueue!
Forward? Request_MAC_On(60)!
Forward? Request_MAC_On(50)!
SendPacketToQueue!
Figure 30 Network layer example for predicting scheduling.
76
network and MAC may have to be on longer than necessary.
Scenario 2 (energy saving) is more complicated to implement, due to the
difficulty of estimating both future access time and activation probability. As shown
in Figure 29, a system block is usually “connected” to and hence can be activated by
multiple neighboring blocks. This kind of intrinsic concurrency makes it non-trivial to
predict the next activation time of the block. For example, to estimate the next access
time of MAC, we will have to consider the cases of receiving events from Network,
Baseband and Locationing blocks.
A simplified and more conservative approach is to only consider the scenarios
when the block will be woken-up for certain. The way to accomplish this is to have
the PM checking for any pending predictive wakeups on a block when it receives a
Request_to_Sleep from it. If there is a pending predictive wakeup, the PM will make
shut-down decisions based on the evaluation of wakeup overheads and timing
information.
The exact algorithm for predictive wakeup scheduling is presented in Figure 31,
described in C pseudo code format.
//Event-handler: PM receives a wakeup blockB in time t request
On_event(Request_BlockB_On (t)) {
if (t < t_wakeupLatency_blockB) //wakeup blockB right now. WakeUp_blockB(); else
AddPendingWakeupList_BlockB(t- t_wakeupLatency_blockB);
/************************* The PendingWakeupList is a list of pending wakeup requests
77
It is implemented as a list of alarms that go off at various pending wakeup times. A wakeup_blockB command is generated whenever an alarm goes off. ****************************/
} //Event-handler: PM receives a sleep request from block B. On_event( Request_to_Sleep) { //check if there is any pending wakeup requests if (Empty(PendingWakeupList_BlockB)) //no pending requests, sleep request granted
Output_command(Sleep_Request_Granted); else //handle pending requests { // t_predicted_blockB is the smallest alarm value in the PendingWakeupList t_predicted_blockB= Min(PendingWakeupList_BlockB); //if the pending wakeup time is less than the wakeup latency, do not sleep if (t_predicted_blockB < t_wakeupLatency_blockB) Output_command(Sleep_Request_Denied); else //if the idle_energy consumption is less than the wakeup overhead, do not sleep
if (t_predicted_blockB * idle_power_blockB < Energy_wakeup_blockB) Output_command(Sleep_Request_Denied);
else Output_command(Sleep_Request_Granted);
} }
Figure 31 Predictive wakeup algorithms Predictive wakeups do not come free. Even the current simple scheme requires
the maintenance of a pending wakeup alarms list for every domain. In a low activity
type of system like PicoRadio, typically the lengths of these lists are no more than
one; hence the maintenance tasks are rather manageable in general.
The implementation of effective predictive scheduling is a non-trivial task. The
inherent concurrency in the system makes the precise prediction of event flow
difficult. We presented a preliminary conservative scheduling scheme that favors
78
functional correctness and implementation simplicity. Care was taken to ensure the
proper timing sequence of power control signals such that the system does not end up
in deadlock. More refined and aggressive schemes will be studied in the future.
5.4.3 Power Scheduling Without Predictive Wakeups
If the system has negligible wakeup overheads and performance is not an
overwhelming concern, predictive wakeups schemes can be left out in favor of much
simpler wakeup mechanisms. Without predictive wakeups, the sleep request handling
is very straightforward: sleep requests are always granted by the PM. Wakeup
requests are also greatly simplified. Token t in event Request_BlockA_On(t) takes the
constant value of zero. The wakeup request timing now becomes: If a block (e.g. B)
wants to access another block (e.g. A), it issues an event Request_BlockA_On(0) to
the PM right after it have populated the queue connecting to A with the necessary
data. Upon receiving Request_BlockA_On(0) , the PM will issue the command
Wakeup_BlockA instantaneously. In this robust approach, A is guaranteed to go into
active state when woken up. Otherwise, A might wake up too early, finding the
incoming queue not ready, and be forced to wait in Idle. In this unfortunate scenario,
if A decides to go to sleep after staying Idle for a period of time (sleep requests are
always granted), a deadlock may occur.
5.5 Incorporating Dynamic Voltage Scheduling (DVS)
So far the scheduling algorithm we have discussed only involves turning the
domains “on” and “off”. It should be reminded that the on “state” could have multiple
sub-states that corresponding to various operating voltages. The PM can control their
79
operating voltages and perform dynamic voltage scaling. While incorporating
dynamic voltage scaling into the power management framework promises more
performance and energy improvement, it also adds a level of complexity into the
already complicated power scheduling problem.
Now let us evaluate the potential benefits of applying dynamic voltage scaling
(DVS) to PicoRadio type of applications. DVS is widely applied to processing units
(e.g., CPUs) that cannot be turned off completely, either due to the tremendous
wakeup overheads or the need for it to continuously process incoming requests within
a tight time constraint. On the other hand, the PicoNode architecture is specially
designed to have a finer granularity of power control and small wakeup overheads.
PicoRadio application also has much looser time constraints and is designed to be
robust enough to tolerate certain degree of packet losses. In addition, for this type of
low activity and low duty cycle applications, leakage current is a major concern and
running at a lower voltage does not reduce the leakage energy consumption.
As first glance, DVS may not be as effective on the PicoRadio system as the
traditional CPU architectures. However, it adds levels of refinement to the
scheduling problem and should not be left un-explored. Our current version of the
power manager supports DVS, but the specific scheduling techniques for DVS will
not be studied in this thesis and are left for future work.
5.6 The Stateflow - Simulink Estimation-Simulation
Framework
The entire reactive behavior of the PicoNode as shown in Figure 25 is modeled in
the stateflow-simulink simulation environment. This environment allows us to
80
explore different power control algorithms under different scenarios such as system
architecture and inter-arrival and service distributions. Power and performance
estimation are accomplished by back-annotations in state diagrams.
Figure 32 shows the top-level schematics of the PicoNode Protocol stacks. Figure
33 shows the MAC sub-domain, which has its own PM. Figure 35 shows the network
layer; Figure 34 shows the top level PM.
Queue
MAC
Network
PMQueue
QueueQueue
MAC
Network
PMQueue
Queue
Figure 32 Top level schematics for PicoNode protocol stacks
81
Figure 33 schematics of the MAC domain. There is a second level PM, which is in-charge of the receive and transmit sub-domains
82
Figure 34 Top-level PM with timer services and the power scheduler.
Figure 35 Network layer consists of four concurrent FSMs: MAC packet processing, Application packet processing, sleep request and Timer functions.
83
6. Network Level Power Management
6.1 Traffic Considerations
To devise the appropriate network level power management policy, we need to
understand the nature of the network traffic. While there are a large number of
literatures on Ethernet and World Wide Web traffic [39][40][41] and a few on high
data rate wireless LAN [42] [43], the author is not aware of any studies on low
activity, low data rate sensor networks. Most studies on power management for
sensor networks were conducted under the assumption that the inter-arrival traffic
follows some well-characterized and well-studied distribution, such as exponential
distribution. As we will see shortly, this is a gross over-simplification. There is very
little publication on either the simulated or measured traffic traces in a typical
wireless sensor network.
84
We have conducted some experiments in our OMNet++ model to obtain accurate
information on network traffic. The information we obtain should answer the
following questions: Does sensor network traffic fall into any know distributions? Is
the traffic stream correlated? What protocol or system properties affect the network
traffic? How different are the traffic patterns seen by different nodes in the network?
To acquire meaningful and intuitive results from the experiment, we constructed
a uniform 10x10 grid network of 100 nodes (see Figure 36). By controlling the
transmission power, we can alter the number of neighbors a node observes. Twelve
controller nodes are “sprinkled” among the sensor nodes. Intuitively speaking,
9
55
72
48
9
55
72
48
Figure 36 10x10 grid of sensor nodes. The Green ones are sensors; red ones are controllers.
85
different nodes should see different traffic distributions, depends on the type of the
node and its position in the network. Controller nodes, due to its periodic interest
packet generations and the arrival of requested data packets, see a higher rate of
traffic. And nodes situate close to the center of the network see more forwarding
traffic. In comparison, sensor nodes locate at the edge of the network will see much
less traffic.
We use the following parameters for our experiments: A node can only see its
immediate neighbors, so the number of neighbors per node ranges from two at the
edge of the network to four at the center. In the application layer, the sensor data
generation period is 120 seconds; the data generation duration, which is the time
duration a sensor generates data responding a particular interest, is 1000 seconds; and
the interest lifetime is 1000 seconds. In the network layer, an interest in the interest
cache (routing table) expires after 1000 seconds and the routing table is updated every
120 seconds. We have simulated the grid network for 180000 seconds, or about 50
hours, and collected inter-arrival time stamps for various nodes.
As we have expected, there is a vast discrepancy among the sensor nodes in the
total number of packet received. Sensor Node 9 situates at the edge of the network
and only has two neighbors. It is the least busy node in the network and merely
receives 46 packets total; whereas controller Node 72, the busiest node in the network
receives 2841 packets.
To study inter-arrival time distribution, the Cumulative Probability Function
(CDF) of the inter-arrival times is plotted. Figure 37 and Figure 38 (a zoom in of
Figure 37) show the CDF plots for Node 9. The CDF curve is clearly not exponential
86
and does not fall into any known distributions. Only 20% of the packets arrivals are
within 0.5 seconds and nearly 40% of the packet arrivals are long than 2000 seconds.
The “steps” in the figures occur at the multiples of the sensor data generation period
(120 seconds) and the interest generation period (1000 seconds). Inspecting the traffic
stream, we notice it is somewhat correlated: Sequences of the long inter-arrival times
(> 120 second) interleave with several bursts of short inter-arrival times (< 0.5
second), while the long inter-arrival times dominate. The bursts of short inter-arrival
times vary in length, ranging from two to five.
87
Figure 37 CDF for sensor Node 9. The curve in red is that of an exponential distribution. It is quite obvious the CDF for Node 9 is NOT exponential.
Figure 38 Zoom in of the CDF plot in Figure 37. The list at the right shows the inter-arrival stream. Short inter-arrival sequences (<0.5 seconds) are shown in either red or blue.
Inter-arrival Sequence 1000.554987 18000.403200 24001.064045 29000.288000 34000.839949 47000.576000 73000.695223 77000.604800 77000.668800 77000.704000 77000.739200 77000.774400 77160.035200 77280.035200 77400.035200 77640.035200 77880.035200 82001.282127 85000.552622 85002.889998 85004.912040 85006.289814 85007.044890 85007.552949 85008.528139 85008.545739 85009.690340 85009.707940 85920.052800 97000.288000 108000.432000 108000.441373 108000.876911 108240.035200 108360.035200 108480.035200 108960.035200 108960.070400 108960.105600 109000.691421 112000.144000 126000.851251 144000.374400 148000.172800 151000.576000 156001.033472
88
Controller Node 72, the busiest node in the network, sees very different traffic
patterns than Node 9. Figure 39 shows the CDF plot and Figure 40 zooms in around
the x-axis (0<t<5s). Out of the total 2841 packets it receives, more than 60% have
relatively short inter-arrival times (< 0.5 second), including 50% or more with inter-
arrival times less than 0.1 seconds. There are very few packets (< 5%) that have inter-
arrival times greater than 120 seconds. From Figure 39, we notice the major “jumps”
in the curve occurs at 120 seconds intervals, which is the sensor generation period. In
this particular simulation, sensors generate data at 40 seconds offset to each other, so
we also notice minor “jumps” at the multiple of 40 seconds. As we zoom in further
on Figure 39 and inspect the CDF curve from t = 0 to 0.2 second (Figure 41), we
notice “jumps” at the intervals of 0.0176 seconds. It turns out 0.0176 is the packet
transmission duration in our simulation. And since there is no data aggregation
implemented in the current MAC, once the MAC takes control the shared channel, it
will not release it until all the accumulated packets in the transmission queue are sent.
Two modifications in the MAC protocol will make the “jumps” less prominent.
The first is data aggregation, which is to combine the several packets that are sent to
the same destination into one. The second is to introduce random backups after the
transmission of each packet. Apparently protocol algorithms have impacts on network
traffics. However, independent of any particular algorithm chosen, there is always
going to be some periodic components related to application parameters such as
sensor data generation periods and system parameters such as packet transmission
duration.
89
Compare Node 9, Node 72 definitely sees a more correlated incoming packet
stream. There are still sequences of the long inter-arrival times (> 120 second)
interleaved with sequences of short inter-arrival times (< 0.5 second). But in this case,
the short inter-arrival times sequences are dominate. The lengths of the short inter-
arrival sequences are also much greater on average compared to these of Node 9.
Other nodes have incoming traffic distributions that fall somewhere between
those of Node 9 and Node 72. The general traffic distribution can be described as
having long inter-arrival times sequences interleave with short inter-arrival
sequences. The length and the frequency of occurrences of the sequences in the traffic
stream depend on the types and the positions of the nodes. The busier the node is, the
longer and the more prevailing of the short inter-arrival sequences compared to the
long inter-arrival sequences.
90
Inter-arrival Sequence 12000.317239 12000.612842 12002.307731 12120.035200 12120.176000 12240.176000 12360.281600 12480.246400 12600.035200 12600.646702 12720.176000 12840.176000 12960.176000 13000.172800 13000.236800 13000.737015 13001.245160 13080.062338 13080.070400 13080.500653 13080.518253 13080.531166 13080.566366 13080.979213 13080.996813 13081.014413 13081.032013 13081.205422 13081.835098 13081.852698 13081.870298 13081.887898 13081.905498 13082.024279 13082.059479 13082.584908 13082.602508 13082.603918 13082.620108 13082.637708 13082.655308 13082.656718 13082.672908 13082.690508 13082.691918 13082.727118 13082.762318 13082.797518 13083.571796 13083.694996 13083.782996 13083.818698 13083.836298 13083.853898
Figure 39 CDF for sensor Node 72. The distribution is still not exponential.
Figure 40. Zoom-in the CDF plot in Figure 39. The list at the right shows the inter-arrival stream. Short inter-arrival sequences (<0.5 seconds) are shown in either red or blue.
91
6.2 Constant Threshold algorithm
Let us first evaluate the simplest power management scheme, the constant
threshold algorithm. In this algorithm, the processing device is shut off after its idle
time exceeds a constant threshold. The optimal value of the threshold depends on
system power metrics such as the break-even time Teven and the incoming traffic
distribution. This is a scheme widely used in laptop computers. Most publications on
power management dismiss the constant sleep threshold algorithm as too simplistic
and crude. We should like to find out how well this algorithm works on sensor
network, that is, how close it is to the optimal power control algorithm.
Figure 41 Further zoom in of the CDF plot in Figure 39. Major “jumps” in the curve occurs at the multiple of 0.0176 seconds, the packet transmission time.
92
Since different network nodes see different traffics, the constant thresholds should
take on different values likewise. As a result, all the nodes have to be individually
pre-characterized to obtain their “optimal” threshold values. Note that these values
are pre-determined and do not change over time.
To obtain the value of optimal threshold, we “sweep” the energy dissipation
related to power management over a range of threshold values to generate the
“threshold curve”. The minimum of the curve is the optimal threshold point: the sleep
threshold value at which minimum amount of energy is consumed. It should be
emphasized that only the energy consumed in idle and sleep states and during power
state transitions is counted. Energy dissipated actively processing packets is excluded,
as it is irrelevant to power management.
93
Figure 42 shows the “sleep threshold plots” for sensor Node 9. Since we are
interested in how Teven affects the values of the optimal sleep thresholds (Sth), three
curves are plotted corresponding to Teven values of 0.5, 0.2 and 0.05. Recalled that
Teven is defined as the ratio of wakeup energy overhead to the idle power dissipation.
It is the length of time such that if the system is idle for time Teven, the energy that is
expended is the same as the energy required for waking up the system. Teven is a
good indication of the wakeup overhead: if the system idle power is fixed, the greater
Teven is, the greater the wakeup overhead and overall energy expenditure. For the low
Figure 42 Energy consumed versus sleep thresholds for sensor Node 9. The three curves
correspond to break-even time of 0.5, 0.2 and 0.05.
94
activity sensor Node 9, the optimal sleep thresholds are zeroes for all three values of
Teven, which means the node should shut down immediately once idle.
Now let us study a more typical sensor node in the network, Node 48. Node 48
has four neighbors and one of them is a controller. As a result, it is much more active
than Node 9 and receives a total of 545 packets during simulation. Figure 43 shows
its sleep threshold plots. The optimal sleep thresholds are still zeroes for Teven equals to
0.05 and 0.2. However, as TEven is increased to 0.5, the optimal sleep threshold
increases to 0.0176 seconds.
Because the expected inter-arrival times are shorter for busier nodes, it pays off
for them to stay in idle longer comparing to the less busy nodes. This in turn implies
Figure 43 Energy consumed versus sleep thresholds for sensor Node 9. The three curves correspond to break-even time of 0.5, 0.2 and 0.05.
95
that the optimal value of sleep threshold should increase with node activity. The
examination of the sleep threshold plots of controller Node 72 confirms this
supposition. The optimal sleep threshold is zero for Teven = 0.05 and is 0.036 and 0.09
respectively for Teven = 0.2 and 0.5.
However, our supposition is proven wrong when we exam the sleep threshold
plots of controller Node 55. Node 55 only receives 2295 packets, as compared to the
2841 packets received by Node 72. It is less active than Node 72. However, when Teven
= 0.5, its optimal threshold is 0.125s, which is greater than the 0.09s for Node 72. In
fact, the corresponding energy expenditure of Node 55 at this threshold value is
32449, which is also great then 32232 as of Node 72.
Figure 44 Energy consumed versus sleep thresholds for sensor Node 55. The three curves correspond to break-even time of 0.5, 0.2 and 0.05.
96
The explanation for this discrepancy lies in the traffic distributions. We need to
find out which values of packet inter-arrival time cause the most energy dissipation
and which nodes a higher percentage of these “worst” inter-arrival times. For the
constant threshold scheme, the “worst” values fall between the sleep threshold and
Teven. Because in these inter-arrival scenarios, the system stays awake for the duration
of the sleep threshold and then goes to sleep, only to be waken up again soon after.
The worst-case scenario occurs when the inter-arrival time equals the sleep threshold:
the node goes to sleep and wakeups right after. In fact inter-arrival times around Teven
is “bad” for any power management algorithm, because either the decision to stay
awake or to go to sleep will dissipate energy roughly equal to the wakeup overhead.
Figure 46 and Figure 47 are the CDF plots of Node 55 and Node 72 respectively. By
Figure 45 Energy consumed versus sleep thresholds for sensor Node 72. The three curves correspond to break-even time of 0.5, 0.2 and 0.05.
97
inspected the plots, we notice the following facts: For Node 55, about 15% of the
inter-arrival times fall in between 0.125 (optimal sleep threshold) and 0.5 (Teven); while
for Node 72, the number is only 10%. Since compared to Node 72, Node 55 has 5%
more packets falling into this undesirable scenario, it consumes more energy despite
being almost 20% less active than Node 72.
98
Figure 46 CDF plot for Node 55
Figure 47 Zoom in of the CDF plot in Figure 46.
99
Our experiment results can be summarized as the following: in general the
optimal constant sleep threshold increases with wake-up overhead and traffic
activities. However, nodes with a higher percentage of the inter-arrival times that fall
in-between the sleep threshold and Teven may have an exceptionally high constant
threshold.
We shall now proceed to calculate the optimality of the constant threshold
algorithm. This is accomplished by comparing the results of the constant threshold
scheme to those of the optimal algorithm. The optimal algorithm is assumed to know
the sequence of arrival requests in advance. In essence, it can make decisions based
on the knowledge of the future. The optimal algorithm shuts down the system
immediately if the next idle period is greater than Teven, and keeps the system idle it
is less than Teven. Evidently the results of the optimal algorithm are unobtainable in
reality, and only serve as the upper bounds to any practical implementations.
73.5%74.9%75.3%72Controller
72.6%72.4%86.8%55Controller
98.7%98.8%99.3%9Sensor
85.7%86.3%93.9%48Sensor
0.50.20.0573.5%74.9%75.3%72
Controller72.6%72.4%86.8%55
Controller
98.7%98.8%99.3%9Sensor
85.7%86.3%93.9%48Sensor
0.50.20.05TevenNode
Nod
e Tr
affic
Act
ivity
Wake-up overhead
Table 7 Optimality of the constant threshold algorithm. The worst-case results are highlighted in red.
100
Table 7 lists the performance of the constant threshold algorithm as compared to
the optimal algorithm for Nodes 71, 55, 48, and 9. The results are listed for three
different values of Teven. The general observation is that it performs better for nodes
with low activities and small Teven. For Node 9, it is extremely close to optimal for
all three values of Teven. Even for a typical sensor node like 48, it is at least 86%
optimal. Nevertheless, we need to keep in mind that the worst performing nodes are
the most vital to the interest of the network. If these critical nodes run out of energy
and die out early, the overall quality of service of the network will be adversely
affected. This is both due to their central locations and their role as controllers. The
worst performer when Teven =0.05 is Node 72, the busiest node in the network;
while as Teven increases to 0.2 and 0.5, the worst performer becomes Node 55.
The reason, as explained in the previous paragraphs, is because Node 55 has a
higher percentage of the “worst” inter-arrival times. As the overhead for power
management (Teven) increases, the performance of the constant threshold algorithm
degrades on Node 55 further than on Node 72. The interesting thing is that even with
the optimal algorithm, Node 55 consumes more energy than Node 72 for all three
values of Teven. Again, we need to remind ourselves that only the non-active energy
is considered, the overall energy dissipation is the sum of the non-active and active
energy. Exactly which node consumes the most energy overall depends on the active
energy the nodes spend processing packets. Since Node 72 has to process more
packets, it consumes more “active” energy than 55, which may or may not offset the
comparably less energy consumed during idle and power state transitions.
101
Let us evaluate the pros and cons of the simple constant threshold algorithm. The
pros are two-folds: it is very simple to implement, and yields reasonably good results
for even the worst performing nodes. The cons of the algorithm are quite apparent: a
straightforward implementation of the algorithm implies that every node has to be
pre-characterized to obtain their individualized threshold values. This is quite
cumbersome considering the large numbers of nodes in the network. In addition,
since the threshold value is fixed, the algorithm is non-adaptive to channel quality
(time-varying packet loss and bit-errors rates) changes as well as neighborhood
configuration changes. The former problem can be solved by identifying and
characterizing only the busiest and most critical nodes in the network. Zero can be
used as the threshold value for all the other nodes. This modified algorithm is suitable
for networks with known topology, relatively stable environment and little mobility.
For networks with a more dynamic environment, or with a topology that cannot
be obtained a prior, or with significant mobility, we would need an adaptive
algorithm that modifies the threshold based on traffic changes. In the next sections,
we will investigate several known adaptive algorithms and some variations.
6.3 Sinha & Chandrakasan
We have briefly introduced Sinha & Chandrakasan’s power management
algorithm for sensor networks Section 4.1.2. Their obvious flaw is the incorrect
assumption that sensor network traffic has uncorrelated exponential distributions. As
we have seen in the traffic analysis section, sensor network traffic is indeed quite
correlated. Even under their flawed assumption, the algorithm has some major
drawbacks.
102
To illustrate the latter point, let us design an experiment and investigate the merit
of this algorithm under their exponential traffic assumption. The CDF of exponential
distribution with inter-arrival rate is described as:
D(x) = 1- e- x (1)
The value of is dynamically measured and updated. The device shutdown
probability Pth is calculated as:
Pth = 1- e- Teven (2)
The shut down decision is based on Pth: if Pth is greater than a pre-set constant a, the
system shuts down immediately; otherwise it stays awake. In our experiment setup,
we generate as input a packet stream with = 2, vary the values of Teven from 0.05 to
0.5, and compare the energy consumption of a node implementing this algorithm to
the optimal algorithm. The constant a is set to be 0.5. Table 8 shows the comparison
results. We can see that the algorithm works very well only when Pth is close to either
49286492814928020095100425021Sinha &Chandrakasan
0.640.550.510.340.20.09Pth
80.5
27303
0.4
97
24960
0.35
5822104.9Error %
313491643190924789OptimalAlgorithm
0.50.20.10.05Teven
49286492814928020095100425021Sinha &Chandrakasan
0.640.550.510.340.20.09Pth
80.5
27303
0.4
97
24960
0.35
5822104.9Error %
313491643190924789OptimalAlgorithm
0.50.20.10.05Teven
Gamma=2
Table 8 Performance of the Sinha & Chandrakasan algorithm compared to optimal algorithm assuming exponential inter-arrival distribution. The disparity quickly increases as Pth goes from 0 to 0.5, and from 1 to 0.5.
103
0 or 1, and quickly degrades as Pth approaches the value of constant a (set to 0.5).
The disparity between the two results is only 4.9% when Pth is 0.09, and becomes
97% when Pth is 0.51. The reason is that when the shutdown probability Pth is close
to either 1 or 0, the system is quite certain about whether to stay awake or go to sleep;
whereas when the probability approaches 0.5, the system is just making a random
guess.
It should be rather evident that Pth has a significant impact on the performance of
this algorithm. According to equation (2), Pth is determined by system metric Teven
and traffic parameter . These are system specification and environment parameters
that are given to, and not controlled by the power manager. This algorithm’s heavy
dependence on system and environmental parameters is obvious a considerable flaw.
This combined with the false assumption of exponential distribution, make it quite
unattractive to sensor networks.
6.4 Modified Hwang and Wu’s [32]
Hwang and Wu’s algorithm also dynamically samples the inter-arrival rates and
bases its decisions on computational history. Unlike Sinha & Chandrakasan, they
assume that the traffic is correlated. Hwang and Wu’s adapted the exponential –
average approach used in CPU scheduling for the prediction of the next inter-arrival
time, i.e. the next idle period of the processing unit. The prediction formula is shown
in (3):
I n+1 =a *in + a (1 - a) in-1 + a (1 - a)2 in-2+ ..…. + a ( 1- a)n i0 + ( 1- a)n+1 I0 (3)
Where I n+1 is new predicted value, ik’s are the previous idle periods, and a is a
constant attenuation factor in the range between 0 to 1. The formula indicates that the
104
predicted idle period is the weighted average of previous idle periods. Early idle
periods have less weight, as specified by the exponential attenuation factor. The
parameter a controls the relative weight of recent and past history in the prediction.
There are many multiplication and addition operations in formula (3). This
translates to high computational complexity that is not feasible for direct
implementation on sensor nodes. To solve this problem, we truncate the formula and
only use the three most recent histories. The formula now becomes:
I n+1 =a *in + a (1 - a) in-1 + a (1 - a)2 in-2 (4)
The earlier idle periods carry very little weight and can be discarded with little
effects on I n+1, while reducing the computational complexity drastically. Our
modified Hwang and Wu uses (4) to predict the next idle period I n+1., where ik’s are
the previous inter-arrival times instead of the idle periods. If I n+1 is greater than Teven,
it shuts the system off as soon as the system becomes idle, otherwise the system stays
on until the idle time period reaches Teven.
Wake-up overhead
69.0%73.6%82.7%72 Controller
82.6%90.1%91.2%9 Sensor
80.1%85.0%93.2%48 Sensor
63.2%70.3%87.3%55 Controller
0.50.20.0569.0%73.6%82.7%72 Controller
82.6%90.1%91.2%9 Sensor
80.1%85.0%93.2%48 Sensor
63.2%70.3%87.3%55 Controller
0.50.20.05TevenNode
Nod
e Tr
affic
Act
ivity
Table 9 Optimality of the modified Hwang & Hu. The worst-case nodes are highlighted in red.
105
Table 9 lists the optimality of the modified Hwang & Hu algorithm. We have
chosen the values of parameter a that produce the best results. Similar to the constant
threshold algorithm, its performance degrades as wakeup overhead and traffic
activities increase. Serving as an exception, Node 55 is the worst performing node for
Teven= 0.2 and 0.5. The reason, as explained in Section 4.3.3.1.1, is because Node 55
has a higher percentage of inter-arrival times that incur the most energy related to
power management. As the wakeup overhead increases, it replaces Node 72 as the
worst-case node.
Let us compare Table 9 to the results of the constant threshold algorithm listed in
Table 7. When Teven = 0.05, the worst-case node in modified Hwang & Hu is 85.8%
optimal versus the 75.3% in the case of constant threshold. When Teven = 0.2, the two
perform very similarly (71.4% versus 72.4%). When Teven = 0.5, Hwang & Hu
performs 8% worse than the constant threshold algorithm (63.7 versus 72.6%). Notice
the constant threshold scheme also tends to perform better for the less active nodes in
the network.
6.5 Adaptive Dynamic Threshold Algorithms
We would like to investigate how different ways of executing shutdowns and
weighting the inter-arrival history can affect the performance of the adaptive
algorithms. Instead of making the immediate shutdown decision based on past
history, a different approach is to use the predicted idle time to vary the sleep
threshold. In other words, the system will have a dynamic sleep threshold determined
by the weighted average of the previous inter-arrival rates. We have simulated two
106
dynamic threshold algorithms with different weighting functions. In both cases, the
system goes to sleep after remaining idle for Sth.
1. Dynamic threshold with exponential weighting (EXP). Update the sleep
threshold Sth defined as:
Sth = C * Teven / I n+1 (5)
Where C is a constant, and I n+1 is calculated using equation (4).
2. Dynamic threshold with root mean square (RMS). The sleep threshold is
defined as:
Sth = C * Teven / I n+1 (6)
Where C is a constant and I n+1 = square_root( ( in2 + in-1
2 + in-2 2)/3)
Unlike the exponential weighting scheme, the previous three arrival rates are
weighted equally.
The simulation results of EXP and RMS are displayed in Table 10 and Table 11
respectively.
Wake-up overhead
70.2%76.5%83.0%72 Controller
88.5%88.3%87.8%9 Sensor
82.0%85.1%90.8%48 Sensor
64.4%72.3%83.3%55 Controller
0.50.20.0570.2%76.5%83.0%72 Controller
88.5%88.3%87.8%9 Sensor
82.0%85.1%90.8%48 Sensor
64.4%72.3%83.3%55 Controller
0.50.20.05TevenNode
Nod
e Tr
affic
Act
ivity
Table 10 Optimality of EXP. The worst-case nodes are highlighted in red.
107
One would expect that out of the three adaptive algorithms presented, one would
yield superior results to the others. However, comparing Table 9, Table 10 and Table
11, we notice only marginal variations, and there is no clear winner. EXP seems to
perform on average 8% worse than RMS on Node 9 (as we will see shorter this can
be improved). Remember we are the most interested in the worst performing node,
which can be either Node 72 or Node 55 depends on Teven. The differences in
performance for these nodes among the three algorithms are less than 2%. These
marginal differences suggest that altering the ways of executing shutdowns and
weighting the previous inter-arrival history have little effects on the performance of
the adaptive algorithms. This leads us to speculate that any type of predictive
algorithm that relies on the recent inter-arrival history will not be able to significantly
out-perform modified Hwang & Hu, EXP or RMS. In other words, there is a limit to
the performance of any algorithm that only has the knowledge of the recent inter-
arrival history. A considerably different approach that incorporates information of the
network neighborhood is needed to achieve any major breakthrough. Such
Wake-up overhead
70.3%76.6%82.4%72 Controller
97.1%96.9%97.1%9 Sensor
82.0%85.1%91.6%48 Sensor
64.6%72.4%84.1%55 Controller
0.50.20.0570.3%76.6%82.4%72 Controller
97.1%96.9%97.1%9 Sensor
82.0%85.1%91.6%48 Sensor
64.6%72.4%84.1%55 Controller
0.50.20.05TevenNode
Nod
e Tr
affic
Act
ivity
Table 11 Optimality of RMS. The worst-case nodes are highlighted in red.
108
information can be passed around in the network through piggy-bagging existing
packets with fields dedicated to power management.
6.5.1 Improve Adaptive Algorithms By Exploiting Special
Characteristics Of The Sensor Network
Even though major improvements are difficult to obtain, minor ones can be
accomplish by making some plain observations of network attributes. Firstly, sensor
nodes are woken up not only by the incoming packets, but also by periodic update
timers. Unlike the wakeups caused by incoming packets, expiry timer-triggered
wakeups tend not to be followed by bursts of traffic. This means that the system can
shut down immediately after timer-triggered wakeups. The second observation is on
the controller nodes that generate interest packets periodically. They often will
receive bursts of data packets shortly after the interest generation. Based on this
observation, controller nodes have their dynamic sleep thresholds increased right after
each interest generation.
72.6%78.2%83.6%72 Controller
98.9%98.9%98.9%9 Sensor
83.0%86.6%92%48 Sensor
65.7%74.1%82.7%55 Controller
0.50.20.0572.6%78.2%83.6%72 Controller
98.9%98.9%98.9%9 Sensor
83.0%86.6%92%48 Sensor
65.7%74.1%82.7%55 Controller
0.50.20.05TevenNode
Nod
e Tr
affic
Act
ivity
EXP
Table 12 Optimality of the Improved EXP
109
After adding the above minor improvements in to the EXP algorithm, we obtain
the results listed in Table 12. Although improvements for the critical nodes are only
marginal (1~2%), there is almost a 10% improvement for Node 9.
6.6 Evaluations Of Various Power Management Algorithms
For Sensor Network Applications
Adaptive algorithms seem to be more appropriate solutions since they are able to
explore the temporal correlations in the traffic, handle environmental changes and are
relatively simple to implement. They are also self-designing, which means no traces
and pre-characterizations are needed. Nevertheless, we do not want to rule out the
constant threshold algorithm completely as it achieves better performances in certain
cases.
83.0%86.6%92%Dyn EXP48
85.7%86.3%93.9%Constant
9 98.9%98.9%98.9%Dyn EXP
98.7%98.8%99.3%Constant
55 65.7%74.1%82.7%Dyn EXP
72.6%72.4%86.8%Constant
72
Teven
72.6%78.2%83.6%Dyn. EXP
73.5%74.9%75.3%Constant
0.50.20.05
83.0%86.6%92%Dyn EXP48
85.7%86.3%93.9%Constant
9 98.9%98.9%98.9%Dyn EXP
98.7%98.8%99.3%Constant
55 65.7%74.1%82.7%Dyn EXP
72.6%72.4%86.8%Constant
72
Teven
72.6%78.2%83.6%Dyn. EXP
73.5%74.9%75.3%Constant
0.50.20.05
Table 13 Performance comparison of the constant threshold versus dynamic EXP
110
Let us study the comparison results presented in Table 13. For low activities
nodes such as 48 and 9, the two results are rather compatible. For high activity nodes
such as 55 and 72, EXP out performs the constant threshold algorithm for smaller
Teven. However, while Teven increases and the penalty for mis-prediction increases, the
constant threshold algorithm becomes the winner. Ironically, the simple constant
threshold algorithm seems to execute better than the more sophisticated EXP when
power management becomes difficult and critical. In other words, it does better for
nodes with the “worst” inter-arrival times (Node 55) and systems with higher Teven.
One approach to address the above problem is to use the constant threshold
algorithm for some critical controller nodes, on the condition that we can properly
identify them. A node may be able to extract information from its location in the
network (network topology) and “guess” how busy it is. Topology based power
management is an active area of our future research.
As we have seen before, the busiest node in the network may not be the “worse”
node in a power management sense. At the same time, we need to keep in mind that
the “worse” node in a power management sense may not be the node with the worst
overall energy consumption. The overall energy consumption is determined by the
active energy plus energy controlled by the power manager (Idle energy and power
state transition overheads).
6.7 Service Time Consideration
We did not consider the packet service time in our discussions so far. It is
definitely a relevant parameter in power management. However, for low activity, low
duty cycle type applications like the PicoRadio network, the service time is
111
significantly smaller than the average inter-arrival times. Our simulation indicated
that even for the busiest node in the network, there is only a 2% chance that there is
more than one packet in the queue. It seems reasonable for us to not to include service
time in the modeling.
6.8 Implementation Cost Of The PM
Care should be taken when implementing the power manager itself to make sure it
does not consume an overwhelming amount of energy. The selected power control
algorithm should have low computational complexity and requires modest storage
space. The original Hwang & Hu algorithm clearly violates this requirement and has
to be modified.
The behavior of the power scheduler is modeled as concurrent EFSMs, each of
them controls a power domain (Figure 27). An energy efficient way of implementing
the PM is either mapping it directly to ASIC, or some kind of inter-connected
reconfigurable FSMs. A microprocessor implementation is not recommended due to
its massive cost compared to other implementation fabrics.
7. Conclusions and Future works
In the last chapter of the thesis, we would like to recapitulate the major research
results and contributions. We will also discuss the lessons learned and identify the
open questions and opportunities for further research. Power management for sensor
networks is still a fledging area that is very much unexplored. We hope that this thesis
will instigate a certain amount of interest from the research community to tackle one
of the many interesting problems left.
112
7.1 Summary of Thesis Research and Contributions
2. Formal top-down platform-based design methodology for protocol
implementation. Most protocol design methodologies currently in use are
inadequate, either because they do not rely upon formal techniques and
therefore do not guarantee correctness, or because they do not provide
sufficient support for performance analysis and design exploration and
therefore often lead to sub-optimal implementations. Our methodology relies
on a formal Model of Computation (MOC). It supports architecture
exploration, meets the application’s need on flexibility while achieving energy
efficient solutions. Using PicoRadio as the design driver, the proposed formal
top-down design methodology yields superior results compared to traditional
bottom-up ad-hoc approaches.
3. Reactive systems need reactive management. OS and software support is
crucial for the design of ultra-low energy communication systems. These
systems, reactive in nature, tend to have high level of integration and system
heterogeneity. General-purpose operating systems developed for broad
application are increasingly less suitable for these types of complex real time,
power-critical, domain specific systems implemented on advanced
heterogeneous architectures. More efficient solutions are obtained with OS’s
that are developed to exploit the reactive event-driven nature of the domain.
As proof, we present a comparison between two OS’s that target this
embedded domain: one that is general-purpose multi-tasking (ECOS) and
another that is event-driven (TinyOS). Preliminary results indicate that the
113
event-driven OS achieves an 8x improvement in performance, 2x and 30x
improvement in instruction and data memory requirement, and a 12x
reduction in power over its general-purpose counterpart.
4. Hierarchical power management framework for reactive systems. Based
on the attractive concepts of an existing reactive OS, we proposed a power
management framework that specifically targets reactive heterogeneous
system. Its hierarchical structure enhances design scalability, supports
concurrency and enables power control at various granularities. Most state-of-
art power management systems handle only stand-alone devices. The scope of
our power management algorithm, however, is not limited to individual nodes;
instead, it aims to encompass the interest of the network as a whole. Our
power management algorithm executes in two phases: Network level
algorithm first treats the whole node as one entity and try to decide when the
whole node should go to sleep; then once the node is turned on, the node level
algorithm determines on the scheduling of the various modules inside the
node.
5. The experimentation of network level power management algorithms on
the PicoRadio network. In the OMNet++ simulator, we have simulated the
performance of various power control policies in a typical sensor network
setting. Each sensor node models the complete PicoRadio protocol stack.
From our experiments, adaptive algorithms seem to be good solutions since
they are able to explore the temporal correlations in the traffic stream, handle
environmental changes and are relatively simple to implement. However,
114
simple constant threshold algorithms perform better for nodes with the
“worst” inter-arrival times and systems with high Teven. Our experimentation
on the various adaptive algorithms lead us to speculate that there is a
performance limit to any adaptive algorithm that only has the knowledge of
the recent inter-arrival history. A more “global” approach that incorporates
information on the network neighborhood is needed to achieve major
breakthroughs.
6. PicoRadio prototype development. To validate research concepts and gain
valuable design experience, I have participated in the development of the
PicoRadio II and PicoRadio III chips. For the PicoRadio II chip, I was
responsible for the entire software implementation process, including OS
selection and porting, application code generation, etc. I have been involved in
the architecture development of both PicoRadio II and PicoRadio III.
PicoRadio III deploys a power manager to demonstrate the reactive
management concepts discussed in the thesis.
7.2 Lessons Learned and Future Research Opportunities
In completing the thesis research, we have encountered numerous problems,
which prevented us from accomplishing higher goals. We would like to itemize the
valuable lessons learned from our efforts in methodology and design flow as well as
power management for sensor networks. Future research directions will also be
suggested.
115
Platform-based design methodology for protocol processing
Good modeling is essential for the success of the design flow. The
methodology only works as good as the models. The processor and OS
models provided by the Cadence VCC tool are vastly inaccurate. At a result,
the PicoRadio II system performance is over-estimated and memory
requirement under-estimated. Needless to say, this is detrimental to the design
process.
In the architecture exploration phase of the design flow, it would be highly
desirable to provide a “confidence metric” that indicates how accurate the
performance estimation is. We are not aware of any existing performance
analysis tools that have this capability. Consequently, the designers are often
reluctant to adopt high-level design methodology and use such tools.
Node level power management
Localized power control policies should be investigated in the future. In the
thesis, we presented a power scheduling approach that is global in nature:
most of the intelligence is in the global power scheduler. Certain scheduling
decisions, however, can be delegated to the local blocks to avoid overloading
the power scheduler. A block may decide on how long it should wait in Idle
before requesting to sleep, based on its own evaluation of the expected event
inter-arrival rates and wakeup overheads.
More refined and aggressive predictive look-ahead scheduling schemes should
be studied and implemented. We have presented a rather simplistic and
conservative predictive look-ahead scheduling algorithm. The study of
116
predictive power scheduling is still in the conceptual stage. We need more
quantitative analysis of its costs and benefits. The PicoRadio platform, due to
its very low block wakeup overheads, is not a good candidate for
demonstrating predictive scheduling. An implementation platform that has
high wakeup overheads needs to be selected as the proof-as-concept for
predictive scheduling algorithms. DVS is another open research area. Our
framework has the capability to incorporate DVS but specific scheduling
techniques for DVS are left for future work.
Network level power management
Larger scale networks with more realistic topology should be constructed to
study power management algorithms. We have conducted our experiments on
a simple 100-node grid network; while realistic settings typically do not have
grid configuration and may include “walls” that block signal transmission. It
will be also very interesting to see how the performance of various power
control algorithm scales with the size of the network.
Better wireless channel models should be incorporated in the network
simulation. We are currently using an overly simplistic channel-interference
matrix model. Channel quality has a significant impact on protocol design and
the choice of power management policy. Unfortunately, wireless link
characteristic are difficult to capture and a realistic channel model is not easily
available. The ultimate plan is to implement the protocol stack along with the
power manager on a prototype and deploy it in a real network setting.
117
Measurement data on network traffic and energy dissipation can then be
collected and used to aid policy development.
A global paradigm that uses network topology to devise power management
policies should be studied. This paradigm is drastically different from existing
policies that rely only on local traffic flows. Our experiments suggest that
there are limitations to algorithms with narrow local visions. We believe that
real breakthrough can be achieved if the power manager uses the knowledge
of not only the local node but also the surrounding network. Power control
information can be passed around in the network by appending dedicated
power management fields to existing packet format. A node’s power control
decisions may be influenced by the energy status of its neighboring nodes.
Very often in the sensor network a node knows its own location and those of
its neighbors, and power management polices can be developed based on this
information. For example, in the constant threshold algorithm, instead of pre-
characterizing the nodes to obtain its optimal constant sleep threshold, we
could calculate the value based on global network topology.
118
8. References
1. E. Lee and A. Sangiovanni-Vincentelli, A Unified Framework for Comparing Models of
Computation, IEEE Trans. on Computer Aided Design of Integrated Circuits and Systems, Vol.
17, N. 12:1217-1229, December 1998.
2. A. Girault, B. Lee and E. Lee, A preliminary Study of Hierarchical Finite State Machines with
Multiple Concurrency models, UCB/ERL Technical Report, August 1997.
3. C. Hoare, “Communicating Sequential Processes”, Communications of the ACM, Vol. 21, No. 8,
August 1978.
4. R.Milner, J. Parrow, and D. Walker, A Calculus of Mobile Processes, I, Information and
Computation, Vol. 100, No. 1, Sep 1992.
5. J. Dennis, First Version Data Flow Procedure Language, Technical Memo MAC TM61, MIT Lab
for Computer Science, May 1975.
6. G. Kahn, The Semantics of Simple Language for Parallel Programming, Proc. Of the IFIP
Congress 74, North-Holland Publishing Co., 1974.
7. C. Cassandras, Discrete Event Systems, Modeling and Performance Analysis, Irwin, Homewood
IL, 1973.
8. A. Benveniste and G. Berry, The Synchronous Approach to Reactive and Real-Time Systems,
Proceedings of the IEEE, Vol. 79, No. 9, 1991, pp. 1270-1282.
9. L. Lavagno, A. Sangiovanni-Vincentelli & E. Sentovich, Models of Computation for Embedded
System Design, 1998 NATO ASI Proceedings on System Synthesis, Il Ciocco, Italy, 1998
10. A. Sangiovanni-Vincentelli, R. McGeer and A. Saldanha, Verification of Integrated Circuits and
Systems, Proc. Of 1996 Design Automation Conference, June 1996.
11. A. Ferrari and A. Sangiovanni-Vincentelli, System Design: Traditional Concepts and New
Paradigms, Proceedings of the 1999 Int. Conf. On Comp. Des., Austin, Oct. 1999.
12. J. Rabaey et al. PicoRadio Supports Ad Hoc Ultra-Low Power Wireless Networking. IEEE
Computer, Vol. 33, No. 7, pp. 42-48, July 2000.
13. OPNET Radio Modeler, OPNET Technologies, Inc., http://www.mil3.com
119
14. C. Perkins and E. Royer, Ad-hoc On-Demand Distance Vector Routing, Proceeding of the 2nd
IEEE Workshop. Mobile Comp. Sys. and Apps., pp. 90-100, Feb. 1999.
15. C. Perkins and P. Bhagwat, Highly Dynamic Destination Sequenced Distance-Vector Routing
(DSDV) for Mobile Computers, Computer Communications Review, pp. 234-44, Oct. 1994.
16. H. Zhang et al, “1V Heterogeneous Reconfigurable Processor IC for Baseband Wireless
Applications,” Proc. ISSCC Conf., 2000, pp. 68-69.
17. Darren C. Cronquist et al, "Architecture Design of Reconfigurable Pipelined
Datapaths," Twentieth Anniversary Conference on Advanced Research in VLSI 1999.
18. M. Smith, Application Specific Integrated Circuits: Chapter 5.4. Altera MAX, Addison-Wesley,
1997.
19. Tim Tuan, Suet-Fei Li, Jan Rabaey, Reconfigurable Platform Design for Wireless Protocol
Processors, Proc. Of 2001 ICASSP, May 2001.
20. B. Kienhuis et al, An Approach for Quantitative Analysis of Application-specific Dataflow
Architectures, Proceedings of International Conf. of Application-specific Systems, Architectures
and Processors, pp. 338-349, Zurich, Switzerland 1997.
21. Cadence Design Systems, http://www.cadence.com
22. ECOS Operating System, http://www.redhat.com
23. David Culler et al, The TinyOS group, Department of EECS, UC Berkeley.
24. Pai Chou; Borriello, G. Software architecture synthesis for retargetable real-time embedded
systems, Proceedings of the Fifth International Workshop on Hardware/Software Codesign.
25. K.Ramamritham and J.A. Stankovic, Scheduling Algorithms and Operating Systems Support for
Real-Time Systems, Proceedings of the IEEE, January 1994, pp. 55-67.
26. R. Evans & P. Franzon, “Energy Consumption Modeling and Optimization for SRAM’s”, Journal
of Solid-State Circuits, Vol. 30, No. 5, May 1995.
27. G. paleologo, L. Benini, A. Bogliolo and G. De. Micheli, Policy optimatization for dynamic power
management, Proc. 35th Design Automation Conference, June 1998, pp. 182-187.
28. A. Karlin, M. Manasse, L. McGeoch and S. Owicki, Competitive randomized algorithm for
nonuniform problems, Algorithmica, vol.11, no.6, pp 542-571, June 1994.
120
29. E.Chung, L. Benini and G. De. Micheli, Dynamic Power Management for non-stationary service
requests, Design, Automation and Test in Europe, pp 77-81, 1999.
30. T. Simunic, Dynamic Management of Power Consumption, Chapter 1, Power Aware Computing,
Kluwer Academic Publisher, pp102-125. 2002.
31. A. Sinha and A Chandrakasan, Dynamic Power Management in Wireless Sensor Networks, IEEE
Design & Test of Computers, 2001.
32. C. H Hwang and A. Wu, A predictive system shutdown method for saving of event-driven
computation, Int. Conf. in Computer Aided Design, Nov. 1997, pp. 28-32.
33. Intel StrongARM processors. http://developer.intel.com/design/strong/sa1100.html
34. T. Pering, T. Burd, R. Brodersen, The simulation and evaluation of Dynamic Voltage Scaling
algorithms, Proceeding of IEEE International Symposium on Low Power Electronic and Design,
1998.
35. L.Geppert, T. Perry, Transmeta’s magic show, IEEE Spectrum, vol. 37, pp 26-33, May 2000.
36. Andrea Willig, Martin Kubisch, Christian Hoene and Adam Wolisz, Measurement of a wireless
Link in an Industrial Environment using IEEE 802.11-Compliant Physical Layer, IEEE
Transaction on Industrial Electronics, 2002.
37. D. Duchamp and N. Reynolds, Measured performance of wireless LAN, Proc. Of 17th Conf. On
Local Computer Networks, Minneapolis, 1992.
38. M. Srivastava, A Chandrakasan and R.W. Brodersen, Predictive shutdown and other architectural
techniques for energy efficient programmable computation, IEEE Trans. Very Large Integated
Syst,. Vol. 4, pp. 42-4, Mar. 1996.
39. Crovella and Bestavros,Self-similarity in world Wide traffic: Evidence and possible causes, IEEE
Trans. Networking, vol. 5, pp. 835-846, Dec 1997.
40. M. Garrett and W. Willinger, Analysis, modeling and generation of self-similar VBR video traffic,
in SIGCOMM’94, London, U.K., August 1994, pp. 269-280.
41. W. Leland et al., On the self-similar nature of Ethernet traffic, IEEE Trans. Networking, vol. 2, pp.
1-15, Feb 1994.
121
42. Q. Liang, Ad Hoc Wireless Network Traffic – Self-Similarity and Forecasting, IEEE
Communication Letters, Vol. 6, No 7, July 2002
43. J. Redi and D. Averesky, Performance of Energy-Conserving Access Protocols Under Self-similar
Traffic. IEEE Wireless Communications and Networking Conference (WCNC'99), September 21-24, 1999.
Top Related