Optimize Storage Placement in Sensor Networkswm/J14.pdf · placement design. Instead, we consider...

1

Optimize Storage Placement in SensorNetworks

Bo Sheng, Member, IEEE, Qun Li, Member, IEEE, and Weizhen Mao, Member, IEEE

Abstract—Data storage has become an important issue in sensor networks as a large amount of collected data need to be archived forfuture information retrieval. Storage nodes are introduced in this paper to store the data collected from the sensors in their proximities.The storage nodes alleviate the heavy load of transmitting all data to a central place for archiving and reduce the communication costinduced by the network query. The objective of this paper is to address the storage node placement problem aiming to minimize the totalenergy cost for gathering data to the storage nodes and replying queries. We examine deterministic placement of storage nodes andpresent optimal algorithms based on dynamic programming. Further, we give stochastic analysis for random deployment and conductsimulation evaluation for both deterministic and random placements of storage nodes.

Index Terms—Wireless sensor networks, data storage, data query.

✦

1 INTRODUCTION

S ENSOR networks deployed for pervasive computingapplications, e.g., sensing environmental conditions

and monitoring people’s behaviors, generates a largeamount of data continuously over a long period of time.This large volume of data has to be stored somewherefor future retrieval and data analysis. One of the biggestchallenges in these applications is how to store andsearch the collected data.The collected data can either be stored in the network

sensors, or transmitted to the sink. Several problemsarise when data are stored in sensors. First, a sensor isequipped with only limited memory or storage space,which prohibits the storage of a large amount of dataaccumulated for months or years. Second, since sensorsare battery operated, the stored data will be lost afterthe sensors are depleted of power. Third, searching forthe data of interest in a widely scattered network fieldis a hard problem. The communication generated in anetwork-wide search will be prohibitive. Alternatively,data can be transmitted back to the sink and stored therefor future retrieval. This scheme is ideal since data arestored in a central place for permanent access. However,the sensor network’s per-node communication capability(defined as the number of packets a sensor can transmitto the sink per time unit) is very limited [1], [2]. A largeamount of data cannot be transmitted from the sensornetwork to the sink efficiently. Furthermore, the datacommunication from the sensors to the sink may takelong routes consuming much energy and depleting of thesensor battery power quickly. In particular, the sensors

• Bo Sheng is with the College of Computer and Information Science,Northeastern University, Boston, MA 02115.E-mail: [email protected]

• Qun Li and Weizhen Mao are with the Department of Computer Science,College of William and Mary, VA 23185.Email: {liqun, wm}@cs.wm.edu

around the sink are generally highly used and exhaustedeasily, thus the network may be partitioned rapidly.It is possible that, with marginal increase in cost,

some special nodes with much larger permanent storage(e.g., flash memory) and more battery power can bedeployed in sensor networks. These nodes back up thedata for nearby sensors and reply to the queries. The dataaccumulated on each storage node can be transportedperiodically to a data warehouse by robots or traversingvehicles using physical mobility as Data Mule [3]. Sincethe storage nodes only collect data from the sensors intheir proximity and the data are transmitted throughphysical transportation instead of hop-by-hop relay ofother sensor nodes, the problem of limited storage, com-munication capacity, and battery power is ameliorated.Placing storage nodes is related to the sensor network

applications. We believe query is the most importantapplication for sensor networks since in essence sensornetworks are about providing information of the envi-ronment to the end users. A user query may take variousforms, e.g., “how many nodes detect vehicle traversingevents”, and “what is the average temperature of thesensing field”. In this scenario, each sensor, in addition tosensing nearby environment, is also involved in routingdata for two network services: the raw data transmissionto storage nodes and the transmission for query diffusionand query reply. Two extremes, as mentioned earlier,would be transmitting all the data to the sink or storingthem on each sensor node locally. On the one hand, datasolely stored in the sink is beneficial to the query replyincurring no transmission cost, but the data accumula-tion to the sink is very costly. On the other hand, storingdata locally incurs zero cost for data accumulation,whereas the query cost becomes large because a queryhas to be diffused to the whole network and each sensorhas to respond to the query by transmitting data to thesink. The storage nodes not only provide permanentstorage as described previously, but also serve as a buffer

2

between the sink and the sensor nodes. The positioningof storage nodes, however, is extremely important inthis communication model. A bad placement strategymay waste the storage resources and have an adverseeffect on the performance. Therefore, a good algorithmfor placing storage nodes is needed to strike a balancebetween these two extremes characterizing a tradeoffbetween data accumulation and data query.This paper considers the storage node placement prob-

lem in a sensor network. This problem can be verycomplex with respect to various optimization metrics,such as bandwidth minimization on each link and life-time of the network. However, the bandwidth problemis alleviated because the storage nodes collect the datagenerated within their proximities and avoid the heavyload on each link incurred by the long distance datacommunication. Lifetime of a network depends on therouting scheme and the network structure. Specifically, itis mainly affected by the most loaded nodes, which arelikely the nodes close to the sink. The storage placementdoes not reduce the loads on those nodes since all theresponses have to be sent through the nodes around thesink. Therefore, even though lifetime is an importantmetric, it is not a good metric to guide the storageplacement design. Instead, we consider an approxima-tion of lifetime: energy cost. We believe energy cost ismore crucial as it is the most important performancemetric in sensor network design. Therefore, we aim tominimize the total energy cost in data accumulation anddata query by judiciously placing the storage nodes.We first examine the problem in the fixed tree model.

We assume the sensor network has organized into atree rooted at the sink. The communication routes fromsensors towards the sink are predefined by the tree.Our goal is to select some of the nodes in the tree asstorage nodes, each of which is responsible for storingthe raw data of its descendants that are not covered byother storage nodes. In many applications, for examplea sensor network along the highway, or a drainage or oilpipeline monitoring system, the network communicationtree is fixed due to the constraints of the sensor deploy-ment. Our results for the fixed tree model fit into thosescenarios well. We also consider the dynamic tree model,where the (optimal) communication tree is constructedafter the storage nodes are deployed. Specifically, eachsensor selects a storage node in its proximity for its datastorage with the goal to minimize the energy cost of theresulting communication tree. The sketch of our solutionand preliminary results have been published in [4].We organize this paper as follows. In Section 2, we

review the related work. In Section 3, we define severalproblems for storage node placement in the aforemen-tioned models. In Sections 4 and 5, we present optimalalgorithms for the problems defined within the fixedtree model. In Section 6, we give theoretical analysisof randomized deployment of storage nodes for boththe fixed tree and dynamic tree models. In Section 7,we present our simulation results, which confirm the

theoretical analysis. Finally, in Section 8, we make theconclusion and discuss future research.

2 RELATED WORK

There has been a lot of prior research work on dataquerying models in sensor networks. In early models([5], [6]), query is spread to every sensor by floodingmessages. Sensors return data back to the sink in thereverse direction of query messages. Those methods,however, do not consider the in-network storage.In the literature, several schemes have introduced

an intermediate tier between the sink and sensors.LEACH [7] is a clustering-based routing protocol, wherecluster heads can fuse the data collected from its neigh-bors to reduce communication cost to the sink. However,LEACH does not address storage problem. Data-centricstorage schemes ([8], [9], [10], [11], [12]), as another cat-egory of the related work, store data on different placesaccording to different data types. In [8], [9] and [11], theauthors propose a data-centric storage scheme based onGeographic Hash Table, where the home site of datais obtained by applying a hash function on the datatype. Another practical improvement is proposed in [12]by removing the requirement of point-to-point routing.In [13], Ahn et al. analyze the scaling behavior of data-centric query for both unstructured and structured (e.g.,GHT) networks and derive some key scaling conditions.GEM [10] is another approach that supports data-centricstorage and applies graph embedding technique to mapdata to sensor nodes. In general, the data-centric storageschemes assume some understanding about the collecteddata and extra cost is needed to forward data to thecorresponding keeper nodes. In our paper, we do notassume any prior knowledge about the data: indeed inmany applications, raw data may not be easily catego-rized into different types. To transmit the collected datato a remote location is also considered expensive becausethe total collected data may be in a very large quantity.To facilitate data query, Ganesan et al. [14] propose

a multi-resolution data storage system, DIMENSIONS,where data are stored in a degrading lossy model, i.e.,fresh data are stored completely while long-term data arestored lossily. In comparison, our scheme is more gen-eral without any assumption about the data correlation.PRESTO [15] is a recent research work on storage archi-tecture for sensor networks. A proxy tier is introducedbetween sensor nodes and user terminals and proxynodes can cache previous query responses. Compared tothe storage nodes in this paper, proxy nodes in PRESTOhave no resource constraints in term of power, compu-tation, storage and communication. It is a more generalstorage architecture that does not take the characteristicsof data generation or query into consideration. In theCougar project ([16], [17]), a data dissemination tree isbuilt with data sources as leaves. View nodes introducedin Cougar have similar functionalities as storage nodes inthis paper. Our scheme focuses more on how to optimize

3

the placement of storage nodes, while Cougar mainlyfocuses on how to implement data query with moredetails in a sensor network.In addition, recent research work has shown the fea-

sibility of storage nodes. In [18], Mathur et al. inves-tigate hardware supports to attach large capacity flashmemory to sensors. Their measurement study showsthat access to large local storage is practical for sensors.Furthermore, storage nodes are also supported by recentresearch on software system. Zeinalipour-Yazti et al. pro-pose MicroHash [19], an indexing structure to manageexternal flash memory of sensors in order to efficientlylook up the stored data. In [20], Mathur et al. designCapsule system as a intermediate layer between flashmemory and sensor applications.In [21] and [22], the authors introduce an approach to

analyzing communication networks based on stochas-tic geometry. They consider models built on Poissonprocesses and obtain formulas to express the averagecost in function of the intensity parameters of Poissonprocesses. Baek et al. extend this work specifically tosensor networks in [23]. Our paper also analyzes randomplacement of storage nodes in Section 6. We will usesimilar means with [21], [22] and [23] to derive analyticalformulas for the performance.

3 PROBLEM FORMULATION

In this paper, we consider an application in which sensornetworks provide real-time data services to users. A sen-sor network is given with one special sensor identifiedas the sink (or base station) and many normal sensors,each of which generates (or collects) data from its envi-ronment. Users specify the data they need by submittingqueries to the sink and they are usually interested inthe latest readings generated by the sensors1. To replyto queries, one typical solution is to let the sink haveall the data. Then any query can be satisfied directly bythe sink. This requires each sensor to send its readingsback to the sink immediately every time it generatesnew data. Generally, transferring all raw data could bevery costly and is not always necessary. Alternatively, weallow sensors to hold their data and to be aware of thequeries, then raw data can be processed to contain onlythe readings that users are interested in and the reduced-size reply, instead of the whole raw readings, can betransferred back to the sink. This scheme is illustratedin Fig. 1, where the black nodes, called storage nodes, areallowed to hold data. The sink diffuses queries to thestorage nodes by broadcasting to the sensor network andthese storage sensors reply to the queries by sending theprocessed data back. Compared to the previous solution,this approach reduces the raw data transfer cost (asindicated by the thick arrows in the figures), becausesome raw data transmissions are replaced by query reply

1. Our algorithms also apply to the queries to the historic data. Forthe ease of presentation, we assume all queries are corresponding tothe latest generated data.

(as indicated by the thin arrows). On the other hand, thisscheme incurs an extra query diffusion cost (as indicatedby the dashed arrows). In this paper, we are interested instrategically designing a data access model to minimizeenergy cost associated with raw data transfers, querydiffusion, and query replies.We first formally define two types of sensors (or

nodes). 1) Storage nodes: This type of nodes have muchlarger storage capacity than regular sensors. In the dataaccess model, they store all the data received from othernodes or generated by themselves. They do not sendout anything until queries arrive. Based on the querydescription, they obtain the results needed from the rawdata they are holding and then return the results backto the sink. Note that except enriched storage capacity,other resources on storage nodes are still constrainedas regular sensors. The sink itself is considered as astorage node. 2) Forwarding nodes: This type of nodesare regular sensors and they always forward the datareceived from other nodes or generated by themselvesalong a path towards the sink. The outgoing data arekept intact and the forwarding operation continues untilthe data reach a storage node. The forwarding operationis independent of queries and there is no data processingat forwarding nodes.

Query

SinkStorage NodeForwarding NodeRaw DataReply

Fig. 1: Data Access Model with Storage Nodes

Therefore, our goal is to design a centralized algorithmthat can derive the best locations of the storage nodes toguide the deployment of such a hybrid sensor network.We make the following assumptions about the charac-teristics of data generation, query diffusion, and queryreplies. First, for data generation, we assume that eachnode generates rd readings per time unit and the datasize of each reading is sd. Second, for query diffusion, weassume that rq queries of the same type are submittedfrom users per time unit and the size of the querymessages is sq. Third, for query reply, we assume thatthe size of data needed to reply a query is a fractionα of that of the raw data. Specifically, we define a datareduction function f for query reply. For input x, whichis the size of raw data generated by a set of nodes,function f(x) = αx for α ∈ (0, 1] returns the size ofthe processed data, which is needed to reply the query.We do not restrict the types of queries we impose on thesensor network in this paper, but we assume that wecan obtain an empirical value for α through examining

4

Sink

(a) Step 1: Regular sensors are deployed ina field.

Sink

(b) Step 2: A communication tree is con-structed to relay data.

Sink

(c) Step 3: Some regular sensors are up-graded to or replaced by storage nodes.

Fig. 2: Deploy Storage Nodes in the Fixed Tree Model (storage nodes are black)

the historic queries. The parameter α characterizes manyqueries satisfied by a certain fraction of all the sensingdata, e.g., a range query may be “return all the nodes thatsense a temperature higher than 100 degree” and α canbe estimated based on the data distribution information.In this paper, the communication among all n nodes

is based on a tree topology rooted at the sink. The treeis formed in the initial phase as follows. The sink firstbroadcast a message with a hop counter. The nodesreceiving the message will set the message sender asthe parent node, increase the hop counter by one, andbroadcast it. If a node receives multiple messages, it willselect the one with the minimum hop counter to broad-cast and set the sender of the message as its parent. Dataare transferred along the edges in this communicationtree. To transmit one data unit, the energy cost of thesender and receiver are etr and ere respectively, and etr

is also relevant to the distance between the sender andreceiver. To simplify the problem, we set the length ofeach tree edge to one unit, which means that sensornodes have a fixed transmission range and the energycost of transferring data is only proportional to the datasize. Our algorithms can be easily extended to non-uniform transmission ranges as long as topology infor-mation is available. In our energy model, for simplicityof presentation, the receiving energy cost is assigned tothe sender without changing the total energy cost. Whensensor i sends one data unit to j, the energy cost of j is0, and the energy consumed by i is

{

etr + ere if j is i’s parent;etr + ere · ci if j is one of i’s children,

where ci is the number of i’s children. In the rest of thispaper, we normalize the energy cost by (etr + ere) foreasy presentation. Thus, transferring one data unit fromi to its parent consumes one energy unit and to broadcastone data unit to its children, i will consume bi energyunits, where bi = etr+ere·ci

etr+ere.

Tree structure has been widely used in sensor net-works. Although the communication tree may be brokendue to link failure, this paper considers a commonpractice that only stable links are chosen to build the treeso that errors in transmission due to poor link qualitycan be reduced and the tree topology can be robust fora long time. Since the tree topology changes will be rareincurring small communication costs compared to the

query and data collection cost, we will not consider thecost for tree topology information update in this paper. Ifpacket loss happens, sensors may retransmit the packettill it gets through. We assume that the probability ofretransmission is the same for all links. Thus, the energycost for communication on each link is still proportionalto the data size even though the overhead for a unitpacket is larger than that of a perfect wireless link.

The tree structure is independent of the underlyinglow layer communication protocol: like myriad routingprotocols, tree structure is one of the routing schemes.Our algorithm only assumes that the energy cost isproportional to the transmitted data size, which canbe realized in many communication protocols such asthe duty cycle mechanism. Moreover, we assume thatthe applications considered in this paper can toleratethe delay caused by low layer communication, such asretransmission and duty cycle mechanism. We wouldalso like to mention that the lower layer implementation,such as the duty cycle mechanism, is compatible with thefunctionality of the tree structure based communication.Tree structure is constructed in the initialization phase,in which duty cycle has not been started. Therefore,duty cycle does not affect the tree construction whilethe tree structure has to be considered for selecting theparameters for duty cycle.

Here, we present basic analysis of energy cost. Let i bea node in the communication tree and Ti be the subtreerooted at i. We use |Ti| to denote the number of nodes inTi. We define e(i) to be the energy cost incurred at i pertime unit, which includes, the cost for raw data transferfrom i to its parent if i is a forwarding node, the cost forquery diffusion if i has storage nodes as its descendants,and the cost for query reply if i is a storage node or hasa storage descendant. To define e(i) mathematically weneed to consider several possible cases.

Case A. i is a forwarding node and there are no storagenodes in Ti. All raw data generated by the nodes in Ti

have to be forwarded to the parent of i and there is noquery diffusion cost. So e(i) = |Ti|rdsd.

Case B. i is a storage node and there are no otherstorage nodes in Ti. The latest readings of all raw datagenerated by the nodes in Ti are processed at node i andthe reply size is be α|Ti|sd. Node i sends the reply to itsparent when queries arrive. So e(i) = rqα|Ti|sd.

5

Case C. i is a storage node and there is at least oneother storage node in Ti. In addition to the cost forquery reply as defined in Case B, i also incurs a costfor query diffusion that is implemented by broadcastingto its children. So e(i) = rqα|Ti|sd + birqsq.

Case D. i is a forwarding node and there is at leastone storage node in Ti. This is the case where all threetypes of cost (for raw data transfer, query diffusion, andquery reply) are present. Among the |Ti|−1 descendantsof i, let d1 be the number of forwarding descendantswithout any storage nodes on their paths to i (the rawdata generated at these d1 nodes and at i itself will beforwarded from i to its parent without reduction) andd2 be the number of storage descendant’s or forwardingdescendants with at least one storage node on their pathsto i (the last readings of the raw data generated at thesed2 nodes will have been processed and reduced beforereaching i). Obviously, d1 + d2 = |Ti| − 1. So

e(i) = (d1 + 1)rdsd + birqsq + rqαd2sd.

The communication tree can be formed before or afterstorage node deployment. Accordingly, we will considertwo models in the rest of this paper. In the fixed treemodel, as illustrated in Fig. 2, we first deploy regularsensors and construct a communication tree as usual.Based on the topology information, we select some ofthe regular sensors to be storage nodes. We can attachlarge flash memory to these selected sensors or replacethem by more powerful storage nodes at the samelocations. The other model, the dynamic tree model, isillustrated in Fig. 3. In this model, storage nodes aredeployed before the communication tree is formed andtheir location information is broadcast to nearby regularsensors. After that, all sensors organize themselves intoa communication tree according to the locations of thestorage nodes. In both models, after tree constructionand storage node deployment, each storage node needsto send a notification towards the sink. In this way, everysensor is aware of the existence of storage nodes amongits descendants and when a query arrives, it is able todetermine whether to continue the diffusion or not.

Within the fixed tree model, we will consider twoproblems of storage node placement. Given an undi-rected tree T with nodes labeled with 1, 2, . . . , n. Thelength of each edge is 1. Let e(i) be the energy cost ofnode i in one time unit as defined above. The objec-tive is to place storage sensors (and hence forwardingsensors) on nodes in T such that the total energy cost∑

i∈T e(i) is minimized. In the case when there is nolimit on the number of storage nodes that can be usedto minimize the energy cost, the problem is denotedwith UNLIMITED. In the case when there is a limitednumber of storage nodes, say k, to use, the problem isdenoted with LIMITED. For the dynamic tree model,the storage node placement problem becomes how toplace k given storage nodes to form a communicationtree with the minimum total energy cost. This problem

is NP-hard since it is a general case of the minimum k-median problem. We have proposed a 10-approximationalgorithm for the dynamic tree model in our previouswork [24].

The above problems defined with the fixed tree anddynamic tree models aim to find the optimal locationsfor storage nodes in a deterministic way. In reality,however, the storage nodes may not be deployed in aprecise way. Instead, their deployment may be randomwith a certain density λ. This paper also evaluates theperformance of random deployment of storage nodesin fixed and dynamic trees. Finally, Table 1 lists thenotations, which will be used in the rest of this paper.

rd / sd: rate of data generation / size of each datarq / sq : rate of user queries / size of each query messageα: data reduction rate (query reply size / raw data size)n: total number of sensorsk: total number of storage nodes (for LIMITED problem)Ti: the subtree rooted at node i

di: the depth of node i

ci: the number of node i’s childrene(i): energy cost of node i

E(i): energy cost of all the nodes in Ti

λ / λs: density of regular sensors / storage nodesetr / ere: energy cost for transmitting / receiving a unit data

TABLE 1: Summary of Notations

4 UNLIMITED NUMBER OF STORAGE NODES

We first present a linear-time algorithm for the problemUNLIMITED, where an unlimited number of storagenodes are available to use to minimize the energy costof a communication tree. Recall that e(i) is the energycost at node i. Let Ti be the subtree rooted at i. ThenE(i) is the energy cost of nodes in Ti, defined to beE(i) =

∑

i∈Tie(i).

Our algorithm relies on the following lemma.

Lemma 1: Given a node i and its subtree Ti. If αrq ≥rd, then i must be a forwarding node to minimize E(i).Otherwise, i must be a storage node to minimize E(i).

Proof: We compare the energy cost of two trees,which are identical in every aspect except that the firsttree’s root is a forwarding node and the second tree’sroot is a storage node. Let E1 and E2 be the energy costof these two trees, respectively. Comparing the energycost of individual nodes, one by one, in the two trees,we observe that any two non-root nodes in the sameposition of the trees must have the same energy cost.The only difference is the energy cost of the roots. Let e1

and e2 be the energy cost of the roots in the two trees,respectively. Therefore, E1 − E2 = e1 − e2. To prove thelemma, it suffices to prove that

e1 − e2

{

≤ 0 if αrq ≥ rd;> 0 if αrq < rd.

We consider two cases. First, if both roots have nostorage descendants, then according to the four-casedefinition of energy cost given in the previous section

6

Sink

(a) Step 1: Regular sensors and storagenodes are deployed in a field.

Sink

(b) Step 2: Each storage node (including thesink) broadcasts its location information.

Sink

(c) Step 3: A communication tree is con-structed based on the locations of storagenodes.

Fig. 3: Deploy Storage Nodes in the Dynamic Tree Model (storage nodes are black)

(Cases A and B, specifically), we have

e1 − e2 = |Ti|rdsd − rqα|Ti|sd

= |Ti|sd(rd − αrq)

{


Second, if both roots have at least one storage descen-dant, then according to the four-case definition of energycost given in the previous section (Cases D and C,specifically), we have

e1 − e2 = ((d1 + 1)rdsd + birqsq + rqαd2sd)

−(rqα|Ti|sd + birqsq)

= (d1 + 1)sd(rd − αrq)

{


In the first tree with a forwarding root, recall that d1

is the number of forwarding descendants of the rootwithout any storage nodes on their paths to the rootand that d2 is the number of storage descendants plusthe number of forwarding descendants with at least onestorage node on their paths to the root. Also recall thatd1 + d2 = |Ti| − 1.From the above lemma, we can conclude that if αrq ≥

rd then every node (except for the root/sink, which isalways a storage node) in the sensor network must be aforwarding node to minimize the energy cost. However,if αrq < rd, things are a little tricky. Although the root ofthe tree, say i, must be a storage node, it may not be truethat every node in the sensor network must be a storagenode to minimize the energy cost. One would think thatin order for the tree to incur a minimum energy cost,all of its subtrees should incur a minimum energy cost.However, since αrq < rd, these optimal subtrees all havestorage nodes as their roots. This means that the energycost of root i will have to include the cost for querydiffusion birqsq since it has storage children, i.e., e(i) =rqα|Ti|sd + birqsq. The cost for query diffusion, however,can be eliminated if all subtrees of i has only forwardingnodes, i.e., e(i) = rqα|Ti|sd. (See Cases C and B in thefour-case definition of e(i) in the previous section.) Thus,the minimum energy cost of the tree rooted at i shouldbe derived from the better of these two scenarios.For a tree Ti rooted at i, let Ci be the set of children of i.

Let E∗(i) be the minimum (optimal) energy cost of Ti. IfCi is empty, i.e., i is a leaf, then i must be a storage node

to achieve its minimum energy cost. So E∗(i) = rqαsd.If Ci is not empty, then for any j ∈ Ci, let Ef (j) be theenergy cost of Tj when all nodes in Tj are forwardingnodes. So

E∗(i) = min { rqα|Ti|sd + birqsq +∑

j∈Ci

E∗(j) ,

rqα|Ti|sd +∑

j∈Ci

Ef (j) }.

Algorithm 1 given in pseudo-code finds the optimalplacement of storage nodes in two cases: (1) αrq ≥ rd,(2) αrq < rd, where the first case is trivial and the secondcase is solved by dynamic programming that works fromthe bottom to the top of the tree. Assume that the n nodesin the tree T are labeled using the post-order2. A tableE∗[1..n] is used to hold the minimum energy cost of allsubtrees rooted at node i = 1, . . . , n. At the end of thecomputation, E∗[n] will hold the minimum energy costof T (which is rooted at n according to the post-orderlabeling). We also maintain a second table Ef [1..n] whichrecords the energy cost of all subtrees when all nodesin each subtree are forwarding nodes. In the algorithm,lines 5-9 compute the E∗ and Ef entries for all leavesand lines 10-19 compute the E∗ and Ef entries for theremaining nodes following our post-order.There are only O(n) entries to compute in tables E∗

and Ef and to compute each entry that corresponds toa node, only its children will have to be considered.Furthermore, each node starts as a storage node. Once itis changed to a forwarding node by the algorithm, it willnever be changed back. Therefore, the time complexityof Algorithm 1 is O(n), where n is the number of nodes.Summarizing the discussion above, we have the fol-

lowing theorem.Theorem 1: If αrq ≥ rd, then the optimal tree with the

minimum energy cost contains only forwarding nodes(except for the root). If αrq < rd, then the optimaltree can be constructed by a dynamic programmingalgorithm in O(n) time.We also observe that every node starts as a storage nodeand that once it is changed to a forwarding node, all itsdescendants are changed to forwarding nodes as well.

2. The post-order used in this paper is slightly different from thetextbook definition of post-order in that our post-order requires allleaves to be listed first.

7

Algorithm 1 Place Unlimited Storage Nodes

1: make the root a storage node2: if αrq ≥ rd then3: make all non-root nodes forwarding nodes and return4: end if5: for all leaves i do6: make i a storage node7: E∗[i] = rqαsd

8: Ef [i] = rdsd

9: end for10: for all remaining nodes i, in post-order, do11: make i a storage node12: cost1 = rqα|Ti|sd + birqsq +

P

j∈CiE∗[j]

13: cost2 = rqα|Ti|sd +P

j∈CiEf [j]

14: E∗[i] = min{cost1, cost2}15: Ef [i] = |Ti|rdsd +

P

j∈CiEf [j]

16: if cost1 ≥ cost2 then17: change each descendant of i that is a storage

node to a forwarding node18: end if19: end for

Thus, it is impossible for a forwarding node to havea storage descendant. Likewise, it is impossible for astorage node to have a forwarding ancestor. We thenhave the following corollary.Corollary 1: In the optimal tree, if i is a forwarding

node, all its descendants are forwarding nodes. If i is astorage node, all its ancestors are storage nodes.In summary, this UNLIMITED problem refers to the

scenario that the deployment budget is sufficient toupgrade every sensor to be a storage node. However,simply making all sensors storage nodes may not be thebest strategy. The appropriate deployment still dependson the characteristics of query and data generation.Intuitively, if there are a large volume of queries for acertain set of data and the reduction function yields alarge α, it would be better to transfer these data to thesink. On the other hand, if queries are infrequent andthe reply size is much less than the raw data, it wouldbe more efficient to hold the raw data locally. Accordingto Theorem 1 and Corollary 1, the optimal deploymentof storage nodes has a special property. For any pathfrom a leave to the root, there is a clear boundary thatdistinguish forwarding nodes from storage nodes. Thenodes below the boundary layer to leaves are forwardingnodes and the nodes above the boundary towards thesink are storage nodes.

5 LIMITED NUMBER OF STORAGE NODES

In the problem UNLIMITED discussed in the previ-ous section, we assume that we have enough storagenodes for the need to minimize the energy cost of thenetwork. In reality, however, storage nodes may comewith a hardware cost. Considering a limited budget fordeploying a sensor network, there might be only a smallportion of sensors as storage nodes. This is why we havealso defined the problem LIMITED, which is similar to

UNLIMITED except that we have only k storage nodes todeploy. Since the root (sink) is always a storage node, weassume that k ≥ 1 and that k−1 is the maximum numberof storage nodes that may appear as descendants of theroot. Furthermore, from the discussion in the previoussection, if αrq ≥ rd, the optimal tree has no storagenodes at all except the root. In this case, we just do notdeploy any of the k − 1 storage nodes and we get anoptimal tree. Our discussion in this section on LIMITEDis for the case of αrq < rd. Since the number of storagenodes is limited, where to place them becomes a crucialproblem. A bad placement strategy may hardly improvethe performance. Basically, there is a tradeoff betweentwo trends. On the one hand, if storage nodes are closeto the sink, i.e., at a high level in the tree structure, theycan process more raw data, thus reduce the reply sizefrom storage nodes to the sink. However, the sensornetwork spends much energy in transferring the rawdata from low level forwarding nodes to the storagenodes. On the other hand, if the storage nodes are faraway from the sink, the raw data from their descendantscan be processed earlier along the path towards the sink.However, storage nodes may cover only a few regularsensors, which leads to much raw data transferred to thesink without being processed. Besides this tradeoff, thebenefits from a storage node also depend on the loca-tions of other storage nodes. Therefore, in this section,we propose the optimal placement strategy in order tomaximize the benefits from deploying k storage nodes.

5.1 Arbitrary Trees

Assume that a communication tree T is given withup to k storage nodes already optimally deployed. Bydefinition, the energy cost of T is

∑

i∈T e(i). However,we are going to use a different and unique methodto calculate this cost, which works from the bottom ofthe tree towards the root. Starting from the leaf nodesand following the post-order until the root is eventuallyreached, for each node i, we compute the energy costalready incurred within the subtree Ti rooted at i, whichis E(i) by our notation, plus the energy cost contributedby the nodes in Ti to their ancestors, which includesboth raw data transmission cost and query reply costaccording to the four-case definition of the energy costof an individual node. Specifically, if i is a forwardingnode, it contributes a raw data transmission cost of rdsd

to each of its forwarding ancestors that lie between i andi’s closest storage ancestor (due to Cases A and D) anda query reply cost of rqαsd to each of the other ancestors(due to Cases B and D). If i is a storage node, however,it contributes a query reply cost of rqαsd to each of itsancestors (due to Cases C and D). Fig. 4 depicts the twoscenarios. The top path from node i to the root (sink) iswhen i is a forwarding node and the bottom path from ito the root is when i is a storage node. Above each node(except i) is the contribution from i to the energy costincurred at the node.

8

Storage Node

Forwarding Node

α *sdrq* α *sdrq* α *sdrq* α *sdrq* α *sdrq*

α *sdrq*α *sdrq*α *sdrq*α *sdrq*α *sdrq*α *sdrq*α *sdrq*

Sink

Sink

rd*sd rd*sd

Fig. 4: Computing the contribution to the energy cost of allancestors

Let l be the number of forwarding nodes between iand its closest storage ancestor, not including i. Let mbe the upper bound on the number of storage nodes inTi. Then, we use Ei(m, l) for the energy cost that includesE(i) and the amount contributed by the nodes in Ti tothe energy cost of their ancestors. Note that 0 ≤ m ≤ kand 0 ≤ l ≤ n− 2. In the case that i is a storage node ori’s parent is a storage node, l becomes 0. Furthermore,if m = 0, no storage node is used in Ti and if m ≥ 1, atleast one but no more than m storage nodes are used inTi. Therefore, En(k, 0) is the minimum energy cost of Twith up to k storage nodes to deploy, assuming n is thelabel for the root.When traversing the nodes in post-order in the tree

starting from the leaves, let i be the current node beingtraversed. Let di be the depth of i in the tree, which isthe number of edges on the path from i to the root n. Wecan define Ei(m, l) recursively. For notational simplicity,we first define Q0(m) and Q1(m) as follows.

Qi0(m) =

{

0 if m = 0;birqsq if m ≥ 1.

, Qi1(m) = Qi

0(m − 1).

If i is a leaf node, Ei(m, l) includes the energy cost ofi and the pre-calculated amount contributed by i to allof its di ancestors. Specifically, if i is a forwarding node,its own energy cost is rdsd and its contribution to theenergy cost of its ancestors is lrdsd + (di − l)rqαsd. If iis a storage node, its own energy cost is rqαsd and itscontribution to the energy cost of its ancestors is dirqαsd.Therefore,

Ei(m, l) =

{

(l + 1)rdsd + (di − l)rqαsd if m = 0;(di + 1)rqαsd if m ≥ 1.

If i is a forwarding non-leaf node with a child set ofCi, the upper bound m must be divided among all ofits children. Let P (m) be the set of all permutations p =(mp

j |j ∈ Ci), where∑

j∈Cimp

j = m and mpj denotes the

upper bound on the number of storage nodes for subtreeTj in permutation p. Then Ei(m, l) is defined to be thesum of the amount from all of its subtrees,

min∀p∈P (m)

{∑

j∈Ci

Ej(mpj , l + 1)},

the energy cost of i, rdsd +Qi0(m), and the pre-calculated

amount of energy cost contributed by i to its ancestors,lrdsd + (di − l)rqαsd. So,

Ei(m, l) = min∀p∈P (m)

{∑

j∈Ci

Ej(mpj , l + 1)}

+(l + 1)rdsd + (di − l)rqαsd + Qi0(m).

If i is a storage non-leaf and non-root node, the upperbound m− 1 must be divided among all of its children.Let P (m − 1) be the set of all permutations p = (mp

j |j ∈Ci), where

∑

j∈Cimp

j = m−1 and mpj denotes the upper

bound on the number of storage nodes for subtree Tj inpermutation p. Then Ei(m, l) is defined to be the sum ofthe amount from all of its subtrees,

min∀p∈P (m−1)

{∑

j∈Ci

Ej(mpj , 0},

the energy cost of i, rqαsd + Qi1(m), and the pre-

calculated amount of energy cost contributed by i to itsancestors, dirqαsd. So,

Ei(m, l) = min∀p∈P (m−1)

{∑

j∈Ci

Ej(mpj , 0)}

+(di + 1)rqαsd + Qi1(m).

Algorithm 2 given here maintains a two-dimensional(k + 1) × (n − 1) table, Ei[m, l], at each node i, where0 ≤ m ≤ k and 0 ≤ l ≤ n − 2. Assume that a post-order traversal is done beforehand and that the depthof each node is computed beforehand. Both the post-order and the depths can be obtained in O(n) time.In the algorithm, lines 1-6 computes the Ei tables forall leaves i, lines 7-12 compute the Ei tables for theremaining non-root nodes i, and line 13-14 computethe entry En[k, 0] for the root n. After all tables areconstructed, the minimum energy cost of the tree withup to k storage nodes can be found in the entry En[k, 0].Note that instead of constructing a table for the storageroot n, we compute only the needed entry for n.Assume that every node in the tree has at most c

children. To partition an upper bound m into up to cupper bounds with the sum equal to m, there are atmost

(

m+c−1m

)

=(

m+c−1c−1

)

≤(

k+c−1c−1

)

permutations. Thealgorithm constructs O(n) tables and each table consistsof O(kn) entries. To compute each entry, the time is

O((

k+c−1c−1

)

c) = O((k + c − 1)c−1c/(c − 1)!)

= O((max{k, c})c−1).

Thus, the time complexity of the algorithm isO(kn2(max{k, c})c−1). We summarize the discussionabove in the following theorem.Theorem 2: Given a communication tree with n nodes

and at most c children for each parent. Let k be the max-imum number of storage nodes that may be deployed inthe tree. Then the optimal tree with the minimum energycost can be constructed by a dynamic programmingalgorithm in O(kn2(max{k, c})c−1) time.

5.2 Regular Trees

Now we consider a special case of LIMITED, where thegiven network is a regular tree with exactly c childrenfor each non-leaf node and all leaves at the same level.For such a c-ary regular tree, we can modify Algorithm2 to achieve a faster time complexity by making use ofthe regularity of the tree structure.

9

Algorithm 2 Place Limited Storage Nodes

1: for all leaves i do2: for m = 0 to k do3: for l = 0 to n − 2 do4: if m = 0 then5: Ei[m, l] = (l + 1)rdsd + (di − l)rqαsd

6: if m ≥ 1 then Ei[m, l] = (di + 1)rqαsd

7: for all remaining non-root nodes i, in post-order, do8: for m = 0 to k do9: for l = 0 to n − 2 do10: min1 = min∀p∈P (m){

P

j∈CiEj [m

pj , l + 1]}

+(l + 1)rdsd + (di − l)rqαsd + Qi0(m)

11: min2 = min∀p∈P (m−1){P

j∈CiEj [m

pj , 0]}

+(di + 1)rqαsd + Qi1(m)

12: Ej [m, l] = min{min1, min2}13: En[k, 0] = min∀p∈P (k−1){

P

j∈CnEj [m

pj , 0]}

+rqαsd + Qi1(m)

14: return En[k, 0]

First, any subtree in a regular tree is also a regular treeand nodes at the same level have the subtrees with thesame topology. This suggests that instead of keeping atable for each node as in Algorithm 2, we may keep justone table for each level. We name the levels from bottomto top, with all leaves at level 0, all parents of the leavesat level 1, and finally the root at level ⌊logc n⌋. For eachlevel h, we define a two-dimensional table Eh[m, l] for0 ≤ m ≤ k and 0 ≤ l ≤ ⌊logc n⌋ − 1, which returns theenergy cost incurred within the subtree rooted at levelh plus the contribution from the nodes in the subtreeto their ancestors. As used previously, m is still themaximum number of storage nodes to use in the subtreeand l is the number of forwarding nodes between theroot of the subtree and the storage ancestor closest tothe root of the subtree.

We first define Q0(m) and Q1(m) as follows.

Q0(m)=

8

<

:

0 if m = 0;etr+c·ere

etr+ererqsq if m ≥ 1.

, Q1(m)=Q0(m−1).

Let H = ⌊logc n⌋. Algorithm 3 first computes the tableE0[m, l] for all leaves at level 0, for 0 ≤ m ≤ k and0 ≤ l ≤ H−1 in lines 2-5. Then it works its way up, levelby level, until level H−1 in lines 6-11. The root, which isat level H , is treated in lines 12-13, differently from theother nodes since it must be a storage node. By usingone table for each level, our algorithm will construct⌊logc n⌋ tables. This will result in savings in both spaceand time, compared with our algorithm for arbitrarytrees, which needs to construct n tables. The modifiedAlgorithm 3 for regular trees has a time complexity ofO(k(log n)2(max{k, c})c−1).

The result of the algorithm can be summarized in thefollowing theorem.

Theorem 3: Given a c-ary regular tree with n nodes. Letk be the maximum number of storage nodes that may bedeployed. Then the optimal tree with the minimum en-ergy cost can be constructed by a dynamic programmingalgorithm in O(k(log n)2(max{k, c})c−1) time.

Algorithm 3 c-ary Regular Tree

1: H = ⌊logc n⌋2: for m = 0 to k do3: for l = 0 to H − 1 do4: if m = 0 then E0[m, l] = (l + 1)rdsd + (H − l)rqαsd

5: if m ≥ 1 then E0[m, l] = (H + 1)rqαsd

6: for h = 1 to H − 1 do7: for m = 0 to k do8: for l = 0 to H − 1 do9: min1 = min∀p∈P (m){

Pc

j=1 Eh[mpj , l + 1]}

+(l + 1)rdsd + (H − h − l)rqαsd + Q0(m)10: min2 = min∀p∈P (m−1){

Pc

j=1 Eh[mpj , 0]}

+(H − h + 1)rqαsd + Q1(m)11: Eh[m, l] = min{min1, min2}12: EH [k, 0] = min∀p∈P (k−1){

Pc

j=1 EH−1[mpj , 0]}

+rqαsd + Q1(m)13: return EH [k, 0]

6 STOCHASTIC ANALYSIS

The algorithms in the previous sections aim to findthe optimal locations for storage nodes. In practice,however, sensors including the storage nodes may notbe deployed in a deterministic way. Instead, randomdeployment of sensors is commonly seen in applications.In this section, we analyze the performance of randomdeployment of storage nodes in both fixed tree anddynamic tree models.

6.1 Fixed Tree Model

Assume the forwarding nodes and storage nodes arerandomly distributed to the field with density λ and λs

respectively. For simplicity, we consider a disk networkfield, where the sink is placed at the center and R isthe radius. In the fixed tree model, the network buildsa communication tree in which each node finds theshortest path to the sink by following the tree edges.Each forwarding node sends its data to the first ancestorstorage node on the path to the sink. As our simulationand other previous research show, the radius (ri) of thearea covered by the nodes that are i hops or less from thesink is proportional to i. Let this ratio be c′ = ri

i, Thus,

we can estimate the number of nodes whose distancesto the sink are between (t − 1)c′ and tc′, i.e., the nodeswith depth t. Let num(t) represent the total number ofthe nodes whose depth values are t,

num(t) = λπ(t2c′2 − (t − 1)2c′2) = λπ(2t − 1)c′2.

For a node with depth t, let s(t) be the expected hopdistance to its closest storage ancestor. The cost causedby this node is rdsd ·s(t)+rqαsd ·(t−s(t)). The probabilitythat an individual node is a storage node is p = λs

λ.

Therefore,

s(t) = p · 0 + p(1 − p) · 1 + p(1 − p)2 · 2 + · · ·

+p(1 − p)t−1 · (t − 1) + (1 − p)t · t

= (1

p− 1)(1 − (1 − p)t).

10

The total energy cost in this model can be expressed as

E =

R

c′∑

t=1

num(t)(rdsds(t) + rqαsd(t − s(t))) + Eq,

where Eq is the cost of query diffusion. The value of c′

is related to the communication range and node density.We can obtain the value from simulation.Query messages are diffused from the sink to every

storage node. For each storage node, it incurs an extraquery diffusion cost along the path to its closest stor-age ancestor. If we assume there is no overlap amongthe paths connecting each storage node and its closeststorage ancestor, the total query diffusion cost Eq can beformulated as

Eq =∑

Rc′

t=1 num′(t)rqsqs(t), (1)

where num′(t) = λsπ(2t− 1)c′2 is the number of storagenodes whose depth values are t and recall s(t) is theexpected distance to the closest storage ancestor.

6.2 Dynamic Tree Model

The fixed tree model assumes that the communicationtree does not change according to the placement of thestorage nodes. In the dynamic tree model, after thestorage nodes have been positioned, each sensor nodechooses the best storage node for storage with respectto the minimal communication cost for data forwardingand query diffusion and reply. The storage node place-ment in this model is more complicated than that inthe fixed tree model because we need to consider theinterplay between the storage node placement and theselection of the storage node for each sensor. These twosteps affect each other dynamically.In the optimal solution, a storage node should send

query reply to the sink by following the shortest pathbecause the data from a storage node cannot be furtherreduced according to the definition of data reductionfunction. A forwarding node has to choose a storagenode for data storage to minimize the total communi-cation cost for its data including raw data transfer to thestorage node and query reply from the storage node tothe sink. Therefore, the closest storage node may not bethe best choice if it is further away from the sink yieldingmore cost on query reply. Assume the sink is located atthe origin. Let ~xi represent the location of sensor i, andfd(~xi) indicate the location of the forwarding destination(storage node) assigned to sensor i. If i is a storage node,then fd(~xi) = ~xi. The energy cost of sending raw datafrom ~xi to fd(~xi) is rdsd|~xi−fd(~xi)|. The query reply costfor the data from forwarding node i is rqαsd|fd(~xi)|. Intotal, the cost generated by a single node i in a time unitis rdsd|~xi − fd(~xi)| + rqαsd|fd(~xi)|. To find the optimalsolution, we need to minimize the cost for each sensor.The total energy cost in this model is

E =∑

i(rdsd|~xi − fd(~xi)| + rqαsd|fd(~xi)|) + Eq, (2)

where Eq is the cost of query diffusion. We find that Eq

in this dynamic tree model is the same as that in the fixedtree model. Because in both models, each storage nodeis connected to the sink by the shortest path. Therefore,we can also use Eq. (1) to estimate Eq . In the followingof this section, we will analyze the rest part of Eq. (2),which is denoted by E′.First, let us consider a scenario where a sensor at loca-

tion ~x forward its raw data to a storage node at location~y. We define a function F (~x, ~y) as the energy cost causedby this sensor, F (~x, ~y) = rdsd|~x− ~y|+ rqαsd|~y|. There arerequirements for this scenario to be valid. Essentially, forthe sensor at ~x, no other storage node could yield lesscost than the one at ~y. In the next we will derive thiscondition. We define an area G(~x, U) = {~y|F (~x, ~y) ≤ U},that is, if a sensor at ~x selects any storage node in thatarea, the energy cost for the data of that sensor wouldbe no more than U . Thus, G(~x, F (~x, ~y)) ∩ S = φ isthe requirement for the above scenario. Considering thePoisson deployment of sensors, the minimum reply costcan be expressed as

E′ = λ∫∫

P (~y ∈ S)F (~x, ~y)P (G(~x, F (~x, ~y)) ∩ S = φ)dydx,

where S is the set of all storage nodes including the sink.P (~y ∈ S) is the probability that there is a storage nodeat location ~y,

P (~y ∈ S) =

{

1 if ~y is the origin;λs otherwise.

P (G(~x, F (~x, ~y)) ∩ S = φ) represent the probabilityof G(~x, F (~x, ~y)) ∩ S = φ. As we described earlier, thiscondition means there is no other storage node in thearea G(~x, F (~x, ~y)). According to the Poisson process ofdeploying storage nodes with the density λs,

P (G(~x, F (~x, ~y)) ∩ S = φ)

=

{

e−λs|G(~x,F (~x,~y))| if F (~x, ~y) ≤ F (~x, 0);0 otherwise.

Unlike other nodes, the sink is deterministically fixed inthe network. So if area G covers the sink, there is noneed to compute the probability. The forwarding nodewill definitely send data directly to the sink.However, |G| in the formula above, called the Carte-

sian Oval, cannot be expressed in a closed form. Toapproximate the energy cost, we make each forwardingnode simply choose the closest storage node for datastorage. The network field is then divided into Voronoicells induced by storage nodes. The energy cost of thistopology is very close to the optimal case, especiallywhen λs ≪ λ.Assume there is a forwarding node i at location ~x, the

probability that i sends data through a storage node atlocation ~y becomes

P (~x → ~y) =

0 if |~x − ~y| ≥ |~x|;

e−λsπ|~x|2 if ~y is the origin;

λse−λsπ|~x−~y|2 Otherwise.

11

Thus,

E′ = λ

∫ ∫

F (~x, ~y)P (~x → ~y)dydx

= λ

∫

F (~x, 0)e−λsπ|~x|2dx

+λ

∫ ∫

|~x−~y|<|~x|F (~x, ~y)λse

−λsπ|~x−~y|2dydx. (3)

In the first term, F (~x, 0) = rdsd|~x|. Therefore,

λ∫

F (~x, 0)e−λsπ|~x|2dx = λrdsd

∫

|~x|e−λsπ|~x|2dx

= 2πλrdsd

∫ R

0

ρ2e−λsπρ2

dρ. (4)

y

x θρ

ρ’Sink

Fig. 5: A forwarding node at location ~x sends data via a storagenode at ~y.

For the second term in Eq. (3), Fig. 5 shows thevariables after coordination conversions, where ρ =|~x − ~y| and ρ′ = |~x|. F (~x, ~y) can be expressed byrdsdρ + rqαsd

√

ρ′2 + ρ2 − 2 cos θρ′ρ. Thus, the secondterm becomes:

λ

∫ ∫

|~x−~y|<|~x|F (~x, ~y)λse

−λsπ|~x−~y|2 dydx

= 4π2λλsrdsd

∫ R

0

∫ ρ′

0

e−λsπρ2

ρ2ρ′dρdρ′

+2πλλsrqαsd

∫ R

0

∫ 2π

0

∫ ρ′

0

e−λsπρ2

ρρ′√

ρ′2 + ρ2 − 2 cos θρ′ρ dρdθdρ′. (5)

Combining Eq. (4) and (5), the total energy cost exceptquery diffusion is

E′ = 2πλrdsd

∫ R

0

ρ2e−λsπρ2

dρ

+4π2λλsrdsd

∫ R

0

∫ ρ′

0

e−λsπρ2

ρ2ρ′dρdρ′

+2πλλsrqαsd

∫ R

0

∫ 2π

0

∫ ρ′

0

e−λsπρ2

ρρ′√

ρ′2 + ρ2 − 2 cos θρ′ρ dρdθdρ′.

We can further approximate E′ by examining its twocomponents separately. Let Efs be the cost of trans-ferring raw data between forwarding nodes and theirclosest storage nodes. Let Ess be the cost to relay replyfrom storage nodes to the sink. For a forwarding nodei , the expected distance to the closest storage node is

12√

λs. Thus,

Efs = λπR2rdsd1

2√

λs, (6)

where λπR2 is the total number of forwarding nodesand rdsd

12√

λsis the energy cost of transferring raw data

from an individual forwarding node to its closest storagenode. In addition, Ess = rqαsd

∑

i∈S(Di + 1)Li, whereDi is the total number of descendants and Li is thedistance to the sink. Since each forwarding node choosesthe closest storage node for data storage, the number offorwarding nodes that each storage node is responsiblefor is approximately the same. If we replace Li by themean value L′, then Ess = rqαsdL

′ ∑i∈S(Di + 1), where

∑

i∈S(Di+1) represents the number of nodes which senddata via storage nodes to the sink. Let N ′ be the numberof forwarding nodes that send data directly to the sink,∑

i∈S(Di + 1) = λπR2 − N ′. N ′ can be derived as

N ′ = λ

∫

e−λsπ|~x|2dx = λ

∫ 2π

0

∫ R

0

e−λsπρ2

ρdρdθ

= 2πλ

∫ R

0

ρe−λsπρ2

dρ =λ

λs

(1 − e−λsπR2

).

And L′ can be simply approximated as

L′ =λs

∫ R

02πr · rdr

λsπR2=

2

3R.

Therefore,

Ess = rqαsd(23R(λπR2 − λ

λs(1 − e−λsπR2

))). (7)

Combining Eq.(6) and (7),

E′ = λπR2rdsd1

2√

λs+rqαsd(

23R(λπR2− λ

λs(1−e−λsπR2

))).

7 PERFORMANCE EVALUATION

Our evaluation is based on simulation. In this section,we will present the performance results of the proposedalgorithms.

7.1 Simulation Settings

In our simulation, we consider a network of sensorsdeployed on a disk of radius 5 with the sink placed atthe center. One thousand sensor nodes (n = 1000) aredeployed to the field randomly following 2-dimensionalspatial Poisson process. Node transmission range is set to0.65. After all nodes are deployed, a routing tree rootedat the sink is constructed by flooding a message from thesink to all the nodes in the network. As we mentionedin Section 3, the message carries the number of hops ittravels so that each node chooses among its neighborsthe node that has the minimum number of hops to be itsparent. This tree topology is needed in the simulation ofthe fixed tree model. This step, however, can be skippedfor the dynamic tree model.In the rest of this section, we will present and compare

the following algorithms.

• FT-DD: It represents the fixed tree model with de-terministic deployment. In FT-DD, the storage nodesare deployed by following the dynamic programmingalgorithm according to the known tree topology.

12

• FT-RD: It represents the fixed tree model with randomdeployment. In FT-RD, we randomly select a certainnumber of nodes in the network to be storage nodes.

• DT-RD: It represents the dynamic tree model with ran-dom deployment. In this algorithm, the storage nodesare randomly deployed. After that, each forwardingnode selects the best storage node to deliver data andeach storage node replies to query by following theshortest path to the sink.

• ST-RD: It represents semi-dynamic tree model withrandom deployment which is enhanced version of FT-RD with a local adjustment. When a sensor i is up-graded to storage node in a tree structure, its siblings’children will try to set i as their parents if i is withintheir communication range.

• Greedy: It represents a greedy algorithm where themost heavily loaded sensors will be upgraded tostorage nodes. Usually, those sensors close to the sinkwill become storage nodes in this algorithm.

This list does not include the existing in-networkstorage approaches such as data-centric storage becausethe problem settings are different for a comparison.Approximately, the performance of data-centric storageis close to the random deployment in dynamic treemodel in this paper since the locations of storage sitesare determined by hash values.In addition, we use relative energy cost as perfor-

mance metrics. We measure the cost when no storagenode except the sink is deployed as the baseline. Let theenergy cost in this no storage scenario be Ef . And let theenergy cost after the storage nodes are deployed be E.The relative energy cost is defined as E

Ef, represented as

a percentage. In the rest of this section, we use “energycost” for “relative energy cost”.Due to the randomness of our simulation environ-

ment, results from the same parameter setting mightvary a lot. Therefore, for a certain set of parameters, weconduct 100 independent trials and the average energycost is used in the following analysis and comparison.Unless otherwise stated, we set the following parametersin our simulations: rd = rq = sd = sq = 1 andα = 0.5. We evaluate the energy cost by varying thenumber of storage nodes k and the data reduction rateα. Note that the energy cost is also related to rd, sd, rq

and sq. However, for comparison purpose, changingrq, sq, rd, sd will be equivalent to changing α. To simplifythe description, we fix rq, sq, rd, sd and only vary α toexamine different characteristics of data and queries.

7.2 Random Deployment

Fig. 6 shows the energy cost of random deploymentin the fixed tree model. We compare our theoreticalestimation with simulation results. As we can see fromthe figure, the theoretical estimation and the simulationmatch well. We have examined the simulation carefullyand found that many storage nodes are placed at theleaf nodes or have very few descendants. Therefore, the

0 5 10 15 20 2595

96

97

98

99

100

Number of Storage Nodes

Ene

rgy

Cos

t(%

)

Simulation ResultsTheoretical EstimateStandard Deviation

Fig. 6: FT-RD: Energy costwith varying number of stor-age nodes (k).

0 0.2 0.4 0.6 0.8 196.5

97

97.5

98

98.5

99

99.5

100

100.5

Reduction Rate

Ene

rgy

Cos

t(%

)

Fig. 7: FT-RD: The impact ofdata reduction rate (α).

data reduction for those descendants is negligible andless energy is saved compared to the case that each nodesends all the data to the sink.In Fig. 7, we show the energy cost with respect to

different data reduction rates α. We fix the number ofstorage nodes (k = 10) and change the data reductionrate α from 0.1 to 0.9. In this fixed tree model, decreasingdata reduction rate cannot improve the performance toomuch. Even when α is set to 0.1, we still have more than96% energy cost with 10 storage nodes. The reason isthat data accumulation to the storage nodes from theforwarding nodes consumes most of the energy withrespect to the query diffusion and reply. Moreover, whenα is 0.9, the energy cost is even worse than the originalcost, because the incurred query diffusion cost becomeslarger than the benefits obtained.

0 5 10 15 20 2565

70

75

80

85

90

95

100


Ene

rgy

Cos

t(%

)

Simulation ResultsTheoretical EstimateStandard Deviation

Fig. 8: DT-RD: Energy costwith varying number of stor-age nodes (k).

0 0.2 0.4 0.6 0.8 150

60

70

80

90

100

Reduction Rate

Ene

rgy

Cos

t(%

)

Fig. 9: DT-RD: The impact ofdata reduction rate (α).

The energy cost of random deployment in the dynamictree model is shown in Fig. 8. In this model, the locationof each storage node is broadcast to forwarding nodesso that they can choose the proper storage nodes todeliver data for the energy concern. In this way, wetake full advantage of every storage node and maximizetheir contributions to the whole network. As shownin Fig. 8, random deployment performs much betterin this dynamic tree model. The energy cost decreasesvery fast with increasing number of storage nodes, e.g.,with 10 storage nodes (1% of total nodes), we cansave energy by approximately 20%. Fig. 9 illustrates theimpact of data reduction rate to the energy cost in thedynamic tree model. This time, α becomes an importantparameter, because every storage node is in charge ofmany forwarding nodes in the dynamic tree model.A small decrease of α will reduce energy cost greatly.We also observe that our theoretical estimation matches

13

well with the simulation results. Although our stochasticanalysis uses some expected values and approximations,the maximum difference between the two curves is lessthan 5%.As shown above, with random deployment, the dy-

namic tree model has a significant performance improve-ment over the fixed tree model. However, the locations ofstorage nodes need to be broadcast to all other nodes andthe new tree is completely different from the originallyconstructed tree one. We consider a semi-dynamic treemodel, in which local adjustments are applied to theoriginally constructed tree. For each storage node i,all the forwarding nodes within the transmission rangeof i that have a depth no less than i’s depth selecti as parent. In result, each storage node gains moredescendants and accepts more raw data storage. Fig. 10compares the energy cost of random deployment in threemodels (fixed tree, dynamic tree and semi-dynamic tree),as well as deterministic deployment in the fixed treemodel. As shown in Fig. 10, DT-RD achieves the bestperformance, while FT-RD has the worst performance.Local adjustment in ST-RD improves the performance ofthe fixed tree model. In FT-RD, each storage node has nocontrol about how many descendants it can have. Manystorage nodes are deployed with few descendants, whichexplains why FT-RD delivers the worst performance. ST-RD allows each storage node to have some restrainedflexibility in choosing its descendants, and has a betterperformance than FT-RD. DT-RD has more flexibility inchoosing descendants, and we see a much improvedperformance.

0 5 10 15 20 2570

75

80

85

90

95

100


Ene

rgy

Cos

t(%

)

FT−RDST−RDDT−RD

Fig. 10: Comparison of energy cost with varying number ofstorage nodes (k).

7.3 Deterministic Deployment

Fig. 11 and Fig. 12 illustrate the performance of deter-ministic deployment in the fixed tree model. In the simu-lation, the locations of the storage nodes are obtained byAlgorithm 2. Compared to the random deployment (FT-RD) and greedy algorithm, deterministic deploymentsignificantly improve the performance by precisely com-puting the optimal positions to put the storage nodes.In Fig. 11, we fix the reduction rate α = 0.5, and

vary the number of storage nodes from 2 to 25. With afew storage nodes, the energy cost is sharply reduced inFig. 11. When k becomes larger, the slope gradually be-comes flatter. As shown in the figure, we can save about20% energy cost with 10 storage nodes, and 30% with

25 storage nodes. In addition, Fig. 12 shows the energycost with varying reduction rate, when the number ofstorage nodes is fixed as 10. As we can see, the energycost is nearly linear to the reduction rate α.

0 5 10 15 20 2570

75

80

85

90

95

100


Ene

rgy

Cos

t(%

)

FT−DDFT−RDGreedy

Fig. 11: Comparison of FT-DD, FT-RD and Greedy: En-ergy cost with varying num-ber of storage nodes (k).

0 0.2 0.4 0.6 0.8 160

70

80

90

100

110

Reduction Rate

Ene

rgy

Cos

t(%

)

FT−DDFT−RDGreedy

Fig. 12: Comparison of FT-DD, FT-RD and Greedy: Theimpact of data reduction rate(α).

Intuitively, the performance in the fixed tree modeldepends on which level of the tree the storage nodesare deployed at. When storage nodes are close the theleaves, which often happens in FT-RD, the benefit of datareduction is limited. On the other hand, when storagenodes are deployed close to the sink as in the greedyalgorithm, a large amount of raw data have to traversea long path to reach the storage nodes still yielding highenergy cost. The locations derived from our algorithmusually reside in the middle levels. Here is an exampleof the result. The following Table 2 shows the depthdistribution in a particular tree structure. The depth ofthe sink is 0. When we deploy 10 storage nodes insuch a tree, the derived depths of storage nodes are3,4,4,4,4,5,5,6,6,6.

D 1 2 3 4 5 6 7 8 9 10 11# 10 32 48 74 108 128 176 189 164 65 6

TABLE 2: Depth distribution (D:depth, #:number of sensors)

7.4 Network Life

Finally, load distribution and network lifetime are shownin Fig. 13 and Fig. 14. We show the workloads of the mostheavily-loaded 50 nodes in Fig. 13. In Fig. 14, we definelifetime as the time that 2% nodes are depleted of energy.In our setting, it means 20 sensors are out of operation.According to our simulation, this scenario will usuallycause disconnection in the sensor network. As we can seein Fig. 13, FT-RD almost has no improvement on load-balancing and lifetime. In contrast, FT-DD lengthens thelifetime a lot with a small number of storage nodes,although the objective of our algorithm is to minimizethe total energy cost. For example, with 15 storage nodes,the lifetime is increased by more than 60%. DT-RD doesnot perform well with only a few storage nodes becausethe sensors connecting storage nodes and the sink carrya lot of workloads for both raw data transmission andreply collection. The greedy algorithm is superior to DT-RD by specifically reducing the energy cost of the mostheavy loaded nodes.

14

0 10 20 30 40 500

50

100

150

200

250

Rank of Workload

Wor

kloa

d

No Storage NodesFT−RDDT−RDFT−DDGreedy

Fig. 13: Comparison of load-balancing: Workload is mea-sured as the size of the mes-sages sent by each sensor perunit time.

0 5 10 15 20 250.5

1

1.5

2


Life

tim

e

FT−DDDT−RDGreedy

Fig. 14: Comparison of life-time with varying number ofstorage nodes (k): Values arenormalized by the lifetimewithout storage nodes.

8 CONCLUSION

This paper considers the storage node placement prob-lem in a sensor network. Introducing storage nodesinto the sensor network alleviates the communicationburden of sending all the raw data to a central placefor data archiving and facilitates the data collection bytransporting data from a limited number of storagenodes. In this paper, we examine how to place storagenodes to save energy for data collection and data query.We consider two models: fixed tree model and dynamictree model. In the first model, we give exact solution onhow to place storage nodes to minimize total energy cost.Using stochastic analysis, we also give the performanceestimation for both models under the assumption thatthe storage nodes are deployed randomly.

ACKNOWLEDGMENT

This project was supported in part by US NationalScience Foundation grants CNS-0721443, CNS-0831904,and CAREER Award CNS-0747108.

REFERENCES

[1] P. Gupta and P. R. Kumar, “The capacity of wireless networks,” IEEETransactions on Information Theory, vol. IT-46, no. 2, pp. 388–404, Mar 2000.

[2] E. J. Duarte-Melo and M. Liu, “Data-gathering wireless sensor networks:organization and capacity,” Computer Networks (COMNET), vol. 43, no. 4,pp. 519–537, Nov. 2003.

[3] R. C. Shah, S. Roy, S. Jain, and W. Brunette, “Data MULEs: Modeling athree-tier architecture for sparse sensor networks,” in Proceedings of the1st IEEE International Workshop on Sensor Network Protocols and Applications(SPNA), Anchorage, AK, USA, May 2003.

[4] B. Sheng, Q. Li, and W. Mao, “Data storage placement in sensor networks,”in MobiHoc ’06: Proceedings of the 7th ACM International Symposium on MobileAd Hoc Networking and Computing, Florence, Italy, 2006, pp. 344–355.

[5] S. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong, “TAG: atiny aggregation service for ad-hoc sensor networks,” SIGOPS OpereratingSystem Review, vol. 36, no. SI, pp. 131–146, 2002.

[6] ——, “The design of an acquisitional query processor for sensor networks,”in Proceedings of the 22th ACM SIGMOD International Conference on Manage-ment of Data (SIGMOD ’03), New York, NY, USA, 2003, pp. 491–502.

[7] W. Heinzelman, A. Chandrakasan, and H. Balakrishnan, “Energy-efficientcommunication protocols for wireless microsensor networks,” in Proceed-ings of International Conference on System Sciences, Maui, HI, USA, Jan 2000.

[8] S. Shenker, S. Ratnasamy, B. Karp, R. Govindan, and D. Estrin, “Data-centricstorage in sensornets,” SIGCOMM Computer Communication Review, vol. 33,no. 1, pp. 137–142, 2003.

[9] S. Ratnasamy, B. Karp, S. Shenker, D. Estrin, R. Govindan, L. Yin, and F. Yu,“Data-centric storage in sensornets with GHT, a geographic hash table,”Mobile Networks and Applications, vol. 8, no. 4, pp. 427–442, 2003.

[10] J. Newsome and D. Song, “GEM: graph embedding for routing and data-centric storage in sensor networks without geographic information,” inProceedings of the 1st International Conference on Embedded Networked SensorSystems, New York, NY, USA, 2003, pp. 76–88.

[11] Y.-J. Kim, R. Govindan, B. Karp, and S. Shenker, “Geographic routing madepractical,” in Proceedings of the 2nd USENIX Symposium on Networked SystemsDesign and Implementation (NSDI ’05), Boston, MA, USA, May 2005.

[12] C. T. Ee, S. Ratnasamy, and S. Shenker, “Practical data-centric storage,” inProceedings of the 3rd USENIX Symposium on Networked Systems Design andImplementation (NSDI ’06), San Jose, CA, USA, May 2006.

[13] J. Ahn and B. Krishnamachari, “Fundamental scaling laws for energy-efficient storage and querying in wireless sensor networks,” in Proceedingsof the 7th ACM International Symposium on Mobile Ad Hoc Networking andComputing, New York, NY, USA, 2006, pp. 334–343.

[14] D. Ganesan, B. Greenstein, D. Estrin, J. Heidemann, and R. Govindan,“Multiresolution storage and search in sensor networks,” ACM Trans.Storage, vol. 1, no. 3, pp. 277–315, 2005.

[15] M. Li, D. Ganesan, and P. Shenoy, “PRESTO: Feedback-driven data man-agement in sensor networks,” in Proceedings of the 3rd USENIX Symposiumon Networked Systems Design and Implementation (NSDI ’06), San Jose, CA,USA, May 2006.

[16] A. Demers, J. Gehrke, R. Rajaraman, N. Trigoni, and Y. Yao, “The cougarproject: a work-in-progress report,” SIGMOD Rec., vol. 32, no. 4, pp. 53–59,2003.

[17] J. Gehrke and S. Madden, “Query processing in sensor networks,” IEEEPervasive Computing, vol. 03, no. 1, pp. 46–55, 2004.

[18] G. Mathur, P. Desnoyers, D. Ganesan, and P. Shenoy, “Ultra-low power datastorage for sensor networks,” in Proceedings of the 5th International Conferenceon Information Processing in Sensor Networks, New York, NY, USA, 2006, pp.374–381.

[19] D. Zeinalipour-Yazti, S. Lin, V. Kalogeraki, D. Gunopulos, and W. A. Najjar,“MicroHash: An efficient index structure for flash-based sensor devices.”in Proceedings of the 4th USENIX Conference on File and Storage Technologies,2005.

[20] G. Mathur, P. Desnoyers, D. Ganesan, and P. Shenoy, “CAPSULE: Anenergy-optimized object storage system for memory-constrained sensordevices.” in Proceedings of the 4th International Conference on EmbeddedNetworked Sensor Systems, Boulder, Colorado, USA, 2006.

[21] F. Baccelli, M. Klein, M. Lebourges, and S. Zuyev, “Stochastic geometryand architecture of communication networks,” J. Telecommunication Systems,vol. 7, pp. 209–227, 1997.

[22] F. Baccelli and S. Zuyev, “Poisson-Voronoi spanning trees with applicationsto the optimization of communication networks,” Operations Research,vol. 47, no. 4, pp. 619–631, 1999.

[23] S. J. Baek, G. de Veciana, and X. Su, “Minimizing energy consumption inlarge-scale sensor networks through distributed data compression and hi-erarchical aggregation,” IEEE JSAC Special Issue on Fundamental PerformanceLimits of Wireless Sensor Networks, vol. 22, no. 6, pp. 1130–1140, August 2004.

[24] B. Sheng, C. C. Tan, Q. Li, and W. Mao, “An approximation algorithmfor data storage placement in sensor networks,” in Proceedings of the2nd International Conference on Wireless Algorithms, Systems and Applications(WASA ’07), Chicago, IL, USA, August 2007.

Bo Sheng is a postdoctoral research associatein the College of Computer and Information Sci-ence at Northeastern University. He received hisPh.D. in computer science from the College ofWilliam and Mary in 2010. His research inter-ests include wireless networks and embeddedsystems with an emphasis on the efficiency andsecurity issues.

Qun Li is an associate professor in the Depart-ment of Computer Science at the College ofWilliam and Mary. He holds a PhD degree incomputer science from Dartmouth College. Hisresearch interests include wireless networks,sensor networks, RFID, and pervasive comput-ing systems. He received the NSF Career awardin 2008.

Weizhen Mao received her Ph.D. degree inComputer Science from Princeton University un-der the supervision of Andrew C.-C. Yao. Hercurrent research interests include the designand analysis of online and approximation algo-rithms for problems arising from application ar-eas, such as multi-core job scheduling, wirelessand sensor networks. She is currently an asso-ciate professor in the Department of ComputerScience at the College of William and Mary.

Optimize Storage Placement in Sensor Networkswm/J14.pdf · placement design. Instead, we consider...

Documents

Transcript of Optimize Storage Placement in Sensor Networkswm/J14.pdf · placement design. Instead, we consider...