Compressed Sensing based Low-Power Multi-view Video Coding ...

1

Compressed Sensing based Low-PowerMulti-view Video Coding and Transmission in

Wireless Multi-path Multi-hop NetworksNan Cen, Member, IEEE, Zhangyu Guan, Senior Member, IEEE, and Tommaso Melodia, Fellow, IEEE

F

Abstract—Wireless Multimedia Sensor Network (WMSN) is increas-ingly being deployed for surveillance, monitoring and Internet-of-Things(IoT) sensing applications where a set of cameras capture and com-press local images and then transmit the data to a remote controller.Such captured local images may also be compressed in a multi-viewfashion to reduce the redundancy among overlapping views. In thispaper, we present a novel paradigm for compressed-sensing-enabledmulti-view coding and streaming in WMSN. We first propose a newencoding and decoding architecture for multi-view video systems basedon Compressed Sensing (CS) principles, composed of cooperativesparsity-aware block-level rate-adaptive encoders, feedback channelsand independent decoders. The proposed architecture leverages theproperties of CS to overcome many limitations of traditional encodingtechniques, specifically massive storage requirements and high compu-tational complexity. Then, we present a modeling framework that exploitsthe aforementioned coding architecture. The proposed mathematicalproblem minimizes the power consumption by jointly determining theencoding rate and multi-path rate allocation subject to distortion andenergy constraints. Extensive performance evaluation results show thatthe proposed framework is able to transmit multi-view streams withguaranteed video quality at lower power consumption.

Index Terms—Compressed Sensing, Multi-view Video Streaming, Net-work Optimization, Internet of Things.

1 INTRODUCTION

Wireless Multimedia Sensor Networks (WMSNs) arecomposed of low-cost, battery-operated wireless camerasensors with the ability of acquiring, processing andtransmitting visual data. By extending the capability oftraditional Wireless Sensor Networks (WSNs), WMSNsplay a paramount role in the evolution of the Internet-of-Things (IoTs) paradigm by enabling multi-media datagathering, processing and analysis, for example, disastermonitoring, pervasive surveillance, traffic and infras-tructure monitoring [2] in the scenario of smart cities.However, WMSN poses additional challenges comparedto traditional WSN because it requires intense processing

N. Cen is with the Department of Computer Science, Missourit Univeristyof Science and Technology, Rolla, MO 65401 USA. Zhangyu Guan is withthe Department of Electrical Engineering, University at Buffalo, The StateUniversity of New York, Buffalo, NY 14260 USA. T. Melodia is with theDepartment of Electrical and Computer Engineering, Northeastern Univer-sity, Boston, MA, 02115 USA. E-mail: [email protected], [email protected],[email protected] work is based upon material supported by the Office of Naval Research(ONR) under grant N00014-17-1-2046.A preliminary shorter version of this paper [1] appeared in the ACM Intl.Symposium on Mobile Ad Hoc Networking and Computing (ACM MobiHoc).

ability and high-bandwidth availability. Given the factthat the sensor nodes in WMSNs are characterized bytight energy, limited processing and bandwidth, how todesign a low-complexity low-power framework pertain-ing to data compression, processing and networking isa critical issue.

Recently, compressed sensing (CS) has been proposedas a possible solution to enable video streaming in re-source constrained WMSNs. CS-based imaging systemsare able to reconstruct image or video signals from arelatively “small” number of (random or deterministic)linear combinations of original image pixels, referred toas measurements, without collecting the entire frame [3],[4], thereby offering a promising alternative to traditionalvideo encoders by acquiring and compressing video or im-ages simultaneously at very low computational complexity forencoders [5]. This attractive feature motivated a numberof works that have applied CS to video streaming in low-power wireless surveillance scenarios. For example, [6]–[8] mainly concentrate on single-view CS-based videocompression, by exploiting temporal correlation amongsuccessive video frames [6], [7] or considering energy-efficient rate allocation in WMSNs with traditional CSreconstruction methods [8]. In [9], we showed that CS-based wireless video streaming can deliver surveillance-grade video for a fraction of the energy consumption oftraditional systems based on predictive video encodingsuch as H.264. In addition, [8] illustrated and evaluatedthe error-resilience property of CS-based video streaming,which results in graceful quality degradation in wirelesslossy links. A few recent contributions [10]–[13] haveproposed CS-based multi-view video streaming tech-niques, primarily focusing on an independent-encoderand joint-decoder paradigm, which exploits the implicitcorrelation among multiple views at the decoder side toimprove the resulting video quality using complex jointreconstruction algorithms.

From a system view of multi-view video streaming,besides visual data acquiring and compressing, how toachieve power-efficient quality-assured data transmis-sion over a multi-hop wireless sensor network is anotherimportant open problem. Very limited work has beenreported in the literature to address this issue, especiallyfor CS-based streaming system. For example, [14] and[15] have looked at this problem by considering tradi-

2

tional encoding paradigms, e.g., H.264 or MPEG4; thesecontributions focus on video transmission in single-hopwireless networks and provide a framework to improvepower efficiency by adjusting encoding parameters suchas quantization step (QS) size to adapt the resulting rate.

To the best of our knowledge, we propose for the firsttime a holistic paradigm of coding and transmitting forlow-complexity low-power compressed-sensing-enabledmulti-view video streaming in multi-hop wireless sensornetworks. The objective is to efficiently deliver high-quality video on resource-limited video sensors. Toachieve this objective, we first propose a novel CS-basedmulti-view coding and decoding architecture composedof cooperative encoders and independent decoders. Un-like existing works [10]–[12], the proposed system isbased on independent encoding and independent de-coding procedures with limited channel feedback in-formation and negligible content sharing among cam-era sensors. Furthermore, we propose a power-efficientquality-guaranteed rate allocation algorithm based ona compressive Rate-Distortion (R-D) model for multi-view video streaming in multi-path multi-hop wirelesssensor networks with lossy links. Our work makes thefollowing contributions:CS-based multi-view video coding architecture withindependent encoders and independent decoders. Dif-ferent from state-of-the-art multi-view coding architec-tures, that are either based on joint encoding or onjoint decoding, we propose a new CS-based sparsity-aware independent encoding and decoding multi-viewstructure, that relies on lightweight feedback and inter-camera cooperation.- Sparsity estimation. We develop a novel adaptive ap-proach to estimate block sparsity based on the recon-structed frame at the decoder. The estimated sparsity isthen used to calculate the block-level measurement rateto be allocated with respect to a given frame-level rate.Next, the resulting block-level rates are transmitted backto the encoder through the feedback channel. The en-coder that is selected to receive the feedback information,referred to as reference view (R-view), shares the contentwith other non-reference views (NR-views) nearby.- Block-level rate adaptive multi-view encoders. R-view andNR-views perform the block-level CS encoding indepen-dently based on the shared block-level measurement rateinformation. The objective is to not only implicitly lever-age the considerable correlation among views, but also toadaptively balance the number of measurements amongblocks with different sparsity levels. Our experimentalresults show that the proposed method outperformsstate-of-the-art CS-based encoders with equal block-levelmeasurement rate by up to 5 dB in terms of Peak Signal-to-Noise Ratio (PSNR).Modeling framework for CS-based multi-view videostreaming in multi-path multi-hop wireless sensornetworks. We consider a rate-distortion model of theproposed streaming system that captures packet lossescaused by unreliable links and playout deadline vio-

lations. Based on this model, we propose a two-fold(frame-level and path-level) rate control algorithm de-signed to minimize the network power consumptionunder constraints on the minimum required video qual-ity for multi-path multi-hop multi-view video streamingscenarios.

The rest of the paper is organized as follows. InSection 2, we discuss related works. In Section 3, wereview a few preliminary notions. In Section 4, weintroduce the proposed CS-based multi-view video en-coding/decoding architecture. In Section 5 we present amodeling framework to design optimization problems ofmulti-view streaming in multi-hop sensor networks andpropose a solution algorithm. Finally, simulation resultsare presented in Section 6, while in Section 7 we drawthe main conclusions and discuss future work.

2 RELATED WORKS

CS-based Multi-view Video. More recently, several pro-posals have appeared for CS-based multi-view videocoding based on Distributed Video Coding (DVC) 1 ar-chitecture [10] [13] [18] [19] [20] [21] [22] [23] [24]. In [10],a distributed multi-view video coding scheme basedon CS is proposed, which assumes the same measure-ment rates for different views, and can only be appliedtogether with specific structured dictionaries as sparserepresentation matrix. A linear operator [20] is proposedto describe the correlations between images of differentviews in the compressed domain. The authors then useit to develop a novel joint image reconstruction scheme.In [18], the authors propose a novel CS joint multi-viewreconstruction method guided by the spatial correlationand low-rank background constraints. [19] presents ajoint optimization model (JOM) for compressed sensingbased multi-view image reconstruction, which jointlyoptimizes an adaptive disparity compensated residualtotal variation (ARTV) and a multi-image nonlocal low-rank tensor (MNLRT). The authors of [21] propose aCS-based joint reconstruction method for multi-view im-ages, which uses two images from the two nearest viewswith higher measurement rate of the current image(the right and left neighbors) to calculate a predictionframe. The authors then further improve the perfor-mance by way of a multi-stage refinement procedure[22] via residual recovery. The readers are referred to[21] [22] and references therein for details. Disparity-based joint reconstruction for multi-view video is alsoproposed in [23] and [24], where different reconstructionmethods, i.e., residual-based and total variation basedapproaches are adopted, respectively. In our previouswork [13], we proposed a motion-aware joint multi-view video reconstruction method based on a newlydesigned interview motion compensated side informa-tion generation approach. Differently, in this article, wepropose a novel CS-based independently encoding and

1. DVC algorithms (aka Wyner-Ziv coding [16] [17]) exploit thesource statistics at the decoder, thus shifting the complexity from theencoder side to the decoder side.

3

independently decoding architecture for multi-view videosystems based on new cooperative sparsity-aware-block-level rate adaptive encoders.Energy-efficient CS-enabled Video streaming.Few articles have investigated energy-constrainedcompressively-sampled video streaming. [25] presents alow-complexity and energy efficient image compressivetransmission scheme for camera sensor networks,where the authors use residual energy of camera sensornodes to control the image quality to balance energyconsumption of nodes. In [9], an analytical/empericalrate-energy-distortion model is developed to predict thereceived video quality when the overall energy availablefor both encoding and transmission of each frame isfixed and limited and the transmissions are affectedby channel errors. The model determines the optimalallocation of encoded video rate and channel codingrate for a given available energy budget. [26] proposesa cooperative relay-assisted compressed video sensingsystems that takes advantage of the error resilienceof compressively-sampled video to maintain goodvideo quality at the receiver side while significantlyreducing the required SNR, thus reducing the requiredtransmission power. Different from the previous works,which mainly aims at single-view single path CS-basedvideo streaming, in this article, we consider CS-basedmulti-view video streaming in multi-path multi-hopwireless sensor networks.

3 PRELIMINARIES

3.1 Compressed Sensing Basics

We first briefly review basic concepts of CS for signalacquisition and recovery, especially as applied to CS-based video streaming. We consider an image signalvectorized and then represented as x ∈ RN , whereN = H×W is the number of pixels in the image, and Hand W represent the dimensions of the captured scene.Each element xi denotes the ith pixel in the vectorizedimage signal representation. Most natural images areknown to be very nearly sparse when represented usingsome transformation basis Ψ ∈ RN×N , e.g., DiscreteWavelet Transform (DWT) or Discrete Cosine Transform(DCT), denoted as x = Ψs, where s ∈ RN is sparse rep-resentation of x. If s has at most K nonzero components,we call x a K-sparse signal with respect to Ψ.

In CS-based imaging system, sampling and compres-sion are executed simultaneously through a linear mea-surement matrix Φ ∈ RM×N , with M � N , as

y = Φx = ΦΨs, (1)

with y ∈ RM representing the resulting sampled andcompressed vector.

It was proven in [3] that if A , ΦΨ satisfies thefollowing Restricted Isometry Property (RIP) of order K,

(1− δk)||s||2`2 ≤ ||As||2`2 ≤ (1 + δk)||s||2`2 , (2)

with 0 < δk < 1 being a small ”isometry” constant and`2 denoting `2 norm, then we can recover the optimalsparse representation s∗ of x by solving the followingoptimization problem

P1: Minimize ||s||0Subject to: y = ΦΨs

(3)

by taking only

M = c ·Klog(N/K) (4)

measurements, where c is some predefined constant.Afterwards, x can be obtained by

x = Ψs∗. (5)

However, problem P1 is NP-hard in general, and inmost practical cases, measurements y may be corruptedby noise, e.g., channel noise or quantization noise. Then,most state-of-the-art work relies on l1 minimization withrelaxed constraints in the form

P2: Minimize ||s||1Subject to : ||y −ΦΨs||2 ≤ ε

(6)

to recover s. Note that P2 is a convex optimizationproblem. Researchers in sparse signal reconstructionhave developed various solvers [27]–[29]. For example,the Least Absolute Shrinkage and Selection Operator(LASSO) solver [28] can solve problem P2 with compu-tational complexity O(M2N). We consider a Gaussianrandom measurement matrix Φ in this paper.

3.2 Rate-Distortion Model for Compressive ImagingThroughout this paper, end-to-end video distortion ismeasured as mean squared error (MSE), which is awidely used performance measure in the field of signalprocessing, especially for objective image quality mea-surement where the quality of images are measuredalgorithmically [30]. Since Peak Signal-to-Noise Ratio(PSNR) is a more common metric in the video cod-ing community, we use PSNR = 10log10(2552/MSE)to illustrate simulation results. The distortion at thedecoder Ddec in general includes two terms, i.e., Denc,distortion introduced by the encoder (e.g., not enoughmeasurements and quantization); and Dloss, distortioncaused by packet losses due to unreliable wireless linksand violating playout deadlines because of bandwidthfluctuations. Therefore,

Ddec = f(Denc, Dloss). (7)

To the best of our knowledge, there are only a few works[8] that have investigated rate-distortion models forcompressive video streaming, but without consideringlosses. For example, [8] expands the distortion model in[31] to CS video transmission as

D(R) = D0+θ

R−R0, (8)

where D0, θ and R0 are image- or video-dependentconstants that can be determined by linear least squares

4

fitting techniques; R = MN is the user-controlled mea-

surement rate of each video frame.

4 CS-BASED MULTI-VIEW CODINGARCHITECTURE DESIGN

In this section, we introduce a novel encoding/decodingarchitecture design for CS multi-view video streaming.The proposed framework is based on three main com-ponents: (i) cooperative sparsity-aware block-level rateadaptive encoder, (ii) independent decoder, and (iii) acentralized controller located at the decoder. As illus-trated in Fig. 1, considering a two-view example, camerasensors acquire a scene of interest with adaptive block-level rates and transmit sampled measurements to thebase station/controller through a multi-path multi-hopwireless sensor network. Then, the centralized controllercalculates the relevant information and feeds it back tothe selected R-view. The R-view then shares the limitedfeedback information with the other one - NR-view. Thearchitecture can be easily extended to V ≥ 2 views.

Different from existing compressive encoders withequal block measurement rate [7], [8], the objective ofthe proposed framework is to improve the reconstructionquality by leveraging each block’s sparsity as a guidelineto adapt the block-level measurement rate. We nextdescribe how to implement the proposed paradigm bydiscussing each component in detail.

4.1 Cooperative Block-level Rate-adaptive Encoder

To reduce the computational burden at encoders embed-ded in power-constrained devices, most state-of-the-artmulti-view proposals focus on developing complex jointreconstruction algorithms to improve the reconstructionquality. Differently, in our architecture we obtain im-proved quality only through sparsity-aware encoders.

To illustrate the idea, Figure 2(b) depicts the sparserepresentation of Fig. 2(a) with respect to block-basedDCT transformation. We can observe that sparsity differsamong blocks, e.g., the blocks within the coat area aremore sparse than others. According to basic compressedsensing theory in Section 3.1, (4) indicates that the num-ber of required measurements is inversely proportionalto the sparsity K. Therefore, we propose to adapt themeasurement rate at the block level according to sparsityinformation, i.e., more measurements will be allocated toless-sparse blocks, and vice versa.

In our work, the number of required measurementsM ivf for block i in frame f of view v, 1 ≤ i ≤ B, is

calculated based on the sparsity estimated at the cen-tralized controller and sent back via a feedback channel.Here, B = N

Nbdenotes the total number of blocks in one

frame with N and Nb being the total number of pixels inone frame and block, respectively. Assume that we havereceived {M i

vf}Bi=1. Then, the encoding process is similarto (1), described as

yivf = Φivfx

ivf , (9)

where yivf ∈ RMivf and Φi

vf ∈ RMivf×Nb are the mea-

surement vector and measurement matrix for block i inframe f of view v, respectively; xivf ∈ RNb represents theoriginal pixel vector of block i. From (9), we can see thatM ivf varies among blocks from 1 to Nb, thereby imple-

menting block-level rate adaptation. In real applications,the block-level rate adaptive encoding process can beimplemented by using block-wise lensless compressivecamera [32]. In Section 6, the simulation results will showthat this approach can improve the quality by up to5 dB compared with using an independent encoder andindependent decoder.

Mean value subtraction. The CS-based imaging sys-tem acquires and compresses each frame simultaneouslythrough simple linear operations as in (1). Therefore, itcan help reduce the energy consumption compared withtraditional signal acquisition and encoding approaches(e.g., H.264, MJPEG) as validated in [33]. In traditionalvideo coding techniques, frequency transform (e.g., 8×8block of DCT transform) is one must-have component,while inter-view prediction (e.g., disparity estimation)and inter-frame prediction (e.g., motion estimation) areadditions, which will all together result in higher compu-tation complexity compared to linear operations in CS-based imaging systems, thus consuming more power.In this paper, we exploit the inter-view correlation byusing the estimated sparsity from R-view feedbackedfrom the receiver. However, the compression rate of CSis not as high as traditional encoding schemes [9]. Thereis clearly an energy-consumption trade-off between thecompression rate and the bit transmission rate. [9] an-alyzes the rate-energy-distortion for compressive videosensing encoder. To improve the compression rate, weperform mean value subtraction, which can further helpreduce the number of transmitted bits. How to obtainthe mean value m will be discussed in Section 4.3. Sincethe original pixels are not available at the compressiveencoder, we perform the mean value subtraction in themeasurement domain. First, we establish a mean valuevector m ∈ RNb with dimensions the same as xivf , andwhere each element is equal to m. Then, we use the sameblock-level measurement matrix Φi

vf to sample m andthen subtract the result from yivf as

yivf = yivf −Φivfm = Φi

vf (xivf −m). (10)

After sampling, yivf is transmitted to the decoder. From(10), we can see that the proposed mean value sub-traction in the measurement domain is equivalent tosubtraction in the pixel domain.

Next, to validate the effectiveness of mean value sub-traction, we take the Vassar sequence as an example.We select a uniform quantization method. The forwardquantization stage and the reconstruction stage can beexpressed as q = sgn(x) · b |x|∆ + 1

2c and q = ∆ · q,respectively. Here, x, q, q and ∆ represent original signal,quantized signal, de-quantized signal and quantizationstep size, respectively. Figure 3 shows a comparison of

5

Slide 1/20

Architecture Design and Optimization for Compressive-enabled Multi-view Video Streaming

Nan CenUniversity at Buffalo

Block levelrate‐adaptiveǎampling

View 1

View 2

Feedback

Independentdecoding

Centralized Controller

• R‐view selection

• Sparsity estimation

• Mean estimation

• Network optiƳƛȊŀǘƛƻƴ

Base Station/ControllerCamera Sensors

Multi‐path multi‐hopwireless sensor network

Block levelrate‐adaptiveǎŀmpling

Meansubtraction

Meansubtraction

Independentdecoding

Fig. 1: Encoding/decoding architecture for multi-hop CS-based multi-view video streaming.

©

Measurement Sampling


View 1

View 2

Feedback

Measurements

Measurements

Reconstructed View1

Reconstructed View2

Centralized Controller 1. K-view Selection

2. Sparsity Estimation

3. Mean Value Estimation

Content Sharing

Decoder

©



View 1

View 2

Feedback

Measurements

Measurements

Reconstructed View1

Reconstructed View2

Centralized Controller 1. K-view Selection

2. Sparsity Estimation

3. Mean Value Estimation

Content Sharing

Decoder

(a) (b)

Fig. 2: Block Sparsity: (a) Original image, (b) Block-based DCT coefficients of (a).

0 50 100 150 200 25025.5

26

26.5

27

27.5

28

28.5

29PSNR Comparision for Vassar

Quantization Stepsize

PS

NR

(dB

)

With Mean SubstractionWithout Mean Substraction

0 50 100 150 200 2500.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2x 105 Transmitted Bits Comparision for Vassar


Num

ber o

f Tra

nsm

itted

Bits


0 50 100 150 200 2500.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4Compression Rate Comparision for Vassar


Com

pres

sion

Rat

e


(a) (b) (c)

Fig. 3: Comparison of (a) PSNR, (b) the number of transmitted bits, and (c) the compression rate between approacheswith and without mean subtraction.

PSNR, the number of transmitted bits and the compres-sion rate with and without mean subtraction, wherea measurement rate 0.2 is used, and the total bits inthe original frame are 320 × 240 × 8 = 614400 bits.Quantization step sizes from the set {1, 2, 3, 4, 8, 16,

32, 64, 128, 256} are selected. From Fig. 3(a), we canobserve that mean subtraction has a negligible effecton the reconstruction quality and there is no significantquality degradation when the quantization step size isless than 32. This is because the value of measurement

6

is up to thousand and tens of thousand compared tooriginal pixel value with maximum 255. Figures 3(b)and (c) illustrate that with mean subtraction the totalnumber of bits transmitted for one frame is significantlyreduced by up to 30 kbits compared to not using meansubtraction, which corresponds to an improvement incompression rate (defined as the ratio between the num-ber of transmitted bits and the number of total bitsin the original frame) from 0.2391 to 0.1902. Besidesmean value subtraction, we can explore the temporalcorrelation [9] between successive frames in our futurework to further improve the performance with respectto compression ratio.Cooperation via sparsity pattern sharing. Multi-viewvideo streaming is based on reducing the redundancyamong views captured by arrays of camera sensors thatare assumed to be close enough to each other. Most state-of-the-art literature adopts the concept of distributed sys-tem coding architecture [16], [17], where a reference viewtransmits more measurements than other non-referenceviews and then the receiver jointly decodes by exploitingthe implicit correlation among views. Instead, we allowthe encoders to explicitly cooperate to a certain extent.For example, the R-view selected by the centralizedcontroller will periodically receive feedback information,i.e., {Mi}Bi=1 and m, and then share it with the NR-viewsin the same group. Since camera sensors in the samegroup are assumed to be close enough to each other, theblock sparsity among views will be correlated. By usingthe same sparsity information, we can directly exploitmulti-view correlation at the encoders, thus resulting ina clean-slate compressive multi-view coding frameworkwith simple encoders and simple decoders but withimproved reconstruction quality.

4.2 Independent Decoder

As mentioned above, the proposed framework results inrelatively simple decoders. At each decoder, the receivedyivf , distorted version of yivf because of the joint effectsof quantization, transmission errors, and packet drops,will be independently decoded. The optimal solution si,?vfcan be obtained by solving

P3 : Minimize ||sivf ||1Subject to: ||yivf −Φi

vfΨbsivf ||2 ≤ ε,

(11)

where Ψb ∈ RNb×Nb represents the sparsifying matrix(2-D DCT in this work). We then use (5) to obtain thereconstructed block-level image xivf , by solving xivf =

Ψbsi,?vf . Afterward, {xivf}Bi=1 can be simply reorganized

to obtain the reconstructed frame xvf .

4.3 Centralized Controller

The centralized controller is the key component at thereceiver, which is mainly in charge of selecting the R-view and estimating sparsity and mean value requiredto be sent back to the transmitter via a assumed delay-

negligible [34] [35] and error-free feedback link.2 How toimplement a fast feedback channel in practical scenariosis important and feasible, which can help further im-prove the performance of the proposed framework andis currently beyond the scope of the paper. Addition-ally, the controller is also responsible for implementingthe power-efficient multi-path rate allocation algorithmdiscussed in Section 5. Next, we introduce the threekey functions executed at the controller in sequence,i.e., R-view selection, sparsity estimation, and mean valueestimation.R-view selection. The controller selects a view to beused as reference view (R-view) among views in thesame group and then sends feedback information tothe selected R-view. For this purpose, the controller firstcalculates the Pearson correlation coefficient among themeasurement vectors of any two views as

ρmn = corr(ymf , ynf ), ∀m 6= n, m, n = 1, . . . , V, (12)

where ymf is the simple cascaded version of all yimf andcorr(ymf , ynf ) , cov(ymf ,ynf )

σmfσnf. Then, view m?, referred to

as R-view, is selected by solving

m? = argmaxm=1,...,V

ρm, (13)

where ρm , 1V−1

∑n 6=m

ρmn denotes the average Pearson

coefficient for view m. From (13), we can see that theview with the highest average Pearson coefficient isselected as R-view 3. The reconstructed frame xvf of theR-view is then used to estimate the block sparsity Ki

and the frame mean value m for block i.Table 1 shows the calculated ρm for Vassar, Exit and

Ballroom 5-view sequences with lower resolution (i.e.,320 × 240, represented as L) and higher resolution (i.e.,640 × 480, represented as H), respectively. We can seethat the average Pearson correlation coefficient of view3 is the largest for all scenarios while the correlationdegree decreases as resolution increases. Therefore, view3 is selected as R-view. Moreover, we take the Vassar 5-view sequences as an example to elaborate how muchquality gain we can obtain if the other views exceptview 3 are selected as R-view with respect to lowerresolution and higher resolution, respectively, as shownin Table 2. We can observe that the improved averagePSNR is proportional to ρm, where selecting view 3 asR-view results in the highest improved average PSNRgain, i.e., 1.6674 dB and 1.3132 dB for lower and higherresolution scenarios, respectively. We can also see that

2. Since we mainly consider a slow block-fading environment, thefeedback delay is significantly less than the coherence time of thefading channels concerned. Thus, the effect of feedback delay can benegligible. Moreover, with a small data rate, efficient error controlcoding techniques over feedback link can be used to achieve error-freefeedback [36] [37].

3. The adopted Pearson correlation just considers the linear relationamong views. We believe that more advanced correlation algorithmswhich also consider the features in the correlation will result in moreaccurate R-view selection and better performance gain based on ourproposed CS-based multi-view coding/decoding architecture.

7

the quality gain slightly decreases for higher resolutionVassar sequences because of the decreased correlationdegree. For this case, because the Vassar multi-viewsequences used here is captured by parallel-deployedcameras with equal spacing, we obtain the same result,i.e., view 3 as R-view, as if we were to choose simply themost central sensor. However, for scenarios with camerasthat are not parallel-deployed with unequal spacing,selecting the most central sensor is not necessarily a goodchoice.Sparsity estimation. Most natural images are character-ized by large smooth or textured regions and relativelyfew sharp edges. Signals with this structure are knownto be very nearly sparse when represented using DWT orDCT domain [38], where lowest frequency componentsprovide a coarse scale approximation of the image, whilethe higher frequency components fill in the detail andresolve edges. Moreover, most DWT or DCT coefficientsare very small. Hence, we can obtain a good approxi-mation of the signal by setting the small coefficients tozero, or thresholding the coefficients, to obtain k-sparserepresentation. Moreover, in CS-based imaging system,the original frame in the pixel domain is not available,therefore, we propose to estimate sparsity based onthe reconstructed frame xvf as follows. By solving theoptimization problem P3 in (11), we can obtain the blocksparse representation si,?vf and then reorganize {si,?vf }Bi=1

to get the frame sparse representation s?vf periodically.The sparsity coefficient Ki is defined as the numberof non-zero entries of s?vf . However, natural pictures ingeneral are not exactly sparse in the transform domain.Hence, we introduce a predefined percentile ps

4, andassume that the frame can be perfectly recovered withN · ps measurements. Based on this, one can adaptivelyfind a threshold T above which transform-domain coef-ficients are considered as non-zero entries. The thresholdcan be found by solving

||max(|s?vf | − T, 0)||0N

= ps, (14)

which is a `0 counting norm problem. Since the samplespace of the above-mentioned problem is small andlimited, we employ an exhaustive search approach tosolve it. Then, we apply T to each block i to estimatethe block sparsity Ki as

Ki = ||max(|si,?vf | − T, 0)||0. (15)

According to (4) and given the frame measurement rateR, M i

vf can then be obtained as

M ivf =

Kilog10(Nb

Ki)∑B

i=1Kilog10(Nb

Ki)NR. (16)

4. ps represents the number of the largest original coefficients thatare kept for video reconstruction. We can set it to an empirical well-performing value, e.g., 15%, and then slightly tune it during the videostreaming. If the estimated sparsity is too small and cannot result ingood reconstruction quality, then we can gradually increase ps till wereach a satisfied point.

Mean value estimation. Finally, the mean value m canbe estimated from xvf as

m =1

N

N∑i=1

xvf (i). (17)

With limited feedback and lightweight informationsharing, implementing block-level rate adaptation at theencoder without adding computational complexity canimprove the reconstruction performance of our proposedencoding/decod-ing paradigm. This claim will be vali-dated in Section 6 in terms of Peak Signal-to-Noise Ratio(PSNR) and Structure Similarity (SSIM) [39].

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7−50

0

50

100

150

200

250

300

350

400

Measurement Rate

Dis

tort

ion

Rate−Distortion Curve Fitting

Fitted Curve for Vassar View 2 Practical Value for Frame 1Practical Value for Frame 4 Practical Value for Frame 80

Fig. 4: Rate-Distortion curve fitting for Vassar view 2sequence.

5 NETWORK MODELING FRAMEWORK

We consider compressive wireless video streaming overmulti-path multi-hop WMSNs. We first formulate avideo-quality-assured power minimization problem, andthen solve the resulting nonlinear nonconvex optimiza-tion problem by proposing an online solution algorithmwith low computational complexity.Network model. In the considered WMSN there are a setV of camera sensors at the transmitter side, with eachcamera capturing a video sequence of the same sceneof interest, and then sending the sequence to the serverside through a set Z of pre-established multi-hop paths.Denote Lz as the set of hops belonging to path z ∈ Z ,with dz,l being the hop distance of the lth hop in Lz . LetV = |V|, Z = |Z|, and Lz = |Lz| represent cardinalityof sets V , Z and Lz , respectively. The following threeassumptions are considered:- Pre-established routing, i.e., the set of multi-hop pathsZ is established in advance through a given routingprotocol (e.g., AODV [40]) and does not change duringthe video streaming session.- Orthogonal channel access, i.e., there exists a pre-established orthogonal channel access, e.g., based onTDMA, FDMA, or CDMA, and hence concurrent trans-missions do not interfere with each other [41].- Time division duplexing, i.e., each node cannot transmit

8

TABLE 1: Average Pearson correlation coefficient ρm for Vassar, Exit and Ballroom five views.

View 1 View 2 View 3 View 4 View 5Vassar-L 0.8184 0.8988 0.9243 0.8973 0.8435Vassar-H 0.7655 0.8464 0.8815 0.8551 0.7920

Exit-L 0.5703 0.6787 0.7038 0.6838 0.5078Exit-H 0.5358 0.6281 0.6643 0.6315 0.4745

Ballroom-L 0.8366 0.8713 0.8812 0.8627 0.8135Ballroom-H 0.7574 0.7961 0.8099 0.7922 0.7484

TABLE 2: Improved average PSNR (dB) when selecting different Vassar views as R-view.

R-view View 1 View 2 View 3 View 4 View 5Vassar-L 1.2312 1.6241 1.6674 1.6167 1.3833Vassar-H 0.6686 0.8865 1.3132 1.2138 0.9019

and receive simultaneously, implying that only half ofthe total air-time is used for transmission or reception.

At the receiver side, the video server concurrentlyand independently decodes each view of the receivedvideo sequences, and based on the reconstructed videosequences it then computes the rate control informationand sends the information back to camera sensors foractual rate control. For this purpose, we define two typesof video frames, Reference Frame (referred to as R-frame)and Non-Reference Frame (referred to as NR-frame). AR-frame is only periodically transmitted by the R-view,which is encoded at fixed block-level rate and used forsparsity and mean value estimation in the centralizedcontroller. All other frames sent out by the R-view andall frames transmitted by the NR-views are categorizedas NR-frames, which are encoded at adaptive block-level rate based on the estimated sparsity. Comparedto an NR-frame, an R-frame is encoded with equal orhigher sampling rate and then sent to the receiver sidewith much lower transmission delay. Hence, an R-framecan be reconstructed with equal or higher video qualityand used to estimate sparsity pattern information, whichis then fed back to video cameras for rate control inencoding the following NR-frames. For the R-view, weconsider a periodic frame pattern, meaning that theR-view camera encodes its captured video frames asR-frames periodically, e.g., one every 30 consecutiveframes.

In the above setting, our objective is to minimizethe average power consumption of all cameras andcommunication sensors in the network with guaranteedreconstructed video quality for each view, by jointlycontrolling video encoding rate and allocating the rateamong candidate paths.

To handle CS-based multi-view video streaming withguaranteed quality, a rate-distortion model to measurethe end-to-end distortion that jointly captures the effectsof encoder distortion and transmission distortion asstated in (7) is needed. To this end, we modify the R-D model (8) proposed in [8] by adding a packet lossterm to jointly account for compression loss and packet

loss 5 in compressive wireless video streaming systems,described as

Ddec = f(Denc, Dloss) = D0 −θ

R− κploss −R0. (18)

The parameters D0, θ, and R0 can be estimated fromempirical rate-distortion curves via a linear least squaredcurve fitting [42]. Next, we describe how to derive themin compressive multi-view streaming systems.

Since the original pixel values are not available at thereceiver end and even not available at the transmitter side incompressive multi-view streaming systems, we let the R-view periodically transmit a frame at a higher measure-ment rate, e.g., 60% 6. In this way, after reconstruction atthe decoder side, the reconstructed frame is consideredas the original image in the pixel domain. We then re-sample it at different measurement rates and perform thereconstruction procedure to obtain several rate-distortionsample pairs which are then used to complete the linearleast squared curve fitting to obtain. Figure 4 illustratesthe effectiveness of the above-mentioned online rate-distortion estimation approach, where Vassar view 2sequence is used. We can observe that the fitted rate-distortion curve (depicted in black solid line) matcheswell the ground-truth distortion values (depicted in redpentagrams, blue squares and green pluses for frames 1,4 and 80, respectively).

Besides determining the fitting parameters in (18), nextwe derive the packet loss probability ploss and packet lossrate to measurement rate reduction converter κ in (18)to formalize the network optimization problem.Packet loss probability. According to the proposedmodified R-D model (18), packet losses affect the videoreconstruction quality because they introduce an effec-tive measurement rate reduction. Therefore, effectiveestimation of packet loss probability at the receiver side

5. Different from traditional predictive-encoding based imaging sys-tems, each packet in CS-based imaging systems has the same impor-tance, i.e., it contributes equally to the reconstruction quality. Therefore,the packet loss probability can be converted into a measurement ratereduction through a conversion parameter and considered into therate-distortion performance.

6. Based on CS theory [38], image reconstructed by using 60% measurementcan result in basically the original image, i.e., the differences between thereconstructed image and the original image cannot be perceived by human eyes.

9

has significant impact on frame-level measurement ratecontrol.

In real-time wireless video streaming systems, a videopacket can be lost primarily for two reasons: i) the packetfails to pass a parity check due to transmission errorsintroduced by unreliable wireless links, and ii) it takestoo long for the packet to arrive at the receiver side,hence violating the maximum playout delay constraint.Denoting the corresponding packet loss probability aspper and pdly, respectively, the total packet loss rate plosscan then be written as

ploss = pper + pdly. (19)

In the case of multi-path routing as considered above,pper and pdly in (19) can be further expressed as

pper =∑z∈Z

bz

bpzper, (20)

pdly =∑z∈Z

bz

bpzdly, (21)

where pzper and pzdly represent the packet loss rate for pathz ∈ Z due to transmission error and delay constraintviolation, respectively; b and bz represent total video rateand the rate allocated to path z ∈ Z , respectively.

Since each path z ∈ Z may have one or multiple hops,to derive the expressions for pzper and pzdly in (20) and(21), we need to derive the resulting packet error rateand delay violation probability at each hop l of pathz ∈ Z , denoted as pz,lper and pz,ldly, respectively. For thispurpose, we first express the feasible transmission rateachievable at each hop. For each hop l ∈ Lz along pathz ∈ Z , let Gz,l and Nz,l represent the channel gainthat accounts for both path loss and fading, and theadditive white Gaussian noise (AWGN) power currentlymeasured by hop l, respectively. Denoting P z,l as thetransmission power of the sender of hop l, then theattainable transmission rate for the hop, denoted byCz,l(P z,l), can be expressed as [43]

Cz,l(P z,l) =W

2log2

(1 +K

P z,lGz,l

Nz,l

), (22)

where W is channel bandwidth in Hz, calibration factorK is defined as

K =−φ1

log(φ2pber), (23)

with φ1, φ2 being constants depending on available set ofchannel coding and modulation schemes, and pber is thepredefined maximum residual bit error rate (BER). Then,if path z ∈ Z is allocated video rate bz , for each hopl ∈ Lz , the average attainable transmission rate shouldbe equal to or higher than bz , i.e.,

E[Cz,l(P z,l)] ≥ bz, (24)

with E[Cz,l(P z,l)] defined by averaging Cz,l(P z,l) overall possible channel gains Gz,l in (22).

Based on the above setting, we can now express thesingle hop packet error rate pz,lper for each hop l ∈ Lz ofpath z ∈ Z as,

pz,lper = 1− (1− pber)L, (25)

where L is the predefined packet length in bits. Further,we characterize the queueing behavior at each wirelesshop as in [44] using a M/M/1 model to capture theeffects of channel-state-dependent transmission rate (22)single-hop queueing delay. Denoting T z,l as the delaybudget tolerable at each hop l ∈ Lz of path z ∈ Z ,the resulting packet drop rate due to delay constraintviolation can then be given as [45]

pz,ldly = e−(E[Cz,l(P z,l)]−bz)Tz,l

L , (26)

with E[Cz,l(P z,l)] defined in (24). For each path z ∈ Z ,the maximum tolerable end-to-end delay Tmax can beassigned to each hop in different ways, e.g., equal as-signment or distance-proportional assignment [46]. Weadopt the same delay budget assignment scheme as in[46].

Finally, given pz,lper and pz,ldly in (25) and (26), we canexpress the end-to-end packet error rate pzper and delayviolation probability pzdly in (20) and (21) as, for eachpath z ∈ Z ,

pzper =∑l∈Lz

pz,lper, ∀z ∈ Z, (27)

pzdly =∑l∈Lz

pz,ldly, ∀z ∈ Z, (28)

by neglecting the second and higher order product ofpz,lper and of pz,ldly. The resulting pzper and pzdly provide anupper bound on the real end-to-end packet error rate anddelay constraint violation probability. The approximationerror is negligible if packet loss rate at each wireless hopis low or moderate. Note that it is also possible to derivea lower bound on the end-to-end packet loss rate, e.g.,by applying the Chernoff Bound [47].

Packet loss to measurement rate. After having modeledploss, we now concentrate on determining κ to con-vert ploss to measurement rate reduction (referred to asRd = κ · ploss). First, parameter τ = 1

QN is defined toconvert the amount of transmitted bits of each frame toits measurement rate R used in the (18), with Q beingthe bit-depth for each measurement. We assume that bis equally distributed among F frames within 1 secondfor all V views, i.e., the transmitted bits for each frameis b/F/V . Thus, measurement rate R for each frame ofeach view is equal and defined as R = τb/F/V . Then,we can define κ as

κ = τL⌈b/F/V

L

⌉, (29)

and rewrite (18) as

Ddec = D0 −θ

τb/F/V − κploss −R0. (30)

10

Problem formulation. Based on (30), we formulate, asan example of applicability of the proposed framework,the problem of power consumption minimization forquality-assured compressive multi-view video stream-ing over multi-hop wireless sensor networks, by jointlydetermining the optimal frame-level encoding rate andallocating transmission rate among multiple paths, i.e.,

P4 : MinimizeP z,l,bz,l∈Lz,∀z∈Z

∑z∈Z

∑l∈Lz

P z,l (31)

Subject to: b =∑z∈Z

bz (32)

Ddec ≤ Dt (33)0 < τb/F/V − κploss ≤ 1 (34)0 ≤ P z,l ≤ Pmax, ∀l ∈ Lz, z ∈ Z, (35)

where Dt and Pmax represent the constraints upon dis-tortion and power consumption, respectively. Here, (33)and (34) are the constraints for required video qualitylevel and total measurement rate not lower than 0 andhigher than 1, respectively. In fact, the optimization prob-lem P4 is non-convex because the distortion constraint isnon-convex. Solving it directly will be computationallyexpensive due to the large space of b. Therefore, in thefollowing, we design a solution algorithm to find thesolution to the problem in real time.Solution Algorithm. The core idea of the solution al-gorithm is to iteratively control video encoding andtransmission strategies at two levels, i.e., adjusting videoencoding rate for each frame (frame level) and allocatingthe resulting video data rate among different paths (pathlevel). In each iteration, the algorithm first determinesat the frame level the minimum video encoding rate re-quired to achieve predefined reconstructed video quality,i.e., b in (33); and then determines at the path level theoptimal transmission rate strategy with minimal powerconsumption, i.e., bz for each path z ∈ Z .

At the frame level, given the current total video en-coding rate b and assigned rate bz for each path z ∈ Z ,the algorithm estimates the video construction distor-tion Ddec based on (19)-(30). Then, if the video qualityconstraint in optimization problem P4 can be strictlysatisfied, i.e., the inequality holds in (33), it means thatpower consumption can be further reduced by reducingthe total video encoding rate b, e.g., by a predefinedstep ∆b, while keeping the distortion constraint (33) stillsatisfied. Otherwise, if constraint (33) is violated, weneed to reduce reconstructed video Ddec by increasingthe video encoding rate b hence transmission power.Whenever there are changes with the total encoding rateb, it triggers at the path level rate allocation amongdifferent paths. For example, if b is increased by ∆b,the increased amount of video data rate is allocatedto the path that results in minimum increase of powerconsumption, and vice versa.

As the above procedure goes on, the resulting videodistortion Ddec is maintained fluctuating around, ideallyequal to, the predefined maximum tolerable distortion

Dmax. Hence, we approximately solve the optimizationproblem P4 formulated in (31)-(35), and the resultingpower consumption provides an upper bound on thereal minimum required total power. Next, in Section 6we validate the effectiveness of the proposed solutionalgorithm through extensive simulation results.

6 PERFORMANCE EVALUATION

The topology includes a certain number V camera sen-sors and pre-established paths with random numberof hops between camera sensors and the receiver. Theframe rate is F = 30 fps, and the R-view periodicallysends the R-frame every second. At the sparsity-awareCS independent encoder side, each frame is partitionedinto 16 × 16 non-overlapped blocks implying Nd = 256.A measurement matrix Φi

vf with elements drawn fromindependent and identically distributed (i.i.d) Gaussianrandom variables is considered, where the random seedis fixed for all experiments to make sure that Φi

vf

is drawn from the same matrix. The elements of themeasurement vector yivf are quantized individually byan 8-bit uniform scalar quantizer and then transmittedto the decoder. At the independent decoder end, weuse Ψb composed of DCT transform basis as sparsify-ing matrix and choose the LASSO algorithm for recon-struction motivated by its low-complexity and excellentrecovery performance characteristics. We consider fourtest multi-view sequences, Vassar, Exit, Ballroom, andBalloons, which are made publicly available [48] [49]and represent scenarios with slow, moderate and fastmovement characteristics, respectively. In the sequencesconsidered, the optical axis of each camera is parallelto the ground, and each camera is 19.5 cm away fromits left and right neighbors. There spatial resolutions of(H = 240) × (W = 320), (H = 480) × (W = 640), and(H = 768)× (W = 1024) (in pixel) are considered.

6.1 Evaluation of CS-based Multi-viewEncoding/Decoding ArchitectureWe first experimentally study the performance of theproposed CS-based multi-view encoding/decoding ar-chitecture by evaluating the PSNR of the reconstructedvideo sequences. Experiments are carried out only onthe luminance component. Next, we discuss a perfor-mance comparison among (i) traditional Equal-Block-Measurement-Rate Independent Encoding and Inde-pendent Decoding approach (referred to as EBMR-IEID), (ii) the proposed sparsity-aware Adaptive-Block-Measurement-Rate Independent Encoding and Indepen-dent Decoding approach (referred to as ABMR-IEID) and(iii) Independent Encoding and Joint Decoding (referredto as IEJD) proposed in [12] which selects one view asreference view reconstructed by traditional CS recoverymethod, while other views are jointly reconstructed byusing a reference frame.

Figures 5 and 6 show the PSNR comparisons of 50frames for views 1, 2, 3 and 4 of Vassar and Exit multi-view sequences, where a 0.3 measurement rate for each

11

0 5 10 15 20 25 30 35 40 45 5028

28.5

29

29.5

30

30.5

31

31.5

32

32.5

PSNR Comparision for Vassar View 1,

Frame Index

PS

NR

(dB

)

EBMR−IEIDABMR−IEIDIEJD

Measurement Rate=0.3

0 5 10 15 20 25 30 35 40 45 5028

28.5

29

29.5

30

30.5

31

31.5

32

32.5

33


Frame Index

PS

NR

(dB

)

EMBR−IEIDAMBR−IEIDIEJD


(a) (b)

0 5 10 15 20 25 30 35 40 45 5028

28.5

29

29.5

30

30.5

31

31.5

32

32.5

33


Frame Index

PS

NR

(dB

)

EMBR−IEIDAMBR−IEIDIEJD


0 5 10 15 20 25 30 35 40 45 50

28.5

29

29.5

30

30.5

31

31.5

32

32.5

33


Frame Index

PS

NR

(dB

)



(c) (d)

Fig. 5: PSNR against frame index for (a) view 1, (b) view 2 (R-view), (c) view 3, and (d) view 4 of sequence Vassar.TABLE 3: PSNR and SSIM comparison for Vassar eight views.

View # ABMR-IEID EBMR-IEID IEJDPSNR (dB) SSIM PSNR (dB) SSIM PSNR (dB) SSIM

1 33.6675 0.8648 30.0883 0.8215 30.2717 0.78872 33.7768 0.8686 30.3459 0.8262 30.3355 0.79023 34.1934 0.8771 30.6265 0.8323 30.9214 0.81064 33.5766 0.8696 30.4168 0.8294 30.4168 0.82945 33.3030 0.8624 30.1011 0.8169 30.3641 0.79096 34.2191 0.8846 30.6803 0.8382 30.7265 0.80597 32.9924 0.8575 29.8250 0.8162 29.6648 0.77728 32.3376 0.8472 29.3713 0.8054 29.5466 0.7742

view of ABMR-IEID 7 and EBMR-IEID is selected. Toassure fair comparison, the measurement rate of eachview in IEJD is also set to 0.3. Besides, according to theR-view selection algorithm, view 2 is chosen as the R-view for this scenario. Since the R-view transmits theR-frame periodically which is not encoded based onsparsity pattern at the encoder, therefore we can observedrops occurred periodically in Fig. 5(b) and Fig. 6(b).For the Vassar sequences, as illustrated in Fig. 5, we cansee that the proposed method ABMR-IEID outperformsthe traditional approach EBMR-IEID and IEJD by up

7. Here, we consider a worst case for ABMR-IEID, i.e., the measure-ment rate of R-frame in R-view is set to 0.3 as the same as NR-frames.Higher measurement rate of R-view will result in higher performancegain because the estimated sparsity pattern will be more accurate.

to 3.5 dB and 2.5 dB in terms of PSNR, respectively.For the Exit sequences, Figure 6 shows improvementin the reconstruction quality of ABMR-IEID comparedwith EBMR-IEID and IEJD fluctuates more than thatof Vassar video, with increased PSNR varying from 5dB to 2 dB and from 4 dB to 1 dB, respectively. Thisphenomenon occurs because of the video-based features,i.e., the texture of Exit changes faster than in Vassar.In other words, the proposed scheme is more robust insurveillance scenarios where the changes of texture areless severe. However, we can eliminate this phenomenonby transmitting R-frames more frequently. Figures 5 and6 also depict performance improvement on NR-views(views 1, 3 and 4 here), i.e., by sharing the sparsityinformation between R-view and NR-views, correlation

12

0 5 10 15 20 25 30 35 40 45 5028

29

30

31

32

33

34

35

36

PSNR Comparision for Exit View 1,

Frame Index

PS

NR

(dB

)



0 5 10 15 20 25 30 35 40 45 5029

30

31

32

33

34

35

36

PSNR Comparision for Exit View 2

Frame Index

PS

NR

(dB

)



(a) (b)

0 5 10 15 20 25 30 35 40 45 5029

30

31

32

33

34

35

36


Frame Index

PS

NR

(dB

)



0 5 10 15 20 25 30 35 40 45 5029

30

31

32

33

34

35

36


Frame Index

PS

NR

(dB

)



(c) (d)

Fig. 6: PSNR against frame index for (a) view 1, (b) view 2 (R-view), (c) view 3, and (d) view 4 of sequence Exit.

0 0.1 0.2 0.3 0.4 0.5 0.6Measurement Rate

18

20

22

24

26

28

30

32

34

PS

NR

(dB

)

PSNR Comparision of Ballroom View 1

EMBR-IEIDAMBR-IEIDIEJD


18

20

22

24

26

28

30

32

34

PS

NR

(dB

)

PSNR Comparision of Ballroom View 2



18

20

22

24

26

28

30

32

34

PS

NR

(dB

)PSNR Comparision of Ballroom View 3


(a) (b) (c)

Fig. 7: Rate-distrotion comparison of frame 3 of Ballroom sequences: (a) view 1, (b) view 2, and (c) view 3.

among views is implicitly exploited to improve thereconstruction quality.

We then illustrate the rate-distortion characteristics ofABMR-IEID, EBMR-IEID and IEJD. Figures 7, 8 and 9show the comparisons of 3-view Ballroom, 4-view Vassarand 4-view Exit scenarios with resolution 240 × 320,where the 3rd frame of Ballroom, 75th frame of Vassarand 9th frame of Exit are taken as example, respectively.Evidently, ABMR-IEID outperforms significantly EBMR-IEID and IEJD, especially as the number of measure-ments increases. Since view 2 is selected as reference

view, aka K-view for IEJD, we set a fixed measurementrate 0.6 for the K-view [12], therefore, a platform isobserved in view 2 for IEJD method. We can observethat at measurement rate 0.4, ABMR-IEID can improvePSNR by up to 3.5 dB, 4.4 dB and 2.4 dB, not only onR-view but also on NR-views for all video sequences. Tofurther evaluate the impact of resolution, we take 3-viewBallroom and Balloons with resolution 480× 640 (denotedas Ballroom-H), and 768 × 1024, respectively as exam-ples since the textures of both sequences change morefrequently compared to Vassar and Exit. As illustrated in

13

0 0.1 0.2 0.3 0.4 0.5 0.6

Measurement Rate

20

25

30

35

40

PS

NR

(dB

)

PSNR Comparision of Vassar View1

IEJDEMBR-IEIDAMBR-IEID

0 0.1 0.2 0.3 0.4 0.5 0.6

Measurement Rate

20

25

30

35

40

PS

NR

(dB

)



(a) (b)

0 0.1 0.2 0.3 0.4 0.5 0.6

Measurement Rate

25

30

35

PS

NR

(dB

)



0 0.1 0.2 0.3 0.4 0.5 0.6

Measurement Rate

20

25

30

35

40

PS

NR

(dB

)



(c) (d)

Fig. 8: Rate-distortion comparison for frame 75 of Vassar sequences: (a) view 1, (b) view 2, (c) view 3, and (d) view4.Fig. 10, we can see that the proposed ABMR-IEID alsooutperforms the other two methods by up to 4 dB forBallroom although the correlation slightly decreases asthe resolution increases as shown in Table 1. In Fig. 11,we also observe upto 5 dB PSNR improvement obtainedby ABMR-IEID compared to the other two methods forBalloons.

Next, we extend the scenario to 8 views on Vassar,where view 4 is selected as R-view, and the measurementrate is set to 0.35 for all views. Figure 12 shows thespecific reconstructed image comparison, where the leftcolumn illustrates the reconstructed frame 25 of view3 and view 7 by ABMR-IEID, respectively. The milldlecolumn shows the reconstructed images by EBMR-IEID,and the left columns shows the results obtained by usingIEJD. We can observe that the quality of images locatedin the left column is much better than that in the righttwo columns (e.g., the curtain in the 2nd floor and personin the scene, and etc.). Furthermore, Table 3 shows thedetailed PSNR and SSIM value comparison betweenABMR-IEID and EBMR-IEID and IEJD for frame 25 of 8views. From Fig. 12 and Table 3, we can see that ABMR-IEID also works well on 8 views compared to ABMR-IEID and EBMR-IEID, with PSNR and SSIM improve-ment up to 3.5 dB and 0.05, respectively. However, theIEJD method proposed in [12] does not perform well on8 views, where the gain is almost negligible.

6.2 Evaluation of Power-efficient CompressiveVideo StreamingThe following network topologies are considered: i) 2-path scenario with 2-hop path 1 and 1-hop path 2; ii)3-path scenario with 2-hop path 1, 1-hop path 2 and2-hop path 3. We assume bandwidth W = 1 MHz foreach channel. The maximum transmission power at eachnode is set to 1 W and the target distortion in MSE is50. We also assume the maximum end-to-end delay isTmax = 0.5 s assigned to each hop proportional to thehop distance. To evaluate PE-CVS (referred to as theproposed power-efficient compressive video streamingalgorithm proposed in Sec. 5), we compare it with analgorithm (referred to as ER-CVS) that equally splitsthe frame-level rate calculated by PE-CVS onto differentpaths.

Figures 13 and 14 illustrate the total power consump-tion comparison between PE-CVS and ER-CVS and thesaved power by PE-CVS compared to ER-CVS for 2-path and 3-path topologies, respectively. From Figs. 13(a)and 14(a), we see that PE-CVS (depicted in red line)results in less power consumption than ER-CVS (blackdash line) for both cases. At some points, the totalpower consumption of PE-CVS and ER-CVS is almostthe same. This occurs because the path-level bit ratescalculated by PE-CVS are equal to each other. Since ER-CVS uses frame-level rate obtained from PE-CVS and

14

0 0.1 0.2 0.3 0.4 0.5 0.6

Measurement Rate

20

25

30

35

40

PS

NR

(dB

)

PSNR Comparision of Exit View 1


0 0.1 0.2 0.3 0.4 0.5 0.6

Measurement Rate

20

25

30

35

40

PS

NR

(dB

)



(a) (b)

0 0.1 0.2 0.3 0.4 0.5 0.6

Measurement Rate

20

25

30

35

40

PS

NR

(dB

)



0 0.1 0.2 0.3 0.4 0.5 0.6

Measurement Rate

20

25

30

35

40

PS

NR

(dB

)



(c) (d)

Fig. 9: Rate-distortion comparison for frame 9 of Exit sequences: (a) view 1, (b) view 2, (c) view 3, and (d) view 4.


20

22

24

26

28

30

32

34

36

38

40

PS

NR

(dB

)

PSNR Comparision of Ballroom-H View 1



20

22

24

26

28

30

32

34

36

38

40

PS

NR

(dB

)




20

22

24

26

28

30

32

34

36

38

PS

NR

(dB

)


EBMR-IEIDAMBR-IEIDIEJD

(a) (b) (c)

Fig. 10: Rate-distortion comparison of frame 3 of higher resolution Ballroom sequences: (a) view 1, (b) view 2, and(c) view 3.

equally allocates it to each path, thereby resulting in thesame power consumption. As shown in Figs. 13(b) and14(b), the histograms clearly show that PE-CVS savesmore power than ER-CVS, up to 170 mW.

7 CONCLUSIONS

We addressed the problem of compressive multi-viewcoding and power-efficient streaming in multi-hop WM-SNs. We first proposed a novel compressed sensingbased multi-view video coding/decoding architecture,

composed of cooperative sparsity-aware independentencoder and independent decoder. We also introduceda central controller to do the sparsity pattern estimation,R-view selection, mean value estimation and implementnetwork optimization algorithms. By introducing lim-ited channel feedback and enabling lightweight spar-sity information sharing between R-view and NR-views,the encoders independently encode the video sequenceswith sparsity awareness and exploit multi-view correla-tion to improve the reconstruction quality of NR-views.

15

0 0.1 0.2 0.3 0.4 0.5 0.6

Measurement Rate

25

30

35

40

45

PS

NR

(dB

)PSNR Comparision of Balloons View 1



25

30

35

40

45

PS

NR

(dB

)

PSNR Comparision of Balloons view 2



25

30

35

40

45

PS

NR

(dB

)

PSNR Comparision of Balloons View 3


(a) (b) (c)

Fig. 11: Rate-distortion comparison of frame 25 of higher resolution Balloons sequences: (a) view 1, (b) view 2, and(c) view 3.

(a) (b) (c)

(d) (e) (f)

Fig. 12: Reconstructed frame 25 of view 3 by (a) ABMR-IEID, (b) EBMR-IEID, (c) IEJD, and reconstructed frame 25of view 7 by (d) ABMR-IEID, (e) EBMR-IEID, and (f) IEJD.

0 5 10 15 20 25 30 35 400

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Time (Second)

Pow

er (W

)

Total Power Consumption Comparison

ER-CVSPE-CVS

0 5 10 15 20 25 30 35 400

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Time (Second)

Pow

er (W

)

Saved Power Consumption by PE-CVS

(a) (b)

Fig. 13: 2-path Scenario: (a) Total power consumption comparison, (b) Saved power consumption by PE-CVScompared to ER-CVS.

16

0 5 10 15 20 25 30 35 400.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Time (second)

Pow

er (W

)

Total Power Consumption Comparison

PE-CVSER-CVS

0 5 10 15 20 25 30 35 400

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

Time (second)

Pow

er (W

)

Saved Power Consumption by PE-CVS

(a) (b)

Fig. 14: 3-path Scenario: (a) Total power consumption comparison, (b) Saved power consumption by PE-CVScompared to ER-CVS.

Based on the proposed encoding/decoding architecture,we developed a modeling framework to minimize themulti-view video transmission power but with guaran-teed video quality for a multi-hop multi-path sensornetwork. Extensive simulation results showed that thedesigned compressive multi-view framework can con-siderably improve the video reconstruction quality withminimal power consumption.

REFERENCES[1] N. Cen, Z. Guan, and T. Melodia, “Multi-view Wireless Video

Streaming Based on Compressed Sensing: Architecture and Net-work Optimization,” in Proc. of ACM Intl. Symposium on MobileAd Hoc Networking and Computing (MobiHoc), Hangzhou, China,June 2015.

[2] A. E. Redondi, L. Baroffio, M. Cesana, and M. Tagliasacchi,“Multi-view Coding and Routing of Local Features in VisualSensor Networks,” in IEEE International Conference on ComputerCommunications (INFOCOM), April 2016, San Francisco, CA, USA.

[3] E. J. Candes and M. B. Wakin, “An Introduction to CompressiveSampling,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp.21–30, March 2008.

[4] D. L. Donoho, “Compressed Sensing,” IEEE Transactions on Infor-mation Theory, vol. 52, no. 4, pp. 1289–1306, April 2006.

[5] S. Pudlewski and T. Melodia, “A Tutorial on Encoding andWireless Transmission of Compressively Sampled Videos,” IEEECommunications Surveys & Tutorials, vol. 15, no. 2, pp. 754–767,Second Quarter 2013.

[6] H. W. Chen, L. W. Kang, and C. S. Lu, “Dynamic MeasurementRate Allocation for Distributed Compressive Video Sensing,”Visual Communications and Image Processing, vol. 7744, pp. 1–10,July 2010.

[7] Y. Liu, M. Li, and D. A. Pados, “Motion-aware Decoding ofCompressed-sensed Video,” IEEE Transactions on Circuits SystemVideo Technology, vol. 23, no. 3, pp. 438–444, March 2013.

[8] S. Pudlewski, T. Melodia, and A. Prasanna, “Compressed-sensingEnabled Video Streaming for Wireless Multimedia Sensor Net-works,” IEEE Transactions on Mobile Computing, vol. 11, no. 6, pp.1060–1072, June 2012.

[9] S. Pudlewski and T. Melodia, “Compressive Video Streaming:Design and Rate-Energy-Distortion Analysis,” IEEE Transactionson Multimedia, vol. 15, no. 8, pp. 2072–2086, Dec. 2013.

[10] X. Chen and P. Frossard, “Joint Reconstruction of CompressedMulti-view Images,” in Proc. IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), Taipei, Taiwan,April 2009.

[11] M. Trocan, T. Maugey, E. W. Tramel, J. E. Fowler, and B. Pesquet-Popescu, “Compressed Sensing of Multiview Images Using Dis-parity Compensation,” in Proc. International Conference on ImageProcessing, Hong Kong, September 2010, pp. 3345–3348.

[12] N. Cen, Z. Guan, and T. Melodia, “Joint Decoding of Indepen-dently Encoded Compressive Multi-view Video Streams,” in Proc.of Picture Coding Symposium (PCS), San Jose, CA, December 2013.

[13] ——, “Inter-view Motion Compensated Joint Decoding ofCompressive-Sampled Multi-view Video Streaming,” IEEE Trans-actions on Multimedia, vol. 19, no. 6, pp. 1117–1126, June 2017.

[14] C. Li, D. Wu, and H. Xiong, “Delay-Power-Rate-Distortion Modelfor Wireless Video Communication Under Delay and EnergyConstraints,” IEEE Transactions on Circuits and Systems for VideoTechnology, vol. 24, no. 7, pp. 1170–1183, July 2014.

[15] Z. He, Y. Liang, L. Chen, I. Ahmad, and D. Wu, “Power-rate-distortion Analysis for Wireless Video Communication underEnergy Constraints,” IEEE Transactions on Circuits and Systems forVideo Technology, vol. 15, no. 5, pp. 645–658, May 2005.

[16] A. D. Wyner and J. Ziv, “The Rate-distortion Function for SourceCoding with Side-information at the Decoder,” IEEE Transactionson Information Theory, vol. 22, no. 1, pp. 1–10, Jan. 1976.

[17] D. Slepian and J. K. Wolf, “Noiseless Coding of Correlated Infor-mation Sources,” IEEE Transactions on Information Theory, vol. 19,no. 4, pp. 471–480, July 1973.

[18] X. Fei, L. Li, H. Cao, J. Miao, and R. Yu, “View’s Dependency andLow-rank Background-guided Compressed sensing for Multi-view Image Joint Reconstruction,” IET Image Processing, vol. 13,no. 12, pp. 2294–2303, 2019.

[19] J. Zhu, J. Wang, and Q. Zhu, “Compressively Sensed Multi-ViewImage Reconstruction Using Joint Optimization Modeling,” in2018 IEEE Visual Communications and Image Processing (VCIP), Dec.2018, Taichung, Taiwan, pp. 1–4.

[20] V. Thirumalai and P. Frossard, “Correlation estimation fromcompressed images,” Journal of Visual Communication and ImageRepresentation, vol. 24, no. 6, pp. 649–660, 2013.

[21] M. Trocan, T. Maugey, J. Fowler, and B. Pesquet-Popescu,“Disparity-Compensated Compressed-Sensing Reconstruction forMultiview Images,” in Proc. IEEE International Conference on Mul-timedia and Expo (ICME), Suntec City, Singapore, July 2010, pp.1225–1229.

[22] M. Trocan, T. Maugey, E. Tramel, J. Fowler, and B. Pesquet-Popescu, “Multistage Compressed-Sensing Reconstruction ofMultiview Images,” in Proc. IEEE International Workshop on Multi-media Signal Processing (MMSP), Saint Malo, France, October 2010,pp. 111–115.

[23] S. Elsayed, M. Elsabrouty, O. Muta, and H. Furukawa, “Dis-tributed Perceptual Compressed Sensing Framework for Multi-view Images,” Electronics Letters, vol. 52, no. 10, pp. 821–823,December 2016.

[24] Y. Liu, C. Zhang, and J. Kim, “disparity-compensated total-variation minimization for compressed-sensed multiview imagereconstruction,” in in Proc. of 2015 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP), Brisbane Aus-tralia, April 2015.

[25] Y. Wang, D. Wang, X. Zhang, J. Chen, and Y. Li, “Energy-Efficient Image Compressive Transmission for Wireless CameraNetworks,” IEEE Sensors Journal, vol. 16, no. 10, pp. 3875–3886,2016.

17

[26] S. Pudlewski and T. Melodia, “Cooperating to Stream Compres-sively Sampled Videos,” in Proc. of IEEE International Conferenceon Communications (ICC), Budapest, Hungary, June 2013.

[27] M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright, “GradientProjection for Sparse Reconstruction: Application to CompressedSensing and Other Inverse Problems,” IEEE Journal on SelectedTopics in Signal Processing, vol. 1, no. 4, pp. 586–598, Dec. 2007.

[28] R. Tibshirani, “Regression Shrinkage and Selection Via the Lasso,”Journal of the Royal Statistical Society, Series B, vol. 58, pp. 267–288,1996.

[29] D. Donoho, M. Elad, and V. Temlyakov, “Stable Recovery ofSparse Overcomplete Representations in the Presence of Noise,”IEEE Transactions on Information Theory, vol. 52, no. 1, pp. 6–18,January 2006.

[30] P. Mohammadi, A. Ebrahimi-Moghadam, and S. Shirani, “Sub-jective and Objective Quality Assessment of Image: A Survey,”Majlesi Journal of Electrical Engineering, vol. 9, 06 2014.

[31] K. Stuhlmuller, N. Farber, M. Link, and B. Girod, “Analysis ofVideo Transmission over Lossy Channels,” IEEE Journal on SelectedAreas in Communications, vol. 18, no. 6, pp. 1012–1032, June 2000.

[32] X. Yuan, G. Huang, H. Jiang, and P. A. Wilford, “Block-wiseLensless Compressive Camera,” in Proc. of IEEE InternationalConference on Image Processing (ICIP), Beijing, China, September2017.

[33] S. Pudlewski and T. Melodia, “A Rate-Energy-Distortion Analysisfor Compressed-Sensing-Enabled Wireless Video Streaming onMultimedia Sensors,” in IEEE Global Telecommunications Conference(GLOBECOM), Houston, TX, USA, Sep. 2011.

[34] F. Mager, D. Baumann, R. Jacob, L. Thiele, S. Trimpe, and M. Zim-merling, “Feedback Control Goes Wireless: Guaranteed Stabilityover Low-power Multi-hop Networks,” in Proc. of ACM/IEEEInternational Conference on Cyber-Physical Systems (ICCPS), April,2019.

[35] D. Baumann, F. Mager, R. Jacob, L. Thiele, M. Zimmerling,and S. Trimpe, “Fast Feedback Control over Multi-hop WirelessNetworks with Mode Changes and Stability Guarantees,” ACMTrans. on Cyber-Phys. Syst., vol. 4, no. 2, pp. 18:1–18:32, November2019.

[36] X. Guo, Y. He, S. Atapattu, S. Dey, and J. S. Evans, “PowerAllocation for Distributed Detection Systems in Wireless SensorNetworks With Limited Fusion Center Feedback,” IEEE Transac-tions on Communications, vol. 66, no. 10, pp. 4753–4766, Oct 2018.

[37] Y. He and S. Dey, “Power Allocation for Secondary Outage Mini-mization in Spectrum Sharing Networks with Limited Feedback,”IEEE Transactions on Communications, vol. 61, no. 7, pp. 2648–2663,July 2013.

[38] G. Kutyniok, “Compressed Sensing: Theory and Applications,”CoRR, 2012.

[39] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image QualityAssessment: From Error Visibility to Structural Similarity,” IEEETransactions on Image Processing, vol. 13, no. 4, pp. 600–612, April2004.

[40] C. E. Perkins and E. M. Royer, “Ad hoc On-Demand DistanceVector (AODV) Routing,” RFC 3561, July 2003.

[41] Y. Shi, S. Sharma, Y. T. Hou, and S. Kompella, “Optimal RelayAssignment for Cooperative Communications,” in Proc. ACM In-tern. Symp. on Mobile Ad Hoc Networking and Computing (MobiHoc),Hong Kong, China, May 2008.

[42] H. Prautzsch and K. alkauskas, “Curve and Surface Fitting: AnIntroduction,” SIAM Review, vol. 31, no. 1, pp. 155–157, 1989.

[43] A. Goldsmith, Wireless Communications. New York, NY, USA:Cambridge University Press, 2005.

[44] X. Zhu, E. Setton, and B. Girod, “Congestion-distortion OptimizedVideo Transmission over Ad Hoc Networks,” EURASIP Journal ofSignal Processing: Image Communications, vol. 20, pp. 773–783, 2005.

[45] D. Bertsekas and R. Gallager, Data Networks. USA: Prentice Hall,2000.

[46] T. Melodia and I. D. Akyildiz, “Cross-layer Quality of ServiceSupport for UWB Wireless Multimedia Sensor Networks,” inProc. of IEEE Conference on Computer Communications (INFOCOM),Phoenix, AZ, April 2008.

[47] H. Chernoff, “A Measure of Asymptotic Efficiency for Tests ofa Hypothesis Based on The Sum of Observations,” Annals ofMathematical Statistics, vol. 23, no. 4, pp. 493–507, 1952.

[48] Mitsubishi Electric Research Laboratories. MERL Multi-viewVideo Sequences. [Online]. Available: ftp://ftp.merl.com/pub/avetro/mvc-testseq/

[49] [Online]. Available: http://www.fujii.nuee.nagoya-u.ac.jp/multiview-data/

Nan Cen (M’19) received the Ph.D. degree inElectrical Engineering from Northeastern Uni-versity, Boston, MA, USA, in 2019. She is an As-sistant Professor with the Department of Com-puter Science, Missouri University of Scienceand Technology, Rolla, MO, USA. She directsthe Wireless Networks and Intelligent Systems(WNIS) Lab at MST, with research interests insystem modeling, control and prototyping fornext-generation intelligent wireless networks.

Zhangyu Guan (SM’20) received the Ph.D. de-gree in Communication and Information Sys-tems from Shandong University in China in2010. He is currently an Assistant Professor withthe Department of Electrical Engineering at theState University of New York at Buffalo, where hedirects the Wireless Intelligent Networking andSecurity (WINGS) Lab, with research interestsin network design automation, new spectrumtechnologies, and wireless network security. Hehas served as an Area Editor for Elsevier Jour-

nal of Computer Networks since July 2019. He has served as TPCchair for IEEE INFOCOM Workshop on Wireless Communications andNetworking in Extreme Environments (WCNEE) 2020, Student TravelGrants Chair for IEEE Sensor, Mesh and Ad Hoc Communicationsand Networks (SECON) 2019-2020, Information System (EDAS) Chairfor IEEE Consumer Communications Networking Conference (CCNC)2021. He has also served as TPC member for IEEE INFOCOM 2016-2020, IEEE GLOBECOM 2015-2020, IEEE MASS 2017-2019, IEEEIPCCC 2015-2019, among others.

Tommaso Melodia (M’07, SM’16, F’18) is theWilliam Lincoln Smith Chair Professor with theDepartment of Electrical and Computer Engi-neering at Northeastern University. He is alsothe Founding Director of the Institute for theWireless Internet of Things and the Directorof Research for the PAWR Project Office. Hereceived his Ph.D. in Electrical and Com-puterEngineering from the Georgia Institute of Tech-nology in 2007. He is a recipient of the Na-tional Science Foundation CAREER award. Prof.

Melodia has served as Associate Editor for IEEE Transactions onWireless Communications, IEEE Transactions on Mobile Computing,IEEE Transactions on Multimedia, and is the Editor in Chief for ComputerNetworks. He has served as Technical Program Committee Chair forIEEE Infocom 2018, General Chair for IEEE SECON 2019 and ACMMobiHoc 2020. Prof. Melodia is the Director of Research for the Plat-forms for Advanced Wireless Research (PAWR) Project Office, a $100Mpublic-private partnership to establish 4 city-scale platforms for wirelessresearch to advance the US wireless ecosystem in years to come.Prof. Melodias research on modeling, optimization, and experimentalevaluation of Internet-of-Things and wireless networked systems hasbeen funded by the National Science Foundation, the Air Force Re-search Laboratory, the Office of Naval Research, DARPA, and the ArmyResearch Laboratory. Prof. Melodia is a Fellow of the IEEE and a SeniorMember of the ACM.

Compressed Sensing based Low-Power Multi-view Video Coding ...

Documents

Transcript of Compressed Sensing based Low-Power Multi-view Video Coding ...