[32]

35
Hierarchical resource allocation for robust in- home video streaming Peter van der Stok 1,2 , Dmitri Jarnikov 1,2 , Sergei Kozlov 1 , Michael van Hartskamp 2 , Johan Lukkien 1 1, Eindhoven, Technical University, Netherlands; 2, Philips Research, Eindhoven, Netherlands Abstract High quality video streaming puts high demands on network and processor resources. The bandwidth of the communication medium and the timely arrival of the frames necessitate a tight resource allocation. Given the dynamic environment where videos are started and stopped and electro-magnetic perturbations affect the bandwidth of the wireless medium, a framework is needed that reacts timely to the changes in network load and network operational conditions. This paper describes a hierarchical framework, which can handle the dynamic network resource allocation in a timely manner. 1. Introduction Today, TVs, video recorders and set-top boxes are mostly interconnected with SCART cables. The advent of broadband and the introduction of the second PC in the home make digital home networks a more realistic way to interconnect Consumer Electronic devices. This tendency moves video away from dedicated media to video streaming across an open and shared network connecting multiple types of devices (e.g. phones, PCs, and CE-devices). This new multimedia environment introduces the problem of sharing limited network resources between video- and other applications. Both the need stok-elsevier-jss-4 1

Transcript of [32]

Page 1: [32]

Hierarchical resource allocation for robust in-home video streaming

Peter van der Stok1,2, Dmitri Jarnikov1,2, Sergei Kozlov1, Michael van Hartskamp2, Johan Lukkien1

1, Eindhoven, Technical University, Netherlands; 2, Philips Research, Eindhoven, Netherlands

Abstract High quality video streaming puts high demands on network and processor

resources. The bandwidth of the communication medium and the timely arrival of the frames

necessitate a tight resource allocation. Given the dynamic environment where videos are

started and stopped and electro-magnetic perturbations affect the bandwidth of the wireless

medium, a framework is needed that reacts timely to the changes in network load and

network operational conditions. This paper describes a hierarchical framework, which can

handle the dynamic network resource allocation in a timely manner.

1. IntroductionToday, TVs, video recorders and set-top boxes are mostly interconnected with SCART cables. The advent of

broadband and the introduction of the second PC in the home make digital home networks a more realistic

way to interconnect Consumer Electronic devices. This tendency moves video away from dedicated media to

video streaming across an open and shared network connecting multiple types of devices (e.g. phones, PCs,

and CE-devices). This new multimedia environment introduces the problem of sharing limited network

resources between video- and other applications. Both the need for resources, expressed as bit rate, and the

availability of resources, expressed as bandwidth, fluctuate within intervals of tens of milliseconds. In

addition there is a need for a timely delivery of the video at the destination. The source of the video can be

either live (broadband TV or a camera) or taken from a storage medium. For a live transmission, a low

overall delay from generation to displaying is mandatory, which imposes strict timeliness requirements. The

source can be located inside the home, outside, or connected through some gateway (e.g. a broadband

connection). The quality of the source video can vary from relatively poor - in the order of tens of kbits/s - for

use on a small display, to high quality - in the order of 6 to 40 Mbits/s - for use on a large screen flat TV.

The sharing of the network medium among several applications leads to a lower bandwidth available to a

given video. In addition, a significant part of the home network will be based on wireless technology. The

stok-elsevier-jss-4 1

Page 2: [32]

consequences of wireless connections are reduced security and bandwidth, as well as increased fluctuation of

the bandwidth through interference with other transmission sources and the moving of objects.

Not meeting the resource and timeliness requirements leads to non-optimal viewing experiences in the form

of distortion, hiccups, delayed viewing or stalling. To avoid these severe quality changes (leading to people

refusing to buy networks and TVs) we propose a scheme that allocates the resources in such a manner that

under overload a tolerable quality degradation occurs such that recognizable video is provided at all times.

The scheme combines the video-source, the video coding, and the transport protocol and is especially

advantageous for live broadcasts. It distinguishes fast fluctuations at the frame level (≈ 40 ms) from structural

fluctuations. Fast fluctuations are caused by variations in the frame sizes and distortions in the bandwidth.

Structural fluctuations last longer and come for example from the starting/stopping of another application.

It is not sufficient to come up with technologically viable solutions. Because many manufacturers provide

network equipment and CE-devices, inter-operability must be assured. The situation in the home is very

different from telephone- or service provider networks. The networks for the latter are under control of one

operator who decides the resource management procedures and technology. In the home there is no such

authority, and standardization must assure that the provided technology collaborates to support the policies

wished by the users of the home network.

2. Related workMost CE-devices (TV, DVD) are resource-constrained to put them in an acceptable price range. For

telephones, the same resource constraints are mostly driven by battery power constraints. The work on video

streaming most related to our work can be distinguished in three areas (1) scheduling of packets on the link,

(2) adaptation of processing power to video requirements in the renderer, (3) multicasting video with a bit-

rate adapted to the individual capacities of the receivers, and (4) transporting video over a network.

Scheduling packets. The packets of all applications, which share the network, need to be scheduled. It is

assumed that a network authority (as foreseen by UPnP QoS standardization[32][33]) allocates bandwidth to

the individual applications. To prevent buffer overflows and provide a balanced loading of the network,

packets are scheduled such that the consumed bandwidth is not exceeded and the network load is evenly

distributed in time. The unscheduled packets are viewed as interference to the scheduled applications. Leaky

stok-elsevier-jss-4 2

Page 3: [32]

bucket and token bucket techniques are two well-known examples [2]. General Processor Sharing (GPS) is an

ideal scheduling mechanism, which is applied to networks. Two examples show how scheduling techniques

can allocate bandwidth to streams or separate asynchronous traffic from synchronous (video) traffic [22][23].

Processing power in CE-devices. In [7], and [17] the observation is made that the CPU needs for a video

stream fluctuate from frame to frame and from scene to scene. A distinction is made between the fast

fluctuation at the frame level, and the slower fluctuation at the video scene level. The concept of Scalable

Video Algorithm (SVA) is introduced to adapt the quality of the decoding process to control the CPU

requirements. In this framework it is possible to provide the highest quality while still meeting the deadline of

each frame. In [8] the authors describe the allocation of priorities to video frames such that important video

frames, on which other frames depend, have a higher probability of acquiring CPU resources to maintain the

highest possible video quality under processing overload.

Multicasting. Devices have different processor/memory capabilities, thus not every device may be capable of

processing all video data that is streamed by the sender. To ensure that all devices process video according to

their capacity, a sender sends to each destination the amount of data that the device can successfully process.

In [16] the sender adapts the content. There are several strategies for content adaptation. The three foremost

of those strategies are the simulcast model, transcoding and scalable video coding model. With the simulcast

approach [16] the sender produces several independently encoded copies of the same video content, which

differ in e.g. bit/frame rates, and spatial resolutions. The sender delivers these copies to the destinations, in

agreement with the specifications coming from the destination.

Video transport. Two protocols are generally deployed for the transport of video over a network with the

Internet protocol (IP) facilities defined by Internet Engineering Task Force (IETF): Transmission Control

Protocol (TCP) [20], and Real-time Transport Protocol (RTP) [19]. RTP is very successful on the Internet to

support live video but the quality displayed on the PC is far below the quality accepted by buyers of digital

TV. RTP promotes timely arrival of packets by allowing loss of packets. Efforts are ongoing to extend RTP

with retransmission facilities as provided by TCP [34]. TCP provides loss-less transport of packets but badly

supports live streaming over the Internet. The TCP-RTM protocol is an effort to provide timeliness to audio

packets by skipping late packets [21].

stok-elsevier-jss-4 3

Page 4: [32]

The fluctuating bandwidth of the wireless medium has as consequence that the quality of the rendered video

fluctuates. In [14] a controller at the receiver side removes the quality fluctuations by selecting the

transmitted video parts such that the quality fluctuations are reduced.

Standardization for the home.

Only a few years ago, IEEE 1394 [25] and HiperLAN [26] were considered as promising candidates for

connectivity in consumer electronics home networking as they offered timely delivery of packets and are

therefore suited for multimedia transport. Now wired and wireless Ethernet are recognized as the

predominant connectivity standards. The advantage is that only one technology is used for all networking

applications in the home (e.g. file transport, chatting, audio, and video). Yet, it offers only best-effort but no

timeliness guarantees. The IEEE 802.11e standard [27], which provides extensions to wireless Ethernet,

offers prioritized and scheduled access. It also offers several other enhancements. Before the IEEE 802.11e

standard was completed in 2005, the Wi-Fi Alliance had started a certification program for wireless

multimedia based on IEEE 802.11e. The program for prioritized access, called Wi-Fi Multimedia (WMM)

[28], was completed in 2004. The Wi-Fi Alliance is currently working on a certification program for

scheduled access. Several other connectivity technologies have recently been developed that include

scheduled access: WiMedia [29] HomePlug AV [30], etc. Even for wired Ethernet, the Ethernet AV initiative

[31] intends to improve its timeliness properties.

As it is expected that home networks remain heterogeneous in their connectivity technologies, middleware

solutions are developed to deal with this heterogeneity. One of the more popular middleware technologies for

the home is UPnP [32]. The UPnP forum standardizes so-called device control protocols for e.g. AV

applications, Internet Gateway devices, but also for Quality of Service (QoS). The UPnP-QoS v1 and

ongoing v2 [33] specifications define the use of priority-based policies. Currently work in UPnP-QoS is

ongoing to develop version 3 on parameterized QoS and scheduled access. The UPnP-QoS makes it possible

to share bandwidth in accordance to application quality criteria. As such it becomes possible to share the

network between different types of applications. For example, the real-time aspects of audio and video

streaming can be guaranteed at the expense of delays for file transport

stok-elsevier-jss-4 4

Page 5: [32]

The Digital Living Network Alliance (DLNA) is an industry forum that provides interoperability guidelines

to implement digital media servers, players and renderers [34]. Guidelines are written on the use of wired and

wireless Ethernet and Bluetooth, TCP/IP and UPnP. TCP is the mandatory transport protocol for AV content.

The DLNA also defines various profiles for AV media formats [35].

3. Video transportThis section motivates our choice of TCP and the deployment of scalable video coding. Figure 1 shows the

most important features of the network configurations we consider.

a)

b)

sender switch AP

receiver 1

receiver N ...

TCP traffic shaper

application

IP data

video data sender

real-time video

stored video

Figure 1 Example network configuration

The sender contains a video application, which sends stored or live video to a destination on the network. It

invokes a transport protocol, which packs the video frames in packets and the traffic shaper sends the packets

in a regular fashion to the network. The network is composed of a wired (switched Ethernet) part and a

wireless part. Packets are buffered and sent on in the switch and the Access Point (AP). Losses due to buffer

overflow may occur at the sender, the switch, the AP and the receiver. For today’s wired segment there is

almost no packet loss. In addition, measurements showed that losses over the wireless segment do not occur

when the retry counter is set to 4 or higher [24]. Consequently we may assume that packet loss over the

communication media in the home is negligible, and most of the time losses occur due to buffer overflow.

3.1. Transport protocolsWhen unreliable transport protocols (such as RTP, [19]) are used for sending multimedia streams over a large

network (the Internet), it is very difficult to control the data losses happening due to congestion in the routers

or due to the low reliability of the medium, e.g. a wireless link. Usual practice to handle such losses is to use

stok-elsevier-jss-4 5

Page 6: [32]

error recovery mechanisms at the receiver and/or redundancy coding at the sender. However, these

mechanisms are often combined with some content adaptation technique, which uses the feedback from the

network or receiver to inform the sender about the losses. This makes the system difficult to implement.

TCP, being a reliable transport protocol, eliminates uncontrollable losses of data. Applications built upon

TCP see the network as reliable transport means with varying throughput. Nevertheless, if at some moment

the application needs to send more data than TCP can deliver (the network bandwidth drops below the bit rate

of a live encoded stream), loss of data can happen due to application/TCP buffer overflows. Introducing

larger buffers may decrease the probability of buffer over- and under-flows, because often the network

throughput drops below the video bit-rate only for a short time, after which the “recovery” takes place. The

larger the buffers, the longer the periods of insufficient bandwidth can go unnoticed by the end-user. The cost

for the large buffers is increased latency (the time needed for a unit of data – e.g. a video frame – to be

transferred from the sender to the receiver). Latency of more than 200 ms, which corresponds to buffering of

5-7 frames, is not acceptable in real-time video applications. Keeping the buffers small, limits the amplitude

and duration of bandwidth variations that can be handled. However, these losses are easy to control by

applying buffer management techniques that are different from the default Tail-Drop technique. Such

techniques as Partial Buffer Sharing (PBS) and Triggered Buffer Sharing (TBS) [4], as well as Push out

Buffers (POB) [5] are based on dropping lower priority data to accommodate higher priority data when the

buffer cannot accommodate both. Implementations of buffer management techniques in the video-streaming

domain address frame skipping approaches and scalable video coding methods (see below).

TCP selection. The major drawback of TCP is its stalling behavior and its slow start-up after an end-to-end

packet loss. However our measurements (section 5.1) indicated that end-to-end packet losses occur rarely in

home networks and are always immediately restored by acknowledgements resent within 20 ms.

Consequently, stalling behavior is almost completely eliminated. On the positive side, the flow control of

TCP adapts the bit rates of the packets to the bandwidth availability. For live video, packets are lost in the

sender buffer of the application, but the same application has access to this buffer and can decide which parts

can be removed. In contrast, RTP just goes on sending packets leading to uncontrolled losses.

stok-elsevier-jss-4 6

Page 7: [32]

3.2. Video framesA MPEG-2 video stream is built up of I, P and B frames. Each frame represents one picture. An I-frame

contains enough information to be decoded independently. A P-frame needs additional information from a

directly preceding I-frame or P-frame. Motion vectors describe how a part of the referenced picture must be

moved for a correct visualization in the frame to decode. B-frames need additional information from two

frames, a succeeding P- or I-frame and another preceding P- or I-frame. The video is structured in Groups of

Pictures (GOP), containing one I-frame followed by a sequence of B- and P-frames. Examples of legal GOPs

are I(I), IPP(I), IBPBPB(I) or IBBPBBPBB(I). The (I) denotes the start of the next GOP.

A scalable video coding scheme describes an encoding of video frames into multiple layers, including a Base

Layer (BL) of basic quality and several Enhancement Layers (EL) containing increasingly more video data to

enhance the quality of the base layer and thus resulting in video of increasingly higher quality [18]. Scalable

video coding is represented by variety of methods that could be applied to many existing video coding

standards [10][11][12]. These methods are based on principles of temporal, signal-to-noise ratio (SNR),

spatial and data partition scalability [15]. In our framework we use a specific form of temporal scalability that

we call I-Frame Delay (IFD) and a form of SNR scalability that is resistant against packet losses.

I-Frame Delay. IFD represents a temporal scaling technique. When the network bandwidth drops below the

bit rate of the video, temporal scalability decreases the bit rate of the video by dropping video frames without

influencing the quality of the surviving frames. A reasonably low amount of dropped frames might not be

noticeable by the end-user. However, dropping frames arbitrarily (as it would be in case of Tail-Drop buffer

handling) is not a good idea because the impact of dropping MPEG frames has an impact on the end-user

perceived quality dependent on the frame type (I, P, or B). We use the frame type to guide the frames

skipping process as follows: when the sender buffer gets full, IFD will push the B frames out of the buffer

first, and then, (i.e. the bandwidth has dropped significantly for a longer period), the P and I frames.

The cumulative weight of B frames in a MPEG-2 stream often comes to 50% and more. This means that by

only dropping all B frames we can make the resulting example video stream fit into a bandwidth that is half

the bit rate of the original stream, still preserving inter-frame dependencies. The price for this will be a

stok-elsevier-jss-4 7

Page 8: [32]

decreased frame-rate - in the case of an IBBPBB(I) GOP structure, all B frames dropped would lead to 1/3 of

the original frame-rate and 1/2 of the original bit rate.

SNR scalability. In Figure 2 two possible structures for the enhancement layer are shown. The arrows

indicate the dependence of the frames on each other. A cross suggests the loss of a particular frame in a layer.

A horizontal thick arrow indicates the loss of a frame in a layer dependent on the lost part indicated with a

cross. The BL has a normal standard GOP structure. Using a GOP structure that has P and B frames in the EL

is dangerous from a reliability point of view. If the network condition is bad, there is a high risk of losing a

frame during the transmission. In this case if the lost frame is I or P, the receiver will not be able to decode

the rest of the GOP, which will lead to a considerable loss of frames in current and upper enhancement layers

(Figure 2a). In our coding scheme, the enhancement layer is formed from the residuals of the frames from the

base layer. That means no dependency between different frames inside the enhancement layer. Therefore the

loss of any frame from an enhancement layer will not influence subsequent frames (Figure 2b).

Figure 2 Two SNR enhancement layer structures

Choosing TCP together with IFD or SNR scalable video makes it possible to remove a selected part of the

video at the source. The percentage of video to be removed at the source is determined by the bandwidth.

3.3. Example managementWithout too much loss of generality we explain the framework techniques by looking at the transmission of a

continuous stream over network of Figure 1. The wireless channel retransmits packets until they arrive at the

sender, simultaneously adjusting the bit rate of the channel dependent on the packet loss rate.

stok-elsevier-jss-4 8

Page 9: [32]

Figure 3 bandwidth fluctuations for a given stream plus control indication.

Figure 3 shows an example of the bandwidth fluctuations as perceived by a TCP stream sent at maximum

packet rate. The first 5 seconds the TCP stream is the single user of the link. From 5 to 10 seconds an

additional file transport shares the wireless link, from 10 to 15 seconds – interference is added, then from 15

to 20 seconds the second stream stops and finally, after 20 seconds, the interference stops as well. Every

40ms the number of arrived bits is measured, and divided by 40 ms to obtain the effective bit rate of the TCP

stream with a sampling interval of 40 ms. Using Figure 3 some important observations can be made.

1. To exploit the available bandwidth to its fullest, the video bit rate curve should retrace the measured

curve shown in Figure 3. However, this is impossible in practice due to, among others, the variable

bit rate of the video, use of fixed-sized video layers, inertia of the transcoder, lack of calculation

power etc. Even when the video bit rate follows the available bandwidth, the end-user is confronted

with an unpleasant perception of frequent quality changes [3]. Therefore, the notion of quality level is

introduced. In our case, for simplicity, quality level is fully determined by the video bit rate.

2. We could change the quality level based on feedback, which triggers the source to change to another

quality level when appropriate. In Figure 3 we base this triggering on the changes of the two average

bandwidth values denoted with dashed and solid lines. However, answering the question when the

quality change should change does not answer the question with what value the quality level should

change. A worst-case (dashed line) or a more optimistic guaranteed level (solid line) are possible.

stok-elsevier-jss-4 9

Page 10: [32]

3. Using the pessimistic dashed line, the video gets through with maximum probability. However, the

drawback is the low effectiveness of the bandwidth usage. In Figure 3 we see that due to the

fluctuations in the intervals [0,5) and [20,25), the worst-case dotted line is 1 Mbit/s below the

measured bit rate, while in interval [9,14) the bandwidth fluctuation becomes so high, that the worst

case scenario brings us no video at all.

The solid line depicts a video quality level close to the measured value. The price for this is an occasional

loss of data. Two techniques are used to keep the effects of losses low: (1) layered scalable video and (2) I-

Frame Delay (IFD)

4. Management frameworkThe management is hierarchically ordered. At the highest level, bandwidth allocation is done to permit

bandwidth sharing between videos. At the next level, the bit rate of the video is adapted to the available

bandwidth. At the lowest level less important packets are dropped to assure that more important packets

arrive in time at the destination.

Figure 4 Sender refinement

Figure 4 shows the structure of the sender. The original video enters the application. Inside the application a

transcoder transforms the single layer video into a layered scalable video. The bandwidth allocation algorithm

specifies the sum of the sizes of the layers. The size of the individual layers and the number of layers are

determined by the bandwidth fluctuations. The video layers are presented to TCP, which fragments the

frames into packets. A traffic shaper outputs the packets on the link.

stok-elsevier-jss-4 10

Page 11: [32]

4.1. Slow fluctuationsTwo types of slow fluctuations are considered; (1) user interaction to increase or decrease the quality of a

video stream or to start/stop a video stream, and (2) adaptation of the sizes of the video layers in response to

changes in bandwidth availability for example coming from physical stimuli.

User interaction The UPnP QoS working groups prescribe the elements, which distribute the bandwidth

allocation decisions over the network. Each device holds a QoSDevice module, which receives from the

UPnP QoS manager instructions on the bandwidth it may use (see Figure 4). The traffic shaper takes time

windows in which it sends packets according to the prescription received from the QoS manager.

Layer configurations. Two modules in our framework are involved in determining the number of layers and

the size of each layer: (1) scalable transcoder, and (2) layer configurator (see Figure 4).

Scalable transcoder. The transcoder converts non-scalable video into multi-layered scalable video. The layer

configuration may be changed at run-time. The input to the transcoder is provided via a reliable channel, thus

assuming that there are no losses or delays in the incoming stream.

Layer configurator. The layer configurator chooses number and bit-rates of layers based on the acquired

information about network conditions and receiver’s decoding capability. The network information is used to

estimate the currently available network bandwidth, fluctuations and errors. The receiver’s decoding

capability is used to define the maximal number of layers that can be handled by the receiver.

4.2. Fast fluctuationsThe scalable video solution makes it possible to react to fast changes by dropping enhancement layers for a

given frame. The layers should be chosen such that the bit-rate of BL is less than the available bandwidth to

assure that BL always arrives. Taking this to its logical consequence means a very small BL. A small BL

yields a very low quality that is difficult to repair with larger EL [1]. Therefore, the BL is chosen as high as

possible. To counteract the fast bandwidth fluctuations two components are employed (1) I-Frame Delay

(IFD) algorithm and (2) layered frame scheduler.

IFD. Our experiments show that impressive improvements (compared to the default Tail-Drop technique) can

be achieved with only two buffers, which accommodate each 1 video frame. Let us mark the frames in the

buffer as follows: S – the frame that is being transmitted (and is partially sent), W – the frame waiting in the

stok-elsevier-jss-4 11

Page 12: [32]

buffer. The frame, which is offered for transmission by the application, will be marked as C. When W is

present, which means that we cannot buffer any more frames, and C is arriving from the application, the

scheduling algorithm decides which of the two frames (C and W) is least important for the end-quality to

decide which one to discard. The following algorithm favors I frames and P frames over B frames:

WHILE (TRUE) DOWHILE (C is empty) DO NothingIF (W is empty) THEN Store C in WELSE

IF (C is of type I) THEN Overwrite W with CELSE IF (C is of type B) Discard CELSE IF (W is of type I or P) Discard CELSE Overwrite W with C

A Boolean is added to the algorithm to drop all frames to the next I-frame when a P-frame is dropped.

Is bufferfull?

Delete packets in bufferfor a frame with lowest priority

Check BL buffer

Put packet into buffer

YesNo

Is it BL?No Yes

Start with BL

Is bufferempty?

Take packet Send packet

Check buffer

Choose next layer

Is packet outdated?

Drop packet

No No

Yes

Yes

Figure 5 Transmission of packets from sender buffer (left), and filling sender buffer (right)

Frame scheduler. The scheduling combines the layered scalable video with the IFD temporal approach at

the sender. IFD and layered scalable video can be used independently and in isolation. The combination

supports larger fluctuations in bandwidth. However, there is a minimum bandwidth of 1 Mbit/s, associated

with the 802.11 technology. Layers of a scalable video are sent according to a priority scheme. Since BL

information is absolutely necessary, the BL has the highest priority. The priority of each EL decreases with

increasing layer number. When a frame from ELx is being transmitted and a frame from BL arrives, the

sender sends the BL frame after the transmission of the current packet belonging to ELx (if any). When a

frame from ELx arrives, it preempts a frame from ELy (where y>x) in a similar fashion (see Figure 5).

When the channel bandwidth has become lower than the total video bit rate, the sender buffer gets full. To

prevent sending late packets, we introduce a maximum lifetime for EL packets. If the maximum is reached,

the packet is deleted from the buffer. BL packets are removed by IFD independent of their life time.

stok-elsevier-jss-4 12

Page 13: [32]

4.3. Choosing layersThe layer configurator uses a table that is created off-line to choose the most appropriate layer configuration

as function of the network conditions. For a predefined set of network conditions we estimate (1) loss

probability per layer for each layers configuration, and (2) the average quality that can be delivered by this

layers configuration (by looking at loss probabilities and calculating the SNR quality of the video). A fixed

maximum number of layers is used per network condition. If the decoding capacity of the receiver is lower

than the suggested BL-value, part of the bit-rate of BL is reassigned to the first EL. For example, if an

optimal configuration for a given network condition yields a BL of 4 Mbps, an EL 1 of 2 Mbps, and an EL2 of

2 Mbps and due to device requirements the BL should be limited to 1 Mbps, then the BL bit-rate is set to 1

Mbps and EL1 bit-rate is set to 5 Mbps.

Offline, a network simulation environment creates strategies for the layer configurator as shown in Figure 6.

The environment consists of five major components: frame size generator, packet generator,

sender/prioritizer, wireless channel simulator, and receiver/quality calculator. The frame size generator

produces normally distributed random values for frame sizes based on the stream bit-rate, assuming that the

mean size of a frame in a stream is equal to the bit-rate divided by frame rate. These values are passed to the

packet generator, which formats an incoming data stream into a set of packets based on video stream syntax

and the network protocol specification. The packets are buffered and sent over in accordance with their

priority by the sender/prioritizer. In accordance with MAC level retransmissions of 802.11-like protocols, we

allow a fixed number of retransmissions for a packet that is lost. The module also uses a maximum lifetime

for packets, so the outdated packets are deleted from the buffer. The Gilbert model [13] was used for the

insertion of errors into the transmission channel of the channel simulation module. The module, based on the

description of the network condition, expressed in average available bandwidth, error rate and burstiness of

errors, calculates the amount of error-free packets, dropped packets and frames. Corrupted packets are

dropped (a packet is considered corrupted if at least one bit of the packet is wrong). A complete frame is

dropped when at least one packet of the frame is dropped.

stok-elsevier-jss-4 13

Page 14: [32]

Frame sizegenerator

Packetgenerator

Sender /prioritizer

Wireless channel

simulator

Receiver / quality

calculator

Frame sizes

Number and size ofpackets per stream

Packets with priorities

Packet and Frameerror rates

OUTPUTAverage PSNR as a function ofnetwork condition and layer configuration

INPUT

Layer configuration(number of layers,bit-rates of layers)

Network conditions(average bandwidth,error rate, burstiness)

Packetizationscheme

Buffer sizes,packets lifetimenumber of retransmissions

Figure 6 Network simulation environment (input in italic is implementation specific)

Finally, the packets of a given frame, transmitted over the channel, are merged together into a single frame in

the receiver/quality calculator module. The receiver computes how many times corresponding frames from

different layers are transmitted successfully. Based on these values and knowing (from predefined data) the

mapping between layer size and quality expressed in average Peak Signal to Noise Ratio (PSNR) the module

calculates average PSNR for the received video. The layer configuration with the highest average PSNR is

considered to be the optimal for the given network condition.

4.4. InteroperabilityIt is important that the framework does not only solve the technical requirement of showing the best possible

video as function of transmission conditions and receiver capacity, but also provides a high level of

interoperability. The UPnP and DLNA standards and recommendations govern all interaction between

devices on the network. The global problem of sharing network resources between applications is solved

within the context of the standard. The problem of optimizing perceived video quality is solved entirely

within the sender. Consequently, it is possible to apply the framework solutions within an interoperable

framework, still allowing the manufacturers to improve the quality of their own senders.

5. EvaluationEvaluation is done in two parts. Section 5.1 shows the validity of our choice for TCP. Section 5.2 shows how

the framework handles the variations in bandwidth coming from fluctuating operational conditions.

stok-elsevier-jss-4 14

Page 15: [32]

5.1. Video streaming with TCP over wireless mediumThe measurements presented in this paper are a selection from the measurements described in [24]. The

measurement setup is as follows. A PC with a wireless card is used as sender. The PC sends video over IEEE

802.11b to an Access Point (AP). The AP is connected with switched Ethernet to the receiver PC which

renders the video. The following transmission protocols are compared: (A) Unblocking UDP, (B) blocking

UDP, (C) RFC 2250 [9], which describes packetizing mpeg, over blocking UDP, (D) TCP and (E) TCP with

IFD. A MPEG-2 video with duration of 60 seconds was streamed from sender to renderer with 4 different bit

rates, 3, 4, 5 and 6 Mbit/s. The MPEG-2 video specifies that 25 frames are sent per second i.e. 40 ms between

each frame.

For all transmission protocols, frames are sent with the bit rate of the video or with a rate limited by the

bandwidth of the wireless channel. When the bandwidth is smaller than the bit rate of the video, the effective

bit rate was reduced at the sender such that the duration of the video increased beyond the original 60 seconds

for transmission protocols B, C, and D. For transmission protocol A (unblocking UDP), packets were lost

inside the driver of the sender in an uncontrolled fashion, limiting the duration to 60 seconds. Transmission

protocol A is rejected for that reason. For transmission protocol E (TCP with IFD) the rendered video

duration remained equal to 60 seconds while losses were controlled. This is explained in more detail below.

The maximum effective transmission rates for B is 6 Mbit/s, for C is 4.5 Mbit/s and for D is 5 Mbit/s. The

effective transmission rate of protocol C (RTP, the official standardized video protocol) is lower than B and

even D (TCP) because the packet overhead is larger and not all packets are completely filled.

(a) (b)

Figure 7 (a) Throughput for TCP versus wireless retry value and (b) latency of TCP versus video bit rate

stok-elsevier-jss-4 15

Page 16: [32]

One of the properties of the wireless link is that the sending of a packet is immediately acknowledged at the

link layer. When the sender receives no acknowledgement, the wireless frequency rate is lowered and the

packet is resent. The retry value determines the number of times a packet can be resent before it is definitely

lost. Figure 7(a) shows the throughput of TCP with different retry values of the wireless link. On the

horizontal axis the time during video transmission is shown. After 20 seconds the microwave is switched on,

and switched off after 40 seconds. When the retry value of the wireless link is set to one (no retransmissions)

the TCP protocol needs to resend all lost packets and we see that the throughput is below 1 Mbit/s. Not

visible in the figure is that with a retry value of four almost all packets arrive and TCP retransmits only 3 to 4

packets during the full 60 seconds. With blocking UDP we see the same dip between 20 and 40 seconds with

the difference that the maximum transmission rate is 4.5 Mbit/s, no end-to-end retransmission takes place and

frames are lost.

Figure 7(b) shows the latency of the frames with respect to the time they should have arrived at the sender,

given the arrival time of first frame. A retry value of 4 is used, meaning that no losses occur during wireless

transmission. A latency of 80 msec is acceptable with a reception buffer of two frames. For bit rates of 3 and

4 Mbit/s the video transmission is stopped after 60 seconds. During the interval [20,40) for video bit rate 3

Mbit/s a delay builds up and disappears once (probably associated with a TCP retransmission) and for video

bit rate 4 Mbit/s two delays build up and are removed later on. For bit rate 6 Mbit/s the situation is dramatic,

an enormous latency builds up during the whole transmission period. For bit rate 5 Mbit/s latency builds up

during the microwave on period. The same type of behavior can be measured with blocking UDP, with the

exception that for video bit-rate of 5 Mbit/s the latency builds up as for the video bit rate of 6 Mbit/s, and

video frames are lost from time to time.

A latency value larger than 80 msec. means that the video is stalled during several moments. In case of live

video, frames would even be lost at the sender buffer, because video cannot be delayed, contrary to the

conditions in this experiment. The total effect on the viewer would be disastrous.

The IFD protocol is activated on top of TCP to trade latency against frame losses in a controlled fashion.

Figure 8 shows the latency versus time with wireless retry value equal to 4 (TCP retransmissions < 5) for

different video bit rates. For all bit rates the maximum latency remains below 110 msec. This means that live

stok-elsevier-jss-4 16

Page 17: [32]

video is transmitted in time even when overload conditions occur. The 110 msec means that a given frame

can be rendered two to three times consecutively when frames are dropped. This dropping leads to jerky

movements at times. During the 60 seconds period of transmission of the 6 Mbit/s video, with a 20 seconds

microwave perturbation, no I-frames are removed, 3 P-frames are removed and 314 B frames are removed

(total 20% of frames are removed) the consequence on the video is no artifacts, sometimes a jerky movement,

but all rendered frames are not more retarded than 110 msec.

Figure 8 Latency of TCP with IFD for different video bit rates

The IFD protocol can also be activated on top of UDP. However, the disadvantage is that when a second

wireless segment is introduced (e.g. sender to AP and from AP to renderer), IFD must also be applied for the

following segment (e.g. in the Access Point). From an interoperability point of view this is difficult to realize

in practice (manufacturers have to agree on the specifics of the IFD protocol).

Given all these measurement, the solution of TCP coupled with controlled frame losses seems the best

solution to obtain timely live video under restricted network resources, because the TCP throughput is highest

coupled with the least chance of artifacts in the rendered video.

5.2. Framework behaviorAn example behavior is based on Figure 3. Each interval experiences network conditions that are different

from both predecessor and successor intervals. The wildly moving line represents the bandwidth of the

channel sampled with 40 ms intervals.

stok-elsevier-jss-4 17

Page 18: [32]

Interval BL (Mbps) EL (Mbps) BL loss rate % EL loss rate %1 4 1 0 4.4

4.25 1 0.1 12.53.75 1 0 1.1

2 3.5 1 0.2 9.83.25 1 0 4.23.75 1 0.7 19.5

3 1.5 1.25 6.5 26.71.5 1 6.5 20.8

1.75 1 8.8 26.91.25 1.5 5 26.6

4 2.25 1 2 17.92 1 0.8 11.72 1.25 0.8 17.6

5 4 1 0 4.44.25 1 0.1 12.83.75 1 0 1

Table 1 Configurations that deliver highest average objective quality under different network conditions

Two layers are transmitted. The lower dotted line approximates the average bit rate of the BL. The higher

solid line represents the average bit rate of the BL + EL layers. In practice the video bit rate fluctuates around

the average value with a deviation of ± 25%. As soon as a change of network conditions is detected, the layer

configurator changes the bit rates of the layers. The network conditions of the different time intervals are

expressed in error rate and burstiness of errors. Once these parameters are measured the configurator uses a

simple look up table to choose a layer configuration.

The lookup table is created offline as described in section 4.3. For each of the 5 intervals of Figure 3 the best

fit is calculated with one of the network transmission conditions used by the network simulation environment.

Table 1 shows three (four for time interval 3) best layer configurations for every time interval (network

condition) of our example. The uppermost configuration of each time interval is preferred as the best choice

for that time interval. Under good network conditions the second and third configuration in a given interval

differ from the best only by the size of BL (plus-minus 0.25 Mbps). A small change in bit rate of the BL

produces a more significant difference in quality than a change in an EL.

Under poor network conditions BL size is very small (segment 3). So, even a small increase in bit rate for BL

brings a huge raise in the value of objective video quality. However, the penalty for the bit rate increase is a

high loss rate for BL, which influences the subjective quality value. The acceptable loss range for BL is 5%,

so for segment 3 a configuration with BL of 1.25 Mbps and EL of 1.5 Mbps is chosen.

stok-elsevier-jss-4 18

Page 19: [32]

6. ConclusionsStreaming video requires a dynamic adaptation of the video to the available network resources and

destination resources. Even when the bit rate of the video is in agreement with the average network

bandwidth, wireless networks suffer rapid bandwidth fluctuations caused by interference from other electro-

magnetic sources, and the differences in frame sizes imply fluctuating bandwidth requirements. This calls for

a hierarchical approach to handle at the highest level the slow bandwidth changes, and at a lower level the

fast bandwidth changes. At the same time the capabilities of the receiver need to be taken into account to

prevent sending data, which cannot be handled at the destination. When nothing is done, the video is rendered

with artifacts, which are very annoying to the viewer of the video.

A framework is proposed that removes video data to reduce the bit rate of the video in a controlled fashion. A

transcoder adapts the bit rate to the slow fluctuations while the fast fluctuations are handled by throwing

away layers of SNR scalable streams or remove entire frames when the bandwidth drop is extremely large

and sudden. Using TCP provides the following advantage: All the intelligence of the system is concentrated

at the sender side. No network protocol adaptation is needed.

Presenting solutions for optimizing transmission of video is not enough. The solutions should be presented in

an interoperability framework, to be accepted by the CE device manufacturers. The paper shows how such a

framework can be integrated within the UPnP and DLNA standardization efforts.

AcknowledgementsWe like to thank Jeffrey Kang and Jan Ouwens for many helpful discussions and valuable input.

References

[1] R. Haakma, D. Jarnikov, P. van der Stok, Perceived quality of wirelessly transported videos, in Dynamic and

Robust Streaming in and between Connected Consumer-Electronic Devices (ed. P. van der Stok), Series: Philips

Research Book Series, Vol. 3, 2005

[2] S. Tanenbaum, Computer Networks, 4th ed. Prentice-Hall, 2003.

[3] M. Zink et al, Subjective Impression of Variations in Layer Encoded Videos, KOM Multimedia

Communications, 2003

stok-elsevier-jss-4 19

Page 20: [32]

[4] Pedro Cuenca et al, Performance Evaluation of Cell Discarding Mechanisms for the Distribution of VBR

MPEG-2 Video Over ATM Networks. IEEE Transactions on Broadcasting, 44(2), June 1998

[5] Tao Tian et al, Priority dropping in network transmission of scalable video. International Conference on Image

Processing, 3:400-3, Sept. 2000

[6] Dmitri Jarnikov, Peter van der Stok, Johan Lukkien, Wireless streaming based on a scalability scheme using

legacy MPEG-2 decoders, Ninth IASTED Int. Conference on Internet & Multimedia Systems & Applications, 2005

[7] C.C.Wust, L.Steffens,R.J.Bril, and W.F.J.Verhaegh, “QoS Control Strategies for High Quality Video

Processing”. In Proc. 16th Euromicro Conference on Real-Time Systems (ECRTS), Catania, Italy, 2004.

[8] D Isovic and G.Fohler, “Quality aware MPEG-2 Stream Adaptation in Resource Constrained Systems”. In

Proc. 16th Euromicro Conference on Real-Time Systems (ECRTS), Catania, Italy, 2004.

[9] D. Hoffman, G. Fernando, V. Goyal and M. Civanlar, “RTP Payload Format for MPEG1/MPEG-2 Video. RFC

2250”, Network Working group, Jan. 1998.

[10] ISO/IEC International Standard 13818-2, “Generic Coding of Moving Pictures and Associated Audio

Information: Video”, Nov., 1994.

[11] ISO/IEC International Standard 14496-2, “Information Technology – Generic Coding of Audio-Visual Objects,

Part 2: Visual”, MPEG98/N2502a, Oct., 1998.

[12] ITU-T International Telecommunication Union, “Draft ITU-T Recommendation H.263 (Video Coding for Low

Bit Rate Communication)”, KPN Research, The Netherlands, Jan., 1995.

[13] J. R. Yee and J. Edward J. Weldon, ”Evaluation of the performance of error-correcting codes on a Gilbert

channel”, IEEE Trans. on Communications, pp. 2316-2323, Aug. 1995.

[14] D. Jarnikov, P. van der Stok, C.C. Wust, “Predictive Control of Video Quality under Fluctuating Bandwidth

Conditions”. ICME '04, Volume: 2 , pp. 1051 – 1054, June 27-30, 2004

[15] McCanne, S., Vetterli, M., Jacobson, V., “Low-complexity video coding for receiver-driven layered multicast”,

IEEE journal on selected areas in communications, vol. 16, no 6, p.983-1001, 1997.

[16] Peter Amon, Jurgen Pandel, “Evaluation of Adaptive and Reliable Video Transmission Technologies”,

available from http://www.polytech.univ-nantes.fr/pv2003/papers/pv/html/main/all_pap.htm

[17] R.J. Bril, C. Hentschel, E.F.M. Steffens, M. Gabrani, G.C. van Loo and J.H.A. Gelissen, “Multimedia QoS in

consumer terminals”, Proc. IEEE Workshop on Signal Processing Systems (SIPS), pp. 332-343, Sep. 2001.

stok-elsevier-jss-4 20

Page 21: [32]

[18] Yao Wang, Joern Ostermann, and Ya-Qin Zhang, “Video Processing and Communications”, Prentice Hall,

2002.

[19] H. Schulzrinne, G.M.D. Fokus, S. Casner, R. Frederick and V. Jacobson. RTP: A Transport Protocol for Real-

Time Applications. Internet Engineering Task Force, A/V Transport Working Group, Jan. 1996.

[20] J. Postel. Transmission Control Protocol. RFC 793, Information Sciences Institute, September 1981.

[21] S. Liang and D. Cheriton., TCP-RTM: Using TCP for Real-Time Multimedia Applications, InfoCom 2001.

[22] L. Lenzini, E. Mingozzi, G. Stea, A unifying service discipline for providing rate-based guaranteed and fair

queuing services based on the Timed Token protocol, IEEE transaction on Computers, Vol 51, Nr 9 2002.

[23] J.C.R. Bennett and H. Zhang, Hierarchical packet fair queuing algorithms, Proc of the ACM SIGCOMM 1996.

[24] J. Ouwens, The Performance of Wireless MPEG-2 Video Streaming, Philips Internal note TN-2005/00735.

[25] IEEE 1394 standard

[26] HiperLAN, http://en.wikipedia.org/wiki/HIPERLAN#HIPERLAN.2F2

[27] IEEE 802.11e standard

[28] Wi-Fi CERTIFIED™ for WMM™ - Support for Multimedia Applications with Quality of Service in Wi-Fi®

Networks, http://www.wifi.org/membersonly/getfile.asp?f=WMM_QoS_whitepaper.pdf

[29] WiMedia, http://www.wimedia.org/en/index.asp

[30] HomePlug AV White Paper, http://www.homeplug.org/en/docs/HPAV-White-Paper_050818.pdf

[31] Residential Ethernet Overview, Michael Johas Teener, CommsDesign,

http://www.teener.com/ResidentialEthernet/Residential%20Ethernet.pdf

[32] UPnP forum, www.upnp.org

[33] UPnP Quality of Service specifications, http://www.upnp.org/standardizeddcps/qualityofservice.asp

[34] DLNA Interoperability Guidelines v1.5, March 2006

[35] DLNA Media Format Guidelines v1.5 - Volume 2, March 2006

stok-elsevier-jss-4 21