Thesis.doc

Performance Evaluation and Comparison of Middle-Ware based QoS support for

Real-Time/Multimedia applications versus using Modified OS Kernels

By

Saif Raza Hasan

1

University of Massachusetts, Amherst, MA, USA 2001

2

Table of Contents

1. Introduction......................................................................................................................3

2. System Description..........................................................................................................3

2.1 QLinux Features.........................................................................................................3

2.2 TAO / Corba Features................................................................................................4

3. Experimental Setup..........................................................................................................6

3.1 Echo / Null RPC.........................................................................................................6

3.2 Simple HTTP Client-Server.......................................................................................9

3.3 Concurrent HTTP Client-Server..............................................................................10

3.4 Compute Intensive HTTP Client-Server..................................................................10

3.5 Long File/ MPEG Download...................................................................................11

3.6 Video Streaming Without QoS Specification..........................................................13

3.7 Video Streaming With QoS Specification................................................................16

3.8 Video Streaming performance in presence of another application..........................19

3.9 Industrial Control System Simulation......................................................................20

4. Summary & Discussion of Qualitative Comparisons....................................................21

4.1 QLinux......................................................................................................................22

4.2 TAO / CORBA.........................................................................................................22

References..........................................................................................................................24

3

1. Introduction

Most common off-the-shelf operating systems (COTS) such as Windows-NT, Linux and most Unises are not designed to provide applications with QoS guarantees for access to system resources such as network bandwidth or CPU time. This leads to poor performance of Real-Time / Multimedia applications which have deadlines that must be met in a guaranteed fashion on these systems.

In trying to address this problem, two approaches are common. One is to modify the OS itself and incorporate mechanisms for guaranteeing QoS and modifying applications to use these mechanisms. The other alternative is to use a middle-ware which runs on top of the unmodified operating and provides an API for application programmers that allows them to specify QoS requirements, and the middleware attempts to meet those requirements.

In this project we have compare the performance and evaluated some of the trade-offs involved in these two approaches. We have used the QLinux kernel which supports HSFQ for process and packet scheduling as an example of a modified COTS system and the TAO ORB as the example for a Real Time Middleware. TAO is an open-source implementation of the Corba specification and contains optimizations and built in support for providing QoS to it’s applications.

We’ve developed various applications and performed experimental evaluation of their performance to obtain a quantitative and qualitative comparison of the two systems.

2. System Description

The systems that we are evaluating and comparing in this project are QLinux and TAO (a Real Time CORBA implementation). The two systems represent two alternative approaches to providing QoS guarantees to Soft Real Time / Multimedia applications. The mechanisms and features provided by the two systems are different and in this section describe them in detail.

2.1 QLinux Features

QLinux is a Linux kernel that can provide quality of service guarantees. QLinux, based on the Linux 2.2.x kernel, combines some of the latest innovations in operating systems research. It includes the following features:

Hierarchical Start Time Fair Queuing (H-SFQ) CPU scheduler Hierarchical Start Time Fair Queuing (H-SFQ) network packet scheduler

The H-SFQ CPU scheduler enables hierarchical scheduling of applications by fairly allocating cpu bandwidth to individual applications and application classes. The H-SFQ

4

packet scheduler provides rate guarantees and fair allocation of bandwidth to packets from individual flows as well as flow aggregates (classes). The kernel also supports Lazy Receiver Processing (LRP) of packets, however, this feature was turned off in the kernel we used for our experiments.

In the QLinux framework for CPU scheduling[1], the hierarchical partitioning requirements are specified by a tree structure. Each thread in the system belongs to exactly one leaf node. Each non-leaf node represents an aggregation of threads and hence an application class in the system. Each node in the tree has a weight that determines the percentage of its parent node’s bandwidth that should be allocated to it. Also, each node has a scheduler: whereas the scheduler of a leaf node schedules all the threads that belong to the leaf node, the scheduler of an intermediate node schedules it’s child nodes. Given such a scheduling structure, the scheduling of threads occurs hierarchically. The weights associated with the nodes determine the proportion of the CPU bandwidth available to the threads/application class represented by that node.

Similarly for the case of the network packet scheduler[2], it is possible to create a hierarchical structure that represents the distribution of available network bandwidth to the various flows in the system. This is done by attaching queues to leaf nodes in the scheduling hierarchy and associating sockets with the respective queues.

These manipulations are all performed through special system calls made by an application running as root. This application can be the same or different as the one requiring the QoS guarantees.

2.2 TAO / Corba Features

TAO[5] is a freely available, open-source, and standards-compliant real-time implementation of CORBA that provides efficient, predictable, and scalable quality of service (QoS) end-to-end. Unlike conventional implementations of CORBA, which are inefficient, unpredictable, non-scalable, and often non-portable, TAO applies the best software practices and patterns to automate the delivery of high-performance and real-time QoS to distributed applications.

Many types of real-time applications can benefit from the flexibility of the features provided by the TAO ORB and its CORBA services. In general, these applications require predictable timing characteristics and robustness since they are used in mission-critical real-time systems. Other real-time applications require low development cost and fast time to market.

Traditionally, the barrier to viable real-time CORBA has been that many real-time challenges are associated with end-to-end system design aspects that transcend the layering boundaries traditionally associated with CORBA. That's why TAO integrates the network interfaces, OS I/O subsystem, ORB, and middleware services in order to provide an end-to-end solution.

5

http://www.cs.wustl.edu/~schmidt/PDF/Grand_Challenges.pdf

http://www.cs.wustl.edu/~schmidt/PDF/Grand_Challenges.pdf

http://www.cs.wustl.edu/~schmidt/ACE_wrappers/TAO/docs/orbsvcs.html

http://www.cs.wustl.edu/~schmidt/corba-research-realtime.html

http://www.cs.wustl.edu/~schmidt/corba-research-realtime.html

http://www.cs.wustl.edu/~schmidt/corba-research-performance.html

http://www.cs.wustl.edu/~schmidt/PDF/ORB-patterns.pdf

http://www.cs.wustl.edu/~schmidt/ACE_wrappers/docs/ACE-development-process.html

http://www.omg.org/

http://www.cs.wustl.edu/~schmidt/PDF/orc.pdf

For instance, consider the CORBA Event Service[6], which simplifies application software by supporting decoupled suppliers and consumers, asynchronous event delivery, and distributed group communication. TAO enhances the standard CORBA Event Service to provide important features, such as real-time event dispatching and scheduling, periodic event processing, efficient event filtering and correlation mechanisms, and multicast protocols required by real-time applications.

TAO is carefully designed using optimization principle patterns that substantially improve the efficiency, predictability, and scalability of the following characteristics that are crucial to high-performance and real-time applications:

TAO uses active demultiplexing and perfect hashing optimizations to dispatch requests to objects in constant, i.e., O(1), time regardless of the number of objects, IDL interface operations, or nested POAs.

TAO's IDL compiler generate either compiled and/or interpreted stubs and skeletons, which enables applications to make fine-grained time/space tradeoffs in the presentation layer.

TAO's concurrency model is carefully design to minimize context switching, synchronization, dynamic memory allocation, and data movement. In particular, TAO's concurrency model can be configured to incur only 1 context switch, 0 memory allocations, and 0 locks in the ``fast path'' through the ORB.

TAO uses a non-multiplexed connection model that avoids priority inversion and behaves predictably when used with multi-rate real-time applications.

TAO's I/O subsystem is designed to minimize priority inversion interrupt overhead over high-speed ATM networks and real-time interconnects, such as VME. It also runs very efficiently over standard TCP/IP protocols, as well.

TAO provides all these optimizations within the standard CORBA 2.x reference model.

6

http://www.cs.wustl.edu/~schmidt/PDF/Words99.pdf

http://www.cs.wustl.edu/~schmidt/PDF/oopsla.pdf

3. Experimental Setup

To evaluate the two systems, we developed and tested a set of applications for them and performed various performance measurements. For this purpose we used a fire-walled network of identical Linux workstations. The machines were all 300 MHz Pentium – II’s with 192MB RAM connected to each other through a 10Mbps Ethernet link and running Redhat Linux 6.1. Applications were client-server type and were tested in two configurations – client running on the same host as the server and client running on another host in the network (remote client case).

The following applications were developed for both systems - (QLinux as well as Corba) to compare various aspects of performance of these two systems. Since the two systems support slightly different application design frameworks, care was taken to keep the corresponding applications on both systems comparable.

3.1 Echo / Null RPC

This application was designed to measure the response time and the corresponding over head for a single unit of synchronous communication. In this application, the client would send a small (usually empty string) string to the server which would then respond by sending the same string back. The metric in this application was the response time of a single request as observed from the client side.

In the case of the application QLinux the client and server communicate through a simple, connection oriented (TCP) socket. The server launches a new thread whenever a client connects to it, to service the request. The server then reads the string sent by the client from the socket and writes it back to the socket immediately. The client sends 100 sequential requests to the server, records the response times for each of these and then exits. As this application does not involve and QoS specific services, it was run on a plain Linux kernel.

In the Corba based application, we used the concept of a distributed object to implement the Echo server. The server implements the interface of the Echo object (the interface consists of a single function that takes in a string argument and returns the same string).

The server creates an instance of the Echo object, writes it's unique identification string (IOR) to a disk file and then waits for client requests. The client reads the Server IOR string from the disk file to obtain a reference to the Echo object. It then performs the Echo function by means of an RPC type mechanism. The actual sending and receiving of the data between the client and the server takes place through the ORB (Object Request Broker) and the details of this are completely transparent to the programmer.

For a fair evaluation, we configured the server to operate in a 'thread per request' mode. Other modes such as reactive, 'thread per connection' or pool of threads are also available without needing any special coding effort from the developer. Also, this application does

7

not use any QoS related features of the TAO ORB, but it does benefit from any special optimizations built into the ORB.

The following table indicates the average response times for the Echo application for the two systems:

Local Client Remote Client

Non Corba 513.55 7.01 767.45 4.97

Corba withfirst request

648.99 119.42 897.93 119.65

Corba without first

request588.101 4.31 836.899 3.04

As can be seen from the table, on the average, the response time for the Corba based echo application was larger than for the simple socket based ones. This is because of the extra overhead which the middle-ware involves. An strace of the Corba application revealed that although it too uses TCP based socket for it's communication, the marshalling, demarshalling of arguments, extra header information that needs to be sent and the allocation and deallocation of the paramters at the server-side was the cause of the larger response time for the CORBA application.

Figures 1 and 2 also show that in the Corba case, the response time for the very first request made on the remote object takes an order of magnitude time more than subsequent requests to the same object. This difference can become significant in case the application design is such that only one request is made per remote object (i.e. a remote object reference is never reused. In this case, every request would experience the extra overhead of the very first request.

8

Figure 1: Corba Null RPC Response Time

Figure 2: Socket Null RPC Response Times

9

The explanation for this can be seen from the following ‘strace’ of the client application for the first two requests sent to the server:

gettimeofday({989986519, 979772}, NULL) = 0brk(0x80e7000) = 0x80e7000socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7connect(7, {sin_family=AF_INET, sin_port=htons(4937), sin_addr=inet_addr("192.168.42.3")}, 16) = 0fcntl(7, F_GETFL) = 0x2 (flags O_RDWR)fcntl(7, F_SETFL, O_RDWR) = 0fcntl(7, F_GETFL) = 0x2 (flags O_RDWR)fcntl(7, F_SETFL, O_RDWR) = 0setsockopt(7, SOL_SOCKET, SO_SNDBUF, [65535], 4) = 0setsockopt(7, SOL_SOCKET, SO_RCVBUF, [65535], 4) = 0setsockopt(7, IPPROTO_TCP1, [1], 4) = 0getpid() = 20221fcntl(7, F_SETFD, FD_CLOEXEC) = 0getpeername(7, {sin_family=AF_INET, sin_port=htons(4937), sin_addr=inet_addr("192.168.42.3")}, [16]) = 0rt_sigprocmask(SIG_BLOCK, ~[RT_1], [RT_0], 8) = 0rt_sigprocmask(SIG_SETMASK, [RT_0], NULL, 8) = 0writev(7, [{"GIOP\1\1\1\0F\0\0\0\0\0\0\0\0\0\0\0\1\237\0@\33\0\0\0\24"..., 82}], 1) = 82rt_sigprocmask(SIG_BLOCK, ~[RT_1], [RT_0], 8) = 0rt_sigprocmask(SIG_SETMASK, [RT_0], NULL, 8) = 0select(8, [5 7], NULL, NULL, NULL) = 1 (in [7])read(7, "GIOP\1\1\1\1\26\0\0\0", 12) = 12read(7, "\0\0\0\0\0\0\0\0\0\0\0\0\6\0\0\0Hello\0", 22) = 22gettimeofday({989986519, 987282}, NULL) = 0fstat(1, {st_mode=S_IFREG|0640, st_size=20519, ...}) = 0mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40923000write(1, "7510\n", 57510) = 5gettimeofday({989986519, 988551}, NULL) = 0writev(7, [{"GIOP\1\1\1\0F\0\0\0\0\0\0\0\1\0\0\0\1\237\0@\33\0\0\0\24"..., 82}], 1) = 82rt_sigprocmask(SIG_BLOCK, ~[RT_1], [RT_0], 8) = 0rt_sigprocmask(SIG_SETMASK, [RT_0], NULL, 8) = 0select(8, [5 7], NULL, NULL, NULL) = 1 (in [7])read(7, "GIOP\1\1\1\1\26\0\0\0", 12) = 12read(7, "\0\0\0\0\1\0\0\0\0\0\0\0\6\0\0\0Hello\0", 22) = 22gettimeofday({989986519, 990551}, NULL) = 0

The code between the 2 pairs of gettimeofday system calls shows the system calls made by the client for the first and the second requests to the same object. As can be seen, the client creates a socket and connects to the server only after the first request is made, for subsequent calls the connection is kept open and reused. Hence the first request involves a greater over head. We expect a similar behavior at the server side contributing to the patter, when requests are received, however, this could not be confirmed as strace failed to work correctly on the server (which is a multithreaded application).

10

3.2 Simple HTTP Client-Server

This application models a simple HTTP server that only receives requests for static HTML pages or objects. The client sends a properly formatted HTTP request to the server requesting a file. In our system, the client can request one of 8 possible files varying in size from 20kB to 1.29MB. The server receives the HTTP request, parses it, reads the requested file from it’s local disk, includes it in an HTTP response and sends it back to the client. For this experiment, there is only one client thread that sequentially makes 100 requests for a file of the same size. To avoid disk-caching effects, the server has 100 copies of files of each possible size and reads a different one every time.

The QLinux application again uses simple TCP based server that creates a new thread for every request. The new thread services the HTTP request and also logs the time taken for disk I/O and for processing the HTTP request. This can be used in conjunction with the response time measured at the client to determine the time spent by the request and response in the network.

The Corba version of the server uses a remote object with ‘thread per request’ policy to perform the same operation as the QLinux server.

The following plots indicate that the performance of the two systems in terms of response time are mostly comparable, though the QLinux application performs marginally better.

Local HTTP Response Times

0

100000

200000

300000

400000

500000

600000

700000

0 200 400 600 800 1000 1200 1400

Filesize (kB)

(use

cs)

Corba

Non Corba

Figure 3: Local HTTP Response Times

Remote HTTP Response Time

0

500000

1000000

1500000

2000000

2500000

0 200 400 600 800 1000 1200 1400

File Size (kB)

(use

cs)

Corba

Non Corba

Figure 4: Remote HTTP Response Times

3.3 Concurrent HTTP Client-Server

This experiment differs from the previous one in that, instead of having a single client thread, the client launches a number of threads (1-4), each of which make 100 requests to the same HTTP server concurrently. This forces the server to service 1-4 requests simultaneously. In both applications, the server remains unchanged from the previous experiment, and the clients were modified so that the client launches a certain no of threads on startup each of which would make requests to the server. These experiments were done for varying no. of client threads (1-4), varying file sizes (20 kB to 1.28Mb) and with the client on a host local or remote to the server.

The figures5-8 summarize these results for the following representative cases.

11

No of Client Threads = 4 and varying the filesize

Local HTTP Response Times, Concurrency=4

0

200000

400000

600000

800000

1000000

1200000

0 200 400 600 800 1000 1200 1400

Filesize (kB)

(use

cs)

Corba

Non Corba


Remote HTTP Response Times,Concurrency = 4

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

0 200 400 600 800 1000 1200 1400

File size (kB)

(use

cs)

Corba

Non Corba


Filesize=640kB, varying the no. of client threads.

Local Concurrent HTTP Response Time, Filesize=640kB

0

100000

200000

300000

400000

500000

600000

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

No of concurrent clients

(use

cs)

Corba

Non Corba


Remote Concurrent HTTP Response Times, Filesize=640kB

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

No of concurrent clients

(use

cs)

Corba

Non Corba


3.4 Compute Intensive HTTP Client-Server

This again is a variation on the HTTP theme. This is a simulation of cgi-bin processing request. In this case the client sends an HTTP request to the server. In response, the server reads a small file from disk (40kB), performs some computation (a complicated loop) and then sends the file it read from disk to the client. The server is designed so that the time taken for the computation constitutes most of the request processing time. In this case as well the applications were created by simple modifications to HTTP client-server described in the preceding experiments. The figures 9,10 demonstrate the variation of the average response time for a single request as the no. of clients making concurrent requests to the server is increased from 1 – 15. As can be seen from the figures, the response time for these computation-intensive requests increases almost linearly as the number of simultaneous clients is increased. The Corba application because of the extra overheads mentioned earlier performs slightly poorer than the simple socket based one. Also since this application does not involve transfer of large amounts data over the network, the difference in the response times for the local client and the remote client was not too substantial.

12

Local Computation Response Times

0

200000

400000

600000

800000

1000000

1200000

0 2 4 6 8 10 12 14 16

No of Clients

(use

cs)

Corba

Non Corba

Figure 9: Local HTTP Response Times, with concurrency

Remote Comp Intensive Response Time

0

200000

400000

600000

800000

1000000

1200000

1400000

0 2 4 6 8 10 12 14 16

No of Clients

(use

cs)

Corba

Non Corba

Figure 10: Remote HTTP Response Times, with concurrency

To simulate the computation at the server, we initially used a simple counting for-loop which counted to a large no. However since the TAO/ORB by default uses –O3 level of compiler optimization ( g++ compiler) , this simple loop got optmized out. So we had to make the loop more complicated and make it manipulate more dummy variables to prevent the compiler from optimizing it out.

3.5 Long File/ MPEG Download

One of the models for serving a video stream is to assume that the client has sufficient buffers to accommodate the entire video file. In this model the server simply sends the video file to the client as fast as possible and letting the client buffer it and play it back whenever it wants. This model isn’t fundamentally different from the simple download of a large file. Since the video file is not being streamed at a real time rate and there are no deadlines associated with the arrival of data, the only parameter of interest is the time taken to download the entire file (basically throughput).

In this experiment, we’ve used this model and measured the time taken to download an MPEG-1 video file, with the server sending the video frames as fast as possible. We used an MPEG-1 video file consisting of 5962 frames. Instead of actually using the video data, we generated a profile of the MPEG sequence that listed the size of each frame in the video. During initialization, the server reads this profile from disk and stores the data about frame sizes in an array. When transmitting, the server simply sends a blank block of data of the same size as the corresponding frame to the client. This approach eliminates the disk from the service loop and allows us to measure the pure network and application overhead of transmitting data.

In the QLinux application, the server listens for client connections on a TCP socket. When a client connects to the server, the server spawns a new thread which starts sending the client data equivalent to the video frame sizes as soon as possible. The stream is preceded by a 4 byte header indicating the number of frames in the stream. Each frame in the stream is preceded by a 4 byte header specifying the size of the following frame in bytes. These headers are necessary so that client knows how much data to receive and

13

when to stop receiving the data. The client logs the arrival time of each frame when all the data for that frame is arrived.

When using Corba, for this kind of application, the synchronous RPC mechanism of the basic Corba server is not suitable because of the amount of data sent from the server to the client and the manner in which it is sent. TAO includes an Audio/Video Streaming Service[3], which supports the semantics for setting up and tearing down Streaming connections, however, the service focuses itself in the control/signaling involved during streaming and does not deal with the actual transport/QoS related issues. The Real-Time Event Channel Service provided in the TAO implementation of CORBA is an ideal choice for the kind of communication required for this application.

The Event Channel service needs to be started before either client or the server. The server starts up and creates an instance of a video supplier object and listens for requests from clients through an RPC call. When a client wants to receive a video file from a server, it first launches a separate thread that connects to the Event Channel as a consumer. Since the RTES supports filtering of events at the consumer by event types, the consumer thread specifies to the event channel the ID of events it wishes to receive. The consumer thread then waits for events to arrive. Meanwhile the other client thread obtains a reference to the video supplier object and performance an RPC on it, providing it with the event ID it is interested in receiving and telling the supplier to start sending the video frames.

The supplier on receiving the RPC from the client, connects to the event channel as a supplier of events with the ID specified by the client. The first event it sends to the client contains the number of frames it will be sending. Next the supplier pushes events into the Event Channel containing one frame each.

In the general case, the Event Channel can run on any host on the distributed system and still be used as a communication channel between a supplier and consumer on any other hosts. However to allow a fairer comparison with the QLinux application, for our experiments, the event channel was run on the same host as the supplier.

The following plots show the frame arrival times at the consumer for the QLinux and Corba applications. It can be seen that the performance of Corba is poorer than the QLinux application because of the extra overhead involved in doing the data transfer via an event channel. The difference in the two is much more for the case when the client and server are on running on the same machine. The reason for this is that in the case with the remote client, the delay incurred in the network (which is approximately the same for both applications) is much more substantial than the overhead due to the event channel.

14

Local MPEG Download

0

20000000

40000000

60000000

80000000

100000000

120000000

140000000

1

199

397

595

793

991

1189

1387

1585

1783

1981

2179

2377

2575

2773

2971

3169

3367

3565

3763

3961

4159

4357

4555

4753

4951

5149

5347

5545

5743

5941

Fra

me

arri

val t

imes

(u

secs

)

Corba

Non Corba

Figure 11: Frame Arrival Times, Local Download

Remote MPEG Download

0

50000000

100000000

150000000

200000000

250000000

300000000

350000000

400000000

450000000

1

198

395

592

789

986

1183

1380

1577

1774

1971

2168

2365

2562

2759

2956

3153

3350

3547

3744

3941

4138

4335

4532

4729

4926

5123

5320

5517

5714

5911

Fra

me

arri

val t

imes

(u

sec)

Corba

NonCorba

Figure 12: Frame Arrival Times, Remote Download

To estimate the overhead of the event channel, we measured transmission delays for 1000 byte data blocks through direct socket transfers vs through an event channel. In the case of supplier and receiver on the same machine, we get the following numbers:

Corba (via Event Channel) - 3134.22 17.10 usecsSocket based direct - 1279.32 203.33 usecs

These numbers clearly indicate the extra overhead that the Event Channel adds to the data transmission.

3.6 Video Streaming Without QoS Specification

This experiment is a baseline evaluation of the performance of the two systems for streaming a video sequence at a Real-Time rate. The application are mostly similar to the case of the long MPEG download, except that instead of transmitting all frames as soon as possible, the server transmits frames in rounds of 30 frames every second. Since the video sequence is a 30fps one, this transmission corresponds to the frame rate. Although both systems allow specification of QoS parameters (in different ways), these were not enabled for this set of experiments. We experimented by increasing the number of concurrent streams till the server was unable to meet the frame rate.

In the QLinux application apart from the rate at which the server sends the frames, there were no other changes to the application from the previous experiment .

The Corba Real Time Event Service also needs the TAO Scheduling Service to be running for it to work. Also the supplier and consumer need to locate the Event Channel to be able to connect to it. In the previous experiments, this was done by using each service’s unique string which can be used to locate it (the IOR). However in a real-life Corba based distributed system, the system runs a “Naming Service” (similar to a DNS kind of look up). Each CORBA object (i.e. service providing entity) registers itself with the Naming Service and which binds a name to it. Since the Naming Service runs in a well-known location, a client wishing to access a particular object can perform a lookup with the naming service and get a reference to the object.

15

So for this and subsequent applications, we ran a Naming Service on the server machine. Then the Scheduling Service is launched which registers itself as “SchedulingService”. Then the Event Channel Service launches itself, try to locate the Scheduling Service by doing a lookup on the Naming Service for “SchedulingService” and on success register itself as “EventService”. Once this is done, the Supplier and the Consumer can connect to the event channel by first locating it by doing a lookup for “EventService”. This means that server machine is running at least 4 Corba applications at any time (Naming Service, Scheduling Service, Event Service & the Supplier). Since these are large applications, they do put a significant load on the CPU.

Another important difference between this application and the MPEG download is that the supplier instead of pushing a single event for each frame to be transmitted, creates an event set consisting of 30 events corresponding to the 30 frames of that round, and pushes the whole event set at once. It is worth noting that although the application framework allows for a consumer to receive multiple events (an event set) at a time, we found that even though we were pushing an event set of 30 events at supplier end, the consumer was still getting one event at a time.

We also ran these applications for the case of multiple concurrent streams being served to different clients through the same event channel. As described earlier, the event channel has support for filtering event delivery to consumers.

Since the supplier is run in ‘thread per request’ mode, a new supplier thread is created for each client that wants to receive a video. These supplier threads connect to the same event channel and push events into it separately. Since each client (consumer) provides the supplier with a different event ID, the events being pushed into the channel by the various supplier threads are tagged with their respective Ids and are filtered by the event channel so that each consumer only gets the events belonging to its own stream.

The event channel uses a pool of threads to do the dispatching of the events and the size of the pool can be configured when the Event Service is started. For our experiments, we set the number of threads to be the same as the no of concurrent clients we would be using for each experiment.

The Metric: For comparing the performance of the two applications we used a metric based on the time taken to receive a round. When transmitting a stream, the server sends a round of 30 frames every second. If it completes transmitting the round before it’s time to send the next round, the sending thread sleeps for that duration. If, however, it is unable to send a round in one second, it starts sending the next round as soon it finishes with the current one. The client, who is logging the frame arrival time of each frame, is interested in receiving a complete round (30 frames) every second to be able to play the video stream continuously. Therefore, at the client side we measure the amount of time taken to receive an entire round. This is the time from the arrival of the first frame to the arrival of the arrival of the last (30th) frame of the round. If this time is less that 1 second, then we can say that this round was received in-time for playback. If the time is greater than one second, this corresponds to a missed deadline. One of the effects that this metric

16

does not measure is that if one of the earlier rounds is delayed and misses its deadline, it is likely going to make the subsequent rounds also miss their deadlines. By using the time to receive each frame separately, we factor out the cumulative effect of deadline misses.

Therefore the two metrics we are interested in are:

1. Average Round Duration (as measured from Client) with a increasing number of Streams

2. Distribution of Round Duration (around deadline) and effect of increasing number of Streams

Local Streaming No QoS

0

10000

20000

30000

40000

50000

60000

70000

0 1 2 3 4 5 6

No of Streams

Ave

rag

e R

ou

nd

Du

rati

on

(u

secs

)

Corba

Non Corba

Figure 13: Average Round Durations, Local Client, No QoS

Remote Streaming No QoS

0

200000

400000

600000

800000

1000000

1200000

0 1 2 3 4 5 6

No of streams

Ave

rag

e R

ou

nd

du

rati

on

(u

secs

)

Corba

Non Corba

Figure 14: Average Round Durations, Remote Client No QoS

As can be seen from figure 12, when the client and server are running on the same host, the QLinux application not only outperforms the Corba application, but also is not affected significantly by increasing the number of simultaneous streams. The reason for this is that for the local streaming, the overhead due to the network is minimal, but since the QLinux applications are simpler and less memory and CPU intensive, they are not affected as much as Corba applications when a large no. are running on the same system.

However in figure 13, for the remote client, the performance of the two systems is almost comparable. Both applications are unable to meet the frame rate (corresponding to an average round duration of 1 second) for 5 concurrent streams.

The following figures 15,16,17,18 show the distribution of Round Durations for the remote client and the cases of 4 and 5 concurrent streams.

We also experimented by changing the round size from 30 frames to 15 frames. In this case the server is required to send rounds of 15 frames every 500 msecs. We found that because of the tighter deadline and possibly the 10msec granularity of the Linux kernel clock, the server was unable to support even 4 concurrent streams for remote clients. The round durations for these cases are shown in figures 19 and 20

17

Corba Remote 4, No QoS

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0

1000

00

2000

00

3000

00

4000

00

5000

00

6000

00

7000

00

8000

00

9000

00

1000

000

1100

000

1200

000

1300

000

1400

000

Mor

e

Round Duration

Fra

ctio

n

Frequency

Figure 15: Corba Round Durations 4 Streams

Non Corba Remote 4, No QoS

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0

1000

00

2000

00

3000

00

4000

00

5000

00

6000

00

7000

00

8000

00

9000

00

1000

000

1100

000

1200

000

1300

000

1400

000

Mor

e

Round Durations

Fra

ctio

n

Frequency

Figure 17: Non Corba Round Durations 4 Streams

Corba Remote 5, No QoS

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0

1000

00

2000

00

3000

00

4000

00

5000

00

6000

00

7000

00

8000

00

9000

00

1000

000

1100

000

1200

000

1300

000

1400

000

Mor

e

Round Durations

Fra

ctio

n

Frequency

Figure 16: Corba Round Durations 5 Streams

Non Corba Remote 5, No QoS

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0

1000

00

2000

00

3000

00

4000

00

5000

00

6000

00

7000

00

8000

00

9000

00

1000

000

1100

000

1200

000

1300

000

1400

000

Mor

e

Round Durations

Fra

ctio

n

Frequency

Figure 18: Non Corba Round Durations 5 StreamsRoundSize=15 Corba Remote No QoS

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0

5000

0

1000

00

1500

00

2000

00

2500

00

3000

00

3500

00

4000

00

4500

00

5000

00

5500

00

6000

00

6500

00

7000

00

Mor

e

Round Durations

Fra

ctio

n

Frequency

Figure 19: Corba Round Durations 4 Streams, Round size=15 frames

RoundSize=15 Non Corba Remote No QoS

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0

5000

0

1000

00

1500

00

2000

00

2500

00

3000

00

3500

00

4000

00

4500

00

5000

00

5500

00

6000

00

6500

00

7000

00

Mor

e

Round Durations

Fra

ctio

n

Frequency

Figure 20: Non Corba Round Durations 4 Streams, Round size=15 frames

3.7 Video Streaming With QoS Specification

After testing the raw performance of streaming applications on both systems features, we repeated the experiments of the previous sections, this time with the QoS features enabled. This also involved modifying the application to specify the QoS guarantees it needs from the system.

In the case of the QLinux application, the server modifies the process scheduling hierarchy to create an SFQ scheduling class with weight 1. A default best effort class with weight 1 is automatically created when the kernel is booted. For each thread that the server creates to send a stream, it creates a new leaf node with weight 1 in the SFQ class

18

and attaches the service thread to that node. Thus, in the server application, the thread that listens for client connections, runs in the best effort class and the threads responsible for streaming the video run in the SFQ class, all with the same weight.

Similarly, the server also creates an SFQ class in the packet scheduling hierarchy and attaches the new socket created for a client connection to a newly created queue with weight 1 in the SFQ class.

The mechanism for providing QoS specification is much simpler and more intuitive in the case of TAO. All the server program needs to do is to associate the following QoS parameters with the a certain type of event when connecting to an Event Channel as a supplier (or consumer):

Worst Case Time (to perform task): set to 1 secTypical Time (to perform task): set to 1 secCached Time: set to 1 secTime Period of task: set to 1 secNo of threads (of event channel to use for this task): set to 1

The Real Time Event Service uses these parameters to get a schedule from the Scheduling Service and schedules the Event Channel’s pool of threads using that schedule.

The following figures show the variation in the Average Round Duration as the number of concurrent clients increases. This is for the round size of 30 frames with a 1 second period.

Local Streaming with QoS

0

10000

20000

30000

40000

50000

60000

0 1 2 3 4 5 6

No of Streams

Avg

Ro

un

d D

ura

tio

ns

Corba

No Corba

Figure 21: Average Round Durations, Local Client with QoS

Remote Streaming with QoS

0

200000

400000

600000

800000

1000000

1200000

0 1 2 3 4 5 6

No of Streams

Avg

Ro

un

d D

ura

tio

ns

Corba

Non Corba

Figure 22: Average Round Duration Remote Client with QoS

As can be seen from the above figures, the relative behavior of the two applications in terms of the average Round Duration does not change too much from the case when the QoS was not enabled. For local performance, QLinux still out-performs CORBA and in the case of remote client, although CORBA does better when the number of clients is low, the QLinux application scales better.

The following plots show the distributions of the Round Durations for these applications with 4 and 5 remote clients. It can be seen that QLinux with QoS enabled does manage to

19

shorten the tail of the distribution noticeably, thus ensuring that a greater percentage of rounds are able to arrive within the deadline.

Corba Remote 4 with QoS

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0

1000

00

2000

00

3000

00

4000

00

5000

00

6000

00

7000

00

8000

00

9000

00

1000

000

1100

000

1200

000

1300

000

1400

000

Mor

e

Round Durations

Fra

ctio

ns

Frequency

Figure 23: Corba RoundDurations,4 Streams with QoS Non Corba Remote 4 with QoS

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0

1000

00

2000

00

3000

00

4000

00

5000

00

6000

00

7000

00

8000

00

9000

00

1000

000

1100

000

1200

000

1300

000

1400

000

Mor

e

Round Durations

Fra

ctio

n

Frequency

Figure 25: Non Corba RoundDurations,4 Streams with QoS

Corba Remote 5 with QoS

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0

1000

00

2000

00

3000

00

4000

00

5000

00

6000

00

7000

00

8000

00

9000

00

1000

000

1100

000

1200

000

1300

000

1400

000

Mor

e

Round Durations

Fra

ctio

n

Frequency

Figure 24: Corba RoundDurations,5 Streams with QoS Non Corba Remote 5 with QoS

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0

1000

00

2000

00

3000

00

4000

00

5000

00

6000

00

7000

00

8000

00

9000

00

1000

000

1100

000

1200

000

1300

000

1400

000

Mor

e

Round Durations

Fra

ctio

n

Frequency

Figure 26: Non Corba RoundDurations,5 Streams with QoS

We also observed the effect of decreasing the round size to 15 frames transmitted every 500 msecs for 4 concurrent remote clients. Again we see that with QoS, the QLinux application does a better job of ensuring that a greater number rounds are received within their deadlines. The improvement from the case without QoS is also more noticeable with QLinux.

RoundSize=15 Corba Remote QoS

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0

5000

0

1000

00

1500

00

2000

00

2500

00

3000

00

3500

00

4000

00

4500

00

5000

00

5500

00

6000

00

6500

00

7000

00

Mor

e

Round Duration

Fra

ctio

ns

Frequency

Figure 27: Corba Round Durations 4 Streams, Round size=15 frames

RoundSize=15 Non Corba Remote QoS

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0

5000

0

1000

00

1500

00

2000

00

2500

00

3000

00

3500

00

4000

00

4500

00

5000

00

5500

00

6000

00

6500

00

7000

00

Mor

e

Round Duration

Fra

ctio

n

Frequency

Figure 28: Non Corba Round Durations 4 Streams, Round size=15 frames

20

3.8 Video Streaming performance in presence of another application

In this experiment, we’ve attempted to measure how the performance of a time critical application is affected by the presence of another non-time critical application. This was done by running the Streaming Server and the HTTP server on the same host. 4 streaming clients and a certain number of HTTP clients make concurrent requests to their respective servers. We observe the effect of increasing the number of HTTP clients on the Average Round Duration measured by the streaming clients.

The results of some preliminary experiments are presented below in figures 29 and 30

21

Application Mix, 4 Streams + HTTP No QoS

0

200000

400000

600000

800000

1000000

1200000

1400000

0 2 4 6 8 10 12

No of HTTP Clients

Avg

Ro

un

d D

ura

tio

ns

Corba

Non Corba

Figure 29: Average Round Durations, 4 Client No QoS, increasing no of HTTP Clients

Application Mix, 4 Streams + HTTP, with QoS

0

200000

400000

600000

800000

1000000

1200000

1400000

0 2 4 6 8 10 12

No of HTTP Clients

Avg

Ro

un

d D

ura

tio

ns

Corba

Non Corba

Figure 30: Average Round Durations, 4 Client with QoS, increasing no of HTTP Clients

The experiments involving these applications are being continued by others and will be included in a later document.

3.9 Industrial Control System Simulation

This application simulates an industrial control console system[4]. The application consists of 2 distinct components. The first one models a supervisor’s command console in which a client program from time to time issues a command to a remote actuator device (simulated by a remote server). The device performs the operation requested by the client (simulated by a brief busy-wait) and then sends an acknowledgement back to the client. The operator is simulated by automatically generated requests at random time intervals (100 – 300 msecs).

The second component is a remote-sensing kind of application, in which a sensor (simulated by a supplier) periodically (every 20 msecs) sends an update of a measurement to a remote monitoring system (simulated by a consumer).

The QLinux version of this application consists of two client-server pairs, one for the Command Console and the other for the remote sensor. Both applications are non-concurrent and use a TCP based socket for communication. In the command control application, the server on receiving a connection waits for the client to send an operation on the open socket, performs a brief busy wait and sends back a single byte acknowledgement. The client issues commands at random time intervals and records the response time for requests. In the remote-sensing application, the server on getting a connection periodically sends a byte of data to the client, which measures the arrival times of these updates.

For TAO, the Command Console application uses the RPC-like mechanism to send a request and receive a response. The remote-sensing application uses the Real Time event channel for communication between the sensor and the monitoring system.

The metrics of interest in this application are the jitter in the arrival of sensor updates and the response times for the Command-Console requests.

These experiments were only performed for the remote client and QoS was enabled for Remote Sensing Application.

The average response times for the Command-Console application requests are summarized below:

Response Time (usecs) CORBA QLinux

Remote Sensor NOT Running 67974.33 96.09 67617.63 12.97

Remote Sensor Running 70960.69 443.86 67853.9 11.53

As can be seen from the above table, the performance of the Command-Console application in the absence of the Remote-Sensing application for both QLinux and CORBA is comparable, however the degradation in performance when the Remote Sensing Application is added to the server load, is more significant in the case of CORBA than QLinux because of the extra overhead associated with CORBA applications.

In the case of the Remote Sensing applications, we measure the Update sending times at the sender and the Update receiving time at the client. The differences between successive update sending times gives an indication of how well the threads of the sending application are scheduled. Ideally these differences should be constant. These experiments were performed with the Command-Console application also running simultaneously on the same hosts.

The differences between update arrival times at the client also incorporate the jitter introduced by the communication medium between the two applications. These measurements are summarized in the following table:

Avg. Update Differences (usecs)

CORBA QLinux

Sending Side 29996.37 9.21 29997.65 4.63

Receiving Side 29992.54 7.80 29997.63 5.70

As can be seen, there is really not much to choose between the performance of the two systems. They both manage to give an almost identical level of performance.

4. Summary & Discussion of Qualitative Comparisons

One of the objectives of this work was to provide, apart from the quantitative comparison of the two systems presented above, also give a qualitative evaluation of the relative ease of use of the two systems and identify any potential stumbling blocks.

In our experience with developing corresponding applications for both QLinux and TAO, we observed the following pros and cons of the two systems.

4.1 QLinux

One of the biggest advantages that QLinux offers it’s users is the ability to use the same, API which most systems programmers are already familiar with (basic socket based programming). Using the additional QoS features available, does not add significantly to the application complexity. This makes the task of porting already existing applications to use these features reasonably straightforward. The developer has to learn only a few new system calls, however, the documentation (man pages), for these leaves some room for improvement. This is a significant advantage in the usability of the system.

On the other hand, in the simplicity of it’s API, lies also a slight drawback, in that it does not provide the user with any high-level interfaces to simplify the task of developing networked/distributed applications. Also since QLinux deals with QoS only in terms of HSFQ scheduling parameters (i.e. bandwidth share), the task of actually specifying the QoS for a given applications is not straightforward. This specially so since it is common to describe QoS parameters in terms of deadlines, end-end delays and periodicity which do not fit directly into the QLinux framework.

The other noticeable drawbacks of the QLinux API is that it requires that all QoS related system calls have to be executed in super-user mode. Thus making large scale in a simple way impractical. A solution to this problem would be to develop run-time services which run as the super-user and provide an interface for other user’s applications to request QoS for themselves. It would also be possible to do some admission control in this way. Also, the error reporting mechanisms for the system calls for CPU HSFQ and Link HSFQ are neither consistent nor convenient. Link HSFQ system calls use system messages to report erroneous or exceptional conditions which makes it hard to write clean and robust applications.

4.2 TAO / CORBA

Since TAO is first a Distributed Object Computing (DOC) middle-ware and then a Real-Time system, it suffers from the problem of being a large, complicated and a very general-purpose application development framework. This implies that any person wishing to use TAO for a Real-Time application has to first overcome the significant hurdle of understanding the various concepts related to DOC and the large amount of details associated with a specific DOC system (such as CORBA). For small, simple applications, this overhead can be prohibitive.

Also since the use of CORBA requires the developer to follow a certain application development framework. It makes the task of porting existing applications, which use conventional socket based communication, to use TAO a very complicated and often impractical task.

However, once this initial hurdle is overcome, the middle-ware provides many features to make the task of developing distributed applications simpler and faster. Since the middle-ware frees the developer of the lower level implementation details (the socket programming), it substantially reduces development time for simple applications because of the time saved in writing hard to debug, error prone networking code.

This freedom, however, comes at the cost of flexibility. Thus while it is extremely easy and quick to develop applications which conform well to the models commonly used in DOC. Developing an application that has a unique/uncommon communication model might involve a lot of unnecessary complications.

Building TAO and TAO based applications is surprisingly memory and CPU intensive. A minimal build of ACE, TAO and TAO OrbServices took close to 4 hours on an otherwise lightly loaded Pentium-II, 300 Mhz machine with 192MB RAM. It takes 5-10 minutes to build from scratch the simplest TAO based application when a normal socket based one takes only a few seconds.

For the purpose of specifying QoS guarantees, TAO provides a rather intuitive and easy to understand interface which involves directly specifying deadlines, periodicity of events, etc. Also the ORB supports various kinds of concurrency models in servers and the services that can be used by simply providing command line parameters or configuration files during startup.

References

[1] P. Goyal and X. Guo and H.M. Vin, A Hierarchical CPU Scheduler for Multimedia Operating Systems, Proceedings of 2nd Symposium on Operating System Design and Implementation (OSDI'96), Seattle, WA, pages 107-122, October 1996.

[2] P. Goyal and H. M. Vin and H. Cheng, Start-time Fair Queuing: A Scheduling Algorithm for Integrated Services Packet Switching Networks, Proceedings of ACM SIGCOMM'96, pages 157-168, August 1996

[3] Sumedh Mungee, Nagarajan Surendran, Douglas C. Schmidt, The Design and Performance of a CORBA Audio/Video Streaming Service

[4] Krithi Ramamritham, Chia Shen, Oscar Gonzalez, Shubo Sen and Shreedar Shirgurkar, Using Windows NT for Real-Time Applications: Experimental Observations and Recommendations, Proceedings of IEEE RTAS'98

[5] Douglas C. Schmidt, TAO Overview http://www.cs.wustl.edu/~schmidt/TAO-intro.html

[6] Douglas C. Schmidt, Using the Real Time Event Service http://www.cs.wustl.edu/~schmidt/events_tutorial.html

Thesis.doc

Documents

Transcript of Thesis.doc