Download - On Workload in an SCA-based System, with Varying Component and Data Packet Sizes Tore Ulversøy 1 Jon Olavsson Neset 2 1 FFI 1 UNIK University Graduate.

On Workload in an SCA-based System, with Varying Component and Data Packet Sizes

Tore Ulversøy1

Jon Olavsson Neset2

1FFI1UNIK University Graduate Center1University of Oslo (UiO)2Norwegian University of Science and Technology (NTNU)

Outline

• Background and Problem Definition• Empirical Analysis• Analysis using Low-Complexity Analytical Models• Conclusions

Background

• The base code of one of the waveform applications used in the following originate from a member in, and the waveform application is also used for other activities in

• The Regular Task Group on SDR founded below the RTO-IST-080

RTO-IST-080 RTG-038 Software Defined Radio currently, the team consists of experts from government,

university and industry fromCA, DK, GE, HU, IT, NL, NO, SP, TU, US and: SDR Forum

headed by NL (Chairman: Hans Segers, TNO)

Background: Main Objectives of RTG-038

© IBM/Levono

© Rockwell Collins

… 011010 …

© Spectrum Signal Processing

• Share knowledge & experience of (multi)national SDR/SCA developments

• Report on possibilities of sharing waveforms and waveform components

• Investigations of portability and interoperability:

SCA-based implementation of STANAG 4285 waveform

demonstrate portability onto national SDR platforms

demonstrate interoperability between the different implementations

1.

2.

3.

Problem Definition and Problem Background

•SCA defines an environment that allows applications to be built as compositions of SW components (and devices)•SCA defines a distributed system, communication through CORBA for CORBA-capable processors•There is wide freedom as to how small components to split the application into: Many small components reuse of components becomes easier, but CPU overhead increases

Processor 1 Processor 2 Processor 3

C6 C7 C8 C9 C10

C1 C2 C3 C4 C5

Application/Component View:

Physical View:

•What are the CPU overhead effects of a fine structure (many components) relative to a course one (few components), and how can we predict this overhead?

C_tot

Analysis Approach:

CPU workload implied by a task or a group of tasks = the fraction of available processor cycles occupied over a time period

Increasing accuracy

Analytical model

Increasing clarity / simplicity

Concurrency model, e.g. Petri-net

Simulation model

Measurements on testbed

Measurements on actual system

Empirical Analysis

• Using OSSIE (Open Source SCA Implementation Embedded) [2] from VirginiaTech which uses omniORB [3]– Advantages: Low user-threshold, full source-code available,

Linux-based• Profiling and monitoring tools:

– OProfile [4]– SYSSTAT sar [5]

Experiment: Stanag 4285

Experiment: Synthetic waveform

OS Linux 2.6.9-34.EL Linux 2.6.9-34.EL OSSIE revision 0.6.0 0.6.2 Processor Pentium M 1.86GHz Pentium M 1.86GHz RAM 1,5 GByte 1,5 GByte Cache specifics L1i=first level instruction cache L1d=first level data cache L2=second level cache

L1i=32kB L1d=32kB L2=2MB

L1i=32kB L1d=32kB L2=2MB

Empirical Analysis, Simple Waveform Application

• Stanag 4285, TX part. Base code provided by Telefunken Racoms for RTO-IST-080 RTG-038

• Implemented as three different configurations, all performing the same processing functional work

• Non-SCA ‘c’ version as a reference

Data Sink

Stanag4285 TX

Float to fixed converter

Forwarder Forwarder Forwarder Forwarder

Data Source

FEC Encoder

Inter-leaver

Symbol Mapper & Scrambler

Symbol to I/Q & TX Filter

Data Sink

Data Source

FEC Encoder

Inter-leaver

Symbol Mapper & Scrambler

Symbol to I/Q & TX Filter

Float to fixed converter

Data Sink

2 comp.

7 comp.

11 comp.

Packet rate regulator

Empirical Analysis, Stanag 4285 TX: Results, User

0

5

10

15

20

25

30

2395 25600 64000

Symbol rate

Use

r C

PU

% Non-SCA

Single component+sink

6-components+sink

10-components + sink

WL measured by SYSSTAT sar (sar –u 40 5)

Empirical Analysis, Stanag 4285 TX: Results, User+System

0

5

10

15

20

25

30

35

40

45

2395 25600 64000

Symbol rate

Use

r +

Sys

tem

CP

U %

Non-SCA

Single component+sink

6-components+sink

10-components + sink

WL measured by SYSSTAT sar (sar –u 40 5)

Empirical Analysis, Synthetic Application

• A total of 9 FIR-filters, N taps and packet size B (NxB mult/adds per FIR)

• Both N and B can be varied

• 4 different configurations

• ‘c’ version as a reference

W2:

W3:

W5:

W11:

SRC F1TO9 SNK

FTOT SNK

SRC F123 F456 F789 SNK

F5 F6 F7 F8 F9

SRC F1 F2 F3 F4

SNK

Packet rate regulator

Synthetic Application: WL versus Configuration and N

WL results measured by SYSSTAT sar (sar –u 40 3)

CPU Workload versus ConfigurationB=2000, PR=40

0

5

10

15

20

25

30

35

40

45

50

FUNC N=10

W2 N=10

W3 N=10

W5 N=10

W11 N=10

FUNC N=50

W2 N=50

W3 N=50

W5 N=50

W11 N=50

Configuration

CP

U W

L [

%]

System

UserRatio, user: 1,67Ratio, user+syst: 1,94

Ratio, user: 1,10Ratio, user+syst: 1,17

Synthetic Application: WL versus Packet Size

CPU Workload versus Packet Size

0

5

10

15

20

25

30

35

0 10000 20000 30000 40000 50000 60000 70000

Packet Size

CP

U W

L [

%]

FUNC U

FUNC U+S

W3 U

W3 U+S

W5 U

W5 U+S

W11 U

W11 U+S

W2 U

W2 U+S

•Packet rate: 10/sec

•N and B selected such that ‘C’ implementation (FUNC) is at 10±0,3% user CPU WL

•WL overhead is seen to increase significantly with B

The Simple Lower Bound Model (SLBM)

• Ideal, unrealistically optimistic model

• Serves as a lower bound

• ti = number of cycles per packet

100

)2()1()1(9%

PCR

tMtMtMtttPRWLi TFTSpacketSNKSRCCL

CN CN+1

TStTFt

packettCLt CLt

tSRC

tSNK tpacket tTS tCL tTF tpacket

tTS tpacket tCLtTF tTS

Idle

On timer strobe

Parameters in the Simple Lower Bound Model

CORBATSTSRC CORBATSTSINK

•For simplicity, we measure the parameters in the model with OProfile and/or SYSSTAT sar, using test applications:

for (i=0; i < BLSZ; i++)

......

‘c’-prog

SRCt SNKt

TSt(B) TFt

(B) packett(B)

sar estimate Assumed 0

Assumed 0

14,5*B 10,6*B 79200+5,5*B User

OProfile Estimate

650 200 14,6*B 10,6*B 80700+5,5*B

sar estimate Assumed 0

Assumed 0

≈ 0 ≈ 0 160000+23,6*B System

OProfile Estimate

≈ 0 ≈ 0 ≈ 0 ≈ 0 150000+23,0*B

CORBA test application:

Results, SLBM

• The simple model describes the dominating part of the user CPU overhead. Agreement best for small packet sizes

0 1 2 3 4 5 6

x 104

10

12

14

16

18

20

22

24

B

WL

Use

r [%

]

Comparison of W11 Measured Data and Simple Lower Bound Model

Packet SizeN and B selected such that ‘C’ implementation (FUNC) is at 10±0,3% user CPU WL

Measured

SLBM

M=11, Packet rate =10/sec

The Context Switch Model (CSM)

100%%

PCR

ttCSRWLiWLcs CSICSD

Context Switch Rate [switches/second]

CS Direct Cost [cycles]

Cycle rate of the processor

CS Indirect Cost [cycles]Here: Using course estimate based on addressed space and memory speed

Here: ≈ 5µsec = 9300 cycles

Measured ≈ 1300 for example next page

Results, CSM

• With the CS model, we better explain the measured WL

0 1 2 3 4 5 6

x 104

10

12

14

16

18

20

22

24

B

WL

Use

r [%

]

Comparison of W11 Measured Data and CS Model with Approximate Parameters

Measured

CSM, only tCSD, course estimate

CSM example (course parameter estimates)

M=11, Packet rate =10/sec

N and B selected such that ‘C’ implementation (FUNC) is at 10±0,3% user CPU WL

Packet Size

Conclusions:

• We have used empirical analysis and simple analytical models to understand the effects of granularity in an SCA-based system w/CORBA capable processors

• When executing the same total functional processing work, we observe that the processor workload increases as the number of components increases

• This overhead increases with data packet size, and becomes more dominant the lesser the functional work per packet

• Contributors: Data conversions, packet communication through CORBA, direct cost of context switches, indirect cost of context switches

• Hence the scalability and reusability benefits that result from implementing the SDR-application with a high number of components, must be balanced against the processing efficiency loss that occurs when having to run several components on the same processor

• Two simple models are described that help explain the major effects, and may be used to calculate the overhead

References:

[1] P. J. Fortier and H. E. Michel, Computer Systems Performance Evaluation and Prediction. Amsterdam: Digital Press, 2003.

[2] VirginiaTech, OSSIE development site for software-defined radio, http://ossie.wireless.vt.edu/trac as of Dec. 20 2007

[3] omniORB, http://omniorb.sourceforge.net/ as of Feb. 29 2008

[4] OProfile - A System Profiler for Linux, http://oprofile.sourceforge.net as of Feb. 29 2008

[5] SYSSTAT, http://pagesperso-orange.fr/sebastien.godard/ as of Feb. 29 2008

[6] Chuanpeng Li, Chen Ding, and Kai Shen, "Quantifying The Cost of Context Switch," in ExpCS '07: Proceedings of the 2007 workshop on Experimental computer science, San Diego, CA, 13-14 June 2007.

Questions?