TV/DTV technologies_E… ·  · 2017-03-212 GLOSSARY of ABBREVIATIONS 7 LIST of FIGURES 10 LIST of...

207
KAUNAS UNIVERSITY OF TECHNOLOGY Vytautas Deksnys DIGITAL TV TECHNOLOGIES Education Course KAUNAS, 2014

Transcript of TV/DTV technologies_E… ·  · 2017-03-212 GLOSSARY of ABBREVIATIONS 7 LIST of FIGURES 10 LIST of...

KAUNAS UNIVERSITY OF TECHNOLOGY

Vytautas Deksnys

DIGITAL TV TECHNOLOGIES

Education Course

KAUNAS, 2014

2

GLOSSARY of ABBREVIATIONS 7

LIST of FIGURES 10

LIST of TABLES 15

INTRODUCTION 16

1. TELEVISION HISTORY 18

1.1. Main Television Principle 18

1.2. Development of Monochrome Television 18

1.3. Development of Color Television 22

1.4. NTSC 24

1.5. SECAM 25

1.6. PAL 25

1.7. MAC (multiplexed analog components) 26

1.8. PALplus 26

1.9. Development of Digital Television 26

HUMAN EYE STRUCTURE, COLOR VISION and COLOR MODELS 29

2.1. Parts of the Human Eye 29

2.2. Color Vision 30

2.3. Development of Color Models 34

2.4. RGB Color Model 36

2.5. CMYK Color Model 37

2.6. Gamma Correction 38

2.7. YUV Color Model 38

2.8. YCbCr Color Model 39

2.9. YCoCg Color Model 39

HUMAN HEARING, PERCEPTION and SPEECH PRODUCTION 41

3.1 Human Ear Structure 41

3.2 Limits of Hearing 42

3.3 Masking 45

3.3.1 Time Masking 45

3.3.2 Frequency Masking 45

3.4 Ear as spectrum analyzer 46

3.5. Human Voice 49

3.6 Vowels and Consonants 50

3.7 Engineering Model 51

4. DIGITIZATION of SIGNALS 54

4.1 Pulse Code Modulation 54

3

4.2 Sampling: Nyquist- Shennon Sampling Theorem 54

4.3 Ideal Sampling 55

4.4 Aliasing and Anti-Aliasing 56

4.5 Flat-Top Sampling 58

4.6 Quantizing: Uniform Quantizing 60

4.7 Encoding 61

4.8 Bandwidth of PCM Signals 63

4.9 Linear PCM Advantages and Disadvantages 64

4.10 Digitization of Video Signals 66

4.10.1 Introduction 66

4.10.2 Digitization formats 67

4.10.3 Transport problems 69

5. SPEECH, AUDIO and VIDEO SIGNALS COMPRESSION by QUANTIZATION 70

5.1 Necessity and possibility of compression 70

5.2 Nonuniform quantizing: µ-law and A-law companding 70

5.3 Adaptive Quantization 74

5.3.1 Maximum SNR coding 74

5.3.2 Adaptation principle 74

5.3.3 Types of adaptation 75

5.3.4 Feed-forward adaptation 75

5.3.5 Feed-backward adaptation 82

5.4 Differential quantization 83

5.5 Vector quantization 85

6. VIDEO SIGNALS COMPRESSION 88

6.1 Video Material Characteristics 88

6.1.1 Picture correlation 88

6.1.2 Information quantity in a digital image 89

6.2 Variable Length Coding (VLC) or Entropy Coding 90

6.3 Predictive Coding 93

6.4 Prediction with Motion Compensation 95

6.5 Compression of Images Using Transformations 99

7. LINEAR PREDICTIVE ANALYSIS of SPEECH SIGNALS 107

7.1 Basic Principles of Linear Predictive Analysis 107

7.2 Optimal prediction coefficients 109

7.3 The autocorrelation method 109

7.3.1 The Durbin‘s recursive procedure 110

7.3.2 Prediction effectiveness 112

4

7.4 The covariance method 112

7.5 The choice of parameters for linear predictive analysis 113

8. AUDIO SIGNALS COMPRESSION 116

8.1 Sub-Band Coding Principle 116

8.2 MPEG Audio Layers 117

8.2.1 MPEG-1 Audio Layer 1 118

8.2.2 MPEG-1 Audio Layer 2 120

8.2.3 MPEG-1 Audio Layer 3 121

8.3 MPEG-2 Audio 126

8.3 MPEG-2 Advanced Audio Coding 127

8.5 Main Characteristics of MPEG-4 Audio 129

9. FORMATION of DVB DIGITAL STREAMS 130

9.1 Characteristics of ISO/IEC 13818 standard 130

9.2 MPEG-2 video compression standard 131

9.3 Construction of MPEG bit streams 131

9.3.1 MPEG-2 elementary stream 132

9.3.2 MPEG-2 packetized elementary stream 133

9.3.3 MPEG-2 program stream 134

9.3.4 MPEG-2 transport stream 134

9.3.5 The MPEG-2 tables 137

10. SCRAMBLING and CONDITIONAL ACCESS 140

10.1 Introduction 140

10.2 Functional Partitions 140

10.2.1 Data-Scrambling 141

10.2.2 Subscriber Authorization System 141

10.2.3 Subscriber Management System 141

10.3 System Architecture 141

10.3.1 Scrambler and Descrambler 141

10.3.2 CA Messages 142

10.3.3 CA Descriptor 143

10.3.4 CA-Host 143

10.3.5 CA-Client 143

10.3.6 CA-Module 144

10.3.7 Subscriber Management System 144

10.4 Network Integration 144

10.7 Main Conditional Access Systems 145

11. RANDOMIZATION and FRAME SYNCHRONIZATION 146

5

11.1 Randomization 146

11.2 Frame Synchronization 148

12. CODING 150

12.1 Introduction 150

12.2 Block Codes 150

12.3 Burst Error Correction 151

12.3.1 Block Interleaving 151

12.3.2 Convolutional Interleaving 152

12.3.3 Reed–Solomon (RS) Code 154

12.4 Convolutional Coding 155

12.4.1 Code parameters and the structure of encoder 155

12.4.2 Encoder states 156

12.4.3 Punctured codes 156

12.4.4 Structure of an encoder for L>1 157

12.4.5 Coding of an incoming sequence 157

12.4.6 Encoder design 158

12.4.7 State diagram 159

12.4.8 Tree diagram 160

12.4.9 Trellis diagram 161

12.4.10 The basic decoding idea 161

12.4.11 Sequential decoding 162

12.4.12 Maximum likelihood and Viterbi decoding 165

13. MODULATION of DIGITAL SIGNALS 169

13.1 Phasor Representation of Signals 169

13.1.1 Geometric interpretation 170

13.2 Modulation 172

13.3 Multilevel Modulation and Constellation Diagrams 173

13.4 Quadrature Modulators and Demodulators 176

13.5 Inter-Symbol Interference 177

13.6 Coded Orthogonal Frequency Division Multiplexing 178

13.6.1 Introduction 178

13.6.2 Effects of multipath propagation 179

Multiple carriers 179

13.6.4 Orthogonality 181

13.6.5 Guard interval 182

13.6.6 Generation and demodulation of COFDM signal 183

13.6.7 Coding 184

6

13.6.8 Inserting pilot cells 185

13.6.9 Interleaving 185

13.6.10 Choice of COFDM parameters 185

14. CHARACTERISTICS of DVB STANDARDS 187

14.1 DVB-T system 187

14.1.1 Splitter 188

14.1.2 Transport stream adaptation and randomization 188

14.1.3 Outer coding 189

14.1.4 Outer interleaving 190

14.1.5 Inner coding 191

14.1.6 Inner interleaving 192

14.1.7 Bit-wise interleaving 193

14.1.8 Symbol interleaving 194

14.1.9 Signal constellations and mapping 194

14.1.10 OFDM frame structure 196

14.1.11 Main parameters of DVB-T system 197

14.2 DVB-C system 198

14.2.1 Functional block diagram of the system 198

14.2.2 System blocks and their functions 198

14.3 DVB-S system 201

14.3.1 Functional block diagram of the system 201

14.3.2 System blocks and their functions 202

REFERENCES 204

7

GLOSSARY of ABBREVIATIONS AAC – Advanced Audio Coding

AC – Alternating Current

ACE – Advanced Coding Efficiency

ACI – Adjacent Channel Interference

ACK – Acknowledgment

ACM – Adaptive Coding and Modulation

ADC – Analog-to-Digital Converter

ADSL – Asymmetric Digital Subscriber Line

AES – Advanced Encryption Standard

APSK – Amplitude or Asymmetric Phase-Shift Keying

ARQ – Automatic Repeat Request

ASK – Amplitude-Shift Keying

ASPEC – Audio Spectral Perceptual Entropy Coding

ATA/ATAPI – Advanced Technology Attachment with Packet Interface

ATM – Asynchronous Transfer Mode

AVC – Advanced Video Coding

BER – Bit Error Rate

BC – Backward Compatibility

BCH – Bose-Chaudhuri-Hocquenghem Code

BPSK – Binary Phase-Shift Keying

BSAC – Bit-Sliced Arithmetic Coding

BSP – Board Support Package

CAT – Conditional Access Table

CABAC – Context Adaptive Binary Arithmetic Coding

CAVLC – Context Adaptive Variable Length Coding

CCI – Co-Channel Interference

CD – Compact Disc

CELP – Code-Excited Linear Predictive

CEPS – Common Electronic Purse Specification

CIF – Common Intermediate Format

CMAC – Cipher-Based Message Autentification Code

COFDM – Coded Orthogonal Frequency Division Multiplexing

CP – Content Protection

CPE – Common Phase Error

CRC – Cyclic Redundancy Check

CSI – Channel-State Information

DAB – Digital Audio Broadcasting

DAC – Digital-to-Analog Converter

DAT – Digital Audio Tape

DBPSK – Differential Binary Phase Shift Keying

DBS – Direct Broadcast Satellite

DC – Direct Current

DCT – Discrete Cosine Transform

DES – Data Encryption Standard

DFT – Discrete Fourier Transform

DPCM – Differential Pulse Code Modulation

DQPSK – Differential Quadrature Phase-Shift Keying

DRM – Digital Rights Managemen

DTH – Direct To Home

DTTV Or DTT – Digital Terrestrial Television

DVB-DSNG – DVB-Digital Satellite News Gathering

DVB-T – Digital Video Broadcasting Terrestrial

DVB-C – Digital Video Broadcasting Cable

DVB-S – Digital Video Broadcasting Satelite

DVB-H – DVB-Handheld

DVB-SH – DVB-Satellite to Handhelds

DVB-NGH – DVB-Next Generation Handheld

DVB-RCS – Return Channel System for Satellite

DVB-RCT – Return Channel System for Terrestrial TV

8

DVB-RCC – Return Channel System for Cable TV

DVD – Digital Versatile Disc

DVI – Digital Visual Interface

ECC – Elliptic Curve Cryptography

EDTV – Enhanced Definition Television

EEPROM – Electrically Erasable Programmable Read-Only Memory

ES – Elementary Stream

EPG – Electronic Programm Guide

FEC – Forward Error Correction

FFT – Fast Fourier Transform

FIFO – First-In, First-Out Shift Register

FRAM – Ferroelectric Random Access Memory

FSK – Frequency-Shift Keying

GA – Guard Interval

GOP – Group Of Pictures

HD, HDTV – High Definition Television

HDMI – High-Definition Multimedia Interface

HP – High Priority

ICI – Inter-Carrier Interference

IFFT – Inverse Fast Fourier Transform

IIR – Infinite Impulse Response

IP – Internet Protocol

ISI – Inter-Symbol Interference

JND – Just Noticeable Difference

JPEG – Joint Photographic Experts Group

LAN – Local Area Network

LDPC – Low Density Parity Check

LDTV – Limited Definition Television

LP – Low Priority

LP – Linear Prediction

LPC – Linear Prediction Coefficients

MAC – Message Autentification Code

MACL – Medium Access Control Layer

MDCT – Modified Discrete Cosine Transform

MMU – Memory Management Unit

MPE – Multiprotocol Encapsulation

MPEG – Moving Picture Coding Experts Group

MPQM – Moving Picture Quality Metrics

MPEG-J – Framework for MPEG Java API’s

MPQM – Moving Picture Quality Metrics

MSB – Most Significant Bit

MUX – MultipleX

NACK – Negative Acknowledgment

NICAM – Near-Instantaneous Companded Audio Multiplex

NRZ – Non-Return to Zero

OOK – On-Off Keying

OCF – Optimal Coding in the Frequency Domain

PAM – Pulse Amplitude Modulation

PAL – Phase Alternating Line

PAPR – Peak Average Power Ratio

PARCOR – Partial Correlation Coefficient

Parity bit – is a bit added to the end of a string of binary code that indicates whether the number of bits in the string with

the value one is even or odd

PAT – Program Association Table

Payload – referred to as the actual or body data

PCM – Pulse Code Modulation

PCR – Program Clock Reference

PDA – Personal Digital Assistant

PDH – Plesiochronous digital hierarchy

PES – Packetized Elementary Stream

9

PID – Packet Identification Field

PMT – Program Map Table

PPDN – Polyphase Decomposition Network

PRBS – Pseudo-Random Binary Sequence

PS – Program Stream

PSK – Phase-Shift Keying

PCC – Punctured Convolutional Code

QAM – Quadrature Amplitude Modulation

QCIF – Quarter Common Intermediate Format

QEF – Quasi Error Free

QPSK – Quadrature Phase Shift Keying

QAM – Quadrature amplitude modulation

QoS – Quality of Service

RAM – Random Access Memory

RF – Radio Frequency

ROM – Read-Only Memory

RRC – Root-Raised-Cosine

RS – Reed-Solomon

RSA – RSA stands for Ron Rivest, Adi Shamir and Leonard Adleman, who first publicly described the algorithm in 1977.

It is one of the first practicable public-key cryptosystems;

RZ – Return to Zero

SAM – Secure Access Module

SBC – Sub-Band Coding

S&H – Sample-and-Hold

SD – Standart Definition

SDH – Synchronous Digital Hierarchy

SDTV – Standard Definition Television

SECAM – Système Sequentiel Couleur A Mémoire

SER – Symbol Error Rate

SFN – Single Frequency Network

SIF – Source Intermediate Format

SMPTE – Society of Motion Picture and Television Engineers

SNR – Signal to Noise Ratio

SPL – Sound Power Level

SRRC – Square-Root-Raised-Cosine

SRS – Sampling Rate Scalable

STB – Set-Top-Box

TCP – Transmission Control Protocol

TF – Transmission Frame

TNS – Temporal Noise Shaping

TPS – Transmission Parameter Signalling

TS – Transport Stream

TV – Television

UHF – Ultra-High Frequency

VCM - Variable Coding and Modulation

VHF – Very-High Frequency

Vocoder – Voice Coder

VOD – Video on Demand

VQ – Vector Quantization

ZigBee – is a specification for a suite of high level communication protocols used to create personal area networks built

from small, low-power and relatively cheap digital radios

10

LIST of FIGURES

Fig. 1.1. Illustration of the main television principle ------------------------------------------------------------------------ 18

Fig. 1.2. Nipkow disk for a 16-line picture;(picture capturer - transmitter) --------------------------------------------- 19

Fig. 1.3. Basic mirror drum construction -------------------------------------------------------------------------------------- 19

Fig. 1.4. Representation of progressive scanning (625 lines) -------------------------------------------------------------- 20

Fig. 1.5. Representation of interlaced scanning (625 lines) ---------------------------------------------------------------- 20

Fig. 1.6. Illustration of a composite monochrome video signal ------------------------------------------------------------ 21

Fig. 1.7. Illustration of a composite color video signal PAL or NTSC --------------------------------------------------- 24

Fig. 1.8. Frequency spectrum of the PAL signal ----------------------------------------------------------------------------- 24

Fig. 1.9. Color plan of the NTSC system -------------------------------------------------------------------------------------- 25

Fig. 2.1. Simplified structure of the human eye ------------------------------------------------------------------------------ 29

Fig. 2.2. Simplified structure of retina ----------------------------------------------------------------------------------------- 31

Fig. 2.3. The sensitivity of the different cones (S, M, L) and rods (R) to varying wavelengths ---------------------- 31

Fig. 2.4. Illustration of the encoding of cone signals into opponent color’s signals in the human visual system - 33

Fig. 2.5. Newton‘s color wheel -------------------------------------------------------------------------------------------------- 34

Fig. 2.6. Goethe‘s color triangle ------------------------------------------------------------------------------------------------- 34

Fig. 2.7. The CIE 1931 xy chromaticity space -------------------------------------------------------------------------------- 34

Fig. 2.8. Primary and secondary colors for RGB and CMYK models ---------------------------------------------------- 36

Fig. 2.9. RGB and CMY Color 3-D Models ---------------------------------------------------------------------------------- 38

Fig.3.1. Cross-section of the human ear --------------------------------------------------------------------------------------- 41

Fig.3.2. Functional diagram of the human ear -------------------------------------------------------------------------------- 41

Fig. 3.3. Equal loudness curves corresponding to threshold of the quiet and pain limit ------------------------------- 44

Fig. 3.4. Illustration of time masking ------------------------------------------------------------------------------------------ 45

Fig. 3.5. Illustration of frequency masking ------------------------------------------------------------------------------------ 46

Fig. 3.6. Resonant frequencies versus position along the basilar membrane -------------------------------------------- 46

Fig. 3.7. The critical band versus central frequency ------------------------------------------------------------------------- 47

Fig. 3.8. Relationship between the frequency scale and mel-scale -------------------------------------------------------- 48

Fig. 3.9. Mel scale filterbank ---------------------------------------------------------------------------------------------------- 49

Fig. 3.10. A schematic diagram of the human speech production system ------------------------------------------------ 49

Fig. 3.11. An example of glottal volume velocity ---------------------------------------------------------------------------- 49

Fig. 3.12. A block diagram of human speech production ------------------------------------------------------------------- 50

Fig. 3.13. Rosenberg approximation of glottal pulse ------------------------------------------------------------------------ 51

Fig. 3.14. Generation of the exitation signal for voiced speech ------------------------------------------------------------ 51

Fig. 3.15. Generation of the exitation signal for unvoiced speech--------------------------------------------------------- 51

Fig. 3.16. The simplest physical model of vocal tract ----------------------------------------------------------------------- 51

Fig. 3.17. Representation of vocal tract as a concatenation of lossless acoustic tubes --------------------------------- 52

Fig. 3.18. Direct form implementation of digital filter system function describing vocal tract ---------------------- 52

Fig. 3.19. General model for speech production ------------------------------------------------------------------------------ 53

Fig. 4.1. Analog-to-digital converter ------------------------------------------------------------------------------------------- 54

Fig. 4.2. Analog signal reconstruction system -------------------------------------------------------------------------------- 54

Fig. 4.3. Illustration of ideal sampling process ------------------------------------------------------------------------------- 55

Fig.4.4. Illustration of undersampling effects --------------------------------------------------------------------------------- 57

Fig.4.5. A complete design of digital (PCM) signal generation system -------------------------------------------------- 58

Fig. 4.6. Illustration of flat-top sampling -------------------------------------------------------------------------------------- 59

Fig. 4.7. Illustration of waveforms in a PCM system ------------------------------------------------------------------------ 60

Fig. 4.8. SNR relation to signal value and PCM code word bit number -------------------------------------------------- 61

Fig. 4.9. Position of samples in the 4:2:2 format ----------------------------------------------------------------------------- 67

11

Fig. 4.10. Position of samples in the 4:2:0 format --------------------------------------------------------------------------- 68

Fig. 4.11. Position of samples in the SIF format ----------------------------------------------------------------------------- 68

Fig. 5.1. Approximations of voice analog signal probability density function ------------------------------------------ 71

Fig. 5.2. Input-output relations for an A-law characteristic ---------------------------------------------------------------- 72

Fig. 5.3. Distribution of quantization levels for the A-law 3-bit quantizer with stand 87,56A A ---------------- 72

Fig. 5.4. Output SNR versus input signal level of 8-bit PCM systems with and without companding-------------- 73

Fig. 5.5 Practical approximation of A-law characteristic -------------------------------------------------------------------- 73

Fig. 5.6. PCM code word structure --------------------------------------------------------------------------------------------- 73

Fig. 5.7. Block diagram representation of adaptive quantization with variable step size ------------------------------ 74

Fig. 5.8. Block diagram representation of adaptive quantization with variable gain ----------------------------------- 74

Fig. 5.9. Block diagram of feed-forward adaptive quantizer with time varying step size ----------------------------- 75

Fig. 5.10. Block diagram of feed-forward quantizer with time varying gain -------------------------------------------- 75

Fig. 5.11. Examples of the RMS value estimates and corresponding signal waveforms ------------------------------ 78

Fig. 5.12. Examples of the RMS value estimates and corresponding signal waveforms ------------------------------ 81

Fig. 5.13. Block diagram of feed-backward adaptive quantizer with time varying step size ------------------------- 82

Fig. 5.14. Block diagram of feed-backward adaptive quantizer with time varying gain ------------------------------ 82

Fig. 5.15. Autocorrelation function estimate for speech signal ------------------------------------------------------------ 83

Fig. 5.16. General block diagram of differential quantizer ----------------------------------------------------------------- 84

Fig. 5.17. Illustration of vector quantization with L=25 two-dimensional (m=2) code vectors ---------------------- 86

Fig. 5.18. Illustration of vector quantization principle ---------------------------------------------------------------------- 86

Fig. 6.1. Image example ---------------------------------------------------------------------------------------------------------- 89

Fig. 6.2. Horizontal correlation coefficient for luminance component of the image in Figure 6.1 ------------------ 89

Fig. 6.3. Illustration of Huffmann tree building ------------------------------------------------------------------------------ 91

Fig. 6.4. Illustration of assignment of codes to the symbols ---------------------------------------------------------------- 91

Fig, 6.5. Illustration of 1-D prediction ----------------------------------------------------------------------------------------- 93

Fig. 6.6. Illustration of 2-D prediction ----------------------------------------------------------------------------------------- 94

Fig. 6.7. Initial stage of motion-compensated prediction procedure ------------------------------------------------------ 95

Fig. 6.8. Search area in the prediction picture for the selected block in the current picture and generation

of block in motion compensated picture --------------------------------------------------------------------------- 96

Fig. 6.9. Illustration of current block and search area ----------------------------------------------------------------------- 96

Fig. 6.10. Illustration of search procedure ------------------------------------------------------------------------------------- 96

Fig. 6.11. Block diagram of motion compensated encoder ----------------------------------------------------------------- 97

Fig. 6.12. Block diagram of matching decoder ------------------------------------------------------------------------------- 98

Fig. 6.13. Block diagram of lossy motion compensated encoder ---------------------------------------------------------- 98

Fig. 6.14. Typical rate-distortion curve ---------------------------------------------------------------------------------------- 99

Fig. 6.15. Graphical view of 8 basic functions for 1-D DCT ------------------------------------------------------------- 102

Fig. 6.16. Basic vectors for 2-D DCT ---------------------------------------------------------------------------------------- 103

Fig. 6.17. Illustration of zig-zag scanning order ---------------------------------------------------------------------------- 105

Fig. 6.18. Zig-zag scan order of quantized DCT coefficients ------------------------------------------------------------ 105

Fig. 6.19. Motion-compensated prediction difference DCT encoder --------------------------------------------------- 106

Fig. 6.20. Motion-compensated prediction difference DCT decoder --------------------------------------------------- 106

Fig. 7.1. Linear prediction filter ----------------------------------------------------------------------------------------------- 108

Fig. 7.2. Block diagram of prediction error filter --------------------------------------------------------------------------- 108

Fig. 7.3. Final block diagram of speech synthesis filter ------------------------------------------------------------------- 109

Fig. 7.4. Variation of prediction error versus predictor order ------------------------------------------------------------ 113

Fig. 7.5. Illustration of sliding window technique ------------------------------------------------------------------------- 114

Fig. 7.6. Some most popular windows --------------------------------------------------------------------------------------- 114

Fig. 7.7. Two possible average spectra of speech signal ------------------------------------------------------------------ 115

12

Fig. 7.8. Frequency response of pre-emphasis filter ----------------------------------------------------------------------- 115

Fig. 8.1. Basic sub-band coding scheme ------------------------------------------------------------------------------------- 116

Fig. 8.2. Illustration of the sub-band decomposition and reconstruction principles ---------------------------------- 117

Fig. 8.3. Basic block diagram of a Layer 1 encoder ----------------------------------------------------------------------- 118

Fig. 8.4. Grouping of sub-band samples for Audio Layers 1, 2, 3------------------------------------------------------- 119

Fig. 8.5. ISO/MPEG/Audio Layer 1 frame structure; valid for 384 PCM audio input samples, duration 8 ms

with a sampling rate of 48 kHz ------------------------------------------------------------------------------------- 119

Fig.8.6. Basic block diagram of a Layer 1 decoder ------------------------------------------------------------------------ 120

Fig. 8.7. Basic block diagram of a Layer 2 encoder ----------------------------------------------------------------------- 120

Fig. 8.8. ISO/MPEG/Audio Layer 2 frame structure; valid for 1152 PCM audio input samples, duration 24 ms

with a sampling rate of 48 kHz ------------------------------------------------------------------------------------- 121

Fig. 8.9. Basic block diagram of a Layer 3 encoder ----------------------------------------------------------------------- 122

Fig. 8.10. MPEG Audio Layer 3 filter bank processing ------------------------------------------------------------------- 122

Fig. 8.11. MDCT window types and arrangement of transition between overlapping long and short

window types -------------------------------------------------------------------------------------------------------- 123

Fig. 8.12. Huffman partitions -------------------------------------------------------------------------------------------------- 124

Fig. 8.13. The arrangement of the various bit fields in a Layer 3 bit stream ------------------------------------------ 124

Fig. 8.14. Structure of MPEG-2 advanced audio coder ------------------------------------------------------------------- 127

Fig. 9.1. A typical digital TV transmission setup -------------------------------------------------------------------------- 130

Fig. 9.2. MPEG-2 audio and video systems at transmission side -------------------------------------------------------- 132

Fig. 9.3. Video elementary stream format ----------------------------------------------------------------------------------- 133

Fig. 9.4. Header for MPEG-2 video elementary stream ------------------------------------------------------------------- 133

Fig. 9.5. Packetized elementary stream header ----------------------------------------------------------------------------- 133

Fig. 9.6. Possible data streams in multiple program mode ---------------------------------------------------------------- 134

Fig. 9.7. MPEG-2 audio and video systems at reception side ------------------------------------------------------------ 135

Fig. 9.8. Arrangement of the PESs in an MPEG-2 transport stream ---------------------------------------------------- 135

Fig. 9.9. The structure of the transport packet and its header ------------------------------------------------------------ 136

Fig. 9.10. Multiplexed audio and video packets in MPEG transport stream ------------------------------------------- 136

Fig. 9.11. Illustration of insertion of signaling tables ---------------------------------------------------------------------- 137

Fig. 10.1. Three layers around the protected contents --------------------------------------------------------------------- 140

Fig. 10.2. The major components of the DVB-CA architecture --------------------------------------------------------- 141

Fig. 10.3. Illustration of the ECM and EMM generation process ------------------------------------------------------- 142

Fig. 10.4. Illustration of decryption of the control words from CM and EMM --------------------------------------- 142

Fig. 10.5. Illustration of the process of finding ECM and EMM in the transport stream ---------------------------- 143

Fig. 10.6. The components of the DVB-CA architecture integrated in DVB data network ------------------------- 145

Fig. 11.1. Tapped shift register ------------------------------------------------------------------------------------------------ 147

Fig. 11.2. Illustrative binary scrambler and descrambler ----------------------------------------------------------------- 148

Fig. 11.3. Shift register sequence generator --------------------------------------------------------------------------------- 149

Fig. 11.4. Sync words for frame synchronization -------------------------------------------------------------------------- 149

Fig. 12.1. Illustration of block interleaving and adding of parity bits--------------------------------------------------- 153

Fig. 12.2. Convolutional interleaving – deinterleaving scheme---------------------------------------------------------- 154

Fig. 12.3. The organization of an RS code with m=8, k=223 and r=32 ------------------------------------------------ 156

Fig. 12.4. Convolutional encoder (2, 1, 3) with V=2, L=1, M=3 --------------------------------------------------------- 157

Fig. 12.5. Punctured (3, 2, 3) code encoder composed from two (2, 1, 3) encoders ---------------------------------- 158

Fig. 12.6. Block diagram of an encoder for L=3---------------------------------------------------------------------------- 158

Fig. 12.7. Illustration of encoding of two bit sequence -------------------------------------------------------------------- 158

Fig. 12.8. A state diagram for the (2,1,3) code ----------------------------------------------------------------------------- 160

Fig. 12.9. Tree diagram of (2,1,3) code -------------------------------------------------------------------------------------- 161

13

Fig. 12.10. Trellis diagram for (2,1,3) code --------------------------------------------------------------------------------- 162

Fig. 12.11. Sequential decoding path search -------------------------------------------------------------------------------- 163

Fig. 12.12. Setting the threshold in sequential decoding ------------------------------------------------------------------ 165

Fig. 12.13. Viterbi decoding; Step 1 ------------------------------------------------------------------------------------------ 166

Fig. 12.14. Viterbi decoding; Step 2 ------------------------------------------------------------------------------------------ 167

Fig. 12.15. Viterbi decoding; Step 3 ------------------------------------------------------------------------------------------ 167

Fig. 12.16. Viterbi decoding; Step 4 ------------------------------------------------------------------------------------------ 167

Fig. 12.17. Viterbi decoding; Step 5 ------------------------------------------------------------------------------------------ 168

Fig. 12.18. Viterbi decoding; Step 6 ------------------------------------------------------------------------------------------ 168

Fig. 12.19. Viterbi decoding; Step 7 ------------------------------------------------------------------------------------------ 168

Fig. 12.20. Viterbi decoding; Step 8 ------------------------------------------------------------------------------------------ 168

Fig. 13.1. A cosinusoidal signal ----------------------------------------------------------------------------------------------- 170

Fig. 13.2. Rotating phasor ------------------------------------------------------------------------------------------------------ 172

Fig. 13.3. Illustration of rotating phasor projections on the real and imaginary axis --------------------------------- 172

Fig. 13.4. Illustration examples of the simplest digitally modulated signals ------------------------------------------ 173

Fig. 13.5. Illustration example of 8-QAM modulated signal ------------------------------------------------------------- 175

Fig. 13.6. Examples of simple signals and their constellation diagrams ----------------------------------------------- 175

Fig. 13.7. Examples of constellation diagrams and corresponding 4-ASK and 4-PSK signals --------------------- 176

Fig. 13.8. Constellation diagrams for 8-APSK, 16-APSK, Gray encoded 8-PSK and square 16-QAM ---------- 176

Fig. 13.9. Illustration of the influence of constellation points coding to signal resistance to noise ---------------- 176

Fig. 13.10. Schematic block diagram of quadrature modulator and demodulator ------------------------------------ 177

Fig. 13.11. Illustration of ISI on received pulses; Ts is the symbol period (in the partial case, the bit period) -- 178

Fig. 13.12. Characteristics of raised-cosine filter for three values of the roll-off factor ----------------------------- 178

Fig. 13.13. Frequency responses of Nyquist filter (RC filter) caused by a series of pulses ------------------------- 179

Fig. 13.14. Typical frequency response of a time-varying channel example ------------------------------------------ 180

Fig. 13.15. Illustration of intersymbol interference formation ----------------------------------------------------------- 182

Fig. 13.16. Illustration of COFDM signal spectrum ----------------------------------------------------------------------- 183

Fig. 13.17. Illustration of the formation of a guard interval -------------------------------------------------------------- 184

Fig. 13.18. Signal processing diagrams; (a) – at the transmitter, (b) – at the receiver ------------------------------- 185

Fig. 13.19. Illustration of inserting pilot cells as is used in DVB-T ----------------------------------------------------- 186

Fig. 14.1. Functional block diagram of the DVB-T system -------------------------------------------------------------- 188

Fig. 14.2. MPEG-2 transport stream packet --------------------------------------------------------------------------------- 189

Fig. 14.3. Scrambler/descrambler schematic diagram --------------------------------------------------------------------- 190

Fig. 14.4. Randomized transport stream packets: Sync bytes and randomized data bytes -------------------------- 190

Fig. 14.5. Reed-Solomon RS(204,188,8) error protected packet -------------------------------------------------------- 190

Fig.14.6. Conceptual diagram of the outer interleaver and deinterleaver; Interleaving depth l=12; --------------- 191

Fig. 14.7. Data structure after outer interleaving --------------------------------------------------------------------------- 191

Fig. 14.8. The sequences of bytes at the different points of interleaver / deinterleaver ------------------------------ 192

Fig. 14.9. Punctured convolutional code encoder for inner coding ----------------------------------------------------- 192

Fig. 14.10. Inner coding and interleaving------------------------------------------------------------------------------------ 193

Fig. 14.11. Mapping of input bits onto output modulation symbols for 16-QAM system -------------------------- 194

Fig. 14.12. 16-QAM constellations for DVB-T system ------------------------------------------------------------------- 196

Fig. 14.13. Transmission frame for the DVB-T signal -------------------------------------------------------------------- 197

Fig. 14.14. Conceptual block diagram of elements at the cable head-end and receiver ----------------------------- 199

Fig. 14.15. Byte to 6-bit symbol conversion for 64-QAM modulation ------------------------------------------------- 200

Fig. 14.16. Example implementation of the byte to m-tuple conversion and the differential encoding

of the two MSBs -------------------------------------------------------------------------------------------------- 200

Fig. 14.17. The DVB-C constellation diagram for 16-QAM ------------------------------------------------------------- 201

14

Fig. 14.18. Functional block diagram of the DVB-S system ------------------------------------------------------------- 203

Fig. 14.19. QPSK constellation used in DVB-S system------------------------------------------------------------------- 203

Fig. 15.1. Block diagram of an encoder of an MPEG-4 I-frame --------------------------------------------------------- 207

Fig. 15.2. Block diagram of an encoder of an MPEG-4 P-frame ........................................................................ 207

Fig. 15.3. Generic block diagram of the CABAC entropy coding scheme ......................................................... 215

Fig. 15.4. An example for a Tanner graph .......................................................................................................... 217

Fig. 15.5. DVB-H Frame structure ...................................................................................................................... 220

Fig. 15.6. Representation of the physical-layer framing structure of DVB-S2 system ....................................... 225

Fig. 15.7. Rotated 16-QAM ................................................................................................................................ 229

Fig. 15.8. DVB-T2 typical system architecture ................................................................................................... 230

15

LIST of TABLES

Table 4.1. Various interpretations of 3 bit code words ---------------------------------------------------------------------- 62

Table 4.2. Natural binary and Gray 3 bits code words ---------------------------------------------------------------------- 63

Table 4.3. Parameters of various PCM signals ------------------------------------------------------------------------------- 64

Table 4.4. Modulation type and required bandwidth for transmission rate of 108 Mb/s ------------------------------ 66

Table 5.1. Step size multipliers for adaptive feed-backward quantization of speech signals ------------------------- 83

Table 5.2. Signal-to-noise ratios using various quantizers with B=4 for the same speech material ----------------- 83

Table 6.1. Calculation of the average information quantity for four symbols ------------------------------------------ 90

Table 6.2. Illustration of decoding procedure -------------------------------------------------------------------------------- 92

Table 6.3. The values of the Hadamard functions ( , 1, 2,...m

H m ) --------------------------------------------------- 100

Table 6.4. Intermediate and final results of Hadamard inverse transformation calculation ------------------------- 100

Table 8.1. List of the configuration used in MPEG audio coding standards ------------------------------------------ 118

Table 8.2. Coding tools used in MPEG-2 AAC ---------------------------------------------------------------------------- 128

Table 10.1. Main conditional access systems ------------------------------------------------------------------------------- 146

Table 11.1. Illustrative scrambler input, output and register cells contents ------------------------------------------- 148

Table 12.1. Generator Polynomials for good rate 1/2 codes [85] ------------------------------------------------------- 157

Table 12.2. Producing a coded sequence ------------------------------------------------------------------------------------ 159

Table 12.3. Look-up table for the encoder (2,1,3) ------------------------------------------------------------------------- 159

Table 12.4. Bit agreement as a metric for decision between the received sequence and the 8 possible

valid code sequences --------------------------------------------------------------------------------------------- 163

Table 12.5. Each branch has a Hamming metric depending on what was received and the valid code

words at that state ------------------------------------------------------------------------------------------------- 166

Table 13.1. Correspondence between signal amplitudes and phases, and bit values--------------------------------- 174

Table 14.1. Puncturing pattern and transmitted sequence after parallel-to-serial conversion for the possible

code rates----------------------------------------------------------------------------------------------------------- 193

Table 14.2. Demultiplexer’s mapping rules for 16-QAM modulation ------------------------------------------------- 194

Table 14.3. Permutation functions of bit-wise interleavers -------------------------------------------------------------- 195

Table 14.4. Main parameters of the DVB-T terrestrial system ---------------------------------------------------------- 199

Table 14.5. Truth table for differential coding ----------------------------------------------------------------------------- 201

Table 14.6. Conversion of constellation points of quadrant 1 to other quadrants of the constellation diagram - 201

Table 14.7. Available bit rates (Mbit/s) for DVB-C system [88] ------------------------------------------------------- 202

Table 15.1. Exponential Golomb Code (for data elements other than transform coefficients) --------------------- 214

Table 15.2. Overview over messages received and sent by the C-nodes in step 2 of the message passing algorithm

------------------------------------------------------------------------------------------------------------------------------ 218

Table 15.3. Step 3 of the described decoding algorithm ------------------------------------------------------------------ 218

Table 15.4. Modes and features of DVB-S2 in comparison to DVB-S ------------------------------------------------ 224

Table 15.5. Comparison of available modes in DVB-T and DVB-T2 ------------------------------------------------- 230

Table 15.6. Example of MFN mode in the United Kingdom ------------------------------------------------------------ 231

16

INTRODUCTION "Attention, the Universe! By kingdoms, right wheel!" This phrase is the first telegraph message

sent over a 16-km line by Samuel F. B. Morse in 1838. With this message a new era, the era of

electrical communication, was born. Until our times communication engineering had advanced to the

point that earthbound TV viewers could watch astronauts working in space. Telephone, radio, and

television have become integral parts of modern life. Certainly great steps have been made since the

days of Morse. Equally certain, coming years will bring many new achievements of communication

engineering.

This textbook presents an introduction to digital television technologies. Its purpose is to

describe and explain, as simply and as completely as possible, the various aspects of the very complex

problems and solutions chosen for the European Digital Video Broadcasting (DVB) system.

The textbook is one of the results of the activities carried under the Leonardo da Vinci program

project „Education Course of Digital TV Technologies for Vocational Educational Schools” (Project

nr. 2013-1-LV1-LEO05-05127).

The aim of the textbook writing was to help vocational students from project partners’

countries: Latvia, Lithuania, Estonia and Denmark to acquire new skills, knowledge and

qualifications in area of digital TV technologies and to enhance their competitiveness in the labor

market. The textbook is intended for readers with elementary backgrounds in electronics and

communication systems. Some basic knowledge of digital communication system principles, signal

digitization and compression, error control coding and of conventional analog television is presented

in this book for those who require it.

We begin here with a descriptive overview that establishes a perspective for the chapters that

follow.

The first chapter provides an overview of television development history beginning from

Nipkow disk and including monochrome, color and digital televisions. Also the main television

principle based on synchronized scanning of input picture at the transmission side and of electron

beam across a TV screen at the reception side is shortly described.

The second chapter deals with the human eye anatomic structure and functions of its component

parts, with the principles of color vision. Also a short historical overview of color models

development and some basic modern color models used in television as RGB, YUV, YCbCr are

introduced.

In the third chapter the aspects of human ear, that are critical in determining subjective audio

quality, and the elements of speech production system model, whereof directly follow the methods

of speech compression, are introduced.

Chapter 4 describes the general principles of signal digitization separately touching on a

question about digitization of video signals. The questions concerning the Nyquist-Shennon sampling

theorem, pulse code modulation and video signals digitization formats are discussed.

Chapter 5 covers the characteristics of video material, describes the signal processing used to

reduce the spatial and temporal redundancy of digital video signals, with paragraphs describing

predictive coding, prediction with motion compensation and transform coding.

The principles of the linear predictive analysis of speech signals are analyzed in the sixth

chapter. The material presented here seems to be a little bit too puzzling. Therefore it could be omitted

by the first reading.

The seventh chapter step by step introduces the readers into the world of speech and audio

signals compression methods beginning from the simplest one, based on the non-uniform quantizing,

and finishing with more complex ones, based on sub-band filtering. The specific audio compression

methods used by the MPEG-1, MPEG-2 and MPEG-4 standards are described as well.

Chapter 8 deals with the formation principles of multimedia programs and of MPEG transport

stream structure, which will be later used for demultiplexing and decompression in a digital TV

receiver.

17

To protect a DVB data-network, the DVB standard integrates into its broadcasting

infrastructure an access control mechanism, commonly known as Conditional Access. This enables

the broadcasting industry to offer hundreds of pay-TV channels to consumers. The conditional access

system architecture and operation is the subject analyzed in chapter 9.

Chapter 10 discusses the subject of randomization of bit or symbol streams in order to avoid

long runs of identical symbols what could make receiver incapable to extract timing information

needed for correct operation of any communication system. The structure and operation of

randomization/de-randomization units (scramblers/descramblers) are here analyzed. Some

elementary questions about generation of pseudo-random noise and frame synchronization are also

discussed.

Chapter 11 serves to provide a view of coding for error correction. General questions of block

codes and Reed-Solomon codes are considered as well as convolutional coding and sequential and

Viterbi algorithm decoding. In addition, a discussion of block and convolutional interleaving is also

here included.

An understanding of modulation as one of the main processes of signal transmission theory and

praxis is introduced in chapter 12. An economical representation of modulated signals in the form of

constellation is presented. Various types of digital modulations beginning from the simplest as Binary

Phase Shift Keying (BPSK) and finishing with Coded Orthogonal Frequency Division Multiplexing

(COFDM) are analyzed.

Chapter 13 is devoted to analysis of Digital Video Broadcasting (DVB) systems: DVB-T

(terrestrial), DVB-C (cable) and DVB-S (satellite). The functional diagrams of the systems are

presented. The functions and operation of systems’ blocks are explained.

Finally chapter 14 looks at state of the art and perspectives of digital TV. Some relatively new

and already used in praxis systems, as High Definition TV (HDTV) and just coming to the praxis, as

DVB-T2 are shortly described.

We are pleased to acknowledge our indebtedness to our project partners prof. Reza Tadayoni,

prof. Knud Erik Scouby, researcher/lecturer Romass Pauliks for useful discussions on contents of the

book.

18

1. TELEVISION HISTORY

1.1. Main Television Principle

Before analyzing the historical stages of television it is useful to consider a basic outline of how

television itself works. To provide a moving black and white (monochrome TV) picture it is needed

to record, transmit and display information about a two-dimensional pattern of brightness. Nominally,

it is needed to specify how the brightness of every point in a picture varies in time in parallel. In

practice, the technique called raster scanning is used to convert a series of still pictures into a single

serial data stream.

A light sensor scans a predefined path over the picture and reads out how the picture brightness

varies along each line in turn (see Figure 1.1). The arrangement of lines and the order/speed/direction

in which they're scanned is called the raster pattern used by the TV system. This scanning process

means it is possible to get from the sensor a single time-varying pattern which points how to

reconstruct the picture using a TV receiver. This is normally done by scanning an electron beam

across a screen which is covered with a phosphor (a class of chemicals which fluoresce when it is

illuminated with electrons). The signal coming from the TV transmitter at every instant is used to

control the intensity of the electron beam.

In order for the system to work correctly it is needed to ensure that the raster patterns at the

transmitter (in the video camera) and the TV receiver are the same. It must be ensured that the two

raster scans were correctly synchronized.

TV Transmitter

TV Receiver

Input picture

TV Screen

Camera sensor

Electron gun

Raster of lines

drawn on TV screen

Fig. 1.1. Illustration of the main television principle

1.2. Development of Monochrome Television

Television came into being based on the inventions and discoveries of many men and scientists.

At the dawn of television history there were two distinct paths of technology experimented with by

researchers. Early inventors attempted to either build a mechanical television system based on the

technology of Paul Nipkow's rotating disk; or to build an electronic television system using a cathode

ray tube (CRT) developed in 1897 by German inventor F. Braun and proposed for image

reconstruction in 1907 by Russian scientist Boris Rosing and independently 1910 by English inventor

A. A. Campbell-Swinton [1, 2]. However the 'first' generation of television sets was not entirely

electronic.

German, Paul Nipkow developed a rotating-disk technology to transmit pictures over wire in

1884 called the Nipkow disk. Paul Nipkow was the first person to discover television's scanning

principle, in which the light intensities of small portions of an image are successively analyzed and

transmitted. Figure 1.2 shows a schematic picture of the 1884/85 Nipkow’s rotating disk image

scanner. Although he never built a working model of the system, his rotating disk became exceedingly

common, and remained in use until 1939.

19

In 1889 J. L. Weiller introduced the mirror drum as a scanning device. In its original form, there

were as many mirrors as there were lines in the picture and each mirror was tilted at a different angle

compared to the axis of the drum. For horizontal scanning, the mirror drum axle was supported in a

vertical plane. Therefore for vertical scanning, the axle was horizontal. As it rotated, each mirror

caused a line to be scanned below (horizontal scan) or beside (vertical scan) the previous one. The

drawing shown in Figure 1.3 is of a basic vertically scanned mirror drum system. In this drawing, the

modulated light originates at a small aperture, and then passes through a projection lens. This lens is

adjusted so that the light at the aperture itself is focused on the screen. Of course, in order for the light

to get there, it must be reflected off of mirrors on the drum. The mirrors are thereby able to control

the position of the light spot on the screen as it rotates. Since each mirror is carefully set at a different

but proper angle, as the drum rotates, the focused light spot takes on the appearance of a set of parallel

horizontal or vertical lines, commonly referred to as a raster. With proper modulation of the light it

is possible to have a picture.

Within a few more years, there were numerous other scanners proposed, both lens and aperture

types and some that used slots instead of holes.

Light source

Photographic

image

Phototube

Electrical signal

sent to receiver

Direction of

disc rotation

Mirrors

Screen

LensApperture

Fig. 1.2. Nipkow disk for a 16-line picture;(picture

capturer - transmitter)

Fig. 1.3. Basic mirror drum construction

At that time the mechanical display (TV screen) had a small motor with a rotating disk and a

neon lamp, which worked together to give a blurry reddish-orange picture about half the size of a

business card [3]. The period before 1935 is called the "Mechanical Television Era". This type of

television is not compatible with today's fully-electronic television system. Electronic television

systems worked better and eventual replaced mechanical systems.

Already in 1907 B. Rosing proposed the idea of a hybrid system with Nipkow‘s disk or spinning

mirror for image scanning and CRT for image reconstruction. In 1910 A. A. Campbell-Swinton

proposed the idea with CRT for image scanning and representation.

In 1911 professor of St. Petersburg Institute of Technology B. Rosing and his student, later

Russian-American inventor and engineer, V. K. Zworykin demonstrated the working hybrid TV

system with CRT as a receiver, and a mechanical device as a transmitter. This demonstration, based

on an improved design, was among the first demonstrations of TV of any kind.

In the 1920's, John Logie Baird patented the idea of using arrays of transparent rods to transmit

images for television. Baird's 30 line images were the first demonstrations of television by reflected

light rather than back-lit silhouettes. J. L. Baird based his technology on Paul Nipkow's scanning disk

idea and later developments in electronics. He was the first person in the world to demonstrate a

working television system. On January 26th, 1926, a viable television system was demonstrated using

mechanical picture scanning with electronic amplification at the transmitter and at the receiver. The

raster consisted of 30 lines and 5 images per second, later 12,5 images per second. This low definition

resulted in a video bandwidth of less than 10 kHz, allowing these pictures to be broadcast on an

ordinary AM/MW or LW transmitter. The signal could be sent by radio or over ordinary telephone

20

lines, leading to the historic trans-Atlantic transmissions of television from London to New York in

February, 1928.

Charles Jenkins invented a mechanical television system called radio vision and claimed to

have transmitted the earliest moving silhouette images on June 14, 1923.

The first electronic television systems were based on the development of the CRT, which is the

picture tube found in modern TV sets.

V. Zworykin invented an improved CRT called the kinescope in 1929. The kinescope tube was

sorely needed for television. Zworykin was one of the first to demonstrate a television system with

all the features of modern picture tubes.

In 1927, Philo Farnsworth was the first inventor to transmit a television image comprised of 60

horizontal lines. The image transmitted was a dollar sign. Farnsworth developed the dissector tube,

the basis of all current electronic televisions.

The resolution soon improved to 90, and 120 lines and then stabilized for a while on 180 lines

(Germany, France) or 240 lines (England, the United States) around 1935. Scanning was progressive,

which means that all lines of the pictures were scanned sequentially in one frame, as depicted in

Figure 1.4 (numbered here for a 625-line system).

12345

575574573572571

One frame of 625

lines

(575 visible)

Frame retrace

(50 lines)

336

337

338

623

622

621

Two fields of

312,5 lines each

(2×287,5 visible)

First field retrace

(25 lines)

23

24

25

310

309

308

Second field retrace

(25 lines)

Fig. 1.4. Representation of progressive scanning

(625 lines)

Fig. 1.5. Representation of interlaced scanning (625

lines)

These definitions, used for the first "regular" broadcasts, were the practical limit for the Nipkow

disk used for picture analysis; the CRT started to be used for display at the receiver side. In order to

avoid disturbances due to electromagnetic radiation from transformers or a ripple in the power supply,

the picture rate (or frame rate) was derived from the mains frequency. This resulted in refresh rates

of 25 pictures per second in Europe and 30 pictures per second in US. The bandwidth required was

of the order of 1 MHz, which implied the use of VHF frequencies (in the order of 40-50 MHz) for

transmission.

Given the presence of the scan lines/raster and the series of 25 distinct pictures every second it

is perhaps surprising that the flickering of the apparently moving picture isn't painfully obvious. The

illusion of a steady moving picture appears for two reasons. Firstly, the human eye/brain has a

property called persistence of vision. The eye takes a finite time to notice sudden changes in

brightness. The brain has had millions of years of evolution to teach it that physical objects don't keep

vanishing and reappearing 25 times a second. Secondly, the phosphors use in TV’s keeps fluorescing

for a short time after they're hit with electrons. This means they tend to maintain a reasonable

brightness over one or two frame periods. Taken together, these effects help to make flicker go

unnoticed. Despite this, refresh rate of 25 pictures per second wasn't really good enough to give a

completely flicker-free illusion. It was too low.

During the period 1935-1941, electronic television was perfected especially with the invention

of the iconoscope. Several countries began broadcasting, most experimentally, with limited numbers

of TV sets in the hands of the public.

21

Television broadcasting in the United Kingdom started with Baird’s system in 1932. The British

Broadcasting Corporation (BBC) television service started in 1936 with an on-air “bake-off” between

an improved Baird 240-line mechanical system and a 405-line all-electronic system developed by the

EMI and Marconi companies. In 1937, the 405-line monochrome system, known then as “high

definition,” was selected as the UK standard. Development also occurred in several other European

countries, with a variety of TV systems used for transmissions. The definitions in use attained 405

lines (England) to 441 lines (the United States, Germany) or 455 lines (France) using interlaced

scanning. This method, invented in 1927, consisted of scanning a first field made of the odd lines of

the frame and then a second field made of the even lines (see Figure 1.5), allowing the picture refresh

rate for a given vertical resolution to be doubled (50 or 60 Hz instead of 25 or 30 Hz) without

increasing the bandwidth required for broadcasting.

Regardless the different number of lines in operation all systems shared the common futures:

The picture rate have to be linked with mains frequency;

An interlaced scanning was used;

The same composite signal combining video, blanking and synchronization information was

used (see Figure 1.6).

Total line duration

Visible part

White

level

Black

level

Synchronization

level

Horizontal

synchronization

Horizontal

suppression

Fig. 1.6. Illustration of a composite monochrome video signal

Soon afterward, due to the increase in the size of the picture tube, and taking into account the

eye's resolution in normal viewing conditions, the spatial resolution of these systems still appeared

insufficient, and most experts proposed a vertical definition of between 500 and 700 lines. The

following characteristics were finally chosen in 1941 for the US monochrome system, which later

became NTSC when it was upgraded to color in 1952:

525 lines, interlaced scanning (two fields of 262,5 lines);

field frequency, 60 Hz (changed to 59,94 Hz upon the introduction of color in order to

minimize the visual effect of beat frequency between sound (4,5 MHz) and color

(3,58 MHz) subcarriers);

line frequency, 15,750 kHz (60×262,5); later changed to 15,734 kHz with development

of color TV (59,94×262,5);

video bandwidth 4,2 MHz;

negative video modulation;

FM sound with carrier 4,5 MHz above the picture carrier.

After World War II most European countries (except France and Great Britain) adopted the

German GERBER standard. It was similar to US system. 60 Hz field frequency was changed to 50 Hz

field frequency keeping a line frequency as near as possible to 15,750 kHz. This allowed some

22

advantage to be taken of the American experience with receiver technology. This choice implied an

increased number of lines (approximately in the ratio 60/50) and, consequently, a wider bandwidth

in order to obtain well-balanced horizontal and vertical resolutions. The following characteristics

were defined:

625 lines, interlaced scanning (two fields of 312,5 lines);

field frequency, 50 Hz;

line frequency, 15,625 kHz (50×312,5);

video bandwidth, 5,0 MHz;

negative video modulation;

FM sound carrier 5,5 MHz above the picture carrier.

The standard aspect ratio (width/height) of a standard TV picture was 4:3. Since 625 lines are

transmitted every 25th of a second each line scan must take 64 microseconds. To obtain the same

picture resolution in the vertical and horizontal directions it must be possible to distinguish the

brightness of (4/3)×625=833 distinct pixels (or pels=picture elements) along each line. The highest

signal frequencies will therefore be required if the pixels along the line will alternate, light-dark-light-

dark-light-.... Each light-dark alternation is essentially one cycle of the signal, so the highest signal

frequency required to get this horizontal resolution is 833/2 cycles in 64 microseconds, i.e. 6,5 MHz.

Therefore a 6,5 MHz video bandwidth for the picture signal is needed to get the same detail resolution

in the vertical and horizontal directions.

In practice, TV systems only allocates a video signal bandwidth of 5,5 MHz and less This means

that the level of detail horizontally is slightly less than vertically, but this usually goes unnoticed. In

practice the video signal doesn't just contain the line-by-line brightness signal as described above. It

also contains some extra patterns which are designed to help the TV receiver synchronize its display

correctly with the transmitted pattern. To do this a synch level pattern has to be included into the

video signal (see Figure 1.6).

To save on band space, TV uses a Vestigial Sideband system. This is similar to AM, but one of

the sidebands (the lower one) is filtered down to avoid duplicating all the picture information on both

sides of the carrier. This allows us to fit the entire TV signal into an 8 MHz transmission bandwidth

whilst maintaining a 5,5 MHz or less video luminance information bandwidth.

This has formed the basis of all the European color standards defined later (PAL, SECAM, D2-

MAC, PALplus). Although for many years France used a system with 819 lines.

Louis Parker invented the modern changeable television receiver in 1948. Marvin Middlemark

invented "rabbit ears", the "V" shaped TV antennae.

1.3. Development of Color Television

Color TV was by no means a new idea; a German patent in 1904 contained the earliest proposal,

while in 1925 Zworykin filed a patent disclosure for an all-electronic color television system. A

successful color television system began commercial broadcasting in US, first authorized by the

Federal Communications Commission (FCC) on December 17, 1953 based on a system invented by

Radio Corporation of America (RCA).

Following on from earlier research, during the 1940s various color television systems were

proposed and demonstrated in the United States. The first all-electronic color television system,

backward compatible with the existing monochrome television system, was developed in the early

1950s and submitted by the second National Television System Committee to the FCC in 1953 [4].

The FCC approved the NTSC color TV standard on 17 December 1953 [5]. This standard was

subsequently adopted by Canada, Mexico, Japan, and many other countries.

The countries of Europe delayed the adoption of a color television system, and in the years

between 1953 and 1967, a number of alternative systems, compatible with the 625-line, 50-field

existing monochrome systems, were devised [6]. This delay had features intended to improve on

23

NTSC, particularly to eliminate hue errors caused by phase errors of the color subcarrier in the

transmission path.

An early system that received approval was one proposed by Henri de France of the Compagnie

de Television of Paris. He suggested that the two pieces of coloring information (hue and saturation)

could be transmitted as subcarrier modulation that is sequentially transmitted on alternate lines. Such

an approach, designated as SECAM (SEquential Couleur Avec Memoire – for sequential color with

memory) was developed and officially adopted by France and the USSR, and broadcast service began

in France in 1967.

The implementation technique of a one-line delay element for SECAM led to the development,

largely through the efforts of Walter Bruch of the Telefunken Company, of the phase alternation line

(PAL) system. The line-by-line alternation of the phase of one of the color signal components

averages any colorimetric distortions to give the correct value. The PAL system was adopted by

numerous countries in continental Europe, as well as in the United Kingdom, and other countries

around the world. Public broadcasting began in 1967 in Germany and the United Kingdom using two

slightly different variants.

On the way to the color television inventors confronted with some problems:

They have to ensure the bi-directional compatibility with the existing monochrome

standard. It means a monochrome receiver was able to display the new color broadcasts

in black and white, and color receiver could display the existing black and white

broadcasts;

The triple red/green/blue (RGB) signals delivered by the TV camera had to be

transformed into a signal which, on the one hand, could be displayable without major

artifacts on current black and white receivers, and on the other hand could be transmitted

in the bandwidth of an existing TV channel.

The basic idea was to transform, by a linear combination, the three (R, G, B) signals into three

other equivalent components, Y, Cb, Cr (or Y, U, V):

Y=0,587G+0,299R+0,1145B

is called the luminance signal,

Cb=0,564(B-Y) or U=0,493(B-Y)

is called the blue chrominance or color difference,

Cr=0,713(R-Y) or V=0,877(R-Y)

is called the red chrominance or color difference.

The combination used for the luminance (or "luma") signal has been chosen to be as similar as

possible to the output signal of a monochrome camera, which allows the black and white receiver to

treat it as a normal monochrome signal. The two chrominance (or "chroma") signals represent the

"coloration" of the monochrome picture carried by the Y signal, and allow, by linear recombination

with Y, the retrieval of the original RGB signals in the color receiver.

Studies on visual perception have shown that the human eye's resolution is less acute for color

than for luminance transients. This means, for natural pictures at least, that chrominance signals can

tolerate a strongly reduced bandwidth (one-half to one-quarter of the luminance bandwidth), which

will prove very useful for putting the chrominance signals within the existing video spectrum. The Y,

Cb, Cr combination is the common point to all color TV systems, including the newest digital

standards.

In order to be able to transport these three signals in an existing TV channel (6 MHz in the

United States, 7 or 8 MHz in Europe), a subcarrier was added within the video spectrum, modulated

by the reduced bandwidth chrominance signals, thus giving a new composite signal called the Color

Video Baseband Signal - CVBS (see Figure 1.7).

In order not to disturb the luminance and the black and white receivers, this carrier had to be

placed in the highest part of the video spectrum and had to stay within the limits of the existing video

bandwidth (4,2 MHz in the United States, 5-6 MHz in Europe, see Figure 1.8).

24

64µs

52µs12µs4,7µs

Burst

White

level1,0 V

0 V

0,3 VBlack

level

Synchronization

level

Synchro

Suppression

AmplitudeSubcarrier

chrominance

Sound

carrier

Chrominance

0 2 3 4 5 5,54,43 f, MHz

Fig. 1.7. Illustration of a composite color video

signal PAL or NTSC

Fig. 1.8. Frequency spectrum of the PAL signal

The differences between the three world standards NTSC, PAL and SECAM mainly concern

the method of color subcarrier modulation and its frequency.

1.4. NTSC

This system uses with the line frequency locked subcarrier at 3,579545 MHz (=455×Fh/2),

amplitude modulated with a suppressed carrier following two orthogonal axes (quadrature amplitude

modulation, or QAM), by two signals, I (in phase) and Q (quadrature), carrying the chrominance

information. (Here Fh=15,734 kHz – line frequency). These signals are two linear combinations of

(R-Y) and (B-Y), corresponding to a 33° rotation of the vectors relative to the (B-Y) axis. This process

gives a vector (see Figure 1.9), the phase of which represents the tint and the amplitude of which

represents color intensity (saturation).

A reference burst at 3,579545 MHz with a 180° phase relative to the (B-Y) axis superimposed

on the back porch allows the receiver to rebuild the subcarrier required to demodulate I and Q signals.

The choice for the subcarrier of an odd multiple of half the line frequency is such that the luminance

spectrum (made up of discrete stripes centered on multiples of the line frequency) and the

chrominance spectrum (discrete stripes centered on odd multiples of half the line frequency) are

interlaced, making an almost perfect separation theoretically possible by the use of comb filters in the

receiver.

Practice, however, soon showed that NTSC was very sensitive to phase rotations introduced by

the transmission channel, which resulted in very important tint errors, especially in the region of flesh

tones. This led to the necessity of a tint correction button accessible to the user on the receivers and

Europeans to look for solutions to this problem, which resulted in the SECAM and PAL systems.

25

+(B-Y)

+(R-Y)

Q (φ=33o)

I (φ=123o)

IMQM

α =Tint

Magenta

Sat

urat

ion

Red

Yellow

Green

Cyan

Blue

Burst

(φ=180o)

Fig. 1.9. Color plan of the NTSC system

1.5. SECAM This standard eliminates the main drawback of the NTSC system by using frequency

modulation for the subcarrier, which is insensitive to phase rotations; however, FM does not allow

simultaneous modulation of the subcarrier by two signals, as does QAM.

The clever means of circumventing this problem consisted of considering that the color

information of two consecutive lines was sufficiently similar to be considered identical. This reduces

chroma resolution by a factor of 2 in the vertical direction, making it more consistent with the

horizontal resolution resulting from bandwidth reduction of the chroma signals. Therefore, it is

possible to transmit alternately one chrominance component, D′b=1,5(B-Y), on one line and the other,

D′r=-1,9(R-Y), on the next line. It is then up to the receiver to recover the two D′b and D′r signals

simultaneously, which can be done by means of a 64 µs delay line (one line duration) and a

permutation circuit. Subcarrier frequencies chosen are 4,250 MHz (=272×Fh) for the line carrying

D′b and 4,406250 MHz (=282×Fh) for D′r.

This system is very robust, and gives a very accurate tint reproduction, but it has some

drawbacks due to the frequency modulation - the subcarrier is always present, even in non-colored

parts of the pictures, making it more visible than in NTSC or PAL on black and white, and the

continuous nature of the FM spectrum does not allow an efficient comb filtering; rendition of sharp

transients between highly saturated colors is not optimum due to the necessary truncation of

maximum FM deviation. In addition, direct mixing of two or more SECAM signals is not possible.

1.6. PAL This is a close relative of the NTSC system, whose main drawback it corrects. It uses a line-

locked subcarrier at 4,433619 MHz (=1135/4+1/625×Fh), which is QAM modulated by the two color

difference signals U=0,493(B-Y) and V=0,877(R-Y). In order to avoid drawbacks due to phase

rotations, the phase of the V carrier is inverted every second line, which allows cancellation of phase

rotations in the receiver by adding the V signal from two consecutive lines by means of a 64 µs delay

line (using the same assumption as in SECAM, that two consecutive lines can be considered

identical). In order to synchronize the V demodulator, the phase of the reference burst is alternated

from line to line between +135° and –135° compared to the U vector (0°). Other features of PAL are

very similar to NTSC.

On the evolutionary path to fully digital TV systems, there were several projects to enhance and

improve analog television using advanced analog and hybrid analog – digital technologies. Some of

the projects worth mentioning are the Japan Broadcasting Corporation (NHK) HDTV [7] project in

Japan, the Eureka EU 95 Project [8] and PALplus [9] in Europe, and Advanced Compatible

26

Television in the United States [10]. They provided valuable experience for future Digital TV (DTV)

systems development.

1.7. MAC (multiplexed analog components) During the 1980s, Europeans attempted to define a common standard for satellite broadcasts,

with the goal of improving picture and sound quality by eliminating drawbacks of composite systems

(cross-color, cross-luminance, reduced bandwidth) and by using digital sound. This resulted in the

MAC systems, with a compatible L- extension toward HDTV (called HD-MAC).

D2-MAC is the most well-known of these hybrid systems, even if it did not achieve its expected

success due to its late introduction and an earlier development of digital TV than anticipated. It

replaces frequency division multiplexing of luminance, chrominance, and sound (bandwidth sharing)

of composite standards by a time division multiplexing (time sharing). It is designed to be compatible

with normal (4:3) and wide-screen (16:9) formats and can be considered in some aspects an

intermediate step on the route to all-digital TV signal transmission.

1.8. PALplus The primary objective of this development was to allow terrestrial transmission of improved

definition 16:9 pictures (on appropriate receivers) in a compatible way with existing 4/3 PAL

receivers. To do this, the PALplus encoder transforms the 576 useful lines of a 16:9 picture into a 4:3

picture in letterbox format (a format often used for the transmission of films on TV, with two

horizontal black stripes above and below the picture). The visible part occupies only 432 lines

(576×3/4) on a 4/3 receiver, and additional information for the PALplus receiver is encoded in the

remaining 144 lines.

The 432-line letterbox picture is obtained by vertical low-pass filtering of the original 576 lines,

and the complementary high-pass filtering is transmitted on the 4,43 MHz subcarrier during the 144

black lines, which permits the PALplus receiver to reconstruct a full-screen 16/9 high resolution

picture.

In order to obtain the maximum bandwidth for luminance (5 MHz) and to reduce cross-color

and cross-luminance, the phase of the subcarrier of the two interlaced lines of consecutive fields is

reversed. This process, known as "colorplus," allows (by means of a frame memory in the receiver)

cancellation of cross-luminance by adding the high part of the spectrum of two consecutive frames,

and reduction of cross-color by subtracting them.

A movement compensation is required to avoid artifacts introduced by the colorplus process on

fast moving objects, which, added to the need for a frame memory, contributes to the relatively high

cost of PALplus receivers. The PALplus system results in a subjective quality equivalent to D2-MAC

on a wide-screen (16:9) receiver in good reception conditions (high signal/noise ratio).

1.9. Development of Digital Television

Development of high definition and advanced television systems preceded in parallel in the

United States, Europe, and Japan. For various technical, organizational, and political reasons, this has

resulted in multiple sets of DTV standards. Currently, there are three main DTV standard groups:

1. The Advanced Television Systems Committee (ATSC), a North America based DTV

standards organization, which developed the ATSC terrestrial DTV series of standards. In

addition, the North American digital cable TV standards now in use were developed

separately, based on work done by Cable Television Laboratories (CableLabs) and largely

codified by the Society of Cable Telecommunications Engineers (SCTE).

2. The Digital Video Broadcasting (DVB) Project, a European based standards organization,

which developed the DVB series of DTV standards, standardized by the European

Telecommunication Standard Institute (ETSI).

27

3. The ISDB standards, a series of DTV standards developed and standardized by the

Association of Radio Industries and Business (ARIB) and by the Japan Cable Television

Engineering Association (JCTEA).

China has developed another terrestrial DTV standard called Digital Terrestrial Multimedia

Broadcasting (DTMB). The DTMB standard has been adopted in the People's Republic of China,

including Hong Kong and Macau.

The DVB Project is the focal point of the development of DTV for many countries around the

world [11]. The start of DVB in 1993 is a direct consequence of the experience that a number of

European partners had gained in the PALplus project in the late 1980s and the early 1990s.

At the time of the first activities on DVB in Europe, the Moving Pictures Experts Group

(MPEG) was already working on a set of specifications for the source coding of video and audio

signals and had already started the design of the respective systems level (MPEG-2). The proposed

system known as MPEG Audio for the source coding of audio signals, in mono and stereo, was

already in the final phase of standardization. The DVB Project decided that in order for the

technological solution used by the DVB Project DTV should be based on the MPEG standard.

The first complete system specification was the recommendation for satellite transmission

(DVB-S) adopted in November 1993 [11]. Some later it became the European Telecommunications

Standard ETS 300 421. In January 1994, the specification for DVB distribution via cable (DVB-C –

ETS 300 429) followed and since then numerous other specifications like DVB-T and DVB-H have

been developed and adopted.

For many years, the DVB Project developed specifications addressing the technology of

broadcast networks in order to transport digitized audio and video to TV receivers, set-top boxes, and

high-fidelity audio receivers. Later, specifications for interactive channels were added. The

appearance of the multimedia home platform (MHP) indicated another significant event, since with

MHP software applications can be run on all sorts of terminal devices. Among other things, MHP can

be embedded in non-DVB digital broadcasting environments, like US cable (OCAP), US terrestrial

(ACAP) and Japanese terrestrial DTV. Another significant new area for DVB is the development of

specifications for the transport of “DVB content” over telecommunications networks (both fixed and

mobile).

DVB systems distribute data using a variety of approaches, including:

Satellite: DVB-S, DVB-S2 and DVB-SH DVB-SMATV for distribution via SMATV

Cable: DVB-C, DVB-C2

Terrestrial television: DVB-T, DVB-T2 Digital terrestrial television for handhelds: DVB-H,

DVB-SH

Microwave: using DTT (DVB-MT), the MMDS (DVB-MC), and/or MVDS standards

(DVB-MS)

Besides digital audio and digital video transmission, DVB also defines data connections (DVB-

DATA - EN 301 192) with return channels (DVB-RC) for several media (DECT, GSM, PSTN/ISDN,

satellite etc.) and protocols (DVB-IPTV: Internet Protocol; DVB-NPI: network protocol

independent).

The conditional access system (DVB-CA) defines a Common Scrambling Algorithm (DVB-

CSA) and a physical Common Interface (DVB-CI) for accessing scrambled content. DVB-CA

providers develop their wholly proprietary conditional access systems with reference to these

specifications. Multiple simultaneous CA systems can be assigned to a scrambled DVB program

stream providing operational and commercial flexibility for the service provider.

DVB is also developing a Content Protection and Copy Management system for protecting

content after it has been received (DVB-CPCM), which is intended to allow flexible use of recorded

content on a home network or beyond, while preventing unconstrained sharing on the Internet.

28

Older technologies such as teletext (DVB-TXT) and vertical blanking interval data (DVB-VBI)

are also supported by the standards to ease conversion. However, for many applications more

advanced alternatives like DVB-SUB for subtitling are available.

The DVB Project is still extremely active in developing new solutions for the ever changing

world of the electronic media.

29

HUMAN EYE STRUCTURE, COLOR VISION and COLOR MODELS

2.1. Parts of the Human Eye

The eye is a slightly asymmetrical globe, about 2,5 cm in diameter. Simplified structure of the

eye is depicted in Figure 2.1. The front part of the eye (the part we see in the mirror) includes [12]:

• Iris is pigmented (colored) membrane of the eye. The iris is located between the cornea

and the lens. Its color varies from pale blue to dark brown. Multiple genes inherited from

each parent determine a person’s eye color. It consists of an inner ring of circular muscle

and an outer layer of radial muscle. Its function is to help control the amount of light

entering the eye so that:

too much light does not enter the eye which would damage the retina,

enough light enters to allow a person to see;

• Cornea is transparent "front window" of the eye. It is a thick, nearly circular structure

covering the lens and the iris. The cornea is an important part of the focusing system of

the eye;

• Pupil is round black hole in the center of the iris. The size of the pupil changes

automatically to control the amount of light entering the eye;

• Sclera is an opaque, fibrous, protective outer structure. Part of the white sclera can be seen

in the front of the eye. It is soft connective tissue, and the spherical shape of the eye is

maintained by the pressure of the liquid inside. It provides attachment surfaces for eye

muscles;

• Conjunctiva is a thin layer of tissue covering the front of the eye, except the cornea. It is

a thin protective covering of epithelial cells. It protects the cornea against damage by

friction. Tears from the tear glands help this process by lubricating the surface of the

conjunctiva.

Macula

RetinaLens

Iris

Pupil

CorneaOptic

Nerve

Sclera Fovea

Fig. 2.1. Simplified structure of the human eye

Other parts of the eye:

• Lens is a transparent biconvex structure located behind the iris. It focuses light rays

entering through the pupil in order to form an image on the retina;

• Retina is a thin multi-layered membrane which lines the inside back two-thirds of the eye.

It is composed of millions of visual cells and it is connected by the optic nerve to the brain.

The retina receives light and sends electrical impulses to the brain that result in sight. The

light sensitive retina consists of four major layers: the outer neural layer, containing nerve

cells and blood vessels; the photoreceptor layer, a single layer that contains the light

sensing rods and cones; the retinal pigment epithelium (RPE) and the choroid, consisting

of connective tissue and capillaries;

30

• Macula is an area of the eye near the center of the retina where visual perception is most

acute. Macula contains the fovea, a small depression or pit at the center of the macula that

gives the clearest vision. The macula is responsible for the sharp, straight-ahead vision that

is used for seeing fine detail, reading, driving, and recognizing faces. It is one hundred

times more sensitive to detail than the peripheral retina.

• Fovea is the area on the retina where we have the best spatial and color vision. When we

look at, or fixate, an object in our visual field, we move our head and eyes such that the

image of the object falls on the fovea. The fovea covers an area that subtends about two

degrees of visual angle in the central field of vision.

• Optic nerve is cable-like structure composed of thousands of nerve fibers that connect the

macula and retina with the brain. The optic nerve carries electrical impulses from the

macula and retina to the processing center of the brain where they are interpreted into clear,

colorful sight.

In a number of ways, the human eye works much like a digital TV camera:

• Light is focused primarily by the cornea – the clear front surface of the eye, which acts like

a camera lens;

• The iris of the eye functions like the diaphragm of a camera, controlling the amount of light

reaching the back of the eye by automatically adjusting the size of the pupil (aperture);

• The eye's crystalline lens is located directly behind the pupil and further focuses light.

Through a process called accommodation, this lens helps the eye automatically focus on

near and approaching objects, like an autofocus camera lens;

• Light focused by the cornea and crystalline lens (and limited by the iris and pupil) then

reaches the retina – the light-sensitive inner lining of the back of the eye. The retina acts

like an electronic image sensor of a digital camera, converting optical images into

electronic signals. The optic nerve then transmits these signals to the visual cortex – the

part of the brain that controls our sense of sight.

Other parts of the human eye play a supporting role in the main activity of sight:

• Some carry fluids (such as tears and blood) to lubricate or nourish the eye;

• Others are muscles that allow the eye to move;

• Some parts protect the eye from injury (such as the lids and the epithelium of the cornea);

• And some are messengers, sending sensory information to the brain (such as the pain-

sensing nerves in the cornea and the optic nerve behind the retina).

2.2. Color Vision

The eye is often compared to a digital TV camera that is self-focusing, has a self-cleaning lens

and has its images processed by a computer with millions of CPUs [12]. When our eye sees, light

from the outside world is focused by the lens onto the retina simplified structure of which is presented

in the Figure 2.2.

31

RodsCones

Horizontal cells

Bipolar cellsAmacrine cells

Retinal ganglion cells

Nerve fiber layer

Light

Fig. 2.2. Simplified structure of retina

There, it is absorbed by pigments in light-sensitive cells, called rods and cones. Higher primates,

including humans, have three different types of cones. Many animals have two different types, but

some, like birds, have five and more

There are approximately 6 million cones in our retina, and they are sensitive to a wide range of

brightness [12, 13]. The three different types of cones, namely: S-cones, M-cones and L-cones, are

sensitive to short, medium and long wavelengths, respectively, shown in the Figure 2.3 below (plotted

with individual normalizations). The graph in the Figure 2.3 shows the sensitivity of the different

cones to varying wavelengths. The graph shows how the response varies by wavelength for each kind

of receptor. For example, the medium cone is more sensitive to pure green wavelengths than to red

wavelengths.

400 450 500 550 600 650 7000

0,2

0,4

0,6

0,8

1,0

Wavelength, nm

Rel

ativ

e sp

ectr

al a

bso

rban

ce

S

R

M L

Fig. 2.3. The sensitivity of the different cones (S, M, L) and rods (R) to varying wavelengths

The peak sensitivities are provided by three different photo-pigments. Light at any wavelength

in the visual spectrum (ranging from 400 to 700 nm) will excite one or more of these three types of

sensors. Our mind determines the color by comparing the different signals each cone senses [12, 13].

Our ability to perceive color depends upon comparisons of the outputs of the three cone types, each

with different spectral sensitivity.

The color yellow, for example, is perceived when the L cones are stimulated slightly more than

the M cones, and the color red is perceived when the L cones are stimulated significantly more than

the M cones. Similarly, blue and violet hues are perceived when the S receptor is stimulated more

than the other two.

Additionally, we have approximately 125 million rods on the retina, which are used only in dim

light, and are monochromatic – black and white. Rod cells function as specialized neurons that

convert visual stimuli into chemical and electrical stimuli that can be processed by the central nervous

system. Rod cells are stimulated by light over a wide range of intensities and are responsible for

perceiving the size, shape, and brightness of visual images. They do not perceive color and fine

32

details; tasks performed by the other major type of light-sensitive cell, the cone. Rod cells are much

more sensitive (about 100 times) to light than cones and are also much more numerous. Rods are

most sensitive to wavelengths of light around 498 nm (blue-green), and insensitive to wavelengths

longer than about 640 nm (red). (See the dashed black curve in Figure 2.3). This fact means that as

intensity dims at twilight, the rods take over, and before color disappears completely, peak sensitivity

of vision shifts towards the rods' peak sensitivity (blue-green).

The rods and cones line the back of the retina. In front of them are three types of nerve cells:

bipolar cells, horizontal cells and amacrine cells [12, 13]. Bipolar cells receive input from the cones,

and many feed into the retinal ganglion cells. Horizontal cells link cones and bipolar cells, and

amacrine cells, in turn, link bipolar cells and retinal ganglion cells. There are about 1 million ganglion

cells in the eye. The purpose of the ganglion cells is not fully known, but they are involved in color

vision. Ganglion cells compare signals from many different cones. The ganglion cells add and

subtract signals from many cones and perform other mathematical operation (multiply, divide) in

addition to the amplification, gain control, and nonlinearities that can occur within the neural cells.

Thus the network of cells within the retina can serve as a sophisticated image computer. This is how

the information from 130 million photoreceptors can be reduced to signals in approximately one

million ganglion cells without loss of visually meaningful data. For example, by comparing the

response of the middle-wavelength and long-wavelength cones, a ganglion cell determines the

amount of green-or-red. Moreover, they are excited in the middle of the field, and inhibited in the

surrounding field, which makes them particularly sensitive to edges.

The result of these steps for color vision is a signal that is sent to the brain. There are three

signals, corresponding to the three color attributes. These are:

• the amount of green-or-red;

• the amount of blue-or-yellow;

• the brightness.

Another important feature about the three cone types is their relative distribution in the retina.

It turns out that the S cones are relatively sparsely populated throughout the retina and completely

absent in the most central area of the fovea. There are far more L and M cones than S cones and there

are approximately twice as many L cones as M cones. The relative populations of the L:M:S cones

are approximately 12:6:1 (with reasonable estimates as high as 40:20:1). These relative populations

must be considered when combining the cone responses (see curves in Figure 2.3) to predict higher

level visual responses.

The humans perceive color just as they perceive taste. When they eat, their taste buds sense

four attributes: sweet, salty, sour and bitter. Similarly, when the humans look at a scene, their visual

nerves register color in terms of the attributes of color: the amount of green-or-red; the amount of

blue-or-yellow; and the brightness.

Note that these attributes are opposites, like hot and cold. Color nerves sense green or red – but

never both; and blue or yellow – but never both. Thus, the humans never see bluish-yellows or

reddish-greens. The opposition of these colors forms the basis of color vision.

A lot of researches have led to the development of the modern opponent theory of color vision

(sometimes called a stage theory) as illustrated in Figure 2.4. Figure 2.4 illustrates that the first stage

of color vision, the receptors, is trichromatic [12]. However the three ‘color-separation’ images are

not transmitted directly to the brain. Instead the neurons of the retina (and perhaps higher levels)

encode the color into opponent signals. The outputs of all three cone types are summed (L+M+S) to

produce an achromatic response. The summation is taken in proportion to the relative populations of

the three cone types. Differencing of the cone signals allows construction of red-green (L−M+S) and

yellow-blue (L+M−S) opponent signals. The transformation from LMS signals to the opponent signals

serves for decorrelation the color information carried in the three channels, thus allowing more

efficient signal transmission and reducing difficulties with noise. The three opponent pathways also

have distinct spatial and temporal characteristics that are important for predicting color appearance.

33

The importance of the transformation from trichromatic to opponent signals for color appearance is

reflected in the prominent place that it finds within the formulation of all color appearance models.

Figure 2.4 includes not only a schematic diagram of the neural ‘wiring’ that produces opponent

responses, but also the relative spectral responsibility of these mechanisms both before and after

opponent encoding.

Retinal ganglion cells vary significantly in terms of their size, connections, and responses to

visual stimulation but they all share the defining property of having a long nerve fiber that extends

into the brain and typically conduct electrical impulses away from the neuron's cell body. These nerve

fibers form the optic nerve. A small percentage of retinal ganglion cells contribute little or nothing to

vision, but are photosensitive; their nerve fibers participate in the pupil resizing process.

L M S

A R-G Y-B

-

-

-

-

Channel response

avelength, nm

400 500 600 700

0

0,2

0,4

0,6

0,8

1,0

Rel

ativ

e sp

ectr

al

abso

rban

ce

Wavelength, nm

S

R

M L

Fig. 2.4. Illustration of the encoding of cone signals into opponent color’s signals in the human visual system; A –

achromatic (brightness – luminance), R-G – red - green, Y-B – yellow - blue

34

2.3. Development of Color Models

The search for a understanding of exactly what color is and how it functions has been going on

for hundreds of years. A color model is an orderly system for creating a whole range of colors from

a small set of primary colors. Numerous models and systems have been developed.

The first known studies of color were done in ancient Greece by Aristotle. According to him

color exists in the form of rays sent down from the heavens by God [14].

More sophisticated color systems were developed by Aguilonius and Sigfrid Forsius in the

Renaissance time. Aguilonius was the first who attempted to define all colors on the basis of his own

observations of the changing color of the sky from dawn to dusk [14].

In 1660, Sir Issac Newton developed a more logical color order based on his scientific

observation from experiments using a prism [14, 22]. Newton demonstrated that white light could be

broken down into the colors of the rainbow (red, orange, yellow, green, blue, indigo and violet). He

discovered also that when the light from three separate parts of his rainbow, the red, green, and blue

regions, were recombined they would regenerate white light. He called these the primary colors.

When any two of these were combined, secondary colors were formed. When he combined blue and

green light, he observed light the color of cyan. Green and red light mixed to give yellow light. In

both of these cases, Newton apparently regenerated light in another portion of the natural spectrum.

But when he combined red and blue light from his prisms, Newton observed a colored light, magenta,

that was not found in the natural visible spectrum. Newton organized his findings in a color wheel

presented in Figure 2.5 and showing the three „primary colors“ – red, green, and blue – separated by

the three „secondary colors“ – yellow, cyan, and magenta.

Magenta

Cyan

RED

GREEN

BLUEYellow

IA

ICIB

IIB

IIIB

IIAIIIA

IIIC

IIC

0,0 0,1 0,3 0,40,2 0,5 0,6 0,7 0,8

x

y

0,1

0,2

0,3

0,9

0,4

0,5

0,6

0,7

0,8

G

R

B

Fig. 2.5. Newton‘s color wheel Fig. 2.6. Goethe‘s color triangle Fig. 2.7. The CIE 1931 xy chromaticity

space

The next big jump in color theory did not come until the early 1800's, when Johanes Wolfgang

Goethe challenged Newton's ideas and created his own color system [14, 22]. Newton's and Geothe's

approaches were very different. Newton's studies in color were scientifically based, while Goethe's

interest was more in the psycological effects of color. He wished to investigate whether rules could

be found to govern the artistic use of color. Originally he planned on creating an improved color

wheel, but later Goethe found his ideas were best expressed within an equilateral triangle presented

in Figure 2.6. In Goethe's original triangle the three primaries red, yellow, and blue are arranged at

the vertices of the triangle. The other subdivisions of the triangle are grouped into secondary and

tertiary triangles, where the secondary triangle colors represent the mix of the two primary triangles

to either side of it, and the tertiary triangle colors represent the mix of the primary triangle adjacent

to it and the secondary triangle directly across from it.

Also around this time Phillip Otto Runge developed the first three dimensional color model in

the form of a sphere [14, 22]. His theory was revolutionary at that time, and it attempted to arrange

colors based on hue (red, cyan, orange, etc.), whiteness, and blackness.

In 1872 a Scottish physicist, Sir James Clerk Maxwell, developed a chart in the form of an

equilateral triangle from his studies of the electromagnetic theory of light [23]. His triangle is very

35

similar to Goethe's, both are equilateral and both choose three primaries which are combined to

produce the inner colors. Maxwell, however, believed that he could produce all the known colors

within his triangle, and he choose red, green, and blue as primaries.

In 1802, Thomas Young, an English physician first demonstrated that he could generate any

colors that could be seen by mixing differing proportions of the three primary colors of light [24]. For

example, you could mix two parts of red light with one part of green light to get an orange color.

Using more green light than red light, you saw a yellow-green light. Young took his observations a

step further: he hypothesized that the human eye perceives only Newton's three primary colors, red,

green, and blue, and that the eye perceived all of the variations in color by combining these internally.

When the both red and blue light but no green light enters your eye, you "see" magenta even though

the light is not magenta. A combination of red and green, gives the perception of yellow while our

eyes turn blue and green light into cyan.

Young's work had the form of a hypothesis. It remained for the physiological psychologist

Hermann von Helmholtz a century later to postulate the existence of three types of color receptors,

called cones, in the human eye that are stimulated by broad regions of the visible spectrum. Red light

in one of these regions stimulated one type of cone, green light from the middle region could stimulate

a second type of cone, and blue light in the final region stimulated the remaining cone. The relative

degree of stimulation of these cones gives us perception of all of the colors that we see. We perceive

sunlight as "white" because radiate from each of the three of the parts of the visible spectrum (red,

green, and blue) stimulate the three cones in our eyes. If an object reflects red and green light but not

blue light, my eyes will see it as yellow. If a second object reflects just red and blue light, my eyes

will see it as magenta. Yet another object reflecting just blue and green light appears cyan colored.

In 1931, an attempt was made to establish a world standard for the measurement of color by the

Commission Internationale de l'Eclairage (CIE). They generated a version of Maxwell's triangle,

choosing a particular red, green, and blue from which to generate all the colors. The result became

known as the 1931 CIE Chromaticity Diagram (see [15]) presented in Figure 2.7, also showing the

chromaticities of black-body light sources of various temperatures, and lines of constant correlated

color temperature. The updated versions (1960 CIE uv Chromaticity Diagram, 1976 CIE u'v'

Chromaticity Diagram) of this chart is used to measure and quantify the light produced by computer

phospor guns today. The 1931 CIE xy system characterizes colors by a luminance parameter Y and

two color coordinates x and y which specify the point on the chromaticity diagram. This system offers

high precision in color measurement because the parameters are based on the spectral power

distribution (SPD) of the light emitted from a colored object and are factored by sensitivity curves

which have been measured for the human eye. Based on the fact that the human eye has three different

types of color sensitive cones, the response of the eye is best described in terms of three "tristimulus

values". However, once this is accomplished, it is found that any color can be expressed in terms of

the two color coordinates x and y. The colors which can be matched by combining a given set of three

primary colors (such as the blue, green,and red of a color television screen) are represented on the

chromaticity diagram by a triangle joining the coordinates for the three colors.

According to the current level of understanding of color science there are two types of color

models, those that are subtractive and those that are additive [16, 17]. Additive color models use light

to display color while subtractive models use printing inks. Colors perceived in additive models are

the result of transmitted light. Colors perceived in subtractive models are the result of reflected light.

With the advent of the computer age, many attempts have been made to create an ideal color

space model based on the red, green, and blue primaries of the computer screen. There are several

established color models used in computer graphics, but the two most common are the RGB model

(Red-Green-Blue) for computer display and the CMYK model (Cyan-Magenta-Yellow-Black) for

printing. The simplest model is the RGB cube, with corners of black, the three primaries (red, green,

blue), the three secondary mixes (cyan, magenta, yellow), and white.

36

Generally speaking the purpose of a color model is to facilitate the specification of colors in

some standard generally accepted way. In essence, a color model is a specification of a 3-D coordinate

system and a subspace within that system where each color is represented by a single point.

Each industry that uses color employs the most suitable color model. For example, the RGB

color model is used in computer graphics, YUV or YCbCr are used in video systems, PhotoYCC is

used in PhotoCD production and so on. Transferring color information from one industry to another

requires transformation from one set of values to another.

Each color model has its own range of colors that can be displayed or printed. Each color model

is limited to only a portion of the visible spectrum. Since a color model has a particular range of

available color, it is referred to as using a "color space". An image or vector graphic is said to use

either the RGB color space or the CMYK color space (or the color space of another color model).

Notice the centers of the two color charts bellow in Figure 2.8. In the RGB model, the

convergence of the three primary additive colors produces white. In the CMYK model, the

convergence of the three primary subtractive colors produces black. The overlapping of additive

colors (red, green and blue) in the RGB model results in subtractive colors (cyan, magenta and

yellow). The overlapping of subtractive colors (cyan, magenta and yellow) in the CMYK model

results in additive colors (red, green and blue).

The colors in the RGB model are much brighter than the colors in the CMYK model. It is

possible to attain a much larger percentage of the visible spectrum with the RGB model. That is

because the RGB model uses transmitted light while the CMYK model uses reflected light. The muted

appearance of the CMYK model demonstrates the limitation of printing inks and the nature of

reflected light. The colors in this chart appear muted because they are displayed within their printable

range of colors.

Since additive color models display color as a result of light being transmitted (added) the total

absence of light would be perceived as black. Subtractive color models display color as a result of

light being absorbed (subtracted) by the printing inks. As more ink is added, less light is reflected.

Where there is a total absence of ink the resulting light being reflected from a white surface would be

perceived as white.

Red Green

Magenta

Yellow

White

Cyan

Blue

a)

GreenCyan Yellow

Blue

Magenta

Red

Black

b)

Fig. 2.8. Primary and secondary colors for RGB and CMYK models; (a) – RGB chart, (b) – CMYK chart

2.4. RGB Color Model

The RGB color model is an additive color model. The color subspace of interest is a cube shown

in Figure 2.9 (RGB values are normalized to 0…1), in which RGB values are at three corners; cyan,

magenta, and yellow are the three other corners, black is at their origin; and white is at the corner

farthest from the origin. In this case red, green and blue light are added together in various

combinations to reproduce a wide spectrum of colors.

The main purpose of the RGB color model is for the sensing, representation, and display of

images in electronic systems, such as televisions and computers, though it has also been used in

37

conventional photography. Before the electronic age, the RGB color model already had a solid theory

behind it, based in human perception of colors.

RGB is a device-dependent color model: different devices detect or reproduce a given RGB

value differently, since the color elements (such as phosphors or dyes) and their response to the

individual R, G, and B levels vary from manufacturer to manufacturer, or even in the same device

over time. Thus an RGB value does not define the same color across devices without some kind of

color management.

Typical RGB input devices are color TV and video cameras, image scanners, and digital

cameras. Typical RGB output devices are TV sets of various technologies (CRT, LCD, plasma, etc.),

computer and mobile phone displays, video projectors, multicolor LED displays, and large screens

such as JumboTron. Color printers, on the other hand, are not RGB devices, but subtractive color

devices (typically CMYK color model).

In order to create a color with RGB, three colored light beams (one red, one green, and one

blue) must be superimposed. With no intensity,each of the three colors is perceived as black, while

full intensity leads to a perception of seeing white. Differing intensities produce the hue of a color,

while the difference between the most and least intense of the colors make the resulting color more

or less saturated.

The RGB model forms its range of colors from the primary additive colors of red, green and

blue. When red, green and blue light is combined it forms white. Computers generally display RGB

using 24-bit color. In the 24-bit RGB color model there are 256 variations for each of the additive

colors of red, green and blue. Therefore there are 16777216 possible colors (256 reds x 256 greens x

256 blues) in the 24-bit RGB color model. In the RGB color model, colors are represented by varying

intensities of red, green and blue light. The intensity of each of the red, green and blue components

are represented on a scale from 0 to 255 with 0 being the least intensity (no light emitted) to 255

(maximum intensity). For example in the above RGB chart the magenta color would be R=255, G=0,

B=255. Black would be R=0, G=0, B=0 (a total absence of light).

For web-page design the colors used are commonly specified using RGB. Today, with the

predominance of 24-bit displays, it enables most users to see 16,7 million colors of HTML RGB code.

In web page design, there are 216 so-called ‘web-safe’ RGB colors represented by hexidecimal

values. Quite simply, the web-safe color palette consists of the 216 combinations of red, green and

blue.

2.5. CMYK Color Model

The CMYK color model is a subset of the RGB model and is primarily used in color print

production [17]. CMYK is an acronym for cyan, magenta, and yellow along with black (noted as K).

The CMYK color space is subtractive, meaning that cyan, magenta yellow, and black pigments or

inks are applied to a white surface to subtract some color from white surface to create the final color.

For example, see Figure 2.8, cyan is white minus red, magenta is white minus green, and yellow is

white minus blue. Subtracting all colors by combining the CMY at full saturation should, in theory,

render black. However, impurities in the existing CMY inks make full and equal saturation

impossible, and some RGB light does filter through, rendering a muddy brown color. Therefore, the

black ink is added to CMY. The CMY cube is shown in Figure 2.9, in which CMY values are at three

corners: red, green, and blue are the three other corners, white is at the origin; and black is at the

corner farthest from the origin.

It is frequently suggested that the ‘K’ in CMYK comes from the last letter in ‘black’ and was

chosen because B already refers to blue. However, this explanation is incorrect. The ‘K’ in CMYK

stands for ‘key’ since in four-color printing cyan, magenta, and yellow printing plates are carefully

keyed or aligned with the key of the black key plate. Black is used because the combination of the

38

three primary colors (CMY) doesn’t produce a fully saturated black. This is evident in the central

black color created by the overlapping circles in the color chart above.

CMYK is able to produce the entire spectrum of visible colors due to the process of half-toning.

In this process, each color is assigned a saturation level and miniscule dots of each of the three colors

are printed in tiny patterns. This enables the human eye to perceive a specific color made from the

combination. In order to improve print quality and reduce moiré patterns, the screen for each color is

set at a different angle.

In the CMYK color model, colors are represented as percentages of cyan, magenta, yellow and

black. For example in the above CMYK chart (see Figure 2.8) the red color is composed of 14% cyan,

100% magenta, 99% yellow and 3% black. White would be 0% cyan, 0% magenta, 0% yellow and

0% black (a total absence of ink on white paper).

R

G

B

(1,0,0)

Red

Green

(0,1,0)

(0,0,1) BlueCyan

Magenta

White

Black

Yellow

Gray scale

a)

C

M

Y

(1,0,0)

Red

Green

(0,1,0)

(0,0,1)

Blue

Cyan

Magenta

White

Black

Yellow

Gray scale

b)

Fig. 2.9. RGB and CMY Color 3-D Models; (a) – RGB model, (b) – CMY model

2.6. Gamma Correction

The luminance intensity generated by most displays is not a linear function of the applied signal

but is proportional to some power (referred to as gamma) of the signal voltage [17]. As a result, high

intensity ranges are expanded and low intensity ranges are compressed. This nonlinearity must be

compensated to achieve correct color reproduction. To do this, luminance of each of the linear red,

green, and blue components is reduced to a nonlinear form using an inverse transformation. This

process is called "gamma correction".

Usually to convert an RGB image to a gamma-corrected R’G’B’ image the following basic

equations are used

' 4,5 ,

for , , 0,018 ' 4,5 ,

' 4,5 ,

R R

R G B G G

B B

0,45

0,45

0,45

' 1,099 0,099,

for , , 0,018 ' 1,099 0,099,

' 1,099 0,099.

R R

R G B G G

B B

The channel intensity values are normalized to fit in the range [0…1]. The gamma value is

equal to 1/0,45=2,22 in conformity with ITU-R Recommendation BT.709 specification [18].

2.7. YUV Color Model

39

The YUV color model is the basic color model used in analogue color TV broadcasting. Initially

YUV is the re-coding of RGB for transmission efficiency (minimizing bandwidth) and for downward

compatibility with black-and white television. The YUV color space is “derived” from the RGB

space. It comprises the luminance (Y) and two color difference (U, V) components. The luminance

can be computed as a weighted sum of red, green and blue components; the color difference, or

chrominance, components are formed by subtracting luminance from blue and from red.

The principal advantage of the YUV model in image processing is decoupling of luminance

and color information. The importance of this decoupling is that the luminance component of an

image can be processed without affecting its color component. For example, the histogram

equalization of the color image in the YUV format may be performed simply by applying histogram

equalization to its Y component.

There are many combinations of YUV values from nominal ranges that result in invalid RGB

values, because the possible RGB colors occupy only part of the YUV space limited by these ranges.

The Y’U’V’ notation means that the components are derived from gamma-corrected R’G’B’.

Weighted sum of these non-linear components forms a signal representative of luminance that is

called luma Y’. Luma is often loosely referred to as luminance, so you need to be careful to determine

whether a particular author assigns a linear or non-linear interpretation to the term luminance.

Conversion between gamma-corrected R’G’B’ and Y’U’V’ models is possible using the

following basic equation [19]

' 0,299 ' 0,587 ' 0,114 'Y R G B

' 0,147 ' 0,289 ' 0,436 ' 0,492( ' ')U R G B B Y

' 0,615 ' 0,515 ' 0,1 ' 0,877( ' ')V R G B R Y

' ' 1,140 'R Y V

' ' 0,394 ' 0,581 'G Y U V

' ' 2,032 'B Y U

2.8. YCbCr Color Model

The YCbCr color space is used for component digital video and was developed as part of the

ITU-R Recommendation BT.601 [20]. YCbCr is a scaled and offset version of the YUV color space.

The following basic equations [19] are used for conversion between R’G’B’ in the range 0–255 and

Y’Cb’Cr’ (this notation means that all components are derived from gamma-corrected R’G’B’):

' 0,257 ' 0,504 ' 0,098 ' 16Y R G B ,

' 0,148 ' 0,291 ' 0,439 ' 128Cb R G B ,

' 0,439 ' 0,368 ' 0,071 ' 128Cr R G B ,

' 1,164( ' 16) 1,596( ' 128)R Y Cr ,

' 1,164( ' 16) 0,813( ' 128) 0,392( ' 128)G Y Cr Cb ,

' 1,164( ' 16) 2,017( ' 128)B Y Cb .

2.9. YCoCg Color Model

The YCoCg color model was developed to increase the effectiveness of the image compression

[21]. This color model comprises the luminance (Y) and two color difference components (Co - offset

orange, Cg - offset green).

The basic equations [21] for conversion between RGB and YCoCg are:

40

/ 4 / 2 / 4Y R G B ,

/ 2 / 2Co R B ,

/ 4 / 2 / 4Cg R G B ,

R Y Co Cg ,

G Y Cg ,

B Y Co Cg .

The possible RGB colors occupy only part of the YUV, or YCbCr, or YCoCg color spaces

limited by the nominal ranges. Therefore there are many YUV, or YCbCr, or YCoCg combinations

that result in invalid RGB values.

41

HUMAN HEARING, PERCEPTION and SPEECH PRODUCTION

3.1 Human Ear Structure

The human ear is a particularly complex organ, the more especially as the information from

two ears is combined in a sophisticated neural network, the human brain. There are however a lot of

poorly understood phenomena and astonishing effects related to human hearing.

Figures 3.1 and 3.2 illustrate the major structures and phenomena taking place in the human ear

[25–28].

Pina

(Feather)

Auditory

canal

Ear drum

Hammer Anvil

Stapes

Semicircular canals

Cochlea

Nerve

fibres

Oval

window

Eustachian tube

Round

window

Outer ear Inner ear

Middle ear

Fig.3.1. Cross-section of the human ear

Sound

waves

in air

Outer ear Ear

canalEar drum

Middle ear

bones

Round

window

Oval

window

Sound waves

in liquid CochleaBasilar

membrane

High

frequency

detection

Medium

frequency

detection

Low

frequency

detection

Perilymph

Fig.3.2. Functional diagram of the human ear

The outer ear is composed of two parts: the visible earlobe and cartilage attached to the side of

the head, and the ear canal, a tube about 0,5 cm in diameter extending about 3 cm into the head. These

structures direct environmental sounds to the sensitive middle and inner ear organs located safely

inside of the skull bones. The ear canal ends with a thin sheet of tissue named the tympanic membrane

or ear drum. Sound waves striking the tympanic membrane force it to vibrate. The middle ear is a set

of small bones that transfer this vibration to the cochlea (inner ear) where it is converted to neural

impulses. The cochlea is a liquid filled tube roughly 2 mm in diameter and 3 cm in length. Although

shown straight in Figure 3.2, really the cochlea is rolled up and looks like a small snail shell as is seen

in Figure 3.1. In fact, cochlea is derived from the Greek word for snail.

Sound waves enter the auditory canal and the pulsating pressure variations push on the eardrum.

Because the other side of the ear drum (known as the middle ear) is held at a fairly constant pressure

the sound causes the eardrum to vibrate. The outer ear canal, because of its shape and length, is

responsible for the high sensitivity to sounds with frequency components around 4 kHz.

The vibrations of the ear drum are transferred to the three small bone sequence known as the

hammer, anvil, and stirrup bones. This mechanism is designed to effectively couple the sound

42

vibrations from the air in thre outer ear into the fluid filled cochlea (inner ear). This task is

accomplished in part by the lever action of the bones but more by the increase in pressure due to the

fact that the stirrup bone pushes on a much smaller surface area than the surface area of the eardrum

amplifying the pressure of the sound wave.

The middle ear transfers sound energy from the oscillations of air particles in the outer ear to

oscillations in the fluids within the inner ear. The middle ear system acts like a transformer, matching

the acoustic impedance between the air and the fluids at frequencies centered at 1 kHz. The inner ear

transmits the fluid oscillations to the organ of Corti on the basilar membrane, where sensory cells

convert the fluid oscillations to signals that the nervous system can process. The inner ear also can

separate frequencies, because different frequencies produce maximum oscillations at different

positions along the basilar membrane.

When sound waves try to pass from air into liquid, only a small part of the sound energy is

transmitted through the interface, while the remainder of the energy is reflected. This is because air

has a low mechanical impedance (low acoustic pressure and high particle velocity resulting from low

density and high compressibility), while liquid has a high mechanical impedance (high acoustic

pressure, low particle velocity resulting from high density and low compressibility). Otherwise it

requires more effort to wave one‘s hand in water than it does to wave it in air. This difference in

mechanical impedance results that most of the sound energy is reflected at an air/liquid interface.

The middle ear is an impedance matching network that increases the part of sound energy

entering the liquid of the inner ear. For example, fish do not have an ear drum or middle ear, because

they have no need to hear in air. Basically, the impedance conversion is determined by the difference

in area between the ear drum (receiving sound from the air) and the oval window (transmitting sound

into the liquid). The ear drum has an area of about 60 mm2, while the oval window has an area of

roughly 4 mm2. Since pressure is equal to force divided by area, this difference in area increases the

sound wave pressure by about 15 times.

The basilar membrane, which is found within the cochlea, is the supporting structure for about

12000 sensory cells forming the cochlear nerve. The basilar membrane is stiffest near the oval

window, and becomes more flexible toward the opposite end, allowing it to act as a frequency

spectrum analyzer. When exposed to a high frequency signal, the basilar membrane resonates where

it is stiff, resulting in the excitation of nerve cells close to the oval window. Likewise, low frequency

sounds excite nerve cells at the far end of the basilar membrane. This causes specific fibers in the

cochlear nerve respond to specific frequencies. This organization of hearing nerve fibers is called the

place principle, and is preserved throughout the auditory pathway into the brain.

3.2 Limits of Hearing

In human hearing is also used another information encoding scheme called the volley principle

[25, 28]. Nerve cells transmit information by generating brief electrical pulses called action potentials.

A nerve cell on the basilar membrane can encode audio information by producing an action potential

in response to each cycle of the vibration. For example, a 200 Hz sound wave can be represented by

a neuron producing 200 action potentials per second. However, this only works at frequencies below

about 500 Hz. This is the maximum rate that neurons can produce action potentials. The human ear

overcomes this problem by allowing several nerve cells to take turns performing this single task. For

example, a 3000 Hz tone might be represented by ten nerve cells alternately producing potentials at

300 times per second. This extends the range of the volley principle to about 4 kHz. Above this mark

the place principle is solely used.

The ear is a very sophisticated auditory organ and can be thought of as a complex instrument

for auditory signals. Human hearing can detect small variations in air pressure, ranging from 10 µPa

up to 100 Pa. The detection of these small variations occurs in the presence of atmospheric pressure,

where 1 atm=101,3 kPa.

43

Humans perceive loudness on a logarithmic scale. The international standard reference for

sound pressure level measurements is 20 µPa (0 dB), which is the threshold of quiet. This is

considered the nominal threshold of hearing, although approximately half of the general population

can sense sounds at even lower levels. On the opposite end of sound pressure level measurements,

humans experience discomfort and pain from sounds with sound pressure levels greater than 100 Pa

(134 dB). Within this range in level, humans can typically discern changes as small as 1 dB.

Humans perceive loudness on a logarithmic scale. Therefore it is common to express sound

intensity on a logarithmic scale, called decibel Sound Power Level (SPL). On this scale, 0 dB SPL is

a sound intensity of 10-16 W/cm2 and corresponds to the international standard reference equal to

20 µPa for sound pressure level measurements. This is about the weakest sound detectable by the

human ear, i.e. the threshold of the quiet. However approximately half of the general population can

sense sounds at even lower levels. Normal speech is at about 60 dB SPL. This corresponds to sound

intensity of about 10-10 W/cm2 and pressure levels of about 20 mPa. Humans experience discomfort

from sounds with sound pressure levels greater than 100 Pa (134 dB SPL, ≈5∙10-3 W/cm2), while

painful damage to the ear occurs at about 140 dB SPL (10-2 W/cm2).

The difference between the loudest and weakest sounds that humans can hear is about 120 dB,

a range of one million in amplitude. Listeners can detect a change in loudness when the signal is

altered by about 1 dB (a 12% change in amplitude). In other words, there are only about 120 levels

of loudness that can be perceived from the weakest whisper to the loudest thunder. The sensitivity of

the ear is amazing; when listening to very weak sounds, the ear drum vibrates less than the diameter

of a single molecule.

The perception of loudness relates roughly to the sound power to an exponent of 1/3. For

example, if you increase the sound power by a factor of ten, listeners will communicate that the

loudness has increased by a factor of about two (101/3 ≈ 2). This is a main problem for eliminating

undesirable environmental sounds, for example, the noise in the neighboring apartment. Suppose

somebody accurately covered 99% of his apartment walls with a perfect soundproof material,

excluding doors, corners, vents, etc., i.e. missing only 1% of the surface area. Therefore the sound

power has been reduced to only 1% of its former value, but the perceived loudness has only dropped

to about 0,011/3 ≈ 0,2 , i.e. to 20%.

The range of human hearing is generally considered to be 20 Hz to 20 kHz, but it is far more

sensitive to sounds between 1 kHz and 4 kHz. Frequency components outside this range are not

generally considered to impact the human perception of sound, regardless of the sound pressure level.

As explained in the previous section, there are physical reasons why human hearing is most sensitive

to frequency components around 1 kHz and 4 kHz. For example, listeners can detect sounds as low

as 0 dB SPL at 3 kHz, but require 40 dB SPL at 100 Hz (an amplitude increase of 100). Listeners can

tell that two tones are different if their frequencies differ by more than about 0,3% at 3 kHz. This

increases to 3% at 100 Hz. For comparison, adjacent keys on a piano differ by about 6% in frequency.

Many studies have demonstrated the sensitivity of the ear as a function of frequency, which is

typically plotted in equal loudness curves as in the following Figure 3.3.

44

Quiet threshold

20,50,05 0,20,02 0,1 51 f , kHz10 20

20

40

60

0

Sound pressure, dB

120

Pain threshold

Fig. 3.3. Equal loudness curves corresponding to threshold of the quiet and pain limit

It is easiest to consider the threshold of hearing line – the lowest curve on the Figure 3.3. The

curve is 0 dB at 1000 Hz which is the standard reference level for the threshold of hearing. At other

frequencies this threshold of hearing deviates from 0 dB. At low frequencies the threshold is much

higher, for example, at 100 Hz a tone must be almost 40 dB before the average person can hear it. In

contrast at 4000 Hz sounds that are a few dB below 0 dB are audible. The other curves (not presented

in he Figure 3 except pain threshold curve) are curves that represent equal loudness profiles. Note

that for very loud sounds, 80 to 100 dB, the large difference in response between the low and mid

frequencies disappears.

Besides the sound pressure level-dependent sensitivity of hearing, humans also can differentiate

very fine changes in frequency. Below frequencies of 500 Hz, the ear can differentiate tone bursts

with a frequency difference of approximately 1 Hz. Above 500 Hz, the noticeable difference is

proportional to the frequency (0,002×f ).

Two ears provide human with the ability to identify the direction of the sound. Human listeners

can detect the difference between two sound sources that are placed as little as three degrees apart,

about the width of a person at 10 meters. This directional information is obtained in two separate

ways. First, frequencies above about 1 kHz are strongly shadowed by the head. In other words, the

ear nearest the sound receives a stronger signal than the ear on the opposite side of the head. The

second way obtaining directional information is that the ear on the far side of the head hears the sound

slightly later than the near ear, due to its greater distance from the source. Based on a typical head

size (about 22 cm) and the speed of sound (about 340 meters per second), an angular discrimination

of three degrees requires a timing precision of about 30 µsec. Since this timing requires the volley

principle, this way obtaining directional information is predominately used for sounds less than about

1 kHz.

Both these sources of directional information are greatly aided by the ability to turn the head

and observe the change in the signals. An interesting sensation occurs when a listener is presented

with exactly the same sounds to both ears, such as listening to monaural sound through headphones.

The brain concludes that the sound is coming from the center of the listener's head.

While human hearing can determine the direction of a sound source, it poorly identifies the

distance to the sound source. This is because there are few attributes available in a sound wave that

can provide this information. Human hearing weakly perceives high frequency sounds that are nearby,

and low frequency sounds that are distant. This is because sound waves dissipate their higher

frequencies as they propagate long distances. Another weak clue to distance is echo content,

providing a perception of the room size. For example, sounds in a large auditorium will contain echoes

at about 100 millisecond intervals, while 10 milliseconds are typical for a small office. Some species

have solved this ranging problem by using active sonar. For example, bats and dolphins produce

clicks and squeaks that reflect from nearby objects. By measuring the interval between transmission

and echo, these animals can locate objects with about 1 cm resolution. Experiments have shown that

some humans, particularly the blind, can also use active echo localization to a small extent.

45

3.3 Masking

Masking describes the phenomenon in which a sound becomes intangible due to the presence

of another sound [25, 28]. An example of this phenomenon is when loud music masks the sound of

emergency sirens, or when a background noise partially masks conversational speech. The transition

between an unmasked tone and a completely masked tone is continuous. The masked threshold is the

sound pressure level of a barely-audible test tone in the presence of a masking sound.

3.3.1 Time Masking Simultaneous masking describes the effect when the masked signal and the masking signal

occur at the same time. Human hearing is sensitive to the temporal structure of sound, and masking

also can occur between sounds that are not present simultaneously. Pre-masking is when the test tone

occurs before the masking sound. Post-masking is when the test tone occurs after the masking sound.

The following Figure 3.4 shows the time regions of pre-masking, simultaneous masking, and post-

masking in relation to the masking signal.

Pree-

masking Post-masking

Sound pressure, dB

Simultaneous masking

40-40 -20-60 0160 t , ms200 180 20 40 60 80 100120 140 160

50

40

60

80

70

90

Delay time, td (ms)Time after masking

onset, Δt (ms)

Masking sound

Fig. 3.4. Illustration of time masking – the time regions of pre-masking, simultaneous masking, and post-masking in

relation to the masking signal

Post-masking is a pronounced phenomenon that corresponds to decay in the effect of the

masking signal. Pre-masking is a more subtle effect caused by the fact that hearing does not occur

instantaneously because sounds require some time to sense. As indicated in Figure 3.4, researchers

typically can measure pre-masking for only about 20 ms. Post-masking is the more dominant temporal

effect and can be measured for 100 ms following the termination of the masking sound. Both the

threshold in quiet and the masked threshold depend on the duration of the test tone. Researchers must

know these dependencies when investigating pre- and post-masking because they use short-duration

test signals to perform these measurements.

3.3.2 Frequency Masking Broadband white noise can mask test tones. White noise has a spectral density that is

independent of frequency. Other types of noise and signals, such as pink noise, narrow-band noise,

pure tones, and complex tones, also can mask a test signal. When narrow-band noise is the masking

signal, masked thresholds show a very steep rise greater than 100 dB per decade as the test tone

increases in frequency up to the center frequency of the narrow-band noise. This test tone increase is

independent of the level of the masking noise. For frequencies greater than the center frequency of

the noise, the masked threshold decreases quickly for low levels of masking noise but more slowly

for high levels of masking noise. When pure tones are the masking signal, the signal needs additional

filters to remove measurement artifacts such as audible beating and difference tones. The following

Figure 3.5 shows the masked threshold for a masking signal at a frequency of 1 kHz.

46

Masked sounds

Quiet treshold

Masking sound

Masked threshold

20,50,05 0,20,02 0,1 51 f , kHz10 20

20

40

60

80

0

Sound pressure, dB

Masked threshold

Fig. 3.5. Illustration of frequency masking – masked thresholds for a masking signal at a frequency of 1 kHz

3.4 Ear as spectrum analyzer

The fluid in the cochlea is vibrated with the frequencies of the incoming sound wave. The

basilar membrane is the element that separates these frequencies into components. The basilar

membrane is a long structure that is stiff and hard near the round and oval windows and floppy and

loose near its far end. The membrane resonates at different positions along its length depending on

what frequency is traveling in the cochlear fluid. High frequencies are resonant near the hard stiff end

whereas low frequencies are resonant near the far floppy end. Nerves that run the length of the

membrane detect the position of the resonance and hence the frequency (or frequencies) of the

incoming sound. The typical length of the basilar membrane is about 35 mm. There are 10 octaves of

frequency resolution along the length of the membrane each octave occupies about 3,5 mm.

The Figure 3.6 below plots the frequency versus position along the membrane.

x, mm0 3,5 7,0 10,5 14,0 17,5 21,0 24,5 28,0 31,5 35,0

40

20

80

160

320

640

1280

2560

5120

10240

20480

Oval window Helicotrema

f, Hz

Fig. 3.6. Resonant frequencies versus position along the basilar membrane

Note that the curve is approximately linear and that for every 3,5 mm travel along the horizontal

axis (corresponding to distance along the length of the basilar membrane) the resonant frequency

changes by a factor of 2. For example, in the first 3,5 mm the resonant frequency drops from 20480

to 10240 Hz. Note also that the vertical scale is logarithmic; each equal sized increment up the

frequency axis increases by a multiplicative factor of 2.

The ear acts as a mechanical spectrum analyzer splitting up sounds (oscillating pressure

variations as a function of time) into their component frequencies. In other words the signal that is

fed to our nerves and then is subsequently analyzed by our brains is proportional to the strength of

each frequency in the sound [29]. As we will appreciate later this analysis in frequency rather than

time is a very useful way to understand much more about a sound – particularly a musical sound –

when it comes down to understanding concepts such as pitch, timbre, and the scientific basis of

consonance.

Unfortunately a mechanical spectrum analyzer is not perfect at homing in on just one single

frequency exactly. The resonance of the basilar membrane is not perfectly isolated to one particular

47

position for a give pure tone input. There is a width to the resonance. Imagine listening to a pure tone

of 640 Hz. According to the figure above, the resonant position will be at 21 mm along the length of

the basilar membrane. However, that segment of the membrane cannot oscillate without making a

small region either side of the 21 mm position oscillate also. It's like grabbing the middle of your top

lip and pulling it up and down. The parts of your lip on either side get pulled up and down as well

because the lip meat is all connected. The net result is that a pure tone excites the nerves along the

basilar membrane for a small length on either side of the resonance position. Because the position

along the membrane corresponds to frequency this statement is equivalent to saying that a pure tone

excites nerves corresponding to a band of frequencies about the pure tone resonance. This band of

frequencies is called the critical band. The critical band changes from the low to high frequencies.

The graph below in Figure 3.7 shows the width of the critical band on the vertical axis versus the

center frequency of the pure tone on the horizontal axis.

Center frequency (Hz)

Cri

tica

l b

and

wit

h (

Hz)

50 100 200 500 1k 2k 5k 10k20

50

100

200

500

1000

2000

5000

Fig. 3.7. The critical band versus central frequency

Even though the critical band is quite wide we are able to process the information given by the

nerves in the basilar membrane to be able to discern very small differences between two pure tone

frequencies. The smallest interval between two frequencies that we can resolve with our ears is called

the Just Noticeable Difference (JND). Some of the humans can resolve frequencies at the 1 Hz

difference level. The rest of humans can typically resolve frequency differences of between 3 and

5 Hz.

Developing spectrum analyzers of audio signals we have to remember one very important

lesson comming from the audio signals processing praxis: signals should be processed in a manner

consistent with how they are formed or analyzed by apropriate human organs. So the distribution of

critical bands along frequency axis suggest us the concept for choosing frequency bands of spectrum

analyzer filters. However for choosing central frequencies of filters JND in frequency does not

suggest us practical consept as the number of filters would be enormous. Some researchers the

solution to the problem found analyzing perception of continous sounds pitch.

Pitch is our perceptual interpretation of frequency that allows the ordering of sounds on a

frequency-related scale. Pitches are compared as "higher" and "lower" in the sense associated with

musical melodies, which require sound whose frequency is clear and stable enough to distinguish

from noise. Pitch is a major auditory attribute of musical tones, along with duration, loudness, and

timbre. Pitch may be quantified as a frequency, but pitch is not a purely objective physical property;

it is a subjective psychoacoustic attribute of sound. Historically, the study of pitch and pitch

perception has been a central problem in psychoacoustics, and has been instrumental in forming and

testing theories of sound representation, processing, and perception in the auditory system

In general, we perceive pitch logarithmically in relations to frequency. Every doubling in Hz is

perceived as an equivalent octave. It is thought that because a doublings of frequency causes a

48

response at equal distance on the basilar membrane, we hear octaves as related. In fact, because of

the logarithmic spacing of pitch placement on the membrane it can be extrapolated that we perceive

differences in pitches not as differences in frequency, but as the ratio of pitches separating them or

musical intervals which are ratios of frequencies and not linear differences. So, if one pair of musical

sounds has a difference in pitch of 220 Hz and the other a difference of 440 Hz, they are percieved as

having the “same” interval.

Hence, the human auditory system doesn’t interpret pitch in a linear manner. The human

interpretation of the pitch reises with the frequency, which in some applications may be a unwanted

feature. To compensate for this the mel-scale was developed [30]. The sole purpose of the experiment

were to describe the human auditory system on a linear scale. The experiment showed that the pitch

is lineary perceived in the frequency range 0-1000 Hz. Above 1000 Hz, the scale becomes logaritmic.

An approximated formular widely used for mel-scale is shown below:

1000log 1

log(2) 1000

Hzmel

fF

,

where melF is resulting pitch frequency on the mel-scale. This is plotted in Figure 3.8.

10 100 1000 100000

500

1500

2000

2500

3000

3500

Frequency, (Hz)

Pit

ch,

(Mel

s)

1000

a)

500

1500

2000

2500

3000

3500

1000

4000

2000 4000 6000 8000 1000012000 14000 16000

Frequency, (Hz)

Pit

ch, (M

els)

b) Fig. 3.8. Relationship between the frequency scale and mel-scale: a) - logarithnic frequency scale, b) - linear

frequency scale

With the mel-scale applied, coefficients from a LPC will be concentrated in the lower

frequencies and only around the area perceived by humans as the pitch, which may result in a more

precise description of a signal, seen from the preception of the human auditory system. This is

allthough not proved and it is only suggested that the mel-scale may have this effect. The mel-scale

is, regardless of what have been said above, a widely used and effective scale within speech

recognition, in which a speaker need not to be identified, only understood.

So, it is hoped that mel scale more closely models the sensitivity of the human ear than a purely

linear scale, and provides for greater discriminatory capability between speech segments

Mel frequency usualy is used developing filterbanks. This is illustrated in Figure 3.9. It has

been found that the energy in a critical band of a particular frequency influence the human auditory

systems perception. This critical band bandwidth varies with the frequency, where it is linear below

1 kHz and logaritmic above. Combining this with the mel-scale, the distributions of these critical

bands becomes linear. Below 1 kHz critical bands are placed linear around 100, 200,..., 1000 Hz.

Above 1 kHz these bands are placed with the mel-scale.

49

0,5

1,0

0 500 1000 1500 2000 2500 3000 f, Hz

Mag

nit

ude

of

crit

ical

ban

d f

ilte

r

Critial band filter central frequency

Fig. 3.9. Mel scale filterbank: each peak is the center frequency in the critical band

3.5. Human Voice

The human voice is defined as waves of air-pressure. A schematic diagram of the human speech

production system is shown in Figure 3.10 [31–33]. It is produced by airflow pressed out of the lungs

and going out through the mouth and nasal cavities. On its way from the lungs through the vocal tract,

the air passes through the vocal folds, vibrating them at different frequencies.

Lip

Teeth

TonqueNostril

Nasal

Cavity

Oral

Cavity

Pharingeal

Cavity

Larynx

(vocal folds are

within the larynx)

Trachea

Larynx opening

into pharings

Esophagus

Fig. 3.10. A schematic diagram of the human speech production system

The vocal folds are thin muscles looking like lips, placed at the larynx. At their front end they

are connected together permanently, and on the other end they could be open or closed.

When they are closed, the vocal folds are stretched next to each other, forming an air block.

The air pressure from the lungs is forcing its way through that block, pushing the vocal folds aside.

The air passes through the formed crack and the pressure drops, allowing the vocal folds to close

again. This process goes on and on, vibrating the vocal folds to produce the voiced sound. Thus, the

glottal flow is automatically chopped up into pulses. An example of the signals generated by such a

model is shown in Figure 3.11 [31–33]. This glottal pulse has a rise (opening phase) time of about

5,0 ms and a fall (closing phase) time of about 1,7 ms (opening to closing ratio of about 3:1).

0 5 10 15 20 Time (msec)

500

Glottal Volume

Velocity (cc/sec)1000

Fig. 3.11. An example of glottal volume velocity

The vocal folds length and their tension determine the pitch, which is the fundamental

frequency of the voice sound. Pitch is the main acoustic correlate of tone and intonation. The main

parameter of pitch (pitch period) is repetition period of glottal pulses.

In males, the vocal folds are usually longer than females, causing a lower pitch and a deeper

voice.

50

When the vocal folds are open, they form a triangle, allowing the air to reach the mouth cavity

easily. Constriction in the vocal tract, causing air turbulence, is responsible for random noise that

forms the unvoiced sound.

In summary in speech production three major mechanisms of excitation could be identified. These

are:

1. Air flow from the lungs is modulated by the vocal fold vibration, resulting in quasi-periodic

pulse-like excitation.

2. Air flow from the lungs becomes turbulent as the air passes through a constriction, resulting

in a noise-like excitation.

3. Air flow builds up pressure behind a point of total closure of vocal fold. The rapid release of

this pressure, by removing the constriction, causes a transient excitation.

A detailed model of excitation of sound in the vocal system involves the lungs, bronchi, trachea,

the glottis, and the glottal tract. A block diagram of human speech production is shown in Figure 3.12.

SP

Muscle force

Lungs

Velum

Trachea, bronchia

Vocal folds

Oral cavity Lips

NostrilNasal cavity

Pharingeal cavity

Nasal sound

output

Oral sound

output

Fig. 3.12. A block diagram of human speech production

3.6 Vowels and Consonants

The spoken words are made of vowels and consonants. All the vowels are produced when the

vocal folds are closed (a, i, o), therefore they are voiced.

The consonants, on the other hand, can be voiced, unvoiced or semi-voiced. In turn among

voiced and unvoiced consonants one can distinguish voiced and unvoiced stop consonants.

The voiced stop consonants (b, d, g) are transient, noncontinuant sounds which are produced

by building up pressure behind the total constriction somewhere in the oral tract, and suddenly

releasing the pressure. During the period of total closure of the tract the vocal folds do not vibrate.

Following the period of closure there is a brief interval of friction followed by a period of aspiration

before voiced excitation begins.

The unvoiced stop consonants (p, t, k) are similar to their voiced counterparts (b, d, g) with one

exception. The difference between the unvoiced stop phonemes and the voiced stop phonemes is not

just a matter of whether articulatory voicing is present or not. Rather, it includes when voicing starts

(if at all), the presence of aspiration (airflow burst following the release of the closure), and the

duration of the closure and aspiration.

Voiced consonants are formed by closing the vocal folds (m, n). For unvoiced consonants (s, f)

there is no vocal folds vibration, because they are open, and the excitation in this case is the turbulent

flow of air passing through a constriction in the vocal tract.

The semi-voiced consonants are a combination of those two (v, z). v for example is said like f

but with the vocal folds closed.

51

3.7 Engineering Model

There are proposed some approximations of vocal folds vibrations for voiced sounds. For

example, on the basis of analysis of synthetic speech quality Rosenberg [34] found that a good

synthetic pulse waveform is of the form

11 12

1 1 1 1 2

[1 cos( / )], 0 ,

( ) cos[ ( ) / 2 ], ,

0, otherwise.

n N n N

n n N N N n N N

(3.1)

Figure 3.13 shows the pulse waveform and its spectrum for typical values of parameters 1N

and 2N .

40 n0

1,0

0,5

a)

( )n

10 20 30

b)1,0 2,0 3,0 4,0

0

15

30

-15

-30

, kHzf

( ) , dBf

0

Fig. 3.13. Rosenberg approximation of glottal pulse; (a) – pulse waveform, (b) – pulse spectrum

It can be seen from the pulse spectrum form, that the sequence of such pulses can be realised

by appropriate lowpass filtering of rectangular pulses (shaped pulses) as depicted in Figure 3.14.

Impulse train

generator

Impulse train

generatorLowpass

filter

Pitch periodPitch periodAmplitude control AvAmplitude control Av

( )u n

Fig. 3.14. Generation of the exitation signal for voiced speech

Voiceless sounds‘ exitation model is simpler. Noiselike exitation sufficiently exactly can be

modeled by colored noise realised by appropriate low frequency filtering of white noise as presented

in Figure 3.15.

White noise

generator

White noise

generatorLowpass

filter

Lowpass

filter

Gain control ANGain control AN

( )u n

Fig. 3.15. Generation of the exitation signal for unvoiced speech

The simplest physical model of vocal tract that has a useful interpretation in terms of the speech

production process is depicted on Figure 3.16.

LipsLips

Vocal tractVocal tract

GlottisGlottis

A(x, t)

Fig. 3.16. The simplest physical model of vocal tract

As follows from acoustical theory of speech production the vocal tract can be modeled as a tube

of nonuniform, time-varying cross-section [31, 33]. Closed form solutions to partial differential

equation describing this model are not possible. Numerical solutions can be obtained, however require

52

knowing vocal tract area function ( , )A x t , the air pressure and air volume velocity for values of x and

t in the region bounded by the glottis and the lips and boundary conditions at each end of the tube.

A widely used more practical model for speech production is based upon the assumption that

the vocal tract can be represented as a concatenation of lossless acoustic tubes of different length, as

depicted in Figure 3.17, a. The constant cross-sectional areas [ ]kA of the tubes are chosen as to

approximate the area function ( )A x of the vocal tract. If a large number of tubes of short length are

used, one can reasonably expect the resonant frequencies of the concatenated tubes to be close to

those of a tube with continuously varying area function. Moreover the lossless tube models provide

a convenient transition between continuous-time models and discrete-time models as they have many

properties in common with digital filters. Particularly natural transition to digital filters representation

is obtained in the case of concatenation of lossless tubes of equal length as depicted in Figure 3.17,

b.

Vocal tractVocal tract

LipsLipsAk

a)a)

LipsLips

Vocal tractVocal tract

Ak

x x x x x x xb)b)

Fig. 3.17. Representation of vocal tract as a concatenation of lossless acoustic tubes: (a) – of different length, (b) – of

equal length

The lossless tube model of Figure 3.17, b form can be mathematically described by a system

function ( )H z of the form [31, 33]

1

( )

1P

k

k

k

GH z

b z

, (3.2)

where, in the meantime unknown, coefficients G and { }kb depend on the set of the area function

( )A x cross-sections kA . The system function ( )H z corresponds to the system function of the infinite

impulse response (IIR) digital filter. One way to implement this form of system function is to use one

of the standard digital filters implementation structures, for example, direct form, as presented in

Figure 3.18. Here function ( )u n represents an excitation function and ( )s n – speech signal.

......

( )s n

z -1z -1 z -1 z -1ˆ( )s n

1b2b1Pb Pb

( )u n

......

......

G

( )v n

Fig. 3.18. Direct form implementation of digital filter system function describing vocal tract

Thus, the speech mechanism can be modeled as a time varying filter (the vocal tract) excited

by an oscillator (the vocal folds) with different outputs. When voiced sound is produced, the filter is

excited by an impulse chain, in a range of frequencies (60-400 Hz). When unvoiced sound is

produced, the filter is excited by random colored noise, without any observed periodicity. These

attributes can be observed when the speech signal is examined in the time domain.

Putting all above described subsystems together we obtain the model of Figure 3.19 [31, 33].

53

Shaped

pulses

generator

Shaped

pulses

generator

Colored

noise

generator

Colored

noise

generator

Pitch period NT

Unvoiced

Voiced

G

( )u n ( )v n ( )s n

Parameters of

voice tract filter

Switch „voiced -

unvoiced“ control

Voice tract

filter

H(z)

Voice tract

filter

H(z)

Speech signal

Fig. 3.19. General model for speech production

Naturally the model covers not all aspects of speech production. For example it prevents one

from modelling the mixed excitation which is used pronouncing some sounds, for example voiced

fricatives (z, v). The model works very well in the case of slowly changing filter parameters, i.e.

dealing with continuant sounds as vowels. The model works adequately but much worse in the case

of transient sounds such as stops. Some more sophisticated models for isolated cases, for example,

such as voiced fricatives were developed [33].

54

4. DIGITIZATION of SIGNALS

4.1 Pulse Code Modulation

Analog signal is continuous in time and amplitude levels. The process of discretization of the

analog signal yields the equivalent digital signal which is discrete in time and amplitude levels.

Usually the digital signal is expressed in pulse code modulation (PCM) form [35–38].

The process of pulse code modulation (PCM) is analog-to-digital conversion and is done by an

Analog-to-Digital Converter (ADC), block diagram of which is depicted in Figure 4.1. A typical ADC

contains the three components or blocks: sampler, quantizer and encoder. In essence, the analog-to-

digital conversion is the representation of the values of instantaneous samples ( )ss nT of an analog

signal ( )s t at the output of the sampler by a series of pulses that correspond to a series of bits ( )sc nT

at the output of encoder. Assuming that group of B bits compose a digital word, there are 2BM

unique code words. Each code word corresponds to certain amplitude of level. However each sample

value can be any one of an infinite number of levels. Therefore the digital word that represents the

amplitude closest to the actual sampled value is used. This procedure is called quantizing. That is,

instead of using the exact sample value of the analog waveform ( )ss nT , it is replaced by the closest

allowed value ˆ( )ss nT . There are 2BM allowed values, each corresponding to one of the code

words. So the PCM signal is generated by carrying out three basic operations: sampling, quantizing

and encoding.

For reconstruction the analog signal from PCM signal the Digital-to-Analog Converter (DAC)

and the low pass reconstruction filter are used. Block diagram of a reconstruction system is presented

in Figure 4.2. DAC’s, role is to convert the binary code words ( )sc nT received from some

communication channel or DSP device to an intermediate form of the final analog signal. Commonly

this is in the form of flat-top pulse amplitude modulation signal representing quantized instantaneous

samples ˆ ( )ss nT . Reconstruction filter usually filters out unwanted high frequency components and

thereby performs the final conversion operation from the digital domain to analog. The received code

words and back converted quantized instantaneous samples and analog signal can differ from their

original versions for a variety of reasons and therefore are denoted by ( )sc nT , ˆ ( )ss nT and ( )s t .

ˆ( )ss nT

( )sc nT

( )ss nT

Analog-to-Digital Converter

Sampler (S&H)

Encoder

Quantizer

( )ш t

PCM

signalAnalog

signal

( )s t

...

Ts

Digital-to-Analog

Converter

DAC

'( )s t'( )sc nT

PCM

signal

ˆ '( )ss nT

Low pass

reconstruction

filter Analog

signal

Fig. 4.1. Analog-to-digital converter Fig. 4.2. Analog signal reconstruction system

4.2 Sampling: Nyquist- Shennon Sampling Theorem

The sampling operation samples the incoming signal at regular interval called sampling period

or sampling interval (denoted by sT ). The reciprocal of the sampling interval is the sampling rate or

sampling frequency (denoted by sF ) which defines the number of samples per unit of time (usually

seconds) taken from a continuous signal to make a discrete signal. For time-domain signals, the unit

55

for sampling rate is Hertz (inverse seconds, 1/s, s−1), sometimes noted as Sa/s or S/s (samples per

second).

1/s sF T

Correct selected sampling rate guarantees the faithful representation of the signal in digital

domain and the faithful conversion back to analog domain. The criterion for correct selection of

sampling rate is described by Nyquist – Shennon sampling theorem. For the baseband signals which

frequency composition is restricted to a maximum frequency of maxF Hz (extending from 0 Hz to

maximum maxF Hz) it can be explained as follows [35–38]: „In order for a faithful reproduction

and reconstruction of an analog baseband signal it should be sampled at a sampling frequency

sF that is greater than or equal to twice the maximum frequency of the signal maxF ”, i.e.

max2sF F . (4.1)

For example, if the maximum frequency of the signal is 5 kHz, for the faithful representation

of the signal in digital time domain the sampling frequency can be chosen as 10 kHzsF . Higher

the sampling frequency higher is the accuracy of representation of the signal. Higher sampling

frequency also implies more samples per time unit and more storage space and more calculation

operations per time unit requiring from the following DSP units. The minimal correct sampling

frequency min max2sF F is called Nyquist’s frequency and usually denoted by NF

4.3 Ideal Sampling

Consider analog signal wave consisting of some components of different frequencies

1 2 max, ,...,F F F (see Figure 4.3, a and b). Now, to satisfy the sampling theorem, which is stated above,

the sampling frequency can be chosen according to the equation (4.1).

We start analyzing from the ideal (impulse) sampling process, which definitely differs from the

real sampling process in some aspects. However it helps to explain effects and results of sampling

theorem in simple phrase. The ideal sampling process in time domain often called impulse sampling

mathematically can be expressed as multiplication of the signal ( )s t by a series of extremely short

pulses (pulse train) ш(t) at regular interval – sT as depicted in Figure 4.3, a, c and e. The mathematical

model of ideal sampler is presented in Figure 4.3, h. In frequency domain, the output of the ideal

sampling process gives the components of following frequencies: 1 2 max, ,...,F F F (original frequency

content of the signal), 1 2 max, ,...,sF F F F , 1 2 max2 , ,...,sF F F F , 1 2 max3 , ,...,sF F F F , and so on … This

is illustrated in Figure 4.3, f.

Now, comparing the frequency components composition of analog and sampled versions of

signal we see that the sampled signal contains lots of supplementary frequency components

1 2 max, ,...,sF F F F , 1 2 max2 , ,...,sF F F F ,..., etc. It is clear from this that by converting the ideally

sampled signal back to analog domain it is necessary to filter out those supplementary frequency

components. This can be done by using so-called “reconstruction” filter. In our case it is a low pass

filter that is designed to pass only those frequency components that are up to maximum analog signal

frequency maxF Hz and to attenuate components which frequencies are higher than maxF Hz. The

frequency response ( )LPH f of such a low pass filter is depicted in Figure 4.3, f.

56

t

e)

Ts

ˆ( )s t

g)( )ш t

( )s t

c)

ш(t )

Ts t

|Ш ( f )|

2Fs Fs 0

d)

F

FF max 0

|S ( f )|

b)a)

t

s (t)

f)

F0 2Fs Fs

maxF

( )LPH fˆ( )S f

2

sF 3

2

sF 5

2

sF

Folding frequencies

1 2 3 4 5

Aliasing zones

max2sF F

F0 2Fs Fs

h)

3Fs

......

ˆ( )S f

Overlaping and interfering zones

2

sF 3

2

sF 5

2

sF

max2sF F

Fig. 4.3. Illustration of ideal sampling process: (a) analog signal, (b) spectrum of analog signal, (c) sampling

function, (d) spectrum of sampling function, (e) ideally sampled signal, (f) spectrum of correctly sampled

signal and frequency response of low pass reconstruction filter, g) mathematical model of ideal sampler, (h)

spectrum of undersampled signal

4.4 Aliasing and Anti-Aliasing

The usage of too low sampling frequency max2sF F (undersampling mode) leads to an

unwanted effect called aliasing [35–38]. Let’s have a signal with two frequency components:

1 6 kHzF , which is our desired signal, and 2 12 kHzF , which is an unwanted component, e.g.

noise. The noisy component’s frequency is sufficiently away from the frequency of desired signal

and can be easily filtered out by appropriate filter. So, one can say this situation is not critical by

following signal processing.

57

Now let’s see what happens if we undersample the signal choosing the sampling frequency at

max18 kHz<2 24 kHzsF F . The broad-brush view of frequency components situation in

undersampling case is presented in Figure 3, h. Specific, under consideration case reflects in Figure 4.

The first frequency component 1 6 kHzF will produce the components at the sampler output of the

following frequencies: 6 kHz, 12 kHz, 24 kHz, 30 kHz, 42 kHz, 48 kHz, 60 kHz and so on… The

second frequency component 2 12 kHzF will produce the components of: 12 kHz, 6 kHz, 30 kHz,

24 kHz, 48 kHz and so on… Note the 6 kHz component produced by 2 12 kHzF will coincide and

interfere with our desired 1 6 kHzF component and will corrupt it. The main thing is to mention

that now interfering components will be not separable by any filter. This 6 kHz component is called

“alias” of the original noisy component 2 12 kHzF . Similarly the 12 kHz component produced by

1 6 kHzF component is an “alias” of 1 6 kHzF component. It will interfere with original noisy

component 2 12 kHzF and they will be also not separable. However there is no need to take any

measures to avoid the interference at 12 kHz since it is a noise and any way it has to be eliminated.

But it is necessary to do something with the aliasing component of 6 kHz produced by the noisy

component 2 12 kHzF because it will corrupt our desired signal.

1

6

F 18 6

12

18 6

24

36 6

30

36 6

42

54 6

48

54 6

60

72 6

66

, kHzF

2

9

sF

18

sF

3

2

27

sF

36

2 sF

5

2

45

sF

54

3 sF

7

2

63

sF

72

4 sF

b)

Aliasing pairs of frequency components

18

sF

36

2 sF

54

3 sF

72

4 sF18 12

6

2

12

F 18 12

30

2

9

sF

36 12

24

36 12

48

54 12

42

54 12

66

72 12

60

, kHzF

3

2

27

sF 5

2

45

sF 7

2

63

sF

Coincidental and interfering pairs of frequency components due to aliasing

c)

, kHzF

1

6

F 2

12

Fa)

Fig.4.4. Illustration of undersampling effects: (a) frequency components of analog signal, (b) spectral contribution by

16 kHzF and 18 kHz

sF , (c) spectral contribution by

212 kHzF and 18 kHz

sF

It should be noted that choosing another sampling frequency, e.g. max19 kHz<2 24 kHzsF F

gives in slightly different results. The component of 6 kHz will produce the sampler output

components of: 6 kHz, 13 kHz, 25 kHz, 32 kHz, 44 kHz and so on… The component of 12 kHz will

produce the components of: 12 kHz, 7 kHz, 31 kHz, 26 kHz, 50 kHz and so on… Now the

58

coincidental components disappeared. But there are pairs of components which are very close each

other, one can say overlapping pairs: 6 kHz and 7 kHz, 12 kHz and 13 kHz and so on… The

probability of interference between them is very high and the possibility to separate one from another

is very small.

It is clear that coincidence and interference of aliasing components occur if the chosen sampling

frequency is too low. All the frequency components from / 2sF to sF will be alias of frequency

components from 0 to / 2sF , and conversely. The frequency / 2sF is called “Folding frequency”

since the frequency components from / 2sF to sF fold back itself and can overlap and interfere with

the components from 0 to / 2sF , and conversely. In fact the aliasing zones occur on the either sides

of / 2sF , 3 / 2sF , 5 / 2sF , etc… and cause frequency reversal. Similarly aliasing also occurs on either

sides of sF , 2 sF , 3 sF ,… only don’t cause frequency reversal. All these frequencies are also called

“Folding Frequencies” without respect to frequency reversal. The Figures 4.3, f and h illustrate the

concept of aliasing zones, folding frequencies and overlapping and interfering zones.

Note spectrum in zone 2 is just a mirror image of spectrum in zone 1 with frequency reversal.

Similarly spectrum in zone 2 creates aliases in zone 3 only without frequency reversal; spectrum in

zone 3 creates mirror image in zone 4 with frequency reversal and so on…

In the example above, the folding frequency was at / 2 9kHzsF , so all the components from

9 kHz to 18 kHz will be the alias of the components from 0 kHz to 9 kHz. If the aliasing components

fall into the band of somebody interest, it is impossible to separate them from the original signal

components and as a result, the original signal will be corrupted. In order to prevent aliasing of

unwanted components, it is necessary to remove those frequencies that are above / 2sF before

sampling the signal. This is achieved by using an “anti-aliasing” filter that must be inserted before

the analog-to-digital converter.

An anti-aliasing filter, as a rule low pass filter, is designed to pass all the frequency components

which frequencies are bellow or equal to maxF and to attenuate those frequency components which

frequencies are above maxF and the folding frequency / 2sF . Hereby aliasing of unwanted

components is avoided.

A complete design of digital (PCM) signal generation system is presented in Figure 4.5. Thus,

a complete design of analog-to-digital conversion system contains an anti-aliasing filter preceding

the ADC and the complete design of digital-to-analog conversion system contains a reconstruction

filter succeeding the DAC (see Figure 4.2).

( )sc nT( )s t

( )ш t

Analog

signal

PCM

signal

Low pass anti-

aliasing filter

ADC

Fig.4.5. A complete design of digital (PCM) signal generation system

4.5 Flat-Top Sampling

It is important to note that the real sampling operation is much closer to flat-top sampling than

to ideal sampling [35]. Flat-top sampling generates a flat-top pulse amplitude modulation (PAM)

signal (see Figure 4.6, a). This signal could be generated by using a sample-and-hold (S&H) type

electronic circuit. One possible version of that circuit is depicted in Figure 4.6, b. There are two

operations involved in the generation of the flat-top PAM signal:

59

Instantaneous sampling of the continuous signal ( )s t every sT seconds as in the case of ideal

(impulse) sampling;

Lengthening the duration of each sample to obtain rectangular pulses of constant duration τ.

(The duration of pulses τ is called aperture.)

In digital circuit technology, these two operations are jointly called “sample and hold.” Due to

these two operations the amplitudes of regularly spaced pulses are varied in proportion to the

corresponding sample values of a continuous message signal ( )s t (see Figure 4.6, a).

However the values of the flat-top PAM signal frequency components differs from the values

of ideally sampled signal frequency components (compare illustrations in Figure 4.3, f and Figure 4.6,

c). One can readily see that there is some high frequency loss due to filtering effect caused by the flat-

top pulse shape. The overall envelope of frequency components decreases with increasing frequency

and reaches zero at the frequency 1/ (see Figure 4.6, c), where is the aperture of the PAM signal

pulses (see Figure 4.6, a). It means that less aperture causes less attenuation of higher frequency

components. Therefore attenuation of higher frequency components is called aperture distortions.

However, in the commonly used mode the aperture is equal to sampling interval, i.e. sT . In this

case the overall envelope reaches zero at the frequency which coincides with the sampling frequency

sF (see Figure 4.6, c and d) and the aperture distortions can be significant. There is the same high

frequency loss in the recovered analog waveform too. This can be compensated (equalized) at the

recovering part by making reverse distortions, i.e. by amplifying high frequency components in

inverse ratio with their attenuation. This is a very common practice called “Equalization”.

t

a) Ts

ˆ( )s t

s (t) ŝ (t)C

Sampling switch

Discharge switch

G1 G2

b)

0

Fmax

FFs 2Fs 3Fs

1

1

ˆ( )S f

0

Fmax

FFs 2Fs 3Fs c )

ˆ( )S f( )LPH f

Overall envelope,

2

1

1 sT

Overall envelope,2 sT

1 2

t

d)

ˆ( )s t

sT

Fig. 4.6. Illustration of flat-top sampling: (a) flat-top PAM signal (s

T ), (b) sample and hold circuit (1. G1 closes

to sample; 2. Capacitor C holds value of sample; 3. G2 discharges value;) (c) spectrum of flat-top PAM

signal, (d) flat-top PAM signal (s

T )

All above described processes refer only to the sampler. However the sampled signals are

discrete in time but their amplitude is still continuous. In order to have a full representation of a signal

in digital form, the signal has to be discrete in amplitude too. The amplitude of the sampled signal

can take on values from infinite number of possible amplitude levels in particular range. This infinite

number of possible amplitude has to be mapped to finite number of practically required amplitude

levels by mapping different ranges of amplitude levels to a set of allowed amplitude levels. This

requires the next block in ADC namely the quantizer.

60

4.6 Quantizing: Uniform Quantizing

The quantizing operation is illustrated in Figure 4.7 for the case of 32 8M levels.

001

010

011

101

110

111

Input voltage, s

a) M = 2B

0,5 1,5 2,5 3,5

000

100

-0,5-1,5-2,5-3,5 +0

-0

+1

+2

+3

-3

-2

-1

B=3, M=8

Output voltage, s

000010011011011011010001000101110110111111110

t

t

Line code

Clock pulses

c)

Code words

t

Sampling times

sT

b)

ˆ( ), ( ), ( )s t s t s t

Quantized PAM signal ˆ( )s t

Analog signal ( )s t

PAM signal ( )s t ( )sT

Fig. 4.7. Illustration of waveforms in a PCM system: (a) output-input characteristic of a quantizer, (b) analog signal,

flat-top PAM signal and quantized PAM signal waveforms, (c) PCM signal waveform

The quantizer discretizes the continuous amplitude of the sampled signal to discrete levels of

amplitude. The amplitudes of the samples are quatized by dividing the entire amplitude range into a

finite set of amplitude ranges and assigning the same amplitude value to all samples falling in a given

range. The action of value assumption to the particular sample is performed mainly using

mathematical rounding operation. Several types of quantizers – uniform, non-uniform (A-law

quantizer, µ-law quantizer), differential quantizer, etc. exist to fulfill the purpose. This quantizer is

uniform because all the quantizing levels on the s axis and quantizing ranges on the s axis are

distributed uniformly, i.e., all the quantizing steps are of equal size . PCM with this kind of

quantizer is called linear PCM. Since we are approximating the analog sample values by using a finite

number of levels ( 8M in this illustration), error is introduced into the recovered analog signal

because of the quantizing effect. The quantizing error samples ( )e n consists of the difference between

the analog signal samples at the sampler input ( )s n and the output signal samples of the quantizer

ˆ( )s n , i.e.,

ˆ( ) ( ) ( )e n s n s n .

Actually the quantizer output signal is quantized PAM signal. If rounding operation is used, the

maximum of the quantizing error is equal to one-half of the quantizer step size , i.e.,

( )2 2

e n

.

Due to this error recovered analog signal differs from the analog signal at the sampler input. It

is considered that recovered output signal is equal to the input signal plus quantizing error. If the

sampling frequency is minimal allowable, i.e. equal to Nyquist frequency ( max2s NF F F ), or

higher and there is no noise in communication channel between PCM signal formation and analog

signal reconstruction systems there will still be quantizing noise on the recovered analog waveform

due to this error. The output signal of quantizer is called quantized PAM signal.

61

As shown in many literature sources [35–38], under certain assumptions, the ratio of the

recovered analog average signal power to the average of quantizing noise power (signal-to-noise ratio

– SNR) is expressed as follows:

2

2

max

3 RMS

out

sSNR M

S

, (4.2)

where RMSs is the RMS value of the input analog signal, maxS is the peak design level of the quantizer.

Recalling that 2BM , we may express the last equation in decibels as

max, [dB] 6,02 4,77 20logout

RMS

SSNR B

s

, (4.3)

where B is the number of bits in PCM code word. The last equation, sometimes called the 6 dB rule,

discloses the essential feature of PCM: adding one bit to PCM word it is possible to obtain 6 dB

improvement in SNR. However, SNR depends on the signal level and when it decreases SNR drops

down. The maximal value of SNR is reached when the signal reaches the peak design level of the

quantizer. With overrun this level the quantizer begins to work as a limiter. Then the quantizing noise

level begins drastic to increase and SNR to decrease. This equation is valid for various types of input

signal waveforms and quantizer characteristics (uniform, nonuniform, etc.) and under the assumption

that there are no bit errors in received bit stream. Graphical illustration of SNR relations to signal

value and PCM code word bit number is presented in Figure 4.8.

-20 -16 -12 -8 -4 05

15

25

35

45

55, dBSNR

max20log / , dBRMSs S

4B

6B

8B

12 dB

12 dB

Fig. 4.8. SNR relation to signal value and PCM code word bit number

4.7 Encoding

The quantized signal has to be represented in some numeric form to get a digital representation,

i.e. PCM signal. PCM signal is obtained from the quantized PAM signal by encoding each quantized

sample value into digital word – code word [39]. The encoder is the next block in the ADC. Code

words are just some form of numeric representation that the encoder assigns to each discretized

amplitude level. Some of the convenient methods for one-to-one mapping of amplitude levels to code

words include: Natural Binary Coding and Gray Coding. In natural binary mapping the discrete

amplitude levels are coded in binary format. The exact code word that represents a particular

quantized level is chosen considering what code system will be employed in the next signal

processing stages.

Consider that the sampler has discretized the samples in time and the quantizer maps discretized

ranges of voltages to discrete amplitude levels. Binary code words of the encoder output are presented

in the first column of the following Table 4.1. The second, fourth and the sixth columns give three

possible interpretations of given code words. They are: signed magnitude or folded binary, 1’s

complement and 2’s complement form. The encoder just gives out the binary pattern and it is up to

62

the designer to decide how to interpret the binary pattern in order to represent quantization values.

The interpretation influences the quantizer ranges that will be mapped to specific values. Quantizer’s

design and encoder’s output interpretation always have to be implemented in parallel.

The three representations listed above can accommodate both positive and negative values. Out

of these three the fourth one unsigned integer representation can represent positive values only.

Table 4.1 Various interpretations of 3 bit code words

Code

words

Folded binary

form and

quantized

value in Volts

Quantization

range in Volts

for folded

binary form

1’s

complement

form and

quantized

value in Volts

Quantization

range in Volts

for 1’s

complement

form

2’s

complement

form and

quantized

value in Volts

Quantization

range in Volts

for 2’s

complement

form

1 2 3 4 5 6 7

011 +3 +2,5 – +3,5 +3 +2,5 – +3,5 +3 +2,5 – +3,5

010 +2 +1,5 – +2,5 +2 +1,5 – +2,5 +2 +1,5 – +2,5

001 +1 +0,5 – +1,5 +1 +0,5 – +1,5 +1 +0,5 – +1,5

000 +0 +0,0 – +0,5 +0 +0,0 – +0,5 0 -0,5 – +0,5

111 -3 -3,5 – -2,5 -0 -0,0 – -0,5 -1 -1,5 – -0,5

110 -2 -2,5 – -1,5 -1 -0,5 – -1,5 -2 -2,5 – -1,5

101 -1 -1,5 – -0,5 -2 -1,5 – -2,5 -3 -3,5 – -2,5

100 -0 -0,5 – -0,0 -2,5 – -3,5 -4 -4,5 – -3,5

In signed magnitude representation, also called Folded Binary Code / Fold over Binary Code,

the Most Significant Bit (MSB) of the binary pattern represents the sign of the number

(positive/negative) and the rest of the bits represent the magnitude. In 1’s complement representation,

the negative values are just the 1’s complement of positive values. To convert a binary representation

to 1’s complement, one just has to flip all the bits (1’s to 0’s and 0’s to 1’s). To convert to 2’s

complement form, convert the binary pattern to 1’s complement and then add ‘1’ to the resulting sum.

The designer can conveniently choose any of the above mentioned forms to interpret the binary

output of the encoder. For example, if the designer interprets it as 1’s complement or signed

magnitude representation, then he would have two different representations for the value ‘0’. All the

further calculations in DSP has to be done by keeping this fact in mind and this causes greater threat

to the reliability of design. To avoid this ambiguity, 2’s complement is always the good choice of

interpretation. Additionally two’s complement interpretation results in faster and simpler hardware.

But the two’s complement encoding represents not equal positive and negative voltage ranges (from

-4,5V to +3,5V having 3 bits quantizer). This fact also always must be kept in mind by further

calculations in DSP. In the case of signed magnitude and one’s complement encoding the ranges of

representable positive and negative voltages are equal (from -3,5V to +3,5V in the case of 3 bits

quantizer) but they result in ambiguity in representing 0V.

Gray code is another form of quantized signal representation. It is very popular and is used

almost in all applications. In Gray coding, the neighboring code words differ by only one bit. In this

case the minimal noise which is able to convert one symbol to the neighboring can harm only one bit.

On the contrary, when the neighboring code words differ more than by one bit, the same noise will

harm also more than one bit what results in increasing overall Bit Error Rate. It is possible even to

correct single bit errors when combining Gray coding with Forward Error Correction coding. Digital

modulation techniques like M-PSK and M-QAM use Gray codes to represent the modulation symbols

in baseband.

The following Table 4.2 illustrates relationship between natural binary code and Gray code for

3 bits code words.

63

Table 4.2 Natural binary and Gray 3 bits code words

Natural binary code

words

000 001 010 011 100 101 110 111

Gray code words 000 001 011 010 110 111 101 100

At the physical layer in electronic circuits code words are represented in the form of some kind

of line code. Often the form of unipolar non-return to zero (NRZ) code with rectangular pulse shaping

is used, i.e., the binary ones are represented by a high level (+A V) and a binary zeroes by a zero level

(see Figure 4.7, c).

4.8 Bandwidth of PCM Signals

The spectrum of the PCM signal is not directly related to the spectrum of the input signal. The

bandwidth of (serial) binary PCM waveforms depends on the bit rate R and the waveform line code

and pulse shape used to represent the data. The bit rate R is

sR BF , (4.4)

It is shown in many sources [35–38] that for no aliasing case ( max2sF F ) and only when

unipolar NRZ line code and sin /x x pulse shaping is used to generate the PCM waveform the

minimal bandwidth of PCM signal minPCMF is obtained.

min / 2 / 2PCM sF R BF . (4.5)

However, usually a more rectangular type of pulse shape is used, and consequently, the

bandwidth of PCM waveform will be larger than this minimum. There are some typical of waveforms

generated by popular integrated circuits. One of them is based on NRZ line code with practically

rectangular pulse shape. If rectangular pulse shaping is used, the absolute bandwidth is infinity.

Luckily the praxis of exploitation of PCM communication systems demonstrates that sufficient

behavior efficiency is reached when communicating over the channel with the first-null bandwidth.

For PCM waveform generated by rectangular pulses, the first-null bandwidth is

PCM sF R BF . (4.6)

The lower bound of the bandwidth of such PCM signal is

max2 2PCM LB NF BF BF B F , (4.7)

where maxF F is the bandwidth of the corresponding baseband analog signal. Thus, for in the

praxis used values of B, the bandwidth of the serial PCM signal will be significantly larger then

bandwidth of the corresponding analog signal which it represents.

In principle, for transmission of an analog baseband signal over the two-wire transmission line

channel only F , Hz bandwidth is needed. Using the simplest analog modulation, i.e., amplitude

modulation, the transmission radio channel bandwidth of 2 F , Hz will be sufficient. Whereas for

transmission of PCM signal version at least 2B F , Hz channel bandwidth is required.

Thus for transmission of PCM signal much wider communication channel is required in

comparison with other types of signal conveying the same information. This means that PCM signal

contains a lot of redundant information.

Examples The frequency components of an analog audio voice-frequency telephone signal occupy a band

approximately from 300 Hz to 3400 Hz. It means the analog telephone signal maximum frequency is

max 3,4kHzF and bandwidth 3,4 kHzAn SigF . For transmission over digital telephone system

this signal is to be converted to a PCM signal. Minimal sampling frequency (Nyquist frequency)

64

which could be used for sampling is equal to 2×3,4=6,8 kSa/s. Practically this sampling frequency is

never been used, because it is necessary to have some guard band for a finite width of the low pass

anti-aliasing filter frequency response. The steeper the slope the more sophisticated and more

expensive filter we have to use. Therefore the telephone signal usually is sampled with a sampling

frequency of 8 kSa/s. As a rule each sample value is represented by 8 bits. The bit rate of that binary

PCM signal will be

( , Sa/s) ( , bits/Sa) (8, kSa/s) (8, bits/Sa) 64 kbits/ss

R F B

If rectangular pulse shaping is used, the absolute bandwidth is infinity, but the first spectrum

null bandwidth is

1/PCM bF T R ,

where bT is the one bit transmission time, R – bit rate. In our case

64 kHzPCM

F .

Thus for transmission of pure PCM telephone signal the channel bandwidth of 64 kHz is

needed. From the other side, the transmission of the telephone signal analog version containing the

same information is possible over a channel with only 3,4 kHz bandwidth. Eventually using the

simplest analog modulation, for example, amplitude modulation, the transmission channel bandwidth

of 6,8 kHz will be sufficient.

In the following Table 4.3 we present some data of some PCM signals.

Table 4.3 Parameters of various PCM signals

Signal type 𝑭max 𝑭𝒔 B R S/N,

dB

∆𝑭

Telephone signal 3,4 kHz 8 kHz 8 64 kbit/s 38 (64÷128) kHz

Audio signal (mono) 20,0 kHz 48 kHz 16 768 kbit/s >90 (768÷1536) kHz

PAL TV luminance signal (Y) 5,0 MHz 13,5 MHz 8 108 Mbit/s(2) 48 (108÷216) MHz

PAL TV color-difference

signals (U,V)

2×2,5 MHz 2×6,75 MHz 2×8 108 Mbit/s(3) (108÷216) MHz

SDTV luminance signal (Y)(1) 13,5 MHz 8 108 Mbit/s(2) (108÷216) MHz

SDTV color-difference signals

(CB,CR)(1)

2×6,75 MHz 2×8 108 Mbit/s(3) (108÷216) MHz

HDTV luminance signal (Y)(4) 74,25 MHz 8 594 Mbit/s(5) (594÷1188) MHz

HDTV color-difference signals

(CB,CR)(4)

2×37,125 MHz 2×8 594 Mbit/s(6) (594÷1188) MHz

(1) In accordance with Recommendation ITU-R BT.601-7 „Studio encoding parameters of digital television for standard

4:3 and wide-screen 16:9 aspect ratios“; (2) For 625 lines, 25 Hz frame rate, 8 bits and 864 pixels per line; (3) For 625 lines, 25 Hz frame rate, 8 bits, 432 pixels per line and 2 color signals; (4) In accordance with Recommendation ITU-R BT.709-5 „Parameter values for the HDTV standards for production and

international program exchange“; (5) For 1125 lines, 25 Hz frame rate, 8 bits and 2640 samples per line; (6) For 1125 lines, 25 Hz frame rate, 8 bits, 1320 samples per line and 2 color signals;

4.9 Linear PCM Advantages and Disadvantages

Linear PCM encoding method is widely used in praxis, generally by systems operating with

uncompressed audio data: compact discs, digital telephony, HDMI, DVD, etc., because it has some

productive advantages. They are:

65

Relatively inexpensive digital circuitry may be used. Digital circuits are more reliable and can

be produced at lower cost than analog circuits. Also, digital hardware lends itself to more

flexible implementation than analog hardware. PCM signals derived from all types of analog sources may be merged together and with data

signals and transmitted over a common high-speed digital communication system.

In long-distance digital telephone systems requiring repeaters, a clean PCM waveform can be

regenerated at the output of each repeater, where the input consists of a noisy PCM waveform.

The shape of the waveform is affected by two mechanisms:

As all the transmission lines and circuits have some non-ideal transfer function, there is a

distorting effect on the ideal pulse.

Unwanted electrical noise or other interference further distorts the pulse waveform.

Both of these mechanisms cause the pulse shape to degrade as a function of distance. During

the time that the transmitted pulse can still be reliably identified, the pulse is thus regenerated.

The circuits that perform this function at regular intervals along a transmission system are called

regenerative repeaters.

The noise performance of a digital system can be superior to that of an analog system. In a

digital signal, there are two distinct states, on and off, and to mistake one for the other requires

a significant amount of noise. In the analog signal, there are a continuous number of states

representing the signal, and noise can cause each state to shift to an adjacent one. Therefore, a

PCM receiver will work perfectly under much higher noise levels than an analog one. This low

noise susceptibility allows PCM signals to transmit farther than analog signals without signal

degradation, information loss, and distortion.

The probability of error for the system output can be reduced even further by the use of

appropriate coding techniques. The digital signal allows the PCM system to perform additional

functions as well, because additional data can be added to the signal. Firstly, PCM systems send

what is known as a "checksum", which adds redundancy to the system by allowing the receiver

to check that the data it is receiving is not corrupt. In fact, the checksum is so effective that it is

nearly impossible for a receiver to incorrectly accept invalid data. If the receiver determines

that the data is invalid, it can hold the last position, it can ask the transmitter to repeat the last

data portion, it can correct the corrupted data portion using redundant information added during

encoding process.

A PCM signal can be modulated in such a way that only a specific decoder can make sense of

the underlying data. This is useful when the data being sent requires a level of security. The

transmitter and receiver each have circuitry that is analogous to a dictionary. This circuit maps

the binary pulse-codes to their definitions. When a pulse-code is received, the receiver looks up

the meaning in the dictionary. Anyone who intercepted the PCM signal would be left with

meaningless binary data.

A PCM waveform may be saved for later reproduction or playback. Since PCM data is digital

in origin, it can be stored using a computer or similar device. An example of a consumer device

that stores PCM data is the Digital Versatile Disc (DVD) technology. The audio portion of a

DVD movie is encoded using PCM with a sampling rate as high as 192 thousand samples per

second. This PCM stream can be piped directly to an amplifier using a digital audio cable,

where it is then decoded into an audible signal.

However linear PCM characterizes also with some inconsiderably significant disadvantages.

They are:

The wideness of the frequency band;

The sensitivity of SNR to the variations of the ratio max / RMSS s (see Figure 4.8). maxS and

equivalently RMSs vary a lot across sounds, speakers and environment. The variations could

66

reach more than 40 dB. As a result if we want to ensure the certain SNR for some dynamic

signal range, we have to choose the number of the code word bits considering the smallest

signal level. In this case even at a little bit major signal level we will have the better SNR, than

is required, what is equivalent to the over-sized number of code word bits. Therefore in most

cases uniform quantization is not cost effective in the sense of the number of code word bits.

The reason of all of this is the uniform distribution of quantization steps irrespective to the

probability density of signal levels. This will be explained in detail in the next paragraphs.

4.10 Digitization of Video Signals

4.10.1 Introduction

For a number of years, video professionals at television studios have been using various digital

formats, such as D1 (components) and D2 (composite), for recording and editing video signals. In

order to ease the interoperability of equipment and international program exchange, the former CCIR

(Comite Consultatif International des Radiocommunications), now called the ITU-R (International

Telecommunications Union Radiocommunication Sector), has standardized conditions of digitization

and interfacing of digital video signals in component form (Y, Cr, Cb, in 4:2:2 format) [20].

The main advantages of these digital formats are that they allow multiple copies to be made

without any degradation in quality, and the creation of special effects not otherwise possible in analog

format, and they simplify editing of all kinds, as well as permitting international exchange

independent of the broadcast standard to be used for diffusion (NTSC, PAL, SECAM, D2-MAC,

MPEG). However, the drawback is the very important bit rate, which makes these formats unsuitable

for transmission to the end user without prior signal compression.

For component video signals from a studio source, which can have a bandwidth of up to 6 MHz,

the CCIR prescribes a sampling frequency of Fs=13,5 MHz locked on the line frequency. This

frequency is independent of the scanning standard, and represents 864×Fh for 625-line systems and

858×Fh for 525-line systems. The number of active samples per line is 720 in both cases. In such a

line locked sampling system, samples are at the same fixed place on all lines in a frame, and also from

frame to frame, and so are situated on a rectangular grid. For this reason, this sampling method is

called orthogonal sampling, as opposed to other sampling schemes used for composite video sampling

(4×Fsc subcarrier locked sampling for instance).

The most economic method in terms of bit rate for video signal digitization seems to be to use

the composite signal as a source; however, the quality will be limited by its composite nature. Taking

into account the fact that 8 bits (corresponding to 256 quantization steps) is the minimum required

for a good signal to quantization noise ratio (Sv/Nq=59 dB), the bit rate required by this composite

digitization is 13,5×8=108 Mb/s, which is evidently too high even for advanced digital data

transmission systems with different multilevel modulation techniques (see Table 4.4).

Table 4.4 Modulation type and required bandwidth for transmission rate of 108 Mb/s

Modulation type Number of bits/Hz Required bandwidth

(MHz)

BPSK – binary phase shift keying 1 108

QPSK – quadrature phase shift keying 2 54

8-ary PSK – 8-ary phase shift keying 3 36

256-ary QAM – 256-ary quadrature amplitude

modulation

8 13,5

However, digitization of a composite signal has little advantage over its analog form for

production purposes (practically the only one is the possibility of multiple copies without

67

degradation). Therefore, this is not the preferred method for source signal digitization in broadcast

applications, as the composite signal is not very suitable for most signal manipulations (editing,

compression) or international exchanges.

4.10.2 Digitization formats

4.10.2.1 The 4:2:2 format

Recommendation ITU-R Recommendation BT.601 (formerly CCIR Rec.601) [20] defines

digitization parameters for video signals in component form based on a Y, Cb, Cr signals in 4:2:2

format (four Y samples for two Cb samples and two Cr samples) with 8 bits per sample (with a

provision for extension to 10 bits per sample). The sampling frequency is 13,5 MHz for luminance

and 6,75 MHz for chrominance, regardless of the standard of the input signal. This results in 720

active video samples per line for luminance, and 360 active samples per line for each chrominance.

The position of the chrominance samples corresponds to the odd samples of the luminance (see Figure

4.9).

Chrominance signals Cr and Cb being simultaneously available at every line, vertical resolution

for chrominance is the same as for luminance (480 lines for 525-line systems, 576 lines for 625-line

systems). The total bit rate resulting from this process is 13,5×8+2×6,75×8=216 Mb/s. With a

quantization of 10 bits, the bit rate becomes 270 Mb/s. However, if one takes into account the

redundancy involved in digitizing the inactive part of the video signal (horizontal and vertical

blanking periods), the useful bit rate goes down to 166 Mb/s with 8 bits per sample. These horizontal

and vertical blanking periods can be filled with other useful data, such as digital sound, sync, and

other information.

13,5

MHz

(74 ns)

6,75 MHz

(148 ns)

Luminance

Chrominance

Fig. 4.9. Position of samples in the 4:2:2 format

ITU-R Recommendation BT.656 (formerly CCIR-656) [40] defines standardized electrical

interfacing conditions for 4:2:2 signals digitized according to recommendation CCIR-601. This is the

format used for interfacing D1 digital video recorders, and is therefore sometimes referred to as the

D1 format.

The parallel version of this recommendation provides the signal in a multiplexed form (Crl, Y1,

Cb1, Y2, Cr3, Y3, Cb3…) on an 8-bit parallel interface, together with a 27 MHz clock (one clock period

per sample). Synchronization and other data are included in the data flow. The normalized connector

is a DB25 plug.

There is also a serial form of the CCIR-656 interface for transmission on a 75Ω coaxial cable

with BNC connectors, requiring a slightly higher bit rate (243 Mb/s) due to the use of 9 bits per

sample in this mode.

4.10.2.2 The 4:2:0 format

This format is obtained from the 4:2:2 format by using the same chroma samples for two

successive lines, in order to reduce the amount of memory required in processing circuitry while at

the same time giving a vertical resolution of the same order as the horizontal resolution. Luminance

and horizontal chrominance resolutions are the same as for the 4:2:2 format, and thus

68

luminance resolution: 720×576 (625 lines) or 720×480 (525 lines);

chrominance resolution: 360×288 (625 lines) or 360×240 (525 lines).

Figure 4.10 shows the position of samples in the 4:2:0 format.

13,5

MHz

(74 ns)

6,75 MHz

(148 ns)

Luminance

Chrominance

Fig. 4.10. Position of samples in the 4:2:0 format

In order to avoid the chrominance line flickering observed in SECAM at sharp horizontal

transients (due to the fact that one chrominance comes from the current line and the second comes

from the preceding one), Cb and Cr samples are obtained by interpolating 4:2:2 samples of the two

successive lines. Thus the color is determined by the level of previous and subsequent values.

The 4:2:0 format is of special importance as it is the input format for D2-MAC and MPEG-2

coding.

4.10.2.3 The source intermediate format (SIF)

This format is obtained by halving the spatial resolution in both directions as well as the

temporal resolution, which becomes 25 Hz for 625-line systems and 29,97 Hz for 525-line systems.

Depending on the originating standard, the spatial resolutions are then:

luminance resolution: 360×288 (625 lines) or 360×240 (525 lines);

chrominance resolution: 180×144 (625 lines) or 180×120 (525 lines).

Figure 4.11 illustrates the position of the samples in the SIF format. Horizontal resolution is

obtained by filtering and subsampling the input signal. The reduction in temporal and vertical

resolution is normally obtained by interpolating samples of the odd and even fields, but is sometimes

achieved by simply dropping every second field of the interlaced input format. The resolution

obtained is the base for MPEG-1 encoding, and is resulting in a so called "VHS-like" quality in terms

of resolution. This format is used in H.261 and H.264 standards. In this case, the image is encoded

using prediction, transformation and motion compensation.

13,5

MHz

(74 ns)

6,75 MHz

(148 ns)

Luminance

Chrominance

Discarded samples

Fig. 4.11. Position of samples in the SIF format

4.10.2.4 The common intermediate format (CIF)

This is a compromise between European and American SIF formats: spatial resolution is taken

from the 625-line SIF (360×288) and temporal resolution from the 525-line SIF (29,97 Hz). It is the

basis used for video conferencing.

69

4.10.2.5 The quarter CIF (QCIF)

Once again, this reduces the spatial resolution by 4 (2 in each direction) and the temporal

resolution by 2 or 4 (15 or 7,5 Hz). It is the input format used for ISDN video telephony using the

H261 compression algorithm.

4.10.2.6 High definition formats 720p, 1080i

Two standard picture formats have been chosen for broadcast HDTV applications, each existing

in two variants (59,94 Hz or 50 Hz depending on continent):

The 720p format: this is a progressive scan format with a horizontal resolution of 1280

pixels and a vertical resolution of 720 lines (or pixels).

The 1280i format: this interlaced format offers a horizontal resolution of 1920 pixels and

a vertical resolution of 1080 lines (or pixels).

For these two formats, the horizontal and vertical resolutions are equivalent (square pixels)

because they have the same ratio as the aspect ratio of the picture (16:9).

A quick calculation of the required bit rate for the digitization in 4:4:4 format of these two HD

formats gives bit rates on the order of 1 to 1,5 Gb/s depending on the frame rate and resolution, which

is 4 to 5 times greater than for standard definition interlaced video.

4.10.3 Transport problems

It is clear that a bit rate of the order of 200 Mb/s, as required by the 4:2:2 format, cannot be

used for direct broadcast to the end user, as it would occupy a bandwidth of the order of 40 MHz with

a 64-QAM modulation (6 bits/symbol) used for cable, or 135 MHz with a QPSK modulation

(2 bits/symbol) used for satellite. This would represent 5-6 times the bandwidth required for

transmission of an analog PAL or SECAM signal, and does not even take into account any error

correction algorithm. It would of course be even more unthinkable with the 4 to 5 times higher bit

rates generated by the digitization of high definition pictures in 720p or 1080i format.

Compression algorithms, however, have been in use for some years for contribution links in the

field of professional video, which reduce this bit rate to 34 Mb/s, but this is still too high for consumer

applications, as it does not give any advantage in terms of capacity over existing analog transmissions.

The problem of compression seemed not be solved at an early date. Therefore it was developed and

proposed some hybrid standards such as D2-MAC. However, the very rapid progress made in

compression techniques and IC technology in the second half of the 1980s made these systems

obsolete soon after their introduction.

The essential conditions required to start digital television broadcast services were the

development of technically and economically perspective solutions to problems which can be

classified into two main categories:

Source coding. This is the technical term for compression. It encompasses all video and

audio compression techniques used to reduce as much as possible the bit rate (in terms of

Mb/s required to transmit moving pictures of a given resolution and the associated sound)

with the lowest perceptible degradation in quality.

Channel coding. This consists of developing powerful error correction algorithms

associated with the most spectrally efficient modulation techniques (in terms of Mb/s per

MHz), taking into account the available bandwidth and the foreseeable imperfections of

the transmission channel.

70

5. SPEECH, AUDIO and VIDEO SIGNALS COMPRESSION by

QUANTIZATION

5.1 Necessity and possibility of compression

The only disadvantage of PCM – the wideness of the frequency band requiring enormously

wideband communication channel can be reduced by removing redundant information and for this

we must apply some kind of signal compression. From the other side the reduction of the bandwidth

is needed to transmit a given amount of data in a given time over communication channel with a given

bandwidth.

The main compression idea follows from the equation (4.6). It shows that reduction of

bandwidth is possible by reduction of bit-rate and that bit-rate reduction is closely related to the

number of bits in PCM code word because reduction of sampling frequency bellow the Nyquist

frequency isn’t possible due to the Nyquist criterion.

In electronics, computer science and information theory the terms signal compression, data

compression, source coding, or bit-rate reduction are synonyms and involves encoding information

using fewer bits than the original representation. Compression can be either lossy or lossless. Lossless

compression reduces bits by identifying and eliminating statistical redundancy. No information is lost

in lossless compression. Lossy compression reduces bits by identifying unnecessary information and

removing it. The process of reducing the size of a data file is popularly referred to as data

compression, although it’s formal name is source coding (coding done at the source of the data before

it is stored or transmitted).

Compression is useful because it helps reduce resources usage, such as data storage space or

transmission capacity. Because compressed data must be decompressed to use, this extra processing

imposes computational or other costs through decompression. Data compression is subject to a space-

time complexity trade-off. For instance, a compression scheme for video may require expensive

hardware for the video to be decompressed fast enough to be viewed as it is being decompressed, and

the option to decompress the video in full before watching it may be inconvenient or require additional

storage. The design of data compression schemes involves trade-offs among various factors,

including the degree of compression, the amount of distortion introduced (e.g., when using lossy data

compression), and the computational resources required to compress and uncompress the data.

Audio and video data compression has the potential to reduce the transmission bandwidth and

storage requirements of audio and video data.

5.2 Nonuniform quantizing: µ-law and A-law companding

The first step toward modern audio data compression technics was done in the early stage of

digital telephony development when nonuniform quantizing based on µ-law and A-law companding

was introduced. Already some time ago µ-law and A-law companding were used in analog telephony

for dynamic range compression [29, 31, 33, 35]. However transferring this technic to digital telephony

results in additional benefit, i.e., it enabled to save some bits in PCM code words in comparison with

uniform quantization not losing quality of speech signal at the same time.

The idea of nonuniform quantization is based on the desire to avoid the main disadvantage of

the uniform quantization – SNR dependence on the input signal level caused by uniform distribution

of quantization steps. Nonuniform quantizers attempt to match the distribution of quatization levels

to that of the probability density of the input speech signal by allocating more quatization levels in

regions of high probability and fewer levels in regions where the probability is low. This idea is

directly confirmed by the fact that analog voice signals are more likely to have amplitude values near

zero than at the allowed extreme peak values. This is illustrated in Figure 5.1 showing some

approximations of analog voice signal probability density function (pdf) [47].

71

0,2

0,4

0,6

0,8

1,0

-6 -4 -2 0 2 4 6

Laplace distribution

(music)

Gamma distribution

(telephone voice)Experimental

distribution

s

p(s)

Fig. 5.1. Approximations of voice analog signal probability density function (experimental measurement values are

labeled by dots)

Gamma distribution - 33

( ) exp8 2s s

sp s

s

, Laplace distribution - 21

( ) exp2 ss

sp s

,

where ( )p s is probability density function and s is standard deviation of a signal s

Independence of SNR on the signal level results in a constant percentage quantization error in

contrast to a constant absolute quantization error independent of signal amplitude as for uniform

quantization.

As theory shows [31, 33], with the intention of getting a constant percentage error we have to

space the quantization levels logarithmically. However, such spacing may be impractical because it

is impossible to get enormous output signal levels when input signal levels are very small as

logarithmic characteristic requires (0

ln( )s

s

). Some practical approximations to a logarithmic

compression characteristic have been proposed. They are: µ-law characteristic and A-law

characteristic [31, 33]. µ-law characteristic was implemented in USA and Japan and A-law

characteristic in European telephone systems. Equation describing A-law characteristic is

max

max

max

max

( ) / ( ) 1, 0 ;

1 ln( ) F[ ( )]

1 ln ( ) / 1 ( ), 1.

1 ln

As n S s n

A S Aw n s n

As n S s n

A A S

(5.1)

Here ( )s n is input signal samples, ( )w n - output signal samples, A – the parameter which standard

value is stand 87,56A . Figure 5.2 shows a family of curves of ( )w n versus ( )s n for different values

of A. Distribution of quantization levels for an A-law 3-bit quantizer with stand 87,56A A are

shown in Figure 5.3.

72

maxS

( )s n

( ) { ( )}w n F s n

maxS

0

87,56A

60

30

15

5

maxS

( )s n

ˆ( )s n

maxS

0

87,56A

1s 2s

3s

1s

2s

3s

4s

4s

Fig. 5.2. Input-output relations for an A-law

characteristic

Fig. 5.3. Distribution of quantization levels for the A-law

3-bit quantizer with stand 87,56A A

It is clear that using the function of equation (5.1) avoids the problem of small input amplitudes

since ( ) 0w n when ( ) 0s n . For small values of ( )s n ( max0 ( ) / 1 /s n S A ) Equation (5.1)

reduces to

max

( ) ( )(1 ln )

Aw n s n

S A

, (5.2)

i.e., the quantization levels are uniformly spaced. However, for large ( )s n ( max1/ ( ) / 1A s n S )

we have

maxln ln ln ( )( )

1 ln

A S s nw n

A

,

i.e., the quantization levels are logarithmically spaced what guarantees constant percentage error. In

this range of input signal levels SNR is expressed as

, [dB] 6,02 4,77 20lg(1 ln )SNR B A ,

what gives with stand 87,56A A

, [dB] 6,02 9,99SNR B .

Notice that the output SNR is relatively insensitive to the input level, as shown in Figure 4.

Moreover under the week input signal A-law companding results in a better SNR than uniform

quantization but under the strong input signal uniform quantization is superior to nonuniform with A-

law companding.

Investigations of speech signal quality [31, 33] show that approximately the same quality is

reached with 11-bit uniform quantization and 8-bit nonuniform quantization (A-law companding),

i.e., A-law companding allows us to save approximately 3 bits and sometimes even more, nearly 4

bits.

Regardless the above mentioned merits the quasi-logarithmic quantization characterizes with

some negative properties. First, the SNR depends on the variations of the signal level when signals

are week. Second, in the range of strong signals the constancy of SNR is achieved at the expense of

the SNR reduction in comparison with SNR of uniform quantization with the same number of bits

(see Figure 5.4). Third, the quasi-logarithmic quantization supposes the non-linear amplification.

Thus, even at very small levels of input signal the quasi-logarithmic characteristic yields decreasing

of relative level of output signal, what is completely undesirable phenomenon.

73

-20 -16 -12 -8 -4 05

15

25

35

45

55, dBSNR

max20log / , dBRMSs S

-24-28-32-36-40

A-law companding

A=87,56

No companding

(uniform quantization)

Fig. 5.4. Output SNR versus input signal level of 8-bit PCM systems with and without companding

In practice, the smooth nonlinear characteristic of A-law is approximated by piecewise linear 7

chords covering 8 segments (one chord for the 1st and the 2nd segments) as shown in Figure 5.5. In

turn, each chord is approximated by uniformly spaced 16 steps as by uniform quantization. An input

step size of each chord is set by the particular segment number. That is, 16 steps of wide Δ are used

for segment 1 and 2. For segment 3 are used 16 steps of width 2Δ, for segment 4 – 16 steps of width

4Δ, for segment 5 – 16 steps of width 8Δ and so on… and for the last one - the 8th segment – 16 steps

of width 64Δ. As shown in Figure 5.6, each 8-bit PCM code word consists of a sign bit which denotes

a positive or negative input voltage (1 for positive voltage, 0 for negative voltage), three chord bits

which represent the segment number, and four step bits which represent the particular step within the

segment.

1

7/8

6/8

5/8

4/8

3/8

2/8

1/8

0 1/8 1/4 1/2 11/161/321/64

16 steps

8th segment

16

32

48

64

80

96

112

128

16 steps

ˆnorms

norms

7th segment

6th

seg

men

t

5th

seg

men

t

Code

words

11010000

11011111

Segment 6

16

1 2 3 4 5 6 7 8

Step bits (number

of quantizing

interval )

Chord bits

(segment

number)Sign bit

Fig. 5.5 Practical approximation of A-law characteristic Fig. 5.6. PCM code word structure

74

5.3 Adaptive Quantization

5.3.1 Maximum SNR coding

The nonuniform (A-law) quantizer attempts to achieve constant SNR over the wide range of

signal levels usually expressed in logarithmic power units (dB). However, this is achieved at some

offering over the SNR performance that could be achieved if the quantizer step size was matched to

the signal power. This supposes an idea to choose the quantizer levels so as to minimize the

quantization error power, and thus, maximize the SNR, if the signal power is known. Detailed analysis

shows this method additionally leads to the matching signal quantization intervals to probability

distribution. The attempts were made to adapt here model probability distributions – Gaussian,

Gamma or Laplacian. The method was called maximum (or optimum) SNR coding [47]. Although

this kind of quantizers yield minimum mean squared error when matched to the power and amplitude

distribution of the signal, the non-stationary nature of the speech leads to non-satisfactory results.

Therefore this method remains interesting – at least theoretically.

5.3.2 Adaptation principle

Hence, quantization dilemma of speech signals remains. From one side it would be desirable to

choose quantization steps as small as possible so as to minimize the quantization noise. From the

other side the number of steps has to be limited because it is directly related to the number of code

words bits. Therefore the steps have to be large enough so as to accommodate the maximum peak-to-

peak range of the signal. The non-stationary speech attribution complicates this problem greatly.

Fixed quantizers (uniform and nonuniform) may yield good SNR but only when we assume that signal

is stationary. But this cannot be done in practical cases where signal is not stationary and is

fluctuating.

One of solutions proposed [48] is to adapt the properties of the quantizer to the level of the

input signal. The second one is to adapt the signal level to the fixed properties of the quantizer. The

idea of the first solution is illustrated in Figure 5.7.

)(nc( )s n ˆ( )s n

)(n

)(nc ˆ ( )s n

)(n

DecoderEncoder Q

Fig. 5.7. Block diagram representation of adaptive quantization with variable step size

The essence of this solution is to vary quantization step size ∆(𝑛) to match the input signal

power. This could be achieved by keeping instantaneous step size ∆(𝑛) proportional to square root

of input signal power √𝑃𝑠. In the case of nonuniform quantizer, this would imply that the quantization

levels and ranges would be linearly scaled to match the signal power Ps.

The second solution is illustrated in Figure 5.8.

( )s n ( )w n ˆ ( )w n )(nc ( )c n ˆ ( )w n ˆ ( )s n

( )G n

( )G n

Encoder Q Decoder

( ) ( ) ( )w n s n G n

ˆ ( )ˆ ( )

( )

w ns n

G n

Fig. 5.8. Block diagram representation of adaptive quantization with variable gain

75

Here variable gain 𝐺(𝑛) is used, followed by a fixed quantizer step size Δ in the case of uniform

quantizer and quasi-logarithmically spaced step sizes Δ in the case of nonuniform quantizer. The gain

𝐺(𝑛) has to be varied to the end to keep signal power of ( ) ( ) ( )w n s n G n relatively constant. 𝐺(𝑛)

has to be proportional to 1/√𝑃𝑠 (or 1/𝑠𝑅𝑀𝑆), to give 𝑃𝑤 ≈ constant. Notice that the square root of

the normalized signal power is the signal root-mean-square (RMS) value.

The reliable estimation of signal power Ps is needed for both types of adaptive quantization.

Moreover, besides transmission of coded useful information 𝑐(𝑛) the transmission of supplementary

information about quantization step size ∆(𝑛) or gain 𝐺(𝑛) is compulsory. This increases the amount

of transmitting information and requires channel with increased capacity.

5.3.3 Types of adaptation

Time varying properties of speech signals indicate their dual character. The amplitude changes

rapidly from sample to sample or within a few samples. These are instantaneous changes. Meanwhile

the peak amplitude remains essentially unchanged for relatively long time intervals, for example 10

to 20 milliseconds, corresponding to the duration of pronunciation of syllables. These are syllabic

variations.

Speaking about adaptive quantization it is necessary to make difference between rapid or

instantaneous adaptation and slow or syllabic adaptation.

Another difference among adaptive quantization schemes is to be made in compliance with the

way in which the control signals are estimated. In one case they are estimated from the input signal.

These schemes are called feed-forward adaptive quantizers. In the another case the control signals

are estimated from the output signal of quantizer ˆ( )s n or from the output code words ( )c n . These

are called feed-back quantizers.

5.3.4 Feed-forward adaptation

The general block diagrams of feed-forward adaptation systems with time varying step size and

with time varying gain are presented in Figures 5.9 and 5.10 respectively.

ˆ( )s n )(nc( )s n

( )n

Encoder Q

Step size

adaptation

system

)(nc ˆ ( )s n

)(n

Decoder

( )n

Fig. 5.9. Block diagram of feed-forward adaptive quantizer with time varying step size

Most systems of this type attempt to obtain an estimate of the time-varying input signal power

sP . Then, the step size or quantization levels are made proportional to the RMS value of input signal

RMSs , or the gain applied to the input can be made inversely proportional to the RMS value of input

signal, i.e. proportional to 1 / RMSs . The step size in Figure 5.8 would therefore be of the form

ˆ ( )s n)(nc ˆ ( )w n

)(nG

Decoder ˆ ( )w n( )w n

)(nc( )s n

( )G n

Encoder Q

Gain

adaptation

system

( )G n

Fig. 5.10. Block diagram of feed-forward quantizer with time varying gain

76

0( ) ( )RMSn s n (5.3)

and the time varying gain in Figure 9 would be of the form

0( ) / ( )RMSG n G s n , (5.4)

where 0 and 0G are some constants. Practically it is impossible to realize the gain or step size of

any value. Therefore it is common to limit the variations of the gain or the step size in the form

min max( )G G n G ,

min max( )n .

The ratio of these limits determines the dynamic range of the system. To obtain a relatively

constant SNR over a dynamic range of 40 dB, requires

max min/ 100G G ,

max min/ 100 .

The RMS value of the input signal could be estimated in some different ways. Scientific

definition of the RMS value of the analog input signal would be

/2

2

/2

1lim ( )

T

RMST

T

s s t dtT

.

However practical estimations differ from this definition because operations of limit and

integration could be realized only approximately. A common practical approach is

2( ) ( ) ( )RMS

m

s n s m h n m

, (5.5)

where ( )h n is the impulse response of the low pass filter. In one of the simplest cases

1, 1;

( )0, 1.

n nh n

n

(5.6)

Here is a constant which values have to be in the range 0 1 . The value of controls the

effective interval that strongly influences the estimate of RMSs , and herewith the ultimate quantization

results. In some sense this kind of low pass filtering is equivalent to the weighted addition of squared

signal samples where past samples have major weights than current ones. For illustrative purposes

we present Figure 5.11 with some audio signal examples at different points of feed-forward adaptive

quantization systems. For calculation of RMS value the equations (5.5 and 5.6) were employed.

77

0 1 2 3 4 5 6 7

x 104

0

1

2

n

s(n)

sRMS(n)|α=0,9

a)

0 1 2 3 4 5 6 7

x 104

-0,5

0

0,5

1

1,5

2

2,5

3

3,5

4

n

s(n)

sRMS(n)|α=0,99

g)

1,9 1,95 2 2,05 2,1 2,15 2,2 2,25 2,3 2,35

x 104

-0,5

0

0,5

1

1,5

2

n

s(n)

sRMS(n)|α=0,9

b)

1,9 1,95 2 2,05 2,1 2,15 2,2 2,25 2,3 2,35

x 104

-0,5

0

0,5

1

1,5

2

2,5

3

3,5

4

n

s(n)

sRMS(n)|α=0,99

h)

2,25 2,26 2,27 2,28 2,29 2,3

x 104

-0,5

0

0,5

1

1,5

2

n

s(n)

sRMS(n)|α=0,9

c)

2,25 2,26 2,27 2,28 2,29 2,3

x 104

-0,5

0

0,5

1

1,5

2

2,5

3

3,5

n

s(n)

sRMS(n)|α=0,99

i)

78

0 1 2 3 4 5 6 7

x 104

-0,5

0

0,5

1

1,5

n

w(n)|α=0,9

wRMS(n)|α=0,9

d)

0 1 2 3 4 5 6 7

x 104

-0.2

0

0.2

0.4

0.6

0.8

n

w(n)|α=0,99

wRMS(n)|α=0,9

j)

1,9 1,95 2 2,05 2,1 2,15 2,2 2,25 2,3 2,35

x 104

-0,5

0

0,5

1

n

w(n)|α=0,9

wRMS(n)|α=0,9

e)

1,9 1,95 2 2,05 2,1 2,15 2,2 2,25 2,3 2,35

x 104

-0,3

-0,2

-0,1

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

n

w(n)|α=0,99

wRMS(n)|α=0,9

k)

2,25 2,26 2,27 2,28 2,29 2,3

x 104

-0,5

0

0,5

1

1,5

n

w(n)|α=0,9

wRMS(n)|α=0,9

f)

2,25 2,26 2,27 2,28 2,29 2,3

x 104

-0,3

-0,2

-0,1

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

n

w(n)|α=0,99

wRMS(n)|α=0,9

l)

Fig. 5.11. Examples of the RMS value estimates (red lines) using equations (5.5, 5.6) and corresponding signal

waveforms (black lines); (a) waveform s(n) and RMS value estimate sRMS(n) for α=0,9, (b); (c) detailed

fragments of (a), (d) waveform w(n) and RMS value estimate wRMS(n) for α=0,9 , (e), (f) detailed fragments

of (d), (g) waveform s(n) and RMS value estimate sRMS(n) for α=0,99, (h), (i) detailed fragments of (g), (j)

waveform w(n) and RMS value estimate wRMS(n) for α=0,99, (k), (l) detailed fragments of (j)

79

Figures 5.11 a, b and c show the RMS value estimates (red lines) together with the input signal

waveforms (black lines) for the case α=0,9. Figures 5.11 d, e and f represent the waveforms of the

product ( ) ( ) ( )w n s n G n (black lines) and their RMS values estimates (red lines) for the same case

of α. The waveforms ( )s n , ( )w n and their RMS values estimates in the case α=0,99 are presented in

Figures 5.11 g – l. For the case α=0,9 system reaction to changes in the input signal amplitude is more

rapid. Therefore, despite some relatively quick variations of the product ( ) ( ) ( )w n s n G n RMS value,

it’s average remains nearly constant. All signal ( )s n amplitude dips are almost entirely compensated

by the time varying gain included deep drop in the samples region from 2,25∙104 – 2,3∙104.

Conversely, the system with α=0,99 responds much slowly and not all signal amplitude dips are

properly compensated. This is very clearly seen in the region 2,25∙104 – 2,3∙104.

In the case α=0,9 the corresponding time constant is about 100 samples (12 msec at an 8 kHz

sampling frequency) and in the case α=0,99 – about 9 samples (1 msec at 8 kHz). It means that the

system with α=0,9 could be assigned to instantaneous systems and the system with α=0,99 – to

syllabic systems.

It must be noted additionally that the gain ( )G n and the step size ( )n are slowly varying

functions comparing with the input signal amplitude variations. Therefore the sampling rate of ( )G n

and ( )n are much less than of the original signal. However it depends on the value of α. It is very

important because the information rate of the digital representation is the sum of the quantizer output

information rate and the gain or step size function information rate.

The RMS value estimate in the form square-root of moving average of squared input signal has

been proposed in [49]. That is

1

2

0

1( ) ( )

M

RMSm

s n x n mM

. (5.7)

The main operation in the RMS value determination according to this equation is addition of

the squared future input samples that are to be quantized. Therefore for the system realization the

buffer of M samples is needed. Notice that all weights of the summed samples are the same and equal

to 1/M, contrary to the previous case.

The algorithm of ( )RMSs n calculation could be expressed in terms of digital filtering, namely

1

2

0

( ) ( ) ( )M

RMSm

s n x m h n m

, (5.8)

where filter’s impulse response ( )h n is

1 / , 0;

( )0, .

M M nh n

otherwise

(5.9)

This filter is noncausal since the future input samples would not be available. To implement

this noncausal filter we have to delay the output, i.e., to buffer the input by M samples.

Results of investigation of this system are presented in Figure 5.12. Two system operation

modes were used in these experiments: one – with 𝑀 = 8 and the second – with 𝑀 = 128. This

corresponds to the time intervals of 1 msec and of 16 msec at an 8 kHz sampling frequency. The

system compensates all variations of the input signal level quite effectively even deep drop in the

region of 2,25∙104 – 2,3∙104 samples. Since the reaction of the system with 𝑀 = 8 is more rapid it

compensates the rapid changes in signal amplitude better as the system with 𝑀 = 128. However the

gain ( )G n or the step size ( )n changes more quickly as well, what determines the major information

flow rate of the system. Usually the gain or step size is evaluated and transmitted every M samples.

80

0,6

0 1 2 3 4 5 6 7

x 104

-0,6

-0,4

-0,2

0

0,2

0,4

n

s(n)

sRMS(n)|M=8

a)

1 2 3 4 5 6 7

x 104

-0,6

-0,4

-0,2

0

0,2

0,4

0,6

n

s(n)

sRMS(n)|M=128

g)

1,9 1,95 2 2,05 2,1 2,15 2,2 2,25 2,3

x 104

-0,6

-0,4

-0,2

0

0,2

0,4

0,6

0,8

n

s(n)

sRMS(n)|M=8

b)

1,9 1,95 2 2,05 2,1 2,15 2,2 2,25 2,3

x 104

-0,6

-0,4

-0,2

0

0,2

0,4

0,6

n

sRMS(n)|M=128

s(n)

h)

2,25 2,26 2,27 2,28 2,29 2,3

x 104

-0,6

-0,4

-0,2

0

0,2

0,4

0,6

0,8

n

s(n)

sRMS(n)|M=8

c)

2,25 2,26 2,27 2,28 2,29 2,3

x 104

-0,6

-0,4

-0,2

0

0,2

0,4

0,6

n

sRMS(n)|M=128

i)

81

w(n)|M=8

0 1 2 3 4 5 6 7

x 104

-2

-1

0

1

2

3

n

wRMS(n)|M=8

d)

n

0 1 2 3 4 5 6 7

x 104

-4

-3

-2

-1

0

1

2

3

4

wRMS(n)|M=128

w(n)|M=128

j)

1,9 1,95 2 2,05 2,1 2,15 2,2 2,25 2,3

x 104

-2,5

-2

-1,5

-1

-0,5

0

0,5

1

1,5

2

2,5

w(n)|M=8

wRMS(n)|M=8

n e)

wRMS(n)|M=128

w(n)|M=128

1,9 1,95 2 2,05 2,1 2,15 2,2 2,25 2,3

x 104

-4

-3

-2

-1

0

1

2

3

4

n k)

n

2,25 2,26 2,27 2,28 2,29 2,3

x 104

-2,5

-2

-1,5

-1

-0,5

0

0,5

1

1,5

2

2,5

w(n)|M=8

wRMS(n)|M=8

f)

2,25 2,26 2,27 2,28 2,29 2,3

x 104

-4

-3

-2

-1

0

1

2

3

4

n

w(n)|M=128

wRMS(n)|M=128

l)

Fig. 5.12. Examples of the RMS value estimates (red lines) using equations (5.8, 5.9) and corresponding signal

waveforms (black lines); (a) waveform s(n) and RMS value estimate sRMS(n) for M=8, (b); (c) detailed

fragments of (a), (d) waveform w(n) and RMS value estimate wRMS(n) for M=8 , (e), (f) detailed fragments

of (d), (g) waveform s(n) and RMS value estimate sRMS(n) for M=128, (h), (i) detailed fragments of (g), (j)

waveform w(n) and RMS value estimate wRMS(n) for M=128, (k), (l) detailed fragments of (j)

82

Many authors [31, 33, 48] investigated feed-forward adaptive quantization. The comprehensive

comparative study has been done by Noll [49]. He used the second type of RMS value estimation

with 𝑀 = 128 and 𝑀 = 1024 . The obtained results show that 3 bit adaptive quantizer achieves up

to 8 dB better SNR than fixed nonuniform quantizer, of course depending on speech material.

According to 6 dB rule this means that adaptive quantizer permits to save approximately up to 1 bit.

5.3.5 Feed-backward adaptation

Another type of adaptive quantizers is presented in Figures 5.13 and 5.14. Here the RMS value

of the input signal is estimated from the quantizer output signal or equivalently from the output code

words.

Decoder

'( )c n ( )s n

'( )n

Encoder

( )s n ˆ( )s n )(nc

)(n

Q

Step size

adaptation

system

Step size

adaptation

system

Fig. 5.13. Block diagram of feed-backward adaptive quantizer with time varying step size

In these systems, as in the case of feed-forward systems, the step size ( )n and gain ( )G n are

proportional and inverse proportional respectively to an estimate of RMS value of the input signal as

is denoted in equations (5.3, 5.4). There is no need to transmit the information about step size and the

gain. There is possible to extract it from the code words sequence. That is not questionable advantage

of this type of systems.

Usually for RMS value estimation the same equation (5.5) is applied. Only instead of input

samples the output samples are used, i.e.,

2ˆ( ) ( ) ( )RMS

m

s n s m h n m

. (5.10)

However the computational situation in this case is different. It is impossible to realize the

noncausal filter as the present value and the more the future values of ˆ( )s n are not available until the

quantization has occurred, which in turn could be performed after the RMS value has been estimated.

Therefore it is possible to use a low pass filter with impulse response expressed by equation (5.6) or

a causal moving average filter with impulse response

1 / , 1 ;

( )0, .

M n Mh n

otherwise

(5.11)

That leads to the equation

2

0

1( ) ( )

m M

RMSm

s n x n mM

. (5.12)

Gain

adaptation

system

Decoder

( )s n'( )c n

'( )G n

ˆ ( )w n( )s n ( )w n ˆ ( )w n )(nc

Gain

adaptation

system

Q Encoder

( )G n

Fig. 5.14. Block diagram of feed-backward adaptive quantizer with time varying gain

83

The results of the studies of this system were announced in [49]. It was found that using very

short window lengths (e.g., M=2) only 12 dB SNR for 3 bit quantizer was achieved. Larger values of

M permit to obtain only slightly better results.

Another approach [50] was applied to the system based on the block diagram presented in

Figure (5.13). The step size of a uniform quantizer was adapted at each sample time by the equation

( ) ( 1)n R n . (5.13)

Here R is the step size multiplier and depends only on the absolute value of the previous code word

( 1)c n . The multiplier choosing principle could be described as follows. If the previous code word

represents the largest positive or negative quantizer value, it means that quantizer is overloaded and

the step size is too small. In the next step it must be increased by choosing multiplier greater than

one. If the previous code word represents the smallest positive or negative quantizer value, the step

size must be decreased by choosing multiplier less than one. Similarly thinking the multipliers must

be chosen for all possible 2B code words of B-bit quantizer. Jayant [50] obtained empirical values of

multipliers for speech following by the mean squared quantization error criterion. They are presented

in the next Table 5.1.

Table 5.1 Step size multipliers for adaptive feed-backward quantization of speech signals

Step

number

B

1

2

3

4

5

6

7

8

2 0,6 2,2

3 0,85 1,0 1,0 1,5

4 0,8 0,8 0,8 0,8 1,2 1,6 2,0 2,4

However various experiments show that different multipliers can produce comparable results.

This means that the values of multipliers are not particularly critical.

In conclusion we present the quantization results for German sentence “Das hohe Grass ist

schon vergilbt” („The high grass has already turned yellow”) using various above analyzed

quantizers. Table 5.2 shows the average SNR for various quantizers.

Table 5.2 Signal-to-noise ratios using various quantizers with B=4 for the same speech material

Quantizer Uniform A-law µ-law Optimum Feed-forward Feed-backward

SNR 7,44 13,71 13,43 16,53 16,73 19,88

The nonuniform quantizers give comparable results, still much better as uniform quantizer. The

best result is achieved with feed-backward adaptation quantizer.

5.4 Differential quantization

Figure 5.15 shows an example of autocorrelation function estimate for speech signal.

84

0

0,2

0,4

0,6

0,8

1,0

4 8 12 18 20

r( m)

6 10 14 162-0,2

-0,4

-0,6

-0,8

m

Fig. 5.15. Autocorrelation function estimate for speech signal

It is clearly seen that there is considerable correlation between adjacent speech samples, and

the correlation is significant even between samples that are several sampling intervals apart. The

meaning of this high correlation is that, in an average sense, the signal properties do not change

rapidly from sample to sample so that the difference between adjacent samples should have a lower

variance than the variance of the signal itself. This can be easily verified analytically. This fact

provides the motivation for the general differential quantization scheme presented in Figure 5.16

[31, 33, 48, 51].

)(ˆ nd )(nc ˆ ( )s n

( )s n

Decoder

P

( )s n

( )s n ˆ( )s n

)(ˆ nd )(nc)(ndEncoder Q

P

Fig. 5.16. General block diagram of differential quantizer

In this system the input to the quantizer is a signal

( ) ( ) ( )d n s n s n (5.14)

which is the difference between the unquantized input sample, x(n), and an estimate, or prediction, of

the input sample which is denoted ( )s n . This predicted value is the output of a predictor system P,

whose input is, as we will see bellow, a quantized version of the input signal, s(n). The difference

signal may also be called the prediction error signal, since it is the amount by which the predictor

fails to exactly predict the input. Temporarily leaving aside the question of how the estimate, ( )s n ,

is obtained, we note that it is the difference signal that is quantized rather than the input. The quantizer

could be either fixed or adaptive, uniform or nonuniform, but in any case, its parameters should be

adjusted to match the variance of d(n). The quantized difference signal can be represented as

ˆ( ) ( ) ( )d n d n e n (5.15)

where e(n) is the quantization error. According to Figure 5.16 the quantized difference signal is added

to the predicted value ( )s n to produce a quantized version of the input, i.e.,

ˆˆ '( ) ( ) ( )s n s n d n . (5.16)

Substituting equations (5.14) and (5.15) into equation (5.16) we see that

ˆ( ) ( ) ( )s n s n e n . (5.17)

That is, independent of the properties of the predictor system P, the quantized speech sample

differs from the input only by the quantization error of the difference signal. Thus, if the prediction

is good, the variance of d(n) will be smaller than the variance of s(n). So that a quantizer with a given

number of levels can be adjusted to give a smaller quantization error, than would be possible, when

quantizing the input signal directly.

85

It should be noted that it is the quantized difference signal that is coded for transmission or

storage. The system for reconstructing the quantized input from the code words is implicit in

Figure 5.16. This system, presented in Figure 5.16 involves a decoder to reconstruct the quantized

difference signal from which the quantized input is reconstructed using the same predictor as used in

Figure 5.16. Clearly, if ( )c n is identical to c(n) then ˆ ˆ( ) ( )s n s n , which differs from s(n) only by

the quantization error incurred in quantizing d(n). The signal-to-quantizing noise ratio of the system

is, by definition,

2 2

2 2

[ ( )]

[ ( )]

s s

e e

s n PSNR

e n P

M

M (5.18)

which can be written as

s dP Q

d e

P PSNR G SNR

P P (5.19)

where

/Q d eSNR P P (5.20)

is the signal-to-quantizing noise ratio of the quantizer, and the quantity

/P s dG P P (5.21)

is defined as the gain due to the differential configuration.

The quantity QSNR is dependent upon the particular quantizer that is used, and, given

knowledge of the properties of d(n), QSNR can be maximized by using the techniques of the previous

sections. The quantity GP, if greater than unity, represents the gain in SNR that is due to the

differential scheme. Clearly, our objective should be to maximize GP by appropriate choice of the

predictor system P. For a given signal, Ps is a fixed quantity so that GP can be maximized by

minimizing the denominator of equation (5.21), i.e., by minimizing the variance of the prediction

error. To proceed, we need to specify the nature of the predictor P. One approach that is well

motivated by our previous discussion of the model for speech production (see paragraph 3.7) and by

the fact that it leads to tractable mathematics is to use a linear predictor (see chapter 7).

As will be described later in the section 7.3, using only the simplest first order linear predictor

with correctly chosen its parameter 1 it is possible theoretically to achieve the value of coefficient

11P PG G

in the range 2,43dB 7,21dB . The same show experimental results performed with real

speech signals [31, 33]. Speaking about differential quantization this means that with the simplest

first order predictor it is possible to realize about 6 dB improvement in SNR. This is of course

equivalent to adding an extra bit to the quantizer. Improvement in SNR increases with increasing

predictor’s order. Initially, the increasing is sufficiently sharp. However it slows at 6-8 dB value

which is reaching with the 2nd-4th predictor’s order. Further improvement’s increasing is less effective

and there is no case that it reaches 12 dB value. Maximum effective predictor’s order is 10-14.

Alternatively, while keeping the same SNR, differential quantization permits a reduction in bit

rate. The price paid is increased complexity of the quantization system.

5.5 Vector quantization

Two types of quantization techniques exist. They are scalar quantization and vector

quantization. Scalar quantization deals with sample quantization on a sample by sample basis. These

are: uniform, nonuniform, optimum, adaptive, analyzed above, and others. Vector quantization deals

with quantizing the samples in groups called vectors [52]. Vector quantization increases the

optimality of a quantizer at a cost of increased computational complexity and memory requirements.

By all above analyzed quantization schemes the single sample values of some signal or it’s

parameter were used replacing them by quantized values. It has sense this technique to generalize. It

86

is possible m single values (e.g., successive values) s1, s2, s3,…, sm to integrate into m-dimensional

vector

1 2 3( , , ,..., )T

ms s s ss

and to assign to one of L possible m-dimensional “quantization cells”, and to replace by belonging to

selected cell representative vector

1 2 3ˆ ˆ ˆ ˆ ˆ( , , ,..., )T

ms s s ss .

This is the essence of vector quantization (VQ).

The suitable cell to which a vector s has to be assigned (according to chosen criterion) will be

denoted by index i in the future. All representative code vectors are located in so called codebook.

The particular representative vector belonging to the cell i is denoted by the index i and located in the

codebook with L code vectors ˆis . Instead of using one-dimensional quantization intervals, as in scalar

quantization schemes, the so called m-dimensional Voronoi cells are used. The m-dimensional vector

space could be realized with uniform and nonuniform resolution. Figure 5.17 shows the illustration

of two-dimensional Voronoi cells situation with uniform and nonuniform resolution.

iV ˆ

is

-2

-1

0

1

2

3

2s

1 /s

0 1 2 3-1-2-3-3

1 /s

0 1 2 3-1-2-3-3

-2

-1

0

1

2

3

2s

i

V ˆi

s

Fig. 5.17. Illustration of vector quantization with L=25 two-dimensional (m=2) code vectors; (a) uniform resolution,

(b) nonuniform resolution

If the codebook is known the task of vector quantization consists in replacing an m-dimensional

input vector s by the most similar vector ˆ ˆi opts s . The selection of the most similar vector is

performed in two steps. First, estimation of all distances ˆ( , )id s s between the input vector s and all

codebook vectors, second, determination of minimum distance, i.e.,

ˆ ˆ( , ) min ( , )i opt ii

d ds s s s .

The Voronoi cells are implicitly determined with this.

Since the codebook is known to transmitter as well as to receiver it is not necessary to transmit

the quantized vector ˆi opts . More efficiently is to transmit only address opti . This is illustrated in Figure

5.18.

Distance

estimationˆ( , )id s s

Codebook

L×m

values

s

ˆis

opti Codebook

L×m

values

ˆi opts

DecoderEncoder

Fig. 5.18. Illustration of vector quantization principle

87

If the codebook consists of 2BL m-dimensional vectors ˆis , the chosen address opti and

herewith the implicitly chosen vector ˆi opts could be coded using only B bits. Thus, vector quantization

enables to encode each m-dimensional input vector using 2logB L bits.

There are various possibilities in choosing distance measure. However the computational

procedures must be as simple as possible. For quantization of signal vectors or signal parameters

vectors (e.g., linear prediction coefficients vectors) usually the mean squared error as distance

measure is used [55]. That is

2

1

1 1ˆ ˆ ˆ ˆ( , ) ( ) ( ) ( ) , 1,2,...,

mT

i i i n i m

n

d s s i Lm m

s s s s s s .

This form of distance measure is called Euclidean distance.

Alternative possibility is to use the weighted mean squared error [55]

1ˆ ˆ ˆ( , ) ( ) ( )T

i i idm

s s s s W s s ,

where W is symmetric positive determined m×m-dimensional matrix. Frequently the inverse

covariance matrix C-1 is used. In that case distance measure is called Mahalanobis distance.

For determination of minimum distance the conventional full codebook search method is used.

The method supposes calculation of all distances between all pairs of s and ˆis . However, it is difficult

to achieve, due to the extremely high real-time computational complexity involved in the full

codebook search. To overcome this difficulty various algorithms concerning fast codebook search

have been suggested, such as partial distance search, mean difference, classified pre-selection and

other methods [56].

Vector quantization of speech signals requires the generation of codebooks. The codebooks are

designed using iterative algorithms, e.g., the Linde, Buzo and Gray (LBG) algorithm [54], the Kekre’s

Fast Codebook Generation (KFCG) algorithm the Kekre’s Median Codebook Generation (KMCG)

algorithm, etc. [53]. The input to an algorithm is a training sequence. The training sequence is the

concatenation of a set of signal vectors or signal parameters vectors obtained from people of different

groups and of different ages. The speech signals used to obtain training sequence must be free of

background noise. The speech signals can be recorded in sound proof booths, computer rooms and

open environments.

The codebook generation using, for example, LBG algorithm requires the generation of an

initial codebook, which is the centroid or mean obtained from the training sequence. The centroid

obtained is then split into two centroids or codewords using the splitting method. The iterative LBG

algorithm splits these two into four, four into eight and the process continues till the required numbers

of codewords in the codebook are obtained [54].

In conclusion we present some considerations about computational complexity of VQ. In the

simplest case this could be estimated by the number of mathematical operations (multiplications,

additions, etc.) per second ( /secMathOpN ). Some authors [33] suggest the following formula

/sec 3MathOp sN LF .

In typical case 1024L and 8kHzsF . So we obtain an impressive number of mathematical

operations per second 6N /sec 24,6 10MathOp .

88

6. VIDEO SIGNALS COMPRESSION

6.1 Video Material Characteristics

Let’s begin from some definitions [41]:

Pixel is the smallest single component of a digital image. The greater pixel density is in the

image, the higher image resolution.

Block is a part of an image and is composed of an amount of pixels vertically and

horizontally. The blocks composed of 8×8 pixels are used in MPEG systems. Blocks are

processed individually.

Macro block is an image area in a sequence of digitized images. It is used for motion

compensation starting from the key frame and continuing into the following frames.

Usually macro blocks consist of 16×16 pixels.

Slice is a part of an image joining adjacent macro blocks. They can differ in size and form.

Frame is a separate part of an image taken from the sequence of images. In the television,

frame is comprised of two half-frames, each of which consists of the half of rows (even or

odd). Each frame is divided into macro blocks of 16×16 pixels size, where each macro

block is comprised of 4 blocks of 8×8 pixels size.

We know that conversion of an analog video signal into digital form results in an unacceptable

increase in the bandwidth required for transmitting the signal even when multilevel modulation

schemes that transmit several bits/s/Hz are employed. For digital video to be a practical reality, it is

essential that the number of bits required to represent each picture in a video sequence be significantly

reduced. Fortunately, the characteristics of video material are such that substantial reductions in the

number of bits can be achieved without noticeably affecting the subjective quality of the video

service. Below we describe the characteristics that allow these savings to be made.

6.1.1 Picture correlation

In every picture there are plenty of sufficiently large areas that have similar characteristics:

luminance or chrominance levels. This similarity within a picture can be exploited to reduce the

amount of data that needs to be transmitted to accurately represent the picture. In order to completely

represent the part of a picture, which is similar to another all that would be needed would be the

luminance or chrominance level of the first part together with the statement that other part has the

same level of luminance or chrominance. Mathematically, the similarity of a picture parts is measured

by the autocorrelation function. This function measures how pixel similarity varies as a function of

the distance between the pixels. The correlation coefficient r between two blocks of pixels ( , )A i j

and ( , )B i j where i and j are the pixel positions within each block is defined as [41]

2 2

( , ) ( , ) ( , ) ( , )i j i j i j

r A i j A B i j B A i j A B i j B

,

where A , B are mean values of ( , )A i j and ( , )B i j respectively.

The correlation coefficient between two identical blocks is equal to 1. If two blocks are

completely different, the correlation coefficient will be zero. In the common case, the correlation

coefficient can vary from -1, when 𝐴(𝑖, 𝑗) = −𝐵(𝑖, 𝑗), to 1, when 𝐴(𝑖, 𝑗) = 𝐵(𝑖, 𝑗).

Let us consider a two-dimensional block A consisting of 4 pixels

130 135

126 121

and another block B

89

133 131

123 125

Their arithmetical averages are

130 135 126 121128

4A

,

133 131 123 125128

4B

.

Calculation of the correlation coefficient gives us result 0,0573r . It means the two blocks

are practically uncorrelated.

Let’s calculate the correlation coefficient for luminance component of the image (see

Figure 6.1) for horizontal shifts. Now the correlation coefficient is a function of displacement in

pixels. Result is presented in Figure 6.2.

0 2 4 6 8 10-2-4-6-8-100,75

0,80

0,85

0,90

0,95

1,00

Co

rrel

atio

n c

oef

fici

ent

Displacment (pixels) Fig. 6.1. Image example Fig. 6.2. Horizontal correlation coefficient for luminance

component of the image in Figure 6.1

We see the correlation coefficient reaches its highest value 1 at a displacement of 0. At a

displacement of 1 pixel the correlation coefficient falls only till 0,92. At a displacement of 10 pixels

this coefficient is reduced to 0,79. This clearly shows that neighboring and not nearly enough

neighboring pixels of sufficiently complex picture are intensely correlated that can be used for

reduction of the required data rate.

It is of great significance to calculate a two-dimensional correlation function. In that case the

block of interest is moved over the range of pixels in both the horizontal and vertical directions. The

result will be a three-dimensional surface. Usually the correlation in all directions is high for several

pixel shifts irrespective of picture type.

The correlation is large not only between adjacent blocks of pictures but also between adjacent

pictures in a video sequence.

This information can be used to reduce the amount of data required to represent pictures.

6.1.2 Information quantity in a digital image

Let us evaluate the information quantity aI created by a video source. It is defined as [41]

aaa PPI 22 log)/1(log ,

where 𝑃𝑎 is a probability that a video source is generating content a. From this equation follows:

𝐼𝑎 = 0 , when 𝑃𝑎 = 1. It means that the video source is generating content a continuously

and there is no information provided by its appearance;

90

𝐼𝑎 ≥ 0 , when 0 ≤ 𝑃𝑎 ≤ 1. The appearance of a symbol can provide some amount of

information or no information in the worst case but it is never lost;

𝐼𝑎 ≥ 𝐼𝑏 , when 𝑃𝑎 < 𝑃𝑏 . Probability of the appearance of symbol b is higher than the

probability of appearance of symbol a. However we get more information with the

appearance of less likely symbol.

If events a and b are statistically independent, then total information quantity is the sum of

the quantity created by both symbols:

𝐼𝑎𝑏 = 𝐼𝑎 + 𝐼𝑏.

If information source is generating m symbols 0 1 1, ,.., ms s s with corresponding probabilities

𝑃0,𝑃1, … , 𝑃𝑚−1 , then the average information quantity (called entropy) will be: 1

2

0

logm

i i

i

H P P

Let’s have four symbols {a, b, c, d} with appropriate appearance probabilities, and let’s

calculate the average information quantity (see Table 6.1 below).

Table 6.1 Calculation of the average information quantity for four symbols

Symbol 𝑷𝒊 2

logi i

I P −𝑷𝒊 𝒍𝒐𝒈𝟐𝑷𝒊

a 0,75 0,415 0,311

b 0,125 3,0 0,375

c 0,0625 4,0 0,25

d 0,0625 4,0 0,25

H=1,186 bits/symbol (after summation)

It is evident that two bits (B=2) code is sufficient for four symbols transmission and average

code length L will be equal to 2. It could be calculated formally as

0,75 2 0,125 2 0,0625 2 0,0625 2 2 bits/symboli

i

L PB .

Then coding efficiency could be evaluated as

/ 1,186 / 2 59,3E H L %.

However it is possible to use the variable length code for encoding four symbols and to increase

the code efficiency. Details and techniques used for design variable length coding schemes will be

discussed in the next paragraph.

Given the above discussion, we see many pictures characterizes by high level of correlation

between neighboring pixels located on the horizontal or vertical axes of a picture. Moreover pictures’

pixels are distributed non-uniformly. Exploiting these facts it is possible to reduce the amount of

information that is needed to transmit and sufficiently qualitatively to represent pictures. The

techniques that allow us to solve these tasks are analyzed below. These cover: variable length coding

(entropy coding), predictive coding including motion compensation and transform coding.

6.2 Variable Length Coding (VLC) or Entropy Coding

Variable length code significantly increases encoding efficiency. The most well-known method

for variable length coding is the Huffmann algorithm, which assumes previous knowledge of the

probability of each element. Huffmann encoding could be implemented in the following steps [36,

41]:

Step 1. Symbols being encoded have to be classified in order of decreasing probability of

appearance, starting from the highest one;

91

Step 2. Two symbols of the lowest probability are then grouped into one element, which

probability is the sum of the two probabilities. Bit 0 is attributed to the element with the

lowest probability and 1 to the other element. This reduces by one the number of elements

to be classified. Then all symbols have to be classified in order of decreasing probability

again;

Steps 3, 4,… The abovementioned step is repeated until all the elements have been coded.

In this way, the Huffmann coding tree is built.

The last step. The code for each element is obtained by reading the bits encountered in

moving along the Huffmann tree from right to left.

The Huffmann encoding steps is best illustrated by way of an example. Let us have four

symbols {a, b, c, d} and their probabilities of appearance (0,75 , 0,125 , 0,0625 , 0,0625) respectively.

In Figure 6.3 we present the symbols classified in order of decreasing probability of appearance and

the Huffmann tree built after some steps of encoding.

a

b

c

d

0,75

0,125

0,0625

0,0625

0,125

0,125

0,75 0,75

0,25

0

1

1

1

0

0

Fig. 6.3. Illustration of Huffmann tree building

Process of reading of the Huffmann code words for symbols is illustrated in Figure 6.4.

a

b

c

d

0,75

0,125

0,0625

0,0625

0,125

0,125

0,75 0,75

0,25

0

1

1

1

0

0

1

a)

a

b

c

d

0,75

0,125

0,0625

0,0625

0,125

0,125

0,75 0,75

0,25

0

1

1

1

0

0

1

01

b)

a

b

c

d

0,75

0,125

0,0625

0,0625

0,125

0,125

0,75 0,75

0,25

0

1

1

1

0

0

1

01

001

c)

a

b

c

d

0,75

0,125

0,0625

0,0625

0,125

0,125

0,75 0,75

0,25

0

1

1

1

0

0

1

01

001

000

d)

Fig. 6.4. Illustration of assignment of codes to the symbols; (a) – assignment of code 1 to the symbol „a”

(b) - assignment of code 01 to the symbol „b”, (c) - assignment of code 001 to the symbol „c”,

(d) - assignment of code 000 to the symbol „d”

Let us illustrate now decoding procedures using the following sequence of symbols being

transmitted: {a b c d a a a b}. They generate continuous bit stream 10100100011101.

Decoding procedure is very simple and consists of some rules:

Each received bit has to be analyzed separately;

If analyzed bit coincides with the code of some symbol that symbol is considered to be

decoded;

If analyzed bit does not coincide with the code of any symbol the next bit is grouped with

previous and the group of two bits is being analyzed newly;

If analyzed two bits group coincides with the code of some symbol that symbol is

considered to be decoded;

92

If two bits group does not coincide with the code of any symbol the third bit is added to the

previous group and analysis continues, and so on.

Returning to our example, if received bit stream is the same as transmitted (i.e.,

10100100011101), decoding procedure at the reception end could be illustrated as follows in

Table 6.2.

Table 6.2 Illustration of decoding procedure

Analyzed bits Other bits Decoded symbols

1 0100100011101 a

1 0 100100011101 a ?

1 01 00100011101 a b

1 01 0 0100011101 a b ?

1 01 00 100011101 a b ?

1 01 001 00011101 a b c

1 01 001 0 0011101 a b c ?

1 01 001 00 011101 a b c ?

1 01 001 000 11101 a b c d

1 01 001 000 1 1101 a b c d a

1 01 001 000 1 1 101 a b c d a a

1 01 001 000 1 1 1 01 a b c d a a a

1 01 001 000 1 1 1 0 1 a b c d a a a ?

1 01 001 000 1 1 1 01 - a b c d a a a b

In this example, the average code length L will be equal to

0,75 1 0,125 2 0,0625 3 0,0625 3 1,375 bits/symboli

i

L PB .

The entropy is equal to 1

2

0

log 1,186 bits/symbolm

k k

k

H P P

,

and the code efficiency reaches

/ 1,186 /1,375 86,2E H L %.

It is evident that a pure binary coding of four elements requires 2 bits per element, as we have

seen above. So the compression factor achieved with the Huffmann coding is

1,375 / 2 68,75 %.

Huffmann coding is applied to video signals as a complement to other methods. Foremost the

methods generating elements of non-uniform probability have to be applied. Just such a method is

discrete cosine transform applied after quantization of video signals.

However the Huffmann coding results in redundant number of assigned bits per element. For

example, if the probability of a symbol is high, i.e. 0,9 , then the optimal number of bits will be 0,15,

if – low, i.e. 0,3 , then the optimal number of bits will be 1,6. But in the first case 1 bit will be assigned,

in the second case accordingly – 2 bits. Obviously the number of bits is rather redundant. This

redundancy is reduced by using arithmetic coding method, which is standardized in MPEG-4 and

JPEG-2000 standards.

In the case of video stream the same symbols being sequentially sent are grouped and encoded

using variable length code as a single word.

93

6.3 Predictive Coding

The purpose of predictive coding is to use the already received symbols for prediction of the

adjacent future symbols because usually high correlation exists between them. In this case only the

difference between the predicted and the real symbol needs to be transmitted. If prediction is

sufficiently accurate then a small bit number is satisfactory to transmit the difference without

distortions. For this a method similar to DPCM (Differential Pulse Code Modulation) is used (see

paragraph 7.4).

One of the simplest prediction methods used for prediction of forthcoming symbol is to change

it by the earlier received one (1-D prediction) [41]. Instead of the first predicted symbol as a rule the

average value of quantized video signal is used. In the case of 8-bit quantization the average value is

equal to 128.

Let us demonstrate the prediction procedure using the following values of sample image pixels: 130 135 141 129 151

124 140 165 200 199

119 132 175 205 203

We perform the prediction by expecting that the first predicted pixel will be grey what

corresponds to the average luminance value 128. The results are:

Line 1 130 135 141 129 151

Predicted pixels 128 130 135 141 129

Line 2 124 140 165 200 199

Predicted pixels 128 124 140 165 200

Line 3 119 132 175 205 203

Predicted pixels 128 119 132 175 205

The prediction errors (differences) are:

+2 +5 +6 -12 +22

-4 +16 +25 +35 -1

-9 +13 +43 +30 -2

In the next Figures 6.5, a and 6.5, b we present another example with a black and white photo.

a)

b)

Fig. 6.5. Illustration of 1-D prediction: (a) - Original photo, (b) - Prediction error found using 1-D prediction

94

Prediction could be more sophisticated – multidimensional [41]. For example, we have received

the pixels A, B, C, D and want to predict the pixel X.

B C D

A X

Using multidimensional prediction method X could be defined as

1 2 3 4X k A k B k C k D .

Prediction coefficients k are subject to the image content. Using pixels to the left and above the

pixel of interest means that horizontal and vertical correlations within the picture are exploited. Some

improvements in performance would therefore be expected. However it was shown that there is little

to be gained by using additional pixels from the same picture for prediction. Therefore most popular

is 2-D prediction. In that case:

0,5 0,5X A C .

Applying 2-D prediction to the photo (see Figure 6.6, a) we get results presented in Figure 6.6,

b and c.

a)

b)

c)

Fig. 6.6. Illustration of 2-D prediction: (a) original photo,

(b) – predicted photo using 2-D prediction, (c) –

prediction error

Using 1-D and 2-D methods for compression of photos we get approximately 3% reduction in

needed number of bits in the second case.

Encoders where all the pixels used for prediction are in the same picture are called intrapicture

encoders. It is possible to predict from previous pictures. This is called interpicture prediction. If

pictures differ slightly the prediction can be efficient. But in the case of a large amount of motion one

must not to expect the prediction being efficient. Motion-compensated prediction is needed.

95

6.4 Prediction with Motion Compensation

Interpicture prediction performance could be improved by evaluating the motion which occurs

between pictures [41–43]. Instead of coding moving objects several times in a picture is proposed to

transmit only information about how the object had moved between pictures removing requirement

to code them at all. It is clear that this would reduce the bit rate, but it is difficult to implement

because:

pictures can be focused by changing the camera focal length,

panorama can be watched (i.e., camera is rotated vertically or horizontally along its axis),

image can be formed by rotation along the camera axis, or by translation along the camera

axis, or by translation in the plane normal to the camera axis.

Each of these movements can affect the content of video stream in real time. Motion

compensation algorithms taking into account all these facts become very sophisticated and bulky and

their implementation in real time hardware would be practically impossible. Fortunately there is

sufficiently simple approach resulting in practically realizable algorithm. Its essence is approximation

of all kinds of movements as translational movement of sufficiently small objects in a plane normal

to the camera axis. The first step of this process is to estimate the object motion.

Briefly, in motion compensated prediction, the encoder searches for a portion of a previous

frame which is similar to the part of the new frame to be transmitted. It then sends (as side

information) a motion vector telling the decoder what portion of the previous frame it will use to

predict the new frame. It also sends the prediction error so that the exact new frame may be

reconstituted.

Motion compensated prediction procedure consists of two stages: motion estimation and motion

compensation.

The motion estimation stage begins from dividing the current picture to be encoded into a

number of blocks. The prediction picture is then searched for the best match to each of the blocks in

the current picture. For the better explicitness of procedure let us introduce a third, motion

compensated picture, which initially is blank, as shown in Figure 6.7.

Current picture Prediction picture Motion compensated picture

(initialy blank) Fig. 6.7. Initial stage of motion-compensated prediction procedure

Each block of the current picture is analyzed in course. We will begin with the block with left

circle. Then a search area has to be defined in the prediction picture as shown in Figure 6.8. The size

of that area is user selectable. However it must be noted that the larger the area, the larger the tracked

motion and the higher computation requirements. Now the search for the block that best matches the

analyzed block in the current picture follows. This is also illustrated in Figure 6.8. After this the found

best matching block is copied into the motion compensated pictures in the same position as the block

to be predicted in the current picture (see Figure 6.8)

96

Current picture and analyzed block Motion compensated picture with the

copied best matching block from the

prediction picture

Prediction picture with selected search

area and found the best matching block

Motion

vector

Search

area

Best matching

block

Fig. 6.8. Search area in the prediction picture for the selected block in the current picture and generation of block in

motion compensated picture

The location of the best matching block in the prediction picture is transmitted to the decoder.

This information is called motion vector. It is shown in the prediction picture (see Figure 6.8) as a

solid arrow. It marks the displacement of the location of the upper left corner of the block in the

current picture to the location of the upper left corner of the chosen best matching block in the

prediction picture. Together with motion vector the prediction difference block has to be transmitted.

It is formed as a difference between the copied block in motion compensated picture and analyzed

block in the current picture. In our example the motion compensated picture completely coincides

with the original picture. Thus the prediction difference picture will consist of zeroes. In reality it

does not consist of zeroes.

This process is repeated for all blocks in the current picture.

The best approach for the search of the best matching block is to compare the block we are

trying to match with every block of the same size within the search area. The order of the search is

not important. It is important only not to miss any of the possible search positions. One of the potential

search variants is to start the search from the upper left corner of the search area and to move the

analyzed block one by one pixel towards the upper right corner. After the upper right corner is reached

the search reverts to the left side one pixel below from the top and again moves to the right and so

on… The search is terminated when the lower left corner of the search area is reached. At each search

position the summed absolute difference is calculated between the block at search position and the

block in the current picture. The block with the smallest summed absolute difference is chosen as the

motion-compensated prediction block.

The search procedures are illustrated in Figures 6.9 and 6.10. The block from the current picture

is of 2×2 pixels and the search area is of 6×6 pixels. There are a total of 25 possible search positions

in this case.

Block from

current

picture

Search area

from prediction

picture

Fig. 6.9. Illustration of current block and search area Fig. 6.10. Illustration of search procedure

97

In the common case for a search area of N×N pixels and a block of M×M pixels the total number

of search positions will be equal to 2( 1)N M .

Example A search area is of 48×48 pixels. Block size is 16×16 pixels. The picture resolution is 704×480

pixels at picture rate of 25 pictures per second.

The number of searches per block is equal to 2 2(48 16 1) 33 1089 searches. For each

search, two blocks of size 16×16 pixels are compared. This requires 256 pixels comparisons per

search. The number of comparisons per block is equal to 1089×256=278874 comparisons. Each

picture consists of 704 480

132016 16

blocks. The total number of pixel operations per picture is

8278874 1320 3,68 10 . The picture rate is 25 pictures/s. The total number of pixel operations per

second is therefore 8 93,68 10 25 9,2 10 .

This simple example shows very clearly that the computational requirements for motion

estimation are very large. The more especially as search areas greater than 48×48 pixels are often

used in reality. Therefore specialized digital signal processing hardware is used.

Now we will present some considerations regarding the most appropriate block size. From one

side, it must be sufficiently small that the likelihood that a block contains two or more different

motions would be small. From the other side, it must be sufficiently large that the resources required

representing the motion vector would be not excessive. For example, if the block size is 1×1 pixel, 8

bits per pixel are needed for representation of motion vector (4 bits/pixel for horizontal and 4

bits/pixel for vertical components). If the block size is 16×16 pixels, only 0,03125 bits per pixel will

be needed. Most video coding standards use a block size of 16×16 pixels.

A block diagram of the motion compensated encoder is presented in Figure 6.11 and of the

matching decoder is shown in Figure 6.12.

Video signal Motion compensated

difference

Motion

estimation

Frame

store

+ Entropy

coding

Bit stream

output

Motion compensated

prediction +

+

Motion vectors

Fig. 6.11. Block diagram of motion compensated encoder

98

Bit stream

input

Frame

store

+

+

Entropy

decoding

Video

output

Motion

compensated

prediction

Motion compensated

difference

Motion vectors

Fig. 6.12. Block diagram of matching decoder

Up to this paragraph we have assumed that motion estimation is performed to pixel accuracy.

This is not truth in reality. Motion estimation to subpixel accuracy is possible and is supported in

many video encoding standards. Subpixel accuracy can be achieved by estimating the pixel values at

subpixel displacements by means of bilinear interpolation from the known pixel values. A maximum

of half pixel accuracy is validated for MPEG-2 standard.

Above discussed motion compensated encoding has been lossless. This means that decoder

output signal is identical to the signals at the input of encoder. This ensures the highest quality of

signal. However the achieved compression level is modest. To increase the compression level and to

decrease the transmitted bit rate a quantizer is used for quantization of motion compensated difference

before entropy coding as is shown in Figure 6.13. But it is necessary to note that difference quantizing

leads to lossy motion compensated prediction.

Video signalMotion

compensated

difference

Motion

estimation

Frame

store

+ Entropy

coding

Bit stream

output

Motion

compensated

prediction

+

+

Quantizer

Inverse

quantizer

Motion vectors

Fig. 6.13. Block diagram of lossy motion compensated encoder

The uniform quantizer with 511 uniformly spaced quantizing levels is used. The illustration of

an uniform quantizer output-input characteristic with 8 quantizing levels is shown in Figure 4.7, a in

paragraph 4.6. After quantizing the motion compensated difference can take values in the range of

±255. But in the case of successful motion estimation many values of the difference are close to zero

and after quantizing will be equal to zero. This results in decreasing of transmitted bit rate after

entropy coding. Such coding technique is explained in the next paragraph.

As follows from above presented discussion, small quantizer steps results in a high quality

reconstruction of video signals and high data rate. Conversely, the larger quantizer steps used, the

lesser data rate and much lower reconstruction quality is procurable.

The objective measure of the quality of video signals is the peak signal-to-noise ratio (PSNR).

In this case noise is recognized as the mean square error (MSE) between an original and a

reconstructed picture of size M×N pixels. MSE is defined as

99

2

, ,

1 1

1ˆMSE

M N

i j i j

i j

x xMN

,

where ,i jx is the value of the original pixel and ,ˆ

i jx is the value of the corresponding reconstructed

pixel. Thus the PSNR is defined as

2

Peak- to- peak signalPSNR 10lg

MSE

If 8 bits (256 levels) quantization is used for luminance or chrominance signals, the peak-to-

peak signal is equal to 255. In this case

2

255PSNR 10lg

MSE .

For the comparison of the performance of various coding techniques the rate-distortion curves

are used. This curve represents the reliance of PSNR on the data rate. Typical rate-distortion curve is

shown in Figure 6.14.

Data rate (Mbit/s)

PSNR (dB)

Fig. 6.14. Typical rate-distortion curve

6.5 Compression of Images Using Transformations

It was shown before that predictive encoding can reduce the bitrate of the images to be

transmitted, but encoding efficiency is limited because the number of transmitted elements (pixels

differences) is similar as in encoding with no prediction. The reduction effect is achieved by reduced

entropy when prediction is used.

Amount of transmitted elements can be reduced by using mathematical image transformations,

when original values of the pixels are multiplied by the values of the collection of base functions.

Result is summed to obtain coefficients of image expansion using those base functions. Values of

coefficients show how much the variations of the pixels values are similar to the corresponding base

functions values. In the case of good coincidence the result is a large integer. In the opposite case the

result can be small positive or negative integer.

The goal of good signal compression transform is to transfer the information being in the data

into the smallest number of transform coefficients. In the general case, when a data block is described

as N elements 1-D vector

1 2[ .... ]Nx x xx ,

the transform is performed by multiplying the data vector by an N×N matrix of base functions:

11 12 11 1

21 22 22 2

31 32 3

1

....

....

. .... .

. ..

......

N

N

N

N NN NN

t t tC x

t t tC x

t t t

C xt t

C .

Each row of base functions matrix 1 11 12 1[ .... ]Nt t tt , 2 21 22 2[ .... ]Nt t tt ,…, 1 2[ .... ]N N N NNt t tt

represents a basis vector of transform. A successfully chosen transform effectively transfers data

100

information into the smaller number of transform coefficients C in comparison with the number of

data vector elements. Thus transmitting only marked coefficients we can reduce the amount of

information processed.

For demonstration purposes let us examine a simple example of a four point 1-D array

expansion using Hadamard base functions. The examined array is

[64 17 110 23] .

The values of the Hadamard functions ( , 1,2,...mH m ) are presented in Table 6.3 below.

Table 6.3 The values of the Hadamard functions ( , 1, 2,...m

H m )

No

Values of the

first function

1(1,..., 4)H

Values of the

second function

2(1, ..., 4)H

Values of the

third function

3(1, ..., 4)H

Values of the

fourth function

4(1, ..., 4)H

1 0,5 0,5 0,5 0,5

2 0,5 0,5 -0,5 -0,5

3 0,5 -0,5 -0,5 0,5

4 0,5 -0,5 0,5 -0,5

Expansion coefficients are calculated as follows:

0

1

2

3

64 0,5 17 0,5 110 0,5 23 0,5 107,

64 0,5 17 0,5 110 0,5 23 0,5 26,

64 0,5 17 0,5 110 0,5 23 0,5 20,

64 0,5 17 0,5 110 0,5 23 0,5 67.

K

K

K

K

Coefficient 0K is an average of pixels values and usually is called DC (Direct Current)

coefficient. Other coefficients are called AC (Alternating Current) coefficients. Transformed array

will be:

[107 26 20 67] .

The Hadamard transformation is reversible as many other transformations. It means that

original pixel values can be reconstructed by using inverse transformation. For this obtained values

of expansion coefficients have to be multiplied by the corresponding base functions values and then

products have to be added together. Calculation results are presented in Table 6.4.

Table 6.4 Intermediate and final results of Hadamard inverse transformation calculation

Coefficients

0K ,

1K ,

2K ,

3K 0 1

(1, ..., 4)K H 1 2

(1, ..., 4)K H 2 3

(1, ..., 4)K H 3 4

(1, ..., 4)K H Row sum

(Reconstructed values)

107 53,5 -13 -10 33,5 64

-26 53,5 -13 10 -33,5 17

-20 53,5 13 10 33,5 110

67 53,5 13 -10 -33,5 23

For encoding images and video streams the Discrete Cosine Transformation (DCT) is regularly

used. DCTs are important to numerous applications in science and engineering, from lossy

compression of audio (e.g. MP3) and images (e.g. JPEG) (where small high-frequency components

can be discarded), to spectral methods for the numerical solution of partial differential equations [44].

The use of cosine rather than sine functions is critical for compression, since it turns out that fewer

cosine functions are needed to approximate a typical signal.

101

In the case, when a data block is described as N elements 1-D vector 1 2[ .... ]Nx x xx , it‘s

discrete cosine transform coefficients are the list of length N given by: 1

0

(2 1)( ) ( ) ( )cos , 0,1,2,..., 1

2

N

n

n kC k k x n k N

N

.

Similarly, the inverse DCT is defined as 1

0

(2 1)( ) ( ) ( )cos , 0,1,2,..., 1

2

N

n

n kx n k C k k N

N

In both of these equations

1/ , 0,( )

2 / , 1,2,..., 1.

N kk

N k N

In order to reduce the complexity of the circuitry and the processing time required, the

elementary data block size chosen is generally 8×8 pixels, which DCT transforms into matrix of 8×8

coefficients.

In this case the basic functions ( )t k are 1-D arrays defined as 1

0

(2 1)( ) ( )cos , 0,1,2,..., 1

2

N

n

n kt k k k N

N

.

The basic functions for an 8-point DCT are given in the matrix, shown below.

0,354 0,354 0,354 0,354 0,354 0,354 0,354 0,354

0,49 0,416 0,278 0,098 -0,098 -0,278 -0,416 -0,49

0,462 0,191 -0,191 -0,462 -0,462 -0,191 0,191 0,462

0,416 -0,098 -0,49 -0,278 0,278 0,49 0,098 -0,416

0,354 -0,354 -0,354 0,354 0,354 -0,354 -0,354 0,354

0,278 -0,49 0,098 0,416 -0,416 -0,098 0,49 -0,278

0,191 -0,462 0,462 -0,191 -0,191 0,462 -0,462 0,191

0,098 -0,278 0,416 -0,49 0,49 -0,416 0,278 -0,098

These are just sampled versions of cosine waveforms of increasing frequency ranging from 0

periods per vector (i.e., constant) in the case of the first vector to 3,5 period per vector in the case of

the last vector with each vector containing 0,5 to one additional periods before it. Graphical views of

these waveforms are presented in Figure 6.15.

1 2

3 4

102

Fig. 6.15. Graphical view of 8 basic functions for 1-D DCT

For pictures 2-D DCT is required. The forward and reverse transforms are calculated according

to the equations 1 1

0 0

(2 1) (2 1)( , ) ( ) ( ) ( , )cos , , 0,1,2,..., 1

2 2

N N

m n

m k n lC k l k l x m n k l N

N N

and 1 1

0 0

(2 1) (2 1)( , ) ( ) ( ) ( , )cos , , 0,1,2,..., 1

2 2

N N

m n

m k n lx m n k l C k l k l N

N N

,

where

1/ , 0,( )

2 / , 1,2,..., 1.

N kk

N k N

The basic vectors ( , )t k l in this case are 2-D arrays defined by

(2 1) (2 1)( , ) ( ) ( )cos , , 0,1,2,..., 1

2 2

m k n lt k l k l m n N

N N

.

Pictures of the 64 2-D basic vectors from an 8×8 transform are shown in Figure 6.16.

Fig. 6.16. Basic vectors for 2-D DCT

5 6

7 8

103

Transforming integer pixel values into real transform coefficients enables us to transmit those

instead transmitting pixel values. However, transmitting DCT coefficients without any further

processing would probably lead to an increase in the transmissible bitrate. Fortunately the DCT packs

most energy of the picture into a small number of coefficients. Quantizing these coefficients and then

transmitting only the significant ones can result in significant saving of bitrate. The question remains

how best to do this. As the energy is compacted primarily into the first few (low frequency)

coefficients one approach would be to simply not transmit the other (high frequency) coefficients.

The rather that eye sensitivity is higher in the lower and medium frequencies and is significantly

lower in high frequencies. That means that transmissible bitrate can be reduced by eliminating the

high frequency coefficients. However, even at these higher frequencies a significant signal will still

be observable. Therefore the step size of the quantizer could be varied according to the frequency

represented with the step size increasing as the frequency increases. This ensures that high-frequency

coefficients will be quantized to zero although they are sufficiently large.

To do this a quantization relationship between quantized coefficients and original coefficients

values in the form of

,

,

,

8ˆ round k l

k l

k l

CC

QW

is proposed. Here ,ˆ

k lC is the value of quantized coefficient, ,k lC is the value of the original coefficient,

Q is the quantizer step size for a particular block of data, and ,k lW is the weighting values for this

particular coefficient. The weighting value ,k lW increases as the vertical and horizontal frequencies

increase. An example of a matrix of weighting values is presented below.

8 16 19 22 26 27 29 34

16 16 22 24 27 29 34 37

19 22 26 27 29 34 34 38

22 22 26 27 29 34 37 40

22 26 27 29 32 35 40 48

26 27 29 32 35 40 48 58

26 27 29 34 38 46 56 69

27 29 35 38 46 56 69 83

For example, if calculated DCT coefficient 4,4 75C and the quantization step size is 16, then

the quantized coefficient 4,4C , appreciating that the corresponding weighting value is 32, will be

4,4

8 75ˆ round round(1,17) 116 32

C

.

If the coefficient 1,1C had the same value (i.e., 75), the quantized coefficient 1,1C will have the

value

1,1

8 75ˆ round round(2,34) 216 16

C

.

Example Let us have an 8×8 block of data from a picture as presented below [41]

91 42 67 72 83 189 245 241

75 74 171 245 240 227 216 221

50 45 75 65 119 228 245 234

104

72 93 198 246 239 225 214 222

33 58 75 72 155 242 242 229

75 106 215 248 237 223 216 223

DCT coefficients after a 2-d DCT calculation and rounding to the nearest integer are

1294 -495 -104 0 -22 34 48 7

-66 -13 84 13 -1 30 5 -8

-15 7 26 1 21 24 -1 -4

-35 -7 53 12 -17 -4 -5 1

-10 8 19 -9 6 15 2 -2

-60 -24 71 25 -26 -7 -1 1

-7 19 35 -25 -8 24 11 0

-189 -117 173 99 -76 -31 0 -2

The final result after quantization using step size of 8 and the weighting factors, presented

above, is

161 -30 -5 0 0 1 1 0

-4 0 3 0 0 1 0 0

0 0 1 0 0 0 0 0

-1 0 2 0 0 0 0 0

0 0 0 0 0 0 0 0

-2 0 2 0 0 0 0 0

0 0 1 0 0 0 0 0

-7 -4 4 2 -1 0 0 0

We see that even 45 coefficients are equal to zero.

After quantizing DCT coefficients the stage of preparation for transmission them to the receiver

follows. The first step of this process is scanning of 2-D array of coefficients into 1-D array. For this

the zig-zag scanning order of coefficients is used as shown in Figures 6.17, 6.18.

Increasing horizontal frequncy

Increasin

g v

ertical frequen

cy

DC

coefficient

161 -30 -5 0 0 1 1 0

-4 0 3 0 0 1 0 0

0 0 1 0 0 0 0 0

-1 0 2 0 0 0 0 0

0 0 0 0 0 0 0 0

-2 0 2 0 0 0 0 0

0 0 1 0 0 0 0 0

-7 -4 4 2 -1 0 0 0

Fig. 6.17. Illustration of zig-zag scanning order Fig. 6.18. Zig-zag scan order of quantized DCT

coefficients

This method ensures that the DC coefficient is scanned first. Then low-frequency AC

coefficients follow. Higher frequency coefficients are scanned lastly. This scanning order ensures

105

also that the last nonzero coefficient will be met well before the end of the scan. There is no reason

to continue the scan process after the last nonzero coefficient. It can be terminated at that point.

After the scanning, the coefficients have to be reordered into the encoding-friendly form.

Primarily the 1-D array of quantized coefficients is formed as shown below.

{161, 30, -4, 0, 0, -5, 0, 3, 0, -1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 0, -2, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,

0, 0, 2, 0, -7, -4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 2, 0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0}

Now one can explicitly see the array of quantized AC coefficients contains series of consecutive

zeroes. Therefore, a coding advantage can be obtained by using the so called run-length encoding

method.

The essence of the method is that the 1-D array of coefficients is reordered once more

converting each nonzero element into the pairs of integers the first of which indicates the number of

consecutive zeroes before the next coefficient, and the second indicates the next coefficient. The

resulting run-coefficient pairs of our 1-D array are shown below.

{(0, 161) (0, 30) (0, -4), (2, -5), (1, 3), (1, -1), (2, 1), (2, 1), (0, 1), (2, 2), (1, -2),

(5, 1), (0, 1), (4, 2), (1, -7), (0, -4), (0, 1), (10, 4), (0, 2), (7, -1), EOB (End Of Block)}

The 1-D array of coefficients contains 64 numbers. The run-coefficient pairs contain 41 number

including EOB sign. The effect of compression is evident. Further compression is achieved by

applying for encoding of the run-coefficient pairs a Huffman code. A special Huffman code is used

for encoding EOB sign. Since this sign occurs quite commonly in the transmitted stream, it is

represented by a short Huffman code word. The Huffman coding tables for possible run-coefficient

pairs have been developed by standards bodies and are universally accepted.

The DCT is also used to code the motion-compensated prediction difference [41]. The block

diagram of a motion-compensated DCT encoder is shown in Figure 6.19.

Video signal

Motion

compensated

difference

Motion

estimation

Frame

store

+ Entropy

coding

Bit stream

output

Motion

compensated

prediction +

+

Quantizer

Inverse

quantizer

Motion vectors

Direct DCT

Inverse

quantizer

DCT

coefficientsQuantized

DCT

coefficients

Reconstructed

DCT

coefficients

Reconstructed motion

compensated

difference

Reconstructed

picture

Fig. 6.19. Motion-compensated prediction difference DCT encoder

The corresponding decoder is shown in Figure 6.20.

106

Bit stream

input

Frame

store

+

+

Entropy

decoding

Video

output

Motion

compensated

prediction

Quantized

DCT

coefficients

Motion vectors

Inverse

quantizationInverse DCT

Reconstructed

DCT

coefficients

Reconstructed

motion

compensated

difference

Fig. 6.20. Motion-compensated prediction difference DCT decoder

Pixels arrive at the encoder at regular rate. However output bit stream rate is not constant. The

reasons are:

the number of DCT coefficients to be encoded differs from block to block;

the number of bits required to encode each DCT coefficient depends on the number of zero

quantized DCT coefficients.

For stabilization of the encoder output bit rate, rate control buffer is used. It is inserted at the

very output of the encoder. The buffer is simply a block of memory. Data enter into the buffer of

encoder at a variable rate and leave it at a constant rate.

In the case of the decoder, the rate control buffer is inserted at the very input of the decoder.

This buffer accepts data at a constant rate from the channel and transfers data to the entropy decoder

at the variable rate that it requires.

107

7. LINEAR PREDICTIVE ANALYSIS of SPEECH SIGNALS

7.1 Basic Principles of Linear Predictive Analysis

Linear predictive (LP) analysis is one of the most significant and the most common procedure

used almost in all todays’ audio compression algorithms and standards [46]. All researchers and

scientists unanimously agree that linear prediction coefficients (LPC) are the most suitable parameters

which could be used by solving audio and speech signal representation and compression tasks. Now

nobody discusses about the merit of LPC. All discussions now are taking place only in their

computational procedures acceleration.

Hence, referring to the block diagram of the model for speech production (see Figure 3.19 in

paragraph 3.7), to the scheme of digital IIR filter, representing vocal tract (see Figure 3.18 in

paragraph 3.7), and supporting equation (3.2)

1

( ) 1P

k

k

k

H z G b z

, (7.1)

describing system function of the mentioned filter, we can write simple difference equation describing

relation between speech samples ( )s n and excitation function samples ( )u n [31, 33, 45]

1 1

( ) ( ) ( ) ( ) ( )P P

k k

k k

s n Gu n b s n k v n b s n k

. (7.2)

The proper values of coefficients kb are unknown at this time. In the light of difference equation

(7.2) form we can introduce the concept of estimate of speech signal or, in other words, the concept

of synthesized speech signal ˆ( )s n denoting it as [33]

P

k

k knsns1

)()(ˆ . (7.3)

This speech signal estimate formation procedure is called linear prediction (LP) and unknown

coefficients k , 1,2,...,k P – linear prediction coefficients and difference equation order P –

predictor order [45]. Signal estimate ˆ( )s n is called synthesized or predictable signal. Thus LP

procedure allowed us to predict predictable signal sample ˆ( )s n at nth – time moment gathering up P

previous signal samples ( 1), ( 2), , ( )s n s n s n P , multiplied by weighted coefficients

P ,,, 21 . This type of prediction is called short-time prediction as small number of previous

neighboring signal samples is used. Usually predictor order P does not exceed 10-15. In fact the

neighboring speech signal samples correlation is estimated.

The digital filter system function corresponding to the last difference equation (7.3) is

1

ˆ( )( )

( )

Pk

k

k

S zC z z

S z

. (7.4)

This is nonrecursive finite impulse response (FIR) filter. Block diagram of direct filter form

(usually called transversal filer) is depicted in Figure 7.1.

...

( )s nz -1z -1 z -1 z -1

ˆ( )s n

1 2 1P P

...

...

C(z))(ns ˆ( )s n

Fig. 7.1. Linear prediction filter

108

Let us evaluate the difference between the signal and predictive signal which is called prediction

error.

ˆ( ) ( ) ( )d n s n s n . (7.5)

Referring to the equation (7.2) we get

1

ˆ( ) ( ) ( ) ( ) ( )P

k

k

d n s n s n s n s n k

. (7.6)

This difference equation corresponds to the filter system function

1

ˆ( ) ( ) ( )( ) 1 ( ) 1

( ) ( )

Pk

k

k

D z S z S zP z C z z

S z S z

. (7.7)

It follows from the last system function (7.6) and corresponding difference equation (7.5) that

the filter is finite impulse response (FIR) filter as well. Block diagram is presented in Figure 7.2.

...

( )s n

z -1z -1 z -1 z -1 ˆ( )s n

1 2 1P P

+

( )d n

...

...

P(z))(ns ( )d n

C(z)

)(ns

ˆ( )s n

+

( )d n

Fig. 7.2. Block diagram of prediction error filter

It can be seen by comparing equations (7.2) and (7.6), if k kb , then ( ) ( ) ( )d n v n Gu n .

Thus, the prediction error filter ( )P z will be an inverse filter for the system ( )H z described by the

equation (7.1) under the condition k kb , i.e.,

1

( )( )

1k k Pb

kk

k

G GH z

P zz

. (7.8)

Now it is possible to correct the block diagram of the speech synthesis filter modelling vocal

tract (see Figure (3.18). Under the assumed condition k kb we get speech synthesis filter, depicted

in Figure 7.3.

H(z)( )v n ( )s n

C(z)

)(ns

ˆ( )s n

( )v n ...

( )s n

z -1z -1 z -1 z -1ˆ( )s n

121P P

( )v n

...

...

Fig. 7.3. Final block diagram of speech synthesis filter

109

The basic problem of linear prediction analysis is to determine a set of predictor coefficients

k directly from speech signal. However the speech signal is time-varying signal. Consequently

the predictor coefficients must be estimated from short segments of the speech signal in such a way

that will minimize the mean-squared prediction error over a short segment of the speech waveform.

The resulting parameters are then assumed as the parameters of the system function ( )H z in the

model for speech production. Exceptional characteristic of the resulting parameters is their

undisputable useful contribution toward accurate representation of the speech signal.

7.2 Optimal prediction coefficients

The short-time average (mean-squared) prediction error while analyzing mth speech signal

segment is defined as [45]

2

2

1

( ) ( ) ( )P

d k

n n k

E d n s n s n k

. (7.9)

The values of k that minimize dE can be found by setting

0, 1, 2,...,d

k

Ek P

, (7.10)

thereby obtaining system of linear independent equations

1

( , ) ( ,0), , 1, 2, ,P

k

k

l k l k l P

, (7.11)

where

( , ) ( ) ( ), , 1,2,...,n

l k s n l s n k k l P . (7.12)

To solve equations (7.10) for the optimum LPC one must first compute the quantities ( , )l k

for , 1,2,...,k l P . Once this is done we only have to solve equations (7.10) to obtain the k ‘s.

Primarily it is necessary to determine for this the limits on the sum in equation (7.12). If we

wish to develop a short-time analysis procedure, the limits must be over a finite interval. There are

two basic approaches to this question differing in mean-squared prediction error definition,

requirements of short-time analysis and computational complexity. One of them is the autocorrelation

method; the second one is the covariance method [45].

Secondly it is necessary to solve the linear equations in an efficient manner. There are plenty

of techniques that can be used to solve a system of P linear equations in P unknowns. However they

are not equally efficient. For example, using the well-known Gaussian method required number of

mathematical operations approximately is proportional to P3. Because of the special properties of the

equations coefficient matrices it is possible to solve the equations much more efficiently.

7.3 The autocorrelation method

Using autocorrelation method the mth segment of speech signal is to be assumed identically zero

outside the interval 0 1n N . The corresponding prediction error for Pth order predictor will be

nonzero over the interval 0 1n N P . For this case the exact expression of prediction error will

be

1

2

0

( )N P

d

n

E d n

. (7.13)

It is simple to show that ( , )k l from (7.12) could be expressed as

110

1

0

( , ) ( ) ( ), 1, 2, , , 0,1, 2, ,N P

n

l k s n l s n k l P k P

. (7.14)

After some algebraic transformations of equation (7.14) we get

( , ) (| |), 1, 2, , , 0,1, 2, ,l k R l k l P k P , (7.15)

where

1

0

( ) ( ) ( )N k

n

R k s n s n k

(7.16)

is the short-time autocorrelation function of the mth segment of the speech signal. Therefore equation

(7.14) can be expressed as

1

(| |) ( ), 1P

k

k

R l k R l l P

, (7.17)

or in matrix form as

1

2

(0) (1) ( 1) (1)

(1) (0) ( 2) (2)

( 1) ( 2) (0) ( )

m

P

R R R P R

R R R P R

R P R P R R P

. (7.18)

7.3.1 The Durbin‘s recursive procedure

The P×P matrix of autocorrelation values is a Toeplitz matrix as it is symmetric and all the

elements along diagonal are equal. This special property is exploited to obtain algorithm for the

solution of equations (7.17). The most efficient method known for solving this particular system of

equations is the Levinson–Durbin recursion procedure which can be expressed as follows [31, 33]:

1) )0()0( RE ;

2)

1( 1)

1

( 1)

( ) ( )

, 1

ll

j

j

l l

R l R l j

k l PE

;

3) ( )l

ll k ;

4) ( ) ( 1) ( 1)

, 1 1l l l

lj j l jk j l ;

5) ( ) 2 ( 1)(1 )l l

lE k E ,

(7.19)

where lk are so called partial correlation coefficients – PARCOR‘s (about them read some below).

Equations (7.19) are solved recursively for each 1, ,l P and final result is given as

PjP

jj 1,)( . (7.20)

For example, we have autocorrelation matrix 2×2

)2(

)1(

)0()1(

)1()0(

2

1

R

R

RR

RR

.

According to Durbin‘s procedure we will have

)0()0( RE ,

)0(

)1(1

R

Rk ,

111

)0(

)1()1(1 R

R ,

)0(

)1()0( 22)1(

R

RRE

,

)1()0(

)1()0()2(22

2

2RR

RRRk

,

)1()0(

)1()0()2(22

2)2(

2RR

RRR

,

)1()0(

)2()1()0()1(22

)2(1

RR

RRRR

,

)2(11 ,

)2(22 .

In the simplest case, when predictor is of the first order ( 1P ), we have only one equation

1(0) (1)R R .

It follows from this

1

(1)(1)

(0)

Rr

R .

This result means that in the case of the first order of predictor we have only one predictor

coefficient and it is equal to the second sample of normalized autocorrelation function. Knowing from

[31] that this coefficient for speech signal is equal to 0,8–0,9 and referring to the equation (7.3) and

block diagram in Figure (7.1) we can conclude that having only the first order predictor (P=1) the

best way to predict the current sample of speech signal is to multiply the previous one by 0,8–0,9 i.e.

1ˆ( ) (0) (0,8 0,9) ( 1)s n s s n .

It should be noted that quantities ( )lE in equations (7.19) are the prediction error for a predictor

of order l. Thus at each stage of solution the change of the prediction error respecting to change of a

predictor order is observed.

Using this method the number of required mathematical operations approximately is

proportional to P2.

As one can see from Durbin‘s recursive equations (7.19) the main role in computing LPC play

PARCOR‘s , 1, ,lk l P . It is possible to calculate LPC from PARCOR‘s and vice versa. However,

in contrast to LPC, PARCOR‘s have physical meaning. This is clear while analyzing vocal tract

model consisted of concatenation of lossless acoustic tubes of equal length (see Figure (3.17, b)). If

lth tube cross-section area is lA and (l+1)th – 1lA then sound wave reflection coefficient at the

junction of these sections is equal

1

1

l ll

l l

A A

A A

.

and PARCOR‘s are related to corresponding reflection coefficients as

l lk .

From this follows that PARCOR’s lk are in the range

1 1lk .

112

This condition on the parameters lk is important since can be shown that it is necessary and

sufficient condition for all of the roots of the polynomial ( )P z from the equation (7.7) to be inside

the unit circle, thereby guaranteeing the stability of the system ( )H z .

7.3.2 Prediction effectiveness

Initial expression for prediction error is described by equation (7.9). Performing actions

provided in this equation and referring to extended form of equation (7.10) and to equations

(7.15, 7.16) we get

min

1 1

(0) ( ) ( )P P

d k s k

k k

E R R k E R k

. (7.21)

Dividing both sides of this equation by the number of samples of speech signal segment N we

jump from energy concept to power concept and get

min

1

( )P

d s k

k

P P r k

, (7.22)

where mindP is minimal average prediction error power, sP – average signal power and ( )r k –

normalized autocorrelation function.

In general the ratio

sP

d

PG

P (7.23)

is called prediction effectiveness and expresses relational decrease of bit rate using differential

quantization scheme. In the partial case when the power of segment prediction error is minimal we

get optimal (maximal) prediction effectiveness

min

1

1

1 ( )

sP opt P

dk

k

PG

Pr k

. (7.24)

Using only the first order predictor and equating the values of 1 and (1)r to its statistical

means, i.e. to 0,8–0,9 we get 1 2,78 5,26 4,43dB 7,21dBoptG .

7.4 The covariance method

Without going into details we emphasize that the second approach to defining the mth speech

signal and the limits on the sums is to fix the interval over which the mean-squared error is computed.

Hence, according this the expression of the mean-squared error takes the form 1

2

0

( )N

d

n

E d n

.

Then functions ( , )l k becomes

1

( , ) ( ) ( ), 1, 2, , , 0,1, 2, ,N k

n k

l k s n s n k l l P k P

and the system of linear equations becomes

1

( , ) ( ,0), 1, 2,...,P

k

k

l k l l P

or in matrix form

113

)0,(

)0,2(

)0,1(

),()2,()1,(

),2()2,2()1,2(

),1()2,1()1,1(

2

1

PPPPP

P

P

P

.

We see that this P×P matrix of correlation-like values is a symmetric, positive definite matrix

but not Toeplitz matrix. Detailed analysis shows that this matrix of ( , )l k values has the properties

of covariance matrix. Therefore the method of analysis based upon this method of computation of the

mean-squared prediction error is known as the covariance method. The system of suchlike equations

can be solved in an efficient manner and the resulting method of solution is called the Cholesky

decomposition [31, 33, 45].

Not going deep into details of Cholesky decomposition we conclude that this method requires

some less mathematical operations than Gaussian method but nevertheless their number is

approximately proportional to P3.

7.5 The choice of parameters for linear predictive analysis

The main parameter is the order of predictor P. There is a series of studies performed by

different researchers in which they analyzed the dependence of mean-squared prediction error upon

the order of predictor. Some results of investigations are presented in Figure 7.4.

Voiced speech

0

0,2

0,4

0,6

0,8

1,0

4 8 12 2016 P

Unvoiced speech

No

rmal

ized

pre

dic

tio

n e

rro

r

0 Fig. 7.4. Variation of prediction error versus predictor order

It is seen from these curves that the prediction error decreases very quickly until the predictor

order reaches value 12–14. Then the prediction error decreases slowly. Note that the prediction error

for unvoiced speech is much higher than for voiced speech because the LP model much better

describes the voiced speech than unvoiced. So, the predictor order could be chosen on the basis of

predefined value representing the relative reduction of prediction error due to increase of the

predictor order by one. The corresponding formula follows

P p , when ( 1)

( )1

p

p

E

E

, (7.25)

where ( )pE and ( 1)pE are the prediction errors of p and (p+1) order predictor respectively. Typical

P value in many investigations [33] oscillates between 10 and 14.

As it is known speech is non-stationery process. Consequently LPC’s are time varying

functions. Therefore it is necessary to renew periodically their estimation procedure. Fortunately

LPC’s fluctuate slowly. Quasi-stationery period of their fluctuation is about 10–12 ms. As a result

the duration of the audio signal segment for LP analysis is chosen equal to 10–12 ms (N samples) on

the average and the next segment for analysis is picked approximately after 5 ms (L samples) as is

shown in the Figure 7.5.

114

This signal analysis technique is called sliding window technique or simple windowing. There

are a lot of windows used for windowing. Most popular used for LP analysis are Rectangular (no

window), Triangular, Hann and Hemming windows. Graphics of some of them are presented in

Figure 7.6 and analytical expressions are given below.

L

L n

w(n)

N

mn

mth segment

1mn N

0,2

0,8

1,0

N/2 N-10

w(n)

0,6

0,4

n

Triangular

Hamming

Hann

Rectangular

Fig. 7.5. Illustration of sliding window technique Fig. 7.6. Some most popular windows

Rectangular window:

1, 0 ( 1);( )

0, .

n Nw n

otherwise

Bartlett (triangular) window:

2 ( 1) , 0 ( 1) 2;

( ) 2 2 ( 1) , ( 1) 2 1;

0, .

n N n N

w n n N N n N

otherwise

Hamming window:

0,54 0,46cos 2 ( 1) ], 0 1;( )

0, .

n N n Nw n

otherwise

Hann window:

0,5[1 cos 2 ( 1) ], 0 1;( )

0, .

n N n Nw n

otherwise

The stability of signal synthesis filter (see Figure 7.3) with the system function ( )H z is very

important. Theoretically autocorrelation method guarantees filter stability. However due to

practically finite calculation accuracy of the autocorrelation function filter can be unstable. The

reason consists in the signal spectrum shape. For example, the average spectrum of the speech signal

falls down approximately 6 dB per octave (see Figure 7.7). This increases autocorrelation function

and LPC’s calculation errors, especially in the high frequency range. Therefore the signal synthesis

filter reconstructs the speech spectrum, especially the high frequency part, not sufficiently correctly.

For improvement of quality of synthesized speech in high frequencies J. D. Markel and A. H. Grey

[45] recommended forward speech signal spectrum correction by means of pre-emphasis filter which

system function is

1( ) 1preH z z (7.26)

and corresponding frequency response is depicted in Figure 7.8. Typical value of the filter parameter

is equal to 0,8. Filter‘s aim is to compensate the high-frequency part of the speech signal that was

suppressed during the human sound production mechanism. This filter analogically to de-emphasis

filter in the FM broadcasting system raises spectrum of initial speech signal in high frequency range

approximately 6 dB per octave. Resulting spectrum is much more flat than initial signal spectrum.

115

-75

-65

-55

-35

-45

-25

0,250,0625 f, kHz1,0 2,0

( ), dBfP

4,0 8,0

1

2

-10

-15

-20

f, kHz

10

5

0

-5

0,5 1,0 1,5 2,0 2,5 3,00

0,7

0,9

3,5

( ), dBpreH f

Fig. 7.7. Two possible average spectra of speech signal Fig. 7.8. Frequency response of pre-emphasis filter

For inverse correction of synthesized speech spectrum de-emphasis filter is used at the output

of synthesizer. System function of de-emphasis filter is

1

1( )

1deH z

z

. (7.27)

Typical value of the filter parameter is equal to 0,8.

This pre-emphasis-de-emphasis system described above is one of simplest used before LP

analysis. More sophisticated filters called whitening filters are used very often. If the spectrum of

speech signal is described by the frequency function ( )S f , the frequency response of the whitening

filter ( )G f must meet the condition

2( )

( )

constG f

S f .

In this case spectrum of the output signal will be 2

( ) ( ) ( )outS f S f G f const ,

i.e. the output process will be white Gaussian noise with flat spectrum.

The method of linear prediction was widely used by developing effective speech compression

algorithms and corresponding devices called vocoders (Voice Coders) [31, 33, 46].

116

8. AUDIO SIGNALS COMPRESSION

8.1 Sub-Band Coding Principle

Sub-Band Coding (SBC) is a powerful and general method of encoding audio signals

efficiently. Unlike source specific methods (like LPC, which works very well only on speech), SBC

can encode any audio signal from any source, making it ideal for music recordings, movie

soundtracks, and the like. MPEG Audio is the most popular example of SBC.

SBC depends on a phenomenon of the human hearing system called masking. Normal human

ears are sensitive to a wide range of frequencies. However, when a lot of signal energy is present at

one frequency, the ear cannot hear lower energy at nearby frequencies. We say that the louder

frequency masks the softer frequencies. The louder frequency is called the masker.

The basic idea of SBC is to save signal bandwidth by throwing away information about

frequencies which are masked. The result won't be the same as the original signal, but if the

computation is done right, human ears can't hear the difference.

The simplest way to encode audio signals is Pulse Code Modulation (PCM), which is used on

music Compact Discs (CDs), Digital Audio Tape (DAT) recordings, and so on. Like all digitization,

PCM adds quantization noise to the signal, which is generally undesirable. The fewer bits used in

digitization, the more noise gets added. The way to keep this noise from being a problem is to use

enough bits to ensure that the noise is always low enough to be masked either by the signal or by

other sources of noise. This produces a high quality signal, but at a high bit rate (over 700 kbps for

one channel of CD audio). A lot of those bits are encoding masked portions of the signal, and are

being wasted.

There are more clever ways of digitizing an audio signal, which can save some of that wasted

bandwidth. A classic method is nonlinear PCM, such as µ-law encoding (named after a perceptual

curve in auditory perception research). This is like PCM on a logarithmic scale, and the effect is to

add noise that is proportional to the signal strength. Sun's .au format for sound files is a popular

example of µ-law encoding. Using 8-bit µ-law encoding would cut our one channel of CD audio

down to about 350 kbps, which is better but still pretty high, and is often audibly poorer quality than

the original (this scheme doesn't really model masking effects).

Most SBC encoders use a structure depicted in Figure 8.1, a.

Time to

Frequency

mapping

Quantizer

and

encoder

Frame

packing

Psychoacoustic

model

ISO / MPEG

audio bitstream

Digital audio

signal (PCM)

Ancillary data

a)

ISO / MPEG

audio bitstreamFrame

unpackingReconstruction

Frequency

to Time

mapping

Digital audio

signal (PCM)

Ancillary data

b)

Fig. 8.1. Basic sub-band coding scheme; (a) encoder, (b) decoder

First, a time-to-frequency mapping (a filter bank, or FFT, or something else) decomposes the

input signal into sub-bands. The number of sub-bands fluctuates from several to several tens

117

depending on the application area. For example, the G.722 ITU-T standard [69] wideband speech

codec uses only 2 sub-bands and ISO-MPEG-1 Layer 1, 2 encoders [58] use 32 sub-bands.

It has to be noted that directly formed sub-band signals, as an illustration shows in the

Figure 8.2, a, are sampled at the same sampling frequency as an input signal but their spectra are

much narrower. Therefore they become oversampled and have to be subsampled to save

computational resources. This might be performed by a bank of decimators, each of which have to

decrease sampling frequency, depending on the input signal bandwidth ratio with the corresponding

sub-band signal bandwidth, as is illustrated in Figure 8.2, a. In this case the reconstructing part should

be composed from the interpolators and synthesis filters, as depicted in the Figure 8.2, b. The task of

interpolators is to recover the initial sampling frequency by increasing it by appropriate number of

times. However real time implementation of this direct construction requires a lot of computational

resources. To resolve this ambiguity some revolutionary solutions were suggested. For example,

cosine modulated filters [68], for realization of which polyphase decomposition networks (PPDN)

and discrete (fast) Fourier transform (DFT/FFT) were adopted, what significantly accelerated the

computation of filtering and inverse filtering procedures.

Decimation

M0:1

( )s n 1( )h n

0 ( )h n

1( )Nh n

Decimation

M1:1

Decimation

MN-1:1

Analysis

filter bankDecimators

a)

Synthesis

filter bankInterpolators

( )s n

Interpolation

1:M0

Interpolation

1:M1

Interpolation

1:MN-1

1( )h n

0 ( )h n

1( )Nh n

b)

Fig. 8.2. Illustration of the sub-band decomposition and reconstruction principles (0 1 1( ), ( ), ..., ( )

Nh n h n h n

- filters

impulse responses, 0 1 1, , ...

NM M M

- sampling frequency changing times); (a) decomposition part, (b)

reconstruction part

The psychoacoustic model looks at these sub-bands as well as the original signal, and

determines masking thresholds using psychoacoustic information. Using these masking thresholds,

each of the sub-band samples is quantized and encoded so as to keep the quantization noise below

the masking threshold. The final step is to assemble all these quantized samples into frames, so that

the decoder can figure it out without getting lost.

Decoding is easier, since there is no need for a psychoacoustic model (see Figure 8.1, b). The

frames are unpacked, sub-band samples are decoded, and a frequency-to-time mapping turns them

back into a single output audio signal.

This is a basic, generic sketch of how SBC works. Notice that we haven't looked at how much

computation it takes to do this. For practical systems that need to run in real time, computation is a

major issue, and is usually the main constraint on what can be done.

8.2 MPEG Audio Layers

Over the last five to ten years, SBC systems have been developed by many of the key companies

and laboratories in the audio industry. Beginning in the late 1980's, a standardization body of the ISO

called the Motion Picture Experts Group (MPEG) developed generic standards for coding of both

audio and video for storage and transmission of various digital media. The ISO standard specifies

syntax for only the coded bit-streams and the decoding process; sufficient flexibility is allowed for

encoder implementation. The MPEG first-phase (MPEG-1) audio coder operates in single-channel or

118

two-channel stereo mode at sampling rates of 32, 44,1 , and 48 kHz and encode it at a bit rate of 32

to 192 kbps per audio channel (depending on layer) [58]. In the second phase of development,

particular emphasis is placed on the multichannel audio support and on an extension of the MPEG-1

to lower sampling rates and lower bit rates. MPEG-2 audio consists of mainly two coding standards:

MPEG-2 BC (Backward Compatibility) [59] and MPEG-2 AAC (Advanced Audio Coding) [60].

Unlike MPEG-2 BC, which is constrained by its backward compatibility (BC) with MPEG-1 format,

MPEG-2 AAC (Advanced Audio Coding) is unconstrained and can therefore provide better coding

efficiency. Not really old development is the adoption of MPEG-4 [61] for very-low-bit-rate channels,

such as those found in Internet and mobile applications. Table 8.1 lists the configuration used in

MPEG audio coding standards.

Table 8.1 List of the configuration used in MPEG audio coding standards

Standards Audio sampling rate,

kHz

Compressed bit-rate,

kbit/s

Channels Standard

approved

MPEG-1 Layer I 32, 44,1, 48 32–448 1 – 2 1992

MPEG-1 Layer II 32, 44,1, 48 32–384 1 – 2 1992

MPEG-1 Layer III 32, 44,1, 48 32–320 1 – 2 1993

MPEG-2 Layer I 32, 44,1, 48 32 – 448 for two BC channels 1 – 5.1 1994

16, 22,05, 24 32 – 256 for two BC channels

MPEG-2 Layer II 32, 44,1, 48 32 – 384 for two BC channels 1 – 5.1 1994

16, 22,05, 24 8 – 160 for two BC channels

MPEG-2 Layer III 32, 44,1, 48 32 – 384 for two BC channels 1 – 5.1 1994

16, 22,05, 24 8 – 160 for two BC channels

MPEG-2 AAC 8, 11.025, 12, 16, 22,05, 24,

32, 44,1, 48, 64, 88,2, 96

Indicated by a 23-bit unsigned

integer

1 – 48 1997

MPEG-4 T/F

coding

8, 11.025, 12, 16, 22,05, 24,

32, 44,1, 48, 64, 88,2, 96

Indicated by a 23-bit unsigned

integer

1 – 48 1999

MPEG-1 Audio is really a group of three different SBC schemes, called layers. Each layer is a

self-contained SBC coder with its own time-to-frequency mapping, psychoacoustic model, and

quantizer, as shown in the diagram above. Layer 1 is the simplest, but gives the poorest compression.

Layer 3 is the most complicated and difficult to compute, but gives the best compression.

The idea is that an application of MPEG-1 Audio can use whichever layer gives the best tradeoff

between computational burden and compression performance. Audio can be encoded in any one layer.

A standard MPEG decoder for any layer is also able to decode lower (simpler) layers of encoded

audio.

8.2.1 MPEG-1 Audio Layer 1

Basic block diagram of a MPEG-1 Audio Layer 1 is shown in Figure 8.3 [62, 63, 64].

PPN analysis

filterbank

(32 channels)

FFT

N=512

Psychoacoustic

model

Normalization,

quantization &

coding

Dynamic bit

allocation

Audio signal

(PCM) Signal-to-mask

ratio

Bit stream

formatting &

error check

Bit allocation

coding

Power spectral

density

Bit allocation

index

Scalefactors

coding

Scalefactor

index

Scalefactors

calculation

ScalefactorsSub-band

samples

Digital bitstream

Bit allocation

per sub-band

Fig. 8.3. Basic block diagram of a Layer 1 encoder

The Layer 1 time-to-frequency mapping is a polyphase filter bank with 32 sub-bands. Polyphase

filters combine low computational complexity with flexible design and implementation options. The

119

sub-bands are equally spaced in frequency (unlike critical bands). The Layer 1 algorithm codes audio

in frames of 384 audio samples. It does so by grouping 12 samples from each of the 32 sub-bands, as

shown in Figure 8.4.

( )s n 1( )h n

0 ( )h n

Analysis

filter bank

31( )h n

12

samples

12

samples

12

samples

12

samples

12

samples

12

samples

12

samples

12

samples

12

samples

Layer 1

FrameLayer 2

Layer 3

Frame

Note: Each sub-band filter

produces 1 output sample

for every 32 input samples Fig. 8.4. Grouping of sub-band samples for Audio Layers 1, 2, 3

The Layer 1 psychoacoustic model uses a 512-point FFT to get detailed spectral information

about the signal. The output of the FFT is used to find both tonal (sinusoidal) and nontonal (noise)

maskers in the signal. Each masker produces a masking threshold depending on its frequency,

intensity, and tonality. For each sub-band, the individual masking thresholds are combined to form a

global masking threshold. The masking threshold is compared to the maximum signal level for the

sub-band, producing a signal-to-masker ratio (SMR) which is the input to the quantizer.

The Layer 1 quantizer/encoder first examines each sub-band's samples, finds the maximum

absolute value of these samples, and quantizes it to 6 bits. This is called the scale factor for the sub-

band. Then it determines the bit allocation for each sub-band by minimizing the total noise-to-mask

ratio with respect to the bits allocated to each sub-band. It's possible for heavily masked sub-bands to

end up with zero bits, so that no samples are encoded. Each group of 12 samples gets a bit allocation

and, if the bit allocation is not zero, a scale factor. The bit allocation tells the decoder the number of

bits used to represent each sample. For Layer 1 this allocation can be 0 to 15 bits per sub-band. The

scale factor is a multiplier that sizes the samples to fully use the range of the quantizer. Each scale

factor has a 6-bit representation. Finally, the sub-band samples are linearly quantized to the bit

allocation for that sub-band.

The Layer 1 frame packer has a fairly easy job (see Figure 8.5). Each frame starts with header

information for synchronization and bookkeeping, and a 16-bit cyclic redundancy check (CRC) for

error detection and correction. Each of the 32 sub-bands gets 4 bits to describe bit allocation and 6

bits for the scale factor. The remaining bits in the frame are used for sub-band samples, with an

optional trailer for extra information.

Header CRC Bit Allocation Scalefactors Sub-band samples AD

An

cill

ary

Dat

a

Fie

ld (

Len

gth

no

t sp

ecif

ied)

12 Bit Sync

20 Bit System

Info

16 Bit (128-256) Bit (0-384) Bit

1 sub-band sample

corresponds to 32

PCM audio input

samples

32 Bit

Fig. 8.5. ISO/MPEG/Audio Layer 1 frame structure; valid for 384 PCM audio input samples, duration 8 ms with a

sampling rate of 48 kHz

120

Decoder is much simpler (see Figure 8.6). There is no psychoacoustic model. After the frames

unpacker, sub-band, scale factors and bit allocation decoders’ follows inverse quantizer, which job is

to recover original quantization levels of sub-band samples. The decoder multiplies the decoded

quantizer output with the scale factor to recover the quantized sub-band value. The dynamic range of

the scale factors alone exceeds 120 dB. The combination of the bit allocation and the scale factor

provide the potential for representing the samples with a dynamic range well over 120 dB. At least

PPN synthesis filter bank recovers output audio signal from sub-band signals.

Bit stream

unpacking

Scalefactors &

Bit allocation

decoder

Inverse

quantizer

PPN synthesis

filterbank

(32 channels)

Audio signal

(PCM)

Digital

bitstreamScalefactors

Bit allocation

Sub-band

samples decoder

Fig.8.6. Basic block diagram of a Layer 1 decoder

Layer 1 processes the input signal in frames of 384 PCM samples. At 48 kHz, each frame carries

8 ms of sound. The MPEG specification doesn't specify the encoded bit rate, allowing implementation

flexibility. Highest quality is achieved with a bit rate of 384 kbps. Typical applications of Layer 1

include digital recording on tapes, hard disks, or magneto-optical disks, which can tolerate the high

bit rate, however are mostly outdated.

Using Audio Layer 1, good quality sound can be achieved with compression ratio 1:4, what

corresponds to 384 kbps for a stereo signal.

8.2.2 MPEG-1 Audio Layer 2

Basic block diagram of a Layer 2 encoder is shown in Figure 8.7 [62, 63, 64]. The major blocks,

that are significantly changed or added to form Layer 2 encoder, are shown highlighted in Figure 8.7.

Audio signal

(PCM)

PPN analysis

filterbank

(32 channels)

FFT

N=1024

Psychoacoustic

model

Normalization,

quantization &

coding

Dynamic bit

allocation

Signal-to-mask

ratio

Bit stream

formatting &

error check

Bit allocation

coding

Power spectral

density

Bit allocation

index

Scalefactors &

selection info

coding

Scalefactor

index & scfsi

Scalefactors

calculation

ScalefactorsSub-band

samples

Digital bitstream

Scalefactors

selection

Superblock

formation

Bit allocation

per sub-band

Fig. 8.7. Basic block diagram of a Layer 2 encoder

The Layer 2 algorithm is a straightforward enhancement of Layer 1. It codes the audio data in

larger groups and imposes some restrictions on the possible bit allocations for values from the middle

and higher sub-bands. It also represents the bit allocation, the scale factor values, and the quantized

samples with more compact code. Layer 2 gets better audio quality by saving bits in these areas, so

more code bits are available to represent the quantized sub-band values.

The Layer 2 time-to-frequency mapping is the same as in Layer 1: a polyphase filter bank with

32 sub-bands. However, in a Layer 2 three blocks of 12×32 sub-band samples are combined to form

a superblock (see Figure 8.4). This means a Layer 2 audio frame is effectively triple the size of a

Layer 1 frame and contains the coded data for 1152 PCM samples compared with 384 samples for

Layer 1.

121

The Layer 2 psychoacoustic model is similar to the Layer 1 model, but it uses a 1024-point FFT

for greater frequency resolution. It uses the same procedure as the Layer 1 model to produce signal-

to-masker ratios for each of the 32 sub-bands.

The Layer 2 quantizer/encoder is similar to that used in Layer 1, generating 6-bit scale factors

for each sub-band. However, Layer 2 frames are three times as long as Layer 1 frames, so Layer 2

allows each sub-band a sequence of three successive scale factors, and the encoder uses one, when

the values of the scale factors are sufficiently close; two, or all three, when the encoder anticipates

that temporal noise masking by the human auditory system will hide any distortion caused by using

only one scale factor instead of two or three. This gives, on average, a factor of 2 reduction in bit rate

for the scale factors compared to Layer 1. The scale-factor selection information (SCFSI) field in the

Layer 2 bit stream informs the decoder if and how to share the scale factor values. Generally speaking

a more sophisticated method of scale factor coding is adopted in Layer 1. Bit allocations are computed

in a similar way to Layer 1. However the number of allowable quantizer step sizes is reduced for mid

and high frequency sub-bands resulting in a saving in the bits used for the bit allocation code words.

The Layer 2 frame packer uses the same header and CRC structure as Layer 1 (see Figure 8.8).

The number of bits used to describe bit allocations varies with sub-band: 4 bits for the low sub-bands,

3 bits for the middle sub-bands, and 2 bits for the high sub-bands (this follows critical bandwidths).

The scale factors (one, two or three depending on the data) are encoded along with a 2-bit code

describing which combination of scale factors is being used. The sub-band samples are quantized

according to bit allocation, and then combined into groups of three (called granules). Each granule is

encoded with one code word. This allows Layer 2 to capture much more redundant signal information

than Layer 1.

Header CRC Bit Allocation Scalefactors Sub-band samples AD

Anci

llar

y D

ata

Fie

ld (

Len

gth

not

spec

ifie

d)

12 Bit Sync

20 Bit

System Info

16 Bit

Low sub-bands

4 Bit linear 12 granules of 3 sub-band samples each.

3 sub-band samples correspond to 96

PCM audio input samplesMid sub-bands

3 Bit linearHigh sub-bands

2 Bit linear

SCFSI

(0-60) BitGr0 Gr11

32 Bit (26-188) Bit (0-1080) Bit

Fig. 8.8. ISO/MPEG/Audio Layer 2 frame structure; valid for 1152 PCM audio input samples, duration 24 ms with a

sampling rate of 48 kHz

Layer 2 processes the input signal in frames of 1152 PCM samples. At 48 kHz sampling

frequency, each frame carries 24 ms of sound. Highest quality is achieved with a bit rate of 256 kbps,

but quality is often good down to 64 kbps. Compression degree is equal to 12 with 64 kbps output bit

rate and 768 kbps input bit rate (48 kHz sampling frequency and 16 bit code words).

Typical applications of Layer 2 include audio broadcasting, television, consumer and

professional recording, and multimedia. Audio files on the World Wide Web with the extension

.mpeg2 or .mp2 are encoded with MPEG-1 Layer 2.

Using Audio Layer 2, good quality sound can be achieved with compression ratio 1:6...1:8,

what corresponds to 256..192 kbps for a stereo signal.

8.2.3 MPEG-1 Audio Layer 3

Basic block diagram of a Layer 3 encoder is shown in Figure 8.9 [62, 63, 64].

122

Audio signal

(PCM)

PPN analysis

filterbank

FFT

N=1024

...

MDCT with

Dynamic

Windowing

... Scaler &

Quantizer ... Huffman

encoder ...

Bit stream

formatting &

error check

Psychoacoustic

model

Coding of Side

Information

Digital

bitstream

Bit resevoir

Fig. 8.9. Basic block diagram of a Layer 3 encoder

Layer 3 is substantially more complicated than Layer 2. It is the most aggressive compression

scheme and the most ubiquitous today. It uses both polyphase and discrete cosine transform filter

banks, a polynomial prediction psychoacoustic model, and sophisticated quantization and encoding

schemes allowing variable length frames. The frame packer includes a bit reservoir (buffer) which

allows more bits to be used for portions of the signal that need them.

Layer 3 is intended for applications where a critical need for low bit rate justifies the expensive

and sophisticated encoding system. It allows high quality results at bit rates as low as 64 kbps. Typical

applications are in telecommunication and professional audio, such as commercially published music

and video.

The Layer 3 algorithm is a much more refined approach derived from ASPEC (audio spectral

perceptual entropy coding) and OCF (optimal coding in the frequency domain) algorithms [65, 66].

Although based on the same filter bank found in Layer 1 and Layer 2, Layer 3 compensates for some

filter bank deficiencies by processing the filter outputs with a modified discrete cosine transform

(MDCT) [44]. Figure 8.10 shows a block diagram of this processing for the encoder.

( )s n 1( )h n

0 ( )h n

Analysis

filter bank

31( )h n

MDCT

window

MDCT

window

MDCT

window

MDCT

MDCT

MDCT

All

ias

reduct

ion (

only

for

long b

lock

s)

Long, long-to-short,

short, short-to-long

window select

Long or short block

control (from

psychoacoustic model)

0

1

31

0

1

31

0

18

18

0

0

18

0

575

Fig. 8.10. MPEG Audio Layer 3 filter bank processing

Unlike the polyphase filter bank, without quantization the MDCT transformation is lossless.

The MDCTs further subdivide the sub-band outputs in frequency to provide better spectral resolution.

Furthermore, once the sub-band components are subdivided in frequency, the Layer 3 encoder can

partially cancel some aliasing caused by the polyphase filter bank. Of course, the Layer 3 decoder has

to undo the alias cancellation so that the inverse MDCT can reconstruct sub-band samples in their

original, aliased form for the synthesis filter bank. Layer 3 specifies two different MDCT block

lengths: a long block of 18 samples or a short block of 6.

Before processing the filter outputs with MDCT, they are windowed by half period sine wave

form windows (see Figure 8.11). There is a 50-percent overlap between successive transform

windows, so the window size is 36 and 12 samples, respectively. The long block length allows greater

123

frequency resolution for audio signals with stationary characteristics, while the short block length

provides better time resolution for transients. Note the short block length is one third that of a long

block. In the short block mode, three short blocks replace a long block so that the number of MDCT

samples for a frame of audio samples remains unchanged regardless of the block size selection. For

a given frame of audio samples, the MDCTs can all have the same block length (long or short) or a

mixed block mode. In the mixed block mode the MDCTs for the two lower frequency sub-bands have

long blocks, and the MDCTs for the 30 upper sub-bands have short blocks. This mode provides better

frequency resolution for the lower frequencies, where it is needed the most, without sacrificing time

resolution for the higher frequencies. The switch between long and short blocks is not instantaneous.

A long block with a specialized long-to-short or short-to-long data window serves to transition

between long and short block types. Figure 8.11 shows how the MDCT windows transition between

long and short block modes.

Time

18

sam

ple

s

12

sam

ple

s

6 s

amp

les

36 samples

18 samples 18 samples

12

sam

ple

s

Amplitude ShortLong Long-to-Short Short-to-Long

Fig. 8.11. MDCT window types and arrangement of transition between overlapping long and short window types

Because MDCT processing of a sub-band signal provides better frequency resolution, it

consequently has poorer time resolution. The MDCT operates on 12 or 36 polyphase filter samples,

so the effective time window of audio samples involved in this processing is 12 or 36 times larger.

The quantization of MDCT values will cause errors spread over this larger time window, so it is more

likely that this quantization will produce audible distortions. Such distortions usually manifest

themselves as pre-echo because the temporal masking of noise occurring before a given signal is

weaker than the masking of noise after. Layer 3 incorporates several measures to reduce pre-echo.

First, the Layer 3 psychoacoustic model has modifications to detect the conditions for pre-echo.

Second, Layer 3 can borrow code bits from the bit reservoir to reduce quantization noise when pre-

echo conditions exist. Finally, the encoder can switch to a smaller MDCT block size to reduce the

effective time window. Besides the MDCT processing, other enhancements over the Layer 1 and

Layer 2 algorithms include the following.

Alias reduction. Layer 3 specifies a method of processing the MDCT values to remove some

artifacts caused by the overlapping bands of the polyphase filter bank.

Nonuniform quantization. The Layer 3 quantizer raises its input to the 3/4 power before

quantization to provide a more consistent signal-to-noise ratio over the range of quantizer values. The

requantizer in the MPEG/audio decoder relinearizes the values by raising its output to the 4/3 power.

Scale factor bands. Unlike Layers 1 and 2, where each sub-band can have a different scale

factor, Layer uses scale factor bands. These bands cover several MDCT coefficients and have

approximately critical band widths. In Layer 3 scale factors serve to color the quantization noise to

fit the varying frequency contours of the masking threshold. Values for these scale factors are adjusted

as part of the noise allocation process.

Entropy coding of data values. To get better data compression, Layer 3 uses variable-length

Huffman codes to encode the quantized samples. After quantization, the encoder orders the 576 (32

sub-bands×18 MDCT coefficients/sub-band) quantized MDCT coefficients in a predetermined order.

The order is by increasing frequency except for the short MDCT block mode. For short blocks there

are three sets of window values for a given frequency, so the ordering is by frequency, then by

window, within each scale factor band. Ordering is advantageous because large values tend to fall at

the lower frequencies and long runs of zero or near-zero values tend to occupy the higher frequencies.

124

The encoder delimits the ordered coefficients into three distinct regions as depicted in

Figure 8.12.

1 576

region0 region1 region2

big_value regions count1 region rzero region

[-8206 ... 8206] [-1 ... 1] [0] Fig. 8.12. Huffman partitions

This enables the encoder to code each region with a different set of Huffman tables specifically

tuned for the statistics of that region. There are a total of 32 possible tables given in the standard.

Starting at the highest frequency, the encoder identifies the continuous run of all-zero values as one

region “rzero”. This region does not have to be coded because its size can be deduced from the size

of the other two regions. However, it must contain an even number of zeroes because the other regions

code their values in even numbered groupings.

The second region, the "countl" region, consists of a continuous run of values made up only of

–1, 0, or 1. The Huffman table for this region codes four values at a time, so the number of values in

this region must be a multiple of four.

The third region, the "big values" region, covers all remaining values. The Huffman tables for

this region code the values in pairs. The "big values" region is further subdivided into three sub-

regions, each having its own specific Huffman table.

Besides improving coding efficiency, partitioning the MDCT coefficients into regions and sub-

regions helps controlling error propagation. Within the bit stream, the Huffman codes for the values

are ordered from low to high frequency.

Use of a "bit reservoir." The design of the Layer 3 bit stream better fits the encoder's time-

varying demand on code bits. As with Layer 2, Layer 3 processes the audio data in frames of 1152

samples. Figure 8.13 shows the arrangement of the various bit fields in a Layer 3 bit stream.

Header CRC Side information Main data

12 Bit Sync

20 Bit System

Info

16 Bit (136-256) Bit

32 Bit

Fig. 8.13. The arrangement of the various bit fields in a Layer 3 bit stream

Unlike Layer 2, the coded data representing these samples do not necessarily fit into a fixed

length frame in the code bit stream. The encoder can donate bits to a reservoir when it needs fewer

than the average number of bits to code a frame. Later, when the encoder needs more than the average

number of bits to code a frame, it can borrow bits from the reservoir. The encoder can only borrow

bits donated from past frames; it cannot borrow from future frames.

Despite the conceptually complex enhancements to Layer 3, a Layer 3 decoder has only a

modest increase in computation requirements over a Layer 2 decoder. For example, even a direct

matrix-multiply implementation of the inverse MDCT requires only about 19 multiplies and additions

per sub-band value. The enhancements mainly increase the complexity of the encoder and the

memory requirements of the decoder.

Bit allocation. The bit allocation process determines the number of code bits allocated to each

sub-band based on information from the psychoacoustic model. For Layers 1 and 2, this process starts

by computing the mask-to-noise ratio:

MNR, (dB)=SNR, (dB)-SMR, (dB),

where MNR, (dB) is the mask-to-noise ratio, SNR, (dB) is the signal-to-noise ratio, and SMR, (dB) is

the signal-to-mask ratio from the psychoacoustic model. The MPEG/audio standard provides tables

125

that give estimates for the signal-to-noise ratio resulting from quantizing to a given number of

quantizer levels. In addition, designers can try other methods of getting the signal-to-noise ratios.

Once the bit allocation unit has mask-to-noise ratios for all the sub-bands, it searches for the

sub-band with the lowest mask-to-noise ratio and allocates code bits to that sub-band. When a sub-

band gets allocated more code bits, the bit allocation unit looks up the new estimate for the signal-to-

noise ratio and re-computes that sub-band's mask-to-noise ratio. The process repeats until no more

code bits can be allocated.

The Layer 3 encoder uses noise allocation. The encoder iteratively varies the quantizers in an

orderly way, quantizes the spectral values, counts the number of Huffman code bits required to code

the audio data, and actually calculates the resulting noise. If, after quantization, some scale factor

bands still have more than the allowed distortion, the encoder amplifies the values in those scale

factor bands and effectively decreases the quantizer step size for those bands. Then the process

repeats. The process stops if any of the following three conditions is true:

None of the scale-factor bands have more than the allowed distortion;

The next iteration would cause the amplification for any of the bands to exceed the maximum

allowed value;

The next iteration would require all the scale factor bands to be amplified.

Real time encoders also can include a time limit exit condition for this process.

Iterative control of quantization and coding. The process to find the optimum gain and

scalefactors for a given block, bit rate and output from the perceptual model is usually done by two

nested iteration loops in an analysis-by-synthesis way:

Inner iteration loop (rate loop). The Huffman code tables assign shorter code words to

(more frequent) smaller quantized values. If the number of bits resulting from the coding

operation exceeds the number of bits available to code a given block of data, this can be

corrected by adjusting the global gain to result in a larger quantization step size, leading to

smaller quantized values. This operation is repeated with different quantization step sizes until

the resulting bit demand for Huffman coding is small enough. The loop is called rate loop

because it modifies the overall coder rate until it is small enough.

Outer iteration loop (noise control loop). To shape the quantization noise according to

the masking threshold, scalefactors are applied to each scalefactor band. The systems starts with

a default factor of 1,0 for each band. If the quantization noise in a given band is found to exceed

the masking threshold (allowed noise) as supplied by the perceptual model, the scalefactor for

this band is adjusted to reduce the quantization noise. Since achieving a smaller quantization

noise requires a larger number of quantization steps and thus a higher bit rate, the rate

adjustment loop has to be repeated every time new scalefactors are used. In other words, the

rate loop is nested within the noise control loop. The outer (noise control) loop is executed until

the actual noise (computed from the difference of the original spectral values minus the

quantized spectral values) is below the masking threshold for every scalefactor band (i.e.,

critical band). While the inner iteration loop always converges (if necessary, by setting the

quantization step size large enough to zero out all spectral values), this is not true for the

combination of both iteration loops. If the perceptual model requires quantization step sizes so

small that the rate loop always has to increase them to enable coding at the required bit rate,

both can go on forever. To avoid this situation, several conditions to stop the iterations early

can be checked. However, for fast encoding and good coding results this condition should be

avoided. This is one reason why an MPEG Layer-3 encoder (the same is true for AAC) usually

needs tuning of perceptual model parameter sets for each bit rate.

Stereo redundancy coding. The MPEG/audio compression algorithm supports two types of

stereo redundancy coding: intensity stereo coding and middle/side (M/S) stereo coding. All layers

support intensity stereo coding. Layer 3 also supports M/S stereo coding.

126

Both forms of redundancy coding exploit another perceptual property of the human auditory

system. Psychoacoustic results show that above about 2 kHz and within each critical band, the human

auditory system bases its perception of stereo imaging more on the temporal envelope of the audio

signal than on its temporal fine structure.

In intensity stereo mode the encoder codes some upper frequency sub-band outputs with a single

summed signal instead of sending independent left- and right-channel codes for each of the 32 sub-

band outputs. The intensity stereo decoder reconstructs the left and right channels based only on a

single summed signal and independent left and right channel scale factors. With intensity stereo

coding, the spectral shape of the left and right channels is the same within each intensity coded sub-

band, but the magnitude differs.

The MS stereo mode encodes the left and right channel signals in certain frequency ranges as

middle (sum of left and right) and side (difference of left and right) channels. In this mode, the encoder

uses specially tuned threshold values to compress the side channel signal further.

Using Audio Layer 3, good quality sound can be achieved with compression ratio 1:10...1:12,

what corresponds to 128..112 kbps for a stereo signal.

8.3 MPEG-2 Audio

The MPEG-1 Audio standard includes a lot more developments than is discussed above. For

one thing, it allows two channels, which can be encoded separately or as joint stereo, which encodes

both channels of stereo together for further savings.

MPEG-2 denotes the second phase of MPEG. It introduced a lot of new concepts into MPEG

video coding including support for interlaced video signals. The main application area for MPEG-2

is digital television. The original (finalized in 1994) MPEG-2 Audio standard [59] just consists of

two extensions to MPEG-1:

• Backwards compatible (BC) multichannel coding adds the option of forward and backwards

compatible coding of multichannel signals including the 5.1 channel configuration known

from cinema sound (for movies which have left, right, center, and two surround channels)

plus a subwoofer channel. This multichannel extension leads to an improved realism of

auditory ambience not only for audio only applications, but also for high-definition

television (HDTV) and digital versatile disc (DVD).

• Coding at lower sampling frequencies adds sampling frequencies of 16 kHz, 22,05 kHz and

24 kHz to the sampling frequencies supported by MPEG-1. This adds coding efficiency at

very low bit rates.

Generally speaking, regarding syntax and semantics, the differences between MPEG-1 and

MPEG-2 BC are minor, except in the latter case for the new definition of a sampling frequency field,

a bit rate index field, and a psychoacoustic model used in bit allocation tables. In addition, parameters

of MPEG-2 BC have to be changed accordingly. With the extension of lower sampling rates, it is

possible to compress two-channel audio signals to bit rates less than 64 kbps with good quality.

Backward compatibility implies that existing MPEG-1 audio decoders can deliver two main channels

of the MPEG-2 BC coded bit stream. This is achieved by coding the left and right channels as MPEG-

1, while the remaining channels are coded as ancillary data in the MPEG-1 bit stream.

Both extensions do not introduce new coding algorithms over MPEG-1 Audio. The

multichannel extension contains some new tools for joint coding techniques.

However in early 1994 it was shown that introducing new coding algorithms and giving up

backwards compatibility to MPEG-1 a significant improvement in coding efficiency (for the five

channel case) could be reached. As a result, a new work item defined as MPEG-2 Advanced Audio

Coding (AAC) [67, 69] was introduced. AAC is a second generation audio coding scheme for generic

coding of stereo and multichannel signals.

127

8.3 MPEG-2 Advanced Audio Coding

Advanced Audio Coding (AAC) is a standardized, lossy compression and encoding scheme for

digital audio. Designed to be the successor of the MP3 format, AAC generally achieves better sound

quality than MP3 at similar bit rates. AAC has been standardized by ISO and IEC, as part of the

MPEG-2 and MPEG-4 specifications.

AAC supports inclusion of 48 full-bandwidth (up to 96 kHz) audio channels in one stream plus

16 low frequency effects (limited to 120 Hz) channels, up to 16 "coupling" or dialog channels, and

up to 16 data streams. The quality for stereo is satisfactory to modest requirements at 96 kbitps in

joint stereo mode; however, Hi-Fi transparency demands data rates of at least 128 kbitps. The MPEG-

2 audio tests showed that AAC meets the requirements referred to as "transparent" for the ITU at

128 kbitps for stereo and 320 kbitps for 5.1 audio.

MPEG-2 AAC provides the highest quality for applications where backward compatibility with

MPEG-1 is not a constraint. While MPEG-2 BC provides good audio quality at data rates of 640-

896 kbps for five full-bandwidth channels, MPEG-2 AAC provides very good quality at less than half

of that data rate.

AAC is the default or standard audio format for YouTube, iPhone, iPod, iPad, Nintendo DSi,

Nintendo 3DS, iTunes, DivX Plus Web Player and PlayStation 3. It is supported on PlayStation Vita,

Wii (with the Photo Channel 1.1 update installed), Sony Walkman MP3 series and later, Sony

Ericsson; Nokia, Android, BlackBerry, and webOS-based mobile phones, with the use of a converter,

and the MPEG-4 video standard. AAC also continues to enjoy increasing adoption by manufacturers

of in-dash car audio systems. AAC can be used on HDTV, DVB, iTunes and iPod, iPhone, iPad,

Apple TV, mobile phone, PDA and so on. It is possible to play AAC on FFDShow, the KM Player,

Winamp, VLC, iTunes, Windows Media Player, XBMC, etc.

Figure 8.14 shows a block diagram of an MPEG-2 Advanced Audio Coder (AAC) encoder.

Gain

control

Filter

bank

Intensity

coding/

Coupling

TNS M/SScale

factorsQuantizer

Noiseless

coding

Adaptive

prediction

Rate/Distortion iterative control

Coding of side information

Psychoacustic model

Audio signal

(PCM)

Bit stream

formatting

& error

check

Digital

bitstream

Bit resevoir

Fig. 8.14. Structure of MPEG-2 advanced audio coder

AAC follows the same basic coding paradigm as Layer-3 (high frequency resolution filter bank,

nonuniform quantization, Huffman coding, iteration loop structure using analysis-by-synthesis), but

improves on Layer-3 in a lot of details and uses new coding tools for improved quality at low bit

rates.

The gain control tool splits the input signal into four equally spaced frequency bands, which

are then flexibly encoded to fit into a variety of sampling rates. The pre-echo effect can also be

alleviated through the use of the gain control tool. The filter bank transforms the signals from the

time domain to the frequency domain. The temporal noise shaping (TNS) tool helps to control the

temporal shape of the quantization noise. Intensity coding and the coupling reduce perceptually

irrelevant information by combining multiple channels in high frequency regions into a single

channel. The second order backward adaptive prediction tool further improves coding efficiency by

removing the redundancies between adjacent frames. The predictor reduces the bit rate for coding

subsequent sub-band samples in a given sub-band, and it bases its prediction on the quantized

spectrum of the previous block, which is also available in the decoder, of course in the absence of

128

channel errors. M/S coding removes stereo redundancy based on coding the sum and difference signal

instead of the left and right channels. Other units, including quantization, variable length coding,

psychoacoustic model, and bit allocation, are similar to those used in MPEG Layer 3.

In order to serve different needs, MPEG-2 AAC offers flexibility for different quality and

complexity tradeoffs by defining three profiles: the main profile, the low-complexity profile, and the

sampling rate scalable (SRS) profile. Each profile builds on some combinations of different tools as

listed in Table 8.2.

Table 8.2 Coding tools used in MPEG-2 AAC

Tools Main Low complexity SRS

Variable-Length Decoding yes yes yes

Inverse Quantizer yes yes yes

M/S yes yes yes

Prediction yes no no

Intensity/Coupling yes no no

TNS yes Limited Limited

Filterbank yes yes yes

Gain Control no no yes

The main profile yields the highest coding efficiency by incorporating all the tools with the

exception of the gain control tool. For example, in the main profile the filter bank is a 1024-line

MDCT with 50 % overlap (block length of 2048 samples). The filter bank is switchable to eight 128-

line MDSTs (block lengths of 256 samples). Hence, it allows for a frequency resolution of 23,43 Hz

and a time resolution of 2,6 ms (both a sampling rate of 48 kHz). In the case of the long block length,

the window shape can vary dynamically as a function of the signal.

The low complexity profile is used for applications where memory and computing power are

constrained. It does not employ temporal noise shaping and time domain prediction as one of the most

complex calculation procedures.

The SRS profile offers a scalable complexity by allowing partial decoding of a reduced audio

bandwidth. In this profile a hybrid filter bank is used.

MPEG 2 AAC supports up to 46 channels for various multichannel loudspeaker configurations

and other applications. The default loudspeaker configurations are the monophonic channel, the

stereophonic channel, and the 5.1 system (five channels plus LFE channel).

Tools to enhance coding efficiency.

The following changes compared to Layer-3 help to get the same quality at lower bit-rates:

• Higher frequency resolution. The number of frequency lines in AAC is up to 1024

compared to 576 for Layer-3

• Prediction. An optional backward prediction, computed line by line, achieves better

coding efficiency especially for very tone-like signals (e.g. pitch pipe). This feature is only

available within the rarely used main profile.

• Improved joint stereo coding. Compared to Layer-3, both the Mid/Side coding and the

intensity coding are more flexible, allowing applying them to reduce the bit rate more

frequently.

• Improved Huffman coding. In AAC, coding by quadruples of frequency lines is applied

more often. In addition, the assignment of Huffman code tables to coder partitions can be much

more flexible.

Tools to enhance audio quality

There are other improvements in AAC which help to retain high quality for classes of very

difficult signals.

129

• Enhanced block switching. Instead of the hybrid (cascaded) filter bank in Layer-3, AAC

uses a standard switched MDCT filter bank with an impulse response (for short blocks) of

5,3 ms at 48 kHz sampling frequency. This compares favorably with Layer-3 at 18,6 ms and

reduces the amount of pre-echo artifacts (see below for an explanation).

• Temporal Noise Shaping, TNS. This technique does noise shaping in time domain by

doing an open loop prediction in the frequency domain. TNS is a new technique which proves

to be especially successful for the improvement of speech quality at low bit rates.

With the sum of many small improvements, AAC reaches on average the same quality as Layer-

3 at about 70 % of the bit rate. Using an encoder with good performance, both Layer-3 and MPEG-2

Advanced Audio Coding can compress music while still preserving near-CD or CD quality [70].

Among the two systems, Layer-3 is at somewhat lower complexity. AAC is its successor, providing

near-CD quality at larger compression rates. This increases the playing time of flash memory based

devices by nearly 50% while maintaining the same quality compared to Layer-3 and enables higher

quality encoding and playback up to high definition audio at 96 kHz sampling rate.

8.5 Main Characteristics of MPEG-4 Audio

The MPEG-4 standard [61], which was finalized in 1999, integrates the whole range of audio

from high-fidelity speech coding and audio coding down to synthetic speech and synth audio. The

MPEG-2 ACC tool set within the MPEG-4 standard supports the compression of natural audio at bit

rates ranging from 2 up to 64 kbps. The MPEG-4 standard defines three types of coders: parametric

coding, code-excited linear predictive (CELP) coding, and time/frequency (T/F) coding. For speech

signals sampled at 8 kHz, parametric coding is used to achieve targeted bit rates between about 2 and

6 kbps. For audio signals sampled at 8 and 16 kHz, CELP coding offers good quality at medium bit

rates between about 6 and 24 kbps.

T/F coding is typically applied to the bit rates starting at about 16 kbps for audio signals with

bandwidths above 8 kHz. T/F coding is developed based on the coding tools used in MPEG-2 AAC

with some add-ons. One is referred to as the twin-VQ (vector quantization), which makes combined

use of an interleaved VQ and LPC (Linear Predictive Coding) spectral estimation. In addition, the

introduction of bit-sliced arithmetic coding (BSAC) offers noiseless transcoding of an AAC stream

into a fine granule scalable stream between 16 and 64 kbps per channel. BSAC enables the decoder

to stop anywhere between 16 kbps and the bit rate arranged in 1 kbps steps.

130

9. FORMATION of DVB DIGITAL STREAMS This chapter deals with the formation principles of multimedia programs and of MPEG

transport stream structure, which will be later used for demultiplexing and decompression in a digital

TV receiver.

9.1 Characteristics of ISO/IEC 13818 standard

ISO/IEC 13818 standard [71] describes the composition principles of a multimedia program,

which is comprised of video and audio signals, compressed by MPEG algorithm, and control data. It

also describes how these components have to be multiplexed into synchronized broadcasting stream

(see Figure 9.1). So composed stream can be transmitted over various transmission media, i.e.:

VHF (30-300 MHz) radio frequency band;

UHF (300-3000 MHz) radio frequency band;

Satellite communication channels;

Cable networks;

Standard terrestrial channels using E1 (2 Mbps) or E3 (34 Mbps) formats for digital

transmission based on the Plesynchronous Digital Hierarchy (PDH) technology;

Standard terrestrial channels with the Synchronous Digital Hierarchy (SDH) technology;

Wireless communication channels;

ADSL wired and optical communication lines;

Packet Switching Network (ATM, IP, IPv6, Ethernet).

The data transmission modes over these media can be: simplex, full duplex (using feedback

schemes), unicast, multicast or broadcast to all receivers, according to packet identifier (PID).

TV studio,

program creators

Server

MPEG2

stream

formation

processor

MPEG2

transport and

program

streams

multiplexer

FEC block

and QAM

moduliator

DVB-T

receiver

DVB-T

receiver

Fig. 9.1. A typical digital TV transmission setup

According to the DVB specification data can be transmitted by using five different modes:

Data pipe mode. In this case data segments are packed into containers and delivered to

their destination;

Data stream mode. In this case data constitute a continuous stream which can be

asynchronous (without additional sync signals), synchronous (with sync signals) or

synchronize-able (with sync symbols inserted and data are packed into packet streams);

Multiprotocol encapsulation (MPE) mode. In this case the Digital Storage Media

Command and Control (DSM-CC) technology is applied by emulating Local Area Network

(LAN) functions;

Data carrousel mode. In this case the content of the transmission stream is provided in a

cyclic fashion and placed into cyclic buffer from which data transmission is periodically

131

performed. Data in the buffer can be of any format or type. This principle is used to realize

Electronic Programm Guide (EPG) using fixed length DSM-CC frames.

Object carrousel mode. These carrousels are similar to the data carrousels but are

designed for receivers according to the corresponding packet identification field (PID)

(usually for STB decoders). Herein data are formatted according to DVB specification.

9.2 MPEG-2 video compression standard

MPEG-2 video compression [72] relies on block coding based on the DCT. Specifically, each

frame is divided into 8×8 blocks that are transformed using DCT. Quantization on the transformed

blocks is obtained by dividing the transformed pixel values by corresponding elements of a

quantization matrix and rounding the ratio to the nearest integer. The transformed and quantized block

values are scanned using zig-zag pattern to yield a one-dimensional sequence of the entire frame. A

hybrid variable length coding scheme that combines Huffman coding and run-length coding is used

for symbol encoding

The procedure outlined is used for both intraframe and interframe coding. Intraframe coding is

used to represent an individual frame independently. The scheme used for intraframe coding is

essentially identical to JPEG image compression. An intraframe coded picture in the video sequence

is referred to as intra-picture (I-picture),

Interframe picture is used to increase compression by exploiting temporal redundancy. Motion

compensation is used to predict the content of the frame. Coding of its residual error represents the

predicted frame. The frame is divided into 16×16 macroblocks (2×2 blocks). An optimal match of

each macroblock in the neighboring frame is determined. A motion vector is used to represent the

offset between the macroblock and its best-match in the neighboring frame. A residual error is

computed by subtracting the macroblock from its best-match in the neighboring frame. Coding of the

residual error image is carried out in the same manner as intraframe coding. Coding of the motion

vector field is performed separately using difference coding and variable length coding. An interframe

coded picture in the video sequence that restricts the search to the previous frame is referred to as

predicted picture (P-picture); whereas, those pictures that allow for either the previous or subsequent

frames is referred to as a bidirectional picture (B-picture).

Rows of macroblocks in the picture are called slices. The collection of slices forms a picture.

Groups of pictures (GOP) refer to sequences of pictures. A GOP is used to specify a group of

consecutive frames and their picture types. For example, a typical GOP may consist of 15 frames

with the following picture types: IBBPBBPBBPBBPBB. This scheme would allow for random access

and error propagation that does not exceed intervals of one half second assuming the video streamed

at a rate of 30 frames per second.

9.3 Construction of MPEG bit streams

MPEG-2 has two different multiplexing schemes: program stream and transport stream [41].

The program streams are mostly used in error free environment as storage applications. Broadcast

usage commonly uses the transport stream format. If you have one content channel (one program), it

does not imply that the stream that carries the program would be a program stream. In broadcast usage

it would be a so-called single program transport stream (as defined by ISO 13818-1): a multiplexed

collection of concatenated program streams without beginning or end. Transport stream is superior

to program stream for streaming applications. An MPEG-2 program stream contains one, and only

one, content channel. An MPEG-2 transport stream can contain one or more content channels.

Program and transport streams are composed of a combination of separate components, called

elementary streams (ES). An elementary stream (ES) as defined by MPEG communication protocol

is usually the output of an audio or video encoder. ES contains only one kind of data, e.g. audio, video

or closed caption. An elementary stream is often referred to as "elementary", "data", "audio" or

132

"video" bit streams or streams. The format of the elementary stream depends upon the codec or data

carried in the stream, but will often carry a common header when packetized into a packetized

elementary stream (PES). Within the elementary stream the following are stored:

Digital control data;

Compressed audio signals;

Compressed video signals;

Digital synchronous and asynchronous data.

The segments of audio and video signals are grouped into packets, usually called coded video

or audio frames. After that ES are sent into specific processor which packs them into higher hierarchy

stream, called packetized elementary stream (PES).

A typical method of transmitting elementary stream data from a video or audio encoder is to

first create PES packets from the elementary stream data and then to encapsulate these PES packets

inside transport stream (TS) packets or program stream (PS) packets (see Figure 9.2). The PS mode

permits multiplexing PESs that share a common time base, using variable length packets. The TS

mode permits multiplexing PESs and PSs that do not necessarily share a common time base, using

fixed length (188 bytes) packets. The TS packets can then be transmitted using broadcasting

techniques, such as those used in DVB.

Transport streams and program streams are each logically constructed from PES packets. PES

packets shall be used to convert between transport streams and program streams.

Video

encoder

Audio

encoder

Video

packetizer

Audio

packetizer

Transport

stream

multiplexer

Program

stream

multiplexer

Video

signal

Audio

signal

Transport stream

(to broadcasting system)

Program stream

(to storage media)

Video

PES

Audio

PES

ClockProgram clock

reference

Fig. 9.2. MPEG-2 audio and video systems at transmission side

9.3.1 MPEG-2 elementary stream

An elementary stream (ES) as defined by MPEG communication protocol is usually the output

of an audio or video encoder [75]. ES contains only one kind of data, e.g. audio, video or closed

caption. An elementary stream is often referred to as "elementary", "data", "audio", or "video" bit

streams. The format of the elementary stream depends upon the codec or data carried in the stream,

but will often carry a common header when packetized into a packetized elementary stream. The

structure of the video ES format is dictated by the nested MPEG-2 compression standard: video

sequence, GOP, pictures, slices, and macroblocks. The video ES is defined as a collection of picture

from one source. An illustration of the video ES format is presented in Figure 9.3 [75]. The shaded

segment of the video ES format presented in Figure 9.3 is used to denote that that any permutation of

the fields within this segment can be repeated as specified by the video compression standard.

Sequence

header

Sequence

extention

Extention

and user

data 0

Group of

picture

header

Extention

and user

data 1

Picture

header

Picture

coding

extention

Extention

and user

data 2

Picture data

containing

slices

Slices data

containing

macroblocks

Sequence

end

Fig. 9.3. Video elementary stream format

The content of video elementary stream header is shown in Figure 9.4 [75]

133

Start

code,

32 bits

Horizontal

size,

12 bits

Vertical

size,

12 bits

Aspect

ratio,

4 bits

Frame

rate code,

4 bits

Bit

rate,

18 bits

Marker

bit, 1 bit

(ever 1)

Video buffer

verifier size,

10 bits

Constraint

parameters

flag, 1 bit

Load intra

quantizer

matrix, 1 bit

Intra quantizer

matrix, 0 or

64*8 bits

Non intra

quantizer matrix,

0 or 64*8 bits

Load non intra

quantizer

matrix, 1 bit

Fig. 9.4. Header for MPEG-2 video elementary stream

9.3.2 MPEG-2 packetized elementary stream

PES can have fixed or variable (up to 65536 bytes) length. PES is usually constructed in such

way that it would be composed of integer number of packets (see Figure 9.5) [73. 75].

Packet start

code prefix,

24 bits

Stream ID,

8 bits

Original

or copy,

1 bit

Optional PES header,

variable length

PES data

bytes

Marker

bits „10",

2 bits

PES

scrambling

control,

2 bits

PES

priority,

1 bit

Data

alignment

indicator,

1 bit

Copyright,

1 bit

PES packet

length,

16 bits

7 flags,

8 bits

PES header

data length,

8 bits

Optional

fields 1

Stuffing

bytes,

variable

length

PES CRC

flag,

1 bit

PTS

DTS

indicator

2 bits

ESCR flag,

1 bit

ES rate flag,

1 bit

DSM trick mode

flag,

1 bit

Additional

copy info flag,

1 bit

PES extension

flag,

1 bit

PTS DTS,

40 (33) bitsESCR,

40 (33) bits

ES rate,

24 (22) bits

DSM trick

mode,

8 bits

Additional

copy info,

8 (7) bits

PES CRC,

16 bitsPES extension

Flags, 5 bitsOptional

fields 2

PES

extension

field data

PES private

data,

128 bits

Pack header field,

8 bits

Program packet

sequence counter,

8 bits

PSTD buffer,

16 bits

PES extension field

lenght,

7 bits

PES private

data flag,

1 bit

Pack header field flag,

1 bit

Program packet

sequence counter flag,

1 bit

PSTD buffer flag,

1 bit

PES extension field

lenght flag,

1 bit

Fig. 9.5. Packetized elementary stream header

The format of the PRS header is defined by the stream ID used to identify the type of PES. The

PES packet length indicates the number of bytes in the PES packet. The scrambling mode is

represented by the scrambling control. The PES header length data indicates the number of bytes in

the optional PES header fields, as well as stuffing bytes used to satisfy the communication network

requirements.

The PES header contains time stamps to allow synchronization by the decoder. Two different

timestamps are used: presentation timestamp (PTS) and decoding timestamp (DTS). The PTS

specifies the time at which the access unit should be removed from the decoder buffer and presented.

134

The DTS represents the time at which the access unit must be decoded. The DTS is optional and is

only used if the decoding time differs from the presentation time.

The elementary stream clock reference (ESCR) indicates the intended time of arrival of the

packet at the system target decoder (STD). The rate at which the STD receives the PES is indicated

by the elementary stream rate (ESR). Error checking is provided by the PES cyclic redundancy check

(PESCRC)

The pack header field is a PS pack header. The program sequence packet counter indicates the

number of system streams. The STD buffer size is specified by the P-STD buffer (PSTDB) field.

9.3.3 MPEG-2 program stream

A PS multiplexes several PESs, which share a common time base, to form a single stream for

transmission in error free environment (BER of the order of 10-10, corresponding to 10 erroneous bits

in 1 hour for a bit rate of 30 Mb/s). The PS is intended for a storage and retrieval of programs from

digital storage media such as CD-ROM or hard disk. The PS uses relatively long variable length

packet (e.g., 2048 bytes). Program stream coding layer allows only one program of one or more

elementary streams to be packaged into a single stream, in contrast to transport stream, which allows

multiple programs. Each piece of an individual elementary stream in a PS file has timing information

created during the multiplexing process which tells the decoder their position in time relative to the

beginning of the file. That way the decoder can keep all streams synchronized during playback. The

stream organization is similar to the MPEG-1 system stream [41].

9.3.4 MPEG-2 transport stream

The TS is designed for broadcasting over communication networks [41, 74, 75]. The TS is

intended to be a multiplex of many TV programs with their associated sound and data channels,

although a single program TS is possible as well. In the case of a single program in the TS it is called

single program transport stream (SPTS). Several SPTS can be joined to form a multiple program

transport stream (MPTS). In this case additional PSI program specific information (PSI) appears in

the stream, which is needed for DVB coordination (see Figure 9.6).

Other data MPEG-2 video ES MPEG-2 audio ES

MPEG-2 packets MPEG-2 SPTS

MPEG-2 MPTS

MPEG-2 PSI

Fig. 9.6. Possible data streams in multiple program mode

Transport stream is created in such way that it would be possible to:

• Select encoded data of a single program and decode them (see Figure 9.7);

• Extract data of a single or several programs from the transport stream and create other

transport streams;

• Extract data from the transport stream and to create a program stream;

• Create a program stream, convert it to transport stream, transmit data over communication

channel and then reconstruct the program stream.

135

Audio

decoder

Video

decoder

Synchronization

block

Transport stream

demultiplexer and

decoder

Channel

decoder

Video signal

Audio signalTransport

stream

Channel

data

Fig. 9.7. MPEG-2 audio and video systems at reception side

The TS is based upon packets of constant size so that adding error correction codes and

interleaving is eased. The TS uses small fixed length packets of 188 bytes that make them more

resilient to packet loss or damage during transmission. PES packets are inserted into TS packets, as

seen in Figure 9.8 TS packets should not be confused with PES packets which can be larger and vary

in size.

TP1

188 bytes

TP3 TP4 TP5 TP6 TP7

184

bytes184

bytes

<184

bytes

PES 1 packet

(>184 bytes)

PES 2 packet

(=184 bytes)

184

bytes

<184

bytes

Data (video, audio, etc.) Adaptation field (AF)Transport packet header PES header

Fig. 9.8. Arrangement of the PESs in an MPEG-2 transport stream

The TS packet is composed of a 4-byte header followed by 184 bytes shared between the

optional variable length adaptation field (AF) and the TS packet payload which carries data from

PES. For efficiency, the normal header is relatively small, but for special purposes the header may be

extended. In this case the payload gets smaller so that the overall size of the packet is unchanged. The

payload means the data from the PES composing the TV programs, to which are added a certain

number of data allowing the decoder to find its way in the MPEG-2 transport stream. The TS packet

should carry only data coming from one PES packet. The first byte of each PES packet header is

located at the first available payload location of a transport packet and end at the end of a transport

packet. Since transport packets are generally shorter than PES packets, these have to be divided into

data blocks of 184 bytes. The last transport packet carrying the last part of a PES packet, which is

shorter than 184 bytes, will have to start with an adaptation field (AF), the length of which will be

equal to 184 bytes minus the number of bytes remaining in the PES packet.

An illustration of the TS and its header structure is shown in Figure 9.9 [74, 76].

136

Header TS payload

Sync

byte,

8 bits

Transport

error

indicator,

1 bit

Payload unit

start indicator,

1 bit

Transport

priority,

1 bit

Packet

ID,

13 bits

Transport

scrambling

control,

2 bits

Adaption

field

control,

2 bits

Continuity

counter,

4 bits

Adaptation

field

Program

clock

reference,

48 bits

Splice

countdown,

8 bits

Transport private

data lenght,

8 bits

Transport

private data,

variable length

Adaptation

field extension

lenght,

8 bits

3 flags,

3 bitsOptional field

Original

program clock

reference,

48 bits

Adaptation

field lenght,

8 bits

Discontinuity

indicator,

1 bit

Random access

indicator,

1 bit

PES priority

indicator,

1 bit

5 flags,

5 bitsOptional field Stuffing bytes,

variable length

Fig. 9.9. The structure of the transport packet and its header

The TS header includes a synchronization byte (SB) designed for detection by a demultiplexer

of the beginning of each TS packet. The content of SB is 10000111. The transport error indicator

(TEI) points to the detection of an uncorrectable bit error in this TS packet from previous stages. The

payload unit start indicator (PUSI) flag with a value of true denotes the start of a new PES or PSI,

otherwise zero only. The packet identification (PID) identifies the type and source of payload in the

TS packet via the program specific information (PSI) tables (about tables see below). It is one of the

most important elements in the TS. The PID can identify about 8000 types of packets in a 13-bit field.

A multiplexer seeking a particular elementary stream simply checks the PID of every packet and

accepts those that matches, rejecting the rest. Figure 9.10 shows how different video and audio

packets (differing in PID) are composed in MPEG-2 transport stream.

PID audio PID video PID videoPID audio PID video

Fig. 9.10. Multiplexed audio and video packets in MPEG transport stream

All audio and video packets have their unique PID, according to which they are allocated in by

the receiver’s demultiplexer. Usually there are more video packets than audio packets and any packet

in the stream can be inserted at any given moment by the multiplexer. If there is no packet in the input

of the multiplexer, it inserts empty packet with an ID 0x1FFF to preserve the bit rate.

The scrambling control flag identifies if the data is encrypted. The presence or absence of

adaptation field (AF) and payload data in the packet are indicated by the adaptation field control

(AFC) flag.

In a multiplex there may be many packets from other programs in between packets of a given

PID. To help the demultiplexer to find its way in this jungle, the packet header contains a continuity

count. The continuity counter (CC) provides the number of TS packets with the same PID, which is

used to determine the packet loss. Sequence number of payload packets. (0x00 to 0x0F)

Incremented only when a payload is present (i.e., payload value is true)

The optional adaptation field (AF) contains additional information that need not be included in

every TS packet. The adaptation field is for sending various types of information such as splice

countdown used to indicate PCR and edit points. It is also used as stuffing (when the final section of

a PES packet is placed in a TS packet payload, dummy data must be added to ensure a fixed-length

TS packet). One of the most important fields in the AF is the program clock reference (PCR). A

transport stream is a multiplex of several TV programs and these may have originated from widely

137

different locations. It is impractical to expect all the programs in a transport stream to be synchronous.

The decoder has to lock to the encoder. Therefor the transport stream has to have a mechanism to

allow this to be done independently for each program. The synchronizing mechanism is called

program clock reference. The PCR is a 42-bit field composed of 9-bit segment incremented at 27 MHz

as well as 33-bit segment incremented at 90 kHz. The PCR is used along with a voltage-controlled

oscillator as a time reference for synchronization of the encoder and decoder clock. The program map

table indicates the PID of the transport packets which carry the PCR for each program. The minimum

repetition rate of the PCR is 10 times per second. In some cases, the payload of a transport packet can

be solely composed of an adaptation field of 184 bytes, for example, for the transport of private data.

9.3.5 The MPEG-2 tables

In order the user would be able to accept a transport stream dedicated to him, firstly a PID has

to be identified and then packets, containing a specific PID field, have to be filtered. In order the user

would be able to identify which PID corresponds to a particular program signaling tables are formed

(see Figure 9.11). MPEG-2 has defined four types of tables [76], which together make up the MPEG-

2 program specific information (PSI). Each table, depending on its importance, is made up of one or

more sections. Maximum number of sections is 256, each of which has maximum 1024 bytes, except

for private sections which can be up to 4096 bytes in length.

PID audio PID video PID video

PAT-program

allocation tablePMT-program

map table

Other packets

(program guide,

subtitles etc.)

Fig. 9.11. Illustration of insertion of signaling tables

Program allocation table (PAT)

This table is always presented in the TS packet and it is carried by the packets of PID equal to

zero (PID=0×0000). Its purpose is to indicate, for each program in the stream, the link between the

program number (from 0 to 65535) and the PID of packets carrying a “map” of the program (program

map table – PMT). The PAT is always broadcast unscrambled, even if all programs of the transport

stream are scrambled.

Conditional access table (CAT)

This table must be presented as soon as at least one program in the stream has conditional

access. It is transported by the packets of PID=0×0001 and indicates the PID of packets carrying the

entitlement management messages (EMM) for one or more conditional access systems. The EMM is

one of the two pieces of information required for descrambling conditional access programs.

Program map table (PMT)

There is one PMT for each program present in the stream. It indicates, in the unscrambled mode,

the PID of the elementary streams making up the program and, optionally, other private information

relating to the program, which can be eventually scrambled. This information is the entitlement

control messages (ECM), which is the other piece of information necessary for unscrambling

programs with conditional access, the first being the EMM carried by the CAT. The PMT can be

transported by packets of arbitrary PID defined by the broadcaster and indicated in the PAT, except

0×0000 and 0×0001, which are reserved for PAT and CAT respectively.

Transport stream description table (TSDT)

This table describes the contents of the stream. It is transported by packets of PID=0×0002.

138

Private tables (PT)

These tables carry private data, which are either in free format or in a format similar to the CAT,

except for the section length, which can be as much as 4096 bytes, compared with 1024 bytes for the

other tables.

The DVB standard adds complementary tables to the MPEG-2 tables. Some of them are

mandatory and some optional. These tables of so called DVB-SI (service information) enable the

receiver to configure itself automatically and allow the user to navigate the numerous programs and

services available. This information is made up of four mandatory tables and three optional ones.

Mandatory tables of DVB-SI

These tables apply to the actual transport stream.

Network information table (NIT)

This table, as its name implies, carries information specific network made up of more than one

RF channel (and hence than one transport stream), such as frequencies or channel numbers used by

the network, which the receiver can use to figure itself, for instance. This table is carried by packets

with PID=0×0010.

Service description table (SDT)

This table lists the names and other parameters associated each service in the transport stream.

It is transported by packets with PID=0×0011.

Event information table (EIT)

This table is used to transmit information relating to events occurring or going to occur in the

current transport stream. It is transported by packets with PID=0×0012.

Time and date table (TDT)

This table is used to update the internal real-time clock of the set-top box. It is transported by

packets with PID=0×0014.

Optional tables of DVB-SI

These tables apply to the actual transport stream.

Bouquet association table (BAT)

This table (present in all transport streams making up the bouquet) is used as a tool for grouping

services that the set-top box may use to present the various services to the user (e.g., by way of the

electronic program guide – EPG). A given service or program can be part of more than one bouquet.

It is transported by packets with PID=0×0011.

Event information table (EIT)

This table is used to transmit information relative to events occurring or going to occur in the

current transport stream. It is transported by packets with PID=0×0012.

Running status table (RST)

This table is transmitted only once for a quick update of the status of one or more events at the

time that this status changes, and not repeatedly as with the other tables. It is transported by packets

with PID=0×0013.

Time offset table (TOT)

This table indicates the time offset to the GMT. It is transported by packets with PID=0×0014.

139

Stuffing tables (ST)

These tables are used, for example, to replace previously used tables which have become

invalid. They are transported by packets with PID=0×0010 to 0×0014.

140

10. SCRAMBLING and CONDITIONAL ACCESS

10.1 Introduction

Digital Video Broadcasting (DVB) is a standard defining a one-to-many unidirectional data

network for sending digital TV programs over satellite, cable, and other topologies. The standard

enables the broadcasting industry to offer hundreds of pay-TV channels to consumers in order to

recover as quickly as possible the high investments required launching DVB services. Furthermore

the standard enables the consumers easily to choose one the mostly offered billing forms:

conventional subscription, pay per view or near video on demand, if the system includes a “return

channel”. The expanded capacities make the broadcast signals more valuable and attractive to signal

thefts [78]. To protect a DVB data-network, the DVB standard integrates into its broadcasting

infrastructure an access control mechanism, commonly known as Conditional Access, or CA for

short. The DVB standard provides the transmission of access control data packets carried by the

conditional access table (CAT) and other private data packets indicated by the program map table

(PMT). The standard also defines the common scrambling algorithm (CSA). The conditional access

(CA) itself is not defined by the standard because most operators want to exploit their own system

for commercial and security reasons.

It is evident that only very general principles of scrambling system in the DVB standard are

universally available. Implementation details are accessible only to network operators and

manufacturers under nondisclosure agreement.

The DVB conditional access architecture manages end-users’ rights to access contents of a

DVB digital TV network. The architecture separates functions into subscriber management,

subscriber authorization, and data scrambling. The same broadcast network is used for content

delivery and access control. To defend against attacks, the architecture relies on secret designs,

cryptography, and temper-resistant hardware to achieve its objectives. Furthermore, system

components are carefully partitioned for damage management, where critical parts can be

inexpensively replaced when security is compromised. The architecture is adopted by many

broadcasting networks around the world and serves as a reference model for many other deployments

of digital TV broadcasting.

10.2 Functional Partitions

The DVB-CA architecture manages end-users’ access to protected contents with three

elements: data scrambling, a subscriber authorization system (SAS), and a subscriber management

system (SMS) [77]. Together, they form three layers around the protected contents (see Figure 10.1).

Subscriber Management

Subscriber Authorization

Data-Scrambling

Mpeg2 Digital TV

Fig. 10.1. Three layers around the protected contents

Data scrambling encrypts the digital-TV contents at the center. The subscriber authorization

system controls the data-scrambling element by handling the secured distribution of descrambling

keys to authorized subscribers. Knowing to which subscribers what contents are admissible, the

subscriber management system delivers access permissions to the SAS for execution.

141

The protection scope of the DVB-CA architecture ends at the point where protected contents

are legitimately descrambled. Thus, DVB-CA does not care what a legitimate subscriber does with

the descrambled content.

10.2.1 Data-Scrambling

The data-scrambling element is the encryption of TV contents. To avoid confusion, the DVB-

CA specification uses the terms scrambling and descrambling to mean the encrypting and decrypting

of TV contents, differentiating other uses of cryptography in the broader DVB infrastructure [79].

The broadcast center does the scrambling, and receivers perform the descrambling.

10.2.2 Subscriber Authorization System

The SAS element implements the access-control protocol. It enforces end-users’ access rights

by allowing only authorized subscribers to descramble the contents [77]. SAS uses cryptography

extensively, and the system is designed to be renewable inexpensively as a strategy to contain damage

from being compromised.

10.2.3 Subscriber Management System

The SMS element grants access rights. Operating from the business operation center, the SMS

maintains a database of subscribers. For each subscriber, the SMS database records subscription level,

payment standing, and a unique ID inside the subscriber’s smart card. The SMS uses the information

to decide which TV channels a subscriber is entitled to view, and the access permissions are given to

the subscriber authorization system for enforcement [77].

10.3 System Architecture

The next Figure 10.2 depicts the major components of the DVB-CA architecture and their

relations [77]

CA host

SMS

EMM

ECM

CA descriptors

Scrambler

MPEG-2

audio/video

User rigths

Channel ACL

CA client

Receiver

CA module

Descrambler

Control wordsControl words

MPEG-2

audio/video

DVB data network

Smart card

Fig. 10.2. The major components of the DVB-CA architecture

The scrambler and descrambler implement the data-scrambling element, and control words are

the cipher keys. CA-Host, CA-Client and CA-Module are the three distributed components of the

SAS element, and they use CA descriptors and CA messages (EMM and ECM) for communication.

The real implementation of the system can vary from one operator to another. The details of

these systems are not publicly known, but their principles are similar.

10.3.1 Scrambler and Descrambler

The data-scrambling cipher is called DVB Common Scrambling Algorithm (DVB-CSA). The

algorithm is a combination of 64-bit block cipher followed by a stream cipher, with a key-size of 64

bits [80] in order to make the pirate’s task more difficult. However, the details of cipher generation

142

are kept in secret and disclosed to equipment manufacturers under nondisclosure agreement. For

performance and obscurity, the algorithm is implemented in hardware.

At the broadcast center, the scrambler generates control words to scramble the contents, and it

passes the control words to the CA-Host for secured distribution to descramblers via ECM CA

messages [76]. Control words change about every ten seconds, and the scrambler synchronizes the

descrambler to key switching using a specific bit in data-packet headers [76]. As a defense strategy,

different TV channels are scrambled with different stream of control words.

The DVB standard allows the scrambling of the transport stream and of the PES. However these

cannot be used simultaneously.

It was mentioned in the paragraph 9.3.4 that the transport packet header includes a 2-bit field

called “Transport scrambling flags”. These bits indicate whether the transport packet is scrambled

and with which control word. Transport packets are scrambled after multiplexing.

Scrambling of PES packets generally takes place at the source, before multiplexing. The

presence of scrambling and control word is indicated by the 2-bit PES scrambling control flags in the

PES packet header.

10.3.2 CA Messages

CA messages are encrypted command and control communications from CA-Host to CA-

Modules. The DVB-CA architecture categorizes CA messages into Entitlement Control Messages

(ECM) and Entitlement Management Messages (EMM) [79]. ECMs carry channel-specific access-

control list and control words. EMMs deliver subscriber-specific entitlement parameters [76]. These

messages are generated from three different types of input data:

A control word, which is used to initialize the descrambling sequence;

A service key, used to scramble the control word for a group of one or more users;

A user key, used for scrambling the service key.

ECM are the function of the control word and the service key, and are transmitted

approximately every 2 s. EMM are a function of the service key, and are transmitted approximately

every 10 s. The process of ECM and EMM generating is illustrated in Figure 10.3 [76].

Encryption

User key Service key

ECM

EMM

Control

words

Decryption

Control words

Control wordsECM

Decryption

Service key

Service key

EMM

User key

(Smart card)

EMM User key

Fig. 10.3. Illustration of the ECM and EMM generation

process

Fig. 10.4. Illustration of decryption of the control words

from CM and EMM

In the receiver, the principle of decryption consists of recovering the service key from the EMM

and the user key, contained in a smart card. The service key is then used to decrypt the ECM in order

to recover the control word allowing initialization of the descrambling process. Figure 10.4 illustrates

the process for recovering control words from the ECM and EMM.

Figure 10.5 illustrates the process of finding the ECM and EMM required descrambling a

wanted program [76], for example, No 5.

143

Prog 5

PAT sections

PID N

CA system 1

CA system 2EMM-2 reference

EMM-1 reference

CAT sections

Prog 5

PMT sections

PID

PID 0 Video 5 PID N Audio 5 PID 1 Video 5 EMM-1 EMM-2 EMM-3 EMM-4 EMM-5

CA system 3

CA system 4

EMM-3 reference

EMM-4 reference

Audio

Video

ECM

PCR

Fig. 10.5. Illustration of the process of finding ECM and EMM in the transport stream

The process consists of 4 steps [76]:

Step 1. The program allocation table (PAT), rebuilt from sections in packets with PID=0×0000,

indicates the PID N of the packets carrying the program map table (PMT) sections;

Step 2. The PMT indicates, in addition to the PID of the packets carrying the video and audio

PESs and the PCR (Program Clock Reference), the PID of packets carrying the ECM;

Step 3. The conditional access table (CAT), rebuilt from sections in packets with PID=0×0001,

indicates which packets carry the EMM for one (or more) access control systems;

Step 4. From this information and the user key, contained in the smart card, the descrambling

system can calculate the control word required to descramble the next series of PES or transport

packets depending on the scrambling mode.

As a strategy of defense in depth, a secret cipher different from data scrambling is used, and the

details on the message formats are closely guarded secrets.

10.3.3 CA Descriptor

CA descriptors are data records associating a protected channel to its ECMs [79]. Since

different stream of control words are used to scramble different channels, there is no need to keep the

relations secret. Thus, the CA descriptors are sent in clear via the electronic channel guide, which is

transmitted continuously in the broadcast traffic.

10.3.4 CA-Host

The CA-Host is the control center of the access protection [77]. It is responsible for encrypting

all CA messages to CA-Modules and securely distributing CA messages' cipher keys to CA-Modules.

10.3.5 CA-Client

A CA-Client is the access-control coordinator at a receiver [77. It passes CA messages from

the CA-Host to its CA-Module. It delivers the control words from the CA-Module to the descrambler.

When the viewer selects a channel, the CA-Client uses the channel’s CA descriptor to filter the

associated ECMs and passes them to the CA-Module. If the channel is a pay-per-view, the CA-Client

also walks the viewer through GUI dialogs to confirm purchases.

144

10.3.6 CA-Module

A CA-Module (CAM) is the access-control guard at a receiver [77]. Each CA-Module has a

unique CAM-ID for identifying the subscriber. The CA-Module authenticates and decrypts EMMs to

establish a subscriber’s entitlement parameters, which are stored in the CAM’s non-volatile and

secured memory and never leave the CAM. The CA-Module also authenticates and decrypts ECMs

to receive a channel’s control words and access parameters from the CA-Host. If the access parameter

in an ECM is consistent with the entitlement parameters stored in the CA-Module, the CA-Module

returns the control word to the CA-Client for setting up the descrambler. Since it is important for CA-

Modules to be temper-resistant and easily replaced when damaged or compromised, they are often

implemented as smart cards.

10.3.7 Subscriber Management System

The SMS is the business manager determining each subscriber’s rights of channel access [77].

It uses CAM-IDs to link subscribers to the subscriber authorization system. As a subscriber’s

subscription level and payment standing change, the SMS modifies the access rights by instructing

the CA-Host to send new EMMs to the CA-Module having the subscriber’s CAM-ID [77].

10.4 Network Integration

The next Figure 10.6 shows where the components of the DVB-CA architecture are integrated

into a DVB data network [77].

To illustrate the interaction between the components in the DVB-CA architecture, the following

example of the system operation in the case of the assumed scenario is presented.

Let’s have a subscriber to whom is currently allowed to see only basic programs, without access

to premium sports channels. However the subscriber decides to watch an interesting sports program.

Tuning to the channel, the subscriber receives an on-screen message instructing him to call the

customer service center to upgrade subscription level. After the conversation about service upgrade,

the customer representative confirms the subscriber’s request to upgrade. ithin a few seconds, the

on-screen message is replaced by the desirable program show.

At that time the CA-Client receives the channel-tuning request from the graphical user interface

(GUI). From the channel number of the desirable sports program, the CA-Client looks up the

parameters from the channel-guide to set up the data receiver and packet filter for receiving the

program’s digital audio and video streams. More importantly, the CA-Client looks up the CA

descriptor and extracts parameters to set up the packet filter for receiving ECM packets associated

with the desired channel. When the ECMs arrive, the CA-Client passes them to the CA-Module and

waits for response.

Subscriber

management

CA

des

crip

tors

Channel guide

MPEG-2

audio/video 1

MPEG-2

audio/video 2

MPEG-2

audio/video N

Scr

amble

r

Mult

iple

xer

CA mesages

Tra

nsm

itte

r

CA host

Control

words

Rec

eiver

Pac

ket

fil

ter

Des

cram

ble

r

MPEG-2

audio/

video MPEG-2

decoder

GUI

CA

messages

Channel

guide

TV output

CA module

TV

Smart

card

Control

words

CA messages

CA

clie

nt

145

Fig. 10.6. The components of the DVB-CA architecture integrated in DVB data network

When the CA-Module receives an ECM, it authenticates and decrypts the ECM to extract the

control word and access parameter of the tuned channel. Comparing the access parameter to the stored

entitlement parameter, CA-Module finds that the service belongs to a subscription level higher than

that of the subscriber. Thus, the CA-Module returns a status code of “below service level” to the CA-

Client. The CA-Module continues to return the same status code for every ECM passed from the CA-

Client since the subscriber’s entitlement remains unchanged.

Receiving a response status of “below service level”, the CA-Client displays a message on the

TV screen, asking the subscriber to call the customer service center for service upgrade. The message

remains on screen for as long as the CA-Module returns the same response to all ECMs passed from

the CA-Client.

As instructed by the on-screen message, the subscriber calls the customer service center to

request service upgrade. After confirming the request and obtaining online credit approval, the

customer representative enters the new subscription level to the subscriber management system

(SMS). Upon receiving the upgrade, the SMS provides the new subscription level and the subscriber’s

CAM-ID to the CA-Host. The CA-Host encapsulates the new subscription level into an EMM, tags

it with the subscriber’s CAM-ID, signs and encrypts it, and inserts the EMM into the broadcast traffic.

Back at the receiver site, the CA-Client receives the EMM tagged with the subscriber’s CAM-

ID and passes it to the CA-Module. After authenticating and decrypting the EMM, the CA-Module

stores the new subscription level into its internal secured storage. When a subsequent ECM from the

desired sports channel comes along, the CA-Module finds that the subscriber is entitled to the channel.

Consequently, CA-Module returns the control word.

Receiving a valid control word from the CA-Module, the CA-Client sets up the descrambler,

and the digital audio and video data are descrambled, decoded, and shown on the TV set. Seeing the

wanted sports program, the subscriber ends the conversation with the customer representative.

From this point on, the CA-Client continuously feeds the channel’s ECMs to the CA-Module,

which returns new control words as they change.

10.7 Main Conditional Access Systems

Table 10.1 [76] indicates the main conditional access systems used by European digital pay TV

service providers. Most of these systems use the DVB-CSA scrambling standard specified by the

DVB. The receiver has an internal descrambler controlled by embedded conditional access software

which calculates the descrambler control words from the ECM messages and keys contained in a

subscriber smart card with valid access rights updated by the EMM.

Table 10.1 Main conditional access systems

System Origin Examples of service providers

Betacrypt Betaresearch (obsolete) Premiere World, German cable

Conax Conax AS (Norway) Scandinavian operators

CryptoWorks Philips Viacom, MTV Networks

IrDETO Nethold Multichoice

Nagravision 1 & 2 Kudelski S.A. Dish Network, Premiere

Viaccess 1 & 2 France Telecom TPS, AB-Sat, SSR/SRG, Noos

Videoguard/ICAM News Datacom (NDS) BskyB, Sky Italia

146

11. RANDOMIZATION and FRAME SYNCHRONIZATION

11.1 Randomization

Before the demodulator in a receiver can recover a bit stream from the received signal, it must

extract timing information. As a minimum, this timing information is used to identify the boundaries

between symbols. Long runs of identical symbols may often occur in digital television. Examples

include transport stream null packets and start packets used in video and system layers. In these cases

there are no changes in the value of received symbols waveforms and receiver is unable to extract

timing information. In order to prevent these runs of identical symbols, a process of symbol or bit

stream randomization is applied.

Randomization very often is referred to as a scrambling [38]. As we have seen in chapter 10

the terms scrambling and descrambling denoting the encrypting and decrypting of TV contents were

used. It is common for physical layer standards bodies to refer to physical layer encryption as

scrambling as well [81]. In this chapter the term scrambling is perceived as a coding operation applied

to the bit stream at the transmitter that randomizes the bit stream, eliminating long strings of like bits

that might impair receiver synchronization. Scrambling also eliminates most periodic bit patterns that

could produce undesirable discrete frequency components in the power spectrum. Thus in this chapter

scrambling and descrambling denotes randomization and de-randomization of bit streams,

specifically of MPEG transport streams [82]. Needless to say, the scrambled sequence must be

descrambled at the receiver so as to preserve overall bit sequence transparency.

Simple but effective scramblers and descramblers are built from tapped shift registers having

the generic form of Figure 11.1 [38].

an-1 an-2 an-K

an

1 2

Clock K-stage shift register

na

K

Fig. 11.1. Tapped shift register

Successive bits from the binary input sequence na enter the register and shift from one stage to

the next at each tick of the clock. The output na is formed by combining the bits in the register through

a set of tap gains and mod-2 adders, yielding

1 1 2 2 ...n n n k n Ka a a a . (11.1)

The tap gains themselves are binary digits, so 1k simply means a direct connection while

0k means no connection. The symbol 0 stands for modulo-2 addition, defined by the properties

1 2

1 2

1 2

0, ;

1, ;

a aa a

a a

and 1 2 3 1 2 3 1 2 3( ) ( )a a a a a a a a a (11.2)

where 1 2,a a , and 3a are arbitrary binary digits. Mod-2 addition is implemented with exclusive-OR

(XOR) digital logic gates, and obeys the rules of ordinary addition except that 1 1 0 .

Figure 11.2 shows an illustrative scrambler and descrambler, each employing a 4-stage shift

register with tap gains 1 2 0 and 3 4 1 .

147

na

1na

2na

3na

4na

na

na

Clock

a)

na

1na

2na

3na

4na

na

na

Clock

b)

Fig. 11.2. Illustrative binary scrambler and descrambler; (a) scrambler, (b) descrambler

The binary message sequence na at the input to the scrambler is mod-2 added to the register

output na to form the scrambled message na which is also fed back to the register input. Thus,

3 4n n na a a and n n na a a . (11.3)

The descrambler has essentially the reverse structure of the scrambler and reproduces the

original message sequence, since

( ) ( ) 0n n n n n n n n n na a a a a a a a a a (11.4)

Equations (11.3) and (11.4) hold for any shift-register configuration as long as the scrambler

and descrambler have identical registers.

The scrambling action does, of course, depend on the shift-register configuration. Table 11.1

shows the scrambling produced by our illustrative scrambler when the initial state of the register is

all zeroes. Note that the string of nine zeroes in na , has been eliminated in na . Nonetheless, there

may be some specific message sequence that will result in a long string of like bits in na . Of more

serious concern is error propagation at the descrambler, since one erroneous bit in na will cause

several output bit errors. Error propagation stops when the descrambler register is full of correct bits.

Table 11.1 Illustrative scrambler input, output and register cells contents

Input sequence n

a 1 0 1 1 0 0 0 0 0 0 0 0 0 1

Register contents

1na

0 1 0 1 0 1 1 1 1 0 0 0 1 0

2na

0 0 1 0 1 0 1 1 1 1 0 0 0 1

3na

0 0 0 1 0 1 0 1 1 1 1 0 0 0

4na

0 0 0 0 1 0 1 0 1 1 1 1 0 0

Register output n

a 0 0 0 1 1 1 1 1 0 0 0 1 0 0

Output sequence n

a 1 0 1 0 1 1 1 1 0 0 0 1 0 1

Next, we consider shift-register sequence generation. When a shift register has a nonzero initial

state and the output is fed back to the input, the unit acts as a periodic sequence generator. Take Figure

11.3, for example, where a 3-stage register with initial state 111 produces the 7-bit sequence 1110010

which repeats periodically thereafter. In general, the longest possible sequence period from a register

with K stages is

2 1KN (11.5)

and the corresponding output is called a maximal-length or pseudo noise (PN) sequence. Figure 11.3

is, in fact, a PN sequence generator with 3K and 7N .

148

1 1 1

Initial state Output sequence

1 1 1 0 0 1 0...

Fig. 11.3. Shift register sequence generator

The name pseudo noise comes from the correlation properties of PN sequences. It is easy

mathematically to show their autocorrelation functions are like white noise autocorrelation functions.

11.2 Frame Synchronization

A digital receiver needs to know when a signal is present. Otherwise, the input noise alone may

produce random output bits that could be mistaken for a message. Therefore, identifying the start of

a message is one aspect of frame synchronization. Another aspect is identifying subdivisions or

frames within the message. To facilitate frame synchronization, binary transmission usually includes

special N-bit sync words as shown in Figure 11.4. The initial prefix consists of several repetitions of

the sync word, which marks the beginning of transmission and allows time for bit-sync acquisition.

The prefix is followed by a different code word labeling the start of the message itself. Frames are

labeled by sync words inserted periodically in the bit stream.

Message bits

t

N-bit sync word

Prefix Message start

Fig. 11.4. Sync words for frame synchronization

The elementary frame synchronizer in Figure 11.5 is designed to detect a sync word s1s2…sK

whenever it appears in the regenerated sequence na . Output bits with the polar format 1na are

loaded into an K-stage polar shift register having polar tap gains given by

12 1i K is . (11.6)

This expression simply states that the gains equal the sync word bits in polar form and reverse

order, that is, 1 2 1Ks while 12 1K s . The tap gain outputs are summed algebraically to

form

1

K

n i n i

i

v a

. (11.7)

This voltage is compared with a threshold voltage V, and the frame sync indicator goes HIGH

when nv V . If the register word is identical to the sync word, then n i ia , so 2 1i n i ia and

nv K . If the register word differs from the sync word in just one bit, then 2nv K . Setting the

threshold voltage V slightly below 2K thus allows detection of error-free sync words and sync

words with one bit error. Sync words with two or more errors go undetected, but that should be an

unlikely event with any reasonable value of error probability. False frame indication occurs when K

or K-1 successive message bits match the sync-word bits. The probability of this event is very low

[38]

1

1/ 2 1/ 2 3 2K K K

f fP , (11.8)

assuming equally likely is ones and zeroes in the bit stream.

Further examination of equations (11.6) and (11.7) reveals that the frame synchronizer

calculates the cross-correlation between the bit stream passing through the register and the sync word,

represented by the tap gains. The correlation properties of a PN sequence therefore makes it an ideal

149

choice for the sync word. In particular, suppose the prefix consists of several periods of a PN

sequence. As the prefix passes through the frame-sync register, the values of nv will trace out the

shape of autocorrelation function of sync word with peaks nv K occurring each time the initial bit

s1 reaches the end of the register. An added advantage is the ease of PN sequence generation at the

transmitter, even with large values of K. For instance, getting 410f fP in equation (11.8) requires

K>14,8 , which can be realized with a 4-stage PN generator.

150

12. CODING

12.1 Introduction

If the data at the output of a digital communication system have too frequent errors that hinder

from normal receiving information, the errors can be reduced by the use of either of two main

techniques:

Automatic repeat request (ARQ);

Forward error correction (FEC).

In an ARQ system, when a receiver circuit detects parity errors in a block of data, it requests

that the data block be retransmitted. In an FEC system, the transmitted data are encoded so that the

receiver can detect, as well as correct, errors. These procedures are also called as channel coding

because they are used to correct errors caused by channel noise. The choice between using the ARQ

or the FEC technique depends on the particular application. ARQ is often used in computer

communication systems because it is relatively inexpensive to implement and there is usually a

duplex (two-way) channel so that the receiving end can transmit back an acknowledgment (ACK) for

correctly received data or a request for retransmission (Negative Acknowledgment – NAC) when the

data are received in error. FEC techniques are used to correct errors on simplex (one-way) channels,

where returning of an ACK/NAC indicator is not executable. FEC is preferred on systems with large

transmission delays, because the ARQ technique reduces the effective data rate significantly; the

transmitter has long idle periods while waiting for the ACK/NAC indicator, which is retarded by the

long transmission delay. Digital television uses only FEC technique.

Communication systems with FEC embrace encoding and decoding blocks; encoding block at

transmitter side, decoding block at receiver side. Coding involves adding extra (redundant) bits to the

data stream so that the decoder can reduce or correct errors at the output of the receiver. However,

these extra bits have the disadvantage of increasing the data rate (bits/s) and, consequently, increasing

the bandwidth of the encoded signal. Codes may be classified into two broad categories:

Block codes. A block code is a mapping of k input binary symbols into n output binary

symbols. Consequently, the block coder is a memoryless device. Because n k , the code

can be selected to provide redundancy, such as parity bits, which are used by the decoder

to provide some error detection and error correction. The codes are denoted by (n, k), where

the code rate R is defined by /R k n . Practical values of R range from 1/4 to 7/8, and k

ranges from 3 to several hundred [35].

Convolutional codes. A convolutional code is produced by a coder that has memory. The

convolutional coder accepts k binary symbols at its input and produces n binary symbols

at its output, where the n output symbols are affected by v k input symbols. Memory is

incorporated because 0v . The code rate is defined by /R k n . Typical values for k and

n range from 1 to 8, and the values for v range from 2 to 60. The range of R is between 1/4

and 7/8 [35]. A small value for the code rate R indicates a high degree of redundancy,

which should provide more effective error control at the expense of increasing the

bandwidth of the encoded signal.

12.2 Block Codes

Before discussing block codes, several definitions are needed. These are: Hamming weight and

Hamming distance. The Hamming weight and distance are used to determine the likelihood of errors.

The Hamming weight of a code word is the number of binary 1 bits. For example, the code

words 110101 and 111001 have a Hamming weight of 4.

151

The Hamming distance between two code words, denoted by d, is the number of positions by

which they differ. For example, the code words 110101 and 111001 have a distance of 2d .

A received code word can be checked for errors. Some of the errors can be detected and

corrected if 1d s t , where s is the number of errors that can be detected and t is the number of

errors that can be corrected ( s t ). Thus, a pattern of t or fewer errors can be both detected and

corrected if 2 1d t .

A general code word can be expressed in the form

1 2 3 1 2 3... ...k ri i i i p p p p , (12.1)

where k is the number of information bits, r is the number of parity check bits, and n is the total word

length in the (n, k) block code, where n k r . This arrangement of the information bits at the

beginning of the code word followed by the parity bits is most common. Such a block code is said to

be systematic. Other arrangements with the parity bits interleaved between the information bits are

possible and are usually considered to be equivalent codes. Hamming has given a procedure for

designing block codes that have single error correction capability [86]. A Hamming code is a block

code having a Hamming distance of 3. Because 2 1d t , 1t , and a single error can be detected

and corrected. However, only certain (n,k) codes are allowable. These allowable Hamming codes are

[86]

( , ) (2 1,2 1 )m mn k m , (12.2)

where m is an integer and 3m . Thus, some of the allowable codes are (7,4), (15,11), (31,26),

(63,57), and (127,120). The code rate R approaches to 1 as m becomes large.

Thinking about block coding is simplest when comparing it to two people having a

conversation. When talking in a noisy room or shouting across a long distance, there is more room

for errors in what the receiving person hears. If the sentence is long, the person can correct more

errors by taking the entire sentence in context, but short sentences have a higher error rate because it

is harder to decipher what the person is saying.

In addition to Hamming codes, there are many other types of block codes. One popular class

consists of the cyclic codes [83]. Cyclic codes are block codes, such that another code word can be

obtained by taking any one code word, shifting the bits to the right, and placing the dropped-off bits

on the left. These types of codes have the advantage of being easily encoded from the message source

by the use of inexpensive linear shift registers with feedback. This structure also allows these codes

to be easily decoded. Examples of cyclic and related codes are Bose-Chaudhuri-Hocquenhem (BCH),

Reed-Solomon (RS), Hamming, Maximal Length, Reed-Muller, and Golay codes [87].

12.3 Burst Error Correction Sometimes the bits errors are closely clustered; they say the errors occur in bursts. That is, in

one region of a received bit stream a large percentage of bits are in error, while in another region there

are few and perhaps no errors and the average bit error rate is small. The block codes can correct a

limited number of bit errors in each code word and are helpless manage with error bursts. Examples

of the sources of such bursts are:

In digital video discs playback there may occur for mechanical reasons, when, for example,

some small sections of the discs are simply defective;

Pulse jamming, such as results in radio transmission due to lightning, causes bursts of errors

as long as it lasts.

There are communication channels in which the level of the received signal power

fluctuates with time. In such fading channels bursts of errors are likely to occur when the

received power is low.

12.3.1 Block Interleaving

A primary technique which is effective in overcoming error bursts, is block interleaving. The

principle of block interleaving is illustrated in Figure 12.1 [37, 38].

152

11 12 1

21 22 2

1 2

11 12 1

21 22 2

1 2

bits/row

l

l

k k kl

l

l

r r rl

l

a a a

a a a

a a a

c c c

c c c

c c c

kl information

bits

k information

bits/column

r parity

bits/column

rl parity

bits

Fig. 12.1. Illustration of block interleaving and adding of parity bits

Before the data stream is applied to the channel, the data goes through a process of interleaving

and error correction coding. At the receiving end the data is decoded i.e., the data bits are evaluated

in a manner to take advantage of the error correcting and detecting features which result from the

coding and the process of interleaving is undone. As represented in Figure 12.l a group of kl data bits

is loaded into a shift register which is organized into k rows with l bits per row. The data stream is

entered into the storage element at a11. At each shift each bit moves one position to the right while

the bit in the rightmost storage element moves to the leftmost stage of the next row. Thus, for example,

as indicated by the arrow, the content of a11 moves to a21. When kl data bits have been entered, the

register is full, the first bit being in akl and the last bit in a11. At this point the data stream is diverted

to a second similar shift register and a process of coding is applied to the data held stored in the first

register. In this coding process, the information bits in a column (e.g., a11, a21,..., ak1) are viewed as

the bits of an uncoded word to which parity bits are to be added. Thus the code word (a11 a21…ak1 c11

c21…cr1) is formed, thereby generating a code word with k information bits and r parity bits. Observe

that the information bits in this code word were l bits apart in the original bits stream. When the

coding is completed, the entire content of the k l information register as well as the r l parity bits

are transmitted over the channel. Generally the bit-by-bit serial transmission is carried out row by

row, i.e., in the order

(crl…cr1…c1l…c11 akl…akl…a2l…a21 a1l…a12 a11). (12.3)

Note that the data is transmitted in exactly the same order it entered the register, however, now

parity bits are also transmitted. The received data is again stored in the same order as in the transmitter

and error correction decoding is performed. The parity bits are then discarded and the data bits shifted

out of the register. To see how interleaving affects bursts of errors, consider that the code incorporated

into a code word (a column in Figure 12.1) is adequate to correct a single error. Next suppose that in

the transmitted data stream a burst of noise occurs lasting for l consecutive coded bits. Then because

of the organization displayed in Figure 12.l it is clear that only one error will appear in each column

and this single error will be corrected. If there are 1l consecutive errors then one column will have

two errors and correction will not be assured. In general, if the code is able to correct t errors then the

process of interleaving will permit the correction of a burst of B bits with

B t l . (12.4)

12.3.2 Convolutional Interleaving

An alternative interleaving scheme, convolutional interleaving, is shown in Figure 12.2

[37, 38].

153

Channel with

modulator and

demodulator

1

2

3

4

l

s storage elements

2s

3s

(l-1)s

Line number

d(k)

Convolutional interleaving scheme at

transmitter

Convolutional deinterleaving scheme at

receiver

1

2

3

4

l

(l-1)s Line number

(l-2)s

(l-3)s

(l-4)s

d(k)

Fig. 12.2. Convolutional interleaving – deinterleaving scheme

The four switches operate in step and move from line-to-line at the bit rate of the input bit

stream ( )d k . Thus, each switch makes contact with line 1 at the same time, and then moves to line 2

together, etc., returning to line 1 after line l. The cascade of storage elements in the lines are shift

registers. Starting with line 1, on the transmitter side, which has no storage elements, the number of

elements increases by s as we progress from line to line. The last line l has ( 1)l s storage elements.

The total number of storage elements in each line (transmitter plus receiver side) is, in every case, the

same. Hence, in each line there are a total of ( 1)l s storage elements. To describe the timing of

operations in the interleaving system, let us consider a single line il on the transmitter side. Suppose

that during the particular bit interval of bit ( )d k there is switch contact, at input and output sides of

line il . At the end of the bit interval, a clock signal causes the shift register of line il (and line il

only), to enter into the leftmost storage element the bit on the input side of the line and to start moving

the contents of each of its storage elements one bit to the right. That process having started, a

synchronous clock advances the switches to the next line 1il . When the shift register response is

completed, there will be a new bit at the output end of the line il . However, because of the propagation

delay through the storage elements, the switch at the output end of the line will have already lost

contact with line il , before the new bit has appeared at the line output. In summary, during the interval

of input ( )d k , during which the switches were connected to line il , there is one-bit shift of the shift

register on line il which accepts bit ( )d k into the register. However, the fact that such a shift took

place is not noticed on the output switch until the next time the switch makes contact with line il . We

observe also that while the clock that drives the switches has a clock rate bf , which is the bit rate, the

clocks that drive the shift registers have a rate /bf l . The shift registers are not driven in unison, but

rather in sequence, each register being activated as the switches are about to break contact with its

line. Now let us consider that, initially, all the shift registers in the transmitter and the receiver are

short circuited. If bit ( )d k occurs when all the switches are on line then the corresponding received

bit ( )d k will appear immediately. The next input bit ( 1)d k will be the next received bit except that

it will be transmitted over line 1il and so on. In short, the received sequence ( )d k will be the same

as the transmitted sequence. With the shift registers in place in the transmitter and receiver, each of

the l lines will have the same delay ( 1)l s and therefore the output sequence will still be identical

to the input sequence. Of course, however, ( )d k will be delayed with respect to ( )d k by the amount

( 1)l s .

The sequence in which the bits will be transmitted over the channel is different. This sequence

will be interleaved. Suppose that two successive bits in the input bit stream are ( )d k and ( 1)d k .

Then it can be verified that if in the interleaved stream we continue to refer to the first bit as ( )d k ,

154

the bit which was originally ( 1)d k will instead be ( 1 )d k ls . Thus if 5l and 5s , there will

be 15ls bits interposed between two bits that were initially adjacent to one another.

In comparison with a block interleaving, the convolutional interleaving has the following

advantages:

For the same interleaving distance less memory is required;

The interleaving structure can be changed easily and conveniently by changing the number

of lines l and, or, the incremental number of elements per line s.

12.3.3 Reed–Solomon (RS) Code

The block codes we have shortly described above are organized on the basis of individual bits.

Thus typically a code word has k individual information bits, r individual parity bits and a total of

n k r individual bits in a code word. The Reed–Solomon (RS) block code is organized on the

basis of groups of bits. Such groups of bits are referred to as symbols. Thus suppose we store

sequences of m individual bits which appear serially in a bit stream and thereafter operate with the

m-bit sequence rather than the individual bits. Then we shall be dealing with m-bit symbols. Since

we deal only with symbols we must consider that if an error occurs even in a single bit of a symbol,

the entire symbol is in error. The RS code has the following characteristics: The RS code has k

information symbols (rather than bits), r parity symbols and a total code word length of n k r

symbols. It has the further characteristic that the number of symbols in the code word is arranged to

be

2 1mn . (12.5)

The RS code is able to correct errors in t symbols where

/ 2t r . (12.6)

As an example, assume that 8m , then 82 1 255n symbols in a code word. Suppose

further that we require that 16t , then 2 32r t and 255 32 223k n r information

symbols/code word. The code rate is

223 7

255 8c

kR

n . (12.7)

The total number of bits in the code word is 255 8 2040 bits/code word.

Since the RS code of our example can correct sixteen symbols it can correct a burst of

16 8 128 consecutive bit errors. If we use the RS code with the interleaving as described above,

then using equation (12.4) the number of correctable symbols is l t and the number of correctable bits

is

B mlt (12.8)

with 8m , 16t and 10l , 1280B bits.

The organization of an RS block code with 8m , 223k , 32r utilizing interleaving to a

depth of 4l is shown in Figure 12.3 [37]. There are 223×8×4=7136 information bits and

32×8×4=1024 parity bits.

It is interesting to note that while the (255,223) RS code can correct 128 consecutive bit errors

it must then have an error-free region of 255–16=239 symbols (1912 bits). Further, if the errors are

random, and there is, at most, one error per symbol, then the RS code can correct only sixteen bit

errors in 2040 bits. Clearly the RS code is not an efficient code for correcting random errors.

155

11 12 13 14

21 22 23 24

223,1 223,2 223,3 223,4

11 12 13 14

21 22 23 24

32,1 32,2 32,3 32,4

4 symbols/rowl

a a a a

a a a a

a a a a

c c c c

c c c c

c c c c

k=223

information

symbols/ column

r=32 parity

symbols/column

Each entry

represents

an 8-bit

symbol

Fig. 12.3. The organization of an RS code with m=8, k=223 and r=32

12.4 Convolutional Coding

Convolutional codes have a structure that effectively extends over the entire transmitted bit

stream, rather than being limited to code word blocks. The convolutional structure is especially well

suited to space and satellite communication systems that require simple encoders and achieve high

performance by sophisticated decoding methods.

Convolutional codes are used extensively in numerous applications in order to achieve reliable

data transfer, including digital video, radio, mobile and satellite communication. These codes are

often implemented in concatenation with a hard-decision code, particularly Reed-Solomon.

Convolutional codes, encoding and decoding techniques are more widely or barely described

in many published sources. For our discussion we have used [35–38, 85].

A convolutional code is generated by combining the outputs of an M-stage shift register through

the employment of V XOR logic summers. Such a coder is illustrated in Figure 12.4 for the case M=3

and V=2. Here M1 through M3 are 1-bit storage (memory) devices.

Convolutional codes are commonly specified by three parameters; (V, L, M), where

V is number of output bits;

L is number of input bits;

M is number of memory registers.

The quantity L/V called the code rate is a measure of the efficiency of the code. Commonly L

and V parameters range from 1 to 8, M from 2 to 10 and the code rate from 1/8 to 7/8 except for deep

space applications where code rates as low as 1/100 or even longer have been employed.

Often the manufacturers of convolutional code chips specify the code by parameters (V, L, K).

The quantity K is called the constraint length of the code and is defined by

K=L(M-1). (12.9)

The constraint length K represents the number of bits in the encoder memory that affect the

generation of the V output bits.

12.4.1 Code parameters and the structure of encoder

The convolutional encoder structure is easy to draw from code parameters. First draw M boxes

representing the M memory registers. Then draw V modulo-2 adders to represent the V output bits.

Now connect the memory registers to the adders using the generator polynomial as shown in the

Figure 12.4. (Elementary explanation of generator polynomial follows below. The comprehensive

studies of code words representation as polynomials and generator polynomials of the codes are

presented in many books, for example, in [83, 84]).

156

Input

bits si

3-stages shift register

Output

Encoded bits co

s1 s0

v1

v2

s-1

M3M2M1

Fig. 12.4. Convolutional encoder (2, 1, 3) with V=2, L=1, M=3

This encoder generates a rate 1/2 code. Each input bit is coded into 2 output bits. The constraint

length of the code is 2 (K=2). The 2 output bits are produced by the 2 modulo-2 adders by adding up

certain bits in the memory registers. The selection of which bits are to be added to produce the output

bit is called the generator polynomial G for that output bit. For example, the first output bit has a

generator polynomial of (1,1,1). The output bit 2 has a generator polynomial of (1,0,1). The output

bits are just the modulo-2 sum of these bits.

1 1 0 1v s s s ,

2 1 1v s s .

The polynomials give the code its unique error protection quality. One (2,1,3) code can have

completely different properties from another one depending on the polynomials chosen.

There are many choices for polynomials for any M order code. They do not all result in output

sequences that have good error protection properties. The complete list of these polynomials is

presented in [84]. Good polynomials are found from this list usually by computer simulation. A list

of good polynomials for rate 1/2 codes is given below in the Table 12.1 [85].

Table 12.1 Generator Polynomials for good rate 1/2 codes [85]

Constraint length 3 4 5 6 7 8 9 10

G1 110 1101 11010 110101 110101 110111 110111 110111001

G2 111 1110 11101 111011 110101 1110011 111001101 1110011001

12.4.2 Encoder states

What encoders output depends on what is their state of shift registers stages. Encoder states are

just a sequence of bits. Sophisticated encoders have long constraint lengths and simple ones have

short indicating the number of states they can be in.

The (2,1,3) encoder in Figure 12.4 has a constraint length of 2. The shaded stages hold these

state determining bits. The unshaded stage holds the incoming bit. This means that 2 bits or 4 different

combinations of these bits can be present in these memory stages. These 4 different combinations

determine what output we will get for v1 and v2, the coded sequence. The number of combinations of

bits in the shaded stages are called the states of the code and are defined by

Number of states=2K, (12.10)

where K is the constraint length of the code. The output bits depend on the initial condition of register

stages which changes at each time tick.

The code (2,1,3) shown above has 2 output bits for every 1 input bit. It is a rate 1/2 code. Its

constraint length is 2. The total number of states is equal to 4. The four states of this (2,1,3) code are:

00, 01, 10, 11.

12.4.3 Punctured codes

For the special case of L=1, the codes of rates 1/2, 1/3, 1/4, 1/5, 1/7 are sometimes called mother

codes. We can combine these single bit input codes to produce punctured codes which give us code

157

rates other than 1/V. By using two rate 1/2 codes together as shown in Figure 12.5, and then just not

transmitting one of the output bits we can convert this rate 1/2 implementation into a 2/3 rate code. 2

bits come and 3 go out. This concept is called puncturing. On the receive side, dummy bits that do

not affect the decoding metric are inserted in the appropriate places before decoding.

Input

bits si s1 s0

v1

v2

s-1

M3M2M1

Input

bits si s1 s0

v1

v2

s-1

M3M2M1

Fig. 12.5. Punctured (3, 2, 3) code encoder composed from two (2, 1, 3) encoders

This technique allows us to produce codes of many different rates using just the same simple

hardware. Although we can also directly construct a code of rate 2/3 as we shall see later, the

advantage of a punctured code is that the rates can be changed dynamically (through software)

depending on the channel condition such as rain, etc. A fixed implementation, although easier, does

not allow this flexibility.

12.4.4 Structure of an encoder for L>1

Alternately it is possible to create codes where L is more than 1 bit. The procedure for drawing

the structure of a (V, L, M) encoder where L is greater than 1 is as follows. First draw L sets of M

boxes. Then draw V adders. Now connect V adders to the register stages using the coefficients of the

Vth degree polynomial. This gives a structure like the one in Figure 12.6 for encoder (4,3,3). This

(4,3,3) convolutional encoder has 9 memory registers, 3 input bits and 4 output bits. The constraint

length is 3×2=6. The code has 64 states. The shaded registers contain old bits representing the current

state and also the constraint length.

s1s2 s-2s-1s0 s-5s-4s-3s3

v1

v2 v3 v4 Fig. 12.6. Block diagram of an encoder for L=3

12.4.5 Coding of an incoming sequence

Let‘s encode a two bit sequence of 10 with the (2,1,3) encoder and see how the process works.

First we will pass a single bit 1 through this encoder as shown in Figure 12.7.

1

1

001

a)

1

0

010

b)

1

1

100

c)

158

Fig. 12.7. Illustration of encoding of two bit sequence; (a) – encoder’s state at time t=0, (b) - encoder’s state at time

t=1, (c) - encoder’s state at time t=2

At time t=0, we see that the initial state of the encoder is all zeroes, i.e. 00, (the bits in the

right most K register stage positions). The input bit 1 causes two bits 11 to be output.

At t=1, the input bit 1 moves forward one stage. The input stage is now empty and is filled

with a flush bit of 0. The encoder is now in state 10. The output bits are now 10 by the

same math.

At t=2, the input bit 1 moves forward again. Now the encoder state is 01 and another flush

bit is moved into the input register stage. The output bits are now 11 again.

At t=3, the input bit 1 has passed completely thorough the encoder and the encoder has

been flushed to an all zero state, ready for the next sequence.

Note that a single bit has produced a 6-bit output although nominally the code rate is 1/2. This

shows that for small sequences the overhead is much higher than the nominal rate, which only applies

to long sequences.

If we did the same thing with a 0 bit, we would get a 6 bit all zero sequence. What we just

produced is called the impulse response of this encoder. The 1 bit has an impulse response of 11 10

11. The 0-bit similarly has an impulse response of 00 00 00.

Convolving the input sequence with the code polynomials produced these two output

sequences, which is why these codes are called convolutional codes. From the principle of linear

superposition, we can now produce a coded sequence from the above two impulse responses as

follows.

Let‘s say that we have a input sequence of 10110 and we want to know what the coded sequence

would be. We can calculate the output by just adding the shifted versions of the individual impulse

responses. This is presented in table 12.2.

Table 12.2 Producing a coded sequence

Input bit Impulse response

1 11 10 11

0 00 00 00

1 11 10 11

1 11 10 11

0 00 00 00

Add modulo-2 to obtain response

1 0 1 1 0 11 10 00 01 01 11 00

So, when the input sequence is 10110, the coded sequence is 11 10 00 01 01 11 00.

12.4.6 Encoder design

It is shown in the previous section what happens in an encoder. The encoding hardware is much

simpler because the encoder does no mathematical operations. The encoder for convolutional code

uses a look-up table to do the encoding. The look-up table consists of four items:

Input bit;

The state of the encoder, which is one of the 4 possible states for the example (2,1,3) code;

The output bits. For the code (2,1,3), since 2 bits are output, the choices are 00, 01, 10, 11;

The output state which will be the input state for the next bit.

Analyzing operation of encoder (2,1,3) it is possible to create the following look-up table 12.3.

Table 12.3 Look-up table for the encoder (2,1,3)

Input bit Input state Output bits Output state

s1 s0 s-1 v1 v2 s0 s-1

159

0 0 0 0 0 0 0

1 0 0 1 1 1 0

0 0 1 1 1 0 0

1 0 1 0 0 1 0

0 1 0 0 1 0 1

1 1 0 0 1 1 1

0 1 1 1 0 1 1

1 1 1 1 1 0 1

Three different but related graphical representations have been devised for the study of

convolutional encoding. These are:

State diagram;

Tree diagram;

Trellis diagram.

12.4.7 State diagram

A state diagram for the (2,1,3) code is shown in Figure 12.8. Each circle represents a state. At

any one time, the encoder resides in one of these states. The lines to and from it show state transitions

that are possible as bits arrive. Only two events can happen at each time, arrival of a 1 bit or arrival

of a 0 bit. Each of these two events allows the encoder to jump into a different state. The state diagram

does not have time as a dimension and hence it tends to be not intuitive.

0110

0011

11

011000

0001

10 11

Fig. 12.8. A state diagram for the (2,1,3) code

Comparing the above state diagram to the encoder lookup table one can see the state diagram

contains the same information that is in the table lookup but it is a graphic representation. The solid

lines indicate the arrival of a 0 and the dashed lines indicate the arrival of a 1. The output bits for each

case are shown on the line and the arrow indicates the state transition.

Some encoder states allow outputs of 11 and 00 and some allow 01 and 10. No state allows all

four options.

Let‘s encode the sequence 10110 using the state diagram:

Let‘s start at state 00. The arrival of a 1 bit outputs 11 and puts us in state 10;

The arrival of the next 0 bit outputs 10 and put us in state 01;

The arrival of the next 1 bit outputs 00 and puts us in state 10;

The arrival of the next 1 bit outputs 01 and puts us in state 11;

The inputting last bit 0 takes us to state 01 and outputs 01. So now we have 11 10 00 01

01. But this is not the end. We have to push out of the last register stage the last informative

bit 0 and to take the encoder back to all zero state;

Pushing the last bit 0 to the last register stage we jump from state 01 to state 00 outputting

11;

Finally we return once more to state 00 outputting 00.

The final answer is 11 10 00 01 01 11 00.

160

12.4.8 Tree diagram

Figure 12.9 shows the tree diagram for the code (2,1,3). The tree diagram attempts to show the

passage of time as we go deeper into the tree branches. It is somewhat better than a state diagram but

still not the preferred approach for representing convolutional codes. Here instead of jumping from

one state to another, we go down branches of the tree depending on whether a 1 or 0 is received. If a

0 is received, we go up and if a 1 is received, then we go downwards.

The first branch in Figure 12.9 indicates the arrival of a 0 or a 1 bit. The starting state is assumed

to be 00. The bits above the horizontal lines are the output bits and the bits inside the parenthesis

below the horizontal lines are the output states.

Let‘s code the sequence 10110 as before. At branch 1, we go down. The output is 11 and we

are now in state 10. Now we get a 0 bit, so we go up. The output bits are 10 and the state is now 01.

The next incoming bit is 1. We go downwards and get an output of 00 and now the output state is 10.

The next incoming bit is 1 so we go downwards again and get output bits 01. From this point, in

response to a 0 bit input, we get an output of 01 and an output state of 01. What if the sequence were

longer, then what? We have run out of tree branches. The tree diagram now repeats. In fact we need

to flush the encoder so our sequence is actually 10110 00, with the last 2 bits being the flush bits. We

now go upwards for two branches. Now we have the output to the complete sequence and it is 11 10

00 01 01 11 00. This is also the same answer as the one we got from the state diagram.

00

11

00

11

00

11

10

01

10

01

11

00

01

10

10

11

11

00

00

11

10

01

01

10

11

00

01

10

11

00

00

11

00

11

10

01

10

01

11

00

01

10

01

10

11

00

00

11

10

01

01

10

11

00

01

10

00

(00)

01

10

01

11

(00)

00

(10)

(00)

(00)

(00)(00)

(01)

(11)

(10)

(10)

(10)

(11)

(01)

(00)

(10)

(11)

(01)

(10)

(00)

(11)

(01)

(10)

(00)

(01)

(11)

(01)

(11)

(11)

(10)

(11)

(01)

(11)

(01)

(00)

(10)

(01)

(00)

(10)

0

1

(11)

(01)

(00)

(10)

(11)

(01)

(00)

(10)

(11)

(01)

(00)

(10)

(11)

(01)

(00)

(10)

(11)

(01)

(00)

(10)

(11)

(01)

(00)

(10)

11

(00)

00

(00)

Fig. 12.9. Tree diagram of (2,1,3) code

161

12.4.9 Trellis diagram

Trellis diagrams are messy but generally preferred over both the tree and the state diagrams

because they represent linear time sequencing of events. The x-axis is discrete time and all possible

states are shown on the y-axis. We move horizontally through the trellis with the passage of time.

Each transition means new bits have arrived.

The trellis diagram is drawn by lining up all the possible states (2K) in the vertical axis (K is the

constraint length). Then we connect each state to the next state by the allowable code words for that

state. There are only two choices possible at each state. These are determined by the arrival of either

a 0 or a 1 bit. When the arriving bit is 1, the following line is dashed, and when the arriving bit is 0,

the following line is solid. The bits on the lines are the output bits. The arrows going upwards

represent a 0 bit and going downwards represent a 1 bit. The trellis diagram is unique to each code,

same as both the state and tree diagrams are. We can draw the trellis for as many periods as we want.

Each period repeats the possible transitions.

We always begin at state 00. Starting from here, the trellis expands and in K bits becomes fully

populated such that all transitions are possible. The transitions then repeat from this point on. The

trellis diagram for (2,1,3) code is presented in Figure 12.10.

00

01

10

11

00 00 00 00 00 00

11 1111 11 11 11

1010 10 10 10

01

01 01 01 01

11 11 11 11

00 00 00 00

10 10 10 10

01 01 01 01

Repeats Repeats Repeats Repeats

Discrete time

Sta

tes

00

11

10

00

01

10

01

11

Repeats

Start point

Incomming

bits 01 10 00 01 01 11 00

Fig. 12.10. Trellis diagram for (2,1,3) code

Encoding begins at the start point, representing encoder‘s state 00 and the beginning moment

of the discrete time. Coding is easy. We simply go up for a 0 bit and down for a 1 bit. The path taken

by the bits of the example sequence (10110 00) is shown by the red and blue lines. The red lines

indicate the path corresponding informative bits 10110 and the blue ones the flash bits 00. We see

that the trellis diagram gives exactly the same output sequence 11 10 00 01 01 11 00 as the other

above methods, namely the impulse response, state and the tree diagrams.

12.4.10 The basic decoding idea

There are several different approaches to decoding of convolutional codes. These are grouped

in two basic categories:

Sequential Decoding - Fano algorithm;

Maximum likelihood decoding - Viterbi algorithm.

Both of these methods represent two different approaches to the same basic decoding idea.

Assume that 3 bits were sent via a rate 1/2 code. We receive 6 informative bits and two flush

bits. These six informative bits may or may not have errors. We know from the encoding process that

these bits map uniquely. So a 3 bit sequence will have a unique 6 bit output. But due to errors, we can

receive any of all possible combinations of the 6 bits. This is presented in table 12.4.

The permutation of 3 input bits results in eight possible input sequences. Each of these has a

unique mapping to a six bit output sequence by the code. These form the set of permissible sequences

and the decoder’s task is to determine which one was sent.

162

Table 12.4 Bit agreement as a metric for decision between the received sequence and

the 8 possible valid code sequences

Input bits Valid code sequence Received sequence Bit agreement

000 000000 111010 2

001 000011 111010 1

010 001110 111010 3

011 001101 111010 3

100 111011 111010 5

101 111000 111010 5

110 110101 111010 2

111 110110 111010 4

Let’s say we received 111100. It is not one of the 8 possible sequences above. How do we

decode it? We can do two things:

We can compare this received sequence to all permissible sequences and pick the one

with the smallest Hamming distance(or bit disagreement);

We can do a correlation and pick the sequences with the best correlation.

The first procedure is basically what is called hard decision decoding and the second the soft

decision decoding. The bit agreements show that we still get an ambiguous answer and do not know

what was sent. As the number of bits increase, the number of calculations required to do decoding

significantly manner increases such that it is no longer practical to do decoding this way. We need to

find a more efficient method that does not examine all options and has a way of resolving ambiguity

such as here where we have two possible answers. (Shown in bold in Table 4)

If a message of length n bits is received, then the possible number of code words is 2n. How can

we decode the sequence without checking each and every one of these 2n code words? This is the

basic decoding idea.

12.4.11 Sequential decoding

Sequential decoding was one of the first methods proposed for decoding a convolutional coded

bit stream.

Sequential decoding allows both forwards and backwards movement through the trellis. The

decoder keeps track of its decisions, each time it makes an ambiguous decision, and counts it. If the

count increases faster than some threshold value, decoder gives up that path, and retraces the path

back to the last fork, where the tally was below the threshold.

Let’s do an example. The code is (2,1,3) for which we did the encoder diagrams in the previous

section. Assume that bit sequence 10110 00 was sent. (Remember the last 2 bits are the result of the

flush bits. Their output is called tail bits.)

If no errors occurred, we would receive 11 10 00 01 01 11 00. But let’s say we received instead

01 10 00 01 01 11 00. One error has occurred. The first bit has been received as 0 instead 1.

Decoding process is illustrated in Figure 12.11.

163

00

01

10

11

00 00 00

11 1111

10

01

Discrete time

Sta

tes

0 1

3

4

5

2 3

00

4

11

9

10

00

01

01

Incomming

bits 01 10 00 01

11

00

!

!

5 6 7

1

2 6

8

7

Fig. 12.11. Sequential decoding path search

1. Decision point 1. The decoder looks at the first two bits 01. Right away it sees that an error

has occurred because the starting two bits can only be 00 or 11. But which of the two bits

was received in error, the first or the second? The decoder randomly selects 00 as the starting

choice, and decodes the input bit as a 0. It puts a count of 1 into its error counter. It is now

at point 2.

2. Decision point 2. The decoder now looks at the next set of bits, which are 10. However the

code word choices are 11, 00. From here, it randomly makes the decision that a 1 was sent

which corresponds with the code word 11. This is seen as an error and the error count is

increased to 2. Since error count is less than the threshold value of 3 (which we have set

based on channel statistics) the decoder proceeds ahead. This puts it at point 3.

3. At decision point 3, the received bits are 00, but the code word choices, leading to points 4

or 5, are 10, 01. The decoder recognizes another error, and the error tally increases to 3. That

tells the decoder to turn back.

4. The decoder goes back to point 2 where the error tally was 1 (less than 3) and takes the other

choice to point 6. It again encounters an error condition. The received bits are 10 but the

code word of this branch is 00. The tally has again increased to 2.

5. At decision point 6, the received bits are 00, and the code word choices are 00, 11. From

here, the decoder makes the decision that a 0 was sent which corresponds exactly with one

of the code words. This puts it at point 7.

6. At this point the received bits are 01, but the proposed code words are 00 and 11. Either

selection leads to the third error. The decoder has to turn back to the beginning point 1.

7. Now all choices encountered (red lines) meet perfectly with the code word choices and the

decoder successfully decodes the message as 1011000.

There exist some small modifications of the above described decoding procedure. We will

examine two of them. The first one concerns the principle of ordinary bit decoding.

Consider the first message bit. This first message bit has an effect only on the first M∙V bits in

the code word. Thus with M=3 and V=2, as for the (2, 1, 3) code, the first bit has an effect only on

the first 3 groups of 2 bits. Hence to deduce this first bit, there is no point in examining the code word

beyond its first 6 bits. On the other hand, we would not be taking full advantage of the redundancy of

the code if we undertook to decide about the first message bit on the basis of anything less than an

examination of the first 6 bits of the code. Generally speaking to decode an ordinary bit it is

recommended to examine M∙V corresponding bits in the code word or, what is equivalent, to pass M

points (nodes) in the trellis diagram, if only the total number of errors does not exceed the error

threshold value.

Suppose that in the transmission of any V bits errors have found their way into more bits than

the error threshold value permits. Then, at the node from which this occurs, the decoder will make a

mistake. In such a case, the entire continuation of the path taken by the encoder must be in error.

Consider, then, that the decoder keeps a record, as it progresses, of the total number of discrepancies

164

between the received code bits and the corresponding bits encountered along its path. Then, after

having taken a wrong turn at some node, the likelihood is very great that this total number of

discrepancies will grow much more rapidly than would be the case if the decoder were following the

correct path. The decoder may be programmed to respond to such a situation by retracing its path to

the node at which an apparent error has been made and then taking the alternate way out of that node.

In this way the decoder will eventually find a path through M nodes. When such a path is found, the

decoder decides about the first message bit on the basis of the direction of divergence of this path

from the first node. Similarly, as before, the second message is determined on the basis of a path

searched out by the decoder, again M nodes long, but starting at the second starting node and on the

basis of the received code bit sequence with the first V bits discarded.

The second modification concerns the total number of bit differences between the decoder path

and the corresponding received bit sequence. We consider now how the decoder may judge that it has

made an error and taken a wrong turn. Let the probability be p that a received code bit is in error, and

let the encoder already have traced out a path through l nodes. We assume l<M, so that the decoder

has not yet made a decision about the message bit in question. Since every node is associated with V

bits, then, on the average over a long path of many nodes, we would expect the total number of bit

differences between the decoder path and the corresponding received bit sequence to be d(l)=plV even

when the correct path is being followed. In Figure 12.12 we have plotted d(l) in a coordinate system

in which the abscissa is l, the number of nodes traversed, and the ordinate is the total bit error that

has accumulated.

lM

Total bit erors

Discard level

Incorrect path

Bit errors on

correct path

d(l)=plV

Fig. 12.12. Setting the threshold in sequential decoding

We would expect that, if the encoder were following the correct path, the total bit errors

accumulated would oscillate about d(l). On the other hand, shortly after a wrong decoder decision has

been made, we expect the accumulated bit error to diverge sharply as indicated. A discard level has

also been indicated in Figure 12.12. When the plot of accumulated errors crosses the discard level,

the decoder judges that a decoding error has been made. The decoder then returns to the nearest

overpassed node, chooses alternative way and starts moving forward again until, possibly, it is again

reversed because the discard level is crossed. In this way, after some trial and error, an entire M-node

section of the trellis is navigated. And at this point a decision is made about the message bit associated

with this M-node section of the trellis. In Figure 12.12 the discard level line does not start at the origin

in order to allow for the possibility that the initial bits of the received code sequence may be

accompanied by a burst of noise. The great advantage of sequential decoding of a convolutional code

is that such decoding makes it unnecessary, generally, to explore every one of the 2M paths in the

code trellis section. Thus, suppose it should happen that the decoder takes the correct turn at each

node. Then, in this case, the decoder will be able to make a decision about the message bit in question

on the basis of a single path. Let us assume that at some node the decoder makes a mistake and must

return to this node to take an alternative way. Then even the information that an error has been made

is useful because thereafter the decoder may exclude from its searching’s all paths which diverge in

the original direction from this node. The end result is that sequential decoding may generally be

accomplished with very much less computation than the direct decoding discussed earlier.

165

The memory requirements of sequential decoding are manageable and so this method is used

with long constraint length codes where the signal-to-noise ratio is low.

12.4.12 Maximum likelihood and Viterbi decoding

Viterbi decoding is the best known implementation of the maximum likelihood decoding. Here

we narrow the options systematically at each time tick. The principal used to reduce the choices is

this:

The errors occur infrequently. The probability of error is small.

The probability of two errors in a row is much smaller than a single error that is the errors

are distributed randomly.

The Viterbi decoder examines an entire received sequence of a given length. The decoder

computes a metric for each path and makes a decision based on this metric. All paths are followed

until two paths converge on one node. Then the path with the higher metric is kept and the one with

lower metric is discarded. The paths selected are called the survivors.

For an N bit sequence, total numbers of possible received sequences are 2N. Of these only 2KL

are valid. The Viterbi algorithm applies the maximum likelihood principles to limit the comparison

to 2 to the power of KL surviving paths instead of checking all paths.

The most common metric used is the Hamming distance metric. This is just the dot product

between the received code word and the allowable code word. Each branch has a Hamming metric

depending on what was received and the valid code words at that state. This is illustrated in table 12.5.

Other metrics are also used and we will talk about these later.

Table 12.5 Each branch has a Hamming metric depending on what was received and

the valid code words at that state

Received bits Valid code word 1 Valid code word 2 Hamming metric 1 Hamming metric 2

00 00 11 2 0

01 10 01 0 2

10 00 11 1 1

These metrics are cumulative so that the path with the largest total metric is the final winner.

But all of this does not make much sense until you see the algorithm in operation.

Let’s decode the received sequence 01 10 00 01 01 11 00 using Viterbi decoding.

1. At t=0 (step 1), we have received bit 01 (see Figure 12.13). The decoder always starts at state

00. From this point it has two paths available, but neither matches the incoming bits. The

decoder computes the branch metric for both of these and will continue simultaneously along

both of these branches in contrast to the sequential decoding where a choice is made at every

decision point. The metric for both branches is equal to 1, which means that one of the two

bits was matched with the incoming bits.

00

01

10

11

00

11

Discrete time

Sta

tes

0 1 2 3 4

Incomming

bits 01 10 00 01

5 6 7

01 11 00

Path

metric

1

1

Fig. 12.13. Viterbi decoding; Step 1

166

2. At t=1 (step 2), the decoder fans out from these two possible states to four states (see

Figure 12.14). The branch metrics for these branches are computed by looking at the

agreement with the code word and the incoming bits which are 10. The new metric is shown

on the right of the trellis.

00

01

10

11

00

11

Discrete time

Sta

tes

0 1 2 3 4

Incomming

bits 01 10 00 01

5 6 7

01 11 00

Path

metric

2

2

00

11

10

01

3

1

Fig. 12.14. Viterbi decoding; Step 2

At t=2 (step 3), the four states have fanned out to eight to show all possible paths (see

Figure 12.15). The path metrics calculated for bits 00 and added to previous metrics from t=1. The

trellis is fully populated. Each node has at least two paths coming into it. The paths now begin to

converge on the nodes. Two metrics are given for each of the paths coming into a node. According

to the maximum likelihood principle, at each node we discard the path with the lower metric because

it is least likely. They are shown in blue. This discarding of paths at each node helps to reduce the

number of paths that have to be examined and gives the Viterbi method its strength. The metrics are

as shown in the Figure 12.15 on the right of the trellis.

00

01

10

11

00

11

Discrete time

Sta

tes

0 1 2 3 4

Incomming

bits 01 10 00 01

5 6 7

01 11 00

Path

metric

4 3

2 5

00

11

10

01

3 2

3 2

00

11

00

10

01

11

10

01

Fig. 12.15. Viterbi decoding; Step 3

At t=3 (step 4), the received bits are 01 (see Figure 12.16). The paths progress forward and now

at each node we have two paths converging again. Again the metrics are computed for all paths. We

discard all smaller metrics, shown in blue, but keep both if they are equal. The metrics for all paths

are given on the right.

00

01

10

11

00

11

Discrete time

Sta

tes

0 1 2 3 4

Incomming

bits 01 10 00 01

5 6 7

01 11 00

Path

metric

5 4

5 4

00

11

10

01

5 5

7 3

00

00

10

01

00

11 11

00

10

01

10

01

Fig. 12.16. Viterbi decoding; Step 4

167

At t=4 (step 5), the received bits are 01 again (see Figure 12.17). Again the metrics are

computed for all paths. We discard all smaller metrics but keep both if they are equal.

00

01

10

11

00

11

Discrete time

Sta

tes

0 1 2 3 4

Incomming

bits 01 10 00 01

5 6 7

01 11 00

Path

metric

6 6

6 6

00

11

10

01

5 9

7 7

00

00

10

01

00

11

10

01

01

00

11

10

01

11

00

10

01

Fig. 12.17. Viterbi decoding; Step 5

At t=5 (step 6), the received bits are 11 (see Figure 12.18). After discarding the paths as shown,

we again go forward and compute new metrics.

00

01

10

11

00

11

Discrete time

Sta

tes

0 1 2 3 4

Incomming

bits 01 10 00 01

5 6 7

01 11 00

Path

metric

6 11

8 9

00

11

10

01

7 8

7 8

00

00

10

01

00

11

10

01

01

00

11

01

11

00

10

01

00

11

10

01

11

00

01

10

Fig. 12.18. Viterbi decoding; Step 6

At t=6 (step 7), the received bits are 00 (see Figure 12.19). Again the metrics are computed for

all paths. We discard all smaller metrics.

00

01

10

11

00

11

Discrete time

Sta

tes

0 1 2 3 4

Incomming

bits 01 10 00 01

5 6 7

01 11 00

Path

metric

13 8

11 10

00

11

10

01

10 9

10 9

00

00

10

01

00

11

10

01

01

11

01

11

00

10

01

00

01

11

00

10

00

11

01

10

11

00

10

01

Fig. 12.19. Viterbi decoding; Step 7

At t=7 (step 8) the trellis is complete (see Figure 12.20). After discarding the paths with the

smaller metric, we have the winner path left. The path traced by states 00, 10, 01, 10, 11, 01, 00, 00

and corresponding to bits 1011000 is the decoded sequence.

168

00

01

10

11

00

11

Discrete time

Sta

tes

0 1 2 3 4

Incomming

bits 01 10 00 01

5 6 7

01 11 00

Path

metric

1300

11

10

01

00

00

10

01

00

11

10

01

01

11

01

11

00

10

01

00

01

11

00

10

00

01

Fig. 12.20. Viterbi decoding; Step 8

As a result of reduction of paths entering each node mainly from two to one, there are only four

paths at each step. However, when the information bit stream is long, the amount of information that

must be stored in memory to record the four possible bit streams, and the error associated with each,

becomes enormous. For example, if data is sent at the rate of 10 Mbps for one second, 10 Mbits of

data is transmitted. The memory must therefore store more than 40 Mbits of data.

To reduce the data handling capability required of the decoder, the Viterbi algorithm presents

a sub-optimum procedure which truncates the trellis typically after P=5K nodes where K is the

constraint length. Thus if K=7, P=35 bits. When truncating the trellis we look at the trellis after P bits

are received. We then choose that path (remember there are four paths) with the fewest errors and

delete the remaining three paths. In this manner the memory is reduced to somewhat more than P bits

as after each decision the P/2 data information bits are outputted from the system.

The reason that Viterbi truncation technique is sub-optimum is that we are choosing the

minimum number of error from among the paths terminating at nodes corresponding to states when

K=L. Such a comparison is not optimum since each node yields a distinct path and is then not

comparable. An optimum system would make a decision only after receiving the entire message.

Fortunately the complexity of the Viterbi algorithm is linear in the code word length T. At each

time we have to add, compare and select in every state. The complexity is therefore also linear in the

number of states at each time, which is 4 in our case. In general the number of states is 2M where M

is the number of delay elements in the encoder. Therefore Viterbi decoding is in practice only possible

now if M is not much higher than say 10, i.e. the number of states is not much more than 210=1024.

Note that the code performance improves for increasing values of M.

169

13. MODULATION of DIGITAL SIGNALS

13.1 Phasor Representation of Signals

There are two key ideas behind the phasor representation of a signal:

a real, time-varying signal may be represented by a complex, time-varying signal; and

a complex, time-varying signal may be represented as the product of a complex number

that is independent of time and a complex signal that is dependent on time.

Let's be concrete. The signal

( ) cos( )s t A t (13.1)

illustrated in Figure 13.1, is a cosinusoidal signal with amplitude A, frequency ω, and phase φ. The

amplitude A characterizes the peak-to-peak swing of 2A, the angular frequency ω characterizes the

period 2 /T between negative-to-positive zero crossings (or positive peaks or negative peaks),

and the phase φ characterizes the time / when the signal reaches its first peak. The phase

angle simply translates the sinusoid along the time axis, as shown in Figure 13.1. A positive phase

angle shifts the signal left in time, while a negative phase angle shifts the signal right. With τ so

defined, the signal ( )s t may also be written as

( ) cos ( )s t A t . (13.2)

t

( )s t

A

2 1T

f

A

, 0

0

a)

t

( )s t

A

2 1T

f

A

0, 0

0

b)

t

( )s t

A

2 1T

f

A

, 0

0

c)

Fig. 13.1. A cosinusoidal signal; (a) advanced cosinusoidal signal, (b) cosinusoidal signal with zero phase angle, (c)

delayed cosinusoidal signal

A positive phase angle ( 0 ) shifts the signal curve left in time axis (see Figure 1, a), what

implies signal advance. A negative phase angle ( 0 ) shifts the signal curve right in time axis (see

Figure 1, c), what implies signal delay.

On the contrary, when τ is positive (phase negative), then τ is a “time delay” that describes the

time (greater than zero), when the first peak is achieved (see Figure 1, c). When τ is negative (phase

positive), then τ is a “time advance” that describes the time (less than zero), when the last peak was

achieved (see Figure 1, a).

For convenience, the terms leading and lagging when referring to the sign on the phase angle

is introduced. A cosinusoidal signal 1( )s t is said to lead another cosinusoid 2( )s t of the same

frequency if the phase difference between the two is such that 1( )s t is shifted left in time relative to

2( )s t . Likewise, 1( )s t is said to lag another cosinusoid 2( )s t of the same frequency if the phase

difference between the two is such that 1( )s t is shifted right in time relative to 2( )s t .

With the substitution

2 / T (13.3)

we obtain a third way of writing ( )s t

2

( ) cos ( )s t A tT

. (13.4)

170

In this form the signal is easy to plot. Simply draw a cosinusoidal wave with amplitude A and

period T; then strike the origin ( 0)t so that the signal reaches its peak at τ. The inverse of the period

T is called the “temporal frequency” of the cosinusoidal signal and is given the symbol f. f gives the

number of cycles of ( )s t per second. All above equations could be rewritten in terms of frequency f.

( ) cos(2 ) cos2 ( )s t A f t A f t , (13.5)

where 1

2f

T

,

2 f

.

In summary, the parameters that determine a cosinusoidal signal have the following units:

A, arbitrary (e.g., volts, amperes or meters/sec, depending upon the application);

ω, in radians/sec (rad/sec);

T, in seconds (sec);

φ, in radians (rad) or in degrees ( ), ( 2 rad is equal to 360 );

τ, in seconds (sec);

f, in (sec-1) or in hertz (Hz),

Thus, the conversion between radians and degrees can be expressed as:

Number of degrees=180

×Number of radians.

The signal ( ) cos( )s t A t can be represented as the real part of a complex number:

( ) Re[ exp ( )] Re[ e e ]j j ts t A j t A

We call e ej j tA the complex representation of ( )s t and write

( ) e ej j ts t A

meaning that the signal ( )s t may be reconstructed by taking the real part of e ej j tA . In this

representation, we call e jA the phasor or complex amplitude representation of ( )s t and write

( ) e js t A

meaning that the signal ( )s t may be reconstructed from e jA by multiplying with e j t

and taking

the real part. In communication theory, we call e jA the baseband representation of the signal ( )s t .

13.1.1 Geometric interpretation

At 0t , the complex representation e ej j tA produces the phasor e jA

. This phasor is

illustrated in Figure 2. In the figure, φ is approximately / 4 ( / 4 ). If we let t increase to

time t1, then the complex representation produces the phasor 1e e j tjA .

The multiplier 1e j t just rotates the phasor e jA

through an angle of 1t (see Figure 13.2).

Therefore, as we run t from 0, indefinitely, we rotate the phasor e jA indefinitely, turning out the

circular trajectory of Figure 13.2. When 2 /t then 2

2 /e e 1j t j

t

. Therefore, every 2 /

seconds, the phasor revisits any given position on the circle of radius A. Therefore e ej j tA

sometimes is called a rotating phasor whose rotation rate is the angular frequency :

( )d

tdt

. (13.6)

This rotation rate is also the frequency of the cosinusoidal signal ( ) cos( )s t A t .

171

e , 0, 0jA t

( , 0)A t

1

1e e ,j tjA t t

2

2

e e ,j tjA

t t

2 1( )t t

Re

Im

Fig. 13.2. Rotating phasor

In summary, e ej j tA is the complex representation of the signal ( ) cos( )s t A t , or the

rotating phasor representation of the signal ( ) cos( )s t A t . In this representation, multiplier ej t

rotates the phasor e jA through angles t at the rate . The real part of the complex representation

is the desired signal cos( )A t , while the imaginary part of the complex representation is the

sinusoidal signal sin( )A t .

The projection of the rotating phasor on the real axis traces out the desired signal cos( )A t

, as shown in Figure 13.3, while the projection of the rotating phasor on the imaginary axis traces out

the sinusoidal signal sin( )A t , also shown in Figure 13.3. In the figure, the angle φ is about

/ 4 .

The projection of the rotating phasor on the real axis is called in-phase (I) component of the

rotating phasor, while projection on the imaginary axis can be called the phase-quadrature (simply

quadrature – Q) component. In general, phase quadrature means 90 degrees out of phase, i.e., a

relative phase shift of / 2 .

A

Re

A

Ime ej j tA

t

/ , 0

0

2 /

( )s t

A

A

0 / ,

0

2 /

t

Q component

I component

e , 0 0jA t

Fig. 13.3. Illustration of rotating phasor projections on the real and imaginary axis

So, instead of figuring the cosinusoidal (or sinusoidal) signal as time function we can use its

phasor representation, what is more compact.

172

13.2 Modulation

In electronics and telecommunications, modulation is the process of varying one or more

parameters (amplitude A, phase or frequency ) of a periodic cosinusoidal waveform, called the

carrier signal (high frequency signal), with a modulating signal that typically contains information

to be transmitted.

In telecommunications, modulation is the process of conveying a message signal, for example

a digital bit stream or an analog audio signal, inside another signal that can be physically transmitted.

Modulation of a sine waveform transforms a baseband message signal into a passband signal.

A modulator is a device that performs modulation. A demodulator is a device that performs

demodulation, the inverse of modulation. A modem (from modulator–demodulator) can perform both

operations.

The aim of digital modulation is to transfer a digital bit stream over an analog band pass

channel, for example over the public switched telephone network (where a band pass filter limits the

frequency range to 300–3400 Hz), or over a limited radio frequency band.

The aim of analog modulation is to transfer an analog baseband (or low pass) signal, for

example an audio signal or TV signal, over an analog band pass channel at a different frequency, for

example over a limited radio frequency band or a cable TV network channel.

The most fundamental digital modulation techniques are based on keying [35–38]:

PSK (phase-shift keying): a finite number of phases is used;

FSK (frequency-shift keying): a finite number of frequencies is used;

ASK (amplitude-shift keying): a finite number of amplitudes is used;

QAM (quadrature amplitude modulation): a finite number of at least two phases and at

least two amplitudes are used.

The illustrations of the simplest binary modulation signals: PSK, FSK and ASK are presented

in Figure 12.4.

1 0 1 1 0a) Binary data to

be transmitted

d) BASK (OOK)

signal

e) BPSK signal

f) BFSK signal

b) Unipolar

modulation signal

c) Polar

modulation signal

1bT

R

( )s t

t

( )s t

t

( )s t

t

( )m t

t

t

( )m t

Fig. 13.4. Illustration examples of the simplest digitally modulated signals

ASK uses only two amplitudes: 0 for representing logical zeroes and A for representing logical

ones. Therefore it is called On-Off Keying (OOK) or Binary ASK. PSK uses two phases ( 0 for

representing logical zeroes and 180 for representing logical ones). It is titled Binary PSK (BPSK).

173

Binary FSK uses two possible frequencies: 1f and 2f for representing logical zeroes and logical ones

respectively.

13.3 Multilevel Modulation and Constellation Diagrams

For increasing bandwidth efficiency (transmitted data rate ratio with used bandwidth) more

sophisticated M-ary (multilevel) modulation schemes are employed [35–38]. These are M-ASK, M-

PSK, M-FSK and M-QAM. They use the major number M of amplitudes, phases or frequencies than

two, conventionally four, eight, sixteen, etc., generally 2BM , where B is the number of bits used

for encoding each amplitude, phase or frequency.

In all of the above methods, each of these phases, frequencies or amplitudes are assigned a

unique pattern of binary bits. Usually, each phase, frequency or amplitude encodes an equal number

of bits. This number of bits comprises the symbol that is represented by the particular phase,

frequency or amplitude.

If the alphabet consists of 2BM alternative symbols, each symbol represents a message

consisting of B bits. If the symbol rate (also known as the baud rate) is symbf symbols/second (or

baud), the data rate is symbB f bit/second. For example, with an alphabet consisting of 16 alternative

symbols, each symbol represents 4 bits. Thus, the data rate is four times the baud rate.

QAM is more sophisticated modulation. In QAM, an in-phase signal (or I, with one example

being a cosine waveform) and a quadrature phase signal (or Q, with an example being a sine wave)

are amplitude modulated with a finite number of amplitudes, and then summed. It can be seen as a

two-channel system, each channel using ASK. The resulting signal is equivalent to a combination of

PSK and ASK. The component parts PSK and ASK can use a finite number of phases and amplitudes.

Let‘s construct an example of QAM signal using ASK with two amplitudes (1 and 2) and PSK with

four phases (0, / 2 , , 3 / 2 ). This is the technique most often used. Combining possible PSK

and ASK variants, we get 8 possible waves that we can send. These 8 waves could correspond to 8

binary combinations, or code words, consisting of three bits. First step is to generate a table to show

us which waves correspond to which binary combination. This can basically be done at random,

although we will present the mostly used case in the following table 13.1.

Table 13.1 Correspondence between signal amplitudes and phases, and bit values

Bit values 000 001 010 011 100 101 110 111

Amplitudes 1 2 1 2 1 2 1 2

Phases 0 0 π/2 π/2 π π 3π/2 3π/2

Now, let’s encode the bit stream 001010100011101000011110 using this table. First of all, we

must break it up into 3-bit triads called symbols: 001-010-100-011-101-000-011-110. Now all we

have to do is to evaluate what the resulting signal should look like. Resulting waveform is depicted

in Figure 13.5.

001

2

1

-1

-2010 100 011 101 000 011 110

t

( )s t

8-QAM

signal

Fig. 13.5. Illustration example of 8-QAM modulated signal

174

In the case of PSK, ASK or QAM, where the carrier frequency of the modulated signal is

constant, the modulation alphabet is often conveniently represented on a constellation diagram,

showing the amplitude of the I signal at the x-axis, and the amplitude of the Q signal at the y-axis, as

a point for each symbol. These points are called constellation points. Otherwise, a constellation

diagram is a representation of a signal modulated by a digital modulation scheme such as quadrature

amplitude modulation or phase-shift keying. It displays the signal as a two-dimensional scatter

diagram in the complex plane at symbol sampling instants. In a more abstract sense, it represents the

possible symbols that may be selected by a given modulation scheme as points in the complex plane.

Measured constellation diagrams can be used to recognize the type of interference and distortion in a

signal.

Representation of modulated signals in the form of constellation diagram could be also applied

to simple harmonic signals as cosinusoidal or sinusoidal waveforms. In this case a constellation

diagram consists of one constellation point coinciding with the end-point of the phasor e jA .

Examples of constellation diagrams and corresponding simple signals, including cosinusoidal

signal, are presented in Figure 13.6.

10t

BPSK

signal

( )s t

Re

Im

ASK (OOK)

signal

t

( )s t

10 Re

Im

t

( )s t

A

Time function Phasor Costellation diagram

Cosinusoidal

signal

01 1Bits

0

Re

Im

t

A

0

Re

Im

Fig. 13.6. Examples of simple signals and their constellation diagrams

However, the low spectral efficiency of these modulations makes them inappropriate for the

transmission of high bit rates on channels with a bandwidth which is small as possible. In order to

increase the spectral efficiency of the modulation process, different kinds of more sophisticated

modulations, including quadrature modulations, are used. The more complicated modulations: 4-

ASK and 4-PSK or Quadrature PSK – QPSK and their constellations are presented in Figure 13.7.

175

Time function Costellation diagram

Im

4-PSK

- QPSK

signal

11t

( )s t

00

01

10

Re

0011 01Symbols 10

4-ASK

signal

0100 10 11 Re

Im

( )s t

t

Fig. 13.7. Examples of constellation diagrams and corresponding 4-ASK and 4-PSK signals

Even more complicated modulations are different kinds of quadrature amplitude modulations

(QAM) and high level phase shift keying (PSK) modulations. Some possible constellation diagrams

for 8-QAM (two versions), 8-PSK and 16-QAM (two versions) signals are presented in Figure 13.8.

111

100

Gray encoded

8-PSK

110

001

010

011

000 101

Re

Im0000

1110

Square 16-QAM

0001

1101

0010

0011

1111 0111

Re

Im1000

1001

1010

1011

1100 0110

0101

0100

1

1-1

-1-3 3

3

-3

0000

1011

0001

0100

1010

0111 0101

0011

Re

Im

1001

0110

0010

1000

1100

11011111

1110

16-APSK

(4+12-APSK)

(Circular 16-QAM)

000

001

010

011

100

101

110

111

Re

Im

8-APSK

(4+4-APSK)

(Circular 8-QAM)

000

8-APSK

(4+4-APSK)

(Circular 8-QAM)

001

010

011

100

101

110

111

Re

Im

Fig. 13.8. Constellation diagrams for 8-APSK, 16-APSK, Gray encoded 8-PSK and square 16-QAM

The first picture in Figure 13.8 represents a circular 8-QAM constellation. It is known to be the

optimal 8-QAM constellation in the sense of requiring the least mean power for a given minimum

Euclidean distance (the shortest straight-line distance between two points). The second picture is a

circular 8-QAM constellation for the same signal which is shown in Figure 13.5. 8-PSK constellation

diagram is for the Gray encoded signal. The codes of the neighboring points of this constellation

differ only in one bit. This determines the better signal resistance to noise comparing with signals

with other encoding of 8-PSK constellation points. Explanation of this phenomenon is presented in

Figure 13.9.

Original

signal phasorNoise phasor

Signal+noise

phasor

A

111

100

Gray encoded

8-PSK

110

001

010

011

000 101

Re

Im

A

B C

O

Boundaries of decision

making areas

Not Gray encoded

8-PSK

101

111

100

001

011

010

000 110

Re

Im

O

B C

Fig. 13.9. Illustration of the influence of constellation points coding to signal resistance to noise

176

Figure 13.9 shows two 8-PSK constellations with different encoding of points, phasors of

original signal, noise and signal+noise and some boundaries of decision making areas. It is evident

that the additive noise, acting in a communication channel, changes the received signal phasor,

shifting its end point to new position. Since the receiver’s decision making block decides about the

information, carried by the received signal, depending on the position of its phasor, shifting phasor’s

end point may cause decision errors. For example (see Figure 13.9), the noise shifts the received

signal+noise phasor from the area AOB with the code 000 to the area BOC with the code 100 in the

case of Gray encoding, and to the area BOC with the code 111 in the case of not Gray encoding. In

the Gray encoding case only one bit will be in error, in the not Gray encoding case even three bits

will be in error. So, the Gray encoding in average results in the better performance of modulation

scheme in noise corrupted environment.

It is the nature of QAM that most orders of constellations can be constructed in many different

ways. There are two main types: circular and square QAM. They differ not only by their form but

also by their properties. Note a square QAM constellation can be constructed only when the number

of bits composing each symbol is even, i.e., 2, 4, 6, etc. If circular QAM constellations are composed

of nR concentric rings, each with uniformly spaced PSK points, they are called APSK (Amplitude or

Asymmetric Phase-Shift Keying) constellations. They can be considered as a superclass of QAM.

The advantage over conventional QAM, for example 16-QAM, is lower number of possible amplitude

levels, resulting in fewer problems with non-linear amplifiers. Two diagrams of circular QAM

(APSK) constellations are shown in Figure 13.9. They are for 8-APSK and 16-APSK. 8-APSK

constellation is for the same signal is shown in Figure 13.5. Generally APSK has been proposed in

the framework of the DVB-S2 standard [90]. Also the rectangular 16-QAM constellation is presented

(see the last, the 5th, picture in Figure 13.9). This constellation is recommended for use in Digital

Video Broadcasting Cable and Terrestrial systems (DVB-C and DVB-T) [88, 89].

13.4 Quadrature Modulators and Demodulators

All above discussed PSK and QAM signals may be formed, i.e., modulated using quadrature

modulators and may be demodulated using quadrature demodulators [35–38]. Figure 13.10 represents

schematically the block diagram of quadrature modulator and demodulator. Input symbols coded on

n bits are converted into two sequences I (in-phase) and Q (quadrature), each coded on n/2 bits,

corresponding to 2n/2 states for each of the two sequences. After digital-to-analog conversion (DAC)

I and Q sequences are converted to analog I and Q signals. I signal modulates one local oscillator

output and Q signal modulates another output which is in quadrature (phase shifted by / 2 ) with

the first one.

Modulating signals

Modulated

signal

( )Ts t

sin ct

cos ct

Quadrature

demodulator

/ 2

Local

oscillator

Low pass

filtering

( )TI t

( )TQ t

/ 2

LOLO

cos ct

sin ct

Local

oscillator

( )RQ t

( )RI t

( )Rs t

Quadrature

modulator

Low pass

filtering

Fig. 13.10. Schematic block diagram of quadrature modulator and demodulator

177

13.5 Inter-Symbol Interference

Digital signals are streams of rectangular pulses representing 0’s and 1’s. Many bits can be

combined to form symbols in order to increase the spectral efficiency of the modulation. However

the frequency spectrum of such digital signals is theoretically infinite. This requires theoretically

infinite bandwidth for their transmission, what is impossible. Bandwidth limiting of a signal by means

of conventional filtering results in a theoretically infinite increase of its temporal response, which

would result in overlapping between successive bits or symbols. This effect is called inter-symbol

interference (ISI) [35–38]. It could cause incorrect interpretation of the received waveform

representing received symbols (bits). Illustration of ISI on received pulses is presented in

Figure 13.11.

t0

0 0 0

0

1

t sT

Transmitted

pulses

Received

individual pulses

(individual pulse

response)

Received sum of

pulses

(sum of pulse

responses)

Inter-symbol interfence

sT t0

sT

t0

01 1 1

0 t 0

sTsT tsT

Fig. 13.11. Illustration of ISI on received pulses; Ts is the symbol period (in the partial case, the bit period)

To avoid this problem, special filtering, satisfying the first Nyquist criterion, should be applied

[35–38]. The essence of the first Nyquist criterion is that the temporal response of a Nyquist filter

must present zeroes at times which are multiples of the symbol (bit) period Ts. There are a lot of

Nyquist filters that meet this requirement. The most commonly used filter is called a raised-cosine

(RC) filter. In order to optimize the bandwidth occupation and the signal-to-noise ratio at the same

time, filtering is shared equally between the transmitter and receiver, each of which composes a half-

Nyquist filter called root-raised-cosine (RRC) filter, sometimes known as square-root-raised-cosine

(SRRC) filter. This filter is characterized by so called roll-off factor α, which defines its frequency

response slope steepness. Figure 13.12, a shows the frequency response curve of RC filter for three

values of roll-off-factor and Figure 13.12, b shows the corresponding temporal response.

HRC( f )

α=0

α=0,5

α=10,2

0,4

0,6

0,8

1,0

f0 12 sT

1

sT3

4 sT1

2 sT3

4 sT1

sT

a)

hRC(t)

t

α=1

α=0,5

α=0

-0,2

0

0,2

0,4

0,6

0,8

1,0

Ts 2Ts-3Ts 0 4Ts-Ts-2Ts 3Ts-4Ts 5Ts-5Ts

b)

Fig. 13.12. Characteristics of raised-cosine filter for three values of the roll-off factor; (a) – frequency response, (b) –

temporal response

178

The temporal response shows the presence of zeroes at instants that are multiples of the symbol

period. Figure 13.13 shows temporal responses caused by a series of the identical pulses, representing

a series of the identical symbols (1’s), which repetition period is equal to symbol period Ts.

t

0 sT 2 sT 3 sT 4 sT

Gain

Fig. 13.13. Frequency responses of Nyquist filter (RC filter) caused by a series of pulses

As follows from Figure 13.13, all responses fully overlap and take zero values at instants that

are multiples of the symbol period, except one, which takes the maximum value. In order to reduce

the ISI to minimum, the received signal will have to be sampled at these instants. In this case the

value of sampled signal will be equal to the maximum value of response caused only by one symbol.

Thus the influence of other symbols will be avoided.

For a signal with symbol period Ts the bandwidth F occupied after Nyquist filtering with a

roll-off factor α is expressed by formula [35–38]

1

2 s

FT

. (13.7)

13.6 Coded Orthogonal Frequency Division Multiplexing

13.6.1 Introduction

Coded Orthogonal Frequency Division Multiplexing (COFDM) is a form of modulation which

is particularly well-suited to the needs of the terrestrial broadcasting channels [82, 91, 92, 93].

COFDM can manage with high levels of multipath propagation, with a wide spread of delays between

the received signals. This leads to the concept of single-frequency networks in which many

transmitters send the same signal on the same frequency, generating “artificial multipath”. COFDM

also manages well with co-channel narrowband interference, as may be caused by the carriers of

existing analogue services. COFDM has therefore been chosen for two standards for broadcasting –

DAB (Digital Audio Broadcasting) and DVB-T (Digital Video Broadcasting – Terrestrial), both of

which have been optimized for their respective applications and have options to suit particular needs.

The special performance of COFDM in respect of multipath and interference is only achieved by a

careful choice of parameters and with attention to detail in the way in which the forward error-

correction coding is applied.

The work on this system was initiated by CCETT (Centre commun d'études de télévision et

télécommunications – Centre for the Study of Television Broadcasting and Telecommunication) in

France and developed into a major new broadcasting standard by a collaborative project, Eureka 147.

COFDM involves modulating the data onto a large number of carriers using the FDM

(Frequency Division Multiplexing) technique. The key features which make it work, in a manner that

is so well suited to terrestrial channels, include:

orthogonality (the “O” of COFDM);

the addition of a guard interval;

179

the use of error coding (the “C” of COFDM), interleaving and channel-state information

(CSI).

13.6.2 Effects of multipath propagation

The main problem with reception of radio signals is fading caused by multipath propagation

[91, 92, 93]. Delayed signals are the result of reflections from trees, hills or mountains, or objects

such as people, vehicles or buildings and so on. The main characteristic of frequency selective fading

is that some frequencies are enhanced whereas others are attenuated. This is clearly illustrated in

Figure 13.14.

Frequency

Amplitude Time or position

Fig. 13.14. Typical frequency response of a time-varying channel example

When the receiver and all the objects influencing the reflections remain stationary, then the

effective frequency response of the channel from the transmitter to the receiver will be basically fixed.

If the wanted signal is relatively narrowband and falls into part of the frequency band with significant

attenuation then there will be flat fading and reception will not be satisfactory. If there is some

movement either of the receiver or of any of the surroundings, then the relative lengths and

attenuations of the various reception paths will change with time.

A narrowband signal will vary in quality as the peaks and hollows of the frequency response

move around in frequency. There will also be a noticeable variation in phase response, which will

affect all systems using phase as a modulation parameter.

Now consider a wideband signal. Some parts of the signal may be enhanced in level, whereas

others may be attenuated, sometimes to the point of extinction. In general, frequency components

close together will suffer variations in signal strength which are well correlated. Others which are

further apart will be less well correlated. The correlation bandwidth is often used as a measure of this

phenomenon. Some researchers also use the term “coherence bandwidth”. It is evaluated from the

empirical equation [91, 92]

coh1

2F

D , (13.8)

where D is average delay time of echoes. Results of statistical measurements show coherence

bandwidths of about 0,25–1 MHz in the VHF band, subject to the correlation level chosen in the

range 0,5–0,9.

For a narrowband signal, distortion is usually minimized if the bandwidth is less than the

coherence bandwidth of the channel. However, the signal will be subject to severe attenuation on

some occasions. A signal which occupies a wider bandwidth, greater than the coherence bandwidth,

will be subject to more distortion, but will suffer less variation in total received power.

If we look at the temporal response of the channel, we see a number of echoes present. There

are many different types of echo environment which are typical of different geographical areas. In

cities, echoes come from reflections from buildings with a large range of delays. In the countryside,

the echoes have usually a smaller range of delay, especially if there are no nearby hills.

COFDM is a wideband modulation scheme which is specifically designed to manage with the

problems of multipath reception. It achieves this by transmitting a large number of narrowband digital

signals over a wide bandwidth.

Multiple carriers

180

In COFDM, the data are divided between a big number of closely-spaced carriers [82, 91, 92].

This explains the use of the words “frequency division multiplex” in the name COFDM. Only a small

amount of the data is carried on each carrier, and this significantly reduces the influence of

intersymbol interference.

For a given overall data rate, increasing the number of carriers reduces the data rate that each

individual carrier must carry, and hence (for a given modulation system) lengthens the symbol period.

This means that the intersymbol interference affects a smaller percentage of each symbol as

the number of carriers and hence the symbol period increases.

Suppose, we transmit the carrier with a particular phase and amplitude, that depends on the

transmitted symbol and used constellation. Each symbol carries a number of bits of information.

Imagine that this signal is received via two paths, with a delay between them. Let us examine

reception of the nth symbol, as an example. The receiver will try to demodulate the data that was

placed in this symbol by examining the directly received information and the delayed information

relating to the nth symbol.

When the delay of the signal received via the second path is more than one symbol period, this

signal acts purely as interference between two or more symbols, since it only carries information

belonging to a previous symbol or symbols. Such effect is called intersymbol interference (ISI). ISI

distorts the signal and this distortion could degrade it. ISI situation is illustrated in Figure

13.15, a. This means that on purpose to receive undistorted information only a very small time interval

of the delayed signal can cover the wanted signal. When the delay is less than one symbol period,

part of the signal received via the second path acts also purely as interference. However this part

carries information belonging only to the previous symbol. The rest of it carries the information from

the wanted symbol (see Figure 13.15, b). The signal representing this information may be added to

the signal of the main path enhancing or attenuating it subject to the delayed signal phase. This is not

intersymbol interference, but a form of linear distortion. Thus, it would be a great advantage if the

delay would be not greater than a small part of the symbol period. For this, we have to reduce the

symbol rate what is equivalent to the extension of the symbol period.

Symbols of

the main path

t

nth symbol(n-1)st symbol

Processing

(integration) period

Symbols of the

delayed path

(n+1)st symbol

(n-2)nd symbol(n-3)rd symbol

Both act as ISI

a)

Symbols of

the main path

t

nth symbol(n-1)st symbol

Processing

(integration) period

Symbols of the

delayed path

(n+1)st symbol

nth symbol(n-1)st symbol

Acts as ISI Acts subject to the phase

b)

181

Symbols of

the main path

t

nth symbol(n-1)st symbol

Processing

(integration) period

Symbols of the

delayed path

(n+1)st symbol

nth symbol(n-1)st symbol

Acts as ISI c)

Fig. 13.15. Illustration of intersymbol interference formation; (a) – situation with large delay, (b) – situation with

small delay, (c) situation with small delay and prolonged symbol period

If one carrier cannot ensure the symbol rate which is required, this could be done using a large

number of carriers. For this, the high-rate data have to be divided into many low-rate parallel streams

each of which have to be assigned to an individual carrier. This is a form of FDM explaining the name

COFDM. Decreasing of data rate naturally implies prolongation of the symbol period. ISI situation

with naturally prolonged symbol period due to the introduction of large number of carriers is

illustrated in Figure 13.15, c.

In principle, many modulation schemes could be used to modulate the data at a low bit rate onto

each carrier. In Digital Audio Broadcasting (DAB), for example, differential quadrature phase shift

keying (DQPSK) is used. Digital Video Broadcasting Terrestrial (DVB-T) uses various kinds of

QAM and QPSK.

Even when the delay is less than one symbol period and signal processing period includes these

overlapping symbols, as one can see in Figure 13.15, c, some ISI from the previous symbol remains

and corrupts the received symbol. This could be eliminated if the period of each transmitted symbol

artificially were made longer than the period over which the receiver processes (integrates) the signal.

The realization of such symbol period artificial prolongation could be done by adding a guard interval.

The validity of adding of a guard interval could be much better explained after discussion about the

carriers’ distribution along the frequency axis.

13.6.4 Orthogonality

In a normal FDM system, the many carriers are spaced apart in such a way that the signals can

be received using conventional filters and demodulators. In such receivers, guard bands have to be

introduced between the modulation sidebands related to the different carriers. However the

introduction of these guard bands results in the decreasing of spectrum efficiency.

It is possible, however, to space the carriers in a COFDM signal so that the sidebands of the

individual carriers overlap and the signals can still be received without adjacent carrier interference.

In order to do this the carriers must be mathematically orthogonal [91, 92].

The receiver acts as a bank of demodulators, translating each carrier down to DC, the resulting

signal then being integrated over a symbol period to recover the raw data. If the other carriers all

translated down to frequencies which, in the time domain, have a whole number of cycles in the

symbol integration period Tu, then the integration process results in zero contribution from all these

other carriers. Thus the carriers are linearly independent (i.e. orthogonal) if the carrier spacing is a

multiple of l/Tu.

Mathematically, suppose we have a set of carriers , where k is the complex envelope of

the kth carrier in baseband representation. It can be written as

2( ) , 1/ujk f t

k u ut e f T

. (13.9)

The carriers will be orthogonal if [35–38]

182

( ) ( ) 0, ,

( ) ( ) , ,

u

u

T

k l

T

k l u

t t dt k l

t t dt T k l

(13.10)

where sign * indicates the complex conjugate.

Detailed analysis of the above orthogonality condition discloses one very interesting and

important feature of orthogonal carriers: the signal integration period Tu always contains the whole

number of periods of every carrier. Moreover, the difference of numbers periods of neighboring

carriers is equal to one and of non-neighboring carriers is equal to some whole number. Therefore a

radio pulse containing the whole number of periods is formed during each signal integration period

Tu. The spectrum of such pulse has the sin x x function shape. The spectra of many radio pulses

composing COFDM signal spectrum are illustrated in Figure 13.16.

f

N carriers, F = fu (N-1)

1/u uf T Overal envelope

of spectrum

Fig. 13.16. Illustration of COFDM signal spectrum

It is clearly seen that spectra of all carriers fully overlap. It is impossible to select and to separate

those using common filtering methods. Therefore the classical demodulation methods used in

common FDM systems are also unacceptable. The correlation processing is needed.

The essence of correlation processing is seen from the carriers’ orthogonality condition

equations (13.10). Actually these equations represent the common procedure of demodulating a

carrier by means of multiplying it by another carrier of the same frequency and then integrating the

result. Note that exactly these mathematical operations compose the essence of the correlation

method. The result of multiplication is a carrier, being demodulated, translated down to zero

frequency, i.e., DC. Then, integrating mathematically results in some constant value of which depends

on the duration of the integration period and physically results in some voltage level. The

multiplication by another carrier of not the same frequency results in translation down to the

differential frequency of two being multiplied carriers. The differential frequencies of the carriers are

integer multiples of frequency uf . Therefore after multiplication the components of the differential

frequencies have an integer number of cycles during the integration period Tu, and thus will be

integrated to zero. In other words there is no interference contribution.

Hence, without any filtering, it is possible separately to demodulate all the carriers taking the

advantage of the particular choice for the carrier spacing. Furthermore, no guard bands are needed.

The carriers are closely spaced so that they occupy the same spectrum in total as would a single carrier

modulated with all the data.

13.6.5 Guard interval

If the integration period covers two symbols (see Figure 13.15), actually the same-carrier

interference and inter-carrier interference (ICI) come into play causing linear distortions and ISI

183

respectively. ISI are formed because the components with the differential frequencies from other

carriers do not longer integrate to zero as they change in phase and/or amplitude during the integration

period. The orthogonality condition is violated. These changes occur since the delayed symbol start

time is later than of the main symbol and at least a partial symbol transition within its integration

period takes place. To avoid this artificial adding of a guard interval is proposed [91, 92]. The guard

interval ensures that during the integration period all the information comes from the same symbol

and appears constant during it.

The addition of a guard interval is illustrated in Figure 13.17. The segment added at the

beginning of the symbol and forming the guard interval is identical to the segment at the end of the

symbol. As the delay is less than the guard interval, all the signal components within the integration

period come from the same symbol and the orthogonality criterion is satisfied. ICI and ISI will only

occur when the delay exceeds the guard interval.

Symbols of

the main path

t

nth symbol(n-1)st symbol

Processing

(integration) period

Symbols of the

delayed path

(n+1)st symbol

nth symbol(n-1)st symbol

Guard interval

Fig. 13.17. Illustration of the formation of a guard interval

The guard interval length must be larger than the delay time, but at the same time it must be as

short as possible because it reduces of no avail the data rate. In the case of very long delays the guard

interval has to be chosen large. This is possible by increasing signal integration period uT , implying

a large number of carriers – from hundreds to thousands.

13.6.6 Generation and demodulation of COFDM signal

As we have seen, a large amount of filters was avoided due to the orthogonality of the carriers.

However the questions remain:

How to generate thousands of carriers?

How to implement the demodulation of carriers avoiding thousands of multipliers and

integrators?

In practice, we work with the digitized signal sampled according to the Nyquist theorem. The

process of integration in digital form then becomes one of summation, and the whole demodulation

process takes on a form which is identical to the Discrete Fourier Transform (DFT). In turn, the

process of generation of COFDM signals from the input data takes on a form which is identical to the

Inverse Discrete Fourier Transform (IDFT). Fortunately, computationally efficient Fast Fourier

Transform (FFT) implementations of DFT and Inverse Fast Fourier Transform (IFFT)

implementations of IDFT exist and variety of integrated circuits is available in the market. So it is

relatively easy to build COFDM equipment wherein IFFT is used in the transmitter to generate the

COFDM signal and FFT is used in the receiver to demodulate the COFDM signal. Common versions

of the FFT and IFFT operate on a group of 2N samples. FFT operates on the samples taken in the

integration period and deliver the same number of frequency coefficients. These correspond to the

data demodulated from the many carriers. IFFT operates on the data samples and deliver the same

number of the COFDM signal samples.

At the transmitter, the signal is defined in the frequency domain. It is a sampled digital signal,

and it is defined such that the discrete Fourier spectrum exists only at discrete frequencies. Each

COFDM carrier corresponds to one element of this discrete Fourier spectrum. The amplitudes and

184

phases of the carriers depend on the data to be transmitted and are defined for each transmitted

symbol. All the carriers have their data transitions synchronized, and can be processed together,

symbol by symbol. A schematic diagram of the transmitter signal processing is shown in

Figure 13.18, a and of receiver signal processing in the Figure 13.18, b.

Ser

ial

to p

aral

lel

con

ver

ter

FFT-1

Digital to

analog

converter

Digital to

analog

converter

Modulator

Re

Im

Constellation

mapping

Serial data

input

X0

X1

XN-2

XN-1

a)

Analog

Front-End Demodulator

Analog to

digital

converter

Analog to

digital

converter

FFT

Re

Im Par

alle

l to

ser

ial

conver

ter

Symbol

detection

Serial data

output

b)

Fig. 13.18. Signal processing diagrams; (a) – at the transmitter, (b) – at the receiver

13.6.7 Coding

The forward error-correction coding (FEC) is used in almost any practical digital

communication system. FEC ensures the delivery of an acceptable bit-error ratio (BER) at a

reasonably low signal-to-noise ratio (SNR). At a high SNR FEC is not necessary – and this is also

true for uncoded OFDM, but only when the channel is relatively flat. Uncoded OFDM does not

perform very well in a selective channel [91, 92].

Very simple examples will illustrate this point of view. If there is an echo as strong as the main

path signal which is delayed that every mth carrier is completely extinguished, then the symbol error

rate (SER) will be equal to 1/ m – even at infinite SNR. (Here, symbol denotes the group of bits

carried by one carrier within one OFDM symbol). An echo delay of / 4uT , what corresponds to the

maximum guard interval used in COFDM systems, would cause the SER to be 1/4. Similarly, if there

is one carrier, amongst N carriers in all, which is badly affected by interference, then the SER will be

of the order of 1/N, even with infinite SNR.

This enables us to formulate two conclusions:

uncoded OFDM is not satisfactory for practical use in the case of extremely selective

channels;

extinction of one carrier due to the interference results in a less problem than the presence

of a very strong echo.

The solution is to use convolutional coding in conjunction with Viterby decoding, properly

integrated with the OFDM system.

Thus, the necessity of implementing of coding justifies the use of term “coded” in the COFDM

title.

185

13.6.8 Inserting pilot cells

Each communication channel influences the transmitted signal by unevenly attenuation and

different phase shift of the different signal frequency components. This causes the appropriate shift

of transmitted signal constellation points from their original positions. Therefore before determining

which constellation point was transmitted, and hence what bits were transmitted, the receiver must

somehow to determine the response of the channel for each carrier. To do this in DVB-T [82], some

pilot information is transmitted (so-called scattered pilots) so that, in some symbols on some carriers,

known information (known sequence of bits) is transmitted (see Figure 13.19). Such sequences are

called learning or training sequences and preambles. Comparing the received signal carrying this

information with the known original version the receiver is able to determine the channel frequency

response and to eliminate its influence, i.e. to equalize all the constellations which carry data. For this

the equalizing filter is used which may need to be an adaptive filter adjusting itself to minimize the

distortions caused by channel frequency response.

Frequency

Tim

e

Scattered pilot cell

Fig. 13.19. Illustration of inserting pilot cells as is used in DVB-T

13.6.9 Interleaving

If the relative delay of the echo is very short, then the notches in the channel’s frequency

response will be broad, affecting many adjacent carriers. At the receiver this would cause the Viterbi

decoder to be fed with clusters of unreliable bits. This will cause a serious loss of performance. To

avoid this interleaving of the coded data before assigning them to OFDM carriers at the modulator is

used. Corresponding de-interleaver is used at the receiver before decoding. In this way, the cluster of

errors occurring when adjacent carriers break down simultaneously is broken up, enabling the Viterbi

decoder to perform better. Such interleaving process is called frequency interleaving. However it

works if the channel characteristics only vary slowly with time. The DVB-T system channels satisfy

these conditions.

13.6.10 Choice of COFDM parameters

13.6.10.1 Bandwidth

The bandwidth must be much wider than the coherence bandwidth. Only in this case we can

expect that not all carriers will be affected by frequency selective fading. The most problem here are

short relative delays of the echoes. However they exist forever and it is impossible to avoid them.

Moreover, the required bit rate directly influences the bandwidth. For example, in the case of

64QAM modulation the spectrum efficiency is 6 bit/s/Hz. However, considering the necessity to

insert the guard interval, to add additional FEC bits, and to transmit service information, the effective

spectrum efficiency decreases approximately to 3-4 bit/s/Hz. In the meantime the MPEG-4 coder

transport stream amounts to 2,5-3,5 Mbit/s. It follows from this that approximately 0,75-1,2 MHz

bandwidth is needed for transmission of one TV program. For example, the chosen total bandwidth

of one TV channel in Lithuania is 7,607 MHz. This means that it is possible to transmit to 10 programs

over one physical channel in Lithuania.

186

13.6.10.2 Number of carriers

The more carriers ensure the greater resolution of the diversity offered by the system. There is,

however, a relationship between symbol period and the carrier spacing. The carrier spacing is

1/u uf T . As was discussed above, the multipath environment must change slowly from symbol to

symbol. Thus there is a limit to the symbol period and hence the number of carriers. For static

reception, this is not a major problem. For mobile reception, however, the motion of a receiver which

is located in a car leads to changes in the multipath environment. Over a symbol period a car moving

at a velocity v m/s will travel /uvT f c wavelengths, where f is the carrier’s frequency, and c is the

speed of light. However this is equal to D uf T wavelengths, where Df , is the maximum Doppler shift.

Move of a car introduces phase distortion of the carrier. On purpose to introduce negligible phase

distortion the product D uf T must be small. It was found that a suitable value of this product is of

/ 0,02D u uf T vT f c . (13.11)

In the common case the maximum velocity of a moving receiver (car) is estimated as to be

equal to 160 km/h. Therefore, as follows from the last equation, may need to use the different number

of carriers in the different parts of the VHF (30-300 MHz) and UHF (300-3000 MHz) frequency

bands.

13.6.10.3 Guard interval

If it were only single transmitters, then the significant echoes would all be relatively short,

approximately 10 µs for the majority of locations. In the case of several transmitters the transmissions

from far areas can reach quite high signal levels on occasions of irregular propagation. Therefore is a

strong possibility that interference from remote transmitters may cause a problem. This could be

avoided if there would be sufficiently strong signal from the local transmitter. For this the distance

between transmitters and their power have to be correctly chosen. It is proved that the cheapest

solution of that problem is spacing of relatively high power transmitters by about 50 km. Therefore

the maximum guard interval is set to about 250 µs what corresponds to a maximum difference of

about 80 km in transmission distance. The symbol period need to be properly greater than guard

interval because the power transmitted in the guard interval is not useful. To minimize the power loss,

it is desirable to keep the guard interval to as low a percentage as possible of the symbol period. In

practice the maximum guard interval of the order of 25% of the symbol period has been chosen. This

leads to a symbol period of 1 ms. and hence, as follows from the equation 1/u uf T , to a carrier

spacing of about 1 kHz.

Practically this allows avoiding the influence of echoes delayed by about 200 µs.

13.6.10.4 Main COFDM parameters of Lithuanian DVB-T system

The total bandwidth - 7,607 MHzF ;

The number of carriers 6817K (short notation – 8K);

The symbol period 896 μsuT ;

The carrier spacing 1116Hzuf ;

Allowable the guard interval durations 224,112, 56, 28μs ; Allowable relative

durations of the guard interval / 1 / 4,1 / 8,1 /16,1/ 32uT ;

Allowable total symbol durations 1120,1008, 952, 924μsS uT T ;

187

14. CHARACTERISTICS of DVB STANDARDS

14.1 DVB-T system

Functional block diagram of the DVB-T system is presented in Figure 14.1. The system is

defined as the functional block of equipment performing the adaptation of the baseband TV signals

from the output of the MPEG-2 transport multiplexer, to the terrestrial channel characteristics. The

following processes must be applied to the data stream:

transport stream adaptation and randomization for energy dispersal;

outer coding (i.e. Reed-Solomon coding);

outer interleaving (i.e. convolutional interleaving);

inner coding (i.e. punctured convolutional coding);

inner interleaving;

mapping and modulation;

Orthogonal Frequency Division Multiplexing (OFDM) transmission.

1

2

n

188 b

yte

s

segm

ents

Splitter

OFDM

Guard

interval

insertion

DAC Front-End

Outer encoding Inner encoding

Insertion of

pilots and

TPS signals

High priority bit strem

Low priority bit stream

188 bytes segments

MPEG encoders and multiplexers

...

Bit stream

adaptation

Energy

dispersal

Bit stream

adaptation

Energy

dispersal

Outer coder

RS(204,

188, 8)

Outer

interleaver

Inner

interleaverMapper

Outer coder

RS(204,

188, 8)

Outer

interleaver

Inner

coder

Inner coder

Video encoder

Audio encoder

Data Ele

men

tary

str

eam

mult

iple

xer

Tra

nsp

ort

str

eam

mult

iple

xer

Fig. 14.1. Functional block diagram of the DVB-T system

The DVB-T standard has been designed to be compatible with all existing TV systems in the

world with channel widths of 6, 7 or 8 MHz. It is mainly used with channel widths of 7 MHz in

Europe (VHF band), Australia and 8 MHz in Europe (UHF band). It has also been designed in order

to be able to coexist with analog TV transmissions. Therefore the means for good protection against

adjacent channel interference (ACI) and Co-Channel Interference (CCI) (interference from within the

channel itself) have been provided. It is also a requirement that the system allows the maximum

spectrum efficiency when used within the VHF and UHF bands. This requirement can be achieved

by utilizing Single Frequency Network (SFN) operation.

To achieve these requirements an OFDM system with concatenated error correcting coding is

being specified. To maximize commonality with the DVB-S specification [89] and DVB-C

specification [88] the outer coding and outer interleaving are common, and the inner coding is

common with the DVB-S specification. To allow optimal compromise between network topology

and frequency efficiency, a flexible guard interval is specified. This enables the system to operate

188

efficiently under different network configurations, such as large area SFN and single transmitter,

while keeping maximum frequency efficiency.

Two modes of operation, a "2K mode" (1705 carriers in total, among them 1512 active) and an

"8K mode" (6817 carriers in total, among them 6048 active) for 7 and 8 MHZ channels are defined

for DVB-T and DVB-H transmissions. The "2K mode" is suitable for single transmitter operation and

for small SFN networks with limited transmitter distances. The "8K mode" can be used both for single

transmitter operation and for small and large SFN networks.

The system allows different levels of QAM modulation and different inner code rates to be used

to trade bit rate versus ruggedness. The system also allows two level hierarchical channel coding and

modulation, including uniform and multi-resolution constellation. For this the splitter separates the

incoming transport stream into two independent MPEG transport streams referred to as the high-

priority and the low-priority stream. In this case the functional block diagram of the system must be

expanded to include the modules shown dashed in Figure 14.1. Two independent MPEG transport

streams are mapped onto the signal constellation by the Mapper and the Modulator which therefore

has a corresponding number of inputs.

14.1.1 Splitter

The splitter separates the incoming transport stream into two independent MPEG transport

streams referred to as the high-priority (HP) and the low-priority (LP) stream. Two streams are

mapped onto the signal constellation by the Mapper and Modulator which therefore has a

corresponding number of inputs. The only additional requirement placed on the receiver is the ability

for the demodulator/de-mapper to produce one stream selected from those mapped at the sending end.

It may be used to transmit, for example a standard definition SDTV signal and a high definition

HDTV signal on the same carrier. Generally, the SDTV signal is more robust than the HDTV one.

At the receiver, depending on the quality of the received signal, the receiver may be able to decode

the HDTV stream or, if signal strength lacks, it can switch to the SDTV one. In this way, all receivers

that are close to the transmission site can lock the HDTV signal, whereas all the other ones, even the

farthest, may still be able to receive and decode an SDTV signal.

14.1.2 Transport stream adaptation and randomization

The MPEG-2 transport is organized in fixed length packets as is shown in Figure 14.2. The total

packet length is 188 bytes. This includes 1 sync-word byte (i.e. 01000111) and 187 data bytes. The

processing order at the transmitting side must always start from the Most Significant Byte (MSB)

(i.e. "0") of the sync-word byte.

MPEG-2 transport stream data

187 bytes

SYNC

1 byte

Fig. 14.2. MPEG-2 transport stream packet

In order to ensure adequate binary transitions, the data of the system input must be randomized

using the scrambler/descrambler whereof schematic diagram is depicted in Figure 14.3.

189

XOR

XORAND

Enable

00000011...

Randomized /

de-randomized

data output

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 0 0 1 0 1 0 1 0 0 0 0 0 0 0

Clear data input for

randomization /

randomized data input

for de-randomization

(MSB first) Fig. 14.3. Scrambler/descrambler schematic diagram

Loading of the initial sequence "100101010000000" into the Pseudo Random Binary Sequence

(PRBS) – Pseudo Noise (PN) registers, as indicated in Figure 14.3, must be initiated at the start of

every eight transport packets. To provide an initialization signal for the descrambler, the MPEG-2

sync byte of the first transport packet in a group of eight packets is bit-wise inverted from 01000111

(SYNC) to 10111000 (SYNC ). This process is referred to as "transport stream adaptation" (see

Figure 14.4). The first bit at the output of the PRBS generator will be applied to the first bit (i.e.,

MSB) of the first byte following the inverted MPEG-2 sync byte (i.e., 10111000). To aid other

synchronization functions, during the MPEG-2 sync bytes of the subsequent 7 transport packets, the

PRBS generation will continue, but its output will be disabled, leaving these bytes unrandomized.

Thus, the period of the PRBS sequence will be 1503 bytes.

SYNC1 Rand. data

187 bytesSYNC2

Rand. data

187 bytesSYNC8 SYNC1

Rand. data

187 bytes

Pseudo-random binary sequence period 1503 bytes

Rand. data

187 bytes

Rand. data

187 bytes

Fig. 14.4. Randomized transport stream packets: Sync bytes and randomized data bytes; SYNC1 – non randomized

complemented sync byte; SYNCn – non randomized sync byte

14.1.3 Outer coding

In order to achieve the appropriate level of error protection required for transmission, FEC

based on Reed-Solomon shall be used. The system FEC is designed to improve BER from 10-4 to 10-

10 or 10-11, ensuring quasi error free operation.

The outer coding and interleaving shall be performed on the input packet structure (see

Figure 14.4). Reed-Solomon RS (204,188,T=8) shortened code, derived from the original systematic

RS (255,239,T=8) code, shall be applied to each randomized transport packet (188 byte) of Figure

14.4 to generate an error protected packet (see Figure 14.5). Reed-Solomon coding shall also be

applied to the packet sync byte, either non-inverted or inverted.

Reed-Solomon coding RS(204,188,T=8) adds 16 parity bytes after the transport packets and

becomes 204 bytes long. It can correct up to 8 erroneous bytes in a received word of 204 bytes.

Overhead is low (8%=16/188%).

204 bytes

MPEG-2 transport stream data

187 bytes

16 parity bytes

(RS(204,188,8))SYNC1 or

SYNC n

Fig. 14.5. Reed-Solomon RS(204,188,8) error protected packet

The shortened Reed-Solomon code may be implemented by adding 51 bytes, all set to zero,

before the information bytes at the input of an RS (255,239,T=8) encoder. After the RS coding

procedure these null bytes shall be discarded, leading to a RS code word of N 204 bytes.

190

14.1.4 Outer interleaving

The noisy channel will knock-out more than 8 bytes per package. An outer interleaver improves

that. It can break down any lengthy bursts of errors reaching the outer Reed-Solomon decoder in the

receiver. So, the outer interleaver increases the efficiency of Reed-Solomon coding by spreading

bursts of errors over a longer time.

Following the conceptual scheme of Figure 14.1, convolutional byte-wise interleaving with

depth l=12 is applied to the error protected packets (see Figure 14.5).

The interleaver (see Figure 14.6 and Figure 12.2) may be composed of l=12 branches (lines),

cyclically connected to the input byte-stream by the input switch. Each branch i shall be a First-In,

First-Out (FIFO) shift register, with depth i×s cells where s=17=N/l, N=204. The cells of the FIFO

shall contain 1 byte, and the input and output switches shall be synchronized. For synchronization

purposes, the SYNC bytes and the SYNC bytes shall always be routed in the branch "0" of the

interleaver (corresponding to a null delay).

The deinterleaver (see Figure 14.6 and Figure 12.2) is similar in principle, to the interleaver,

but the branch indices are reversed (i.e. i=0 corresponds to the largest delay). The deinterleaver

synchronization can be carried out by routing the first recognized sync (SYNC or SYNC ) byte in the

"0" branch.

Channel with

modulator and

demodulator

0

1

2

3

11

17 storage elements

17×2

17×3

17×11

Line number

1 byte per

position

Outer interleaver Outer deinterleaver

0

1

2

3

11

17×11 Line number

17×10

17×9

17×8

1 byte per

position

FIFO shift register

Fig.14.6. Conceptual diagram of the outer interleaver and deinterleaver; Interleaving depth l=12; Sync byte always

passes through line 1

The interleaved data bytes shall be composed of error protected packets and shall be delimited

by inverted or non-inverted MPEG-2 sync bytes (preserving the periodicity of 204 bytes). The

interleaved data structure is shown in Figure 14.7).

203 bytes 203 bytes 203 bytesSYNC1 or

SYNC n

SYNC1 or

SYNC n

SYNC1 or

SYNC n

Fig. 14.7. Data structure after outer interleaving

The sequence in which the bytes will be transmitted over the channel after interleaving will be

different from that one in the input. This sequence will be interleaved. It can be verified that if 12l

and 17s , there will be 204l s bytes interposed between two bytes that were initially adjacent to

one another (see Figure 14.8)

191

1

2

2

4

3

6

2

2

4

4

2

2

3

2

2

2

2

0

4

8

2

1

6 ......

2

0

4

2

2

8

1

1

3

2

5

2

2

3

3

2

2

2

1

2

2

0

9

3

7

2

0

5 ......

1

9

3

2

1

7

1

2

2

4

3

6

2

2

4

4

2

2

3

2

2

2

2

0

4

8

2

1

6 ......

2

0

4

2

2

8......

2

2

8

0

2

2

9

2

2

2

5

6

2

2

6

8

2

4

4

8

2

4

6

0

2

4

7

2

......

2

2

7

1

2

2

8

3

2

2

4

7

2

2

5

9

2

4

3

9

2

4

5

1

2

4

6

3

1

8

3

9

1

8

5

1

1

8

6

3

2

2

3

5

2

2

2

3

2

2

1

1

1

8

7

5

2

0

4

3 ......

2

0

3

1

2

0

5

5 3

1

5

2

7

1

8

2

7

1

8

1

5

1

8

0

3

3

9

2

0

7 ......

1

9

5

2

2

8

......

2

2

4

6

2

2

5

8

2

2

7

0

2

2

8

2

2

4

5

0

2

4

3

8

2

4

6

2

2

0

4

2

2

0

5

4

2

0

6

6

2

0

7

8

2

1

3

8 ......

2

1

2

6

2

1

5

0

2

2

3

4

2

2

2

2

2

2

1

0 2

1

4

2

6

2

0

3

0

2

0

1

8

2

0

0

6

3

8

2

0

6 ......

1

9

4

2

1

8

2

2

4

5

2

2

5

7

2

2

6

9

2

2

8

1

2

4

4

9 ......

2

4

3

7

2

4

6

1

2

2

4

5

2

2

5

7......

2

4

3

7

2

4

6

1

2

4

4

9

2

2

8

1

2

2

6

9

Numbers of bytes at the

deinterleaver outputNumbers of bytes

passing channel

Numbers of bytes at the

interleaver input

1 1

1 2

3 3

12 12

Line

number

Line

number

Fig. 14.8. The sequences of bytes at the different points of interleaver / deinterleaver

14.1.5 Inner coding

For inner coding a range of punctured convolutional codes, based on a mother convolutional

code of rate 1/2 with 64 states, are used. Convolutional code is an efficient complement to RS coding

and interleaving, as it corrects other kinds of errors. Inner coding is correcting random error for

satellite and terrestrial only (DVB-S/T). In addition to the mother code of rate 1/2 the system must

allow punctured rates of 2/3, 3/4, 5/6 and 7/8. This will allow selection of the most appropriate level

of error correction for a given service or data rate in either non-hierarchical or hierarchical

transmission mode. If two level hierarchical transmission is used, each of the two parallel channel

encoders can have its own code rate.

However, with code rate 1/2 our useful data capacity is reduced by 50%! Alternative code rates

allow a compromise between protection and data capacity. In the worst case code rate 1/2 should be

used. In the best case code rate 7/8 should be used.

The punctured convolutional code encoder, used in the system, is shown in Figure 14.9.

1 bit

delay

1 bit

delay

1 bit

delay

1 bit

delay

1 bit

delay

XOR

XOR

1 bit

delay

Code rate

controlled

multiplexer

X, Y...

X

Y

Data

input

Fig. 14.9. Punctured convolutional code encoder for inner coding

The operation of the encoder is explained in table 14.1. In this table X and Y refer to the two

outputs of the convolutional encoder.

192

Table 14.1 Puncturing pattern and transmitted sequence after parallel-to-serial conversion

for the possible code rates

Code rates R Unpunctured data from

convolutional encoder

Punctured data

(transmitted sequence)

Code rate 1/2 X1

Y1

X1 Y1

Code rate 2/3 X1 X2

Y1 Y2

X1 Y1 Y2

Code rate 3/4 X1 X2 X3

Y1 Y2 Y3

X1 Y1 Y2 X3

Code rate 5/6 X1 X2 X3 X4 X5

Y1 Y2 Y3 Y4 Y5

X1 Y1 Y2 X3 Y4 X5

Code rate 7/8 X1 X2 X3 X4 X5 X6 X7

Y1 Y2 Y3 Y4 Y5 Y6 Y7

X1 Y1 Y2 Y3 Y4 X5 Y6 X7

14.1.6 Inner interleaving

This section specifies the native inner interleaving processes to be used for 2K and 8K

transmission modes. In order to adapt the bit stream to the OFDM modulation and in order to further

increase the robustness of the system (to reduce the influence of burst errors, to cope with the effect

of frequency-selective channels), after a channel coding, data follow a complex process of inner

interleaving, i.e., data sequence is rearranged again. The inner interleaver is a group of joined blocks

of demultiplexer, bit interleaver and symbol interleaver as shown in Figure 14.10. Both the bit-wise

interleaving and the symbol interleaving processes are block-based. The inner interleaver is specified

in such a way that it provides optimum performance at a given complexity and memory size.

Inner coder

for HP

stream

Inner coder

for LP stream

Inner interleaver

Bit

interleavers

Symbol

interleaver

To the

mapper

DEMUX

DEMUX

Fig. 14.10. Inner coding and interleaving

The schemes and operation of bit interleavers differ subject to the system operation mode

(hierarchical or non-hierarchical, and 2K, 4K or 8K) and modulation used (QPSK, 16-QAM, or 64-

QAM). All these differences are detailed described in [82]. As an example in Figure 14.11 two

schemes of inner interleavers for hierarchical and non-hierarchical 16-QAM transmission modes are

presented. The schemes of other inner interleavers can be found in [82].

I0 Bit

interleaver

I1 Bit

interleaver

I2 Bit

interleaver

I3 Bit

interleaver

DEMUX0 1 2, , ,...x x x Symbol

interleaver

0 1 2, , ,...y y yMapper

0,0 2,0 4,0

convey

, ,

I

y y y

1,0 3,0 5,0

convey

, ,

Q

y y y

0,0 0,1 0,2, , ,...b b b

1,0 1,1 1,2, , ,...b b b

2,0 2,1 2,2, , ,...b b b

3,0 3,1 3,2, , ,...b b b

0,0 0,1, ,...a a

1,0 1,1, ,...a a

2,0 2,1, ,...a a

3,0 3,1, ,...a a

193

a)

I0 Bit

interleaver

DEMUX

I1 Bit

interleaver

I2 Bit

interleaver

I3 Bit

interleaver

DEMUX

0 1 2, , ,...x x x

0 1 2, , ,...x x x

Symbol

interleaver

0 1 2, , ,...y y yMapper

0,0 2,0 4,0

convey

, ,

I

y y y

1,0 3,0 5,0

convey

, ,

Q

y y y

HP

stream

LP

stream

0,0 0,1 0,2, , ,...b b b

1,0 1,1 1,2, , ,...b b b

2,0 2,1 2,2, , ,...b b b

3,0 3,1 3,2, , ,...b b b

0,0 0,1 0,2, , ,...a a a

1,0 1,1 1,2, , ,...a a a

2,0 2,1 2,2, , ,...a a a

3,0 3,1 3,2, , ,...a a a

b)

Fig. 14.11. Mapping of input bits onto output modulation symbols for 16-QAM system; (a) – for non-hierarchical

transmission modes, (b) – for hierarchical transmission modes

14.1.7 Bit-wise interleaving

The first block is demultiplexer. The demultiplexing is defined as a mapping of the input bits

nx onto the output bits ,m nb . The mapping rules for 16-QAM modulation are presented in table 14.2.

Table 14.2 Demultiplexer’s mapping rules for 16-QAM modulation

16-QAM non-hierarchical transmission 16-QAM hierarchical transmission

0x maps to

0,0b ;

0x maps to

0,0b ;

1x maps to

2 ,0b ;

1x maps to

1,0b ;

2x maps to

1,0b ;

0x maps to

2 ,0b ;

3x maps to

3,0b ; 1x maps to

3,0b ;

The mapping rules for other modulations and transmission modes are presented in [82].

In the common case, the input to the inner interleaver, which consists of up to two bit streams

(HP stream and LP stream), is demultiplexed into v sub-streams, where v=2 for QPSK, v=4 for

16QAM, and v=6 for 64QAM. In non-hierarchical mode, the single input stream is demultiplexed

into v sub-streams. In hierarchical mode the high priority stream is demultiplexed into two sub-

streams and the low priority stream is demultiplexed into v-2 sub-streams. This applies in both

uniform and non-uniform QAM modes.

Each sub-stream from the demultiplexer is processed by a separate bit interleaver. There are

therefore up to six interleavers depending on v, labelled I0 to I5. I0 and I1 are used for QPSK, I0 to

I3 for 16-QAM and I0 to I5 for 64-QAM.

Bit interleaving is performed only on the useful data. The block size is the same for each

interleaver, but the interleaving sequence is different in each case. The bit interleaving block size is

126 bits. The block interleaving process is therefore repeated exactly 12 times per OFDM symbol of

useful data in the 2K mode and 48 times per symbol in the 8K mode.

For each bit interleaver, the input bit vector is defined by:

,0 ,1 ,2 ,125( ) ( , , ,... )m m m mm b b b bB ,

where m ranges from 0 to v-1. The elements of the interleaved output vector

,0 ,1 ,2 ,125( ) ( , , ,... )m m m mm a a a aA

are defined by

, ,H ( ), 0,1,2,...,125mm n m na b n ,

where H ( )m n is a permutation function, which is different for each interleaver (see table 14.3).

194

Table 14.3 Permutation functions of bit-wise interleavers

Interleaver Permutation function

I0 0

H ( )n n

I1 1

H ( ) ( 63) mod 126n n

I2 2

H ( ) ( 105) mod 126n n

I3 3

H ( ) ( 42) mod 126n n

I4 4

H ( ) ( 21) mod 126n n

I5 5

H ( ) ( 84) mod 126n n

The outputs of the v bit interleavers are grouped to form the digital data symbols, such that each

symbol of v bits will consist of exactly one bit from each of the v interleavers. Hence, the output

from the bit-wise interleaver is a v bit word ( )nC that has the output of I0 as its most significant bit,

i.e.

0, 1, 2, 1,( ) ( , , ,..., )n n n v nn a a a a C .

14.1.8 Symbol interleaving

The purpose of the symbol interleaver is to map v bit words onto the 1512 (2K mode) or 6048

(8K mode) active carriers per OFDM symbol. The symbol interleaver acts on blocks of 1512 (2K

mode) or 6048 (8K mode) data symbols.

Thus in the 2K mode, 12 groups of 126 data words (12×126=1512 bits) from the bit interleaver

are read sequentially into an auxiliary vector 0 1 2 1511( , , ,..., )y y y y Y . Similarly in the 8K mode, a

vector 0 1 2 6047( , , ,..., )y y y y Y is assembled from 48 groups of 126 data words (48×126=6048 bits).

The elements of the interleaved vector at the symbol interleaver output

max0 1 2 1( , , ,..., )Ny y y y Y

are defined by:

H( )q qy y for even symbols for max0,1,..., 1q N ,

H( )q qy y for odd symbols for max0,1,..., 1q N ,

where max 1512N in the 2K mode and max 6048N in the 8K mode. H( )q is a permutation

function. Its rather complex definition can be found in [82].

The values of Y are used to map the data into the signal constellation, as described in the next

paragraph.

14.1.9 Signal constellations and mapping

The DVB-T system uses Orthogonal Frequency Division Multiplex (OFDM) transmission. All

data carriers in one OFDM frame are modulated using either QPSK, 16-QAM, 64-QAM, non-uniform

16-QAM or non-uniform 64-QAM constellations. The 16-QAM constellations, and the details of the

Gray mapping applied to them, are illustrated in Figure 14.12. Other used constellations can be found

in [82]. The exact proportions of the constellations depend on a parameter α, which can take the three

values 1, 2 or 4, thereby giving rise to the three diagrams in Figures 14.12, a to 14.12, c. The parameter

α is the minimum distance separating two constellation points carrying different HP-bit values

divided by the minimum distance separating any two constellation points. Non-hierarchical

transmission uses the same uniform constellation as the case with α=1 (see Figure 14.12, a).

195

0000

1110

0001

1101

0010

0011

1111 0111

1000

1001

1010

1011

1100 0110

0101

0100

1

1

-1

-1-3 3

3

-3

0, 1, 3, 4,, ,,

q q q qy y y yBit ordering

Q (Im{z}) convey

1, 3,,

q qy y

I (Re{z})

convey

0, 2,,

q qy y

a)

0000

1110

0001

1101

0010

0011

1111 0111

1000

10011010

1011

11000110

0101

0100

2

2

-2

-2-4 4

4

-4

0, 1, 3, 4,, ,,

q q q qy y y yBit ordering

1, 3,,

q qy y

Q (Im{z}) convey

0, 2,,

q qy y

I (Re{z})

convey

b)

0000

1110

0001

1101

0010

0011

1111 0111

1000

10011010

1011

11000110

0101

0100

4

4

-4

-4-6 6

6

-6

0, 1, 3, 4,, ,,

q q q qy y y yBit ordering

1, 3,,

q qy y

Q (Im{z}) convey

0, 2,,

q qy y

I (Re{z})

convey

c)

Fig. 14.12. 16-QAM constellations for DVB-T system; (a) – uniform with α=1 for hierarchical and non-hierarchical

transmission, (b) – non-uniform with α=2 for hierarchical transmission, (c) – non-uniform with α=4 for

hierarchical transmission

In the case of non-hierarchical transmission the data stream at the output of the inner interleaver

consists of v bit words. These are mapped onto a complex number z, according to Figure 14.12, a.

In the case of hierarchical transmission, the data streams are formatted as shown in

Figure 14.11, b, and then the mappings as shown in Figures 14.12, a, b or c are applied, as appropriate.

For hierarchical 16-QAM:

The high priority bits are the 0,qy and 1,qy bits of the inner interleaver output words. The low

priority bits are the 2,qy and 3,qy bits of the inner interleaver output words. The mappings of Figures

12, a, b or c are applied, as appropriate.

For example, the top left constellation point, corresponding to 1000 represents 0, 1qy ,

1, 2, 3, 0q q qy y y . If this constellation is decoded as if it were QPSK, the high priority bits, 0,qy ,

1,qy will be deduced. To decode the low priority bits, the full constellation shall be examined and the

appropriate bits 2,qy , 3,qy extracted from 0,qy , 1,qy , 2,qy , 3,qy .

196

14.1.10 OFDM frame structure

The transmitted signal is organized in frames. Each frame has a duration of FT , and consists of

68 OFDM symbols. Four frames constitute one super-frame. Each symbol is constituted by a set of

K=6817 carriers in the 8K mode and K=1705 carriers in the 2K mode and transmitted with a duration

ST . It is composed of two parts: a useful part with duration UT and a guard interval with a duration

. The guard interval consists in a cyclic continuation of the useful part UT , and is inserted before

it. Four values of guard intervals may be used according to table 14.4.

The symbols in an OFDM frame are numbered from 0 to 67. All symbols contain data and

reference information. Since the OFDM signal comprises many separately-modulated carriers, each

symbol can in turn be considered to be divided into cells, each corresponding to the modulation

carried on one carrier during one symbol. In addition to the transmitted data an OFDM frame contains

reference signals:

Scattered pilots;

Continual pilots;

Transmission Parameter Signaling (TPS) pilots.

An illustration of the DVB-T frame structure of 68 OFDM symbols per frame is given in Figure

14.13.

Time

(Symbol)

0s

1s

2s

3s

67s

66s

Frequency (Carrier)

2c1c0c 3c 4c 1704 or 6816c

Scattered pilot Continual pilot TPS pilot Data1704c if 2K

6816c if 8K

Fig. 14.13. Transmission frame for the DVB-T signal

The scattered and continual pilot cells within the OFDM frame are modulated with reference

information whose transmitted value is known to the receiver and can be used for frame

synchronization, frequency synchronization, time synchronization, channel estimation, transmission

mode identification and can also be used to follow the phase noise. These pilots are transmitted with

an amplitude which is 1/0,75 larger than the amplitude of the other carriers in order to be particularly

robust against transmission errors. TPS pilots inform the receiver about the actual operating

parameters. The TPS carries are modulated by means of a differential binary phase shift keying

(DBPSK). Thus, one bit per carrier can be transmitted. Consequently, one OFDM frame contains a

TPS block of 68 bits, namely, 1 initialization bit, 16 synchronization bits, 37 information bits, and 14

redundancy bits for error protection [82]. In order to make the TPS data more robust against

frequency-selective channel distortions it is transmitted totally redundantly on 17 carrier frequencies

for the 2K mode and on 68 carrier frequencies for the 8K mode.

Each continual pilot coincides with a scattered pilot every fourth symbol; the number of useful

data carriers is constant from symbol to symbol: 1512 useful carriers in 2K mode and 6048 useful

carriers in 8K mode. The value of the scattered or continual pilot information is derived from a Pseudo

197

Random Binary Sequence (PRBS) which is a series of values, one for each of the transmitted carriers.

Reference information, taken from the reference sequence, is transmitted in scattered pilot cells in

every symbol.

In addition to the scattered pilots, 177 continual pilots in the 8K mode and 45 in the 2K mode,

are inserted, where "continual" means that they occur on all symbols. All continual pilots are

modulated according to the reference sequence (PRBS).

The polynomial for the Pseudo Random Binary Sequence (PRBS) generator is 17 2 1X X .

The PRBS is initialized so that the first output bit from the PRBS coincides with the first active

carrier. A new value is generated by the PRBS on every used carrier (whether or not it is a pilot).

The TPS carriers are used for the purpose of signaling parameters related to the transmission

scheme, i.e. to channel coding and modulation. The TPS is transmitted in parallel on 17 TPS carriers

for the 2K mode and on 68 carriers for the 8K mode. Every TPS carrier in the same symbol conveys

the same differentially encoded information bit.

The TPS carriers convey information on:

Modulation including the parameter α value of the QAM constellation pattern;

Hierarchy information;

Guard interval;

Inner code rates;

Transmission mode (2K or 8K);

Frame number in a super-frame;

Cell identification.

The TPS is defined over 68 consecutive OFDM symbols, referred to as one OFDM frame. Four

consecutive frames correspond to one OFDM super-frame. The reference sequence corresponding to

the TPS carriers of the first symbol of each OFDM frame are used to initialize the TPS modulation

on each TPS carrier. Each OFDM symbol conveys one TPS bit. Each TPS block (corresponding to

one OFDM frame) contains 68 bits, defined as follows:

1 initialization bit;

16 synchronization bits;

37 information bits;

14 redundancy bits for TPS error protection.

Of the 37 information bits, 31 are used. The remaining 6 bits shall be set to zero.

More details about TPS transmission format can be found in [82].

14.1.11 Main parameters of DVB-T system

Finally we present table 14.4 summarizing the main parameters of DVB-T for European

channels of 7 and 8 MHz [76, 82].

Table 14.4 Main parameters of the DVB-T terrestrial system

Parameter 8K/8 MHz 8K/7 MHz 2K/8 MHz 2K/7 MHz

Total number of carriers 6817 1705

Useful carriers (data) 6048 1512

Scattered pilot carriers 524 131

Continual pilot carriers 177 45

Signaling carriers 68 17

Useful symbol duration (TU) 896 µs 1024 µs 224 µs 256 µs

Carrier spacing (1/TU) 1116,07 Hz 976,65 Hz 4464,28 Hz 3906,25 Hz

Distance between extreme carriers 7,61 MHz 6,66 MHz 7,61 MHz 6,66 MHz

Relative guard interval (Δ/TU) 1/4, 1/8, 1/16 or 1/32

Individual carrier modulation QPSK, 16-QAM, 64-QAM

Hierarchical modes α=1 or α=2 or α=4

198

14.2 DVB-C system

14.2.1 Functional block diagram of the system

The DVB-C (Cable) system (see Figure 14.14) is defined as the functional block of equipment

performing the adaptation of the baseband TV signals to the cable channel characteristics [88].

From RF

Cable

ChannelQAM

demodulator

I

Q

Matched

Filter &

Equalizer

Differential

Decoder

Symbol to

Byte

Mapper

Covol.

Deinterleaver

(I=12)

RS

Decoder

(204,188,8)

Baseband

Physical

Interface

Datam 8m 8 8 8

Clock

Carrier & Clock & Sync Recovery

Cable Integrated Receiver Decoder

Sinc1 Inverter

&

Derandomizer

Front-End

& ADC

RS Coder

(204,188,8)

Convol.

Interleaver

(I=12)

Byte to

m-tuple

Converter

Mapper &

Pulse

Shaping

I

Q

To RF

Cable

ChannelBaseband

Physical

Interface

Differential

Encoder

QAM

Modulator

8 8 8 m m8

Data

MPEG-2 TS

Clo

ck

Baseband Interface to:

MPEG-2 TS sources,

Contribution links,

Remultiplexers, etc.

Clock & Sync Generator

Cable Channel Adapter

Sinc1

Inverter &

Randomizer

DAC &

Front-End

Fig. 14.14. Conceptual block diagram of elements at the cable head-end and receiver

In the cable head-end, the following TV baseband signal sources can be considered:

Satellite signals;

Contribution links;

Local program sources.

The processes in the following text will be described in an order as shown in Figure 14.14.

14.2.2 System blocks and their functions

Baseband interfacing

This unit shall adapt the data structure to the format of the signal source. The framing structure

shall be in accordance with MPEG-2 transport stream including sync bytes.

Sync 1 inversion and scrambling (randomization)

This unit shall invert the Sync 1 byte according to the MPEG-2 framing structure, and

randomizes the data stream for spectrum shaping purposes. It is the same as in DVB-T system.

Reed-Solomon (RS) coder

This unit shall apply a shortened Reed-Solomon (RS) code to each randomized transport packet

to generate an error-protected packet. This code shall also be applied to the Sync byte itself. The unit

is identical to the outer coder used in DVB-T system.

Convolutional interleaver

This unit shall perform a depth l=12 convolutional interleaving of the error-protected packets.

The periodicity of the sync bytes shall remain unchanged. This block is the same as the outer

interleaver in DVB-T system.

Byte to m-tuple conversion

This unit must perform a conversion of the bytes generated by the interleaver into QAM

symbols. In each case, the MSB of symbol X shall be taken from the MSB of byte V.

Correspondingly, the next significant bit of the symbol shall be taken from the next significant bit of

the byte. For the case of 2B-QAM modulation, the process shall map k bytes into n symbols, such

that: 8k=n×B.

199

The process is illustrated for the case of 64-QAM (where B=6, k=3 and n=4) in Figure 14.15.

Bytes from

interleaver

output

b7 b6 b5 b4 b3 b2 b1 b0 b7 b6 b5 b4 b3 b2 b1 b0 b7 b6 b5 b4 b3 b2 b1 b0

6-bit symbols to

differential

encoder

b5 b4 b3 b2 b1 b0 b5 b4 b3 b2 b1 b0 b5 b4 b3 b2 b1 b0 b5 b4 b3 b2 b1 b0

Symbol X Symbol X+1 Symbol X+2 Symbol X+3

Byte V Byte V+1 Byte V+2

MSB LSB

Fig. 14.15. Byte to 6-bit symbol conversion for 64-QAM modulation

Differential encoding

In order to get a / 2 rotation-invariant QAM constellation, this unit shall apply a differential

encoding of the two Most Significant Bits (MSBs) of each symbol. The MSBs Ik and Qk of the

consecutive symbols A and B are differentially coded at the transmitter end to enable decoding

independently of the quadrant's absolute position. This is necessary because the phase information is

lost due to carrier suppression during modulation.

Figure 14.16 gives an example of implementation of byte to symbol conversion and the

differential encoding.

Bytes from

convolutional

interleaver

Bytes to

m-tuples

converterDifferential

encoder

MapperQk

Ik

q bits (bq-1...b0)

Bk=bq

Ak=MSB

I

Q

q=2 for 16-QAM q=3 for 32- QAM q=4 for 64- QAM

Fig. 14.16. Example implementation of the byte to m-tuple conversion and the differential encoding of the two MSBs

The differential encoding of the two MSBs shall be given by the following Boolean expressions:

1 1( ) ( ) ( ) ( )k k k k k k k k kI A B A I A B A Q ,

1 1( ) ( ) ( ) ( )k k k k k k k k kQ A B B Q A B B I ,

where the above Boolean expression "⊕ " denotes the XOR function, "+" denotes the logical OR

function, " " denotes the logical AND function and the overbar denotes inversion.

The MSBs Ik and Qk are buffered during one symbol clock after differential coding. The original

position of the quadrant is obtained from the comparison of Ik and Ik-1 and Qk and Qk-1. This is

illustrated in table 14.5.

Table 14.5 Truth table for differential coding

Inputs Outputs Rotation

Ak Bk Ik Qk

0 0 Ik-1 Qk-1 0

0 1 1k

Q

1k

I

90

1 0 Qk-1 1k

I

90

1 1 1k

I

1k

Q

180

Modulation

The modulation of the DVB-C system shall be Quadrature Amplitude Modulation (QAM) with

16, 32, 64, 128 or 256 points in the constellation diagram. The mapping is not identical with the

200

correspondent mapping of DVB-T. The constellation diagram for DVB-C 16-QAM modulation is

given in Figure 14.17.

0011

1110

0001

1101

0010

0000

1100 0100

1011

1010

1001

1000

1111 0101

0110

0111

1

1

-1

-1-3 3

3

-3

Q

I

00k kI Q

01k kI Q 11k kI Q

10k kI Q

Fig. 14.17. The DVB-C constellation diagram for 16-QAM

Other DVB-C constellation diagrams can be found in [88]. These constellation diagrams

represent the signal transmitted in the cable system. As shown in Figure 14.17, the constellation points

in quadrant 1 shall be converted to quadrants 2, 3 and 4 by changing the two MSB (i.e. Ik and Qk) and

by rotating the q LSBs according to the following rule given in Table 14.6.

Table 14.6 Conversion of constellation points of quadrant 1

to other quadrants of the constellation diagram

Quadrant MSBs LSBs rotation

1 00

2 10 +π/2

3 11 +π

4 01 +3π/2

Baseband pulse shaping

This unit performs mapping from differentially encoded m-tuples to I and Q signals and a

square-root raised cosine filtering of the I and Q signals prior to QAM modulation. The roll-off factor

is chosen to be equal to 0,15. The QAM signal is filtered with a square-root raised-cosine shaped

filter, in order to remove mutual signal interference at the receiving side.

DAC and front-end

The digital signal is transformed into an analog signal, with a digital-to-analog converter

(DAC), and then modulated to radio frequency by the RF front-end.

The receiving set-up box It adopts techniques which are dual to those ones used in the transmission:

Front-end and ADC: the analog RF signal is converted to baseband and transformed into a

digital signal, using an analog-to-digital converter (ADC);

QAM demodulation;

Matched filtering and equalization;

Differential decoding;

Symbol to byte mapping

Convolutional deinterleaving;

Reed-Solomon decoding;

Derandomization;

MPEG-2 demultiplexing and source decoding

Programmable Transport Stream

Cable bit rates

With a roll-off factor of 0,15, the theoretical maximum symbol rate in an 8 MHz channel is

about 6,96 MBaud.

201

Table 14.7 gives examples of the wide range of possible cable bit rates and occupied bandwidths

considering 16-QAM, 32-QAM, 64-QAM, 128-QAM and 256-QAM constellations.

Table 14.7 Available bit rates (Mbit/s) for DVB-C system [88]

Modulation Bandwidth, MHz

2 4 6 8 10

16-QAM 6,41 12,82 19,23 25,64 32,05

32-QAM 8,01 16,03 24,04 32,05 40,07

64-QAM 9,62 19,23 28,85 38,47 48,08

128-QAM 11,22 22,44 33,66 44,88 56,10

256-QAM 12,82 25,64 38,47 51,29 64,11

The commonly used values in cable networks are 64-QAM or 256-QAM constellations and

6 MHz or 8 MHz channel bandwidths.

14.3 DVB-S system

14.3.1 Functional block diagram of the system

Direct To Home (DTH) services via satellite are particularly affected by power limitations,

therefore, ruggedness of the system against noise and interference, shall be the main design

objective, rather than spectrum efficiency. To achieve a very high power efficiency without

excessively penalizing the spectrum efficiency, the system shall use QPSK modulation and the

concatenation of convolutional and RS codes. The convolutional code is able to be configured

flexibly, allowing the optimization of the system performance for a given satellite transponder

bandwidth.

The system is defined as the functional block of equipment performing the adaptation of the

baseband TV signals, from the output of the MPEG-2 transport multiplexer to the satellite channel

characteristics. The transmission part of the system (see Figure 14.18) consists of blocks

implementing the following functions [89]:

MPEG-2 transport stream adaptation and randomization for energy dispersal;

Reed-Solomon coding (outer coding);

Convolutional interleaving (outer interleaving);

Punctured convolutional coding (inner coding);

Baseband shaping for modulation;

Modulation.

The blocks of reception part (see Figure 14.18) implement inverse functions.

202

From RF

Satellite

Channel

1

2

n

QPSK

Modulator

MPEG encoders and

multiplexers

...

Bit stream

adaptation

Energy

dispersal

Outer coder

RS(204,

188, 8)

Outer

Interleaver

Baseband

shaping

Inner

Coder &

Mapper

Video

encoder

Audio

encoder

Data Ele

men

tary

str

eam

mu

ltip

lex

er

Tra

nsp

ort

str

eam

mu

ltip

lex

er

DAC &

Front-End

Clock & Sync Generator

18

8 b

yte

s

seg

men

ts

To the RF

Satellite

Channel

Satellite Channel Adapter

QPSK

demodulator

I

Q

Matched

Filter

Inner

DecoderSync

decoder

Covol.

Deinterleaver

(I=12)

RS

Decoder

(204,188,8)

Baseband

Physical

Interface

Datam 8m 8 8 8

Clock

Carrier & Clock & Sync Recovery

Satellite Receiver Decoder

Sinc1 Inverter

&

Derandomizer

Front-End

& ADC

Fig. 14.18. Functional block diagram of the DVB-S system

14.3.2 System blocks and their functions

The blocks performing bit stream adaptation and randomization for energy dispersal, outer

Reed-Solomon coding, outer convolutional interleaving and inner punctured convolutional

coding are identical to those used in DVB-T system.

Baseband shaping and modulation

The system employ conventional Gray-coded QPSK modulation with absolute mapping (no

differential coding). Bit mapping in the signal space as given on Figure 14.19 is used. Prior to

modulation, the I and Q signals are square-root raised cosine filtered. The roll-off factor α is equal to

0,35.

I=0

Q=0

Q

I

I=1

Q=0

I=1

Q=1I=0

Q=1

Fig. 14.19. QPSK constellation used in DVB-S system

Due to the similarity of the transmitter and receiver block diagrams (see Figure 14.18), only the

latter will be described below.

Front-End and ADC and QPSK demodulator: these units perform the conversion of the

analog RF signal to baseband and transformation into a digital signal, using an analog-to-digital

converter (ADC), also the quadrature coherent demodulation, providing I and Q information to the

inner decoder.

Matched filter: this unit performs the complementary pulse shaping filtering of square-root

raised cosine type according to the roll-off. The use of a Finite Impulse Response (FIR) digital filter

could provide equalization of the channel linear distortions in the receiver.

Carrier/clock recovery unit: this device recovers the demodulator synchronization.

203

Inner decoder: this unit performs first level error protection decoding. It operates at an input

BER in the order of between 10-1 and 10-2 (depending on the adopted code rate), and produces an

output BER of about 2×10-4 or lower. This output BER corresponds to quasi-error-free service after

outer code correction. This unit is in a position to try each of the code rates and puncturing

configurations until lock is acquired. Furthermore, it is in a position to resolve π/2 demodulation

phase ambiguity.

Sync byte decoder: by decoding the MPEG-2 sync bytes, this decoder provides

synchronization information for the de-interleaving. It is also in a position to recover π ambiguity of

QPSK demodulator (not detectable by the Viterbi decoder).

Convolutional de-interleaver: this device allows the error bursts at the output of the inner

decoder to be randomized on a byte basis in order to improve the burst error correction capability of

the outer decoder.

Outer decoder: this unit provides second level error protection. It is in a position to provide

quasi-error-free output (i.e. BER of about 10-10 to 10-11) in the presence of input error bursts at a BER

of about 7×10-4 or better with infinite byte interleaving. In the case of interleaving depth I=12,

BER=2×10-4 is assumed for quasi-error-free.

Energy dispersal removal: this unit recovers the user data by removing the randomizing

pattern used for energy dispersal purposes and changes the inverted sync byte to its normal MPEG-2

sync byte value.

Baseband physical interface: this unit adapts the data structure to the format and protocol

required by the external interface.

204

REFERENCES 1. Y. Wu, S. Hirakawa, U. H. Reimers, J. hitaker. “Overview of digital television

development worldwide”. Proceedings of the IEEE, Vol. 94, No. 1, January 2006.

2. Broadcasting pioneers: The many innovators behind television history. [Online]. Available:

http://inventors.about.com/library/in-ventors/bltelevision.htm.

3. Television history – The first 75 years. [Online]. Available: http://www.tvhistory.tv/.

4. A six-megacycle compatible high-definition color television system.. RCA Rev., Vol. 10,

pp. 504–522, Dec. 1949.

5. Report and order of Federal Communications Commission. Washington, DC, FCC Doc. 53-

1663, Dec. 17, 1953.

6. D. H. Pritchard and J. J. Gibson. Television transmission standards. In Standard Handbook

of Broadcast Engineering, J. C. Whitaker, Ed. New York: McGraw-Hill, pp. 3.9–3.33, 2005.

7. Y. Ninomiya. The Japanese scene. IEEE Spectrum, Vol. 32, No. 4, pp. 54–57, Apr. 1995.

8. B. Fox. The digital dawn in Europe [HDTV]. IEEE Spectrum, Vol. 32, No. 4, pp. 50–53,

Apr. 1995.

9. R. J. G. Ellis. The PALplus Story. Manchester, U.K.: Architects’ Publishing Partnership

Ltd., 1997.

10. M. A. Isnardi, T. Smith, B. J. Roeder. Decoding issues in the ACTV system. IEEE Trans.

Consum. Electron., Vol. 34, No. 1, pp. 111–120, Feb. 1988.

11. U. Reimers. DVB – The Family of International Standards for Digital Video Broadcasting.

Springer, Berlin, 2004.

12. M. D. Fairchild. Color Appearance Models. Sec. Ed. John Wiley & Sons Ltd, 2005.

13. D. Briggs. The dimensions of color. [Online]. Available:

http://www.huevaluechroma.com/index.php.

14. Color Models. [Online]. Available:

http://cs.brown.edu/courses/cs092/VA10/HTML/ColorModels.html.

15. CIE 1931 color space. [Online]. Available:

http://en.wikipedia.org/wiki/CIE_1931_color_space.

16. Color Models. [Online]. Available: http://www.sketchpad.net/basics4.htm.

17. Color Models. [Online]. Available:

https://software.intel.com/sites/products/documentation/hpc/ipp/ippi/ippi_ch6/ch6_color_

models.html.

18. ITU-R Recommendation BT.709: Basic Parameter Values for the HDTV Standard for the

Studio and International Programme Exchange [formerly CCIR Rec.709]. ITU, Geneva,

Switzerland, 2005.

19. K. Jack. Video Demystified: a Handbook for the Digital Engineer. LLH Technology

Publishing, 3rd Edition, 2001.

20. ITU-R Recommendation BT.601: Studio encoding parameters of digital television for

standard 4:3 and wide-screen 16:9 aspect ratios, [formerly CCIR Rec.601]. ITU, Geneva,

Switzerland, 2011.

21. H. S. Malvar, G. J. Sullivan. Transform, Scaling & Color Space Impact of Professional

Extensions, ISO/IEC JTC/SC29/WG11 and ITU-T SG16 Q6 Document JVT-H031, Geneva,

May 2003.

22. Color Theory. [Online]. Available: http://www.colormepretty.co/color-theory/.

23. James Clerk Maxwell (1831 - 1879). [Online]. Available:

http://faculty.wcas.northwestern.edu/~infocom/Ideas/maxwell.html.

24. D. Zawischa. Introduction to color science. [Online]. Available: https://www.itp.uni-

hannover.de/~zawischa/ITP/introcol.html.

25. Human Perception of Sound. [Online]. Available: http://zone.ni.com/reference/en-

XX/help/372416B-01/svtconcepts/human_perception_sound/.

205

26. Hearing and Perception. [Online]. Available:

http://artsites.ucsc.edu/ems/music/tech_background/te-03/teces_03.html.

27. Engineering Acoustics/The Human Ear and Sound Perception. [Online]. Available:

http://en.wikibooks.org/wiki/Engineering_Acoustics/The_Human_Ear_and_Sound_Percep

tion.

28. E. Zwicker. Psychoakustik. Springer, Berlin, 1982, ISBN 3-540-11401-7.

29. Steven W. Smith. The Scientist and Engineer's Guide to Digital Signal Processing.

California Technical Publishing, 1997.

30. The mel frequency scale and coefficients. [Online]. Available:

http://kom.aau.dk/group/04gr742/pdf/MFCC_worksheet.pdf.

31. L R. Rabiner, R W. Schafer. Digital Processing of Speech Signals. Prentice Hall, London,

1978.

32. K. Fellbaum, J. Richter. Human Speech Production Using Interactive Modules and the

Internet - a Tutorial for the Virtual University. [Online]. Available:

https://www2.spsc.tugraz.at/add_material/courses/scl/vocoder/.

33. P. Vary, U. Heute, W. Hess. Digitale Sprachsignalverarbeitung. Teubner, Stuttgart, 1998.

34. A. E. Rosenberg. Effect of Glottal Pulse Shape on the Quality of Natural Vowels.

J. Acoust. Soc. Am., Vol. 49, No. 2, pp. 583-590, Feb. 1971.

35. L. W. Couch, II. Digital and Analog Communication Systems. Pearson Education, London,

2007.

36. I. A. Glover, P. M. Grant. Digital Communications. Prentice Hall, London, 1998.

37. H. Taub, D. L. Schilling. Principles of Communication Systems. McGraw-Hill, London,

1986.

38. A. B. Carlson. Communication Systems. McGraw-Hill, London, 1987.

39. Natural Binary Codes and Gray Codes. [Online]. Available:

http://www.gaussianwaves.com/2012/10/natural-binary-codes-and-gray-codes/.

40. ITU-R Recommendation BT.656: Interfaces for digital component video signals in 525-line

and 625-line television systems operating at the 4:2:2 level of Recommendation ITU-R

BT.601 (Part A) [formerly CCIR Rec.656]. ITU, Geneva, Switzerland, 2007.

41. J. Arnold, M. Frater, M. Pickering. Digital television: Technology and Standards.

John Wiley&Sons, New Jersey, 2007.

42. T. Wiegand, X. Zhang, and B. Girod. Long-Term Motion-Compensated Prediction. IEEE

Trans. on Circuits and Systems for Video Technology, Feb. 1999.

43. T. Wiegand, N. Färber, K. Stuhlmüller, and B. Girod. Error-Resilient Video Transmission

using Long-Term Memory Motion-compensated Prediction. IEEE Journal on Selected Areas

in Communications, June 2000.

44. K. Rao, P. Yip. Discrete Cosine Transform: Algorithms, Advantages, Applications. Boston,

Academic Press, 1990.

45. J. D. Markel, A. H. Gray. Linear Prediction of Speech. Springer Verlag, Berlin, 1976.

46. B. Goldand, N. Morgan. Speech and Audio Signal Processing: Processing and Perception of

Speech and Music. Wiley&Sons, Chichester, 2000.

47. M. D. Paez, T. H. Glisson. Minimum Mean Squared Error Quantization in Speech. IEEE

Trans. Comm., Vol. Com-20, pp.225-230, April, 1972.

48. Adaptive Quantization. [Online]. Available:

http://www.ece.ucsb.edu/Faculty/Rabiner/ece259/digital%20speech%20processing%20cou

rse/lectures_new/Lecture%2016_winter_2012_6tp.pdf.

49. P. Noll. Adaptive Quantizing in Speech Coding Systems. Proc. 1974 Zurich Seminar on

Digital Communications. Zurich, March, 1974.

50. N. S. Jayant. Adaptive Quantization with a One Word Memory. Bell System Tech. J., pp.

1119-1144, September, 1973.

206

51. R. A. McDonald. Signal to Noise and Idle Channel Performance of DPCM System –

Particular Applications to Voice Signals. Bell System Tech. J., Vol. 45, No. 7, pp. 1123-

1151, September, 1996.

52. R. M. Gray. A classical tutorial on vector quantization Vector quantization. IEEE ASSP

Mag., Vol.1, pp. 4-29, April, 2004.

53. H. B. Kekre, V. Kulkarni. Performance Comparison of Automatic Speaker Recognition

using Vector Quantization by LBG KFCG and KMCG. International Journal of Computer

Science and Security, Vol 4: Issue 6,pp. 571-579.

54. Y. Linde, A. Buzo and R. M. Gray. An algorithm for vector quantizer design. IEEE Trans.

Commun., Vol.28, pp.84-95, Jan. 2003.

55. A. Gersho, R. M. Gray. Vector quantization and data compression. Kluneer, Boston, 1992.

56. C. Q. Chen, S. N. Koh, I. Y. Soon. Fast codebook search algorithm for unconstrained vector

quantization. IEE Proc. Vis. Image Signal Process, Vol. 145, No. 2, April 1998.

57. K. Brandenburg, G. Stoll. ISO-MPEG-1 Audio: a generic standard for coding of high-quality

digital audio. Journal of the Audio Engineering Society, Vol. 42(10), pp.780-792, October

1994.

58. ISO/IEC JTC1/SC29/WG11. Information technology - Coding of moving pictures and

associated audio for digital storage media up to about 1,5 Mb/s. IS 11172 (Part3, Audio),

1992.

59. ISO/IEC JTC1/SC29/WG11. Information technology - Generic coding of moving pictures

and associated audio information. IS 13818 (Part 3, Audio), 1994.

60. ISO/IEC JTC1/SC29/WG11. Information Technology - Generic coding of moving pictures

and associated audio information. IS 13818 (Part 7, Advanced audio coding), 1997.

61. ISO/IEC JTC1/SC29/WG11. Information Technology - Coding of audiovisual objects.

ISO/IEC.D 4496 (Part 3, Audio), 1999.

62. K. Brandenburg, M. Bosi. Overview of MPEG audio: Current and future standards for low-

bit-rate audio coding. J. Audio Eng. Soc., Vol. 45(1/2), pp. 4-19, 1997.

63. C. M. Liu, W. W. Chang. Handbook of Multimedia Communication. Academic Press, New York,

2000. 64. D. Pan. A tutorial on MPEG/Audio compression. IEEE Multimedia Magazine, Vol. 2(2), pp.

60-74, 1995.

65. T. Görne. Tontechnik. Fachbuchverlag, Leipzig, München, 2006.

66. The story of MP3. Fraunhofer Institute for Integrated Circuits IIS. [Online]. Available:

http://www.mp3-history.com/en/the_story_of_mp3.html.

67. M. Bosi, K. Brandenburg, Sch. Quackenbush and others. ISO/IEC MPEG-2 Advanced

Audio Coding. Proc. of the 101st AES-Convention, 1996.

68. T. Saramaki. A generalized class of cosine-modulated filter banks. Proceedings of First

International Workshop on Transforms and Filter Banks, Tampere, Finland, pp. 336-365,

1998.

69. ITU-R Recommendation G.722: 7 kHz audio-coding within 64 kbit/s. ITU, Geneva,

Switzerland, 2012.

70. K. Brandenburg. MP3 and AAC explained. Proceedings of the AES 17th International

Conference on High Quality Audio Coding, 1999.

71. ISO/IEC 13818-1. Information technology - Generic coding of moving pictures and

associated audio information: Systems. 2007.

72. ISO/IEC 13818-2. Information technology - Generic Coding of Moving Pictures and

Associated Audio Information: Video. 2007.

73. Handbook of Image and Video Processing. Edited by A. Bovik. Elsevier Inc., 2005.

74. The Telecommunications Handbook. Edited by K. Terplan, P. A. Morreale. CRC Press,

2000.

207

75. Multimedia Communication Networks. Chapter 7. MPEG-2 System layer. [Online].

Available: https://www.uic.edu/classes/ece/ece434/chapter_file/chapter7.htm.

76. H. Benoit. Digital Television. Satellite, Cable, Terestrial, IPTV, Mobile TV inthe DVB

Framework. Elsvier Inc., 2008.

77. T. Ho. Digital Video Broadcasting Conditional Access Architecture. A Report Prepared for

CS265-Section 2 [Online]. Available:

http://www.cs.sjsu.edu/~stamp/CS265/projects/papers/dvbca.pdf.

78. M. Carter. Pay TV Piracy: Lessons Learned. Digital Rights Management Workshop, 2000.

[Online]. Available: http://www.eurubits.de/drm/drm_2000/slides/carter.pdf.

79. ETSI. Digital Video Broadcasting (DVB); Support for Use of Scrambling and Conditional

Access (CA) within Digital Broadcasting Systems. ETR 289, 1996.

80. M. G. Kuhn. The New European Digital Video Broadcast (DVB) Standard. [Online].

Available: http://www.cl.cam.ac.uk/~mgk25/dvb.txt.

81. EN 301 192. Specifications for Data Broadcasting. European Telecommunications

Standards Institute (ETSI), 2004.

82. ETSI EN 300 744. V1.5.1. Digital Video Broadcasting (DVB); Framing structure, channel

coding and modulation for digital terrestrial television. European Telecommunications

Standards Institute (ETSI), 2004.

83. S. Gravano. Introduction to Error Control Codes. Oxford University Press, New York, 2001.

84. W. W. Peterson, E. J. Weldon. Error Correcting Codes. Cambridge, MA, The MIT Press,

1972.

85. Coding and decoding with Convolutional Codes. Ch. Langton, Editor. [Online]. Available:

www.complextoreal.com.

86. R. W. Hamming. Error Detecting and Error Correcting Codes. Bell System Techn. J., Vol.

29, pp. 147-160, April 1950.

87. V. K. Bhargava. Forward Error Corection Schemes for Digital Communications. IEEE

Communications Magazine, Vol. 21, pp. 11-19, January 1983.

88. ETSI EN 300 429 V1.2.1. Digital Video Broadcasting (DVB); Framing structure, channel

coding and modulation for cable systems. European Telecommunications Standards Institute

(ETSI), 1998.

89. ETSI EN 302 307, V1.2.1. Digital Video Broadcasting (DVB); Framing structure, channel

coding and modulation for 11/12 GHz satellite services. European Telecommunications

Standards Institute (ETSI), 1997.

90. ETSI EN 300 421 V1.1.2. Digital Video Broadcasting (DVB); Second generation framing

structure, channel coding and modulation systems for Broadcasting, Interactive Services,

News Gathering and other broadband satellite applications (DVB-S2). European

Telecommunications Standards Institute (ETSI), 2009.

91. P. Shelswell. The COFDM modulation system: the heart of digital audio broadcasting.

Electronics & Communication Engineering Journal, pp. 127-136, June 1995.

92. J. H. Stott. The how and why of COFDM. EBU Technical Review, pp. 1-14, Winter 1998.

93. S. L. Linfoot, R. S. Sherratt. A Study of COFDM in a Terrestrial Multipath Environment.

Eurocomm 2000. Information Systems for Enhanced Public Safety and Security.

IEEE/AFCEA, pp. 388-391, 2000.