Speech & Audio Coding - LiU

32
Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg

Transcript of Speech & Audio Coding - LiU

Page 1: Speech & Audio Coding - LiU

Speech & Audio Coding

TSBK01 Image Coding and Data Compression

Lecture 11, 2003

Jörgen Ahlberg

Page 2: Speech & Audio Coding - LiU

Outline

• Part I - Speech

– Speech

– History of speech synthesis & coding

– Speech coding methods

• Part II – Audio

– Psychoacoustic models

– MPEG-4 Audio

Page 3: Speech & Audio Coding - LiU

Speech Production

• The human’s vocalapparatus consists of:

– lungs

– trachea (wind pipe)

– larynx

• contains 2 folds of skin called vocal cords which blow apart and flap together as air is forced through

– oral tract

– nasal tract

Page 4: Speech & Audio Coding - LiU

The Speech Signal

Page 5: Speech & Audio Coding - LiU

The Speech Signal

Page 6: Speech & Audio Coding - LiU

(OHPHQWV�RI�WKH�VSHHFK�VLJQDO�� VSHFWUDO�UHVRQDQFH��IRUPDQWV��PRYLQJ�� SHULRGLF�H[FLWDWLRQ��YRLFLQJ��SLWFKHG����SLWFK�FRQWRXU� QRLVH�H[FLWDWLRQ��IULFDWLYHV��XQYRLFHG��QR�SLWFK�� WUDQVLHQWV��VWRS�UHOHDVH�EXUVWV�� DPSOLWXGH�PRGXODWLRQ��QDVDOV��DSSUR[LPDQWV��� WLPLQJ

The Speech Signal

Page 7: Speech & Audio Coding - LiU

9RZHOV � FKDUDFWHULVHG�E\ IRUPDQWV��JHQHUDOO\�YRLFHG��7RQJXH��OLSV�� HIIHFW�RI�URXQGLQJ��([DPSOHV�RI�YRZHOV��D��H��L��R��X��D��DK��RK��9LEUDWLRQ�RI�YRFDO�FRUGV��PDOH����� ���+]��IHPDOH�XS�WR����+]��9RZHOV�KDYH�LQ�DYHUDJH�PXFK�ORQJHU�GXUDWLRQ�WKDQ�FRQVRQDQWV��0RVW�RI�WKH�DFRXVWLF�HQHUJ\�RI�D�VSHHFK�VLJQDO�LV�FDUULHG�E\�YRZHOV�

)��)��FKDUW )RUPDQW�SRVLWLRQV

The Speech Signal

Page 8: Speech & Audio Coding - LiU

� ������ &KDQQHO�YRFRGHU � ILUVW�DQDO\VLV�E\�V\QWKHVLV V\VWHP GHYHORSHGE\ +RPHU�'XGOH\ RI�$77�ODEV�� 92'(5�

� ������ 3&0�� ILUVW�FRQFHLYHG�E\�3DXO�0��5DLQH\�DQG�LQGHSHQGHQWO\�E\����$OH[�5HHYHV��$77�3DULV��LQ�������'HSOR\HG��LQ�86�3671�LQ������

92'(5�± WKH�DUFKLWHFWXUH

History of Speech Coding

Page 9: Speech & Audio Coding - LiU

� ������ &KDQQHO�YRFRGHU � ILUVW�DQDO\VLV�� E\�� V\QWKHVLV�V\VWHP�GHYHORSHG�E\�+RPHU�'XGOH\�RI�$77�ODEV�� 92'(5�

� ������ 3&0�� ILUVW�FRQFHLYHG�E\�3DXO�0��5DLQH\�DQG�LQGHSHQGHQWO\�E\����$OH[�5HHYHV��$77�3DULV��LQ�������'HSOR\HG��LQ�86�3671�LQ������

History of Speech Coding

Page 10: Speech & Audio Coding - LiU

29(�IRUPDQW�V\QWKHVLV��*XQQDU�)DQW� .7+�������

Page 11: Speech & Audio Coding - LiU

+LVWRU\�RI�6SHHFK�� &RGLQJ

� ������ &KDQQHO�YRFRGHU � ILUVW�DQDO\VLV�� E\�� V\QWKHVLV�V\VWHP�+RPHU���'XGOH\�RI�$77�ODEV�� 92'(5�

� ������ 3&0�� ILUVW�FRQFHLYHG�E\�3DXO�0��5DLQH\�DQG�LQGHSHQGHQWO\�E\����$OH[�5HHYHV��$77�3DULV��LQ�������'HSOR\HG��LQ�86�3671�LQ������

� ������ µ�ODZ�HQFRGLQJ�SURSRVHG��VWDQGDUGLVHG�IRU�WHOHSKRQH�QHWZRUN�LQ�������*������

� ������ GHOWD�PRGXODWLRQ�SURSRVHG���GLIIHUHQWLDO�3&0�LQYHQWHG�

� ������ $'3&0�GHYHORSHG

� ������ &(/3�YRFRGHU�SURSRVHG��PDMRULW\�RI�FRGLQJ�VWDQGDUGV�IRU�VSHHFK�VLJQDO�WRGD\�XVH�D�YDULDWLRQ�RQ�&(/3�

Page 12: Speech & Audio Coding - LiU

� 6LJQDO�IURP�D�VRXUFH�LV�ILOWHUHG�E\�D�WLPH�YDU\LQJ�ILOWHU�ZLWK�UHVRQDQW�SURSHUWLHV�VLPLODU�WR�WKDW�RI�WKH�YRFDO�WUDFW�

� 7KH�JDLQ�FRQWUROV�$Y DQG�$1 GHWHUPLQH�WKH�LQWHQVLW\�RI�YRLFHG�DQG�XQYRLFHG�H[FLWDWLRQ�

� 7KH�IUHTXHQF\�RI�KLJKHU�IRUPDQW�DUH�DWWHQXDWHG�E\�����G%�RFWDYH��GXH�WR�WKH�QDWXUH�RI�RXU�VSHHFK�RUJDQV��

� 7KLV�LV�DQ�RYHU�VLPSOLILHG�PRGHO�IRU�VSHHFK�SURGXFWLRQ��+RZHYHU� LW�LV�YHU\�RIWHQ�DGHTXDWH�IRU�XQGHUVWDQGLQJ�WKH�EDVLF�SULQFLSOHV�

Source-filter Model of Speech Production

Page 13: Speech & Audio Coding - LiU

Speech Coding Strategies

1. PCM

• Invented 1926, deployed 1962.

• The speech signal is sampled at 8 kHz.

• Uniform quantization requires >10 bits/sample.

• Non-uniform quantization (G.711, 1972)

• Quantizing y to 8 bits -> 64 kbit/s.

Page 14: Speech & Audio Coding - LiU

Speech Coding Strategies

2. Adaptive DPCM

• Example: G.726 (1974)

• Adaptive predictor based on six previous differences.

• Gain-adaptive quantizer with 15 levels � 32 kbit/s.

Page 15: Speech & Audio Coding - LiU

Speech Coding Strategies

3. Model-based Speech Coding

• Advanced speech coders are based on models of how speech is produced:

Excitationsource

Vocaltract

Page 16: Speech & Audio Coding - LiU

An Excitation Source

Noisegenerator

Pulsegenerator

Pitch

Page 17: Speech & Audio Coding - LiU

Vocal Tract Filter 1: A Fixed Filter Bank

BP

g1

BP

g2

BP

gn

Page 18: Speech & Audio Coding - LiU

Vocal Tract Filter 2: A Controllable Filter

Page 19: Speech & Audio Coding - LiU

Linear Predictive Coding (LPC)

• The controllable filter is modelled as

yn = ∑ ai yn-i + Gεn

where εn is the input signal and yn is the output.

• We need to estimate the vocal tract parameters (aiand G) and the exciatation parameters (pitch, v/uv).

• Typically the source signal is divided in short segments and the parameters are estimated for each segment.

• Example: The speech signal is sampled at 8 kHz and divided in segments of 180 samples (22.5 ms/segment).

Page 20: Speech & Audio Coding - LiU

Typical Scheme of an LPC Coder

Noisegenerator

Pulsegenerator

Pitch

Vocal tractfilter

v/uv Gain Filter coeffs

Page 21: Speech & Audio Coding - LiU

Estimating the Parameters

• v/uv estimation

– Based on energy and frequency spectrum.

• Pitch-period estimation

– Look for periodicity, either via the a.c.f our some other measure, for example

that gives you a minimum value when p equals the pitch period.

– Typical pitch-periods: 20 - 160 samples.

Page 22: Speech & Audio Coding - LiU

Estimating the Parameters

• Vocal tract filter estimation

– Find the filter coefficients that minimize the error

ε2 = ( yn - ∑ ai yn-i + Gεn )2

– Compare to the computation of optimal predictors (Lecture 7).

Page 23: Speech & Audio Coding - LiU

Estimating the Parameters

• Assuming a stationary signal:

where R and p contain acf values.

• This is called the autocorrelation method.

Page 24: Speech & Audio Coding - LiU

Estimating the Parameters

• Alternatively, in case of a non-stationary signal:

where

• This is called the autocovariance method.

Page 25: Speech & Audio Coding - LiU

Example

• Coding of parameters using LPC10 (1984):

54 bits � 2.4 kbit/sSum:

1 bitSynchronization

46 bitsUnvoiced filter

46 bitsVoiced filter

6 bitsPitch

1 bitv/uv

Page 26: Speech & Audio Coding - LiU

The Vocal Tract Filter

• Different representations:

– LPC parameters

– PARCOR (Partial Correlation Coefficients)

– LSF (Line Spectrum Frequencies)

Page 27: Speech & Audio Coding - LiU

� /3&�DQDO\VLV � 9�]�

� 'HILQH�SHUFHSWXDO�ZHLJKWLQJ�ILOWHU��7KLV�SHUPLWV�PRUH�QRLVH�DW�IRUPDQW�IUHTXHQFLHV�ZKHUH�LW�ZLOO�EH�PDVNHG�E\�WKH�VSHHFK� 6\QWKHVLVH�VSHHFK�XVLQJ�HDFK�FRGHERRN�HQWU\�LQ�WXUQ�DV�WKH�LQSXW�WR�9�]�

� &DOFXODWH�RSWLPXP�JDLQ�WR�PLQLPLVH�SHUFHSWXDOO\�ZHLJKWHG�HUURU�HQHUJ\�LQ�VSHHFK�IUDPH

� 6HOHFW�FRGHERRN�HQWU\�WKDW�JLYHV�ORZHVW�HUURU

'HFRGLQJ��� 5HFHLYH�/3&�SDUDPHWHUV�DQG�FRGHERRN�LQGH[� 5H�V\QWKHVLVH�VSHHFK�XVLQJ�9�]��DQG�FRGHERRN�HQWU\�

(QFRGLQJ�

� 7UDQVPLW�/3&�SDUDPHWHUV�DQG�FRGHERRN��LQGH[

3HUIRUPDQFH�� ��NELW�V��026 �����

'HOD\ ����PV�����0,36� ��NELW�V��026 �����

'HOD\ ���PV�����0,36�� ���NELW�V��026 �����

'HOD\ ���PV�����0,36�

Code Excited Linear Prediction Coding (CELP)

Page 28: Speech & Audio Coding - LiU

Examples

• G.728– V(z) is chosen as a large FIR-filter (M ~ 50).

– The gain and FIR-parametrers are estimated recursively from previously received samples.

– The code book contains 127 sequences.

• GSM– The code book contains regular pulse trains with variabel

frequency and amplitudes.

• MELP– Mixed excitation linear prediction– The code book is combined with a noise generator.

Page 29: Speech & Audio Coding - LiU

Other Variations

• SELP – Self Excited Linear Prediction

• MPLP – Multi-Pulse Excited Linear Prediction

• MBE – Multi-Band Excitation Coding

Page 30: Speech & Audio Coding - LiU

Quality Levels

BitrateBandwidthQuality level

<4 kbit/sSynthetic quality

4 – 16 kbit/sCommunication quality

16 – 64 kbit/s300 – 3400 kHzNetwork (tool) quality

>64 kbit/s10 kHzBroadcast quality

Page 31: Speech & Audio Coding - LiU

� 026��0HDQ�2SLQLRQ�6FRUH���UHVXOW�RI�DYHUDJLQJ�RSLQLRQV�VFRUHV�IRU�D�VHW�RI�EHWZHHQ����± ���XQWUDLQHG�VXEMHFWV��

� 7KH\��UDWH WKH�TXDOLW\���WR������EDG����SRRU����IDLU����JRRG����H[FHOOHQW��

� 026�RI���RU�KLJKHU�GHILQHV�JRRG�RU�WRRO�TXDOLW\��QHWZRUN�TXDOLW\��� UHFRQVWUXFWHG�VLJQDO�JHQHUDOO\�LQGLVWLQJXLVKDEOH�IURP�WKH�RULJLQDO��

� 026��EHWZHHQ�����± ����GHILQHV�FRPPXQLFDWLRQ�TXDOLW\�± WHOHSKRQH�FRPPXQLFDWLRQV

� 026�EHWZHHQ�����± ����LPSOLHV�V\QWKHWLF�TXDOLW\

� ,Q�GLJLWDO�FRPPXQLFDWLRQV�VSHHFK�TXDOLW\�LV�FODVVLILHG�LQWR�IRXU�JHQHUDO�FDWHJRULHV��QDPHO\��EURDGFDVW��QHWZRUN�RU�WROO��FRPPXQLFDWLRQV��DQG�V\QWKHWLF�

� %URDGFDVW�ZLGHEDQG�VSHHFK�± KLJK�TXDOLW\�´FRPPHQWDU\´�VSHHFK�± JHQHUDOO\�DFKLHYHG��DW�UDWHV�DERYH��� NELWV�V�

Subjective Assessment

Page 32: Speech & Audio Coding - LiU

� '57��'LDJQRVWLF�5K\PH�7HVW���OLVWHQHUV�VKRXOG�UHFRJQLVH�RQH�RI�WKH�WZR�SRVVLEOH�ZRUGV�LQ�D�VHW�RI�UK\PLQJ�SDLUV��H�J��PHDWO�KHDW�

� '$0��'LDJQRVWLF�$FFHSWDELOLW\�0HDVXUH��� WUDLQHG�OLVWHQHUV�MXGJH�YDULRXV�IDFWRUV�H�J��PXIIOHGQHVV��EX]]LQHVV��LQWHOOLJLELOLW\

4XDOLW\�YHUVXV�GDWD�UDWH���N+]�VDPSOLQJ�UDWH�

Subjective Assessment