An Ultra Low-power Miniature Speech ... - ON Semiconductor
Transcript of An Ultra Low-power Miniature Speech ... - ON Semiconductor
To appear in ICSPAT 2000 Proceedings, October 16-19, 2000, Dallas, TX.
AbstractThis paper describes a CODEC implementa-tion on an ultra low-power miniature applica-tion specific signal processor (ASSP) designedfor mobile audio signal processing applica-tions. The CODEC records speech to andplays back from a serial flash memory at datarates of 16 and 8 kb/s, with a bandwidth of 4kHz. This CODEC consumes only 1 mW in apackage small enough for use in a range ofdemanding portable applications. Results,improvements and applications are also dis-cussed.
1. IntroductionSpeech coding is ubiquitous. Increasing de-mands for portabil ity and fidelity coupled withthe desire for reduced storage and bandwidthutilization have increased the demand for anddeployment of speech CODECs.
However, many current devices using speechCODEC technologies consume enough powerto put inconvenient limits on battery li fetime.As well , CODECs are sometimes too large formobile applications.
The CODEC presented here is an applicationof the programmable SmartCODEC platform.In this implementation, the memory chipconsumes most (98%) of the power, and theentire package is extremely small .
2. System OverviewThe CODEC was designed by interfacing theultra low-power SmartCODEC hardwareplatform with a 32 Mbit serial flash. The
SmartCODEC platform consists of an eff i-cient, block-floating point, oversampledWeighted OverLap-Add (WOLA) filterbank, asoftware-programmable dual-Harvard 16-bitDSP core, two high fidelity 14-bit A/D con-verters, a 14-bit D/A converter and a flexibleset of peripherals [1]. The system hardwarearchitecture (Figure 1) was designed to enablememory upgrades. Removable memory cardsor more power-eff icient memory could besubstituted for the serial flash memory.
The CODEC communicates with the flashmemory over an integrated SPI port that cantransfer data at rates up to 80 kb/s. For thisapplication, the port is configured to blocktransfer frame packets every 14 ms.
WOLA Filterbankand
ProgrammableDSP Core
SmartCODEC
Serial Flash Memory
To
Aud
io D
evic
e
A /D
D/A
Co
ntro
l
SSI or SPI Interface
Figure 1. System Block Diagram
The SmartCODEC also has 2 UARTs. TheCODEC uses one UART for control signals(play / stop / record / time). The other is usedfor an integrated, on-chip debugging port.
In its most compact incarnation (Figure 2), theSmartCODEC platform measures 6.5 x 3.5 x2.5 mm [2]. The flash memory, minimalinterfacing hardware and a battery (when no
An Ultra Low-power Miniature SpeechCODEC at 8 kb/s and 16 kb/s
Robert Brennan, David Coode, Dustin Griesdorf, Todd Schneider
Dspfactory Ltd.611 Kumpf Drive, Unit 200
Waterloo, Ontario, N2V 1K8Canada
To appear in ICSPAT 2000 Proceedings, October 16-19, 2000, Dallas, TX.
other power source is present) occupy addi-tional space.
Figure 2. Relative size of SmartCODEC platform
A non-volatile 32-Mbit serial flash memory iscurrently employed to store the speech. Thischip stores up to 56 minutes of speech at 8kb/s and measures 20.2 x 7.5 x 1.2 mm [3].
The power consumption of the completesystem is 55 mW for recording and 41 mW forplayback. The SmartCODEC platform itselfconsumes less than 1 mA at 1 volt, only 2% ofthe total system power. The remainder of thepower is consumed by the flash memory.
3. Coding AlgorithmThe CODEC design follows traditional Sub-Band Coding (SBC) methods [4]. Speech issampled at 16 kHz and analyzed into 16 com-plex frequency bands via a 32-point FFT and a256-point prototype low pass filter. The fil ter-bank is 2 times oversampled. This results in32 words (8 kHz) of complex data, 2 wordsfor each 500 Hz band.
The incoming bit rate of 16 kHz monauralspeech is 256 kb/s. After analysis the over-sampling increases this rate to 512 kb/s. Therate is subsequently reduced to critical sam-pling by careful decimation of the complexfrequency data. The analysis data at this pointis in the form of a DCT [5]. Discarding thefrequencies above 4 kHz further reduces thedata rate to 128 kb/s.
The remaining 8 words of data represent 1 msof speech data. These 1 ms blocks are bufferedinto groups of 14. Each group comprises a 14ms frame of speech data. Each frame includesits own bit allocation over all 8 bands and itsown noise floor.
Figure 3 shows the design of the software datastructures and processing modules for encod-ing speech. Each storage element is double-buffered, allowing a set of data to remainstatic during a processing frame. The otherhalf of a buffer actively accumulates data inreal-time.
SI One
SI Two
14ms
Noise Floor &Bit A llocation
Quantizer
14ms
14ms
ReceiveBuffer
TransmitBuffer
224 words
Flash M
emory14
ms
Frame Packing
Figure 3. Speech Encoder
As the Receive Buffer accumulates 14 blocksof data, a record of the maximum level in eachband is kept as the Spectral Image (SI). TheSpectral Image reveals the relative magnitudeof all 8 frequency components. A parallelalgorithm calculates a flat noise floor and anear-optimal bit allocation across 8 bins basedon this Spectral Image. This data configuresthe quantizer to encode the spectral levelsfrom the Receive Buffer so that distortion isminimized. The bits represent the level as amultiplier of the noise floor.
Frames of compressed data are packed withthe noise floor first followed by the bit alloca-tion and finally the quantizer output over 14blocks. These frames are then sent to the flashmemory via the SmartCODEC’s onboard SPIport for storage in the serial flash memory.
The noise floor is packed by storing a 4-bitexponent and a 4-bit mantissa, with two vir-tual bits (Figure 4). The “virtual bits” are twoknown bits which are not transmitted becausethey are constant. The sign of the noise floor(one virtual bit) is always positive and themost significant bit in the mantissa (othervirtual bit) will always be 1. This noise floor
To appear in ICSPAT 2000 Proceedings, October 16-19, 2000, Dallas, TX.
compression was found to be extremely accu-rate, since the noise floor has to rise up over30 dB before even 1 LSB quantization erroroccurs. Accuracy is important since the quan-tized noise floor is the basis for decoding eachlevel.
00000000 1 0110 10FACTOR LOST
PRECISION
KNOWN
0LEA DING BITSSIGN
BIT
Virtual Bits
Figure 4. Noise Floor Encoding (Virtual Bits)
In contrast to the encoding process, the de-coding operation (Figure 5) is simple. Every14 ms in playback mode, a compressed frameis fetched from the serial flash. The decoderfirst unpacks the noise floor, then the bitallocation over all 8 bands. From the bit allo-cation information, the decoder can properlyparse the remaining bits in the frame (theencoded speech data levels from the quan-tizer). Multiplying the encoded levels by thenoise floor recalculates the original levels ineach band, which are sent to the Smart-CODEC synthesis buffer every milli second.
D e co de
ToSY N TH ESIS
Flash M emory
Figure 5. Decoder
The system polls for control commands (play /record / stop) after each frame is processed.There is an internal state machine, whosestates consist of sleeping, skipping to a certaintime index in the flash, playing (decoding),recording (encoding) or doing both encodeand decode simultaneously. When encodingand decoding in parallel, the speech is stillrecorded to flash, but the decoded output isalso present at the audio output.
Implementation has proven that the Smart-CODEC platform has more than enough
processing power to run encoding, decodingand write to the serial flash simultaneously.
4. ResultsAt 16 kb/s the CODEC was informally judgedto be ‘good’ by both experts and those unfa-miliar to speech coding. At 8 kb/s it wasjudged to be ‘suitable for low-fidelity commu-nications’ .
Due to the lossy nature of the compressionalgorithm, some distortions are expected.Coded speech has an associated “gurgling”quality, especially at vowel onset. Informallistening tests have compared this CODEC’sperformance at 16 kb/s to other coders with aspeech quality MOS of 3.4.
Segmental SNR measurements were done onthe CODEC’s floating-point simulation atrates of 8, 16 and 32 kb/s (Table 1). SegmentalSNR gives a good relative measure of coderperformance where all distortions come fromthe same fundamental algorithm. The 32 kb/ssimulation produced “ transparent” results. Theresults of the fixed-point implementation onthe SmartCODEC platform matched theresults of the floating-point simulation.
Table 1. Segmental SNR measurements for theCODEC at various bit rates
Bit Rate Seg. SNR8 kb/s 15.2816 kb/s 22.8132 kb/s 33.24
Since the algorithm used is the same in allcases (other than the available bits), thesemeasurements of MSE show how the CODECdistorts less at higher data rates.
A coherence measurement between the origi-nal signal and the CODEC output (Figure 6)shows the relative distortions in each band.Coherence shows the energy in the outputsignal that is linearly related to the input. The“Ratio” (Figure 7) was calculated by Ra-tio=(1-1/R)2 in each band, where R=(signalenergy)/(quantization noise floor). The un-
To appear in ICSPAT 2000 Proceedings, October 16-19, 2000, Dallas, TX.
usual shape of this speech signal’s overallspectrum (with an uncommon dip at 2 kHz)reveals the high correlation between energy ina band and distortion introduced in that band.The distortion for this speech signal at 2 kHzaccounts for a majority of audible distortion.Furthermore, human hearing is extremelysensitive at 2kHz, making any distortions allthe more audible in this specific test case.
Figure 6. Coherence by 500Hz Band at 16 kb/s
Figure 7. Ratio of Speech Signal Energy to NoiseFloor in Each Band at 16 kb/s
Analysis shows that only 35% of availableprocessing cycles are used while simultane-ously encoding and decoding at the defaultcore clock frequency of 1.28MHz. TheSmartCODEC is exceptionally efficient be-cause the analysis and synthesis steps areperformed (on the WOLA filterbank) in par-
allel with the DSP core, as it runs the codingalgorithm.
The psychoacoustic model added to thisCODEC produced results that did not meetwith expectations. The model used the outputof the WOLA filterbank directly, instead ofusing a higher resolution FFT. Our resultsshowed that 500Hz bands did not give suffi-cient frequency resolution to generate mean-ingful masking thresholds. Thus, there is nopsychoacoustic model included in the currentversion.
5. ImprovementsSeveral improvements have been made to thebasic coding algorithm.
A smoothing technique was applied to thenoise floor to limit how much it changedbetween frames. This feature is useful forpreventing noise floor fluttering. The SNR isdecreased with smoothing, but the perceivedquality was judged to be better by trainedlisteners.
Experimental attempts were made to staticallyequalize the noise floor to match the energyspectrum of the speaker. The results of thisshaping produced negligible improvements inoverall distortion, and required a priori knowl-edge of the input spectrum.
Research suggests that 1-4 kHz is the “sweetspot” for human hearing, and that quantizationnoise should be shifted out of this frequencyrange [6]. A static shaping of the noise floor toweight bands 3-8 has the potential to improveperceived quality.
Compressing frames whose energy falls undera noise threshold with silence is possible. Areserved escape word could be transmitted atthe beginning of a frame to indicate silence,and the next frame would start at the subse-quent word. This introduces a variable bit ratewhich is guaranteed to be less than or equal tothe fixed rate.
Simulations of the CODEC at 32 kb/s showedthat there were enough bits available to ex-
To appear in ICSPAT 2000 Proceedings, October 16-19, 2000, Dallas, TX.
pand the bandwidth to 8kHz and obtain higherfidelity recordings.
Another CODEC is currently under develop-ment that will incorporate a higher bit rate anda psychoacoustic model very similar to MPEG[7]. This CODEC will share the same featuresof low power and miniature size of the Smart-CODEC platform, but will be able to encodemusic and speech with high fidelity.
6. ApplicationsThe CODEC can be used in a number of ultralow-power portable applications. For example,it may be used as a speech-recording acces-sory for a personal digital assistant (PDA).Given its small size and ultra-low powerconsumption, it is also suitable for use in 3Gcellular telephones or other wireless applica-tions such as two-way pagers that incorporatevoice attachments. It can also be useful inshort distance, voice-band wireless communi-cations.
The CODEC presented here ill ustrates thepotential of the SmartCODEC for use inportable devices including wireless applica-tions. The low-bit rate is ideal for wirelesstransmission as power is saved both in theencoding and transmission. High fidelityreproduction of recorded speech is possibleusing the platform. The small size and ultralow-power consumption make this an idealsolution for adding voice recording/playbackcapabilities to existing devices. New devicescan incorporate the SmartCODEC hardwareand software directly for further power sav-ings. The inclusion of external memory en-sures ample and upgradeable storage for avariety of applications.
In wireless devices, voice attachments andinstant voice messages can be saved andplayed back through the memory chip. Theadvent of smaller, more eff icient memorieswill complement the current system and ex-pand its capabilities.
7. ConclusionsThis paper presents an ultra low-powerminiature CODEC implemented on theSmartCODEC platform. The CODEC portionof the system (not including flash memory)measures only 6.5 x 3.5 x 2.5 mm and con-sumes less than l mW. This CODEC is suit-able for a number of personal audio and wire-less applications.
The CODEC is pleasant to use at 16 kbps (anestimated speech quality MOS of 3.4 frominformal tests) and remains useful at 8 kbps,yielding nearly one hour of recorded speechon a single 32 Mbit flash memory. This appli-cation uses less than half the computationalcapacity of the SmartCODEC platform.
References[1] Schneider, T., Brennan R.L., “An Ultra
Low-Power Programmable DSP Systemfor Hearing Aids and Other Audio Appli-cations” , Proc. ICSPAT 1999, Orlando,FL, November 1999.
[2] Brennan, R.L., Schneider, T., “A FlexibleFilterbank Structure for Extreme SignalManipulations in Digital Hearing Aids,”Proc. ISCAS-98, Monterey, CA, June1998.
[3] ATMEL Corporation, 32-Megabit 2.7VSerial DataFlash AT45D321, Data SheetAT45D321.
[4] Jayant, N.S. & Noll, P., Digital Coding ofWaveforms, Prentice-Hall Inc., 1984.
[5] Fliege, N. J., “Modified DFT PolyphaseSBC Fil ter Banks with Almost Perfect Re-construction” , Proc. ICASSP-94, Adelaide,Australia, pp. III 149-III152.
[6] Dipert, Brian, “Digital Audio Breaks theSound Barrier” , EDN Magazine. July 20,2000, pp.71-90.
[7] ISO/IEC JTC1/SG29/WG11 MPEG,“ Information Technology - Generic codingof moving pictures and associated audioinformation – Part 3: Audio” , IS13818-31994 (“MPEG-2”).