The Implementation of Chinese and English Bilingual Speech ...paper implemented speaker-independent...
Transcript of The Implementation of Chinese and English Bilingual Speech ...paper implemented speaker-independent...
Abstract—This paper presents a high performance
embedded non-specific, medium vocabulary Chinese-English
bilingual speech recognition system using the continuous
density hidden Markov model and a two-pass search strategy
based on a 16-bit fixed-point digital signal processing (DSP).
This system selecting MFCC parameters as recognition feature.
Improve the system real-time through a dedicated hardware
circuit design. We extract specialized hardware co-processing
circuit characterized by structural features through
abstracting algorithm critical path that speedy computation
concerns much in speech recognition, so as to greatly enhance
the overall performance by little chip costs. The experimental
result suggested that the identification rate is 97.6% when
entries sum is 600. The characteristic storage space was
reduced 31%, and the real-time rate of two stage identification
is 0.7.
Index Terms—Speech recognition, system on chip, dsp,
model selection.
I. INTRODUCTION
With the development of semi-conductor industry and
speech recognition and coding technology, the demand for
embedded speech recognition product in the field of
intelligent electronics, toys and automotive electronics has
been increasing. However, within the limit in hardware
resources and algorithm complexity, traditional embedded
speech recognition product is only applied to the low end
market, being far away from meeting the demand of speech
recognition product in the field of electronic market. This
paper implemented speaker-independent chinese-english
bilingual speech recognition system with high performance
through improving algorithm and the deep hardware and
software collaborative design.
Because of the limit in arithmetic speed and storage space,
the paper employed a two stage identification strategy based
on CHMM [1]-[3] and chose proper eigenvector and model
to reduce the computing complexity and storage space in
order to improve the efficiency of the recognition and
real-time capabilities. Consequently, the high performance,
speaker-independent, medium vocabulary chinese-english
bilingual speech recognition system was developed in the
embedded system [4]-[7] with the 16 bits fixed-point DSP as
Manuscript received April 1, 2013; revised July 18, 2013. This work was
supported in Project 61273268 supported by NSFC and Project 61005019
supported by NSFC.
Shunli Ding is with Tsinghua National Laboratory for Information
Science and Technology, Beijing, China (e-mail: [email protected]).
Hong Cao is with HELIOS-ADSP Technology Co.Ltd, Beijing, China
(e-mail: [email protected]).
Jia Liu is with EE Department of Tsinghua University, Beijing, China
(e-maill:[email protected]).
the core. This system designed dedicated co-processor circuit
on chip based on the characteristics of speech recognition
algorithm, greatly improving the system performance
II. HARDWARE PLATFORM
The hardware platform of the speech recognition
system-on-chip is designed by the laboratory and chip design
companies, and its structure is shown in Fig. 1.
64KB SRAMPower
management
16bit DSPCommunicati
on interface
ADC DAC
Fig. 1. Speech recognition system-on-chip structure.
This chip with Harvard bus, four pipelined architecture, it is
composed of a 16 bit DSP,64KB RAM, 256 KB ROM.
Required for speech recognition in the audio amplification
circuit and the ADC, DAC, the automatic gain control (AGC)
circuit and other modules are integrated on the chip. The chips
specifically designed for speech recognition applications,
designed to effectively reduce hardware overhead, lower
application costs through hardware and software co-design.
As the speech recognition algorithm is in high demand for
embedded system chip resources, in the dissertation, some
logic abstraction has been applied to the critical path which
requires more computing performance, and the specific
hardware co-processor circuit with the structural
characteristics has been extracted. Software algorithm has
been treated as hardware, the overall system operation
efficiency has been largely improved at a smaller chip cost
increase. As a result, the comprehensive price quality will be
leading in the domestic industry. The introduction for specific
co-processor circuit function module was demonstrated in Fig.
2.
Array unit: to complete the function of sequence
accumulation, was able to improve FFT and MFCC features
effectively as well as increasing the speed of GMM
calculating.
Shunli Ding, Hong Cao, and Jia Liu
The Implementation of Chinese and English Bilingual
Speech Recognition System-on-Chip
International Journal of e-Education, e-Business, e-Management and e-Learning, Vol. 3, No. 6, December 2013
432DOI: 10.7763/IJEEEE.2013.V3.273
Array unit快速资料搬移器
32 bit divider
Seeker for
maximumSeeker for minimum
Rapid mover of date
xDSP
Viterbi arithmetic
unit
Fig. 2. Speech recognition co-processing circuit function structure.
Rapid mover of data: to transfer the data from
OTP/SRAM to SRAM rapidly, so as to improve the
efficiency of the model and netlist.
Seeker for extreme data: to seek and identify the location
of maximum and minimum value, thus sorting capabilities
be realized by coordinating with the software circulation and
data exchange.
32-bit divider: to complete 32-bit signed/unsigned
number division. It would be time-consuming if the chip
division was operated by software, and the calculating
efficiency would be increased effectively by using hardware
divider.
Viterbi arithmetic unit: to do viterbi search arithmetic. In
the whole speech recognition algorithm, the time consumed
by Viterbi arithmetic accounts for 30% to 70% and the ratio
becomes more significant with the increase of the terms
number. Consequently, the comprehensive systematic
performance was significantly improved by substituting
hardware Viterbi arithmetic for software.
III. IMPLEMENTATION OF SOC
The basic framework of speech recognition system is
demonstrated in Fig. 3. Recognition process can be roughly
divided into feature extraction, model training and identify
network decoding. For isolate word speech recognition[8],
[9], the selection of acoustic model has nothing to do with
the hardware platform, so the model training can be
completed on PC computer. But the feature extraction and
network decoding must be implemented on hardware
platform.
Speech recognition system mainly has three kinds of
characteristic parameters: LPCC, MFCC and PLPC
[10]-[13]. We chose MFCC in this paper according to the
demand of recognition performance and limit of hardware
resource. Storage space and recognition time are increased
along with the increase of the characteristic dimension,
therefore needing to choose the proper characteristic
parameters for the sake of good real-time performance in the
embedded recognition system, rather than using 39
dimensions feather composed by MFCC [14], [15],
first-order difference, second-order difference, normalized
transient energy and differential power. We identified the
contribution of feature components to recognition
performance, and chose the 27 dimensions feather
components with biggest contribution, effectively reducing
the SRAM area resources consumption by storage
characteristic parameters.
Speech
inputADC
Feature
extraction
Viterbi
decodingDAC Result output
Fig. 3. The basic structure of the speech recognition system.
We must select the right acoustic model because of limit in
operating capacity and store space of unspecific speech
recognition system on chip. If the simpler monophon model
[16]-[19] is selected, although it will take up less space, it will
affect the operating accuracy. On the contrary, if the more
accurate triphone model is selected, although it can satisfy the
high recognition rate, it needs the large store space and
operating capacity which embedded system can’t bear.
So this paper introduced a two phase recognition algorithm
in order to assure the high recognition rate and low consume
of hardware. As shown in Fig. 4, the monophon model was
selected in the first stage to get the multiple candidate entries
by identifying quickly, and the triphone model was selected in
the second phase to get the final result based on the multiple
candidate entries. The two phases were relatively independent.
In the second phase, algorithm can reuse the memory; in the
first phase, the multiple candidate entries which is got by rapid
identification were few, and the calculation of viterbi
decoding is greatly reduced in the second phase. This design
of two phase identification frame took full advantage of store
space and improved real time rate of system.
First stage
(Monophone model)
Second stage
(Triphone model)
candidate entries
Feature
extraction
Fig. 4. Two-stage search structure.
IV. SOFTWARE DESIGN
The software architecture has adopted a hierarchical design
so as to optimize the portability of the embedded system,
which is demonstrated in Fig. 5.
1) The driver layer, which exists in the form of interrupt
service in more cases, such as A / D, D / A etc, contacts
directly with the hardware more. The design of driver
layer has improved the safety and reliability of the system.
2) The service layer: basic math functions required
repeatedly in the algorithm are encapsulated in the
Universal module layer, so that the speed of the whole
system operation can be improved effectively through the
optimization of the basic functions.
3) The function layer, including the feature extraction, post
processing characteristics, GMM score calculation, the
Viterbi decoding etc, is the core algorithm layer of the
speech recognition, determines the recognition rate and
timeliness of the system.
International Journal of e-Education, e-Business, e-Management and e-Learning, Vol. 3, No. 6, December 2013
433
4) The application layer is the highest level of
system-on-chip, controls the system overall framework
and calls different modules. Application layer does not
care about the specific method of the functional modules,
only to call the appropriate function to achieve specific
functions due to the particular need.
The layered design software has improved the
maintainability and scalability of the system-on-a-chip
effectively, large-scale changes to the system subject are not
needed when the function or module adjusts, thus reduces
the difficulty of the secondary development.
Application layer
Hardware
Driver layer
Service layer
Function layer
Fig. 5. Software division.
V. PERFORMANCE
As mentioned before, the study found the amount of the
GMM score calculation and Viterbi decoding operations in a
high proportion of the speech recognition algorithm, in this
article, dedicated hardware co-processing circuit was
designed to enhance the system performance. Examination
has been made to test the improvement of computing array
and Viterbi hardware circuit after the completion of
System-on-chip system. The experimental results are as
follows:
Under the minimum SRAM application conditions, with
30 terms, the real-time rate has reached to 0.87 when using
computing array, if not, the rate would have increased to
2.23. It means the use of computing array improved the
GMM score calculation speed significantly.
In order to test the effect of the Viterbi hardware circuit
module on the overall system performance, the article has
done the test based on 100 words, 600 words and 2000
words respectively, as the results shown in Table I.
TABLE I: PERFORMANCE COMPARISON OF HARDWARE VITERBI AND
SOFTWARE VITERBI
100 Words 600 Words 2000 Words
Hardware Viterbi 0.07 0.41 1.42
Software Viterbi 0.35 2.1 7.1
From the above table, the real-time rate will have five
times increase or so by using Viterbi hardware circuit
module compared to software operations only. The
improvement of embedded system overall performance is
very impressive.
VI. CONCLUSION
This paper introduced a Embedded chinese-english
bilingual speech recognition system based on 16bit
fixed-point DSP. This system used a two phase recognition
algorithm. The 208 state monophon model was selected in the
first phase to identify quickly and get eight candidate entries;
the 358 state triphone model was selected in the second phase
to get the final result. Identification capacity has been
improved, memory space and computational complexity have
been reduced through hardware specialized module design,
select of feature vector and model state share. The
experimental result suggested that the identification rate is
97.6% when entries sum is 600. The characteristic storage
space was reduced 31%, and the real time rate of two phase
identification.
ACKNOWLEDGMENT
Thanks to Zhang Weijiang and MAX who gave help and
guidance during the process of system implementation.
REFERENCES
International Journal of e-Education, e-Business, e-Management and e-Learning, Vol. 3, No. 6, December 2013
[1] F. Wessel et al., “Confidence measures for large vocabulary continuous
speech recognition,” in Proc. IEEE Transactions on speech and audio
processing, Salt Lake City, 2001, pp. 288-298.
[2] M. Makoto et al, “Single-chip speaker independent voice recognition
system,” ICASSP, in Proc. IEEE International Conference on Acoustics,
Speech and Signal Processing, pp. 377-380, 1986.
[3] J. Park and H. Ko, “Effective acoustic model clustering via decision-tree
with supervised learning,” Speech Communication, vol. 46, no. 1, Press,
pp. 1-13, May 2005.
[4] A. Zgank, Z. Kacic, and B. Horvat, “Data driven generation of broad
classes for decision tree construction in acoustic modeling,” in Proc.
Euro-speech, pp. 2505-2508, 2003.
[5] M. Yuan, T. Lee, P. C. Ching, and Y. Zhu, “Speech recognition on DSP:
issues on computational efficiency and performance analysis,”
Microprocessors and Microsystems, vol. 30, no. 3, pp. 155-164, May
2006.
[6] Z. Xuan, W. Rui, and Y. N. Chen, “Acoustic model comparison for an
embedded phonme-based mandarin name dialing system,” in Proc.
International Symposium on Chinese Spoken Language Processing,
Taipei, 2002, pp. 83-86.
[7] X. Zhu et al., “Multi-pass decoding algorithm based on a speech
recognition chip,” Chinese Acta Electronica Sinica, vol. 32, no. 1, pp.
150-153, 2004.
[8] H. J. Yang, J. Yao, and J. Liu, “A novel speech recognition
system-on-chip,” International Conference on Audio, Language and
Image Processing 2008 (ICALIP 2008), China, 2008, pp. 166-174.
[9] Z. Z. Yang et al., “An embedded system for speech recognition and
compression,” ISCIT2005, 2005.
[10] J. G. Ackenhusen, S. S. Ali, B. David, L. F. Rosa, and T. Reed,
“Single-board general-purpose speech recognition system,” AT&T
Technical Journal, vol. 65, no. 5, pp. 48-59, Sep.-Oct., 1986.
[11] C. B. Ngoc, J. J. Monbaron, and J. G. Michel, “An integrated voice
recognition system,” IEEE Journal of Solid-State Circuits, SC-18, pp.
75-81, Feb. 1983.
[12] M. W. Koo, C. H. Lee, and B. Hwang, “Speech recognition and
utterance verification based on a generalized confidence score,” in Proc.
IEEE Transactions on Speech and Audio Processing, Salt Lake City,
2001, pp. 821-832.
[13] E. Lleida and R. C. Rose, “Likelihood ratio decoding and confidence
measures for continuous speech recognition,” in Proc. Fourth
International Conference on Spoken Language Processing. Atlanta, GA.
1996, pp. 478-481.
[14] Rivlin and C. Michael, “Phone-dependent confidence measure for
utterance rejection. ICASSP,” in Proc. IEEE International Conference
on Acoustics, Speech and Signal Processing,” Atlanta, GA. 1996, pp.
515-517.
[15] D. Chandra, U. Pazhayaveetil, and P.D. Franzon, “Architecture for low
power large vocabulary speech recognition,” in Proc. International SOC
Conference, 2006 IEEE.
[16] A. Zgank, Z. Kacic, and B. Horvat, “Data driven generation of broad
classes for decision tree construction in acoustic modeling,” in Proc.
Eurospeech 2003, pp. 2505-2508.
[17] V. Valtcho, “Discriminative methods in HMM-based speech
recognition,” Ph.D thesis, St. John’s College, University of
Cambridge, 1995.
[18] S. Y. Suk, H. Y. Jung, and H. Y. Chung, “Automatic generation of
context-independent variable parameter models using successive state
and mixture splitting,” in Proc. Eurospeech’03, Geneva, 2003, pp.
2501-2504.
[19] E. M. Mumolo and F. Pazienti, “Large Vocabulary Isolated words
recognizer on a cellular array processor,” in Proc. ICASSP’89, 1989,
pp. 785-788.
Shunli Ding was born in Jining Shandong province in
1984.9. Now he is a postgraduate student in Institute of
Microelectronics of Tsinghua University, Beijing, China;
his resaerch aspect is speech signal processing and speech
recognition
Hong Cao was born in Gansu in 1984. He got Master's degree in Institute of
Microelectronics of Tsinghua University, Beijing, China. Now he is working
in HELIOS-ADSP Technology Co.Ltd his resaerch aspect is speech signal
processing and speech recognition.
Jia Liu was born in 1954, in Beijing. He is a professor and Doctor guider in
EE Department of Tsinghua University, Research Aspect: Speech
Recognition/ Speech coding, speech recognition chip design, Multimedia
signal communication system.
International Journal of e-Education, e-Business, e-Management and e-Learning, Vol. 3, No. 6, December 2013
435