Sslis

download Sslis

If you can't read please download the document

description

 

Transcript of Sslis

  • 1. By: Khalid El-Darymli G0327887 S peech to S ign L anguage I nterpreter S ystem ( SSLIS ) Supervisor: Dr. Othman O. Khalifa International Islamic University Malaysia Kulliyyah of Engineering, ECE Dept.
  • 2. OUTLINE
    • Problem statement,
    • Research goal and objectives,
    • Main parts of the SSLIS ,
    • ASR:
      • Sphinx 3.5,
      • General Structure: AM, Dictionary, LM,
      • and Decoding: the Viterbi beam search.
      • The SR engine specs.
    • Sign Language:
    • - ASL, ASL alphabets & Signed English.
    • Structure & flow of SSLIS,
    • Parameter tuning & accuracy measurements,
    • SSLIS capabilities,
    • Conclusions, Shortcomings & Further work.
  • 3. Problem Statement
    • There is no free software, let alone one with a reasonable price, to convert speech into sign language in live mode.
    • There is only one software commercially available to convert uttered speech in live mode to a video sign language
    • This software is called iCommunicator and in order to purchase it deaf person has to pay USD 6,499!
    ! IS IT FAIR ?
  • 4. RESEARCH GOAL AND OBJECTIVES
    • Design and Manipulation of Speech to Sign Language Interpreter System .
    • The SW is open source and freely available which in turn will benefit the deaf community.
    • To fill the gap between deaf and nondeaf people in two senses. Firstly, by using this SW for educational purposes for deaf people and secondly, by facilitating the communication between deaf and nondeaf people.
    • To increase independence and self-confidence of the deaf person.
    • To increase opportunities for advancement and success in education, employment, personal relationships, and public access venues.
    • To improve quality of life.
  • 5. Main Parts of the SSLIS Speech-Recognition Engine Sign Language Database Recognized Text ASL Translation Continuous Input Speech Recognized Text
  • 6. Automatic Speech Recognition ( ASR ):
    • SR systems are clustered according to three categories: Isolated vs. continuous , speaker dependent vs. speaker independent and small vs. large vocabulary .
    • The expected task of our software entails using a large vocabulary , speaker independent and continuous speech recognizer.
    SR Engine Recognized Text Input Voice
  • 7. Sphinx 3.5
    • It is originally started at CMU and then it has been released as open source SW.
    • It is still in development but already includes trainers, recognizers, AMs, LMs and some limited documentation.
    • It works best on continuous speech and large vocabulary.
    • It does not provide any interface in order to make the integration of all components easier.
    • In other words, it is a collection of tools and resources that enables developers/researchers to build successful speech recognizers.
  • 8. The Structure of SR Engine (LVCSR) Signal Processing AM P ( A 1 , , A T | P 1 , , P k ) Dictionary P ( P 1 , P 2 , , P k | W ) LM P ( W n | W 1 , , W n-1 ) X={x 1 ,x 2 , , x T } Hypothesis Evaluation Decoder P(X | W)*P(W) TRAINING DECODING Best Hypotheses H = {W 1 , W 2 , , W k } W BEST Input Audio
  • 9. SR ENGINE SPECS
    • FE SPECS:
    0.97 Pre-Emphasis 6855.4976 Hz Higher Filter Frequency ( f h ) 133.33334 Hz Lower Filter Frequency ( f l ) 512 DFT Size 40 Number of Mel Filters 13 Number of Cepstra Mel FIlterbank Filterbank Type 0.025625 sec Window Length 100 frames/sec Frame Rate 16000.0 Hz Sampling Rate Default Value Parameter
  • 10. KNOWLEDGE BASE
    • It was trained using the MFC vectors derived from 140 hours of 1996 and 1997 hub4 training data.
    • Each vector is thus 39-dimensional.
    • acoustic model is 3-state within-word and cross-word triphone HMMs with no skips permitted between states.
    • It is continuous and comprised of 6000 senones with 8 Gaussians per state.
    Acoustic Model
  • 11. DICTIONARY
    • We are using the CMU dictionary (v. 0.6).
    • It is a machine-readable pronunciation dictionary for North American English that contains over 125,000 words and their transcriptions.
    • It has mappings from words to their pronunciations in the given phoneme set which comprised of 39 phonemes.
  • 12. LM
    • It was taken from CMU open source resources.
    • It is a trigram model, which has been built for tasks similar to broadcast news.
    • The vocabulary covers 64000 words
  • 13. SIGN LANGUAGE
    • Sign Language is acommunicationsystem using gestures that are interpreted visually.
    • As a whole, sign languages share the same modality , a sign, but they differ from country to country.
  • 14. AMERICAN SIGN LANGUAGE ( ASL )
    • ASL is the dominant sign language in the US, anglophone Canada and parts of Mexico.
    • Currently, approximately 450,000 deaf people in the United States use ASL as their primary language
    • ASL signsfollowa certain order, just as words do in spoken English. However, in ASL one sign can express meaning that would necessitate the use of several words in speech.
    • The grammar of ASL uses spatial locations, motion, and context to indicate syntax.
  • 15. ASL ALPHABETS
    • It is a manual alphabet representing all the letters of the English alphabet, using only the hands.
    • Making words using a manual alphabet is called fingerspelling .
    • Manual alphabets are a part of sign languages
    • For ASL, the one-handed manual alphabet is used.
    • Fingerspelling is used to complement the vocabulary of ASL when spelling individual letters of a word is the preferred or only option, such as with proper names or the titles of works.
    Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz
  • 16. SIGNED ENGLISH ( SE )
    • SE is a reasonable manual parallel to English.
    • The idea behind SE and other signing systems parallel to English is that deaf people will learn English better if they are exposed, visually through signs, to the grammatical features of English.
    • SE uses two kinds of gestures: sign words and sign markers .
    • Each sign word stands for a separate entry in a Standard English dictionary.
    • The sign words are signed in the same order as words appear in an English sentence. Sign words are presented in singular, non-past form.
    • Sign markers are added to these basic signs to show, for example, that you are talking about more than one thing or that some thing has happened in the past.
    • When this does not represent the word in mind, the manual alphabet can be used to fingerspell the word.
    • Most of signs in SE are taken from the American Sign Language. But these signs are now used in the same order as English words and with the same meaning.
  • 17. ASL vs. SE (an Example) It is alright if you have a lot ASL Translation SE Translation IT I S ALL RIGHT IF YOU HAVE A LOT
  • 18. DEMONSTRATION OF THE ASL IN OUR SW A number of 2,600 ASL prerecorded video clips In case of nonbasic word, extract the basic word out of it Recognized Word (SR engines output) Is the basic word within the ASL database vocabulary? The American Manual Alphabet Only in case of a nonbasic input word, append some suitable marker Final Output None of the database contents matched the input basic word No Yes Fingerspelling of the original input word The equivalent ASL video clip of the input word, some marker could be appended
  • 19. STRUCRURE AND FLOW OF SSLIS
    • Flowchart of the main program
    Program Start
    • Initialize
    • Signs Database
    • Irregular Plural Verbs
    • Irregular Past Participle Verbs
    • Graphical User Interface
    Has Exit been clicked?
    • Select program to execute
    • Decode
    • Live Pretend
    • Live Decode
    Is Program selected Run selected program Is selected program Live Decode? Enable button to stop Live Decode Has stop button been clicked? Stop Program Live Decode Show Program Output Wait for Event from Class Is Live Decode running? Stop Live Decode End Program Initialize Class Continue next slide NO YES NO YES YES NO NO YES YES NO
  • 20.
    • Flowchart of the class procedure
    Event raised Word lattice entry received Word lattice ending received Word lattice total received Class Class running in the background Wait for INFO event Call program function AddWordLattice Add word lattice entry to an appropriate table Display Speech to Text output Call program function AddWordLattice Add word lattice entry to an appropriate table Live Decode starting received Word hypothesis entry received Total hypotheses entry received Total Frames entry received Call program function AddWordHypothesis Add word hypothesis entry to an appropriate table Call program function AddTotalHypothesis Add Total Hypotheses entry to an appropriate table Display Total Frames entry in appropriate position NO NO Display msg box to user to start decoding Live speech Press ENTER to start recording YES NO YES NO YES YES NO NO YES YES YES YES NO NO NO
  • 21. PARAMETER TUNNING & ACCURACY MEASUREMENTS
    • -beam : Determines which HMMs remain active at any given point (frame) during recognition. (Based on the best state score within each HMM.)
    • -pbeam : Determines which active HMM can transition to its successor in the lexical tree at any point. (Based on the exit state score of the source HMM.)
    • -wbeam : Determines which words are recognized at any frame during decoding. (Based on the exit state scores of leaf HMMs in the lexical trees.)
    TUNING THE PRUNING BEHAVIOUR:
  • 22.
  • 23. TUNING LM RELATED PARAMETERS :
    • -lw : The language weight.
    • -wip : The word insertion penalty.
  • 24. SSLIS CAPABILITIES
    • Real time speech to text to video sign language.
    • Text to video sign language.
    • Automatic WER calculation.
    • Text to computer generated voice with synchronized lips.
    • Speed control of ASL movies in play.
    • Minimize to Auto allows drag and drop from any text editor to be signed.
    • Demonstration of SE manual as parallel to English.
    • Demonstration of decoding process of speech.
    • Live Decode Program allows real time speech recognition while Live Pretend and Decode allows speech recognition in batch mode.
  • 25. Conclusions
    • The research aim of offering freely available and open source SSLIS is fulfilled.
    • Sphinx 3.5 was manipulated as the SR engine.
    • SE manual was followed for translation.
  • 26. Shortcomings &Further Work
    • Degradation in the speech recognition accuracy.
    • Using a poor quality microphone would highly degrade the recognition accuracy of our system.
    • Virtual memory constraints.
  • 27. References
    • [1] Harrington, T. (July, 2004). Statistics: Deaf Population of the United States Retrieved May 2, 2005. http://library.gallaudet.edu/dr/faq-statistics-deaf-us.html
    • [2] Rabiner, L. R. (Feb 1994). Applications of Voice Processing to Telecommunications . Proceedings of the IEEE, Vol. 82, No. 2, pp. 199-228.
    • [3] Seltzer, M. (1999). Sphinx III Signal Processing Front End Specifications . CMU Speech Group. Retreived May 2, 2005. www.cs.cmu.edu/~mseltzer/sphinxman/s3_fe_spec.pdf
    • [4] Rabiner, L., & Juang, B-H. (1993). Fundamentals of Speech Recognition . New Jersey: Prentice Hall international.
    • [5] Becchetti, C., & Ricotti, L. R. (1999). Speech Recognition Theory and C++ Implementation . England: Wiley.
    • [6] Huang, X., Acero, A., Hon, H-W., & Reddy, R. (2001). Spoken Language processing, a Guide to Theory, Algorithm and System Development . Prentice Hall PTR
    • [7] Hwang, Mei-Yuh. (1993). Subphonetic Acoustic Modeling for Speaker-Independent Continuous Speech Recognition . Ph.D. thesis, Computer Science Department, Carnegie Mellon University. Tech Report No. CMU-CS-93-230
    • [8] Jelinek, F. (Apr. 1976). Continuous Speech Recognition by Statistical Methods . Proceedings of the IEEE, Vol. 64, No. 4. pp. 532-556.
    • [9] Baker, J.K. (1975). The DRAGON System-An Overview. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-23(1). pp.24-29.
    • [10] Lin, E. (May 2003). A First Generation Hardware Reference Model for a Speech Recognition Engine . Master Thesis, Computer Science Department, Carnegie Mellon University.
  • 28.
    • [11] Ravishankar, M. (May 1996) Efficient Algorithms for Speech Recognition . Ph.D. dissertation, Carnegie Mellon University. Tech Report. No. CMU-CS-96 143.
    • [12] Ravishankar, M. K. (2004). Sphinx-3 s3.X Decoder (X=5). Sphinx Speech Group. School of Computer Science, Carnegie Mellon University. Retrieved May 2, 2005. http://cmusphinx.sourceforge.net/sphinx3/
    • [13] Rosenfeld, R. The CMU Statistical Language Modeling (SLM) Toolkit . Retrieved May 2, 2005. http://www.speech.cs.cmu.edu/SLM_info.html
    • [14] Gouva, E. The CMU Sphinx Group Open Source Speech Recognition Engines . Retrieved May 2, 2005. http://www.speech.cs.cmu.edu/sphinx/
    • [15] N ational I nstitute of S tandards and T echnology. Retrieved May 2, 2005. http://www.nist.gov/
    • [16] Wilcox, S. (2005). Sign Language . The Microsoft Encarta Reference Library.
    • [17] Personal Communicator . [CD-ROM]. version 2.4. (2001) . Michigan: US. Communication Technology Laboratory, Michigan State University.
    • [18] American Sign Language Video Dictionary and Inflection Guide . (2000). [CD-ROM]. New York: US. National Technical Institute for the Deaf, Rochester Institute of technology. ISBN: 0-9720942-0-2.
    • [19] ASL University. Finferspelling: Introduction . Retrieved May 2, 2005. http://www.lifeprint.com/asl101/fingerspelling/fingerspelling.htm
    • [20] Bornstein, H., Saulnier, K.L. & Hamilton, L.B. (1992). The Comprehensive Signed English Dictionary (Sixth printing). USA: Washington DC, The Signed English series. Clerc Books, Gallaudet University Pres.
  • 29. Thank You
    • Your Questions Are
    • Welcomed