my.fit.edumy.fit.edu/~vkepuska/FIT/2016/KepuskaDossier1.docx · Web viewPromotion Dossier. Veton...

Promotion Dossier

Veton Z. Këpuska

Associate Professor

Electrical and Computer Engineering

Florida Institute of Technology

Contact Information

Olin Engineering Room 353

Phone: 321-674-7183

E-mail: [email protected]

0

Table of Contents

I. Summary

II. Brief History of the Candidate

III. Teaching and Related Activities

IV. Research and Scholarly Activities

V. Service Activities

VI. Appendixes

1

I. SummaryI consider my professional engagement at FIT to be a collection of Teaching, Research and Service activities.

In teaching I have introduced a significant number of new courses, both graduate and undergraduate. The list of the courses that I have introduced is provided in teaching section of the proposal. Through the public U drive I made available an extensive collection of resources to aid my students during their studies of the subject. That material is available online (http://my.fit.edu/~vkepuska/ or http://my.fit.edu/~vkepuska/web/).

While I was employed in private industry as Research Scientist, I had to sign an NDA agreement which prohibited me from publishing my research. As result of my research the company that I worked for filed for three patents which were awarded during my 2003 through 2010.

My research was published in 24 Journal, 30 Conference publications and 1 Book Chapter.

As part of my Service activities, I traveled to China 5 times due to Collaborative agreement of FIT and Chines educational institutions (Dian Ji and Hubei Universities).

2

http://my.fit.edu/~vkepuska/web/

http://my.fit.edu/~vkepuska/

II. Brief History of the CandidateI joined the Department of Electrical and Computer Engineering (ECE) at Florida Institute of Technology in January 2003 as an Associate Professor. My MS and Ph.D. degrees are from Clemson University (1986, 1990). During those studies as served as Lectured, Graduate Research and Teaching Assistant. Following my Ph.D. I worked as a Post-Doctoral Researcher in the “Institute of Geodesy and Photogrammetry” of ETH Zürich. Earlier I served as a Lecturer at University of Prishtina. During the years 1993-2003 I worked as Research Scientist for number of private organizations in the Boston area.

3

III. Teaching and Related ActivitiesMy teaching methodology and practice was recognized in 2009 with the Kerry Bruce Clark Teacher Award. This award is attributed to the achievements presented below.

Since I joint FIT in 2003, I have proposed the set of four courses covering the area of Speech, Speech Processing, Speech Recognition, Natural Language Understanding, and Android Embedded Application Development. I developed both graduate (ECE 5525, 5526, 5527, 5590) and undergraduate level courses (ECE 3551, ECE 3552 as well as ECE 3553) that includes teaching materials and lab exercises.

In my teaching I focus on exit knowledge of the students. To this end, I have developed an interactive web-tool that I extensively use to guide me and my students through final-exam stage of the class. With it I am able to setup the date, time and location of the exam that where it will be conducted. Example of exam setup-page is provided in the link in Appendix A.1. I teach 3 courses per semester, that cover areas that range from Speech Processing, Speech Recognition, Natural Language Processing, Speaker Identification, Digital Signal Processing, Adaptive Filtering, Pattern Recognition, Neural Networks as well as undergraduate level courses that include subjects; Programming Languages (C/C++, Fortran, Pascal, perl, awk, Prolog, Lisp), Computer Architecture, (e.g., Microcomputer Systems 1 and 2), and Web Design Tools (e.g., Multifarious Systems 1 and 2, that is, HTML, Perl, PHP, SQL, MySQL, etc.). In all graduate level as well as undergraduate level classes I rely on MATLAB and MATLAB programming tool (e.g., Speech Analyses and Special Effects Laboratory: SASE_Lab(), Appendix A.2 . ).

In all my courses I require from students to demonstrate mastery of the subject area by developing a practical application combining theory with an appropriate application. The demonstration requires students to verbally defend their work through: (1) Presentation, (2) Detailed Documentation (Microsoft Word, README file, source code, etc.) depicting their work, and (3) Implementation and a demonstration of the project.

Each project is assessed based on detailed instructions for the project requirements that is provided as a reference file from my publically accessible U drive. This document resides in each corresponding class link under “Final Project Requirements document” as exemplified by the following link:

http://my.fit.edu/~vkepuska/ece3551/Final%20Project%20Requirements.doc.

The examination and quality of assessment of the project is conducted by providing each student with at least a 30 minute time slot to present his/her work. Large classes require 2-3 days of presentation times starting from 8:00 am-12:00 pm, and 1:00pm-6:00pm per day to complete the examination. Registering for a time slot is done with the www based application (see for example

4

http://my.fit.edu/~vkepuska/web/courses.php#ece3551-projects

Involving Students in ResearchTo help undergraduate students develop research skills I include them in all my research projects providing them with opportunities to collaborate with graduate students. As a result of this teaching methodology, several of my undergraduate students enrolled in Speech Recognition Graduate program at FIT under my supervision and become distinguished researchers in their own right: Tudor Klein, Sean Powers, Brandon Schmitt, Xerxes Beharry, Ronald Ramdhan, Raymond Sastraputera, Chris Hellriegel, Pattarapong Rojansthien, etc. In particular (i) my focus on undergraduate research has produced positive of outcomes, (ii) I have established a collaborative partnership with industry, and (iii) I have established a educational collaboration with Dr. Anagnostopoulos and other faculty at FIT and elsewhere (FGCU, UCF).

For over a decade, I have been focused on expanding opportunities for my undergraduate students to participate in research that traditionally was reserved to graduate students. As an example is my research project funded by the US Department of Energy (DOE). I took 6 FIT students for two weeks to Washington DC during the summer of 2011. Before we traveled to Washington DC, for the final stage of the project, we were conducting research at FIT in the area of Energy. The paper “Energy Savings from using Mobile Smart Technologies”, under the supervision of Paul Karaffa, DOE fellow, was published in 2013. (Këpuska, V. et al. (2013). Energy Savings from using Mobile Smart Technologies, Journal of Renewable and Sustainable Energy, doi: 10.1063/1.4811096, 2013)

I advise 25-35 undergraduate students each academic year, and work directly with at least 2 junior and senior design teams. List of their achievements in included below:

● Best Junior Design 2007 - Visual Audio - (Brandon Schmitt).● Greatest Commercial Potential - "Smart Room" Senior Design 2008 (Matt Hopkins, David

Herndon, Patrick Marinelli).● I also serve as advisor to Chi Phi student society here at Florida Tech. The Chi Phi

Fraternity was awarded to Dr. Thomas Gehring Award for Chapter Excellence in 2007 .● My students have also won the Third Place in IEEE Student Hardware Competition, ● Best Junior Design for Visual Audio, ● Best Paper Nomination, and a ● First Place in the First Competition for Assistive Device (2005). ● Through an NSF funded research project, I and my former student (Jacob Zurasky) have

developed SASE_Lab - Speech Analysis and Special Effects Laboratory (details included in Appendix A.2). This is a comprehensive tool that allows me to utilize my variety of interest from simple recording of utterances, to its analysis (waveform, and spectrographic analysis) as well as modifying the speech by adding various special effects, as described in detail in Appendix.

5

Karen Kelsky, 04/20/17,

notice i moved this here

TeachingI have taught courses at the undergraduate and graduate level. Most of my courses are required by ECE curriculum.

● Undergraduate:1. Hardware Software Design – ECE 25512. Hardware Software Integration – ECE 25523. Signal and Systems – ECE 32224. Digital State Machines – ECE 35415. Microcomputer Systems 1 – ECE 35516. Microcomputer Systems 2 – ECE 35527. Multifarious Systems 1 – ECE 35538. Multifarious Systems 2 – ECE 45539. Computer Architecture – ECE 455110. Computer Communications - ECE 456111. Electric and Electronic Circuits – ECE 4991

● Graduate:12. Speech Processing – ECE 552513. Speech Recognition – ECE 552614. Search and Decoding in Speech Recognition – ECE 552715. Acoustics of American English Speech – ECE 552816. Embedded Android Programming – ECE 557017. Computer Networks 2 – ECE 553518. Digital System Design 1 – ECE 557119. Digital System Design 2 – ECE 55721

Student Enrollment The enrollment for my courses has steadily grown as indicated in the table below:

Student Enrolled since 2006:

Year 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

Spring 23 29 48 51 26 27 25 37 78 65 70Summer 31 31 29 35 28 26 74 43 34Fall 54 61 61 61 51 65 71 81 114 109 82

Total 77 90 140 143 106 127 124 144 266 217 186

Teaching MaterialAll my teaching material is publicly available through my U drive (http://my.fit.edu/~vkepuska/).

1

6

In order to help students achieve their academic goals I have instructed a number of different Special Topics courses (not listed).

In addition to those courses that I have introduced at the onset of joining FIT, I developed an additional graduate level course, ECE 5570 Embedded Android Programming. This course has grown in demand; in summer 2016 the course attracted 15 graduate students.

Student AdvisingI have been advisor to many Ph.D.’s, MS’s as well as undergraduate students; a few notable graduate students were: Tudor Klein, Xerxes Beharry -Microsoft, Ronald Ramadhan – Apple, Sean Powers, Jacob Zurasky NXT-ID Inc., Brandon Schmitt, Chih-Ti Shih, Raymond Sastraputera, etc.

Typically I have had over 30 student advisees per semester. In addition to advising roles for both undergraduate and graduate students, I am the faculty advisor to the IEEE-HKN engineering honor society Zeta Epsilon Chapter here at Florida Tech.

Student Advising (2016):Undergraduate 15MS Degree Students 11PhD Degree Students 11

Total 37

List of advisees is given below:

Ph.D. Students – Primary Dissertation Advisor● Abdulaziz, Azhar S., PhD CPE● Al-Khuwaiter, Tahsin A., PhD CPE● Alfathe, Mahmood F., PhD CPE● Alshamsi, Humaid S., PhD CPE● Bohouta, Gamal M., PhD CPE● Elharati, Hussien A., PhD EE● Mohamed Eljhani, PhD CPE, “Front-end of Wake-Up-Word Speech Recognition System Design on

FPGA”, Spring 2015● Al Ghamdy Amin O., 2010● Tamas Kasza, PhD CPE, “Communications Protocol for DF-based Wireless Indoor Localization

Networks,” Spring 2006

M.S. Students-Major Thesis Advisor● Eljagmani Hamda, MS CPE● Hasanin Ahmad, MS CPE● Maryam Najafi, MS CPE● Anita Devi, MS CPE “Google Speech Recognition using Embedded System,” Spring 2016

7

● Marwa Alzayadi, “Utilizing Sphunx-4 on Speech Recogntion,” Spring 2016● Ashwini Srivastava, “Comparison of Microphone Array Beamforming Algorithms using Designed

Hardware and Microsoft KINEETCT,” Fall 2015● Tianyi Bi, MS CPE “Using CMY Sphinx in Standard Chinese Automatic Speech Recognition System

Design,”, Fall 2015● Zhenmin Ye, MS EE “Design a Step-Up- Tranformer for P300 Wedlgin Power Supply,”, Summer

2015● Safa M. Al-Taie, MS ECE “Improving the Accuracy of FIngerpriting Sysem Using Mutibiometric

Approach,” Spring 2015● Jacob Coughlin, MS EE “Optimizing Wakeup-Up-Word Application for Embedded Deployment,”

Spring 2015● Boopathy Prakasam, MS EE “Microphone Array for Speech Processing and Recognition,” Spring

2015● Wenyang Zhang, MS CPE “Comparing the Effect of Smoothing and N-gram Order: Finding the Best

Way to Combine the Smoothing and Order of N-gram,” Spring 2015● Ibrahim Al-Badri, MS CPE “Speech Corpus Generation from YouTube,” Fall 2014● Wilson Burgos, MS CPE “Gammatone and MFCC Features in Speaker Recognition,” Fall 2014● Jacob Zurasky, MS CPE “Digital Signal Processing Applications of the TMS320C6747”, Fall 2012● Mohammed Almeer, MS CPE● Patarapong Rojanshien, MS CPE “Speech Corpus Generation from DVD’s of Movies and TV series”,

Fall 2010● Xerxes Beharry, MS CPE “Phoning home: Bridging the Gap Between Conservation and

Convenience,” Fall 2010● Arthur Kunkle, MS CPE “Sequence Scoring Experiments Using the TIMIT Corpus and the HTK

Recognition Framework,”, Spring 2010● Raymond Sastraputera, MS CPE, “Prosodic Features for Wakeu-Up-Word Speech Recognition”,

2009● Chih-Ti Shih, MS CPE “Use of Pitch and Enegy of the Speech Signal for Discrimination of

‘Alerting’ from ‘Referential’ context”, 2009● Za Hniang Za, MS CPE “A Study of Approaches for Microphone Array Signal Processing”, 2008● Tien-Hsiang Lo, MS CPE “Analysis of Weighted-Sum of Line Spectrum Pair Method for Spectral

Estimation,”, Spring 2005● Elias Victor, MS CPE “1553 Avionics Bus Hardware Integration Into Expendable Launch Vehicle

(ELV) Simulation Model,” Spring 2005

Ph.D. Students – Advisory Committee Member● Al Rozz Younis Anas Younis, 2017● Shakre Elmane, Dynamics in Recommendations of Updates For Free Open-Source Software, 2016● Jim Bryan, A Robust Speaker Identification Algorithm Based On Atomic Decomposition and Sparse

Redundant Dictionary Learning, 2016● Scott Meredith, PhD ECE, “Lightning Generated Electric and Magnetic Fields: a Methodology into

the Development of Three Models and Their Utilization in Determining the Currents Induced within a Three-wired Tether”, 2011

● Mohamed ???, ECE PhD, 2010

8

● Igor Lugach, PhD MEA, “Effect of Accurate and Simplified Interactions Modeling in a Null-Flux Electromagnetic Levitation System on Performance of Multi-DOF Sliding Model Controller” 2006

● Michel Ouendeno, PhD CPE, “Image Fusin for Improved Perception”, 2007

MS. Students – Advisory Committee Member● Subhadra Subramanian, EE, 2016.● Taher Parekh, Real Time Implementation of an Inertial Measurement Unit based on Fiber- Bragg

Sensor arrays using Legendre polynomials, Mechanical Engineering, 2016● Hai Hoang Truong Tran, A Comparative Study of Relational Database Server Performance on

Windows vs. Linux, Computer Science and Cyber Security, 2015● Chih-Ti Shih, MS ENM, 2009● Patric Durland, MS SYS, “Project Arcade”, 2009● Osaam Saraireh, MS ES, 2007● Rachan Varigiya, MS CS, “Keyword Spotting Using Normalization of Posterior Probability

Confidence Measures”, 2004

9

III. Research and Related Activities

Wake-Up-Word Speech RecognitionI consider my main scientific contribution to be the solution that I have invented in the area of my specialty: Speech Processing and Recognition, so called Wake-Up-Word Speech Recognition (see the following blog: http://lovemyecho.com/2015/11/11/echo-custom-wake-words-why-isnt-this-a-thing-yet/).

I have solved a difficult problem for which there are no other appropriate solutions. The problem can be summarized as discrimination between a target word/phrase having a wake-up or alerting context (e.g., “Alexa, I need to talk to you.”) from that same word/phrase in a referential context (e.g., “Today I talked to Alexa on the phone.”). I have developed a revolutionary speech recognition technology (called Wake-Up-Word or WUW for short) that solves this problem. This solution can change how we can interact with computers. This solution can be applied to general speech recognition problem thus improving their accuracy over 1,500%-30,000% providing solutions that approach human performance. Although speech recognition technology has been around for over 25 years, it has to be emphasized that this kind of accuracy performance required by this application has not been developed by any commercial or research institutions yet, and thus my WUW technology has a potential to revolutionize the way we communicate with computers.

The solution is protected by my two US patents: ("Scoring and re-scoring dynamic time warping of speech" US # 7,085,717, and "Dynamic time warping using frequency distributed distance measures", US # 6,983,246). Those patents are referenced in by following patents:

7,437,291 Using partial information to improve dialog in automatic speech recognition systems

7,143,034 Dynamic time warping device for detecting a reference pattern having a smallest matching cost value with respect to a test pattern, and speech recognition apparatus using the same

7,085,717 Scoring and re-scoring dynamic time warping of speech

5,455,889 Labeling speech using context-dependent acoustic prototypes

10

FundingI strongly advocate and practice teaching approach that bridges the gap between theory and practice. This approach has allowed me to establish a number of strategic partnerships with industry: ThinkEngine Networks (http://www.thinkengine.com), iCVn (http://www.iCVn.com), QTSI (http://www.qtsi.com), PCB (http://www.pcb.com), Knigh's Armament Company (http://www.knightarmco.com), BMW, Microsoft, etc.

With industry partnerships I have been able to support my research. The current total amount of direct support from industry exceeds ~$250,000.

For example, ThinkEngine Networks (http://www.thinkengine.com) supported my research in speech recognition for the first 2.5 years at FIT by providing supplemental income of $35,000 annually. For more information please consult my CV and/or contact a former senior Vice Presidents of Research: Geoffrey Parkhurst: [email protected] and Paul Gagne: [email protected] (detailed contact information also listed under references from industry section).

In addition, through my industry collaborators, I have brought in equipment worth well over $100,000. In the collaboration with iCVn of Baltimore my team has designed a new hardware that will augment the “Shinwoo” money counting device with a scanner capable of reading serial numbers from any currency (e.g., US Dollars, UK Pounds, SW Franks, European Union Euros, etc.), configured by software that we have developed. We have also re-designed and re-implemented their core application software that now connects to the Shinwoo device and communicates with it. For further information please contact Theodore Paraskevakos, President and CEO at [email protected] (detailed contact information also listed under references from industry section).

During the first half of 2006 I completed a very important pilot study for QTSI of Cocoa Beach utilizing my Wake-Up-Word Speech Recognition Technology to Seismic Signals.

PCB, an internationally recognized corporation, has donated over $5,000.00 worth of specialized industrial quality equipment consisting of high precision microphones and data acquisition equipment for microphone array research. In addition PCB is partnering with myself for NSF’s MRI research proposal to establish a largest microphone array laboratory in the south east. Contact information: Ray Rautenstrauch [email protected].

Additional notable collaborations are listed below:

● BMW – Actively working with Dr. Joachim G. Taiber, BMW Group, Head of IT Research Office, Information Technology Research Center (ITRC), Greenville South Carolina. (2005-2008)

● Tudor Klein and Microsoft - Working with Microsoft through my former Graduate Student, Tudor Klein, to establish an appropriate relationship to contribute development

11


this doesn't seem high enoufh, if you got $35K a month for 2.5 years!

of their new-generation of speech recognition software. Contact information [email protected]. (2004-2006)

● NIST, Speech Group – I organized and hosted at FIT a 3-day workshop involving National Institute of Standards (2009). I have enrolled an FIT team (Arthur Kunkle and Dileep Koneru) in a NIST sponsored evaluation effort, Rich Transcription Evaluation Project, with a goal to further advance the speech recognition area. The goal of our participation is not only to use existing technologies but also to incorporate my inventions into existing systems (e.g., HTK and Sphinx) demonstrating their superiority over conventional methods. For details of this workshop follow this link: http://itl.nist.gov/iad/mig/tests/rt/2009/index.html

● PowerPoint Commander is a voice activated Power Point presentation software application that I developed utilizing my WUW speech recognition system. This application was a cross platform (Windows as well as Apple OS). This application incorporates my Wake-Up-Word (WUW) Speech Recognizer. It enables users to control their power-point presentation program using their voice only. The uniqueness of the application is the ability of the technology to distinguish a presentation speech from the speaker’s command. For example:1. Presentation of Speech: "Processing of speech signal requires an application of a

short time analysis."2. Command: "Computer! Go to Next Slide."

I have submitted a $10-Million multi-institutional, collaborative Center of Excellence proposal, as co-PI to the Florida Board of Governors. The proposal ended up being among the top 10 proposals in scientific merit as judged by the prestigious Oak Ridge Associated Universities reviewers (ORAU: Top 10 in Scientific Review Ranking). My contribution as judged by my pears was critical to the scientific merit ranking of the proposal (see Dr. Rodriguez recommendation).

Two (2) NSF projects were awarded (totaling $260,000) and 1 is pending ($1,198,728). I have also submitted 3 proposals: to Rockwell Collins, State of Florida and Lindberg Ann Foundation as PI/Co-PI totaling $2,156,012. In addition, I have submitted a number of SBIR proposals, one of which was an invited proposal by National Institute of Justice (NIJ) of the US Government. I have generated, or been involved in generating of on average greater than 3.3 proposals per year.

Two major projects (totaling $260,000): (1) EMD-MLR project sponsored by NSF and supported through grant DUE-CCLI-0341601 of $99,996. This project was supported under the Course, Curriculum, and Laboratories Improvement (CCLI) program and the Educational Materials Development (EMD) track. (2) AMALTHEA project (AMALTHEA REU Site) also sponsored by NSF under the Research Experiences for Undergraduates (REU) program and grant IIS-REU-0647018 for $160,701 over a period of 3 years (2007-2010).

In 2012 I was awarded I-Corps NSF grant through which I established Zëri Corporation (www.zeriinc.com) which in turn provided an excellent opportunity for my graduate students to

12


this is purely my curiosity, but have you made loads and loads of money from this patent??

engage in scientific and entrepreneurship skills.

PatentsIn year 2016-2017 we are witnessing a proliferation of speech oriented services and devices that deploy the technology of which I am one of the inventors: please refer to the list of patents below.

• Dynamic Time Warping (DTW) Using Frequency Distributed Distance Measures, US6983246 B2, January 3, 2006.

• Frequency Distribution of Minimum Vector Distance for Dynamic Time Warping, WO 2003100768 A1, April 1, 2006.

• System and Methods for Facilitating Collaboration of a Group, US 20110134204 A1 June 2009

• Exploiting Differences in Correlations for Modeled and Un-Modeled Sequences by Transforming Trained Model Topology in Sequence Recognition, Provisional Patent Application, August 2009

IV. Service and Related Activities My service activities at FL Tech include:

Conferences and Related ActivitiesIn 2004 & 2005 I and Dr. Marius Silaghi proposed and introduced a Special Speech Processing and Recognition track in FLAIRS. This task entailed hosting the www site with the information about the new track, collecting the papers, recruiting reviewers, organizing, conducting, and managing the peer review process, and notifying the conference chair of the final list of accepted papers. I have served as char in 3 conferences (FLAIRS 2004 & 2005) as well as ASEE 2006. The paper was nominated for best paper award out of 9 nominated papers - Këpuska V., Rogers N., Patel M., (2006). A MATLAB Tool for Speech Analysis, Processing, and Recognition: SAR-LAB, ASEE, Chicago, 2006. (Best Paper Award Nomination).

In addition, I have served as reviewer for Elsevier Journals Speech Communication, Neural Networks, Nonlinear Analysis Theory and Methods, IEEE Transactions on Neural Networks (2007), ACM Symposium on Applied Computing – NLSP (2008), FLAIRS (2005 and 2006); First Conference on Information Technology and Engineering Sciences, Kosova (2007). In the past, prior to joining FIT I have often served as reviewer for the International Journal of Speech Technology, IEEE ICASSP, ICSLP.

Also, I have served as Reviewer for Academic Advancement of Dr. Dimitrios Charalampidis -

13

ECE department of University of New Orleans.

Reviewing: “The New University Researchers Start-up Program of Fonds de recherche du Québec”, by

Guy Tremblay, Ph.D, Program Manager, Fonds de recherche du Québec – Nature et technologies, 140 Grande Allée East, Quebec City, QC, CANADAG1R 5M8, Phone: 418 643-3439, Fax: 418 643-1451, Email: [email protected]

NSF in 2016. Meeting ID: P161746 Meeting Name: NRT Engineering III Panel, Meeting Start Date: 04/11/2016, Meeting End Date: 04/13/2016. Panel Leader: Richard A. Tankersley, Ph.D., Program Director, NRT, IGERT, GK12 and GROW,

NSF Division of Graduate Education NRT panel ID P170643, Tara L. Smith, PhD, Program Director.

● Since 2012 I am serving as Editor in Chief at the “Scientific & Academic Publishing”; http://www.sapub.org/journal/editorialdetails.aspx?JournalID=1045&PersonID=10112.,.

● Springer Journal’s Editorial Office, International Journal of Speech Technology,● Advances in Research, Neural Networks, etc● Expert Reviewer for security clearance for FIT students. Most recently: NC04 - Cardinal,

Patrick (207070) - Expert: Këpuska, Veton (KEPVE1601

I have served as a professional reviewer for the Wiley and Sons book “Speech and Audio Signal Processing: Processing and Perception of Speech and Music”, by Morgan Gold.

I served as, Faculty Senator for ECE for two years, Member of the EE Curriculum Committee for 5 years, FIT’s hiring committee 4 times. In 2009 I have organized prestigious NIST Workshop http://itl.nist.gov/iad/mig/tests/rt/2009/index.html.

uCollaborator CoalitionBecause of Dr. Rodriguez (FGCU) and my efforts, we have created the uCollaborator Coalition of Excellence (uCE, pronounced "you see") gathering researchers from Florida Gulf Coast University, Florida Institute of Technology and University of Central Florida (Institute for Simulation and Training) and industrial partners (further details provided in the following link www.ucollaborator.com). Currently, uCE is engaging in research, development, and commercialization of a new breed of ubiquitous collaboration technologies. And, recently, the coalition has submitted a couple of grant proposals for developing an open-platform to connect the physical and virtual worlds both synchronously (same time) and asynchronously (anytime, anywhere). I took the lead on (NSF-IIS 905145: RI:HCC:Medium:RUI: uC: Ubiquitous Collaboration Platform for Multimodal Team Interaction Support).

14

CommitteesAdditional service activities include representing ECE department in FIT's Graduate Curriculum Committee (2003-2005). Served as member of Department Head Search Committee (2007) as well as currently (2008-2009). Serving in ECE department's ABET and Curriculum Committee. I advise directly my Department Head in all matters regarding to Computer Engineering Curriculum.

OutreachMy activities contributed to university's outreach efforts through AMALTHEA and EMD-MLR; NSF funded projects; I have represented FIT in Harris Engineering Week several times (2004-2008), LASER DAY (2005-2007) and personal presentations at BCC, Melbourne L.B. Johnson Middle School, Cocoa Beach High School. Finally I am serving as a judge in science fair for Central Brevard Catholic Homeschool Group (2007, 2008, 2009) (e-mail: [email protected]).

CBD Board of Directors2000-2010, I served as a member of Board of Directors for a non-governmental human-rights organization "Center fFor Balkan Development". For more information please see http://www.friendsofbosnia.org/who_board.html.

15

A1. Appendixes

Final Exam Web-Based Tool

In this included example, I am using my name (“Veton Këpuska” and “Veton Këpuska 1”) for logging for final-exam under ECE3551 at two different times, e.g.: at 9:00 am which was marked with red (accepted) and 13:30 pm which is marked with blue (rejected) as indicated below.

16

This tool allows me to print the table of final examination times of all registered users:

This tool is password protected and I never had any problems utilizing it.

17

A2. Speech Analysis and Special Effects Laboratory: SASE_Lab() Tool

Empty beginning screen-shot of my SASE_Lab() tool.

The next screen-shot displays the process of loading of the wave-file from a TIMIT (Texas Instruments and MIT) corpus. This is done interactively by picking the file from the directory where the corpora is located.

After selecting the region of loaded wave-form by clicking on it, the rest of windows get populated by performing various analysis as described further in detail below.

18

One additional feature that was added after the final report was published is presented next. It depicts sentence with individual TIMIT phonetic transcription with boundaries, individual words, and what kind of phoneme it is.

Next figure depicts options provided by the View menu where I was provided two options: to “Process Speech” and “Spectrogram”. The default option is to display “Process Speech” as depicted in the figure below.

19

If I pick the Spectrographic display I will get the following window which depicts the spectrographic analysis performed from the original wave-form:

20

Notice the task-bar below original figure. This task-bar is designed to allow the user(s) of such a tool to focus instead on overall utterance on only 1 second log sections as demonstrated below:

By design, the next feature of this tool allows user(s) to apply various special effects from the list (Echo, Reverberation, Flanger, Chorus, Vibrato, Tremolo and Voice Changer) as depicted in the next slide where by the “Reverberation” was picked and applied to original signal;

21

Introduction

Speech processing is the application of signal processing techniques to a speech signal for a variety of applications. A common goal between applications is to represent characteristics of the original signal as efficiently as possible. Example applications include speech recognition, speaker recognition, speech coding, speech synthesis, and speech enhancement. Efficiency is a key factor in all of these applications. These applications are covered in more detail in the motivation section of this report.

In speech recognition, the goal is to provide a sound wave containing speech to a system, and have the system recognize parts of words or whole words. To complete this goal, speech processing must first take place to prepare the speech signal for recognition. This is done by breaking down the speech signal into frames of approximately 10-30 milliseconds, and generating a feature vector that accurately and efficiently characterizes the speech signal for that frame of time. Reducing the frame from a fairly large set of speech signal samples to a much smaller set of data, the feature vector, allows for quicker computation during the recognition stage of the system.

The figure above shows a high level diagram of a typical speech recognition system. The input to the system is a speech signal and is passed to the front end of the system. The front end is responsible for the speech processing step and will extract features of the incoming speech signal. The back end of the system will use the features provided by the front end and based on statistical models, provide recognized speech. This paper will focus on the front end and related speech processing required to extract feature vectors from a speech signal.

22

Front EndFeature Extraction

Back EndRecognition

SpeechSignal

Recognized

Motivation

Speech Processing is an important technology that is used widely by many people on a day-to-day basis. Almost all smartphones come equipped with speech recognition capabilities to enable hands free use of certain phone functionality. A recent advancement of this mobile technology is Siri on the Apple iPhone 4S. This application takes speech recognition a step further and adds machine understanding of the user’s requests. Users can ask Siri to send messages, make schedules, place phone calls, etc. Siri responds to these requests in a human-like nature, making the interaction seem like almost talking to a personal assistant. At the base of this technology, speech processing is required to extract information from the speech signal to allow for recognition and further more understanding.

Speech recognition is also commonly used in interactive voice response (IVR) systems. These systems are used to handle large call volumes in areas such as banking and credit card services. IVR systems allow interaction between the caller and the company’s computer systems directly by voice. This allows for a large reduction in operating costs, as a human phone operator is not necessary to handle simple requests by a customer. Another benefit of an IVR system is to segment calls to a large company based on the caller’s needs and route them to appropriate departments.

Other applications of speech processing and recognition focus on a hands free interface to computers. These types of applications include voice transcription or dictation systems. These can be found commercially in use for direct speech to text transcription of documents. Other hands-free interfaces allow for safer interaction between human and machines such as the OnStar system used in Chevy, Buick, GMC, and Cadillac vehicles. This system allows the user to use their voice to control navigation instructions, vehicle diagnostics, and phone conversations. Ford vehicles use a similar system called Sync, which relies on speech recognition for hands free interface to calling, navigation, in-vehicle entertainment, and climate control. These systems use of hands free interface to computing, allows for a safer interaction when the users attention needs to be focused on the task at hand, driving.

Another growing area of technology utilizing speech processing is the video game market. Microsoft released the Kinect for the Xbox 360, which is an add-on accessory to the video game console to allow for gesture/voice control of the system. While the primary focus of the device was gesture control, it uses speech processing technology to allow control of the console by the user’s voice.

23

Discrete-Time Signals

A good understanding of discrete-time signals is required prior to discussing the mathematical operations of speech processing. Computers are discrete systems with finite resources such as memory. Sound is stored as discrete-time signals in digital systems. The discrete-time signal is a sequence of numbers that represent the amplitude of the original sound before being converted to a digital signal.

Sound travels through the air as a continuously varying pressure wave. A microphone converts an acoustic pressure signal into an electrical signal. This analog electrical signal is a continuous-time signal that needs to be discretized into a sequence of samples representing the analog waveform. This is accomplished through the process of analog to digital conversion. The signal is sampled a rate called the sampling frequency (Fs) or sampling rate. This number determines how many samples per second are used during the conversion process. The samples are evenly spaced in time, and represent the amplitude of the signal at that particular time.

The above figure shows the difference between a continuous-time signal and a discrete-time signal. On the left, one period of a 200 Hz sine wave is shown. The period of this signal is the reciprocal of the frequency and in this case, five milliseconds. On the right, the signal is shown in discrete-time representation. The signal is a sequence of samples, each sample representing the amplitude of the signal at a discrete time. The sampling frequency for this example was 8 kHz, meaning 8000 samples per second. The result of one period of the 200 Hz sine wave is 40 samples.

24

The sampling frequency is directly related to the accuracy of representation of the original signal. By decreasing the sampling rate to 2 kHz, or 2000 samples per second, the discrete-time signal loses accuracy. This can be seen on the right side of the following figure.

The exact opposite is true, by increasing the sampling frequency the signal can be represented more accurately as a discrete-time signal. The following figure uses a sampling frequency of 44.1 kHz. It can be seen that the signal on the right more accurately describes the continuous-time signal at a sampling rate of 44.1 kHz as opposed to 2 kHz or 8 kHz.

25

Common sampling rates used currently for digital media are as follows:

8kHz – Standard land-line telephone service 16kHz – Wideband land-line telephone service 44.1kHz – CD Audio Tracks 96kHz – DVD Audio Tracks

Speech Processing

The main goal of speech processing is to reduce the amount of data used to characterize the speech signal while maintaining an accurate representation of the original data. This process produces a feature vector of 13 numbers typically. The feature vector is commonly referred to as Mel-Frequency Cepstral Coefficients (MFCCs). The process of feature extraction can be broken down into several stages of mathematical operations that take place on a discrete-time signal input. The following is high level diagram of feature extraction stages.

26

FramingSpeechSignal Pre-Emphasis

WindowingFourier

TransformMel Filter

LogDiscrete Cosine

TransformFeature Vector

Framing

The speech signal can be of any length, but for analysis, the signal must be divided in segments. Each segment, or frame, will be analyzed and a feature vector will be produced. Speech signals are typically stationary over a period of 10-30 milliseconds. Given a sampling frequency of 8 kHz, corresponding frame sizes are of 80 to 256 samples. These samples contained in the frame will be passed through all stages of the front end to produce a vector containing 13 values that characterize the speech signal during that frame.

Upon complete processing of a particular frame, the next frame should not begin where the previous one ended. To more accurately process the signal, the next frame should overlap the previous frame by some amount.

The above figure shows 768 samples of a speech signal and also the overlapping nature of the speech frames. The blue signal is the speech signal and it can be noted this is a semi-stationary section of speech. The periodicity of the signal is clearly shown. The other signals show how this speech would be divided into five frames. Each colored curve is an analysis window that segments the speech signal into frames. Each frame is 256 samples in length, and each frame overlaps the previous by 50%, or 128 samples in this case. This insures accurate processing of the speech signal. A front end system can be described by its frame rate, or the number of frames per second that the speech signal is divided into. The frame rate of the front end also translates into the number of feature vectors produced per second due to the fact that one frame produces one feature vector.

27

Pre-emphasis

The next stage of the front end is to apply a pre-emphasis filter to the speech frame that has been segmented in the previous step. A pre-emphasis filter in relation to speech processing is typically a high-pass, 1st order, finite impulse response (FIR) filter. A filter modifies a signal that is passed through it.

A filter has a characteristic called frequency response. This describes how the filter modifies the signal passed through it. The filter used here is a high-pass, meaning that it will pass the frequencies above the cut-off frequency, while attenuating or reducing parts of the signal below the cut-off frequency. The frequency response of a common 1 st order pre-emphasis filter is shown below.

The above graph was generated using the value of -0.97 for the filter coefficient and the sampling rate of 8 kHz. The frequency response of this filter shows that the magnitude of lower

28

FilterInput Output

frequencies are attenuated or reduced in magnitude. The opposite is also true, higher frequencies are not attenuated as much as frequencies in lower parts of the spectrum.

The reason for applying the pre-emphasis filter is tied to the characteristics of the human vocal tract. There is a roll-off in spectral energy towards higher frequencies in human speech production. To compensate for this factor, lower frequencies are reduced. This prevents the spectrum from being overpowered by the higher energy present in the lower part of the spectrum.

The above figure shows the original speech signal in blue and the pre-emphasized speech signal in red. While maintaining the same overall periodicity and general waveform shape, the high frequency components are accentuated. The quicker changing parts of the signal, higher frequencies, are compensated so that the lower frequency energy does not overpower the spectral results.

The operation of applying a filter to a signal is represented mathematically through the convolution operation, denoted by the ‘*’ operator.

y ( t )=x ( t )∗h (t )

The above equation convolves the input signal x(t) with filter h(t) to produce the output y(t). In a continuous time signal, the convolution operation is defined by the following integration:

29

y ( t )=∫−∞

t

h ( τ ) x ( t−τ ) dτ

In a discrete-time system the convolution operation changes from integration to summation.

y [n ]=∑i=0

N

β i x [n−i]

N−filter order β i−filter coefficients x [n ]−input signal y [n ]−output signal

In the case of our pre-emphasis filter, the order is one. This means there will be two coefficients, β0 andβ1. The first coefficient of a FIR filter, β0, is always one. The coefficientβ1 used in the above example frequency response was the value -0.97. Expanding the above summation based on these coefficient values yields the following results.

y [n ]=β0x [n ]+β1 x [n−1]

y [n ]=x [n ]−0.97 x [n−1 ]

The input to this function is the sequence of samples in the speech frame, x[n]. The output of this filter is the pre-emphasized (high-pass filtered) frame of speech, y[n].

30

Windowing

After a frame of speech has been pre-emphasized, a window function must be applied to the speech frame. While many different types of windowing functions exist, a hamming window is typically used for speech processing. The figure below shows a hamming window of length 256.

Hamming Window Function of N length:

w (n )=0.54−0.46 cos ( 2πnN−1 )

To apply this window function to the speech signal, each speech sample is multiplied by the corresponding value of the window function to generate the windowed speech frame. At the center of the hamming window, the amplitude is 1.0 and decays to the value 0.08 at either the beginning or end of the hamming window. This allows for the center of the speech frame to remain relatively unmodified by the window function, while samples are attenuated more, the further they are from the center of the speech frame. Observe the following figure. On the top half, a speech signal is shown in blue, with a hamming window function in red. The bottom half of the figure shows the results when the hamming window is applied to the speech signal. At the center of the frame, the speech signal is nearly the original values, but the signal approaches zero at the edges.

31

This process of windowing is very important to speech processing in the next stage, the Fourier Transform. If a windowing function is not applied to a speech frame, there can be large discontinuities at the edges of the frame. These discontinuities will cause problems with the Fourier Transform and will induce errors in the frequency spectrum of the framed audio signal. While it may seem like information is being lost at the edges of the speech frame due to the reduction in amplitude, the overlapping nature of sequential speech frames ensures all parts of the signal are analyzed.

32

Fourier Transform

The Fourier Transform is an algorithm used to transform a time domain signal into a frequency domain. While time domain gives information about how the signal’s amplitude changes of time, frequency domain shows the signals energy content a different frequencies. See the following graph for an example frequency spectrum of a time domain signal. The x-axis is frequency and the y-axis is magnitude of the signal. It can be observed that this particular frequency spectrum show a concentration of energy below 1 kHz, and another peak of energy between 2.5 kHz and 3.5 kHz.

The human ear interprets sound based on the frequency content. Speech signals contain different frequency content based on the sound that is being produced. Speech processing systems analyze frequency content of a signal to recognize speech. Every frame of speech passed through the speech processing system will have the Fourier Transform applied to allow analysis in frequency domain.

The above graph shows peaks of frequency magnitude in three areas. Most speech sounds are characterized by three frequencies called formant frequencies. The formants for a particular sound are resonant frequencies of the vocal tract during that sound and contain the majority of signal energy. Analysis of formant locations in terms of frequency is the basis for recognizing particular sounds in speech.

33

The Fourier Transform f̂ ( ξ ) of continuous time signal f ( x ) is defined as:

f̂ ( ξ )=∫−∞

∞

f ( x ) ∙ e−2πixξdx , for every real number ξ

The Inverse Fourier Transform of f̂ ( ξ ) reproduces the original signal f ( x ) :

f ( x )=∫−∞

∞

f̂ (ξ ) ∙ e2 πixξ dξ , for every realnumber x

f̂ ( ξ )−continous frequency spectrum

f ( x )−continous timeinput signal

Speech processing typically deals with discrete-time signals, and the corresponding discrete Fourier Transforms are given below:

X [k ]=∑n=0

N−1

x [n ] ∙ e−i2π k

N n

x [n ]= 1N ∑

k=0

N−1

X [k ] ∙ ei2 π k

N n

x [n ]−discrete time signal

X [k ]−discrete frequency spectrum

N−ℱ Transform¿¿

n−sample number

k−frequency bin number

The vector X [k ] contains the output values of the Fourier Transform algorithm. These values are the frequency domain representation of the input time domain signal, x [n ]. For each index of k , from 0 to N , the value of the vector is the magnitude of signal energy at the frequency bin k . When analyzing magnitude, the Fourier Transform returns results that are

34

symmetric across the mid-point of the FFT size. For example, if the FFT size is 1024, the first 512 results will be symmetric to the last 512 values, in terms of magnitude. See the following graph of a 1024 point Fourier Transform.

Due to this symmetry, only the first half of the Fourier Transform is used when analyzing the magnitude of frequency content of a signal. The relation from k to actual frequency depends on the sampling rate (F s) of the system. The first frequency bin, k=0, represents 0 Hz, or the overall energy of the signal. The last frequency bin, k=N /2 (512 in the above graph), represents the maximum frequency that can be detected based on the sampling rate of the system. The Nyquist-Shannon Sampling Theorem states the maximum frequency that can be detected in a discrete-time system is half of the sampling rate. Given a sampling rate of 8 kHz, the maximum frequency, or Nyquist Frequency, would be 4 kHz. This value would correspond to the 512th index of the above graph.

Each frequency bin k , represents a range of frequencies rather than a single value. The range covered by a set of frequencies is called a bandwidth(BW ). The bandwidth is defined by initial and final frequencies in the given band. For example if a frequency bin started at 200 Hz and ended at 300 Hz, the bandwidth of the bin would be 100Hz. The Fourier Transform returns frequency bins that are equally spaced from 0 Hz to the Nyquist Frequency, with each bin having the same bandwidth. To compute the bandwidth of each bin, the overall bandwidth of the signal must be divided by the number of frequency bins. For example given F s=8 kHz and N=1024.

35

BW bin=( F s

2 )/( N2 )=( 8kHz2 )/( 1024

2 )=7.8125Hzbandwidth per bin

This is the bandwidth (BW bin) for each frequency bin, of the results from the Fourier Transform. To translate from frequency bin k to actual frequency, the bin number (k ¿ is multiplied by the bin bandwidth (BW bin). For example if BW bin=7.8125 and k=256.

f max bin=BW ∙k=7.8125 Hzbin

∙256=2000Hz

f min bin=f max bin−BW=2000Hz−7.8125Hz=1992.1875Hz

These calculations show that frequency bin 256 covers frequencies from about 1.992 kHz to 2 kHz. Note the equation for bin bandwidth is indirectly proportional to the Fourier Transform size. As the Fourier Transform size increases, the bin bandwidth decreases, thus allowing a finer resolution in terms of frequency. A finer resolution in frequency produces more accurate results in terms of the original frequency content of the signal versus the output from the Fourier Transform.

An optimization of the Fourier Transform is the Fast Fourier Transform (FFT). This is a much more efficient way to compute the Fourier Transform of a given signal. There are many different algorithms for computing the FFT such as the Cooley-Tukey algorithm. Many algorithms rely on the divide-and-conquer approach, where the overall Fourier Transform is computed by breaking down the computation into smaller Fourier Transforms. Direct implementation of the Fourier Transform is of order N 2 while the Fast Fourier Transform achieve a much lower order of N ∙ log (N ).

Another optimization used is pre-computing twiddle factors. The complex exponential from Fourier Transform definition is known as the twiddle factor. The value of the complex exponential is independent of the input signal x [n] and is always the same value for a given n , k and N . Since these values never change for a particular n and k , a table of values of size n by k can be computed ahead of time. This look-up table has every possible value of the complex exponential for a given n and k . Rather than computing the exponential every time, the algorithm ‘looks up’ the value in the table. This greatly improves the efficiency of the Fourier Transform algorithm.

36

Mel-Filtering

The next stage of speech processing is converting the output of the Fourier Transform to mel scale rather than linear frequency. The mel scale was first introduced in 1937 by Stevens, Volkman, and Newman. The mel scale is based on the fact that human hearing responds to changes in frequency logarithmically rather than linearly. The frequency of 1 kHz was used as a reference point where 1000 Hz is equal to 1000 mels. The equation relating frequency to mels is as follows:

m=2595 ∙ log10(1+ f700 )

37

The above graph shows the transformation function from linear scale to logarithmic scale frequency. A greater change in linear frequency is required for the same increment in mel scale as linear frequency increases.

To apply this transformation to the frequency spectrum output of the Fourier Transform stage of speech processing, a series of triangular filters must be created. Each filter will be applied to the linear frequency spectrum to generate the mel-scale frequency spectrum. The number of mel-filters is dependent on the application of the speech processing, but typically 20-40 channels are used. The graph below shows 25 mel-filters to be applied to the frequency spectrum obtained from previous section. It can be observed that for each increasing filter, the bandwidth increases, covering a larger range of frequencies. The magnitude of each filter also decreases. This is due to normalization of magnitude according to the bandwidth that the filter covers.

To apply the process of mel-filtering to the frequency spectrum will result in a vector that is the same length as the number of mel filters that are applied. Each mel filter function will be multiplied to each value of the frequency spectrum and the results summed. This

38

summation of multiplications will produce a single value corresponding to the magnitude of signal energy at a particular mel frequency. This process is repeated for each mel filter.

Every filter channel has a magnitude of zero for all values that fall outside of the triangle, thus eliminating all frequency information related to that mel filter channel that falls outside of the triangle. Frequencies that a nearest to the center of the mel filter will have the most impact on the output value, with linearly decreasing significance approach either side of the triangle. See the below figure to observe the input linear frequency spectrum and resulting mel-scale frequency spectrum.

The blue graph shows the frequency spectrum obtained from the previous section, the Fourier Transform. The graph below in red shows the output after converting to mel-frequency scale. It can be observed that the mel-frequency spectrum has the same overall shape as the linear scaled frequency spectrum, but the higher frequency information has been compressed together.

39

Mel-Frequency Cepstral Coefficients

The final two steps of speech processing produce results that are called mel-frequency cepstral coefficients or MFCCs. These coefficients form the feature vector that is used to represent the frame of speech being analyzed or processed. As mentioned before, the feature vector needs to accurately characterize the input. The two mathematical processes that need to be applied after the previous steps are taking the logarithm and applying the discrete cosine transform.

The above figure shows the associated feature vector when using the mel-frequency spectrum obtained in the previous section. The first step is to take the logarithm (base 10) of each value in the mel-frequency spectrum. This is a very useful operation, as it allows the

40

separation of signals combined through convolution. For example, if a speech signal is convolved with a noise signal such as background noise:

y (t )=x (t)∗n(t)

y ( t )=speech∧noise , x ( t )=original speech,n (t )=noise signal

By taking the Fourier Transform of both sides of the equation, the convolution operation becomes a multiplication. This is due to the convolution property of the Fourier Transform.

y ( t )=x ( t )∗n (t )

Y (ω )=X (ω)∙ N (ω)

Then, by applying the logarithm property of multiplication, the original signal and the noise signal are mathematically added together instead of multiplied. This allows the subtraction of an undesired signal that has been convolved with a desired signal.

Y (ω )=X (ω ) ∙ N (ω)

log10 (Y (ω) )=log10 (X (ω) )+ log10 (N (ω ) )

From the last equation, if the noise signal is known, then it can be subtracted from the combined signal. After the logarithm has been taken of each value from the mel-frequency spectrum, the final stage of speech processing is to apply the discrete cosine transformation.

ℱ Transform : X [k ]=∑n=0

N−1

x [n ] ∙e−i2π k

N n

e−i2π k

N n=cos(−2πn k

N )+isin(−2πn kN )

Then drop the imaginary component and the kernel becomes:

cos (−2πn kN )

The resulting discrete cosine transform equation is:

X [k ]=∑n=0

N−1

x [n] ∙cos (−2πn kN )

This operation will result in a vector of values that have been transformed from mel-frequency domain to cepstral domain. This transformation led to the name Mel-Frequency

41

Cepstral Coefficients, MFCCs. Most applications use only the first 13 values to form the feature vector, truncating the remaining results. The length of the feature is dependent on the application, but 13 values is sufficient for most speech recognition tasks. The discrete cosine transform is nearly identical to the Fourier transform, except that it drops the imaginary part of the kernel.

Experiments and Results

For this project, a speech processing tool was created with MATLAB to allow analysis of all the steps involved. The tool has been named SASE Lab, for Speech Analysis and Sound Effects Lab. The result is a graphical user interface that allows the user to either record a signal or open an existing waveform for analysis. This signal can then be played back to hear what speech was said. The interface has six plots that show the speech signal at the various stages of speech processing. Starting at the top-left, a plot shows the entire waveform of the original signal. After having opened or recorded a signal, the user can then click on a part of the signal shown to analyze that particular frame of speech. After a frame has been selected, the other five plots show information related to that frame of speech. The top-right plot shows only the frame of speech selected, rather than the whole signal. It also shows the windowing function overlay in red. On the middle row, left side, the plot shows the frame of speech after the windowing function has been applied. To the right of that plot, the frequency spectrum of the speech frame is shown. On the bottom row, left side, the plot shows the results of converting the linear scale frequency spectrum to mel-scale frequency. The plot on the bottom-right shows the final output of the speech processing, a vector of 13 features representing the input for the particular frame being analyzed.

42

The above picture is a screenshot of the SASE Lab tool analyzing a particular frame of speech from a signal. The part of the original signal being analyzed is highlighted with a vertical red bar in the top-left plot.

In addition to these six plots, there are three buttons above them. These three buttons control starting a recording from a microphone, stopping a recording, and playing a recording. Along the menu bar, the user has standard options such as File, View, and Effects. The File menu allows a user to open or save a recording. The View menu allows the user to switch between the view shown above or the spectrogram view. The Effects menu allows the user to apply several different audio effects to the speech signal.

The other main view of the SASE Lab shows spectrogram of the signal. This displays how the frequency content of the signal changes over time. This is an important piece of information when dealing with speech and speech recognition. The human ear differentiates sounds based on frequency content, so it is important to analyze speech signals for their frequency content.

43

The screenshot above shows the spectrogram view of SASE Lab. On the top half of the display, the waveform of the original signal is displayed. On the bottom section, the spectrogram of the signal is displayed. The x-axis is time and the y-axis is frequency. The above screenshot is analyzing the speaker saying “a, e, i, o, u”. The frequency content of each utterance shows distinct patterns with regards to frequency content.

The spectrogram view of SASE Lab also allows the user to select a 3 second window to analyze separately in the case that a long speech signal is present. When looking at a longer speech signal, the spectrogram become crowded due to compressing a large amount of data in the same display space. See the following screenshot for an example.

44

The speech signal shown is approximately 18 seconds long and hard to analyze when showing the spectrogram of the whole signal at once. To alleviate this issue, a slider bar was implemented to allow the user to select a 3 second window of the entire speech signal. The window is shown in the top graph by the highlighted red section of the signal waveform. In a new window, the 3 second section of speech waveform is plotted, along with the corresponding section of spectrogram. See the screenshot below.

45

This figure shows only the section of speech highlighted in the previous figure. Spectral characteristics are much easier to observe and interpret compared to viewing the entire speech signal spectrogram at the same time. The user can change the position on the slider bar from the main view, and the secondary view will update its content to show the new 3 second selection. The user may also play only the 3 second selection by pressing the Play Section button on the right side of the main view. The Play All button will play the entire signal.

Analysis on the feature vector produced for different vowel sounds yields results as expected. The feature vector should be able to accurately characterize the original sound from the frame of speech. For different sounds of speech, the feature vector needs to show distinct characteristics to allow analysis and recognition of the original speech. The following five figures show how SASE Lab analyzes the vowel sounds “a, e, i, o, u”. These sounds are produced by voiced speech, in which, the vocal tract is driven by periodic pulses of air from the lungs. The periodic nature of the driving signal is characterized through the speaker’s pitch. For example, males tend to have lower pitch than females and thus have a greater amount of time between each pulse of air. The pitch of the speaker can be observed on the frequency spectrum of the SASE Lab.

46

The above figure shows the speech waveform recorded for a speaker saying the sounds “a, e, i, o, u” (/ey/, /iy/, /ay/, /ow/, /uw/). This can be observed in the first plot that shows the entire speech signal. There are five distinct regions where there is significant amplitude data indicating sound. These five regions are surrounded by low amplitude data, showing slight pauses between each sound uttered. In the first graph, a vertical red line is seen on the first region of sound, the “a” or /ey/ sound. The placement of the vertical red line controls which frame of the entire speech signal is to be analyzed. The next plot shows a periodic signal, with approximately three complete periods. This is the frame that has been selected for analysis. The next three plots show the signal as it passes through each stage of speech processing. The final plot on the bottom right, shows the mel-frequency cepstral coefficients (MFCCs) for the particular frame being analyzed. These 13 values are called the feature vector. Note that the first value of feature vector is omitted from the plot. This value contains the energy of the signal and typically is of greater magnitude than the other 12 values. It is not shown on the plot as it would cause re-scaling of the graph and the detail of the other 12 features to be lost.

47

Feature Vector for “a” /ey/

Feature Vector for “e” /iy/

Feature Vector for “i” /ay/

Feature Vector for “o” /ow/

The four figures above show how the feature vector differs for each sound produced. The difficulty lies in the fact that even for the same speaker, every time a particular sound is produced, there will be slight variations. This is one factor that makes speech recognition a complicated task. Current solutions require training a model for each sound. The training process entails collecting many feature vectors for a sound and creating a statistical model of the distribution of the features for the given sound. Then, when comparing an unknown feature vector to the likely distributions of features for a given sound, a probability that the unknown feature vector belongs to a known sound can be computed.

48

Summary

Speech processing entails many different aspects of mathematics and signal processing techniques. The main goal of this process is a reduction in amount of data while maintaining an accurate representation of the speech signal characteristics. For every frame of speech, typically 10-30 milliseconds, a feature vector must be computed that contains these characteristics. An average frame size is approximately 256 samples of audio data, while the feature vector typically is only 13 values. This reduction in amount of data allows for more efficient processing of the feature vector. For example, if the feature vector is being passed along to a speech recognition process, an analysis on the feature vector will be computationally more efficient that analysis on the original frame of speech.

The outline of a speech processing system contains several stages. The first stage is to separate the speech signal into frames of 10-30 millisecond duration. Speech is typically constant over this period, and allows for efficient analysis of a semi-stationary signal. This frame of data is then passed through the other stages of the process to produce the end result, a feature vector. The next step is to apply a pre-emphasis filter to compensate for lower energy in the higher frequencies of human speech production. After this filter, the Fourier Transform of the signal is taken to compute the frequency spectrum of the speech frame. This information indicates what frequency content composed the speech signal. The frequency spectrum is then converted to a logarithmic scale through the process of mel-filtering. This step models the sound as humans perceive frequencies, logarithmically. After this step, the base 10 logarithm is taken of the resulting values from the previous step and finally, the discrete cosine transform is applied. The resulting vector is truncated to 13 values and forms the feature vector. This feature vector characterizes the type of speech sounds present in the original frame, but with the advantage of using far less data. The feature vector can then be passed along to a speech recognition system for further analysis.

The MATLAB tool created for this project, SASE Lab, performs all stages of speech processing to produce a feature vector for each frame of speech signal data. SASE Lab also shows graphs of data after each stage of the speech processing task. This breakdown of information allows the user to visualize how the data is manipulated through each step of the process. In addition to speech processing, the tool also incorporates several digital signal processing techniques to add audio effects to a speech signal. The application of these effects can then be analyzed for how they affect the feature vectors produced or at any stage of speech processing. Effects include echo, reverberation, flange, chorus, vibrato, tremolo, and modulation.

49

ReferencesKëpuska, Dr. Veton.DiscreteTimeSignalProcessingFramework.

http://my.fit.edu/~vkepuska/ece5525/Ch2-Discrete-Time%20Signal%20Processing%20 Framework2.ppt

Këpuska, Dr. Veton.AcousticsofSpeechProduction.http://my.fit.edu/~vkepuska/ece5525/Ch4-Acoustics_of_Speech_Production.pptx

Këpuska, Dr. Veton.SpeechSignalRepresentations.http://my.fit.edu/~vkepuska/ece5526/Ch3-Speech_Signal_Representations.pptx

Këpuska, Dr. Veton.AutomaticSpeechRecognition.http://my.fit.edu/~vkepuska/ece5526/Ch5-Automatic%20Speech%20Recognition.pptx

Oppenheim, Alan V., and Ronald W. Schafer. Discrete-timesignalprocessing. 3rd ed. Upper Saddle River: Pearson, 2010.

Phillips, Charles L., and John M. Parr. Signals,systems,andtransforms. 4th ed. Upper Saddle River, NJ: Pearson/Prentice Hall, 2008.

Quatieri, T. F.. Discrete-timespeechsignalprocessing:principlesandpractice. Upper Saddle River, NJ: Prentice Hall, 2002.

50

my.fit.edumy.fit.edu/~vkepuska/FIT/2016/KepuskaDossier1.docx · Web viewPromotion Dossier. Veton...

Documents

Transcript of my.fit.edumy.fit.edu/~vkepuska/FIT/2016/KepuskaDossier1.docx · Web viewPromotion Dossier. Veton...