Artificial Intelligence in Voice Recognition Systems2

download Artificial Intelligence in Voice Recognition Systems2

of 32

Transcript of Artificial Intelligence in Voice Recognition Systems2

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    1/32

    1

    1. INTRODUCTION

    The speech recognition process is performed by a software component known as the speech

    recognition engine. The primary function of the speech recognition engine is to process spoken input and

    translate it into text that an application understands. The application can then do one of two things: The

    application can interpret the result of the recognition as a command. In this case , the application is a

    Command and Control Application. If an application handles the recognized text simply as text, then it is

    considered a Dictation Application. The user speaks to the computer through a microphone, which in turn,

    identifies the meaning of the words and sends it to NLP device for further processing. Once recognized, the

    words can be used in a variety of applications like display, robotics, commands to computers, and dictation.

    No special commands or computer language are required. There is no need to enter programs in a

    special language for creating software. Voice XML takes speech recognition even further. Instead of talkingto your computer, you're essentially talking to a web site, and you're doing this over the phone. OK, you say,

    well, what exactly is speech recognition? Simply put, it is the process of converting spoken input to text.

    Speech recognition is thus sometimes referred to as Speech-to-Text. Speech recognition allows you to

    provide input to an application with your voice. Just like clicking with your mouse, typing on your keyboard,

    or pressing a key on the phone keypad provides input to an application; speech recognition allows you to

    provide input by talking. In the desktop world, you need a microphone to be able to do this. In the Voice

    XML world, all you need is a telephone.

    When you dial the telephone number of a big company, you are likely to hear the sonorous voice of a

    cultured lady who responds to your call with great courtesy saying welcome to company X. Please give me

    the extension number you want .You pronounce the extension number, your name, and the name of the

    person you want to contact. If the called person accepts the call, the connection is given quickly. This is

    artificial intelligence where an automatic call-handling system is used without employing any telephone

    operator.

    AI is the study of the abilities for computers to perform tasks, which currently are better done by

    humans. AI has an interdisciplinary field where computer science intersects with philosophy, psychology,

    engineering and other fields. Humans make decisions based upon experience and intention. The essence of AI

    in the integration of computer to mimic this learning process is known as Artificial Intelligence Integration.

    1.1 Problems

    The general problem of simulating (or creating) intelligence has been broken down into a number of

    specific sub-problems. These consist of particular traits or capabilities that researchers would like an

    intelligent system to display. The traits described below have received the most attention.

    1.1.1 Deduction, reasoning, problem solving

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    2/32

    2

    Early AI researchers developed algorithms that imitated the step-by-step reasoning that human were

    often assumed to use when they solve puzzles, play board games or make logical deductions. By the late

    1980s and '90s, AI research had also developed highly successful methods for dealing with uncertain or

    incomplete information, employing concepts from probability and economics.

    For difficult problems, most of these algorithms can require enormous computational resources most

    experience a "Combinatorial Explosion": the amount of memory or computer time required becomes

    astronomical when the problem goes beyond a certain size. The search for more efficient problem solving

    algorithms is a high priority for AI research.

    Human beings solve most of their problems using fast, intuitive judgments rather than the conscious,

    step by-step deduction that early AI research was able to model. AI has made some progress at imitating this

    kind of "Sub-symbolic" problem solving: embodied agent approaches emphasize the importance of

    sensorimotor skills to higher reasoning; neural net research attempts to simulate the structures inside human

    and animal brains that give rise to this skill.

    1.1.2 Knowledge representation

    Knowledge representation and knowledge engineering are central to AI research. Many of the

    problems machines are expected to solve will require extensive knowledge about the world. Among the

    things that AI needs to represent are: objects, properties, categories and relations between objects situations,

    events, states and time causes and effects knowledge about knowledge (what we know about what other

    people know) and many other, less well researched domains. A complete representation of "what exists" is

    an ontology (borrowing a word from traditional philosophy), of which the most general are called upper

    Ontologies.

    1.1.2.1 Among the most difficult problems in knowledge representation are:

    Default reasoning and the qualification problem:

    Many of the things people know take the form of "working assumptions." For example, if a bird

    comes up in conversation, people typically picture an animal that is fist sized, sings, and flies. None of these

    things are true about all birds. John MCCarthy identified this problem in 1969 as the qualification problem:

    for any commonsense rule that AI researchers care to represent, there tend to be a huge number of

    exceptions. Almost nothing is simply true or false in the way that abstract logic requires. AI research has

    explored a number of solutions to this problem.

    The breadth of Commonsense Knowledge

    The number of atomic facts that the average person knows is astronomical. Research projects that

    attempt to build a complete knowledge base of commonsense knowledge(e.g., Cyc) require enormous

    amounts of laborious onotological engineering they must be built, by hand, one complicated concept at a

    http://en.wikipedia.org/wiki/Commonsense_knowledgehttp://en.wikipedia.org/wiki/Ontology_engineeringhttp://en.wikipedia.org/wiki/Ontology_engineeringhttp://en.wikipedia.org/wiki/Commonsense_knowledge
  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    3/32

    3

    time. A major goal is to have the computer understand enough concepts to be able to learn by reading from

    sources like the internet, and thus be able to add to its own ontology.

    The sub symbolic form of some commonsense knowledge

    Much of what people know is not represented as "facts" or "statements" that they could actually say

    out loud. For example, a chess master will avoid a particular chess position because it "feels too exposed" or

    an art critic can take one look at a statue and instantly realize that it is a fake. These are intuitions or

    tendencies that are represented in the brain non-consciously and sub-symbolically. Knowledge like this

    informs, supports and provides a context for symbolic, conscious knowledge. As with the related problem of

    sub-symbolic reasoning, it is hoped that situated AI or computational intelligence will provide ways to

    represent this kind of knowledge.

    1.1.3 Planning

    Intelligent agents must be able to set goals and achieve them. They need a way to visualize the future

    (they must have a representation of the state of the world and be able to make predictions about how their

    actions will change it) and be able to make choices that maximize the utility(or "value") of the available

    choices.

    In classical planning problems, the agent can assume that it is the only thing acting on the world and

    it can be certain what the consequences of its actions may be. However, if this is not true, it must periodically

    check if the world matches its predictions and it must change its plan as this becomes necessary, requiring the

    agent to reason under uncertainty. Multi-agent planning uses the cooperation and competition of many agents

    to achieve a given goal. Emergent behavior such as this is used by evolutionary algorithms and swarm

    intelligence.

    1.1.4 Learning

    Machine learning has been central to AI research from the beginning. Unsupervised learning is the

    ability to find patterns in a stream of input. Supervised learning includes both classification and numerical

    regression. Classification is used to determine what category something belongs in, after seeing a number of

    examples of things from several categories. Regression takes a set of numerical input/output examples and

    attempts to discover a continuous function that would generate the outputs from the inputs. In reinforcement

    learning the agent is rewarded for good responses and punished for bad ones. These can be analyzed in terms

    ofdecision theory, using concepts like utility. The mathematical analysis of machine learning algorithms and

    their performance is a branch oftheoretical computer science known as computational learning theroy.

    1.1.5 Natural language processing

    http://en.wikipedia.org/wiki/Situated_artificial_intelligencehttp://en.wikipedia.org/wiki/Computational_intelligencehttp://en.wikipedia.org/wiki/Unsupervised_learninghttp://en.wikipedia.org/wiki/Supervised_learninghttp://en.wikipedia.org/wiki/Statistical_classificationhttp://en.wikipedia.org/wiki/Regressionhttp://en.wikipedia.org/wiki/Reinforcement_learninghttp://en.wikipedia.org/wiki/Reinforcement_learninghttp://en.wikipedia.org/wiki/Decision_theoryhttp://en.wikipedia.org/wiki/Utility_(economics)http://en.wikipedia.org/wiki/Theoretical_computer_sciencehttp://en.wikipedia.org/wiki/Situated_artificial_intelligencehttp://en.wikipedia.org/wiki/Computational_intelligencehttp://en.wikipedia.org/wiki/Unsupervised_learninghttp://en.wikipedia.org/wiki/Supervised_learninghttp://en.wikipedia.org/wiki/Statistical_classificationhttp://en.wikipedia.org/wiki/Regressionhttp://en.wikipedia.org/wiki/Reinforcement_learninghttp://en.wikipedia.org/wiki/Reinforcement_learninghttp://en.wikipedia.org/wiki/Decision_theoryhttp://en.wikipedia.org/wiki/Utility_(economics)http://en.wikipedia.org/wiki/Theoretical_computer_science
  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    4/32

    4

    Figure: 1.1 ASIMO uses sensors and intelligent algorithms to avoid obstacles and navigate stairs.

    Natural language processing gives machines the ability to read and understand the languages that

    humans speak. Many researchers hope that a sufficiently powerful natural language processing system would

    be able to acquire knowledge on its own, by reading the existing text available over the Internet. Some

    straightforward applications of natural language processing include information retrieval (ortext mining) and

    machine translation.

    1.1.6 Motion and manipulation

    Figure : 1.2

    The Care-Providing robot FRIEND uses sensors like cameras and intelligent algorithms to control

    the manipulator in order to support disabled and elderly people in their daily life activities. The field of

    robotics is closely related to AI. Intelligence is required for robots to be able to handle such tasks as object

    manipulation and navigation, with sub-problems oflocalization (knowing where you are), mapping (learning

    what is around you) and motion planning (figuring out how to get there)

    1.1.7 Perception

    Machine perception is the ability to use input from sensors (such as cameras, microphones, sonar and

    others more exotic) to deduce aspects of the world. Computer vision is the ability to analyze visual input. A

    few selected sub problems are speech recognition, facial recognition and object recognition.1.1.8 Social intelligence

    http://en.wikipedia.org/wiki/ASIMOhttp://en.wikipedia.org/wiki/Natural_language_processinghttp://en.wikipedia.org/wiki/Information_retrievalhttp://en.wikipedia.org/wiki/Text_mininghttp://en.wikipedia.org/wiki/Machine_translationhttp://en.wikipedia.org/wiki/Care-Providing_Robot_FRIENDhttp://en.wikipedia.org/wiki/Roboticshttp://en.wikipedia.org/wiki/Motion_planninghttp://en.wikipedia.org/wiki/Robot_localizationhttp://en.wikipedia.org/wiki/Robotic_mappinghttp://en.wikipedia.org/wiki/Motion_planninghttp://en.wikipedia.org/wiki/Computer_visionhttp://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Facial_recognition_systemhttp://en.wikipedia.org/wiki/Object_recognitionhttp://en.wikipedia.org/wiki/ASIMOhttp://en.wikipedia.org/wiki/Natural_language_processinghttp://en.wikipedia.org/wiki/Information_retrievalhttp://en.wikipedia.org/wiki/Text_mininghttp://en.wikipedia.org/wiki/Machine_translationhttp://en.wikipedia.org/wiki/Care-Providing_Robot_FRIENDhttp://en.wikipedia.org/wiki/Roboticshttp://en.wikipedia.org/wiki/Motion_planninghttp://en.wikipedia.org/wiki/Robot_localizationhttp://en.wikipedia.org/wiki/Robotic_mappinghttp://en.wikipedia.org/wiki/Motion_planninghttp://en.wikipedia.org/wiki/Computer_visionhttp://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Facial_recognition_systemhttp://en.wikipedia.org/wiki/Object_recognition
  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    5/32

    5

    Figure : 1.3 Kismet, a robot with rudimentary social skills

    Emotion and social skills play two roles for an intelligent agent. First, it must be able to predict the

    actions of others, by understanding their motives and emotional states. (This involves elements ofgame

    theory, decision theory, as well as the ability to model human emotions and the perceptual skills to detect

    emotions.) Also, for good human-computer interaction, an intelligent machine also needs to display emotions.

    At the very least it must appear polite and sensitive to the humans it interacts with. At best, it should have

    normal emotions itself.

    1.1.9 Creativity

    Figure : 1.4 TOPIO, a robot that can play table tennis, developed by TOSY.

    A sub-field of AI addresses creativity both theoretically (from a philosophical and psychological

    perspective) and practically (via specific implementations of systems that generate outputs that can be

    considered creative). A related area of computational research is Artificial Intuition and Artificial Imagination.

    1.2 Tools

    In the course of 50 years of research, AI has developed a large number of tools to solve the most

    difficult problems in the field ofcomputer science. A few of the most general of these methods are discussed

    below.

    1.2.1 Search and optimization

    http://en.wikipedia.org/wiki/Kismet_(robot)http://en.wikipedia.org/wiki/Game_theoryhttp://en.wikipedia.org/wiki/Game_theoryhttp://en.wikipedia.org/wiki/Decision_theoryhttp://en.wikipedia.org/wiki/Human-computer_interactionhttp://en.wikipedia.org/wiki/TOPIOhttp://en.wikipedia.org/wiki/Table_tennishttp://en.wikipedia.org/wiki/TOSYhttp://en.wikipedia.org/wiki/Creativityhttp://en.wikipedia.org/w/index.php?title=Artificial_Intuition&action=edit&redlink=1http://en.wikipedia.org/wiki/Artificial_Imaginationhttp://en.wikipedia.org/wiki/Computer_sciencehttp://en.wikipedia.org/wiki/Kismet_(robot)http://en.wikipedia.org/wiki/Game_theoryhttp://en.wikipedia.org/wiki/Game_theoryhttp://en.wikipedia.org/wiki/Decision_theoryhttp://en.wikipedia.org/wiki/Human-computer_interactionhttp://en.wikipedia.org/wiki/TOPIOhttp://en.wikipedia.org/wiki/Table_tennishttp://en.wikipedia.org/wiki/TOSYhttp://en.wikipedia.org/wiki/Creativityhttp://en.wikipedia.org/w/index.php?title=Artificial_Intuition&action=edit&redlink=1http://en.wikipedia.org/wiki/Artificial_Imaginationhttp://en.wikipedia.org/wiki/Computer_science
  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    6/32

    6

    Many problems in AI can be solved in theory by intelligently searching through many possible

    solutions Reasoning can be reduced to performing a search. For example, logical proof can be viewed as

    searching for a path that leads frompremises to conclusions, where each step is the application of an inference

    rule. Planning algorithms search through trees of goals and subgoals, attempting to find a path to a target goal,

    a process called means-ends analysis.Robotics algorithms for moving limbs and grasping objects use local

    searches in configuration space. Many learning algorithms use search algorithms based on optimization.

    Simple exhaustive searches are rarely sufficient for most real world problems: the search space (the

    number of places to search) quickly grows to astronomical numbers. The result is a search that is too slow or

    never completes. The solution, for many problems, is to use "heuristics" or "rules of thumb" that eliminate

    choices that are unlikely to lead to the goal (called "pruning the search tree"). Heuristics supply the program

    with a "best guess" for what path the solution lies on.

    A very different kind of search came to prominence in the 1990s, based on the mathematical theory

    ofoptimization. For many problems, it is possible to begin the search with some form of a guess and then

    refine the guess incrementally until no more refinements can be made. These algorithms can be visualized as

    blind hill climbing: we begin the search at a random point on the landscape, and then, by jumps or steps, we

    keep moving our guess uphill, until we reach the top. Other optimization algorithms are simulated annealing,

    beam search and random optimization.

    Evolutionary coputation uses a form of optimization search. For example, they may begin with a

    population of organisms (the guesses) and then allow them to mutate and recombine, selecting only the fittest

    to survive each generation (refining the guesses). Forms of evolutionary computation include swarm

    intelligence algorithms (such as ant colony orparticle swarm optimization) and evolutionary algorithms (such

    as genetic algorithms and genetic programming).

    1.2.2 Logic

    Logic was introduced into AI research by John McCarthy in his 1958 Advice Takerproposal. Logic

    is used for knowledge representation and problem solving, but it can be applied to other problems as well. For

    example, the satplan algorithm uses logic forplanning and inductive logic programming is a method for

    learning.

    Several different forms of logic are used in AI research. Propositional orsentential logic is the logic

    of statements which can be true or false. First-order logic also allows the use ofquantifiers and predicates, and

    can express facts about objects, their properties, and their relations with each other. Fuzzy logic, is a

    version of first-order logic which allows the truth of a statement to be represented as a value between 0 and 1,

    rather than simply True (1) or False (0). Fuzzy systems can be used for uncertain reasoning and have been

    widely used in modern industrial and consumer product control systems. Subjective logic models uncertainty

    in a different and more explicit manner than fuzzy-logic: a given binomial opinion satisfies belief + disbelief

    + uncertainty = 1 within a Beta distribution. By this method, ignorance can be distinguished from

    http://en.wikipedia.org/wiki/Artificial_intelligencehttp://en.wikipedia.org/wiki/Premisehttp://en.wikipedia.org/wiki/Logical_consequencehttp://en.wikipedia.org/wiki/Inference_rulehttp://en.wikipedia.org/wiki/Inference_rulehttp://en.wikipedia.org/wiki/Automated_planning_and_schedulinghttp://en.wikipedia.org/wiki/Means-ends_analysishttp://en.wikipedia.org/wiki/Roboticshttp://en.wikipedia.org/wiki/Local_search_(optimization)http://en.wikipedia.org/wiki/Local_search_(optimization)http://en.wikipedia.org/wiki/Configuration_spacehttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Optimization_(mathematics)http://en.wikipedia.org/wiki/Search_spacehttp://en.wikipedia.org/wiki/Astronomicalhttp://en.wikipedia.org/wiki/Computation_timehttp://en.wikipedia.org/wiki/Heuristicshttp://en.wikipedia.org/wiki/Pruning_(algorithm)http://en.wikipedia.org/wiki/Search_treehttp://en.wikipedia.org/wiki/Heuristicshttp://en.wikipedia.org/wiki/Optimization_(mathematics)http://en.wikipedia.org/wiki/Hill_climbinghttp://en.wikipedia.org/wiki/Simulated_annealinghttp://en.wikipedia.org/wiki/Beam_searchhttp://en.wikipedia.org/wiki/Random_optimizationhttp://en.wikipedia.org/wiki/Natural_selectionhttp://en.wikipedia.org/wiki/Evolutionary_computationhttp://en.wikipedia.org/wiki/Swarm_intelligencehttp://en.wikipedia.org/wiki/Swarm_intelligencehttp://en.wikipedia.org/wiki/Ant_colony_optimizationhttp://en.wikipedia.org/wiki/Particle_swarm_optimizationhttp://en.wikipedia.org/wiki/Evolutionary_algorithmshttp://en.wikipedia.org/wiki/Genetic_algorithmshttp://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/John_McCarthy_(computer_scientist)http://en.wikipedia.org/wiki/Advice_Takerhttp://en.wikipedia.org/wiki/Satplanhttp://en.wikipedia.org/wiki/Automated_planning_and_schedulinghttp://en.wikipedia.org/wiki/Inductive_logic_programminghttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Propositional_logichttp://en.wikipedia.org/wiki/Sentential_logichttp://en.wikipedia.org/wiki/First-order_logichttp://en.wikipedia.org/wiki/Quantifierhttp://en.wikipedia.org/wiki/Predicate_(mathematical_logic)http://en.wikipedia.org/wiki/Fuzzy_logichttp://en.wikipedia.org/wiki/Fuzzy_systemhttp://en.wikipedia.org/wiki/Subjective_logichttp://en.wikipedia.org/wiki/Beta_distributionhttp://en.wikipedia.org/wiki/Artificial_intelligencehttp://en.wikipedia.org/wiki/Premisehttp://en.wikipedia.org/wiki/Logical_consequencehttp://en.wikipedia.org/wiki/Inference_rulehttp://en.wikipedia.org/wiki/Inference_rulehttp://en.wikipedia.org/wiki/Automated_planning_and_schedulinghttp://en.wikipedia.org/wiki/Means-ends_analysishttp://en.wikipedia.org/wiki/Roboticshttp://en.wikipedia.org/wiki/Local_search_(optimization)http://en.wikipedia.org/wiki/Local_search_(optimization)http://en.wikipedia.org/wiki/Configuration_spacehttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Optimization_(mathematics)http://en.wikipedia.org/wiki/Search_spacehttp://en.wikipedia.org/wiki/Astronomicalhttp://en.wikipedia.org/wiki/Computation_timehttp://en.wikipedia.org/wiki/Heuristicshttp://en.wikipedia.org/wiki/Pruning_(algorithm)http://en.wikipedia.org/wiki/Search_treehttp://en.wikipedia.org/wiki/Heuristicshttp://en.wikipedia.org/wiki/Optimization_(mathematics)http://en.wikipedia.org/wiki/Hill_climbinghttp://en.wikipedia.org/wiki/Simulated_annealinghttp://en.wikipedia.org/wiki/Beam_searchhttp://en.wikipedia.org/wiki/Random_optimizationhttp://en.wikipedia.org/wiki/Natural_selectionhttp://en.wikipedia.org/wiki/Evolutionary_computationhttp://en.wikipedia.org/wiki/Swarm_intelligencehttp://en.wikipedia.org/wiki/Swarm_intelligencehttp://en.wikipedia.org/wiki/Ant_colony_optimizationhttp://en.wikipedia.org/wiki/Particle_swarm_optimizationhttp://en.wikipedia.org/wiki/Evolutionary_algorithmshttp://en.wikipedia.org/wiki/Genetic_algorithmshttp://en.wikipedia.org/wiki/Genetic_programminghttp://en.wikipedia.org/wiki/John_McCarthy_(computer_scientist)http://en.wikipedia.org/wiki/Advice_Takerhttp://en.wikipedia.org/wiki/Satplanhttp://en.wikipedia.org/wiki/Automated_planning_and_schedulinghttp://en.wikipedia.org/wiki/Inductive_logic_programminghttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Propositional_logichttp://en.wikipedia.org/wiki/Sentential_logichttp://en.wikipedia.org/wiki/First-order_logichttp://en.wikipedia.org/wiki/Quantifierhttp://en.wikipedia.org/wiki/Predicate_(mathematical_logic)http://en.wikipedia.org/wiki/Fuzzy_logichttp://en.wikipedia.org/wiki/Fuzzy_systemhttp://en.wikipedia.org/wiki/Subjective_logichttp://en.wikipedia.org/wiki/Beta_distribution
  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    7/32

    7

    probabilistic statements that an agent makes with high confidence. Default logics, non-monotonic logics and

    circumscription are forms of logic designed to help with default reasoning and the qualification problem.

    Several extensions of logic have been designed to handle specific domains of knowledge, such as:

    description logics situation calculus, event calculus and fluent calculus (for representing events and time)

    causal calculusbelief calculus and modal logics.

    1.2.3 Probabilistic methods for uncertain reasoning

    Many problems in AI (in reasoning, planning, learning, perception and robotics) require the agent to

    operate with incomplete or uncertain information. Starting in the late 80s and early 90s, Judea Pearl and others

    championed the use of methods drawn fromprobability theory and economics to devise a number of powerful

    tools to solve these problems.

    Bayesian networks are a very general tool that can be used for a large number of problems: reasoning

    (using the Bayesian inference algorithm), learning (using the expectation-maximization algorithm), planning

    (using decision networks) and perception (using dynamic Bayesian networks). Probabilistic algorithms can

    also be used for filtering, prediction, smoothing and finding explanations for streams of data, helping

    perception systems to analyze processes that occur over time (e.g., hidden Markov models orKalman filters).

    A key concept from the science ofeconomics is "utility": a measure of how valuable something is to

    an intelligent agent. Precise mathematical tools have been developed that analyze how an agent can make

    choices and plan, using decision theory, decision analysis, information value theory. These tools include

    models such as Markov decision processes, dynamic

    1.2.4 Classifiers and statistical learning methods

    The simplest AI applications can be divided into two types: classifiers ("if shiny then diamond") and

    controllers ("if shiny then pick up"). Controllers do however also classify conditions before inferring actions,

    and therefore classification forms a central part of many AI systems. Classifiers are functions that usepattern

    matching to determine a closest match. They can be tuned according to examples, making them very attractive

    for use in AI. These examples are known as observations or patterns. In supervised learning, each pattern

    belongs to a certain predefined class. A class can be seen as a decision that has to be made. All the

    observations combined with their class labels are known as a data set. When a new observation is received,

    that observation is classified based on previous experience.

    A classifier can be trained in various ways; there are many statistical and machine learning

    approaches. The most widely used classifiers are the neural network,kernel methods such as the support

    vector machine, k-nearest neighbor algorithm, Gaussian mixture model,naive Bayes classifier, and decision

    tree. The performance of these classifiers has been compared over a wide range of tasks. Classifier

    performance depends greatly on the characteristics of the data to be classified. There is no single classifier that

    http://en.wikipedia.org/wiki/Default_logichttp://en.wikipedia.org/wiki/Non-monotonic_logichttp://en.wikipedia.org/wiki/Circumscription_(logic)http://en.wikipedia.org/wiki/Qualification_problemhttp://en.wikipedia.org/wiki/Knowledge_representationhttp://en.wikipedia.org/wiki/Description_logichttp://en.wikipedia.org/wiki/Situation_calculushttp://en.wikipedia.org/wiki/Event_calculushttp://en.wikipedia.org/wiki/Fluent_calculushttp://en.wikipedia.org/wiki/Causalityhttp://en.wikipedia.org/wiki/Modal_logichttp://en.wikipedia.org/wiki/Judea_Pearlhttp://en.wikipedia.org/wiki/Probabilityhttp://en.wikipedia.org/wiki/Economicshttp://en.wikipedia.org/wiki/Bayesian_inferencehttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Expectation-maximization_algorithmhttp://en.wikipedia.org/wiki/Automated_planning_and_schedulinghttp://en.wikipedia.org/wiki/Decision_networkhttp://en.wikipedia.org/wiki/Machine_perceptionhttp://en.wikipedia.org/wiki/Dynamic_Bayesian_networkhttp://en.wikipedia.org/wiki/Machine_perceptionhttp://en.wikipedia.org/wiki/Hidden_Markov_modelhttp://en.wikipedia.org/wiki/Kalman_filterhttp://en.wikipedia.org/wiki/Economichttp://en.wikipedia.org/wiki/Utilityhttp://en.wikipedia.org/wiki/Decision_theoryhttp://en.wikipedia.org/wiki/Decision_analysishttp://en.wikipedia.org/wiki/Applied_information_economicshttp://en.wikipedia.org/wiki/Markov_decision_processhttp://en.wikipedia.org/wiki/Classifier_(mathematics)http://en.wikipedia.org/wiki/Pattern_matchinghttp://en.wikipedia.org/wiki/Pattern_matchinghttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Kernel_methodshttp://en.wikipedia.org/wiki/Support_vector_machinehttp://en.wikipedia.org/wiki/Support_vector_machinehttp://en.wikipedia.org/wiki/K-nearest_neighbor_algorithmhttp://en.wikipedia.org/wiki/Gaussian_mixture_modelhttp://en.wikipedia.org/wiki/Naive_Bayes_classifierhttp://en.wikipedia.org/wiki/Decision_tree_learninghttp://en.wikipedia.org/wiki/Decision_tree_learninghttp://en.wikipedia.org/wiki/Default_logichttp://en.wikipedia.org/wiki/Non-monotonic_logichttp://en.wikipedia.org/wiki/Circumscription_(logic)http://en.wikipedia.org/wiki/Qualification_problemhttp://en.wikipedia.org/wiki/Knowledge_representationhttp://en.wikipedia.org/wiki/Description_logichttp://en.wikipedia.org/wiki/Situation_calculushttp://en.wikipedia.org/wiki/Event_calculushttp://en.wikipedia.org/wiki/Fluent_calculushttp://en.wikipedia.org/wiki/Causalityhttp://en.wikipedia.org/wiki/Modal_logichttp://en.wikipedia.org/wiki/Judea_Pearlhttp://en.wikipedia.org/wiki/Probabilityhttp://en.wikipedia.org/wiki/Economicshttp://en.wikipedia.org/wiki/Bayesian_inferencehttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Expectation-maximization_algorithmhttp://en.wikipedia.org/wiki/Automated_planning_and_schedulinghttp://en.wikipedia.org/wiki/Decision_networkhttp://en.wikipedia.org/wiki/Machine_perceptionhttp://en.wikipedia.org/wiki/Dynamic_Bayesian_networkhttp://en.wikipedia.org/wiki/Machine_perceptionhttp://en.wikipedia.org/wiki/Hidden_Markov_modelhttp://en.wikipedia.org/wiki/Kalman_filterhttp://en.wikipedia.org/wiki/Economichttp://en.wikipedia.org/wiki/Utilityhttp://en.wikipedia.org/wiki/Decision_theoryhttp://en.wikipedia.org/wiki/Decision_analysishttp://en.wikipedia.org/wiki/Applied_information_economicshttp://en.wikipedia.org/wiki/Markov_decision_processhttp://en.wikipedia.org/wiki/Classifier_(mathematics)http://en.wikipedia.org/wiki/Pattern_matchinghttp://en.wikipedia.org/wiki/Pattern_matchinghttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Kernel_methodshttp://en.wikipedia.org/wiki/Support_vector_machinehttp://en.wikipedia.org/wiki/Support_vector_machinehttp://en.wikipedia.org/wiki/K-nearest_neighbor_algorithmhttp://en.wikipedia.org/wiki/Gaussian_mixture_modelhttp://en.wikipedia.org/wiki/Naive_Bayes_classifierhttp://en.wikipedia.org/wiki/Decision_tree_learninghttp://en.wikipedia.org/wiki/Decision_tree_learning
  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    8/32

    8

    works best on all given problems; this is also referred to as the "no free lunch" theorem. Determining a

    suitable classifier for a given problem is still more an art than science.

    1.2.5 Neural networks

    Figure: 1.5

    A neural network is an interconnected group of nodes, akin to the vast network of neurons in the

    human brain.The study of artificial neural networks began in the decade before the field AI research was

    founded, in the work ofWalter Pitts and Warren McCullough. Other important early researchers were Frank

    Rosenblatt, who invented theperceptron and Paul Werbos who developed theback propagation algorithm.

    The main categories of networks are acyclic or feed forward neural networks (where the signal

    passes in only one direction) and recurrent neural networks (which allow feedback). Among the most popular

    feed forward networks areperceptrons,multi-layer perceptrons and radial basis networks. Among

    recurrent networks, the most famous is the Hopfield net, a form of attractor network, which was first

    described by John Hopfield in 1982. Neural networks can be applied to the problem of intelligent control (for

    robotics) orlearning, using such techniques as Hebbian learning and competitive learning.

    Jeff Hawkins argues that research in neural networks has stalled because it has failed to model the

    essential properties of the neocortex, and has suggested a model (Hierarchical Temporal Memory) that is

    based on neurological research.

    1.2.6 Control theory

    Control theory, the grandchild ofcybernetics, has many important applications, especially in

    robotics.

    1.2.7 Languages

    AI researchers have developed several specialized languages for AI research, including Lisp and

    Prolog.

    1.3 Applications

    http://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimizationhttp://en.wikipedia.org/wiki/Neuronhttp://en.wikipedia.org/wiki/Human_brainhttp://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Walter_Pittshttp://en.wikipedia.org/wiki/Warren_McCulloughhttp://en.wikipedia.org/wiki/Frank_Rosenblatthttp://en.wikipedia.org/wiki/Frank_Rosenblatthttp://en.wikipedia.org/wiki/Perceptronhttp://en.wikipedia.org/wiki/Paul_Werboshttp://en.wikipedia.org/wiki/Backpropagationhttp://en.wikipedia.org/wiki/Feedforward_neural_networkhttp://en.wikipedia.org/wiki/Recurrent_neural_networkhttp://en.wikipedia.org/wiki/Perceptronshttp://en.wikipedia.org/wiki/Multi-layer_perceptronhttp://en.wikipedia.org/wiki/Radial_basis_networkhttp://en.wikipedia.org/wiki/Hopfield_nethttp://en.wikipedia.org/wiki/John_Hopfieldhttp://en.wikipedia.org/wiki/Intelligent_controlhttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Hebbian_learninghttp://en.wikipedia.org/w/index.php?title=Competitive_learning&action=edit&redlink=1http://en.wikipedia.org/wiki/Neocortexhttp://en.wikipedia.org/wiki/Hierarchical_Temporal_Memoryhttp://en.wikipedia.org/wiki/Cyberneticshttp://en.wikipedia.org/wiki/Roboticshttp://en.wikipedia.org/wiki/Lisp_programming_languagehttp://en.wikipedia.org/wiki/Prologhttp://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimizationhttp://en.wikipedia.org/wiki/Neuronhttp://en.wikipedia.org/wiki/Human_brainhttp://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Walter_Pittshttp://en.wikipedia.org/wiki/Warren_McCulloughhttp://en.wikipedia.org/wiki/Frank_Rosenblatthttp://en.wikipedia.org/wiki/Frank_Rosenblatthttp://en.wikipedia.org/wiki/Perceptronhttp://en.wikipedia.org/wiki/Paul_Werboshttp://en.wikipedia.org/wiki/Backpropagationhttp://en.wikipedia.org/wiki/Feedforward_neural_networkhttp://en.wikipedia.org/wiki/Recurrent_neural_networkhttp://en.wikipedia.org/wiki/Perceptronshttp://en.wikipedia.org/wiki/Multi-layer_perceptronhttp://en.wikipedia.org/wiki/Radial_basis_networkhttp://en.wikipedia.org/wiki/Hopfield_nethttp://en.wikipedia.org/wiki/John_Hopfieldhttp://en.wikipedia.org/wiki/Intelligent_controlhttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Hebbian_learninghttp://en.wikipedia.org/w/index.php?title=Competitive_learning&action=edit&redlink=1http://en.wikipedia.org/wiki/Neocortexhttp://en.wikipedia.org/wiki/Hierarchical_Temporal_Memoryhttp://en.wikipedia.org/wiki/Cyberneticshttp://en.wikipedia.org/wiki/Roboticshttp://en.wikipedia.org/wiki/Lisp_programming_languagehttp://en.wikipedia.org/wiki/Prolog
  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    9/32

    9

    Artificial intelligence has successfully been used in a wide range of fields including medical

    diagnosis,stock trading, robot control, law, scientific discovery, video games, toys, and Web search engines.

    Frequently, when a technique reaches mainstream use, it is no longer considered artificial intelligence,

    sometimes described as the AI effect .It may also become integrated into artificial life.

    1.3.1Competitions and prizes

    There are a number of competitions and prizes to promote research in artificial intelligence. The main

    areas promoted are: general machine intelligence, conversational behavior, data mining, and driver less cars,

    robot soccer and games.

    1.3.2 Platforms

    Aplatform (or "computing platform") is defined by Wikipedia as "some sort of hardware architecture

    or software framework (including application frameworks), that allows software to run." As Rodney Brooks

    pointed out many years ago, it is not just the artificial intelligence software that defines the AI features of the

    platform, but rather the actual platform itself that affects the AI that results, i.e, we need to be working out AI

    problems on real world platforms rather than in isolation.

    A wide variety of platforms has allowed different aspects of AI to develop, ranging from expert

    systems, albeit PC-based but still an entire real-world system to various robot platforms such as the widely

    available Roomba with open interface.

    2. THE TECHNOLOGY

    http://en.wikipedia.org/wiki/Medical_diagnosishttp://en.wikipedia.org/wiki/Medical_diagnosishttp://en.wikipedia.org/wiki/Stock_tradinghttp://en.wikipedia.org/wiki/Robot_controlhttp://en.wikipedia.org/wiki/Lawhttp://en.wikipedia.org/wiki/Game_artificial_intelligencehttp://en.wikipedia.org/wiki/Web_search_engineshttp://en.wikipedia.org/wiki/AI_effecthttp://en.wikipedia.org/wiki/Artificial_lifehttp://en.wikipedia.org/wiki/Platform_(computing)http://en.wikipedia.org/wiki/Computing_platformhttp://en.wikipedia.org/wiki/Expert_systemshttp://en.wikipedia.org/wiki/Expert_systemshttp://en.wikipedia.org/wiki/Medical_diagnosishttp://en.wikipedia.org/wiki/Medical_diagnosishttp://en.wikipedia.org/wiki/Stock_tradinghttp://en.wikipedia.org/wiki/Robot_controlhttp://en.wikipedia.org/wiki/Lawhttp://en.wikipedia.org/wiki/Game_artificial_intelligencehttp://en.wikipedia.org/wiki/Web_search_engineshttp://en.wikipedia.org/wiki/AI_effecthttp://en.wikipedia.org/wiki/Artificial_lifehttp://en.wikipedia.org/wiki/Platform_(computing)http://en.wikipedia.org/wiki/Computing_platformhttp://en.wikipedia.org/wiki/Expert_systemshttp://en.wikipedia.org/wiki/Expert_systems
  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    10/32

    10

    A human identity recognition system based on voice analysis could have seamless applications. The

    ASR (Automatic Speaker Recognition) is one such system. Automatic Speaker Recognition is a system that

    can recognize a person based on his/her voice. This is achieved by implementing complex signal processing

    algorithms that run on a digital computer or a processor. This Application is analogous to the fingerprint

    recognition system or other biometrics recognition systems that are based on certain characteristics of a

    person.

    There are several occasions when we want to identify a person from a given group of people even

    when the person is not present for physical examination. For example, when a person converses on a

    telephone, all we have is the persons voice for analysis. It then makes sense to develop a recognition system

    based on voice.

    Speaker recognition has typically been classified as either a verification or identification task.

    Speaker verification is usually the simpler of the two since it involves the comparison of the input signal with

    a single given stored reference pattern. Therefore, the verification task only requires a system to verify, if the

    speaker is the same as the person he/she identifies himself/herself. Speaker identification is more complex

    because the test speaker must be compared against a number of reference speakers to determine if a match can

    be made. Not only the input signal is to be examined to see if it came from a speaker, but the identification of

    the individual speaker is also necessary.

    The identification of speakers remains a difficult task for a number of reasons. First, the acquiring of

    a unique speech signal can suffer as a result of the variation of the voice inputs from a speaker and

    environmental factors. Both the volume and pace of speech can vary from one test to another. Also, unless

    initially constrained, an extensive vocabulary or unstructured grammar can affect results. Background noise

    must also be kept to the minimum so that a changing environment will not divert the speakers attention or the

    final voicing of a word or sentence. As a result, many restrictions and clarifications have been placed on

    speaker and speech recognition systems.

    One such restriction involves using a closed set for speaker recognition. A closed set implies that

    only speakers within the original stored set will be asked to be identified. An open set would allow the extra

    possibility of a test speaker not coming from the initially trained set of speakers, thereby requiring the system

    to recognize the speaker as not belonging to the original set. An open set system may also have the task to

    learning a new speaker and placing him or her within the original set for future reference.

    Another common restriction involves using a test dependent speaker recognition system. This type of

    system would require the speaker to utter a unique word or phrase to be compared against the original set of

    like phrases. Text-independent recognition, which for most cases is more complex and difficult to perform,

    identifies the speaker regardless of the text or phrase spoken.

    Once an utterance, or signal, has been recorded, it is usually necessary to process it to get the voiced

    signal in a form that makes classification and recognition possible. Various methods have included the use of

    power spectrum values, spectrum coefficients, linear predictive coding, and a nearest neighbor distance

    algorithm. Tests have also shown that although spectrum coefficients and linear predictive coding have given

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    11/32

    11

    better results for conventional template and statistical classification methods, power spectrum values have

    performed better when using neural networks during the final recognition stages.

    Various methods have also been used to perform the classification and recognition of the processed

    speech signal. Statistical methods utilizing Hidden Markov Models, linear vector quantifiers, or classical

    techniques such as template matching have produced encouraging, yet limited success. Recent deployments

    using neural networks, while producing varied success rates, have offered more options regarding the types of

    inputs sent to the networks, as well as provided the ability to learn speakers in both an off-line and an on-line

    manner. Although back-propagation networks have traditionally been used, the implementation of more

    sophisticated networks, such as an ART 2 network, has been made.

    ASR can be broadly classified into four types:

    1. Text-independent identification

    2. Text-independent verification

    3. Text-dependent identification

    4. Text-dependent verification

    Speaker identification is a procedure by which a speaker is identified from a group of n people. It

    should be noted that a totally new speaker not belonging to the group could wrongly be identified as someone

    from within the group. Speaker verification is a procedure by which a speaker who claims his/her identity is

    verified as being correct or not.

    A fundamental requirement for any ASR system is gathering reference samples and finding certain

    features from the voice that are characteristic to a person. These feature vectors are then stored. When a new

    test sample is made available, the references are either searched to find the closest match (in case of

    identification), or a threshold of a distance measure is checked (in case of verification).

    The next aspect to the considered is text-dependency. In a text-independent situation, the reference

    utterance and the test utterance are not the same. This type of recognition system finds its applications in

    criminology. In a text-dependent situation, the reference utterance and the test utterance are the same, which

    gives us a higher degree of accuracy. This type of recognition system has applications where security is a

    matter of concern, such as access to a building to a lab, to a computer, etc.

    The dominant technology used in ASR is called the Hidden Markov Model, or HMM. This

    technology recognizes speech by estimating the likelihood of each phoneme at contiguous, small regions

    (frames) of the speech signal. Each word in a vocabulary list is specified in terms of its component phonemes.

    A search procedure is used to determine the sequence of phonemes with the highest likelihood. This

    search is constrained to only look for phoneme sequences that correspond to words in the vocabulary list, and

    the phoneme sequence with the highest total likelihood is identified with the word that was spoken. In

    standard HMM's, the likelihoods are computed using a Gaussian Mixture Model; in the HMM/ANN

    framework, these values are computed using an artificial neural network (ANN).

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    12/32

    12

    3. SPEECH RECOGNITION

    The user speaks to the computer through a microphone, which in turn, identifies the meaning of the

    words and sends it to NLP device for further processing. Once recognized, the words can be used in a variety

    of applications like display, robotics, commands to computers, and dictation.

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    13/32

    13

    The word recognizer is a speech recognition system that identifies individual words. Early pioneering

    systems could recognize only individual alphabets and numbers. Today, majority of word recognition systems

    are word recognizers and have more than 95% recognition accuracy. Such systems are capable of recognizing

    a small vocabulary of single words or simple phrases. One must speak the input information in clearly

    definable single words, with a pause between words, in order to enter data in a computer. Continuous speech

    recognizers are far more difficult to build than word recognizers. You speak complete sentences to the

    computer. The input will be recognized and, then processed by NLP. Such recognizers employ sophisticated,

    complex techniques to deal with continuous speech, because when one speaks continuously, most of the

    words slur together and it is difficult for the system to know where one word ends and the other begins.

    Unlike word recognizers, the information spoken is not recognized instantly by this system.

    What is a speech recognition system?

    A speech recognition system is a type of software that allows the user to have their spoken words

    converted into written text in a computer application such as a word processor or spreadsheet. The computer

    can also be controlled by the use of spoken commands.

    Speech recognition software can be installed on a personal computer of appropriate specification. The user

    speaks into a microphone (a headphone microphone is usually supplied with the product). The software

    generally requires an initial training and enrolment process in order to teach the software to recognize the

    voice of the user. A voice profile is then produced that is unique to that individual. This procedure also helps

    the user to learn how to speak to a computer.

    3.1 Speech recognition process

    After the training process, the users spoken words will produce text; the accuracy of this will

    improve with further dictation and conscientious use of the correction procedure. With a well-trained system,

    around 95% of the words spoken could be correctly interpreted. The system can be trained to identify certain

    words and phrases and examine the users standard documents in order to develop an accurate voice file for

    the individual.

    However, there are many other factors that need to be considered in order to achieve a high

    recognition rate. There is no doubt that the software works and can liberate many learners, but the process can

    be far more time consuming than first time users may appreciate and the results can often be poor. This can be

    very demotivating, and many users give up at this stage. Quality support from someone who is able to show

    the user the most effective ways of using the software is essential.

    When using speech recognition software, the users expectations and the advertising on the box may

    well be far higher than what will realistically be achieved. You talk and it types can be achieved by some

    people only after a great deal of perseverance and hard work.

    3.2 Terms and Concepts

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    14/32

    14

    Following are a few of the basic terms and concepts that are fundamental to speech recognition. It is

    important to have a good understanding of these concepts when developing Voice XML applications.

    3.2.1 Utterances

    When the user says something, this is known as an utterance. An utterance is any stream of

    speech between two periods of silence. Utterances are sent to the speech engine to be processed. Silence, in

    speech recognition, is almost as important as what is spoken, because silence delineates the start and end of an

    utterance. Here's how it works. The speech recognition engine is "listening" for speech input. When the engine

    detects audio input - in other words, a lack of silence -- the beginning of an utterance is signaled.

    `Similarly, when the engine detects a certain amount of silence following the audio, the end of the

    utterance occurs. Utterances are sent to the speech engine to be processed. If the user doesnt say anything, the

    engine returns what is known as a silence timeout - an indication that there was no speech detected within the

    expected time frame, and the application takes an appropriate action, such as reprompting the user for input.

    An utterance can be a single word, or it can contain multiple words (a phrase or a sentence).

    3.2.2 Pronunciations

    The speech recognition engine uses all sorts of data, statistical models, and algorithms to convert

    spoken input into text. One piece of information that the speech recognition engine uses to process a word is

    its pronunciation, which represents what the speech engine thinks a word should sound like. Words can have

    multiple pronunciations associated with them. For example, the word the has at least two pronunciations in

    the U.S. English language: thee and thuh. As a Voice XML application developer, you may want to

    provide multiple pronunciations for certain words and phrases to allow for variations in the ways your callers

    may speak them.

    3.2.3 Grammars

    As a Voice XML application developer, you must specify the words and phrases that users can say to

    your application. These words and phrases are defined to the speech recognition engine and are used in the

    recognition process. You can specify the valid words and phrases in a number of different ways, but in Voice

    XML, you do this by specifying a grammar. A grammar uses a particular syntax, or set of rules, to define the

    words and phrases that can be recognized by the engine. A grammar can be as simple as a list of words, or it

    can be flexible enough to allow such variability in what can be said that it approaches natural language

    capability.

    3.2.4 Accuracy

    The performance of a speech recognition system is measurable. Perhaps the most widely used

    measurement is accuracy. It is typically a quantitative measurement and can be calculated in several ways.

    Arguably the most important measurement of accuracy is whether the desired end result occurred. This

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    15/32

    15

    measurement is useful in validating application design Another measurement of recognition accuracy is

    whether the engine recognized the utterance exactly as spoken.

    Another measurement of recognition accuracy is whether the engine recognized the utterance exactly

    as spoken. This measure of recognition accuracy is expressed as a percentage and represents the number of

    utterances recognized correctly out of the total number of utterances spoken. It is a useful measurement when

    validating grammar design.

    Recognition accuracy is an important measure for all speech recognition applications. It is tied to

    grammar design and to the acoustic environment of the user. You need to measure the recognition accuracy

    for your application, and may want to adjust your application and its grammars based on the results obtained

    when you test your application with typical users.

    4. SPEAKER INDEPENDENCY

    The speech quality varies from person to person. It is therefore difficult to build an electronic system

    that recognizes everyones voice. By limiting the system to the voice of a single person, the system becomes

    not only simpler but also more reliable. The computer must be trained to the voice of that particular

    individual. Such a system is called Speaker-dependent system.

    Speaker-independent system can be used by anybody, and can recognize any voice, even though the

    characteristics vary widely from one speaker to another. Most of these systems are costly and complex. Also,

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    16/32

    16

    these have very limited vocabularies. It is important to consider the environment in which the speech

    recognition system has to work. The grammar used by the speaker and accepted by the system, noise level,

    noise type, position of the microphone, and speed and manner of the users speech are some factors that may

    affect the quality of the speech recognition.

    4.1 Speaker Dependence Vs Speaker Independence

    Speaker Dependence describes the degree to which a speech recognition system requires knowledge

    of a speakers individual voice characteristics to successfully process speech. The speech recognition engine

    can learn how you speak words and phrases; it can be trained to your voice.

    Speech recognition systems that require a user to train the system to his/her voice are known as

    speaker-dependent systems. If you are familiar with desktop dictation systems, most are speaker dependent.

    Because they operate on very large vocabularies, dictation systems perform much better when the speaker has

    spent the time to train the system to his/her voice.

    Speech recognition systems that do not require a user to train the system are known as speaker-

    independent systems. Speech recognition in the Voice XML world must be speaker-independent. Think of

    how many users (hundreds, maybe thousands) say be calling into your web site. You cannot require that each

    caller train the system to his or her voice. The speech recognition system in a voice-enabled web application

    MUST successfully process the speech of many different callers without having to understand the individual

    voice characteristics of each caller.

    4.1.1 Advantages of speaker independent system

    The advantage of a speaker independent system is obvious anyone can use the system without first

    training it. However, its drawbacks are not so obvious. One limitation is the work that goes into creating the

    vocabulary templates. To create reliable speaker independents templates, someone must collect and process

    numerous speech sample. This is a time-consuming task; creating these templates is not a one-time effort.

    Speaker-independent templates are language-dependant, and the templates are sensitive not only to two

    dissimilar languages but also to the differences between British and American English. Therefore, as part of

    your design activity, you would need to create a set of templates for each language or a major dialect that your

    customers use. Speaker independent systems also have a relatively fixed vocabulary because of the difficulty

    in creating a new template in the field at the users site.

    4.1.2 The advantage of a speaker-dependent system

    A speaker dependent system requires the user a train the ASR system by providing examples of his

    own speech. Training can be tedious process, but the system has the advantage of using templates that refer

    only to the specific user and not some vague average voice. The result is language independence. You can say

    ja, si, or ya during training, as long as you are consistent. The drawback is that the speaker-dependent system

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    17/32

    17

    must do more than simply match incoming speech to the templates. It must also include resources to create

    those templates.

    4.1.3 Which is better:

    For a given amount of processing power, a speaker dependent system tends to provide more accurate

    recognition than a speaker independent system. A speaker independent system is not necessarily better: the

    difference in performance stems from the speaker independent template encompassing wide speech

    variations.

    4.2 System Configuration

    Figures 4.2.1 and 4.2.2 show the identification system and the verification system configuration,

    respectively .The first part of the system consists of the data acquisition hardware that acquires the speech,

    performs some signal conditioning, digitizes it and gives it to the computer/processor.

    The second part consists of core signal processing and system identification techniques to extract

    speaker specific features. These features are stored and are used at a later time for the actual

    identification/verification test. At this stage, the system is ready for identification or verification.

    Now, when the test sample is uttered by one of the members of the group, the speech is digitized and

    the features are extracted. For identification, distances between this vector and all the reference vectors are

    measured and the closest vector is picked up as the correct one. This vector would correspond to a person

    whom the system claims as having been identified. For verification, the person claims his/her identity. The

    distance between the corresponding reference vector and the test vector is the computed. If the measured

    distance is less than a set threshold, the verification system accepts the speaker; if not, it rejects the speaker.

    Figure 4.1: Speaker Identification

    MEASUREMENT OF DISTANCE

    DECISIONMAKING

    IDENTITYOF PERSON

    REFERENCEVECTORSALGORITH

    M TOSELECT

    FEATURES

    ADC WITHSIGNAL

    CONDITIONING

    VOICESAMPLE

    THRESHOLDCOMPARISON

    VOICESAMPLE

    (PERSONCL

    ALGORITHMTO SELECTFEATURES

    ADC WITHSIGNA

    LCONDI

    REFERENCE VECTOR

    OF THESPEAKER

    MEASUREMENT OF

    DISTANCE PERSONVERIFIED

    ORNOT

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    18/32

    18

    Figure 4.2: Speaker Verification

    The voice input to the microphone produces an analogue speech signal. An analogue-to-digital

    converter (ADC) converts this speech signal into binary words that are compatible with digital computer. The

    converted binary version is then stored in the system and compared with previously stored binary

    representations of words and phrases. The current input speech is compared one at a time with the previously

    stored speech pattern after searching by the computer. When a match occurs, recognition is achieved. The

    spoken word is binary form is written on a video screen or passed along to a natural language understanding

    processor for additional analysis.

    Since most recognition systems are speaker-dependent, it is necessary to train a system to

    recognize the dialect of each new user. During training, the computer displays a word and the user reads it

    aloud. The computer digitizes the users voice and stores it. The speaker has to read aloud about 1,000 words.

    Based on these samples, the computer can predict how the user utters some words that are likely to be

    pronounced differently by different people.

    The block diagram of a speaker-dependent word recognizer is shown in Fig. 4.2.1 The user

    speaks before the microphone, which converts the sound into electrical signal. The electrical analogue signal

    from microphone is fed to an amplifier provided with automatic gain control (AGC) to produce an amplified

    output signal in a specific optimum voltage range, even when the input signal varies from feeble to loud.

    The analogue signal, representing a spoken word, contains many individual frequencies of

    various amplitudes and different phases, which when blended together take the shape of a complex

    waveform . A set of filters is used to break this complex input signal into its component parts. Band pass

    filters (BEP) pass on frequencies only in certain frequency range, rejecting all other frequencies. Generally,

    about sixteen filters are used; a simple system may contain a minimum of three filters. The more the numberof filters user, the higher the probability of accurate recognition.

    Presently, switched capacitor digital filters are used because these can be custom-built in

    integrated circuit form. These are smaller and cheaper than active filters using operational amplifiers. The

    filter output is then fed to the ADC to translate the analogue signal into digital word. The ADC samples the

    filter outputs many times a second. Each sample represents a different amplitude of the signal .

    Evenly spaced vertical lines represent the amplitude of the audio filter output at the instant of

    sampling. Each value is then converted to a binary number proportional to the amplitude of the sample. A

    central processor unit controls the input circuits that are fed by the ADCs. A large RAM stores all the digital

    values in a buffer area. This digital information, representing the spoken word, is now accessed by the CPU

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    19/32

    19

    to process it further. The normal speech has a frequency range of 200 Hz to 7 kHz. Recognizing a telephone

    call is more difficult as it has bandwidth limitation of 300Hz to 3.3 Hz.

    As explained earlier, the spoken words are processed by the filters and ADCs. The binary

    representation of each of these words becomes a template or standard, against which the future words are

    compared. These templates are stored in the memory. Once the storing process is completed, the system can

    go into its active mode and is capable of identifying spoken words. As each word is spoken, it is converted

    into binary equivalent and stored in RAM. The computer then starts searching and compares the binary input

    pattern with the templates.

    It is to be noted that even if the same speaker talks the same text, there are always slight

    variations in amplitude or loudness of the signal, pitch, frequency difference, time gap, etc. Due to this reason,

    there is never a perfect match between the template and binary input word. The pattern matching process

    therefore uses statistical techniques and is designed to look for the best fit.

    The values of binary input words are subtracted from the corresponding values, in the

    templates. If both the values are same, the difference is zero and there is perfect match. If not, the subtraction

    produces some difference or error. The smaller the error, the better the match. When the best match occurs the

    word is identified and displayed on the screen or used in some other manner.

    The search process takes a considerable amount of time as the CPU has to make many

    comparisons before recognition occurs. This necessitates use of very high-speed processors. A large RAM is

    also required as even though a spoken word may last only a few hundred milliseconds, but the same is

    translated into many thousands of digital words. It is important to not e that alignment of words and templates

    are to be matched correctly in time, before computing the similarity score. This process, termed as dynamic

    time warping, recognizes that different speaker pronounce the same words at different speeds as well as

    elongate different parts of the same word. This is important for the speaker-independent recognizers.

    5. WORKING OF THE SYSTEM

    The voice input to the microphone produces an analogue speech signal. An analogue to digital

    converter (ADC) converts this speech signal into binary words that are compatible with digital computer. The

    converted binary version is then stored in the system and compared with previously stored binary

    representation of words and phrases. The current input speech is compared one at a time with the previously

    stored speech pattern after searching by the computer. When a match occurs, recognition is achieved. The

    spoken word in binary form is written on a video screen or passed along to a natural language understanding

    processor for additional analysis. Since most recognition systems are speaker-dependent, it is necessary to

    train a system to recognize the dialect of each new user. During training, the computer displays a word and

    user reads it aloud. The computer digitizes the users voice and stores it. The speaker has to read aloud about1000 words. Based on these samples, the computer can predict how the user utters some words that are likely

    to be pronounced differently by different users.

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    20/32

    20

    The block diagram of a speaker- dependent word recognizer is shown in figure. The user

    speaks before the microphone, which converts the sound into electrical signal. The electrical analogue signal

    from the microphone, is fed to an amplifier provided with automatic gain control (AGC) to produce an

    amplified output signal in a specific optimum voltage range, even when the input signal varies from feeble to

    loud.

    The analogue signal, representing a spoken word, contains many individual frequencies of

    various amplitudes and different phases, which when blended together take the shape of a complex wave form

    as shown in figure. A set of filters is used to break this complex signal into its component parts. Band pass

    filters (BFP) pass on frequencies only on certain frequency range, rejecting all other frequencies. Generally,

    about 16 filters are used; a simple system may contain a minimum of three filters. The more number of filters

    used, the higher the probability of accurate recognition. Presently, switched capacitor digital filters are used

    because these can be custom- built in integrated circuit form. These are smaller and cheaper than active filters

    using operational amplifiers. The filter output is then fed to the ADC to translate the analog signal into digital

    word. The ADC samples the filter output many times a second. Each sample represents different amplitude of

    the signal .A CPU controls the input circuits that are fed by the ADCs. A large RAM stores all the digital

    values in a buffer area. This digital information, representing the spoken word, is now accessed by the CPU to

    process it further.

    5.1 Speaker- dependent word recognizer

    Figure

    RAM

    I

    NPUT

    CIRCUI

    TS

    DIGITISEDSPEECH

    ADCBPF

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    21/32

    21

    The normal speech has a frequency range of 200 Hz to 7KHz. Recognizing a telephone call is

    more difficult as it has bandwidth limitations of 300Hz to 3.3KHz.As explained earlier the spoken words are

    processed by the filters and ADCs. The binary representation of each of these word becomes a template or

    standard against which the future words are compared. These templates are stored in the memory. Once the

    storing process is completed, the system can go into its active mode and is capable of identifying the spoken

    words. As each word is spoken, it is converted into binary equivalent and stored in RAM. The computer then

    starts searching and compares the binary input pattern with the templates. It is to be noted that even if the

    same speaker talks the same text, there are always slight variations in amplitude or loudness of the signal,

    pitch, frequency difference, time gap etc. Due to this reason there is never a perfect match between the

    template and the binary input word. The pattern matching process therefore uses statistical techniques and is

    designed to look for the best fit.

    The values of binary input words are subtracted from the corresponding values in the

    templates. If both the values are same, the difference is zero and there is perfect match. If not, the subtraction

    produces some difference or error. The smaller the error, the better the match. When the best match occurs,

    the word templates are to be matched correctly in time, before computing the similarity score. This process,

    termed as dynamic time warping recognizes that different speakers pronounce the same word at different is

    identified and displayed on the screen or used in some other manner.

    The search process takes a considerable amount of time, as the CPU has to make many comparisons

    before recognition occurs. This necessitates use of very high-speed processors. A Large RAM is also required

    as even though a spoken word may last only a few hundred milliseconds, but the same is translated into many

    thousands of digital words. It is important to note that alignment of words and speeds as well as elongate

    different parts of the same word. This is important for the speaker- independent recognizers.

    Now that we've discussed some of the basic terms and concepts involved in speech recognition, let's

    put them together and take a look at how the speech recognition process works. As you can probably imagine,

    the speech recognition engine has a rather complex task to handle, that of taking raw audio input and

    translating it to recognized text that an application understands. The major components discussed are:

    Audio input

    Grammar(s)

    Acoustic Model

    Recognized text

    The first thing we want to take a look at is the audio input coming into the recognition engine.

    It is important to understand that this audio stream is rarely pristine. It contains not only the speech data (what

    was said) but also background noise. This noise can interfere with the recognition process, and the speech

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    22/32

    22

    engine must handle (and possibly even adapt to) the environment within which the audio is spoken. As we've

    discussed, it is the job of the speech recognition engine to convert spoken input into text. To do this, it

    employs all sorts of data, statistics, and software algorithms. Its first job is to process the incoming audio

    signal and convert it into a format best suited for further analysis.

    Once the speech data is in the proper format, the engine searches for the best match. It does this

    by taking into consideration the words and phrases it knows about (the active grammars), along with its

    knowledge of the environment in which it is operating for Voice XML, this is the telephony environment).

    The knowledge of the environment is provided in the form of an acoustic model. Once it identifies the the

    most likely match or what was said, it returns what it recognized as a text string. Most speech engines try very

    hard to find a match, and are usually very "forgiving." But it is important to note that the engine is always

    returning it's best guess

    for what was said.

    5.1.1 Acceptance and Rejection

    When the recognition engine processes an utterance, it returns a result. The result can be either of

    two states: acceptance or rejection. An accepted utterance is one in which the engine returns recognized text.

    Whatever the caller says, the speech recognition engine tries very hard to match the utterance to a word or

    phrase in the active grammar. Sometimes the match may be poor because the caller said something that the

    application was not expecting, or the caller spoke indistinctly. In these cases, the speech engine returns the

    closest match, which might be incorrect. Some engines also return a confidence score along with the text to

    indicate the likelihood that the returned text is correct. Not all utterances that are processed by the speech

    engine are accepted. Acceptance or rejection is flagged by the engine with each processed utterance.

    5.2 What software is available?

    There are a number of publishers of speech recognition software. New and improved versions are

    regularly produced, and older versions are often sold at greatly reduced prices. Invariably, the newest versions

    require the most modern computers of well above average specification. Using the software on a computer

    with a lower specification means that it will run very slowly and may well be impossible to use. There are two

    main types of speech recognition software: discrete speech and continuous speech.

    Discrete speech software is an older technology that requires the user to speak one word at a time .

    Dragon Dictate Classic Version 3 is one example of discrete speech software, as it has fewer features, is

    simple to train and use and will work on Continuous speech software allows the user to dictate normally. In

    fact, it works best when it hears complete sentences, as it interprets with more accuracy when it recognizes the

    context. The delivery can be varied by using short phrases and single words, following the natural pattern of

    speech.

    5.3 What technical issues need to be considered when purchasing this system?

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    23/32

    23

    The latest versions of speech recognition software (September 2001) require a Pentium 3

    processor and 256 MB of memory. Currently, Dragon Naturally Speaking Version 4 and IBM Via Voice

    Millennium edition have been used in school settings. Very good results can be obtained with these on fast,

    high-memory machines. When purchasing a machine, it is worth mentioning to the supplier that it will be

    required for running speech recognition software.

    Whether choosing a desktop or portable computer, it will also require a good quality duplex

    (input and output) sound card. Poor sound quality will reduce the recognition accuracy. The microphones

    supplied with the software may be perfectly adequate, but better results can often be obtained by using a

    noise-cancelling microphone. In addition, mobile voice recorders allow a number of users to produce dictation

    that can be downloaded to the main speech recognition system, but be aware of some of the complexities of

    their use.

    5.4 How does the technology differ from other technologies?

    Speech recognition systems produce written text from the users dictation, without using, or

    with only minimal use of, a traditional keyboard and mouse. This is an obvious benefit to many people who,

    for any number of reasons, do not find it easy to use a keyboard, or whose spelling and literacy skills would

    benefit from seeing accurate text.

    The limitations to this type of software are that:

    It needs to be completely tailored to the user and trained by the user.

    It is often set up on one machine, and so can create difficulties for a user who works from many

    locations, for example from school and home.

    It depends on the user having the desire to produce text and be able to invest the time, training and

    perseverance necessary to achieve it.

    It is most successful for those competent in the art of dictation.

    A speech recognition system is a powerful application in that the softwares recognition of the

    users voice pattern and vocabulary improves with use. A useful tip is to ensure that voice files can be backed

    up regularly.

    5.5 What factors need to be considered when using speech recognition technology?

    The Becta SEN Speech Recognition Project describes the key factors to success as The Three T's -

    Time, Technology and Training:

    Time

    Take time to choose the most appropriate software and hardware and match it to the user. One option

    for new users is to start with discrete speech software. The skills learned whilst using it can be transferred to

    more sophisticated speech recognition software. If the new user is unable to make effective use of discrete

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    24/32

    24

    speech recognition software, then it is unlikely they will succeed with continuous speech software.

    Familiarizing with the product and frequent breaks between talking are also helpful older computers.

    Training

    With speech recognition systems, both the software and the user require training. Patience and

    practice are required. The user needs to take things slowly, practicing putting their thoughts into words before

    attempting to use the system.

    Technology

    The best results are generally achieved using a high-specification machine. Sound cards and

    microphones are a key feature for success, as is access to technical support and advice.

    6. THE LIMITS OF SPEECH RECOGNITION

    To improve speech recognition applications, designers must understand acoustic memory and

    prosody. Continued research and development should be able to improve certain speech input, output, and

    dialogue applications. Speech recognition and generation is sometimes helpful for environments that are

    hands-busy, eyes-busy, mobility required, or hostile and shows promise for telephone-based ser-vices.

    Dictation input is increasingly accurate, but adoption outside the disabled-user community has been slow

    compared to visual interfaces. Obvious physical problems include fatigue from speaking continuously and the

    disruption in an office filled with people speaking.

    By understanding the cognitive processes surrounding human acoustic memory and

    processing, interface designers may be able to integrate speech more effectively and guide users more

    successfully. By appreciating the differences between human-human interaction and human-computer

    interaction, designers may then be able to choose appropriate applications for human use of speech with

    computers. The key distinction may be the rich emotional content conveyed by prosody, or the pacing,

    intonation, and amplitude in spoken language. The emotive aspects of prosody are potent for human

    interaction but may be disruptive for human-computer interaction. The syntactic aspects of prosody, such as

    rising tone for questions, are important for a systems recognition and generation of sentences.

    Now consider human acoustic memory and processing. Short-term and working Memory are

    some-times called acoustic or verbal mems the human brain that transiently holds chunks of information and

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    25/32

    25

    solves problems also supports speaking and listening. Therefore, working on tough problems is best done in

    quiet environments without speaking or listening to someone. However, because physical activity is handled

    in another part of the brain, problem solving is compatible with routine physical activities like walking and

    driving. In short, humans speak and walk easily but find it more difficult to speak and think at the same time.

    Similarly when operating a computer, most humans type (or move a mouse) and think but find it more

    difficult to speak and think at the same time. Hand-eye coordination is accomplished in different brain

    structures, so typing or mouse movement can be performed in parallel with problem solving.

    Product evaluators of an IBM dictation software the human brain that transiently holds chunks

    of information and solves problems also supports speaking and listening. Therefore, working on tough

    problems is best done in quiet environmentswithout speaking or listening to someone. However, because

    physical activity is handled in another part of the brain, problem solving is compatible with routine physical

    activities like walking and driving. In short, humans speak and walk easily but find it more difficult to speak

    and think at the same time.

    Similarly when operating a computer, most humans type (or move a mouse) and think but find

    it more difficult to speak and think at the same time. Hand-eye coordination is accomplished in different brain

    structures, so typing or mouse movement can be performed in parallel with problem solving. Product

    evaluators of an IBM dictation software package also noticed this phenomenon. They wrote that thought for

    many people is very closely linked to language. In keyboarding, users can continue to hone their words while

    their fingers output an earlier version. In dictation, users may experience more interference between

    outputting their initial thought and elaborating on it. Developers of commercial speech recognition software

    packages recognize this problem and often advise dictation of full paragraphs or documents, followed by a

    review or proofreading phase to correct errors. Since speaking consumes precious cognitive resources, it is

    difficult to solve problems at the same time. Proficient keyboard users can have higher levels of parallelism in

    problem solving while performing data entry. This may explain why after 30 years of ambitious attempts to

    provide military pilots with speech recognition in cockpits, aircraft designers persist in using hand-input

    devices and visual displays. Complex functionality is built in to the pilots joy-stick, which has up to 17

    functions, including pitch-roll- yaw controls, plus a rich set of buttons and triggers. Similarly automobile

    controls may have turn signals, wiper settings, and washer buttons all built onto a single stick, and typical

    video camera controls may have dozens of settings that are adjustable through knobs and switches. Rich

    designs for hand input can inform users and free their minds for status monitoring and problem solving.

    The interfering effects of acoustic processing are a limiting factor for designers of speech

    recognition, but the the role of emotive prosody raises further con-cerns. The human voice has evolved

    remarkably well to support human-human interaction. We admire and are inspired by passionate speeches. We

    are moved by grief-choked eulogies and touched by a childs calls as we leave for work. A military

    commander may bark commands at troops, but there is as much motivational force in the tone as there is

    information in the words. Loudly barking commands at a computer is not likely to force it to shorten its

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    26/32

    26

    response time or retract a dialogue box. Promoters of affective computing, or reorganizing, responding to,

    and making emotional displays, may recommend such strategies, though this approach seems misguided.

    Many users might want shorter response times without having to work them-selves into a mood

    of impatience. Secondly, the logic of computing requires a user response to a dialogue box independent of the

    users mood. And thirdly, the uncertainty of machine recognition could undermine the positive effects of user

    control and interface predictability.

    7. APPLICATIONS

    One of the main benefits of speech recognition system is that it lets user do other workssimultaneously. The user can concentrate on observation and manual operations, and still control the

    machinery by voice input commands. Consider a material-handling plant where a number of conveyors are

    employed to transport various grades of materials to different destinations. Nowadays, only one operator is

    employed to run the plant. He has to keep a watch on various meters, gauges, indication lights, analyzers,

    overload devices, etc from the central control panel. If something wrong happens, he has to run to physically

    push the stop button. How convenient it would be if a conveyor or a number of conveyors are stopped

    automatically by simply saying stop.

    Another major application of speech processing is in military operations. Voice control of

    weapons is an example. With reliable speech recognition equipment, pilots can give commands and

    information to the computers by simply speaking in to their microphones-they dont have to use their hands

    for this purpose. Another good example is a radiologist scanning hundreds of X rays, ultra sonograms, CT

    scans and simultaneously dictating conclusion to a speech recognition system connected to word processors.

    The radiologist can focus his attention on the images rather than writing the text. Voice recognition could also

    be used on computers for making airline and hotel reservations. A user requires simply to state his needs, to

    make reservation, cancel a reservation, or make inquiries about schedule. sensitive effects of user control and

    interface predictability.

  • 8/3/2019 Artificial Intelligence in Voice Recognition Systems2

    27/32

    27

    7.1 Health care

    In the health care domain, even in the wake of improving speech recognition technologies,

    medical transcriptionists (MTs) have not yet become obsolete. Many experts in the field anticipate that with

    increased use of speech recognition technology, the services provided may be redistributed rather than

    replaced. Speech recognition is used to enable deaf people to understand the spoken word via speech to text

    conversion, which is very helpful.

    Speech recognition can be implemented in front-end or back-end of the medical documentation process.

    Front-End SR is where the provider dictates into a speech-recognition engine, the recognized words are

    displayed right after they are spoken, and the dictator is responsible for editing and signing off on the

    document. It never goes through an MT/editor.

    Back-End SR or Deferred SR is where the provider dictates into a digital dictation system, and the voice is

    routed through a speech-recognition machine and the recognized draft document is r