An ESOTerIC Approach to Some Problems in Automatic Speech...

An ESOTerIC Approach to Some Problems in Automatic Speech Recognition

DAVID R. HILL University of Calgary, Alberta, Canada

(Received 3 September 1968)

A General Description of ESOTerIC I

An outline of the overall problem of automatic speech recognition-the ASR problem-has been given in previous papers (Hill, 1967a, 1968). The problem divides into three main areas:

(1) Problems of definition; (2) Problems of implementation;(3) Problems of application.

The present paper describes a machine, “ESOTerIC”, which embodies, in an ele-mentary form, sections corresponding to those of the “basic acoustic recognizer” proposed in those papers. Some experimental results obtained with this machine are discussed, and the approach adopted is generalized. Finally, the possibility of immediate applications for this simple ASR system is discussed.

ESOTerIC I was first described and demonstrated at the Second Ma-chine Intelligence Workshop, Edinburgh, in 1966. The name “ESOTerIC” (Experimental Speech Operated Terminal for Input to Computers) refers to only one application, though historically the first. As illustrated in Fig. 1, speech enters the machine by means of a microphone. In the first section of the machine, called the analyser, the acoustic content of the input signal is determined, as far as the machine’s (limited) analysis will allow. The out-put of the analyser consists of a number of lines, carrying binary signals, which continuously indicate the state of the input in terms of the presence/absence of selected acoustic properties to which the analyser is sensitive. These outputs are termed “primitive acoustic features” or PAFs, and are the primitives of the subsequent analysis. They comprise: (1) presence of signifi-cant and high-frequency energy alone (hiss); (2) presence of significant low-

Int. J. Man-Machine Studies (1969) 1, 101-121

102 DAVID R. HILL

frequency energy alone (humph); (3) presence of hiss and humph to-gether; (4) absence of any significant energy (silence, or gap). Any in-formation about the order of occurrence of these PAFs is, at this stage, preserved implicitly in the order of their output from the analyser.

FIG. 1. Main operations required in ESOTerIC 1.

The purpose of the second section, the sequence detector, is to make ex-plicit that order information which is necessary for recognition, so that varia-tions in the sequence relevant to recognition are taken into account, but varia-tions which are irrelevant have a minimal disturbing effect. In ESOTerIC I, logic and storage are incorporated to note which PAF occurred first, which last, which of the four occurred, and whether or not certain transitions-of one PAF to another-occurred during the input word. These comprise the structural descriptors of the input. With only a few basic descriptors, or primitives, and a limited depth of sequence analysis the choice for these structural descrip-tors, called “compound acoustic features” or CAFs, is somewhat restricted. Nevertheless the set used could be improved upon, using a simple generaliza-tion. Suppose two PAFs are “A” and “B”. It will be noticed that the CAF “A was immediately followed by B” may be detected independently of absolute time, and thus may be used for recognition without knowing where the word began, where it ended, or how the time scale varied from beginning to end of the word. It will also be noticed that if, for example, “A” were silence, and “B” were the first PAF to occur, then the CAF “A was immediately followed by B” would necessarily be generated. These facts, amongst others, form the basis of the more general approach developed below. Time normalization has been a major problem in ASR, and it is suggested that this type of approach is essential, and is a key to the problem of definition.

103AN ESOTerIC APPROACH TO SOME PROBLEMS

The outputs of the sequence detector thus consist of pulses marking each instant at which each of these CAFs is detected. These pulses are used to set the states of binary storage elements, although other forms of store might be more suitable. The bit pattern thus formed represents the content and the order information explicitly. Each bit has a significance with respect to the input which does not depend on its relationship to the other bits in the pattern. For example, there is no time-ordering of the bits. In ESOTerIC I the decision is taken on the simple basis of a bit-for-bit matching with patterns stored as plugs in a three layer matrix board. These allow presence, absence, or don’t care conditions to be specified, the latter condition obtaining when a plug corresponding to the feature for the word in question is omitted.

Figure 2 shows the state-transition diagram of ESOTerIC I. The transitions are governed by signals from the controller section, which determines the beginning and end of the input signal (on a mean power threshold basis, with timing), and presumes that such a signal represents a spoken word. There is provision for manual or automatic resetting of the machine, when it is in the output state, and means are provided to suppress the generation of new CAFs during this state (the machine is said to be “frozen”).

FIG. 2. ESOTerIC I basic state transitions.

ESOTerIC I has been investigated for several reasons. (a) It has led to an understanding of the need to process order information explicitly (a key to the problem of definition); (b) it has led to a general solution to some problems of implementation; (c) it is a complete machine which can be used as an ASR input to other machines, and which is cheap enough (say $500), and simple enough, to be accessible to many people; (d) it has demonstrated that even a simple device has immediate applications potential (the appli-cations problem); (e) it has allowed work to start on the important human

104 DAVID R. HILL

factors problems of using such devices, without the need for the artificial contrivance of “someone” pretending to be a machine.

Experimental Use of ESOTerIC I Four sets of experiments will be mentioned in this section: early trials;

an experiment in recognizing unknown speakers; an experiment in training unknown speakers; an experiment in man-machine communication.

The early trials were informal and aimed at gaining experience in the use of the device. As exemplified in Table 1, vocabularies of nine or ten words were chosen; one included a “propaganda item” (Table 1(b)). In the later

trials more useful vocabularies were used, as shown in Tables 2 and 3. The special words chosen for the digits are analogous to the special characters used for optical character recognition (OCR) except they are not machine

produced. The hardware was limited to a maximum of ten different outputs, but the device could also be used on-line to a Digital Equipment Corporation (DEC) PDP8 computer (Hill, 1967b) to allow any size of vocabulary; in this case the computer simulated the plug-board.


EARLY TRIALS Two main points of interest arose from these. First, many naïve speakers

who were unaware of the basis for the machine’s operation could achieve good (even perfect) scores at their first encounter with the machine. Such results were obtained both in the laboratory and also in an informal trial held during a local trade exhibition. There seemed to be two groups of speakers. Some speakers obtained very low scores, less than 50 %, but a larger group averaged 80 % after being told how to say the words. Secondly, some speak-ers who performed poorly at first could learn to obtain perfect scores. A learn-ing curve for such a subject is shown as Fig. 3.

FIG. 3. Typical learning curve, vocabulary: 9 words (Table 1 (a)).

AN EXPERIMENT IN RECOGNIZING UNKNOWN SPEAKERS It was intended, at one stage, to install an ESOTerIC device in the De-

partment of Machine Intelligence and Perception, Edinburgh University. In preparation for this, a tape-recording of 12 members of the department, each speaking a 16-word vocabulary, was obtained. This vocabulary was chosen to satisfy their estimated requirements for an experimental ASR input to a multi-access system. The recognition rate for the 12 speakers was 78%; of the 22% not correctly recognized, 12% were misrecognized, and 10% were rejected. A wide variation in the threshold settings for the machine barely affected the results. Finally, it should be added that there was a good deal of extraneous noise on the tape, which was recorded in an open office.

AN EXPERIMENT IN TRAINING NAÏVE SPEAKERS This was a pilot experiment intended to examine the ability to learn to

use the device given various instructions. It was felt that simple instructions

106 DAVID R. HILL

would be necessary if they were to be of practical value. Subjects were as-signed to one of two experimental groups: (i) given only minimum informa-tion about the production of desired sounds; (ii) given, in addition to (i), minimum machine generated information concerning the occurrence of de-sired sounds. An indicator let subjects know when to speak, and whether the machine’s response was right or wrong. The second experimental group were provided with a further indicator showing whether the machine had detected hiss, humph, or gap. The appearance of the indicators is shown in Fig. 4, and the general experimental situation in Fig. 5. The PDP8 not

FIG. 4. Displays for teaching unknown speakers. The left-hand box was seen by all sub-jects; the right-hand box was seen by only experimental group 2, during practice.

FIG. 5. On-line experiment using ESOTerIC.

only controlled the experiment, but also allowed on-line data-logging and analysis. 16 men and 16 women, chosen arbitrarily from the secretarial and maintenance staff of the laboratory, were assigned to a control group and the two experimental groups. All subjects were allowed a similar amount of practice between an initial and a final test run. All subjects also heard a tape-recording of the vocabulary, spoken using a preferred pronunciation, before the first test.


Although some improvement in the recognition rate was observed for both experimental group 1 and group 2 (+7·2 % and +3·4 % respectively) it was not statistically significant at the 5 % level. The control group showed a slight decrease in performance. With the comparatively large variation in perfor-mance from individual to individual, larger groups would have been desir-able. It is probable that, for many subjects,merely hearing the vocabulary spoken in preferred pronunciation gave all the information needed to achieve a reasonable initial performance. No attempt was made to optimize the train-ing procedure, nor was the experiment concerned with the achievement of any particular level of performance. Both these aspects are of considerable interest, and it is clear that further experiments are called for.

AN EXPERIMENT IN MAN-MACHINE COMMUNICATION In this experiment an ESOTerIC device was connected to a DEC PDP7

computer in the University Mathematical Laboratory at Cambridge. The de-vice was used to communicate commands to the scope editor program. This program allows editing of text, which is read in from tape and displayed on a type 340 graphic display console. In normal operation of the scope editor, the commands would be entered using either special code combinations on the keyboard of the associated teletype, or by touching command symbols on the display with a light-pen. The commands include functions such as Punch, Read, and Insert, which tell the editor what to do with the material either already in the text buffer, or to be entered from the teletype keyboard. Two points are selected for attention. First, the use of speech input for command functions allowed them to be separated completely from the data input to the scope editor program, without the user having to leave the teletype and take up the light-pen. Secondly, despite high ambient noise conditions (the micro-phone was a mere 20 inches from the teletype) no problem of noise rejection was encountered. This was achieved, in part, by the use of a close-talking, boom-mounted microphone.

Discussion of Experiments The early trials of this prototype of what is, essentially, a very inexpensive

device, show promise of reasonable recognition rates for naive, unknown speakers, and demonstrate the possibility of unskilled operators being able to learn, even after initially poor performance, to operate the device nearly perfectly without extensive instruction. In this connection it must be empha-sized that, with feedback to the speaker of what the machine “heard”, some errors in recognition need be no detriment to the overall system. Trials with

108 DAVID R. HILL

unknown speakers, using a larger and more useful vocabulary, confirmed the first observation. Trials in which non-technical staff were taught to use the device suggested that poor performers could be trained to do better, even with a short simple, training procedure. The results also suggested that, for many people, the most important single aid to suitable pronunciation of the vocabulary was to hear a tape-recording of the preferred pronunciation. This might not be so effective if the machine were using less “consciously available” features of the acoustic signal. Finally ESOTerIC I has been found useful in a real man-machine system. Application of simple ASR systems are discussed again in a later section of this paper.

A Generalization of ESOTerIC I In earlier papers (e.g. Hill, 1967a) the author has suggested that a maxi-

mum likelihood technique is appropriate to decision taking in ASR; but, as Nagy (1968) has pointed out, it is only tractable under rather restrictive as-sumptions: either where the probability distribution is in terms of indepen-dent binary features, or where the probability distribution is Gaussian, with equal covariance matrices.

Speech lends itself at least in classical phonetic analysis to treatment in terms of binary features, and it is reasonable to attempt to satisfy the first assumptions in the ASR decision-taker. Difficulties arise, however: speech is a time series—features appear and disappear; features are overlapped and succeeded by other features; and the information is contained not only in the features that occur, but also in the order in which they occur. Secondly, the classical treatment requires that speech be segmented into functionally (rath-er than acoustically) defined entities, called phonemes (which are thus spe-cific to a particular language, or even a group of speakers). Furthermore, the physical correlates of the binary distinctions, which linguists use to describe these functional categories, may be elusive and even misleading. This is not surprising since the classification is not physically based, and any physical similarity within the categories arises solely from the limitations of the us-ers’ ability to interpret the members of the categories. Thus segmentation, although it may seem a preliminary to recognition, itself depends on recogni-tion.

The supposed need to segment into phonemes, or other phoneme-like units, has led to what has been called the “segmentation problem” in ASR. Another problem in ASR, that of “time-normalization” is closely, associated because, if one could truly “time-normalize” different utterances, they could be segmented without the use of acoustic cues, by fixed time-slicing. The


concept of “time-normalization” also arises out of attempts to match one time series of events (for example an acoustic input) with others (reference utter-ances). However, if one could identify the segments, and name them, then such a comparison could be made without reference to the time scale. The two “problems” are inextricably bound up and are really problems of how to handle the information in the speech stream. These basic problems are too often ignored in favour of others more easily defined.

Finally, there is a problem of a different kind. The acoustic features of speech, such as have been proposed for use in ASR, have interdependencies which contravene the independence assumption of the solution we wish to adopt for decision taking. The decision method is sufficiently attractive on the grounds of simplicity (in both conceptual and in hardware terms) and efficiency, that it is worth taking some trouble to reduce the speech signal to a form suitable for such a decision. In other words we need to reduce the speech signal to a set of independent binary observations.

The Reduction of Analogue Signals to Sets of Independent Binary Features Proposals and schemes involving, at some stage, the analysis of speech

into sets of analogue or binary time-varying signals, have been put forward by many workers (Fry & Denes, 1958; Sakai & Doshita, 1963; Hughes, 1965; Tillman et al., 1965; Hill, 1966; Martin et al., 1966; Keller, 1968; Borrow & Klatt, 1968). Let us call such signals “primitive acoustic char-acteristics” (PACs),or—where they are two valued—“primitive acoustic features” (PAFs). Clearly PACs can be reduced to PAFs simply by inserting a threshold device. It is important to notice that such threshold devices for ASR should show hysteresis, that is, in this context, the property of sticking to a decision. Such a decision represents the formation of a minimum “null hypothesis” consistent with the incoming evidence and may consist of “the feature is occurring”, or “the feature is not occurring”. That is, the hypothesis is not abandoned until it is inconsistent with more recent evidence, rather than merely inadequately supported. Without the ability to make and stick to such minimum hypotheses, the machine’s ability to structure its input—an essential preliminary to making good decisions—is seriously handicapped. In physical terms the effect on signals is as illustrated in Fig. 6. At some level of evidence it is necessary to say a feature is present, and then stick to this decision until the evidence very definitely shows that the feature is absent. It will be noted that both amplitude and time are involved in the hys-teresis. This is necessary, at a practical level, firstly in order to make a reason-able representation of the input, and secondly in order to produce an output

110 DAVID R. HILL

signal suitable for subsequent processing. The figure indicates various intermediate possibilities, which fall short of the requirement by themselves, and finally, the effect of both amplitude and time hysteresis in recovering the underlying signal.

FIG. 6. Illustration of time and level hysteresis. τ is the time for which the signal must be continuously in a state for that state to occur as output (time hysteresis).

(Note: time hysteresis eliminated the top trigger-level signal drop-outs resultingfrom the top trigger level. There is a delay of τ on the recovered signal.)

Consider an analyser producing PAFs, whose output consists of signals indicating when certain important features of the speech signal are present and when they are absent. The content is specified by which output lines are active, but the order is still implicit in their order of activation. The exact or-der is difficult to specify because the signals overlap. Before any detection of sequential characteristics is carried out, therefore, it is necessary to carry out more processing, namely to change an extended PAF into two events “primi-tive acoustic events” or PAEs) which are standard pulses marking the time when a decision is taken that the feature is present, and the time when it is decided that the feature is absent. The difficulty with this process is that, al-though it sorts the input into meaningfully ordered signals, information about the absolute duration of the PAFs is implied merely by the order of events.


The trouble arises because a distinction dependent on absolute duration, such as that between a stop release (say for /t/) and a fricative (say /s/), depends on content and not order. Thus event detection must also take account of absolute duration, and in that way completes the extraction of content. An event detector will have one input (a PAF) and n + 1 outputs, one marking the beginning of the PAF, and the others marking its end in each of n duration categories. Evidence suggests that n = 2 is usual for English. In any case, if the duration of a PAF is ambiguous—if it ends just on the boundary, or within half a standard pulse width—then, to avoid losing information, and perhaps for other reasons as well, the occurrence of both the possible events should be indicated. Thus, in continuous speech, a machine might need to consider a silence of ambiguous duration as either a Stop Gap or an End-of-Phrase gap. The duration judgement could be made adaptive using a strategy similar to that of the MAUDE device (Selfridge & Neisser, 1960).

At this stage the original input has been reduced to a set of primitives (PAEs). The order resolution is governed by the width of the standard pulse representing each such event. If two events overlap, precedence between the pair concerned cannot be assigned. Neglecting such overlaps the informa-tion may be processed further to extract significant aspects of the sequence of events in terms of structural descriptors called “compounded acoustic events” or CAEs, using a grammar based method which represents a gener-alization of the concepts which led to ESOTerIC I.

The grammar is that of a descriptive language for percepts, in this case words. The language must describe pertinent aspects of the ordering of the primitives. If the ordering of primitives need not necessarily be one dimen-sional, nor refer to a time ordering, it is possible to trace a parallel between this approach to ASR, and that adopted in recent work on the machine clas-sification of visual percepts—for example, characters, and geometrical pat-terns (Guzman, 1967; Evans, 1968). Just as the work of Hubel & Wiesel (1962) on cats and of Lettvin et al. (1959) on frogs has suggested a suitable level of primitives for the figure description languages which have been de-veloped, so the work of Whitfield & Evans (1965), also on cats, suggests a suitable level of primitives for use in auditory pattern description languages. No doubt extensions of the basic auditory primitives, similar to those of the basic visual primitives, will be made.

In the auditory situation, only a one-dimensional relationship is being dealt with, so that the necessary pattern description language may be simpler, which may compensate for the greater obscurity of the required primitives. As Evans (1968) has pointed out, if the specification of sub-patterns, and their

112 DAVID R. HILL

relations, in terms of which the pattern is to be analysed, is separated fromthe mechanism which does the analysis, this specification is more readily changed, making the overall system more flexible. Such a mechanism meets the requirements of Minsky’s “articular” system (Minsky, 1961). Not only may structures of arbitrary complexity be handled by a system of arbitrary simplicity, but such a mechanism provides a powerful means for automati-cally generating, testing and, hence, selecting descriptors for the patterns. Such a language is a precise formalism for describing pattern structure.

Evans (1968) suggests four components of rules for pattern definition: the name of the pattern; a list of dummy variables representing constituent sub-patterns; a list of properties to be satisfied by the sub-patterns (which may in-volve any sub-pattern any number of times), and a list specifying information to be listed under the pattern type being defined, when an example is found, for use at still higher levels.

The grammar required for ASR is very much simpler. Sub-patterns, which we may call “compounded acoustic events” as above, are defined in terms of sub-patterns and/or primitives only, which is the same as saying CAEs are defined in terms of CAEs and/or PAEs only (both CAEs and PAEs will be called “events” where this is not confusing). There is only one relationship function, that of precedence, and thus sub-patterns or CAEs may be defined recursively, without specifying property functions. The grammar is, however, context sensitive, and it is necessary to specify, at each level of the recursion, sublists of other objects which must not bear a prohibited relationship to the object defined at that level. In less abstract terms, this amounts to a statement that one can specify that certain other events must not intervene between the two events whose precedence function is being evaluated.

A major advantage of recursive definition is that the structure of subpat-terns, or CAEs, is specified by their name. Thus the CAE (((A(BC))F)B) would be decomposed by the analyser to a head ((A(BC)F)) and a tail B. The first sub-list of prohibited objects would tell the analyser which events were not allowed to intervene between the head and tail at this level. The tail is a primitive, and therefore is available as a set of PAE pulses marking the times at which the event B occurred. If the head were also available, then the analyser could establish the times at which B occurred immedi-ately after the head event, discounting all intervening events except those prohibited. If the head were not available (from a previous determination) then it would be treated as a new event to be recognized and the process repeated. Eventually a level of recursion would be reached at which only primitives or previously recognized events were at the tops of the head and


tail stacks that had been built up, and the procedure could unwind, generat-ing sets of event pulses corresponding to the times of the various events on the head and tail lists until the pulses for the event originally specified were generated.

Such a grammar-controlled analyser has been implemented in PL/I on the IBM 360/50 at the University of Calgary, for use in ASR studies. The specification of these time markers is analogous to the identification of pic-ture points associated with a pattern or sub-pattern for recognizing pictures. There seems to be no particular virtue, at present, in naming any sub-patterns in speech recognition, nor does it seem desirable, as apparently it does in picture recognition, to have a final production at the output level of the rec-ognizer. It seems safer, and more economical to determine the presence or absence of an output pattern class on the basis of the conditional probabilities associated with the presence and absence of the various subpatterns; this may also be relevant to picture pattern research.

The output of the sequence detector, as this stage has been called (Hill, 1967a), may be in several forms; e.g. (i) a bit pattern, each bit correspond-ing to a particular CAE or PAE, and set to”1” if the event in question was

FIG. 7. Illustration of PAEs, CAEs and bit pattern. Note: for this illustration all events have been assumed sequence breakers at every level.

114 DAVID R. HILL

detected; (ii) a bit pattern representing a set of “barometer” type counts (count = number of bits set), each count representing the number of times a given event occurred. Figure 7 illustrates a set of primitive events, the com-pounds derived assuming no other event is allowed to intervene (thus all events may be said to be “sequence breakers”), and the bit pattern derived according to output method (i).

Thus the original set of time-varying analogue signals may, in the manner suggested, and as illustrated with reference to a computer-based grammar, be translated into a set of non-ordered, binary features, using output form (i). There remains the problem of independence. With such a flexible means of generating structural descriptors (CAEs) and such a large number from which to choose, it is tempting to suppose that a suitable selection could be made. It is not clear how to set about choosing suitable CAEs, even on the assumption that they can be shown to exist, and this is a matter for further investigation. Events are independent, in the sense required, if—even when the cause of the observation is known—the observed occurrence of one event tells us nothing further about the occurrence of the other (Minsky, 1961).

Decision Enough has been said about the decision process to be able to leave the

subject, merely emphasizing, once again, that such a process can be imple-mented simply in hardware.

ESOTerIC II ESOTerIC II is, to the extent that it follows the lines laid down in this sec-

tion, a generalized version of ESOTerIC I. Simulation is largely complete, and hardware elements, using integrated circuits, have been designed to carry out the required functions; the controller is similar to that used for ESOTer-IC I. The heart of the sequence detector is an elementary sequence element (ESE). This has two primary inputs (for the events whose precedence is to be computed), and a third input into which prohibited events (sequence break-ers) are combined in a logical OR function. One such element corresponds to a single level of recursion in the analysis procedure. The equivalent function in the simulation is carried out by a single recursive sub-routine. An overall block diagram of the complete hardware machine and a more detailed sche-matic diagram appear as Figs 8 and 9.

Finally, a word in relation to the manner of description implicit in the machine. Sutherland (1968) has suggested a model for visual perception in animals which uses a picture grammar. He suggests that perception is the


FIG. 8. Main operations required in ESOTerIC II

FIG. 9. Schematic of ESOTerIC II.

selection of an appropriate rule for structuring the input, and that perceptual learning consists of learning new rules. It seems that ESOTerIC II may, in some sense, provide an analogous model for auditory pattern recognition. There is a difference, however, in that Sutherland requires a special rule for each identifiable percept class (for example a “wedding group”), whereas a scheme completely analogous to ESOTerIC II would require only prob-ability distributions for the sub-elements of such a percept. Such a model would possess most of the advantages of the original model, together with

116 DAVID R. HILL

additional economy and flexibility. Thus, in perceiving a car, a rule specify-ing relations between all the parts to which we could pay attention would not necessarily be needed, in order to perceive the car. Instead only a probability distribution for the occurrence of relations between various sub-sets of those parts would be required. Evidence could then be collected until the input could either be confidently rejected as an example of a car, or until it could be confidently accepted as an example of a car (or, perhaps, another object, in which case a car would not be perceived). Needless to say, even if this ap-proach is reasonable, it is likely to be an oversimplification.

Applications There are several applications of the simple speech recognition equipment

that can now be constructed. These remarks are based on a contribution to the discussion called, “Tomorrows Research in Speech”, at the IEEE/AFCRL Joint Conference on Speech Communication and Processing (1967).

Optical pattern recognition has proceeded in stages. First a simple photo-cell-driven counter was used which could determine the number of distinct objects passing in front of the device. Next, patterns of holes in paper-tape or cards were identified using an array of photocells backed by more com-plicated hardware. Then, specially printed and specially shaped characters were recognized, using even more sophisticated transducers and hardware. The progression has continued through normally printed special shapes; a single standard fount; many founts; and, more recently, hand-printed charac-ters. Despite this slow progression in optical pattern recognition many people have expected to see, in ASR, the equivalent of handwriting recognition as an initial step. If a more realistic view is taken, the same signs of sure, steady progress can be seen in ASR as have already been observed in OCR.

Two speech-operated telephone answering machines are on sale in the United Kingdom; the user may access his machine by a call from an ordi-nary telephone and, by using code-groups of digits, replay, and then erase the recordings made by callers. One device, the Telstor, manufactured by Shipton Automation, counts the number of deliberately isolated words. Thus “3” would be “one, two, three”. The other device, manufactured by Robo-phone, measures the duration of silence while the caller counts down under his breath from the desired number. Thus “3” would be “three (two, one)”. Notice that even this represents a higher level than the first stage in OCR, for the “holes” are not machine generated. It is certain that the recognition of machine generated speech would present only technological problems.


At Standard Telecommunication Laboratories, Harlow, U.K., a device called “VOTEM” corresponds to an OCR system for reading code groups, being essentially a morse-code operated typewriter. ESOTerIC I is a recog-nizer for special characters, specially printed, and ESOTerIC II indicates the way in which this could be extended to operate with a wider range of “founts”. Borrow & Klatt’s (1968) work on speech recognition, proceeding on somewhat similar lines, is probably the most advanced, and may be con-sidered equivalent to multi-fount OCR. The single fount case is paralleled by single speaker recognition, but, though many workers agree that it presents no difficulty, it has received little attention. It is of considerable practical in-terest for applications such as recognition in the Air Traffic Control environ-ment, where there are few enough speakers to treat them as individuals and store specific data about them. They are specially trained, used to wearing special equipment, highly intelligent, and in need of some device to relieve the increasing load they carry. ATC is of interest as a potential area for the early application of ASR, since the controllers are already required to pass clear messages to the pilots. The substance of these messages is to be stored in a computer, which could generate a display providing immediate feedback of the machine’s recognition (an important consideration in ASR); also since the language used by the controllers is simple and the vocabulary limited, er-ror correction using context would, perhaps, also be possible.

An important class of applications, even for a first-generation device like ESOTerIC I, is exemplified by the use of the device to control a program-mable visual (or audio-visual) display (Gedye & Gaines, 1967). A system of this type, comprising an ESOTerIC device connected to an ESL tm 1024 machine, was demonstrated by the author at the National Physical Labora-tory, Teddington, U.K., in February 1967.

The tm 1024, which is a commercial version of a machine developed by Newman & Scantlebury (1967) at the National Physical Laboratory, is a pro-grammable visual (or audio-visual) display which has been used for a num-ber of purposes including teaching, information retrieval, and fault finding. It also forms the visual display of the ESL (Bristol) ts 512—a simple data acquisition system producing computer-compatible output, which has so far been used primarily in a medical context for research on automated history-taking and investigation of learning ability in various patient groups; e.g. mentally subnormal children, adults with head injuries, and geriatric patients (Gedye, 1967, 1968; Gedye & Wedgwood, 1966).

In its present form the tm 1024 presents frames of visual (or audio-visual) information, to which the user makes a binary response by pressing a left, or

118 DAVID R. HILL

a right, response button. The next frame to be presented depends on the cur-rent frame and the response made. A 10-bit word, read from the current frame according to the response, specifies the inter-frame jump (up to a maximum of ± 512 frames), and enables the programmer to implement complex branch-ing program structures.

An ASR input to a system of this type is of interest for three main reasons. (1). It allows a wide variety of different inputs to be specified for what is basi-cally a single two-valued input system. (2). It avoids the need to operate the device manually, and uses the natural medium of speech for communication. (3). It uses a speech recognizer in an established teaching situation.

The words discriminated by the ASR device may be split into two groups: positive words such as “Yes, Nice, Top, Sooner”, and negative words such as “No, Nasty, Bottom, Later”. In this it is not even necessary that the machine should have a large intrinsic vocabulary, although this may be required for ASR control of other machine functions. Programming is made easier be-cause it is no longer necessary to phrase questions artificially in order to elicit an arbitrarily identified left or right button press. At the same time, the com-munication situation improved by the use of a naturally compatible display/control relationship.

We have already noted that speech is a natural form of communication; in addition speech operation has other special advantages. For example, in teaching manual skills, the student can be freed of the necessity to push but-tons whilst attempting some complex manual task, and for some users, par-ticularly the types of patient referred to above, manual operation may be suf-ficiently difficult to prevent the use of an interactive system altogether unless a speech input is available.

A further important application of ASR devices is in speech education. An ASR device such as ESOTerIC communicates in sufficiently simple terms (hisses and humphs) to make programming certain aspects of speech therapy feasible. This could be particularly useful in implementing the approach to the treatment of dysphasia pioneered by Filby & Edwards (1963). There are further possibilities, along the same lines, of using such a system to teach illiterate people of any tongue some elementary notions of written language.

None of these applications requires a better performance from the ASR device than can be achieved at present for a very low cost. The existence of teaching and allied situations with feedback, and a suitably restricted communication situation, coupled with real needs, has the ingredients of a viable application area. With the need for re-education of unskilled and semiskilled manual workers, the understaffing of our hospitals, and the world-


wide problems of education, which includes a teacher shortage, the sug-gested applications for the simple devices we can build now assume special significance.

Conclusions Considerable progress is being made in the three areas of definition, im-

plementation, and application in ASR. It is now recognized that the speech signal contains two main types of information at the acoustic level, content information and order information, and that the ASR problem must be de-fined in such a way as to take account of this.

The manner of implementation must surely now follow the lines of a pat-tern grammar operating on suitable primitives, though there is still consider-able divergence of opinion on the question of decision methods. It is impor-tant, for implementation, to bear in mind hardware technique such as large scale integration (LSI) which makes methods employing highly repetitive arrays very attractive.

Lastly, it is clear that not only do applications exist for the simple devices, which can be built now, but also useful devices are already in service, and some such applications, far from being trivial, are far reaching in their impor-tance and social implications.

Acknowledgements The many people who have encouraged and helped the author during his work

on ASR are too numerous for all to be mentioned. However, he would like to thank Prof. J. H. Andreae, who nursed and criticized early ideas; Prof. D. Michie, for his confidence in the hardware at an early stage; C. J. Cheney and D. Wright for their work on the tm 1024 and the PDP7 at Cambridge, and Dr. J. L. Gedye for encourag-ing this work.

The author also wishes to thank the National Research Council of Canada, Stan-dard Telecommunication Laboratories Ltd., and the Ministry of Technology for fi-nancial support given at various stages of the research.

References AFCRL/IEEE (1967). Tomorrow’s research on speech. Panel and floor discussion,

AFCRL/IEEE joint conference on speech communication and processing, MIT.

BOBROW, D. G. & KLATT, D. H. (1968). Studies in word recognition by computer. 75th Meeting of the Acoustical Society of America, Ottawa.

EVANS, T. G. (1968). A grammar-controlled pattern analyser. Preprints Proc. IFIP 1968 Congress, Edinburgh.

120 DAVID R. HILL

FILBY, Y. & EDWARDS, A. E. (1963). An application of automated-teaching methods to test and teach form discrimination to aphasics. J. Programmed Instruction, 2, 25-33.

FRY, D. B. & DENES, P. (1958). The solution of some fundamental problems in me-chanical speech recognition. Language and Speech 1 (1).

GEDYE, J. L. (1967). A teaching machine program for use as a test of learning abil-ity. In Aspects of Educational Technology, Ed. D. Unwin and J. Leedham, pp. 369-389. London: Methuen.

GEDYE, J. L. (1968) Automated instructional techniques in the rehabilitation of pa-tients with head injury. Proc. R. Soc. Med. 61, 858-860.

GEDYE, J. L. & GAINES, B. R. (1967). Medical applications of programmable au-diovisual displays. Dig. VII Int. Conf. Med. BioI. Engng, Stockholm, p. 226.

GEDYE, J. L. & WEDGEWOOD, J. (1967). Experience in the use of a teaching machine for the assessment of senile mental changes. Proc. VII Int. Congr. Gerontol. Vienna, pp. 205-207.

GUZMAN, A. (1967). Scene analysis using the concept of model. Computer Corp. of America Contract Rep. AFCRL-67-0133, January, 1967.

HILL, D. R. (1966). STAR—A machine to recognise spoken words. Proceedings of the IFIP Congr., 1965. New York: Spartan/Macmillan.

HILL, D. R. (1967a). Automatic speech recognition—a problem for machine intel-ligence. In Machine Intelligence 1, Edinburgh: Oliver & Boyd.

HILL, D. R. (1967b). Some applications of a small computer (PDP8) to automatic speech recognition research. 129th Meeting British Association for the Ad-vancement of Science, Leeds.

HILL, D. R. (1968). Automatic speech recognition. In Encyclopaedia of Information, Linguistics and Control. Oxford: Pergamon Press.

HUBEL, D. H. & WIESEL, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 160, 106-154.

HUGHES, G. W. (1965). Speech analysis, ASTIA report number AD 624 555. VON KELLER, T. G. (1968). Automatic recognition of spoken words. Presented at

75th Meeting Acoustical Society of America, Ottawa. LETTVIN, J. Y., MATURANA, H. R., MCCULLOCH, W. S. & PITTS, W. H. (1959). What

the frog’s eye tells the frog’s brain. Proc. Inst. Radio Engrs, 47 (11), 1940-59. MARTIN, T. B., ZADELL, H. J., NELSON, A. L., & COX, R. B. (1966). Recognition of

continuous speech by feature abstraction. RCA Tech. Rep. TR-66-189. MINSKY, M. (1961). Steps towards artificial intelligence. Proc. lnst. Radio Engrs,

49, 8-30. NAGY, G. (1968). Classification algorithms in pattern recognition. IEEE Trans. Audio

and Electro-acoustics, AU-16 (2), June 1968. NEWMAN, E. A. & SCANTLEBURY, R. (1967). Teaching machines as intelligence am-

plifiers. National Physical Laboratory Report Auto 31, London. SAKAI, T. & DOSHlTA, S. (1953). The automatic speech recognition system for con-

versational sound. IEEE Trans. Electronic Computers, EC-12 (6), December, 1963.

SELFRIDGE, O. G. & NEISSER, V. (1960). Pattern recognition by machine. Scient. Am. 203 (2) August, 1960.

SUTHERLAND, N. S. (1968). Pattern recognition in animals. Proc. of Int. Conf. Univ. Manitoba. Pattern Recognition: The Retina and the Machine, May, 1968.


TILLMAN, H. G., HEICKE, G., SCHNELLE, H. & UNGEHEUER, G. (1965). Dawid I.—Ein Beitrag zur automatischen “Spracherkennung.” Proc. 5th Int. Congr. Acoustics, Liege.

WHITFIELD, 1. G. & EVANS, E. F. (1965). Behaviour of neurones in the unanaesthe-tised auditory cortex of the cat. Report of the Neurocommunications Research Unit, Univ. Birmingham, for March 1964 to February 1965, on contract DA-91-591-EUC2803 of the U.S. Army.

An ESOTerIC Approach to Some Problems in Automatic Speech...

Documents

Transcript of An ESOTerIC Approach to Some Problems in Automatic Speech...