Very high-level language design: A viewpoint

14
Computer Languages, Vol. I, pp. 3-16. Pergamon Press, 1975. Printed in Northern Ireland VERY HIGH-LEVEL LANGUAGE DESIGN: A VIEWPOINT ALLEN TUCKER Academic Computation Center, Georgetown University, Washington, DC, U.S.A. (Received 5 November 1973) Abstract--Recent developments in very high-level language design indicate that these lan- guages hold great promise for improving the level of man-machine communication, and hence improvingcomputer and programmer utilization. (Essentially, a very high-levellanguage one which allows the programmer to specifywhat to do, rather than how to do it.) This paper surveys these developments, outlines the goals to which an "ideal" very high-level language should aspire, and then presents the design of a very high-levellanguage that would meet these goals. This design is presented in the interest of laying bare some basic designand implementa- tion questions that are inherent to such an achievement. The paper then discusses these questions, indicating both old and new research problems which they suggest. High-level languages Veryhigh-levellanguages Command languages ment systems Job control languages Extendible languages Parsing Automatic algorithm selection Ambiguity Data base manage- Context-sensitivity 1. INTRODUCTION AND MOTIVATION THE HISTORYof computer languages is entering probably its fourth era. The first era was that of machine languages; cryptic coding conventions for directing the machine through each of a tediously long sequence of elementary actions. The second era was that of sym- bolic machine (assembly) languages, which made the algorithm's coding task simpler, but no less machine-dependent. The third era was that of high-level languages, which added a significant level of economy in writing, machine independence, and readability to pro- grams. The fourth era appears to be that of "very high-level languages," or command languages, which enable the writer to express what is to be done, in a language more familiar to his own discipline, rather than how it is to be done. Examples of very high-level languages currently in use are found in statistical packages, data base management systems, query systems, job control languages, and other special application packages. Their widespread use by computer personnel as well as nonpro- grammers in other fields indicate their wide acceptance as a mode of getting useful work done quickly. Even with the existence of such packaged systems having simple access, an immense dilemma persists: computer capacity is a very underutilized resource, yet most professionals who have great need for computer service either fail to get it or do not get it efficiently. A primary cause for this dilemma is the multiplicity of languages and conventions for getting useful work done, coupled with the fact that as a user's requirement changes, he often has to shift from one language (convention) to another. This is particularly difficult for the nonprogrammer, since the command languages differ among each other more widely in syntax and functional capabilities than do programming languages. Further, the programmer usually has an innate sense of excitement for the problem-solving challenge at hand, and is thus more apt either to overcome inter-language differences or to "bend" a language to fit the problem at hand, in order to achieve a solution. 3

Transcript of Very high-level language design: A viewpoint

Computer Languages, Vol. I , pp. 3-16. Pergamon Press, 1975. Printed in Northern Ireland

VERY HIGH-LEVEL LANGUAGE DESIGN: A VIEWPOINT

ALLEN TUCKER Academic Computation Center, Georgetown University, Washington, DC, U.S.A.

(Received 5 November 1973)

Abstract--Recent developments in very high-level language design indicate that these lan- guages hold great promise for improving the level of man-machine communication, and hence improving computer and programmer utilization. (Essentially, a very high-level language one which allows the programmer to specify what to do, rather than how to do it.) This paper surveys these developments, outlines the goals to which an "ideal" very high-level language should aspire, and then presents the design of a very high-level language that would meet these goals. This design is presented in the interest of laying bare some basic design and implementa- tion questions that are inherent to such an achievement. The paper then discusses these questions, indicating both old and new research problems which they suggest.

High-level languages Very high-level languages Command languages ment systems Job control languages Extendible languages Parsing Automatic algorithm selection Ambiguity

Data base manage- Context-sensitivity

1. INTRODUCTION AND MOTIVATION

THE HISTORY of computer languages is entering probably its fourth era. The first era was that of machine languages; cryptic coding conventions for directing the machine through each of a tediously long sequence of elementary actions. The second era was that of sym- bolic machine (assembly) languages, which made the algorithm's coding task simpler, but no less machine-dependent. The third era was that of high-level languages, which added a significant level of economy in writing, machine independence, and readability to pro- grams. The fourth era appears to be that of "very high-level languages," or command languages, which enable the writer to express what is to be done, in a language more familiar to his own discipline, rather than how it is to be done.

Examples of very high-level languages currently in use are found in statistical packages, data base management systems, query systems, job control languages, and other special application packages. Their widespread use by computer personnel as well as nonpro- grammers in other fields indicate their wide acceptance as a mode of getting useful work done quickly. Even with the existence of such packaged systems having simple access, an immense dilemma persists: computer capacity is a very underutilized resource, yet most professionals who have great need for computer service either fail to get it or do not get it efficiently.

A primary cause for this dilemma is the multiplicity of languages and conventions for getting useful work done, coupled with the fact that as a user's requirement changes, he often has to shift from one language (convention) to another. This is particularly difficult for the nonprogrammer, since the command languages differ among each other more widely in syntax and functional capabilities than do programming languages. Further, the programmer usually has an innate sense of excitement for the problem-solving challenge at hand, and is thus more apt either to overcome inter-language differences or to "bend" a language to fit the problem at hand, in order to achieve a solution.

3

4 ALLEN TUCKER

As a means of solving this problem, we feel that the achievement of a universal very high-level language is the most appropriate goal. It is the purpose of this paper to explore the possibilities for such an achievement, identifying the major difficulties involved.

The next Section surveys those attributes of current very high-level languages, systems, and programming languages which seem to bear on this question. In Section 3 we describe a system that would support the achievement of a universal very high-level language. The purpose of that section is not so much to propose a concrete system design as it is to provide a framework in which the main functions of such a system may be laid bare, thus expediting discussion of the specific research problems at hand. That discussion occurs in Section 4.

The specific objectives of a universal very high-level language are as follows"

(1) The language should be easy to learn by the nonprogrammer. (2) The language should be functionally extensible, so that the nonprogrammer can

tailor it to suit his specific needs. (3) The language should allow writing efficiency, so that e.g. words and phrases are

easily abbreviated, and thus a minimum amount of "writing overhead" would be required in the specification of a computing task.

(4) The language should permit algorithmic specifications via some interface with conventional programming languages.

(5) The language should be natural to use, in the sense that the syntax be English-like and the meaning of a word be determined by looking it up in a conventionally- organized dictionary.

(6) The language (indeed, the whole system) should be transferable among any of a number of medium to large scale computing systems, both among species and within a species.

(7) The language should be implemented with reasonable ease, and should run reason- ably efficiently for any particular implementation.

(8) The language (system) should be usable in either batch or interactive mode. (9) The language's functional capabilities should be modularly organized so that any

one installation can easily select a subset of functions tailored to its particular area of applications.

(10) The language should provide access to the full range of a system's functional capabilities.

These objectives are incredibly ambitious; their attainment may not even be forseeable. Yet, the achievement of truly widespread accessibility to computing power demands an ambitious effort.

2. S U R V E Y

Recent developments which bear most heavily on this issue are described in this section. First, we discuss current I/O hardware. Second, we discuss the functional capabilities embodied by current very high-level languages.

2.1. I/0 Hardware It will be necessary that any system which meets the goals we suggest interface with the

user as painlessly as possible. Specifically, the devices through which he communicates with the system should permit a representation of information which assimilates, and

Very high-level language design: a viewpoint 5

probably extends, that of an ordinary typewriter. It is well-known that this is the case for many graphics and typewriter-like terminals currently in use. It is not the case for most card-punching devices (the IBM 029) and line printers, which usually permit either a 48- or 60-character set that does not include even the lower-case letters of the alphabet.

Extension of these capabilities seem to be required in three ways. First the character set should be larger and extensible. Second, the one-dimensional (linear) nature of these devices should be extended to two dimensions, in order to permit natural subscript and superscript representations. Third, optical character recognition devices should be avail- able to permit direct entry of typewritten documents with these stylistic enhancements.

Line printers are currently available which permit the use of a larger alphabet. This option is not widely used, however, due to its additional cost and slower printing speed. To allow a truly extendible alphabet, however, involves significant technological problems whose solution is apparently not feasible, or at least not market-justified at the present time. Similarly, the design of a line printer which prints subscripts, superscripts, and differ- ent type fonts, has not been achieved.

On the brighter side, Klerer and May [5] report the implementation of a language with two-dimensional typewriter I/O. At least two manufacturers market optical character recognition devices which directly read pages of typewritten information. They do not, however, permit subscripts, superscripts, or alphabet extendibility. Again, the question of cost justification is paramount. Even the models which are currently available are more expensive and slower than other input devices.

2.2. Current very high-level languages

Current languages which may be described as command, rather than algorithmic, are many. The most well-known among these arise out of the following application areas; statistics, data base management and query (including report generation), and job control. These areas differ widely in functional requirements, and their languages vary widely in style and "naturalness". Let us examine a few of these languages.

2.2.1. Statisticalpackages. Among the many statistical packages in use, a few have the property of being reasonably easy to learn and use. Two of these are SPSS [6], and EASYSTAT [9]. A more complete survey of available statistical packages appears in [8].

Such packages have similar functional capabilities; their differences lie mainly in their appearance to the potential user. That is, what does the user have to learn, what does he have to say, what "system limitations" are placed on his processing requirements, and how does he decipher the results (and diagnostics)?

The user confronts EASYSTAT via a self-paced, modular textbook. It tells him the kinds of functions that can be performed and how he is to describe them. By following the directions he writes a description of the computation he wants, using short English phrases, like "MULTIPLE LINEAR REGRESSION" and "25 VARIABLES". The SPSS text is similarly self-contained and natural to use.

Although statistical packages' languages have been brought to a high level, and have been demonstrated to yield a high success-per-run rate for the nonprogramming user (70 per cent in the case of EASYSTAT), the class of processes which they represent is not as general as that of a general-purpose very high-level language. Specifically, the user has no possibility for variety in the configuration of data sets which he presents for a run. He has one input data set and one output data set, and that's that. Some systems may allow the user the option of creating an additional output data set to be used as input to a

6 ALLEN TUCKER

subsequent run, but this is an exceptional case. Any generalized very high-level language will have to allow for a variety of data set configurations to occur.

2.2.2. Data base management and query systems. There are many systems which fall into this category. Most are intended to provide an alternative to using a programming language in accomplishing data bases, file maintenance activities, and inquiries. Yet, there appears to be little common agreement on user language or functional capability.

A recent study by the CODASYL Systems Committee [2] documents the salient features of several widely-used data base management systems. Notable among these for its similarity with the goals expressed here is UL/1 [7]. It was designed to serve a wide variety of non- programming users and has a built-in programming capability as well. Unfortunately, it appears not to have enjoyed the widespread exposure required to demonstrate its versatility. Further, it seems to have several implementation-dependent design limitations.

These systems generally exhibit many features which are desirable in a more general- purpose system: simplicity and conciseness of user language, a variety of common, easily- invoked file manipulation functions, automatic report generation, natural conventions for naming and later referencing fields within a record, and so forth. Such features will probably be embedded within future very high-level languages.

These systems also have some severe limitations, most of which exist because their removal is not justified on a cost/benefit basis. However, they are worthy of mention for the purposes of this paper. First, each package is machine-dependent, and is often operating system-dependent as well. Second, each package permits only a fixed number of different basic functions to be performed; namely, file creation (definition), file maintenance (merge), report generation (extraction, tabulation, print), file reorganization (sort, compress). Third, such a package is often incompatible with other software in an installation. For instance, its files are accessible only through use of the package itself. Fourth, such a package is not extendible, either for adding new processing functions or for adding new file organization/access schemes.

2.2.3. Job control languages and program libraries. The most well-known among the job control languages is OS/360's JCL [4]. These languages are very high-level, since they allow the user to specify what to do rather than how to do it. Unlike the packages discussed above, the job control languages allow the user full access to the installation's program libraries. The latter usually contain a full complement of data set manipulation functions as well as programs which are of importance to the special application needs of the installa- tion (e.g. statistical, data processing, etc.). A job control language is extendible in the sense that new programs can be added to the library and thereafter may be accessed via the job control language.

To the nonprogramming user, the job control language is usually not well-understood in its expressive power. His tendency is to find a set of control cards, or else some strictly- defined control card encoding conventions, which will suit his needs. He does not typically attempt to master the language or its full range of capabilities. The reasons for this are perhaps twofold. First, effective use of a job control language usually requires a good understanding of data management and hardware I/O facilities. Specifications written in the language are system-dependent, rather than application-dependent. For instance, the specification of a particular file organization for the given available hardware configuration is a common task to be expressed in control language. Second, the language itself has very strict and artifical writing rules; thus a specification in the language appears unintelligible to the uninitiated user.

Very high-level language design: a viewpoint

3. SYSTEM DESIGN

The design of a very high-level language which satsifies the objectives outlined in the introduction deserves extensive study and development. We outline here what might be considered as a"first cut" at such a design. It must be emphasized that we are not presenting afait accompli; moreover we present this initial design in order that the critical research problems involved may be laid bare.

A graphical description of this design appears in Fig. 1 below. In this section, we will attempt to breathe some life into that description by identifying the functional nature of its elements. The elements in Fig. 1 are labeled identically with the subsections of this section to permit easy cross reference. Examples are given to further motivate the discussion.

A. ( Self-teaching fext, . . . . . . dictionary,

programming language manual (optional)

f! ource qext \ Ose language~ otements andJ ptional pgrnsff

F~. B. E.

. . . . brary \

/ \ /--------7 Statistics,I \ / ~ lmessa es I •Generated \ X [descript.,or~. IDot a sets t 1lTs~. isnsg g s ~ ~progrom ) /\ ,~f \,,urary \ \ \

\ J / 2 ( \ / / \ D. ,

G / / / \ , / Ident ific-/ G ~ / i ",arian /

~Printed " ] utput, I essa-J---~

Fig. 1. Graphical system description.

3.1. The user (tl) The user's computer background may range from that of a highly-skilled systems pro-

grammer to none at all. The only necessary common characteristic among users is that each has a need for a particular, well-defined function to be executed by the computer.

The single element of the system which is shared in common among all users is what we will call the "base language". It would be as ubiquitous as OS/360's JCL. Unlike JCL, however, the base language should be easy to learn and natural to use. Thus, it should be self-teaching, modular, functionally extensible, and reasonably abbreviable. Further

8 ALLEN TUCKER

discussion of base language design characteristics appears in the next section. Here, we will only give a few examples of the kinds of statements one would typically make in the base language.

Example 1. PRINT IN ALPHABETICAL ORDER BY LAST NAME THE NAME AND ADDRESS OF EACH PERSON IN MY FILE WHOSE HOME STATE IS NEW YORK.

Example 2. FOR ALL PERSONS IN THE CURRENT PAYROLL FILE WHOSE BASE PAY HAS NOT BEEN RAISED IN THE LAST SIX MONTHS, RAISE IT BY 5 ~o.

Example3. PERFORM A LINEAR REGRESSION; DMF VS PERMANENT CARIES (PC), USING THE CARACAS CLINICAL EXAMINATION DATA (REGION 1 ONLY). EXCLUDE ALL OBSERVATIONS WITH PC-----9. COMPUTE 50, 75, 90, AND 95 ~o CONFIDENCE INTERVALS ON DMF FOR EACH OF THE FOLLOWING PC VALUES: 0, 1,2, 3, 4, 5. PLOT THE REGRESSION LINE AND THE OBSERVATIONS.

As indicated by these examples, the base language is English-like and yet allows reason- able economy of expression. This economy is achieved in various ways, notable the ability to abbreviate (e.g. PC in Example 3), the ability to use pronouns (e.g. IT in Example 2), and the system's ability to assign meanings to data names (e.g. NAME and ADDRESS in Example 1) automatically. In addition, the system may deduce by default the particular (sequence of) algorithm(s) required to perform the indicated computations. For instance, there is an implied sort indicated in Example 1, provided that the file is not already in alphabetical order by last name. These ideas will be further discussed in subsequent sections.

3.2. The dictionaries (B) To aid him in describing computations to be performed by the system, the user has

access to one or more printed "dictionaries." A dictionary is used in generally the same way that an ordinary English dictionary is used. Its words are ordered alphabetically. Appear- ing with each word is a definition of its meaning(s), which may appear as a description of an algorithm. That description may be written in the base language itself.

Together, the dictionaries define all the computational capabilities of the system. Due to the wide variety of particular computational needs among users, a number of dictionaries should be available. First and foremost among these is the "base dictionary," which defines the elements of the base language available to all users. Functionally, this language is envisioned as an extension and unification of the basic data processing functions which are currently available in the form of the OS/360 utilities, a standard sort/merge package, a reasonably comprehensive data base management system, and a report program generator.

Additional dictionaries would be available for each of a number of wide application areas. These would include mathematical and statistical procedures, text processing pro- cedures, data-processing procedures, graphics procedures, data structure manipulation and systems programming procedures, and so forth. Typically a user in a particular area of application should require only one additional dictionary, together with the basic dictionary, to perform his tasks.

All dictionaries except the base dictionary should be extendible; the meaning(s) of a word may be modified, new words may be defined and added to a dictionary, and words

Very high-level language design: a viewpoint 9

may be deleted from a dictionary. The kinds of computational actions the words in these dictionaries connote are similar to those of the various special application (subroutine) packages currently in use.

In addition, each user or group of users served by an installation may create and main- tain his (their) own dictionary of terms used only by him (them). The kinds of computational actions connoted by the words in such a dictionary are similar to those of an installation's application program library.

All dictionaries are maintained, and updates are generated, by the system. The program generator/compiler uses an encoded form of the dictionaries as the sematic basis for interpreting user requests in the base language.

3.3. The data description library (C) As the dictionary is the system's element for storing algorithmic information, the data

description library (DDL) is the system's element for defining data-descriptive information. This library is based on the conventional notion of "system catalog," but extends that notion in the following ways.

(1) All data sets accessed by the system are automatically catalogued in the DDL. (2) The information stored for an individual data set in the DDL is extensive. In

addition to information about the medium (e.g. tape, disc), particular volume label, and so forth, on which the data set resides, there is the following additional descrip- tive information.

(a) usage statistics (b) functional impact (c) data set size (number of records), organization, primary sequence, blocksize,

record-size (d) record layout; field names, data types, and validity-checking information.

Here, the "usage statistics" will be used by the system as input to standard accounting procedures. In addition, these statistics allow the system to make such system-management decisions as determining whether the activity level of a data set justifies its present location on a permanently-mounted volume. The "functional impact" information identifies those functions which access the data set, for input and for output. The other information listed above is fairly self-descriptive. We note here that the field names defined for a data set provide the primary vehicle by which the system identifies the meaning of a name given by the user (e.g. NAME, ADDRESS, PERSON in Example 1). This technique is commonly used in current data base management systems.

3.4. The identification file (D) This file is the basis for identifying all active users of the system, either individually or

in groups (e.g. by department). It is linked with the dictionaries and data description library in order to identify the data sets and special dictionaries associated with each user (group). In a sense, it is an extension of the basic accounting file maintained in conventional systems. It is an extension in the following ways:

(1) It serves as an authority on file accessing, thus providing protection to files from unauthorized access.

| 0 ALLEN TUCKER

(2) It allows the single-file user to unambiguously reference his file via phrases like "THE FILE" or "MY FILE."

(3) It links a user with his own "private" dictionary, thus allowing him to assign special meanings to words he uses to denote special actions.

3.5. Function libraries (E) These libraries contain the computational algorithms themselves. They are analogous

with the conventional notions of "program library", "built-in functions" of a high-level language, "utility programs", access methods, and so forth. In addition, they contain the following information about the algorithms themselves.

(1) Timing information (2) Storage requirements (3) Other resource requirements (4) Linkage information

The first three items here provide the system with a basis to predict execution time for a a user request for a given function. The fourth item provides information concerning param- eters required and other functions called by and calling the given function.

3.6. Program generator/compiler (F) The program generator/compiler (PGC) is one of the two main dynamic elements of the

system. Its basic task is to read a user request, interpret it (using appropriate information from the dictionaries, the data description library, and the identification file), and generate an executable program from it. Concurrently, it should produce information on the generated program's resource requirements (e.g. execution time, storage requirements). From this information, the user has a reasonable prediction of the cost of running his job. He may also obtain a listing of the sequence of steps (programs) required to service his request, for purposes of documentation and subsequent program refinement.

As noted previously, the user may wish to express certain parts of his request by defining an algorithm in, say, FORTRAN. When he does, the PGC invokes an appropriate com- piler to generate a function "on the spot". This will then be inserted appropriately in the generated program by the PGC. The whole question of inter-language linkage is an impor- tant one, and we will return to it in a later section.

Depending of the nature of the user's request, the PGC may additionally update one or more dictionaries, the function library, or the data description library. A dictionary and function library update would occur when a request contains the definition of a new word. Similarly, a DDL update would occur when a request specifies (implicitly or explicitly) the creation of a new data set.

3.7. The executor (G) The executor plays a role akin to that of the conventional operating system; receiving

jobs, scheduling them for execution, allocating resources to them, generating statistics, monitoring jobs, etc. As a result of executing a particular user request, the executor updates appropriate statistics in the data description library and the identification file.

Periodically, and when system activity is low, the executor should initiate the perform- ance of various system-management functions, such as the following.

(1) Reporting accounting information. (2) Generating hard copy of dictionary, DDL, and identification file updates.

Very high-level language design: a viewpoint l 1

(3) Dumping (restoring) low-activity (high-activity) data sets from (to) on-line direct access volumes to (from) off-line volumes.

(4) Reorganizing data sets to allow more etficient access and device utilization.

The system should not permit the casual user to determine whether or not his data set will reside on a permanently-mounted volume. Such a decision is best kept on the hands of the system, since it knows both the physical capacities and the total data requirements of all users. On the other hand, certain dictionaries, the DDL, the identification file, and certain function libraries might be kept on-line, due to their higher incidence of usage.

4. SPECIFIC RESEARCH PROBLEMS Concurrent with the realization of such a system is the solution of certain research

problems. Some of these are already known, while others are unique to this kind of system. It is the purpose of this section to identify these problems, the extent to which they relate to current systems, and avenues of approach which seems appropriate to their solution. As this writer sees them, the problems are identified as follows.

A. Base language design B. Base language analysis/diagnostics C. Automatic algorithm selection D. Function generalization E. Resource and timing estimation F. Inter-language linkage G. Default selection H. System transferability and machine independence.

4.1. Base language design

As indicated in the examples of the previous section, the use of an English subset as the base language does not necessarily prohibit economy of expression. The use of English phrases to issue requests for computation is already widespread among many application packages [9], and has seen limited use in programming languages [1].

In the system proposed here, we view the base language as a replacement and extension to job control languages. Unlike these control languages, however, the base language has certain distinctive aspects which deserve special mention.

The meaning of a word in the base language is resolved via a dictionary, the DDL, and the identification file. As indicated in the foregoing examples, words fall into the following classes.

(1) Verb; indicating a function or functions to be performed. The particular meaning of a given verb is generally resolvable via the appropriate dictionary and the context in which the verb appears.

(2) Noun; naming an object which may be a file, a record, or a datum. Nouns are analogous with "identifiers" or "variable names" in conventional languages. The meaning (i.e. value) of a given noun is identifiable by consulting a particular DDL entry.

(3) Pronoun; indicating a reference to an object (noun) previously identified among the user's statements. Permitting the (unambiguous) use of pronouns in the base language appears to be essential to the achievement of a natural level of language for the user.

12 ALLEN TUC~m~

(4) Adjectives; to modify, or further qualify the meaning of nouns. (5) Adverbs; to modify or further qualify the meaning of verbs. (6) Prepositions; to form prepositional phrases which qualify the intended usage

of a given noun or verb. (7) "Syntactic sugar", articles and other words which have no semantic impact on the

computation being described, but raise the text's level of readability.

The usual English punctuation, paragraphing, and sectioning of text should also be employed by the base language. The user should be required to entitle each of his process- ing requests, in order that the system be able to identify him.

4.2. Base language analysis/diagnostics

Syntactic analysis of the base language is more complex than that of, say, PL/I. In linguistic terms, the language is not even context-free, due among other things to its require- ment for pronoun resolution. Developments in the area of implementing efficient context- sensitive syntax analyzers are limited [3, 11 ].

The grammar for the base language should allow syntactic ambiguity as well as context sensitivity. When, indeed, two or more parses exist for a given sentence, selection of the correct parse should always be possible via dictionary, DDL, or identification file consulta- tion.

Since it is basically a command language, the base language should permit relatively efficient parsing in spite of the considerations mentioned above. Once the phrases are all sorted out, each command will take the following form.

f(al, a2 . . . . . a~)

Here , f denoted a function (verb) to be applied to the arguments (nouns) al, a~ . . . . , a,. A more difficult problem will be that of deducing both (1) the sequence in which such

commands are to take place, and (2) those actions which are implied by the user's request, but not explicitly stated. One rule, of course, is that commands should be executed in the order in which they are written. However, certain statements such as

PRINT, IN ALPHABETICAL ORDER BY LAST NAME, THE NAME AND A D D R E SS . . .

imply actions (a sort in this example) which should be carried out before the explicit action (e.g. PRINT) may begin.

Diagnostic messages to the user should be closely cross-referenced with his self-teaching text, in order that he may expediently take corrective action. We identify two general classes of diagnostic messages.

(1) Indicating errors which prohibit program generation. (2) Indicating an excessive, or potentially infinite, resource requirement for the program

generated by the user's request.

The first class is similar to the conventional compiler's diagnostics. The second class is unique from conventional messages, in that the messages found here are anticipatory rather than after-the-fact. Concurrently, a burden is placed on the PGC to project anticipated storage, I/O device, and time requirements in order to check that it is reasonable (feasible) with respect to the installation's hardware resources and overall current work load, as well

Very high level language design: a viewpoint 13

as the user's allowed access to the computer (measured in total hours, dollars, runs per week, etc.).

4.3. Automatic algorithm selection When a program in a conventional high-level language invokes a "generic" function,

the arguments passed tend to "specialize" that function to a particular kind of data. In effect, the original function represents a family of functions, any one of which may be acti- vated by a particular invocation. Similarly, when the user of an application package specifies that a particular action take place, he is selecting a function from among the many functions which can be performed by the package.

The concept of Automatic Algorithm Selection (AAS) is meant to denote the act of selecting, from a well-defined collection of available functions in a given class, the one most appropriate to user's request. Such a selection is made on the basis of the nature of the data to which the function is to be applied, the particular resources required by the function, and the hardware resources available in the system.

For example, one common family of functions is that of sorting. The data to be sorted is always a linear list, but may be presented in any of a variety of ways; e.g. as a sequential data set on tape with a "sort control" field or as an array of integers in memory. A number of sorting algorithms now exist and each is optimal for a different configuration of system resources and data arrangement. It is the responsibility of AAS to select, or specialize, the sort in such a way that performance and resource utilization are optimized with respect to these resource/data constraints.

Very little published research exists which directly addresses this issue. However, related work in the areas of program optimization and development of efficient application packages should be useful for generalizing the notion of generic function in such a way that efficient AAS can be achieved. In addition, work done in the areas of extendible languages, such as ALGOL 68 (10), should also provide guidelines in this effort.

4.4. Function generalization Automatic Algorithm Selection requires that functions implemented in the system be

generalized to a degree. It appears obvious that the more generalized a function becomes the less efficiently any of its specializations will perfol'm vis-a-vis an equivalent function written only for that specialization.

On the other hand, it is mandatory that the system be able to select an optimal algorithm for a computational task in such a way that the particular algorithm selected is transparent to the user. All he wants is that the task be accomplished, and he usually does not care how.

In view of such conflicting requirements, it is important that guidelines for the generality and relationship among functions be established. Implicit in these guidelines should be the following facts.

(1) Since storage is currently becoming a more expendible resource than time, the latter should be a more important criterion than the former in selecting optimal function representation (linkage conventions, etc). For example, a recent look by the author at the PL/I built-in square root function shows that almost 50 per cent of the executable instructions are solely for parameter and subroutine linkage. If that code were embedded in-line, an increase in performance (speed- and space- wise) would be achieved at a cost of increasing the compiler's (or the source pro- gram's) complexity.

14 ALLEN TOCKER

(2) Hierarchical dependency among functions should be as loose as possible, due to the reasons above as well as the dynamic nature of the function library.

(3) The kinds of arguments that can be passed to functions must be extended over conventional programming languages' limitations, so that every function will be generic to the outermost extent conveyed by its purpose.

4.5. Resource and timing estimation

The program generator/compiler must be able to automatically estimate resource requirements and execution times for any program it generates, in order that (1) the executor can sensibly schedule the program for execution, and (2) the user can be informed of the computational burden he is placing on the system (or on his budget).

The value of such information is immeasurable. The ease of obtaining such information, however, ranges from very straight-forward (in the case where the requested function is in a dictionary with its timing parameters, and the appropriate size and access speed param- eters for the requested data set(s) are in the DDL) to very difficult. The latter case occurs in the following situations.

(1) The user defines his own algorithm (function) in a programming language, and has not provided any timing information.

(2) The user creates a new data set whose size, organization, and physical medium are unpredictable or unspecified.

For these situations, a number of conventions can be established to force resource and timing data to be generated. The simplest, from the system's point of view, would be to require that the user provide such information when he creates a new function or data set. At the other extreme, the system may switch that function to a kind of "test mode", in which the system interpretively executes the function for the sole purpose of developing timing parameters for it. Once this is done, the system returns to "production mode" with this newly-acquired information.

4.6. Inter-language linkage

A capability for the user to "drop into" a conventional programming language from the base language, in order to express ltn algorithm, is essential here. This poses, however, a wide variety of design problems whose feasible solution may not be immediate. Such problems include the following.

(1) Automatic language recognition; the system must be able to recognize, ideally by syntax alone, that a shift from one language to another in the user's text has occurred.

(2) Once recognized, the particular language being used must be uniquely determined. (3) The conventional problem of data (identifier) linkage among high-level languages

exists here. (4) Control must be passed to a conventional compiler to translate that text into object

code, which must subsequently be correctly imbedded in the program being generated by the PGC.

Considering the current proliferation of programming languages, the system probably cannot practically provide more than a few standard high-level languages. Further, if the system is implemented on a particular machine species (e.g. IBM 360]370) that machine's assembly language should also be available for use.

Very high level language design: a viewpoint 15

4.7. Default selection To support economy of expression and ease of learning in the base language, an extensive

default philosophy must be employed by the system. Defaults for particular kinds of actions (e.g. standard system output medium, data set disposition at end of job, etc.) are widely agreed upon, and are easily standardized. For other kinds of actions, default selection is not so widely-known, and thus ought to be studied more extensively. Examples of such actions are the following.

(1) Determination of "standard output format" for printed results. (2) Determination of physical medium and organization for non-card/print data sets.

4.8. System transferability and machine independence An ideal objective of such very high-level systems would be the achievement of machine

independence, so that the system may exist (be transferred) among a number of actual computers. To meet this objective, the following preliminary tasks must be performed.

(1) Define a common intermediate language which would serve as the target language for translating base language statements and all high-level language programs. Both the PGC and the executor would be written in this intermediate language. In addition, this intermediate language should permit dictionary, DDL, identification file, function library, and data set definition and linkage.

(2) Create translators, one for each real machine configuration on which the system is implemented, which will generate machine code from the intermediate language. Alternatively, some machines might be microprogrammed to interpretively execute the intermediate language itself.

There would be an unprecedented advantage gained by the achievement of machine inde- pendence in such a system. Namely, the user could freely transfer his work (programs, libraries, data sets) from one installation to another with a minimum of effort. One language, the base language, would be universally understood.

We should not however, be too naive about the problems inherent here. Widespread agreement on the nature of the intermediate language will be difficult to achieve. Even if a common intermediate language could be achieved, how much degradation in perform- ance, on a typical real machine, would its usage force ? Would this degradation be tolerable in an application environment, in view of the system's advantages ?

One factor would seem to offset the extent of such performance degradation. The system (specifically the PGC) decides what algorithm is used to service a particular user request. Thus, a request for a sort on machine A may result in an entirely different algorithm than that resulting from the same request on machine B. That is, the PGC takes into account the physical hardware available when generating an algorithm.

Stated in another way, the system may continually improve (e.g. by upgrading and extending the function library) its algorithmic repertoire, and thus the efficiency of a user job, in a way that is transparent to the user and the base language. This situation allows the system an immense potential for optimization, especially as new algorithms are dis- covered and new processing power (e.g. inter-instruction parallelism) becomes more widely available. Essentially, this is made possible by the base language's emphasis on encouraging the user to specify what is to be done rather than how it is to be done.

16 ALLm TUCKER

5. CONCLUSION

It has been the purpose of this report to examine a line of approach to very high-level language design which appears both feasible and highly attractive. Once designed and implemented, such a system should have a significant impact on the widespread problem of getting useful work out of an extremely capable, yet now grossly underutilized tool.

R E F E R E N C E S 1. M. P. Barnett, SNAP A Programming language for Humanists, Computers in the Humanities 4,

(4) 225-240 (1970). 2. Feature Analysis of Generalized Data Base Management Systems, CODASYL Committee, ACM

(1971). 3. P. Gilbert, On the syntax of algorithmic languages, J. Ass. comput. Mach. 13, 90-107 (1966). 4. Job Control Language Reference, SRL #GC28-6704, IBM (1970). 5. M. Klerer and J. May, A user-oriented programming language, AF1PSProc. FJCC (1967). 6. N. Nie, D. Bent, C. Hull, StatisticalPackage for the Social Sciences, McGraw-Hill, New York (1970). 7. T. W. OUe, UL/I: A Nonprocedural Language for Retrieving Information from Data Bases, Proc.

IFIP Congress (1968). 8. W. R. Schuehary, P. Minton, B. Shannon, A survey of statistical packages, Computing Surveys 4, (2)

65-79 (1972). 9. A. B. Tucker, EASYSTAT: An easy-to-use statistics package, AFIPSProc. NCC, 615-619 (1973).

10. A. vanWijgaarden, (Ed.), Report on the Algorithmic Language ALGOL 68, Numerisehe Mathematik 14, 79-218 (1969).

11. W. A. Woods, Context-sensitive parsing, Communs Ass. comput. Mach. 13, 437-445 (1970).