The Future of High Performance Computers in Science and

3. The fastest computer in a particular, named priceclass with both the highest scalar and vector pro-cessing performance.

The three classes of supercomputers are also character-ized by price: The Cray or “true ” supercomputer at

Communications of the ACM 1091

Liv-ermore Loops or Linpack benchmark execution ratesmeasured in floating point operations per second.

2. Ability to solve numerically intense problems. Sucha computer is typically characterized using the

[ll]:

1. The fastest machine with the most memory avail-able.

supercom-puter development follows the Cray formula: the fastestclock; a simple pipelined scalar instruction set; a verycomplex, pipelined vector processing instruction set;and multiprocessors. Software is evolving to automati-cally vectorize and parallelize so that the increase inthe number of processors is often transparent to users.

The supercomputer has been defined three ways bysupercomputer architects and engineers

2 shows various approaches to building highperformance computers and the estimates of perfor-mance versus time.

MAINLINE SUPERCOMPUTERS FOR SCIENCE ANDENGINEERINGThe main line of scientific and engineering

com-

puters can achieve an increase in performance orperformance/price of a factor of 10 or so.

l Applications-specific computers solve only one prob-lem such as speech or image processing. By bindingthe applications in hardware and software at designand construction time, an increase by a factor of 100to 10,000 in performance or performance/price is pos-sible.

Figure 1 shows the alternative forms of the computerstructures being considered in the preceding section.Figure

tie., supercomputers). However, without the scientificbase, creative talent, understanding and demandingusers, and infrastructure to design complex VLSI chips,these innovative machines would not be possible. Avariety of established and newly formed companieshave organized to exploit the new technology. Table Ilists the number of companies building high perfor-mance computers for control and artificial intelligenceapplications, and traditional supercomputers for scien-tific and engineering applications.

The impressive gains in performance and the largenumber of new companies demonstrate the effective-ness of the university-based research establishment andtheir ability to transfer technology to both establishedcompanies and start-up firms.

Three kinds of computers, distinguished by the waythey are used, are emerging:

General purpose computers are multiprogrammedand can handle a variety of applications at the sametime. General purpose computers include supers,mainframes, various minis, and workstations.Run-time defined, applications-specific computers

are used on only a few problems, on a one-at-a-timebasis. These mono-programmed computers are usefulfor a more limited class of problems where a highdegree of data parallelism exists. This class of

DARPA’s Strategic Computing Initiative (SCI). Otherprogress is a result of the evolution of understandinghe technology of multiple vector-processing computers

Rich1 The Future of HighGuest Editors Performance Computers

in Science andEngineeringA vast array of new, highly parallel machines are opening up newopportunities for new applications and new ways of computing.

Gordon Bell

Spurred by a number of innovations from both the in-dustrial and academic research establishments madepossible by VLSI and parallelism, we can expect thenext generation of scientific and engineering computingto be the most diverse and exciting one yet. Some ofthe research accomplishments have been stimulated by

Iohn D. McGregorand

rthur M.

op- with the appropriate primary and secondary memoryerates at 40 Mhz and delivers roughly 10 megaflops on resources.the Linpack benchmark. l Falls into one of the supercomputer price classes

1092 Communications of the ACM September 1989 Volume 32 Number 9

com-puter that:

In 1989 personal supers costing between $15,000 and$50,000 will be introduced. Such machines are based

l Is used for numerically intense computations.l Has the highest scalar and vector processing capacity

on Intel ’s 80860 microprocessor, for example, which

l? 2 37 ? 1 5

$1 million to $20 million; the mini-supercomputer at$100,000 to $1 million; and the graphics supercomputerat $100,000.

In this article, a supercomputer is defined as a

7? 18 57

7

2? 4 1116 ? 2 1846

l? ? 36

’ Started after 1983.

8 ? 4 62

2? 4 10

Graphic Supers 2 ? 0 2Total Supers 14 4 8 19

Array ProcessorsMassive Data ParallelMultiprocessorsMulticomputersTotal (including supers)

SuperminisRISC-based computers

proc.) 5 ? 1 2Mini-supers 5

2? 3 5Mainframes (withvector

I. Companies Building High Performance Computers and Supercomputers.

Kind of computer On market Developing Dead Recent ’

Supercomputers 2

Special Section

TABLE

supercomputertime to the scientific community, it is unlikely that

100000

10000

1960 1970 1980 1990

FIGURE 3. Past, Present and Future Projection of the ClockSpeed, Number of Processors and Performance of Seymour Cray

designed CDC and Cray Research Computers

September 1989 Volume 32 Number 9 Communications of the ACM 1093

supercom-puters. However, since the central approach is sostrongly subsidized by providing “free”

academe, and industry it is worth looking at the rangeof opportunities for different styles of computing fromthe highly centralized or regional supercomputer to afully distributed computing environment using per-sonal supercomputers. Just as distributed computing us-ing LAN-interconnected workstations and personalcomputers has become an alternative to mainframes,fully distributed personal supercomputing is likely toplay a similar role with respect to traditional

8-processor Cray YMP operating at1.68 gigaflops (versus a peak of 2.7 gigaflops, or 5.5times the uniprocessor rate) to solve a large statisticsproblem using a finite element model.

DISTRIBUTED SUPERCOMPUTINGGiven the interest in supercomputing in government,

[5], who used an Pey-

ton

30-year period beginningwith the CDC 6600.

Getting peak power on a single program will requirethe user to do some reprogramming to exploit the mul-tiple processors. Each year one Gordon Bell Prize ac-knowledges the fastest program execution on a tradi-tional supercomputer. In 1988 the prize went to theLong Range Global Weather Modelling Program atNCAR, which was parallelized manually and operatesat over 400 megaflops (versus a peak of 840 megaflops)on the Cray 416. Given the relatively small memoryand the growth in application program size, researchersare beginning to use the XMP in parallel mode to takefull advantage of its processors. In 1989, the awardwent to Vu, Simon, Ashcraft, Grimes, Lewis, and

2), to befollowed in 1992 by the Cray 4. The Cray 4 operates ata clock rate of 1 ns and delivers 128 gigaflops, using 6 4processors. Figure 3 shows the evolution of clock speed,number of processors, and aggregate computing powerfor Cray computers over a

22 gigaflops using four processors,each operating at 5.5 gigaflops, and a 2.9 ns clock.

In 1990, Cray Research plans to introduce the Cray 3(16 processors at twice the speed of the Cray

[6], and is likely to remain so until the Japa-nese start building multiprocessors. In April 1989 NE Cintroduced the SX-3, which it believes will be theworld’s fastest computer. Scheduled for delivery in1990, it is rated at

super-computer

NEC’s SX-2, Fujitsu ’s VP series, and Hita-chi’s s-820-80. These uniprocessors provide very fastsingle stream execution, but the Cray XMP still hasmore aggregate throughput. Through its parallelizingcompiler, the Cray YMP is the world ’s fastest

supercompu-ters aimed at having the fastest uniprocessors as dem-onstrated by

(mC), and the Connection Machine (CM)

(i.e., super, mini-super, graphics super, personalsuper).

l Can supply the computational resources (i.e., pro-cessing, primary and secondary memory, graphics,and networking) to a single user or set of users in oneday that could be obtained using the largest sharedsupercomputer.

The Japanese computer industry ’s first

Multicomputers (mP), Supercomputers,

3ecr1on

1986 1988 1990 1992 1994

FIGURE 2. Current and Projected Operation Rates in eitherInstructions Per Second or Floating Point Operations Per Second

for Microprocessors, Multiprocessors

spec1at

su-percomputers for smaller applications with extensive,3-D graphics and access to central supercomputers viafast networks for running large community programsand accessing large databases.

Obviously, all of the benchmarks run longer on theslower machine. The stretch factor is the increased


rendered/set N/A 50

Note: The time stretch factor for Titan relative to the Cray YMP is givenin parentheses.

quantity components (processor, memory, peripherals,etc.) result in an inherent dis-economy of scale in per-formance/price for smaller-scale applications. This isdemonstrated in a comparison between the Cray YMPand Ardent ’s Titan Graphic Supercomputer, both ofwhich were introduced in 1988. On the other hand,having a large central facility is critical for certain largeprograms and databases.

Table II gives the purchase price, performance, andperformance/price for several benchmarks run both se-quentially and in parallel for the two approaches. Ob-serve that for the purchase price of the YMP, one couldhave 166 graphics supercomputers in a highly distrib-uted fashion for project, departmental, and even per-sonal use. The comparison ignores operating costs,which in the case of a central computer are quite visi-ble. In the distributed approach operations costs (e.g.,file backup) are buried in the organization. Similarlythe cost of support, purchasing, and maintaining soft-ware, and maintaining files may vary using the twoapproaches. Well-funded, large centers provide a com-plete and balanced set of facilities including large pri-mary memories of gigawords as well as being able tohandle tens or even hundreds of gigabyte files, andarchival files for collaborative science. However, fewcenters operate at this level. The “right” environmentfor a user depends on need. Ideally, most users wouldhave their own distributed, cost-effective personal

Mflops/s/$ peak op rate 133 267Millions of pixels

32(83)Mflops/s peak op rate 2667Mflops/s/$ Linpack 107 200

24(89)Mflops/s Linpack 2144

9.4(21)Flops/s/$ parallel 9.8 78

Peak performance (1000 x 1000 Linpack, theoretical peak)

Mflops/s parallel 195

6.5(12)*Flops/s/$ multiprogrammed 31.6 108Mflops/s single processor 79

,005 0.383Whetstones (scalar floating point)

MWhetstones/s single processor 35 6.5Whetstones/s/$ multiprogrammed 14 108

Linpack (100 x 100)

Dhrystones/s/$ multiprogrammedKDhrystones/s single processor 25 23

su-percomputer power than all of the U.S. automotive in-dustry. At least one U.S. research laboratory believesthat a supercomputer is essential for recruiting.

Performance and Cost-EffectivenessThere appears to be no economy of scale across thesupercomputer classes. Lower cost, higher production

FactorCray

YMP a32 Titan 24

Price 20 0.12Processors 8 2Mwords of memory 32 4Drystones (integer-oriented)

supercomput-ing centers.

State and federal governments are very receptive tolarge centers. Several states are building supercomputercenters to serve their university systems. Similarly,many corporations believe a supercomputer is requiredfor competitiveness. For example, Toyota has more

[z]. It resulted in the establish-ment of the NSF ’s Advanced Scientific Computing(ASC) and its five National Computing Centers. Thecenters are now completely institutionalized into theNSF infrastructure at a budget level of approximately$60 million in FY1989. The centers and the 5 to 10percent of NSF researchers who use them have made avery strong case for government-funded

Bardon andCurtis report had begun

super-computers was increased by factors of 1.5, 6, and 3,respectively. New VAX computers with adequatepower and performance/price were not introduced tocompete with the Cray XMP. Only in 1984, after theXMP was already established, did mini-supercomputerswith comparable performance/price emerge. By 1984 astrong national initiative based on the NSF

7xx-series computers were runningscientific and engineering applications., and before theCray XMP was introduced, only one Cray 1 was avail-able at a university. The run-time for the VAX could beup as much as a factor of 100 times the Cray for astretch factor of 100. The acquisition costs for process-ing, measured in millions of floating point operationsper second per dollar, were about the same. VAX al-lowed comparatively large programs to be run easilyusing its large virtual memory.

With the emergence of the Cray XMP in 1983, theperformance, capacity, and cost-effectiveness of

1, the first modern supercomputer,appeared, university users and researchers were usinga highly distributed approach using the VAX 780. Sev-eral thousand VAX

[7].When the Cray

Special Section

distributed supercomputing will evolve quite as TABLE II. Comparison of Central Cray YMP and Distributedrapidly. Titan Graphics Supercomputer

Supercomputing PolicyWith the entry of the Japanese into the supercomputermarket, supercomputing has become an issue of na-tional pride and a symbol of technology leadership.While Japan now builds the fastest uniprocessors, themultiprocessor approach provides more throughput,and with parallelization, more peak power. Table I in-dicates that a large number of new companies havestarted up to build high performance computers since1983. Both Steve Chen, formerly of Cray Research, andBurton Smith, formerly of Dennelcor, have started suchcompanies. A recent report by the Office of Science andTechnology Policy urges the adoption of a governmentintitiative for high performance computing, including aNational Research Network

$50,000 may


20-mips processors attached to a shared bus in amulti configuration that sells for under

[3]. Thousands of multis are now in opera-tion, and will probably become the mainline for tradi-tional time-shared computers and smaller workstationsfor the next decade. However, given the speed and sim-plicity of POP S, users may ask Why bother with somuch performance? By 1990 workstations with four toten

multi-computers. Also, the micros can be used in a redundantfashion to increase reliability and build what is funda-mentally a hardware fault-free computer.

MultiprocessorsThe next generation, high-performance micros are de-signed for building multiple microprocessor computers,or “multis”

20 over a six-year period(1.65 per year).

USING LOW-COST, FAST RISC MICROSThe very fast CMOS and soon-to-come ECL micropro-cessor will compete with every other computer and bea good component for both multiprocessors and

POPS (plain old processors) using the RISC approach isgiven in Table III. Unlike the historical leading edgeclock evolution, POP clock speed has evolved a factorof 5 in four years (1.5 per year) because the processor ison a single chip. Shifting to ECL can give an aggregatespeed-up of a factor of

Spare architecture. By addingan attached vector processor, users can see a verybright picture for scientific and engineering computa-tion in workstations and simple computers.

The projected evolution of the leading edge one-chip

GaAs-basedcomputer using the Sun

Prisma is building a uni-

and multi-processors.

14 percent per year. By the end of 1989, the perfor-mance of the RISC, one-chip microprocessor shouldsurpass and remain ahead of any available minicompu-ter or mainframe for nearly every significant bench-mark and computational workload. By using ECL gatearrays, it is relatively easy to build processors that oper-ate at ZOO Mhz (5 ns clock) by 1990. One such com-pany, Key Computer (recently purchased by AmdahlCorporation), is doing just this to build very fast

[4].Future supercomputers may have embedded renderinghardware to provide both cost-effective and truly inter-active graphics supercomputing.

FUELING HIGH PERFORMANCE COMPUTERSThe future of the conventional uniprocessor appears tobe a simple RISC-based architecture enabling highspeed execution and low cost, one-chip (i.e., micropro-cessor] implementation. Such a processor is able totrack semiconductor process technology and is inher-ently faster than a multiple chip approach. Current mi-croprocessor performance is comparable to that of theIBM 3090 mainframe, which has historically evolved at

Tl (1.5 megabitsper second) are inadequate for applications requiringextensive computation and visualization such as molec-ular modelling, computational fluid dynamics, finiteelement modeling, and volume data visualization

paral-lelization to increase speed on a super in order to pro-vide significantly more power than a user could get in asingle day on a personal super.

Finally, using the graphic supercomputer, visualiza-tion is implicit in the system since each computer has asignificant amount of power to render and display data.Modern supercomputing requires additional resourcessuch as graphic supercomputers or high performanceworkstations just to handle the display of computed

data. Current networks operating at

20 for the distributed, dedicated approach mean thateven large users can get about the same amount ofwork done as with a centralized approach. Very largeprojects using a Cray center get only a few processorhours per day or about 1,000 hours per year, whilelarge users get an hour a day, and average users anhour a week.

By using the peak speeds that are obtained only byrunning each of the processors at peak speed and inparallel, the difference in speed between the Cray YMPand the Titan is finally apparent. While the timesstretch to almost 90 (i.e., to do an hour of computing onthe YMP requires almost 90 hours on the Titan], thecost-effectiveness of the Titan still remains, but only bya factor of two. Thus, we see the importance of

mega-flops/second, or the speed of the Linpack 100 X 100prior to Cray ’s recent compiler improvements. Notethat for a single processor, it takes 12 times longer toget the same amount of work done on the distributedapproach. However, the distributed approach is almostthree times more cost effective or in principle, usersspending the same amount could get three times asmuch computing done.

By automatically parallelizing Linpack even for thesmall case, the Cray YMP runs about 2.5 times fasterusing the eight processors and has once again becomethe world ’s fastest computer. Since the small Linpackbenchmark is too small to run efficiently in parallel,the cost-effectiveness of the approach decreases over afactor of three. Since the Titan has only two relativelyslower processors, the effect of parallelization is not asgreat on cost-effectiveness. Stretch times of around 10to

XMPs run at various large computercenters has in the past been equal to about 25

[8]where for scalar and vector loops, the Cray was fasterby about a factor of 5 and 10, respectively. The Whet-stone benchmark is indicative of such use. For simpleinteger-oriented benchmarks like those encountered inediting, compiling, and running operating systems, theYMP is ill-suited since it is about the speed of the Ti-tan, indicating that while the YMP is still faster forutility programs, it is not cost-effective by over an orderof magnitude.

At first glance, the small Linpack case seems irrele-vant to supercomputing. However, the average speedwhich the Cray

Special Section

time that a program runs using the Titan as comparedto run-time on the Cray YMP. Also associated witheach benchmark is the cost-effectiveness or perfor-mance/price (e.g., megaflops/second/dollar) of the YMPversus the Titan. The range of results is comparablewith an analysis of Titan and the Cray XMP by Misakand Lue at the Institute of Supercomputer Research

applica-

September 1989 Volume 32 Number 9

20 megabits/second which are used to pass mes-sages to other transputers, especially when they areinterconnected in a grid. Several companies are build-ing general purpose, multicomputers by connecting alarge number of transputers together. The transputer isproving especially useful in systems to build

percube is based on Inmos ’ Transputer computer node.A transputer is a processor with a small on-chip mem-ory and four, full duplex interconnection ports operat-ing at

hy-

multi-computer message passing system, but the peak perfor-mance (see Table IV) and price performance appears tobe worth the effort, as a lab can have its own Cray for aparticular problem.

The European multicomputer counterpart to the

Hypercube-connected computers now exist with 32-1024 com-puters that are being manufactured by about a halfdozen companies. Several hundred are currently inuse. Programs have to be rewritten to use the

Caltechdeveloped a large, multicomputer structure known asthe hypercube. By using commodity microprocessors,each with private memory to form a computer, andinterconnecting them in a hypercube, grid, or switchingnetwork, each computer can pass messages to all othercomputers. Today ’s multicomputers provide substan-tially higher performance using faster interconnectionnetworks for message passing than their first generationancestors, making them more widely applicable. For aparticular application, a factor of 10 in perfor-mance/price over mainline supercomputers has beenobserved.

Multicomputers are not generally applicable to allproblems and are usually mono-programmed since theyare used on only one program at a time.

198Os, Seitz and his colleagues at

1/2Oth its cost. The transition is predi-cated on the fact that the performance/price disconti-nuity will cause users to consider switching new andeven existing programs to the new micro-based envi-ronment and away from mainframes and minis on whatis used as a “code museum ” to run existing programs.

Hypercube-Connected MulticomputersIn the early

$500,000 have entered the market. Such a struc-ture provides 2 to 4 times the power of IBM ’s largestmainframe at

’ 100 Mbyte packet.

1096 Communications of the ACM

likely to occur in the traditional mainframe and mini-computer industries. Using this multiprocessor ap-proach, several computers that execute programs atover 100 million instructions per second and cost lessthan

a Typical system. Maximum system is roughly four times larger.

(us)~ 2K 5 0.5

1983-07 1988-92 1993-97

NodesMIPS 1 10 100Mflops scalar 0.1 2 40Mflops vector 10 40 200Memory (Mbytes) 0.5 4 32

Number of nodes ” 64 256 1,024Message time

[l]

First Second Third

Cray-style multiprocessor designs, for general purpose scien-tific and engineering computation. Other approaches tolarge multiprocessors do not have vector facilities, andhence may not be viable or performance/price competi-tive for highly vectorized applications since automaticcompilation or parallel constructs to use a large numberof scalar processors don ’t appear to offer the speed ofthe vector approach for these applications. Further-more, it is difficult to build very large multiprocessorsas cheaply. Hence, multicomputers have been demon-strated and have gained utility for user-explicit, parallelprocessing of single problems at supercomputer speeds.

The most ambitious multiprocessor being introducedin 1990 for both scientific and transaction processing isby Kendall Square Research. Each 64-bit processor op-erates at 40 megaflops and has an associated primarymemory of 32 megabytes and and 80 megabyte/secondI/O channel. The combined system with 480 processorsoperates at 19.2 gigaflops with 16 gigabytes of memory.

Given the ease of building very high performancemultiprocessors based on POP S, a major transition is

TABLE IV. Multicomputer Generations

multipro-grammed and time-shared fashion. It has also been theobject of training and research in parallel processing bythe computer science community because it providesthe most general purpose tool in that it can provide anenvironment for a number of computational models.

The Alliant and Convex mini-supercomputers, andArdent and Stellar graphics supercomputers use

50-mips microprocessor will place much pressure onthe viability of the multiprocessor.

The utility of the multiprocessor as a general purposedevice is proven because it can be used in a

[6].

exist. With only a small incremental design effort, allcomputers can be implemented as multis; however, the

X 100 benchmark using Linpack 100 operations per second ’ Mflops for millions of floating point l/780 = 1 million.’ Mips for millions of VAX equivalent instructions per second. VAX 1

-

1987 16 10 2 161988 25 16 5 251989 40 25

ECL shift1990 80 50 10 1001992 160 100 20 200

Mflopsb Mflops with vector unit

1986 8 5 1

PkMips ”

Special Section

TABLE Ill. Past, Presented, and Projected Clock and Performance for Leading Edge, One-Chip Microprocessors

Year Clock (mhz)

Fortran in atransparent fashion. It is based on Alliant ’s FX-8. Theprototype will likely operate in 1989. Future work isaimed at more and faster processors.


POPS, it ishighly unlikely that any computer specialized for a par-ticular language will be able to keep up.

University of Illinois Cedar Multiprocessor ProjectCedar is aimed at a multiprocessor with up to 32 pro-cessors in 4 clusters of 8 for executing

Spare chips. Another groupat Berkeley has produced a first generation Prolog ma-chine that has outperformed the fastest Japanese specialfifth generation machines. The next generation Prologcomputer is a multiprocessor/multicomputer to exploitboth fine grain and message passing for parallel pro-cessing. Given the rapid increase in speed of

Fortran decks ” automatically. Cydrome, nowdeceased, built a similar product using ECL technologythat provided higher performance and better perfor-mance/price using a similar architectural approach.

EXPERIMENTAL RESEARCH MACHINES TOWATCH

Berkeley and Stanford MultiprocessorsBoth of these RISC-based, multiprocessor architecturesare beginning to come into operation. So far, both haveinfluenced commercial ventures both in RISC and inmultiprocessors. The Stanford project was the prototypefor the MIPS Co. chip design. The Berkeley chip designswere the precursor to Sun ’s

giga-floating point operations per second. Thus the CM2 is the supercomputer for a number of applications.While the Connection Machine operates on one prob-lem at a time in a mono-programmed fashion, it shouldbe able to be multi-programmed.

Evolution of the Array ProcessorMultiflow Corp. was started up based on Fisher ’s workat Yale on a compiler to schedule a number of parallel(7 to 28) execution units using an extra wide instruc-tion word. In fact, the Multiflow can be looked at aseither a SIMD computer with a small number of pro-cessing elements, or an extension of the traditional ar-ray processor, such as the Floating Point Systems com-puters. Multiflow ’s first product runs the Linpackbenchmark at mini-super speeds and costs half asmuch. One feature of this approach is that a compilercan exploit a substantial amount of parallelism in exist-ing “dusty

l/2 gigabytesof primary memory, and operates at speeds up to 10

Hillis, which resulted in the establish-ment of Thinking Machines Corporation. Several ma-chines are installed in a variety of applications thatrequire parallel computation, including text searching,image processing, circuit and logic simulation, compu-tational fluid dynamics, and finite element methodcomputation. The current CM 2 model is a uniprocessorwith 64K processing elements, It has up to

1991 time frame. Using the chip, a small board couldcompute at roughly 100 megaflops. It would not be un-reasonable to expect this component to sell for $10,000or provide the user with 10,000 flops/second/dollar.While the initial product was developed for a specialpurpose, the ability the use the WARP for a range ofapplications is improving with better understanding ofthe compilation process.

Text Searching MachinesSeveral companies have built specialized text and data-base machines. Hollaar at Utah has built and field

, tested machines for very high speed, large database textsearches. The result to data is that inquiries are proc-essed several hundred times faster than on existingmainframes, and the improvement increases with the

ze of the database since pipelined searching hardwareIS added with each disk.

The Connection MachineThe Connection Machine grew out of a research effortat MIT by Danny

24-megaflop rate. Such a chip would bean ideal component in a PC for vector processing in the

iWARP, that ’s capable ofoperating at a

lo-cell WARP operates atan average of 50 percent peak for the problems of inter-est (e.g., speech and signal processing). This provides 50megaflops for $350,000 (142 flops/second/dollar) and isavailable from General Electric. Intel is building a sin-gle systolic processing chip,

dung’s work at Carnegie Mellon University on systolicarrays is beginning to pay off and arrays are being ap-plied to a variety of signal and image processing tasksin military applications. The

{stolic Processors

DARPA’sStrategic Computing Initiative (SCI) or other research inbasic computer science.

2Knodes that operate at 80 megaflops each is equally im-portant.

NEW RESEARCH MACHINESThe following machines have emerged from

multicompu-ters. The DARPA-funded Intel multicomputer with

20-gigaflop supercomputers availablein 1990 will be important in establishing

8K nodes operating at 2.4 megaflops) with power in thesame range as the

NCUBE’s

, power to solve unique problems. Current systems onlyapproach supercomputer power while requiring usersto reprogram: thus, users trade-off lower operationalcosts for a higher initial investment.

Multicomputers can become an established structureonly if they provide power that is not otherwise avail-able and the performance/price is at least an order ofmagnitude cheaper than traditional supercomputers.The next generation of multicomputers (e.g.,

;healthy, growing, profitable, and hence, sustainingbusinesses or by providing users with computational

15 present generation multicomputers. Usershave undoubtedly invested a comparable amount inprogramming. It is unclear how successful these ma-

r chines have been as measured either by establishing

communica-‘ions to robots.

Over a quarter of a billion dollars has been spent tobuild the

Special Section

tion-specific systems for everything from

rathethan antagonistic.


20megabits/second to connect with the 1,000 to 2,000 aca-demic, industrial, and government research organiza-tions. By making a system that could be used for bothcomputers and communications, the two disciplinesand industries could begin to become synergistic

71. These networks are badly needed to interconnectthe plethora of local area and campus area networks,Today, many campuses have installed networks withan aggregate switching need of over 100 megabits persecond, which implies an off-campus traffic need of

[4,

(multi-gigabit/second) switch and network development andresearch are not occurring rapidly enough for computernetworking. The networking dilemma is well-defined

su-percomputers are capable of generating data at videorates. In order for humans to interpret data, it appearsthat the best way is to use direct couple, high perfor-mance consoles. Two companies, Ardent and Stellarintroduced graphic supercomputers based on this prin-ciple. Traditional workstation companies are increasingtheir computational abilities. While supercomputersand mini-supercomputers currently rely on LAN-con-nected workstations, it is more likely that both struc-tures will evolve to have direct video coupling.

Room Area and Campus Area Networks (RAN/CAN)A RAN is needed to replace the various proprietaryproducts and ad hoc schemes for interconnecting arrayof computers within the computer room and withinsystems involving high speed computation. At leastthree companies are building links and switches usingproprietary protocols to operate in the gigabit range. Analternative to the Hyperchannel LAN, which operatesat a peak of 50 megabits/second, is needed. The nextgeneration must be a public standard. A combined fiberdistributed data interface (FDDI) and a nonblocking,public standards-based switch that operates at 100 me-gabits/second using fiber optics seems like a necessaryfirst step that could evolve over the next several years.

Wide Area NetworksIntermediate speed (45 megabit) and fiber optic

[9]. Today ’s

[lo].Such a system would be useful both as part of thememory hierarchy for the teraflop computer and as adatabase where high performance is demanded. TheConnection Machine ’s data vault and Teradata ’s data-base computer are examples of what is possible in re-structuring mass storage.

VisualizationIn order to effectively couple to high performance com-puters, the scientific user community has, under NSFsponsorship, recommended a significant research anddevelopment program in visualization

50-megaflop computers connected via alarge, central switch to pass messages among the com-puters.

Special Algorithm Attached ProcessorsSeveral computers for special applications in chemistry,computational fluid dynamics, genome sequencing, andastronomy have been built or are in development. Eachprovides factors of 100 to 1000 over comparable hard-ware using a general purpose approach.

COMPUTER TECHNOLOGY, RESEARCH, ANDTRAINING NEEDS

Faster CircuitryUniversity research in high speed circuitry, intercon-nection, and packaging is virtually nonexistent. Veryhigh speed processors require much better interconnec-tion and packaging density.

Mass StorageNo radical improvements in size or speed are in prog-ress to keep up with the processing developments thatcome from VLSI. A project that would couple 1,000

one-gigabyte drives in a parallel fashion to provide anessentially random access to a terabyte of memory anduse a variety of specialized architectures is feasible. Aproject at Berkeley is exploring this approach

tera-flops using 32K

TFl has a goal of achieving 1.5

Chromo-dynamics calculations and is a SIMD computer provid-ing 11 gigaflops.

GFll was designed specifically for Quantum TFlGFll and

DARPA’s SCI.

IBM Research

RP3-all explore the size and utility oflarge multiprocessors and provide over 1,000 mips insizes of 1,000 at 1, 128 at 16, and 512 at 2 mips perfor-mance, respectively. None have vector processing, andhence may not be used for mainline scientific and engi-neering applications requiring large numbers of floatingpoint operations. However, the machines and automaticparallelizing compilers could provide sufficiently largeamounts of power to attack new problems.

University of North Carolina ’s Pixel PlanesThis machine is a scaleable, highly parallel, SIMD ar-chitecture that provides the highest performance for avariety of graphics processing tasks such as solids ren-dering under varying and complex lighting situations.

AT&T’s Speech and Signal ProcessorA large number of signal processing computer chipsarranged in a tree-structured multicomputer configura-tion provides over 250 gigaflops on 32-bit numbers. Themachine fits in a rather small rack, and the resultingnumber of flops/second/dollar is nearly one million.The machine came, in part, from Columbia University ’stree-structured multicomputer work and is part of

Ultra-max, and IBM ’s

SC1 projects-BBN ’s Monarch, Encore ’s

Special Section

Very Large MultiprocessorsThree

199Os, which is likely to be over $50 million.

APPLICATIONS CAN EXPLOIT THE OPPORTUNITYWhile it is difficult to predict how the vast increase inprocessing power will affect science and engineeringgenerally, its impact in the following specific areas isclear. Given the current level of training, however, it isunclear how rapidly applications will evolve to exploitthe enormous opportunity available in significantlyfaster and more cost-effective computers.

Mechanical EngineeringComputers are being used in the design of mechanicalstructures ranging from automobiles to spacecraft, andcover a range of activities from drafting to analyzingdesigns (e.g., crash simulation for cars). Designers canalso render high quality images and show the objects inmotion with video. The vast increase in power shouldprovide mechanical engineers with computing power toenable a significant improvement in mechanical design.Under this design paradigm, every facet of product de-sign-including the factory to produce the product-ispossible without prototyping. Within the next decadethe practice of mechanical engineering could be trans-formed provided companies seize the opportunity. Forstarters, better product quality could result from thenew technology, but the big impact comes from drasti-cally reduced product gestation periods and the abilityto have smaller organizations, which also contributes toproduct elegance and quality. A similar change in chipand digital systems design has occurred in the last dec-ade, whereby almost any chip or system can be de-signed and built in about two years.

Biochemistry, Chemistry, and MaterialsIn molecular modeling and computational chemistry,


multi-programmed, shared memory computer. The cost ofsuch machines will be comparable to the supercompu-ter of the

SIMDs?

such machines. The lack of multiprogramming abilitytoday may dictate using these machines on only one ora few very large jobs at the same time, and hence mak-ing the cost per job quite high, thus diminishing theperformance/price advantage. Based on current appli-cations, it is likely that either approach will be morethan 10 percent as efficient as a general purpose,

dataflow multicomputers,multiprocessors?, and

SIMD (e.g., Connection Machine)

Special Dataflow

Shared-memory, multiprocessors

3iK node multicomputer with 1.5 teraflops.2. A Connection Machine with more than one teraflop

and several million processing elements.

Both types of machine require reprogramming accord-ing to either the message passing or massively paralleldata with a single thread of control programmingmodel. However, to exploit the evolving supercompu-ters with 16 to 64 processors, users will have to doextensive reprogramming that will use a combinationof micro- and multi-tasking and message passing.

Today’s secondary memories are hardly adequate for

M&o-/multi-tasking(fine grain)

Massive data-parallelsingle thread ofcontrol

1995. They are:

1. A 4K node multicomputer with 800 gigaflops peak ora

dataflow computermay ultimately require different computational models.In the long term, the models in Table V may not be thebest or even adequate to express parallelism. For now,however, we should build on what we know, learnfrom experience, and research alternatives.

THE TERAFLOP COMPUTER BY 1995Two relatively simple and sure paths exist for buildinga system that could deliver on the order of 1 teraflop by

pipe-lined array of systolic processors), neural networks, spe-cialized SIMD computers, and the

dataflow language may be the bestfor expressing parallelism in ordinary computers.Again, given the rate of increase in POP S, it is unlikelythat a specialized architecture will be able to keep upwith the main line.

Neural Computing NetworksVarious efforts aimed at highly parallel architecturesthat attempt to simulate the behavior of human pro-cessing structures continue to show interesting results.

Changing the Programming ParadigmIt is necessary to change the programming paradigm toget the highest performance from new computer struc-tures. However, the variety of programming modelsreally isn ’t very large, given the variety of what wouldappear to be different computers. Table V summarizesthe main line of computational models and the corre-sponding near-term computer structures that supportthe model.

Other computer structures such as the WARP (a

dataflow computer that mightoutperform the largest supercomputer in problem do-mains with a high degree of parallelism and where vec-tor computation techniques do not apply. Independentof the computer, a

Dataflow as an Alternative Computational ModelArvind’s group at MIT has progressed to the pointwhere it is building a

33-bit address.

Computation model Supporting computer structures

Vector processing Supercomputers (one processor)Message-passing W orkstation clusters on a LAN

(coarse-medium Multicomputers (e.g., hypercubes)grain) Shared-memory, multiprocessors

X1,000 elements, requiring a

X 1,000

com-puters begins to be a severe constraint for every config-uration but multiple computers. For example, a solidsdata set could easily have an array of 1,000

.The address limit of 32 bits on most of today ’s

Special Section

Memorv Address Limits TABLE V. Summary of Main Line Computational Models

au-totasking compilers are evolving to take advantage ofthis in a transparent fashion to a limited degree. Thebad news is that not all applications can be convertedautomatically. Users are not being trained to use suchmachines in an explicit fashion. No concerted effort isin place covering the spectrum from training to re-search. This will require a redirection of resources.

A ROLE FOR THE COMPUTER ANDCOMPUTATIONAL SCIENCE COMMUNITIESThe computer systems research and engineering com-munity is to be congratulated on the incredible com-puter performance gains of the last five years. Now isthe time for the rest of the computer science commu-nity to become involved in using and understanding


multicom-puters since the computers don ’t share the same ad-dress space. The recent improvement in the Craycompiler to automatically parallelize as indicated in theLinpack benchmarks to bring it nearer the state of theart is an important development that will begin to al-low general use and exploit the more rapid increase inavailable power through parallelism than throughfaster clocks.

The good news is that a vast array of new, highlyparallel machines are becoming available and that

su-percomputers such as the Crays. The shared memory,micro- and multi-tasking models used by multiproces-sors, including the Cray doesn ’t work on the

32-bit data is an excellentexample of the gains possible by applications specifichardware and software. Again, agencies are unlikely torisk building such specialized computers, nor does thecommunity have the right combination of computerscientists, scientists, and engineers working to tacklethe problem of programming in any general way.

For the ultimate performance, SIMD machines suchas the Connection Machine appear feasible providedthe problem is rich in data parallelism and users arewilling to reformulate and reprogram their applications.NSF’s research and computing directorates supportonly traditional supercomputing. Only now are agen-cies with the greatest need (DOD, DOE, HHS, andNASA) beginning to examine the Connection Machinefor computation. The need is clear if my own thesis iscorrect about achieving the highest performance.

If we want peak speed from any computer, programswill have to be rewritten to some degree to operate inparallel. One model using message passing where largeamounts of data parallelism occur, will work on nearlyall high performance computers including the Connec-tion Machine, multicomputers, and multiprocessor

l/4 tera floating pointoperations per second on

cost-effective computers unless the prices are at workstationlevels that can be supported by research grants. By notusing small, dedicated computers that are under thedirect control of the researchers who use them, usersare deprived of a tool that should supply at least anorder of magnitude more power than they now achieve

through a shared super. One could imagine that thiskind of infusion of processing power, directed at partic-ular experiments could change the nature of scienceand engineering at least as much as the current ASCprogram.

While highly specialized computers offer the bestperformance/price and operate at supercomputerspeeds, they cost more than workstations. AT&T ’sspeech processor that carries out

5,000+ researchers (rep-resenting only a few percent of the scientific commu-nity) with an average of 1 hour of supercomputing timeper week. Smaller supers such as mini-supers or graph-ics supers could easily supply researchers with theequivalent of 10 to 40 hours per week. Most agencies,however, have no means to support smaller, more

Special Section

the design of molecules is now being done interactivelywith large-scale computers.

Large-Scale Scientific Experiments Based onSimulationA dramatic increase in computational power couldmake a range of system simulation involving manybodies (e.g., galaxys, electron interaction at the atomiclevel) feasible. Nobel laureate Ken Wilson characterizescomputer simulation as the third paradigm of science,supplementing theory and experimentation. This para-digm shift will transform every facet of science, engi-neering, and mathematics.

AnimationLarge-scale computers can compute realistic scenes,providing an alternative to traditional techniques forfilmmaking.

Image ProcessingVarious disciplines including radiology rely on theinterpretation of high resolution photographs and othersignal sources. By using high performance computers,the use of digital images and image processing is finallyfeasible. The use of satellite image data is transformingeverything from military intelligence to urban geog-raphy.

Personal ComputingToday’s large computers will continue to be used toexplore applications that will be feasible for the PC. Forexample, Ardent ’s graphic supercomputer, Titan,which currently sells for about $100,000 is an excellentmodel of what will be available in 2001 for less than$6,000 (assuming a continued price decline at the rateof 20 percent per year). Every home will have an al-most unlimited laboratory to conduct experiments.

ROLE OF THE FEDERAL GOVERNMENTDis-economy of scale continues to exist and favor smallmachines. This suggests that personal supercomputing,like its counterpart in ordinary computing, will migrateto a distributed approach provided decisions are madeon some rational or economic basis. Large regionalcomputers are difficult to access effectively using to-days’s limited networks, especially for interactive vis-ualization. NSF ’s Advanced Scientific Computing (ASC)program successfully supplies

the ACM 1101

K.4.1 [Computers and Society]: Public Policy IssuesGeneral Terms: Design, Human FactorsAdditional Key Words and Phrases: Circuit technology evolution,

parallel processing, supercomputers and other high performance parallelcomputers

ABOUT THE AUTHOR:

G. GORDON BELL is Vice President of Research and Develop-ment at the Ardent Computer Company. His research interestsinclude the contributions to and the understanding of the de-sign of interesting, useful, and cost-effective computer systems,and the organizations that build them. Author ’s Present Ad-dress: Ardent Computer Company, 800 West Maude, Sunny-vale, CA 94086.

Permission to copy without fee all or part of this material is grantedprovided that the copies are not made or distributed for direct commer-cial advantage, the ACM copyright notice and the title of the publicationand its date appear, and notice is given that copying is by permission ofthe Association for Computing Machinery. To copy otherwise, or torepublish. requires a fee and/or specific permission.

September 1989 Volume 32 Number 9 Communications of

Milieux]: The Computer Industry;K.3.2 [Computers and Education]: Computer and information ScienceEducation;

1.2 [Computer Applications]: Physical Sciencesand Engineering; K.l [Computing

large) computers; (very

C.0[Computer Systems Organization]: General: C.l [Computer Systems Or-ganization]: Processor Architectures; C.3 [Computer Systems Organiza-tion]: Special-purpose and Applications-based Systems; [Computer Sys-tems Organization]: Performance of Systems: C.5.1 [Computer SystemImplementation]: Large and Medium (mainframe) Computers-super

B.0 [Hardware]: General;

1, 4 (Apr. 1989).

CR Categories and Subject Descriptors:

Rewew Supercompufer I. 1 (1988).

13. Smith, N.P.. Ed.

1989), 26-33.12. Smith, N.P.. Ed. Supercomputing Review

Zorpette, G. Supercomputer experts predict expan-sive growth. IEEE Spectrum 26, 2 (Feb.

Managemenf of Data (June 1988).Perry. T.S. and

on the fional Conference Iwtrrm-SIGMOD fhe ACM of

D., Katz. R., and Gibson, G. A case for redundant arrays ofinexpensive disks, RAID. In Proceedings

DeFanti, T.A., and Brown, M.D.. Eds. Visualiza-tion in Scientific Computing. Compuf. Graphics 21, 6 (Nov. 1987).Patterson,

Insfifufe for Supercomputing Research 2, 1 (Aug. 1988).8-11.

9.

10.

11.

McCormick. B.H..

Regisfer of the Vecfor

8. Misaki, E.. and Lue, KM Preliminary CPU performance study: TheArdent Titan and the Cray X-MP-1 on the Livermore Loops.

Computq. (Nov. 1987).S&ategy for High Performance nzenf Deuelop-

Laborafory Technical Memorandum, 23 (Feb. 23, 1989).7. Office of Science and Technology Policy. A Research and

Fortran environment. Argonne National

Soffware 6, 3 (May 1989) 78-85.6. Dongarra, J.J. Performance of various computers using standard lin-

ear equations software in a

Karp. A., et al. 1988 Gordon Bell Prize. IEEEJ., Dongarra. J.. 1988), 54-57.

5. Brown.

1985), 462-467.4. Bell, C.G. A national research network. IEEE Spectrum 25, 2 [Feb.

Sctence228 (Apr. 26.

M.. and Curtis. K. A national computing environment foracademic research. NSF Working Group on Computers for Research.National Science Foundation. July 1983.

3. Bell, C.G. Multi ’s: A new class of multiprocessor computers.

Bardon, 1988), 9-19.

2. 8 (Aug.

CL. Multicomputers: Message-passing con-current computers. IEEE Compuf. 21.

Seitz, Althas. W.C. and

supercompu-ter. Understanding may enable work on automatic pro-gramming systems to analyze and rewrite programs forthese computational models. As a minimum, under-standing will greatly facilitate exploiting both the evo-lution and the radically new machines.

REFERENCES1.

supercom-puters and the onslaught of new, highly parallelmachines by being involved with the applications, byusing and understanding the new structures, by writingtexts, training students, and carrying out research. Forexample, no current texts about programming arebased on the mainline development of the

Special Section

the plethora of computers that can be applied to theendless frontier of computational science.

The computer science community can continue toignore applications in computational science. Alterna-tively, it can learn about the various forms of parallel-ism supported by the evolution of mainline

The Future of High Performance Computers in Science and

Documents

Transcript of The Future of High Performance Computers in Science and