Supercomputers, Enterprise Servers, and Some Other Big Ones Lecture for CPSC 5155 Edward Bosworth,...

Supercomputers, Enterprise Servers,

and Some Other Big OnesLecture for CPSC 5155

Edward Bosworth, Ph.D.Computer Science Department

Columbus State University

Topics for This Lecture

1. A discussion of the C.mmp, a parallel computer developed in the early 1970’s.

2. Examination of enterprise computers, such as the IBM S/360 line.

3. More on the Cray-1 and similar vectorprocessors.

4. A discussion of the “big ones”: the IBM BlueGene and the Cray Jaguar.

Scalar System Performance

• Early scalar (not vector) computers were evaluated in terms of performance in MIPS (Millions Instructions Per Second).

• The VAX/11-780 was rated at 1 MIP by its developer (the Digital Equipment Corp.)

• Some claim the term stands for Meaningless Indicator of Performance (for) Salesmen.

• The measure does reflect system performance.

The C.mmp

• This was a multiprocessor system developed by Carnegie-Mellon in the early 1970’s

• It was thoroughly documented by Wulf and Harbinson in what has been fairly called “one of the most thorough and balanced research–project retrospectives … ever seen”.

• Remarkably, this paper gives a thorough description of the project’s failures.

The C.mmp Itself

• The C.mmp is described as “a multiprocessor composed of 16 PDP–11’s, 16 independent memory banks, a cross-point [crossbar] switch which permits any processor to access any memory” It includes an independent bus, called the “IP bus”, used to communicate control signals.

• As of 1978, the system included the following 16 processors.• 5 PDP–11/20’s, each rated at 0.20 MIPS

(that is 200,000 instructions per second)• 11 PDP–11/40’s, each rated at 0.40 MIPS• 3 megabytes of shared memory (650 nsec core and 300 nsec

semiconductor)• The system was observed to compute at 6 MIPS.

The Design Goals of the C.mmp

• The goal of the project seems to have been the construction of a simple system using as many commercially available components as possible.

• The C.mmp was intended to be a research project not only in distributed processors, but also in distributed software. The native operating system designed for the C.mmp was called “Hydra”. It was intended as an OS kernel, intended to provide only minimal services and encourage experimentation in system software.

• As of 1978, the software developed on top of the Hydra kernel included file systems, directory systems, schedulers and a number of language processors.

The C.mmp: Lessons Learned

• The use of two variants of the PDP–11 was considered as a mistake, as it complicated the process of making the necessary processor and operating system modifications.

• The authors had used newer variants of the PDP–11 in order to gain speed, but concluded that “It would have been better to have had a single processor model, regardless of speed”.

The C.mmp: More Lessons Learned

• The critical component was expected to be the crossbar switch. Experience showed the switch to be “very reliable, and fast enough”. Early expectations that the “raw speed” of the switch would be important were not supported by experience.

• The authors concluded that “most applications are sped up by decomposing their algorithms to use the multiprocessor structure, not by executing on processors with short memory access times”.

Still More Lessons Learned

1. “Hardware (un)reliability was our largest day–to–day disappointment … The aggregate mean–time–between–failure (MTBF) of C.mmp/Hydra fluctuated between two to six hours.”

2. “About two–thirds of the failures were directly attributable to hardware problems. There is insufficient fault detection built into the hardware.”

3. “We found the PDP–11 UNIBUS to be especially noisy and error–prone.”

4. “The crosspoint [crossbar] switch is too trusting of other components; it can be hung by malfunctioning memories or processors.”

Another Set of Lessons

• My favorite lesson learned is summarized in the following two paragraphs in the report.

• “We made a serious error in not writing good diagnostics for the hardware. The software developers should have written such programs for the hardware.”

• “In our experience, diagnostics written by the hardware group often did not test components under the type of load generated by Hydra, resulting in much finger–pointing between groups.”

Enterprise Servers & Supercomputers

• There are two categories of large computers based on the intended use.

• Enterprise servers handle simple problems, usually commercial transactions, but handle a very large transaction volume.

• Supercomputers handle complex problems, usually scientific simulations, and work only a few problems at a time.

The IBM S/360 Evolves

• The IBM S/360, introduced in April 1964, was the first of a line of compatible enterprise servers.

• At left is a modern variant, possibly a z/11.

• All computers in this line were specialized for commercial work.

The z196: A “Cloud in a Box”

• Part of the title quotes IBM sales material.• According to IBM, the z/196 is the high end server

and the “flagship of the IBM systems portfolio.”• Some interesting features of the z/1961. It contains 96 microprocessors running at 5.2

GHz. Each processor is quad-core.2. It can execute 52 billion instructions per second.3. The main design goal is “Zero Down Time”, so the

system has a lot of redundancy.

A Mainframe SMPIBM zSeries• Uniprocessor with one main memory card to a high-end system

with 48 processors and 8 memory cards• Dual-core processor chip

— Each includes two identical central processors (CPs)— CISC superscalar microprocessor— Mostly hardwired, some vertical microcode— 256-kB L1 instruction cache and a 256-kB L1 data cache

• L2 cache 32 MB— Clusters of five— Each cluster supports eight processors and access to entire main

memory space• System control element (SCE)

— Arbitrates system communication— Maintains cache coherence

• Main store control (MSC)— Interconnect L2 caches and main memory

• Memory card— Each 32 GB, Maximum 8 , total of 256 GB— Interconnect to MSC via synchronous memory interfaces (SMIs)

• Memory bus adapter (MBA)— Interface to I/O channels, go directly to L2 cache

IBM z990 Multiprocessor Structure

The Cray Series of Supercomputers

• Note the design. In 1976, the magazine Computerworld called the Cray–1 “the world’s most expensive love seat”.

FLOPS: MFLOPS to Petaflops

• The workload for a modern supercomputer focuses mostly on floating-point arithmetic.

• As a result, all supercomputers are rated in terms of FLOPS (Floating Point Operations Per Second). The names scale by 1000’s:MFLOPS, GFLOPS, TFLOPS, and Petaflops.

• Today’s high-end machines rate in the range from 1 to 20 Petaflops with more on the way.

History of Seymour Cray and His Companies

• Seymour Cray is universally regarded as the “father of the supercomputer”. There are no other claimants to this title.

• Cray began work at Control Data Corporation soon after its founding in 1960 and remained there until 1972. At CDC, he designed the CDC 1604, CDC 6600, and CDC 7600.

• The CDC 6600 is considered the first RISC machine, though Cray did not use the term.

Cray’s Algorithm for Buying Cars

1. Go to the nearest auto dealership.2. Look at the car closest to the entrance.3. Offer to pay full price for that car.4. Drive that car off the lot.5. Return to designing of fast computers.

More History

• Cray left Control Data Corporation in 1972 to found Cray Research, based in Chippewa Falls, Wisconsin. The Cray–1 was delivered in 1976. This lead to a bidding war, with Los Alamos National Lab paying more than list price.

• In 1989, Cray left the company in order to found Cray Computers, Inc. His goal was to spend more time on research.

• Seymour Cray died on October 5, 1996.

More Vector Computers• The successful introduction of the Cray-1 insured the cash flow for

Cray’s company, and allowed future work along two lines.1. Research and development on the Cray–2.2. Production of a line of computers that were derivatives of the

Cray–1. These were called the X–MP, Y–MP, etc.• The X–MP was introduced in 1982. It was a dual–processor

computer with a 9.5 nanosecond (105 MHz) clock and 16 to 128 megawords of static RAM main memory.

• The Y–MP was introduced in 1988, with up to eight processors that used VLSI chips. It had a 32–bit address space, with up to 64 megawords of static RAM main memory.

The Cray-2• While his assistant, Steve Chen, oversaw the production of the

commercially successful X–MP and Y–MP series, Seymour Cray pursued his development of the Cray–2.

• The original intent was to build the VLSI chips from gallium arsenide (GaAs), which would allow must faster circuitry. The technology for manufacturing GaAs chips was not then mature enough to be used for mass production.

• The Cray–2 was a four–processor computer that had 64 to 512 (64 bit) megawords of 128–way interleaved DRAM memory. The computer was built very small in order to be very fast, as a result the circuit boards were built as very compact stacked cards. VLSI chips were not yet available in 1985.

Cooling the Cray-2

• Due to the card density, it was not possible to use air cooling. The entire system was immersed in a tank of Fluorinert™, an inert liquid intended to be a blood substitute.

• When introduced in 1985, the Cray–2 was not significantly faster than the Y–MP.

The Cray-2 System

Two Views of a Cray-2 System

The Cray-3 and Cray-4• The Cray–3, a 16–processor system, was announced in 1993 but

never delivered.• The Cray–4, a smaller version of the Cray–3 with a 1 GHz clock was

ended when the Cray Computer Corporation went bankrupt in 1995.• In 1993, Cray Research moved away from pure vector processors,

producing its first massively parallel processing (MPP) system, the Cray T3D™.

• Cray Research merged with SGI (Silicon Graphics, Inc.) in February 1996. It was spun off as a separate business unit in August 1999. In March 2000, Cray Research was merged with Terra Computer Company to form Cray, Inc.

• Cray, Inc. is going strong today (as of Summer 2012).

MPP and the AMD Opteron

• Beginning in the 1990’s, there was a move to build supercomputers as MPP (Massively Parallel Processor) machines.

• These would have tens of thousands of stock commercial processors.

• The AMD Opteron quickly became the favorite chip, as it offered a few features not found in the Intel Pentium line.

The Opteron

• The AMD Opteron is a 64–bit processor that can operate in three modes.

1. In legacy mode, the Opteron runs standard Pentium binary programs unmodified.

2. In compatibility mode, the operating system runs in full 64–bit mode, but applicationsmust run in 32–bit mode.

3. In 64–bit mode, all programs can issue 64–bit addresses; both 32–bit and 64–bitprograms can run simultaneously in this mode.

The Cray XT-5• Introduced in 2007, this is built from about

60,000 Quad–Core AMD Opteron™ processors.

The Jaguar at Oak Ridge, TN

Some Comments on the Jaguar

• On July 30, 2008 the NCCS took delivery of the first 16 of 200 cabinets of an XT5 upgrade to Jaguar that ultimately has taken the system to 1,639 TF with 362 TB of high-speed memory and over 10,000 TB of disk space.

• The final cabinets were delivered on September 17, 2008. Twelve days later on September 29th, this incredibly large and complex system ran a full-system benchmark application that took two and one-half hours to complete.

Details on the JaguarXT5 XT4 Total

Cabinets 200 84 284

Quad-CoreOpterons

37,376 7,832 45,208

Cores 149,504 31,328 180,832

Peak TFLOPS 1,375 263 1,639

Memory 300 TB 62 TB 362 TB

Disks 10,000 TB 750 TB 10,750 TB

15-447 Computer Architecture Fall 2008 ©

33

Why worry about power -- Oakridge Lab. Jaguar

• Current highest performance super computer– 1.3 sustained petaflops (quadrillion FP operations per

second)– 45,000 processors, each quad-core AMD Opteron

• 180,000 cores!– 362 Terabytes of memory; 10 petabytes disk space– Check top500.org for a list of the most powerful

supercomputers

• Power consumption? (without cooling)– 7MegaWatts!– 0.75 million USD/month to power– There is a green500.org that rates computers based on

flops/Watt

The IBM Blue Gene/L

• The Blue Gene system was designed in 1999 as “a massively parallel supercomputer for solving computationally–intensive problems in, among other fields, the life sciences”.

• The BlueGene/L was the first model built; it was shipped to Lawrence Livermore Lab in June 2003.A quarter–scale model, with 16,384 processors, became operational in November 2004 and achieved a computational speed of 71 teraflops.

A Blue Gene Node

Blue Gene/L Packaging 2 nodes per compute card. 16 compute cards per node board. 16 node boards per 512-node

midplane. Two midplanes in a 1024-node rack. 64 racks

Introduction

Compute Card

Node Card

Notes on Installing the Blue Gene• Because Blue Gene/Q is so unique, there are several atypical

requirements for the data center that will house the unit.

1. Power. Each Blue Gene/Q rack has a maximum input power consumption of 106 kVA. Therefore, the magnitude of the electrical infrastructure is much larger than typical equipment.

2. Cooling. Because of the large load, each rack is cooled by air and water. The building facility water must support the Blue Gene/Q cooling system. A raised floor is not a requirement for air cooling the Blue Gene/Q.

3. Size. The Blue Gene/Q racks are large and heavy. Appropriate actions must be taken before, during, and after installation to ensure the safety of the personnel and the Blue Gene/Q equipment.

IBM’s Comments

• Roadrunner is the first general purpose computer system to reach the petaflop milestone. On June 10, 2008, IBM announced that this supercomputer had sustained a record-breaking petaflop, or 1015 floating point operations per second.

• Roadrunner was designed, manufactured, and tested at the IBM facility in Rochester, Minnesota. The actual initial petaflop run was done in Poughkeepsie, New York. Its final destination is the Los Alamos National Laboratory (LANL).

• Most notably, Roadrunner is the latest tool used by the National Nuclear Security Administration (NNSA) to ensure the safety and reliability of the US nuclear weapons stockpile.

Roadrunner Breaks the Pflop/s Barrier

• 1,026 Tflop/s on LINPACK reported on June 9, 2008

• 6,948 dual core Opteron + 12,960 cell BE

• 80 TByte of memory

• IBM built, installed at LANL

Supercomputers, Enterprise Servers, and Some Other Big Ones Lecture for CPSC 5155 Edward Bosworth,...

Documents

Transcript of Supercomputers, Enterprise Servers, and Some Other Big Ones Lecture for CPSC 5155 Edward Bosworth,...