Report on Blue Gene

Abstract on BLUE GENE: Blue Gene is a computer architecture project designed to produce several next-generation supercomputers. Designed to reach operating speeds in the petaflops range, and currently reaching sustained speeds over 360 teraflops. There are four Blue Gene projects in development: Blue Gene/L, Blue Gene/C, Blue Gene/P, and Blue Gene/Q. On June 26, 2007, IBM unveiled Blue Gene/P, the second generation of the Blue Gene supercomputer. Supercomputer: Supercomputer is a computer that performs at or near the currently highest operational rate for computers. A supercomputer is typically used for scientific and engineering applications that must handle very large databases or do a great amount of computation (or both). At any given time, there are usually a few well-publicized supercomputers that operate at extremely high speeds. The term is also sometimes applied to far slower (but still impressively fast) computers. Most supercomputers are really multiple computers that perform parallel processing In general, there are two parallel processing approaches: symmetric multiprocessing (SMP) and massively parallel processing (MPP). IBM's Roadrun ner is the fastest sup ercomputer in the world, t wice as fast as Blue Gene and six times as fast as any of the other current supercomputers. At the lower end of supercomputing, a new trend called clustering, takes more of a build-it-yourself approach to supercomputing. The Beowulf Project offers guidance on how to put together a number of 

Transcript of Report on Blue Gene

Page 1: Report on Blue Gene

8/2/2019 Report on Blue Gene 1/22

Abstract on BLUE GENE:

Blue Gene is a computer architecture project designed to produce

several next-generation supercomputers. Designed to reach operating speeds

in the petaflops range, and currently reaching sustained speeds over 360teraflops.

There are four Blue Gene projects in development: Blue Gene/L, Blue

Gene/C, Blue Gene/P, and Blue Gene/Q. On June 26, 2007, IBM unveiled

Blue Gene/P, the second generation of the Blue Gene supercomputer.


Supercomputer is a computer that performs at or near the currently

highest operational rate for computers. A supercomputer is typically used for

scientific and engineering applications that must handle very large databases

or do a great amount of computation (or both).

At any given time, there are usually a few well-publicized

supercomputers that operate at extremely high speeds. The term is also

sometimes applied to far slower (but still impressively fast) computers. Most

supercomputers are really multiple computers that perform parallel

processing In general, there are two parallel processing approaches:

symmetric multiprocessing (SMP) and massively parallel processing (MPP).

IBM's Roadrunner is the fastest supercomputer in the world, twice as fast

as Blue Gene and six times as fast as any of the other current

supercomputers. At the lower end of supercomputing, a new trend called

clustering, takes more of a build-it-yourself approach to supercomputing.

The Beowulf Project offers guidance on how to put together a number of 

Page 2: Report on Blue Gene

8/2/2019 Report on Blue Gene 2/22

off-the-shelf personal computer processors, using Linux operating systems,

and interconnecting the processors with Fast Ethernet

Applications must be written to manage the parallel processing.

Perhaps the best-known builder of supercomputers has been CrayResearch, now a part of Silicon Graphics. In September 2008, Cray and

Microsoft launched CX1, a $25,000 personal supercomputer aimed markets

such as aerospace, automotive, academic, financial services and life

sciences. CX1 runs Windows HPC (High Performance Computing) Server


In the United States, some supercomputers centers are interconnected

on an Internet backbone known as vBNS or NSFNet. This network is the

foundation for an evolving network infrastructure known as the National

Technology Grid. Internet2 is a university-led project that is part of this


The system also reflects breakthroughs in energy efficiency. With the

creation of Blue Gene, IBM dramatically shrank the physical size and

energy needs of a computing system whose processing speed would have

required a dedicated power plant capable of generating power to thousands

of homes.

The influence of the Blue Gene supercomputer's energy-efficient

design and computing model can be seen today across the Information

Technology industry. Today, 18 of the top 20 most energy efficient

supercomputers in the world are built on IBM high performance computing

technology, according to the latest Supercomputing 'Green500 List'

announced by

Page 3: Report on Blue Gene

8/2/2019 Report on Blue Gene 3/22

Blue Gene - History:

On September 29, 2004, IBM announced that a Blue Gene/L

prototype at IBM Rochester (Minnesota) had overtaken NEC's Earth

Simulator as the fastest computer in the world, with a speed of 36.01TFLOPS on the Linpack benchmark, beating Earth Simulator's 35.86

TFLOPS. This was achieved with an 8 cabinet system, with each cabinet

holding 1,024 compute nodes. Upon doubling this configuration, the

machine reached a speed of 70.72 TFLOPS by November.

On March 24, 2005, the US Department of Energy announced that the

Blue Gene/L installation at LLNL broke its current world speed record,

reaching 135.5 TFLOPS. This feat was possible because of doubling the

number of cabinets to 32.

On the June 2005 Top500 list, Blue Gene/L installations across

several sites world-wide took 5 out of the 10 top positions, and 16 out of the

top 64.

On October 27, 2005, LLNL and IBM announced that Blue Gene/L

had once again broken its current world speed record, reaching 280.6

TFLOPS, upon reaching its final configuration of 65,536 "Compute Nodes"(i.e., 216 nodes) and an additional 1024 "IO nodes" in 64 air-cooled


BlueGene/L is also the first supercomputer ever to run over 100

TFLOPS sustained on a real world application, namely a three-dimensional

molecular dynamics code (ddcMD), simulating solidification(nucleation and

growth processes) of molten metal under high pressure and temperature

conditions. This won the 2005 Gordon Bell Prize.\ 

Page 4: Report on Blue Gene

8/2/2019 Report on Blue Gene 4/22

Blue Gene:

A Blue Gene/P supercomputer at Argonne National Laboratory

Blue Gene is a computer architecture project designed to produce

several supercomputers, designed to reach operating speeds in the PFLOPS

(petaFLOPS) range, and currently reaching sustained speeds of nearly 500TFLOPS (teraFLOPS). It is a cooperative project among IBM (particularly

IBM Rochester and the Thomas J. Watson Research Center), the Lawrence

Livermore National Laboratory, the United States Department of Energy

(which is partially funding the project), and academia. There are four Blue

Gene projects in development: Blue Gene/L, Blue Gene/C, Blue Gene/P, and

Blue Gene/Q.

The project was awarded the National Medal of Technology and Innovation

by U.S. President Barack Obama on September 18, 2009. The president

bestowed the award on October 7, 2009.

Page 5: Report on Blue Gene

8/2/2019 Report on Blue Gene 5/22

Overall Organization

The basic building block of Blue Gene/L is a custom system-on-a-chip that

integrates processors, memory and communications logic in the same piece

of silicon. The BG/L chip contains two standard 32-bit embedded PowerPC

440 cores, each with private L1 32KB instruction and 32KB data caches. L2

caches acts as prefetch buffer for L3 cache.

Each core drives a custom 128-bit double FPU that can perform four double

precision floating-point operations per cycle. This custom FPU consists of 

two conventional FPUs joined together, each having a 64-bit register file

with 32 registers. One of the conventional FPUs (the primary side) is

compatible with the standard PowerPC floatingpoint instruction set. In most

scenarios, only one of the 440 cores is dedicated to run user applications

while the second processor drives the networks. At a target speed of 700

MHz the peak performance of a node is 2.8 GFlop/s. When both cores and

FPUs in a chip are used, the peak performance per node is 5.6 GFlop/s. To

overcome these limitations BG/L provides a variety of synchronization

devices in the chip: lockbox, shared SRAM, L3 scratchpad and the blind

device. The lockbox unit contains a limited number of memory locations for

fast atomic test-and sets and barriers. 16 KB of SRAM in the chip can be

used to exchange data between the cores and regions of the EDRAM L3

cache can be reserved as an addressable scratchpad. The blind device

permits explicit cache management.

The low power characteristics of Blue Gene/L permit a very dense

packaging as in research paper [1]. Two nodes share a node card that also

contains SDRAM-DDR memory. Each node supports a maximum of 2 GB

Page 6: Report on Blue Gene

8/2/2019 Report on Blue Gene 6/22

external memory but in the current configuration each node directly

addresses 256MB at 5.5 GB/s bandwidth with a 75-cycle latency. Sixteen

compute cards can be plugged in a node board. A cabinet with two mid

planes contains 32 node boards for a total of 2048 CPUs and a peak 

performance of 2.9/5.7 TFlops.

The complete system has 64 cabinets and 16 TB of memory. In addition to

the 64K-compute nodes, BG/L contains a number of I/O nodes (1024 in the

current design). Compute nodes and I/O nodes are physically identical

although I/O nodes are likely to contain more memory.

Networks and communication hardware

The BG/L ASIC supports five different networks.





Global interrupts.

Page 7: Report on Blue Gene

8/2/2019 Report on Blue Gene 7/22

Introduction to Blue Gene/L:

Blue Gene/L

The first computer in the Blue Gene series, Blue Gene/L, developed

through a partnership with Lawrence Livermore National Laboratory

(LLNL), originally had a theoretical peak performance of 360 TFLOPS, and

scored over 280 TFLOPS sustained on the Linpack benchmark. After an

upgrade in 2007 the performance increased to 478 TFLOPS sustained and

596 TFLOPS peak.

The term Blue Gene/L sometimes refers to the computer installed at

LLNL; and sometimes refers to the architecture of that computer. As of 

November 2006, there are 27 computers on the Top500 list using the Blue

Gene/L architecture. All these computers are listed as having an architectureof eServer Blue Gene Solution.

The block scheme of the Blue Gene/L ASIC including dual PowerPC

440 cores.

In December 1999, IBM announced a $100 million research initiative

for a five-year effort to build a massively parallel computer, to be applied to

the study of biomolecular phenomena such as protein folding. The project

has two main goals: to advance our understanding of the mechanisms behindprotein folding via large-scale simulation, and to explore novel ideas in

massively parallel machine architecture and software. This project should

enable biomolecular simulations that are orders of magnitude larger than

current technology permits. Major areas of investigation include: how to use

this novel platform to effectively meet its scientific goals, how to make such

Page 8: Report on Blue Gene

8/2/2019 Report on Blue Gene 8/22

massively parallel machines more usable, and how to achieve performance

targets at a reasonable cost, through novel machine architectures. The design

is built largely around the previous QCDSP and QCDOC supercomputers.

Major Features:

The Blue Gene/L supercomputer is unique in the following aspects:

Trading the speed of processors for lower power consumption.

  Trading the speed of processors for lower power consumption.

  Dual processors per node with two working modes: co-processor

(1 user process/node: computation and communication work is

shared by two processors) and virtual node (2 userprocesses/node).

  System-on-a-chip design.

  A large number of nodes (scalable in increments of 1024 up to at

least 65,536)

  Three-dimensional torus interconnect with auxiliary networks for

global communications, I/O, and management.

  Lightweight OS per node for minimum system overhead

(computational noise).

Page 9: Report on Blue Gene

8/2/2019 Report on Blue Gene 9/22

Architecture of Blue Gene:

One Blue Gene/L node Board

Each Compute or I/O node is a single ASIC with associated DRAM

memory chips. The ASIC integrates two 700 MHz PowerPC 440

embedded processors, each with a double-pipeline-double-precision

Floating Point Unit (FPU), a cache sub-system with built-in DRAM

controller and the logic to support multiple communication sub-systems.

The dual FPUs give each Blue Gene/L node a theoretical peak 

performance of 5.6 GFLOPS (gigaFLOPS). Node CPUs are not cache

coherent with one another.

Compute nodes are packaged two per compute card, with 16 compute

cards plus up to 2 I/O nodes per node board. There are 32 node boards

per cabinet/rack. By integration of all essential sub-systems on a singlechip, each Compute or I/O node dissipates low power (about 17 watts,

including DRAMs). This allows very aggressive packaging of up to 1024

compute nodes plus additional I/O nodes in the standard 19" cabinet,

within reasonable limits of electrical power supply and air cooling. The

performance metrics in terms of FLOPS per watt, FLOPS per m2 of 

floorspace and FLOPS per unit cost allow scaling up to very high


Each Blue Gene/L node is attached to three parallel communications

networks: a 3D toroidal network for peer-to-peer communication

between compute nodes, a collective network for collective

communication, and a global interrupt network for fast barriers. The I/O

nodes, which run the Linux operating system, provide communication

Page 10: Report on Blue Gene

8/2/2019 Report on Blue Gene 10/22

with the world via an Ethernet network. The I/O nodes also handle the

filesystem operations on behalf of the compute nodes. Finally, a separate

and private Ethernet network provides access to any node for

configuration, booting and diagnostics.

Blue Gene/L compute nodes use a minimal operating system

supporting a single user program. Only a subset of POSIX calls are

supported, and only one process may be run at a time. Programmers need

to implement green threads in order to simulate local concurrency.

Application development is usually performed in C, C++, or Fortran

using MPI for communication. However, some scripting languages such

as Ruby have been ported to the compute nodes.

To allow multiple programs to run concurrently, a Blue Gene/L

system can be partitioned into electronically isolated sets of nodes. The

number of nodes in a partition must be a positive integer power of 2, and

must contain at least 25 = 32 nodes. The maximum partition is all nodes

in the computer. To run a program on Blue Gene/L, a partition of the

computer must first be reserved. The program is then run on all the nodes

within the partition, and no other program may access nodes within the

partition while it is in use. Upon completion, the partition nodes arereleased for future programs to use.

With so many nodes, component failures are inevitable. The system is

able to electrically isolate faulty hardware to allow the machine to

continue to run.

Plan 9 support:

A team composed of members from Bell-Labs, IBM Research, SandiaNational Laboratory, and Vita Nuova have completed a port of Plan 9 to

Blue Gene/L. Plan 9 kernels are running on both the compute nodes and

the I/O nodes. The Ethernet, Torus, Collective Network, Barrier

Network, and Management networks are all supported

Page 11: Report on Blue Gene

8/2/2019 Report on Blue Gene 11/22

Introduction to Blue Gene/C or Cyclops64:

Blue Gene/C

Blue Gene/C (now renamed to Cyclops64) is a sister-project to Blue

Gene/L. It is a massively parallel, supercomputer-on-a-chip cellular

architecture. It was slated for release in early 2007 but has been delayed.

Introduction to Blue Gene/P:

Blue Gene/P node Card

A schematic overview of a Blue Gene/P supercomputer On June 26,

2007, IBM unveiled Blue Gene/P, the second generation of the Blue Genesupercomputer. Designed to run continuously at 1 PFLOPS (petaFLOPS), it

can be configured to reach speeds in excess of 3 PFLOPS. Furthermore, it is

at least seven times more energy efficient than any other supercomputer,

accomplished by using many small, low-power chips connected through five

specialized networks. Four 850 MHz PowerPC 450 processors are integrated

Page 12: Report on Blue Gene

8/2/2019 Report on Blue Gene 12/22

on each Blue Gene/P chip. The 1-PFLOPS Blue Gene/P configuration is a

294,912-processor, 72-rack system harnessed to a high-speed, optical

network. Blue Gene/P can be scaled to an 884,736-processor, 216-rack 

cluster to achieve 3-PFLOPS performance. A standard Blue Gene/P

configuration will house 4,096 processors per rack.

On November 12, 2007, the first system, JUGENE, with 65536

processors is running in the Jülich Research Centre in Germany with a

performance of 167 TFLOPS. It is the fastest supercomputer in Europe and

the sixth fastest in the world. The first laboratory in the United States to

receive the Blue Gene/P was Argonne National Laboratory. The first racks

of the Blue Gene/P shipped in fall 2007. The first installment was a 111-

teraflops system, which has In February 2009 it was announced that

JUGENE will be upgraded to reach petaflops performance in June 2009,

making it the first petascale supercomputer in Europe. The new

configuration has started at April 6, the system will go into production end

of June 2009. The new configuration will include 294 912 processor cores,

144 terabyte memory, 6 petabyte storage in 72 racks. The new

configuaration will incorporate a new water cooling system that will reduce

the cooling cost substantially.

Web-scale platform:

The IBM Kittyhawk project team has ported Linux to the compute

nodes and demonstrated generic Web 2.0 workloads running at scale on a

Blue Gene/P. Their paper published in the ACM Operating Systems Review

describes a kernel driver that tunnels Ethernet over the tree network, which

results in all-to-all TCP/IP connectivity.[21] Running standard Linux

software like MySQL, their performance results on SpecJBB rank among the

highest on record.

Page 13: Report on Blue Gene

8/2/2019 Report on Blue Gene 13/22

Introduction to Blue Gene/Q:

Blue Gene/Q

The last known supercomputer design in the Blue Gene series, Blue

Gene/Q is aimed to reach 20 Petaflops in the 2011 time frame. It will

continue to expand and enhance the Blue Gene/L and /P architectures with

higher frequency at much improved performance per watt. Blue Gene/Q will

have a similar number of nodes but many more cores per node.[22] Exactly

how many cores per chip the BG/Q will have is currently somewhat unclear,

but 8 or even 16 is possible, with 1 GB of memory per core.

The archetypal Blue Gene/Q system called Sequoia will be installed at

Lawrence Livermore National Laboratory in 2011 as a part of the Advanced

Simulation and Computing Program running nuclear simulations and

advanced scientific research. It will consist of 98,304 compute nodescomprising 1.6 million processor cores and 1.6 PB memory in 96 racks

covering an area of about 3000 square feet, drawing 6 megawatts of power.

IBM details Blue Gene supercomputer:

IBM is shedding light on a program to create the world's fastest

supercomputer, illuminating a dual-pronged strategy, an unusual new

processor design and a leaning toward the Linux operating system.

"Blue Gene" is an ambitious project to expand the horizons of 

supercomputing, with the ultimate goal of creating a system that can perform

one quadrillion calculations per second, or one petaflop. IBM expects a

machine it calls Blue Gene/P to be the first to achieve the computational

milestone. Today's fastest machine, NEC's Earth Simulator is comparatively

Page 14: Report on Blue Gene

8/2/2019 Report on Blue Gene 14/22

slow--about one-thirtieth of a petaflop--but fast enough to worry the United

States government that the country is losing its computing lead to Japan.

IBM has begun building the chips that will be used in the first Blue

Gene, a machine dubbed Blue Gene/L that will run Linux and have morethan 65,000 computing nodes, said Bill Pulleyblank, director of IBM's Deep

Computing Institute and the executive overseeing the project. Each node has

a small chip with an unusually large number of functions crammed onto the

single slice of silicon: two processors, four accompanying mathematical

engines, 4MB of memory and communication systems for five separate


Joining Blue Gene/L is a second major experimental system called

"Cyclops," which in comparison will have many more processors etched

onto each slice of silicon--perhaps as many as 64, Pulleyblank said.

In addition, IBM probably will use the Linux operating system on all

the members of the Blue Gene family, not just Blue Gene/L. "My belief is

that's definitely where we're going to go," Pulleyblank said.

Blue Gene's original mission was to tackle the computationally

onerous task of using the laws of physics to predict how chains of biochemical building blocks described by DNA fold into proteins--massive

molecules such as hemoglobin. IBM has expanded its mission, though, to

other subjects including global climate simulation and financial risk 


"We're looking at broad suite of applications," Pulleyblank said, a

move that will help IBM reach one of the goals of the Blue Gene project: to

produce technology that customers ultimately will pay for.

IBM already has spent more than the original $100 million budgeted

for the project and won't meet its 2004 goal for the ultimate machine, but the

company has made progress bringing its ideas to fruition.

Page 15: Report on Blue Gene

8/2/2019 Report on Blue Gene 15/22

IBM is building the processors for the first member of the Blue Gene

family, Blue Gene/L, and expects to use them this year in a machine that

will be a microcosm of the eventual full-fledged Blue Gene/L due by the end

of 2004, Pulleyblank said. IBM also has begun designing the processors for

Cyclops, which IBM internally calls Blue Gene/C.

The performance results of Blue Gene/L and Cyclops will determine

the design IBM chooses for the eventual petaflop machine, Blue Gene/P,

Pulleyblank said.

There are differences from what IBM originally envisioned. For one

thing, the processors will be based on IBM's PowerPC 440GX processor

instead of being designed from scratch. It's cooled by air instead of water. It

has a different network. And there's less memory, though still a whopping 16

terabytes total.

Blue Gene/L will be large, but significantly smaller than current IBM

supercomputers such as ASCI White, a nuclear weapons simulation machine

at Lawrence Livermore National Laboratory, which will also be the home of 

Blue Gene/L. ASCI White takes up the area of two basketball courts, or

9,400 square feet, while Blue Gene/L should fit into half a tennis court, or

about 1,400 square feet.

IBM's Blue Gene research has an academic flavor, but the company's

ultimate goal is profit. IBM is second only to Hewlett-Packard in the $4.7

billion market for high-performance technical computing machines. From

2001 to 2002, IBM's sales grew 28 percent from $1.04 billion to $1.33

billion, while HP's shrank 25 percent from $2.1 billion to $1.58 billion,

according to research firm IDC.

Like an automaker sponsoring a winning race car, building cutting-edge computers can bring bragging rights that can help attract top engineers

and convince customers that a company has sound long-term plans.

Page 16: Report on Blue Gene

8/2/2019 Report on Blue Gene 16/22

The design of Blue Gene/L

Blue Gene/L is an exercise in powers of two, starting with each of the

65,536 compute nodes. Each of the dual processors on the compute node has

two "floating point units," engines for performing mathematical calculations.

Each node's chip is 121 square millimeters and built on a

manufacturing process with 130-nanometer features, Pulleyblank said. That

compares with 267 square millimeters for IBM's current flagship processor,

the Power4+ used in its top-end Unix servers. The small size for Blue Gene's

chips is crucial to ensure the chips don't emit too much waste heat, which

would prevent engineers from packing them densely enough.

Two nodes are mounted onto a module; 16 modules fit into a chassis;and 32 chassis are mounted into a rack. A total of 64 racks will be installed

at the Livermore lab by the end of 2004, with the first 512-node half-rack 

prototype to be built this fall at IBM' Thomas J. Watson Research Center.

"We're going to have first hardware this year. We are actually fabricating

chips for this machine," Pulleyblank said.

All nodes are created equal, but 1,024 of them will have a more

important task than the rest, Pulleyblank said. These so-called input-output,

or I/O, nodes, will run an instance of Linux and assign calculations to a

stable of 64 processor nodes.

These underling nodes won't run Linux, but instead a custom

operating system stripped to its bare essentials, he said. When they have to

perform a task they're not equipped to handle, they can pass the job up the

pecking order to one of the I/O nodes.

Page 17: Report on Blue Gene

8/2/2019 Report on Blue Gene 17/22

Application developments 

To carry out the scientific research into the mechanisms behind protein

folding announced in December 1999, development of a molecular

simulation application kernel targeted for massively parallel architectures is

underway. For additional information about the science application portion

of the BlueGene project. This application development effort serves multiple


(1) it is the application platform for the Blue Gene Science programs.

(2) It serves as a prototyping platform for research into application

frameworks suitable for cellular architectures.

(3) It provides an application perspective in close contact with the

hardware and systems software development teams.

One of the motivations for the use of massive computational power in the

study of protein folding and dynamics is to obtain a microscopic view of the

thermodynamics and kinetics of the folding process. Being able to simulate

longer and longer time-scales is the key challenge. Thus the focus for

application scalability is on improving the speed of execution for a fixed size

system by utilizing additional CPUs. Efficient domain decomposition and

utilization of the high performance interconnect networks on BG/L (both

torus and tree) are the keys to maximizing application scalability. To provide

an environment to allow exploration of algorithmic alternatives, the

applications group has focused on understanding the logical limits to

concurrency within the application, structuring the application architecture

to support the finest grained concurrency possible, and to logically separate

parallel communications from straight line serial computation.

Page 18: Report on Blue Gene

8/2/2019 Report on Blue Gene 18/22

With this separation and the identification of key communications patterns

used widely in molecular simulation, it is possible for domain experts in

molecular simulation to modify detailed behavior of the application without

having to deal with the complexity of the parallel communications

environment as well. Key computational kernels derived from the molecular

simulation application have been used to characterize and drive

improvements in the floating- point code generation of the compiler being

developed for the BG/L platform. As additional tools and actual hardware

become available, the effects of cache hierarchy and communications

architecture can be explored in detail for the application.


System Software for the I/O Nodes

The Linux kernel that executes in the I/O nodes is based on a standard

Distribution for PowerPC 440GP processors. Although Blue Gene/L uses

standard PPC 440 cores, the overall chip and card design required changes in

the booting sequence, interrupt management, memory layout, FPU support,

and device drivers of the standard Linux kernel. There is no BIOS in the

Blue Gene/L nodes, thus the configuration of a node after power-on and the

initial program load (IPL) is initiated by the service nodes through the

control network. We modified the interrupt and exception handling code to

support Blue Gene/L’s custom Interrupt Controller (BIC).

The implementation of the kernel MMU remaps the tree and torus FIFOs to

user space. We support the new EMAC4 Gigabit Ethernet controller. We

also updated the kernel to save and restore the double FPU registers in each

context switch. The nodes in the Blue Gene/L machine are diskless, thus the

Page 19: Report on Blue Gene

8/2/2019 Report on Blue Gene 19/22

initial root file system is provided by a ramdisk linked against the Linux

kernel. The ram disk contains shells, simple utilities, shared libraries, and

network clients such as ftp and nfs. Because of the non-coherent L1 caches,

the current version of Linux runs on one of the 440 cores, while the second

CPU is captured at boot time in an infinite loop. We an investigating two

main strategies to effectively use the second CPU in the I/O nodes: SMP

mode and virtual mode. We have successfully compiled a SMP version of 

the kernel, after implementing all the required interprocessor

communications mechanisms, because the BG/L’s BIC is not [2] 

compliant. In this mode, the TLB entries for the L1 cache are disabled in

kernel mode and processes have affinity to one CPU.

Forking a process in a different CPU requires additional parameters to the

system call. The performance and effectiveness of this solution is still an

open issue. A second, more promising mode of operation runs Linux in one

of the CPUs, while the second CPU is the core of a virtual network card. In

this scenario, the tree and torus FIFOs are not visible to the Linux kernel.

Transfers between the two CPUs appear as virtual DMA transfers. We are

also investigating support for large pages. The standard PPC 440 embedded

processors handle all TLB misses in software. Although the average number

of instructions required to handle these misses has significantly decreased, it

has been shown that larger pages improve performance.

Page 20: Report on Blue Gene

8/2/2019 Report on Blue Gene 20/22

System Software for the Compute Nodes 

The “Blue Gene/L Run Time Supervisor” (BLRTS) is a custom kernel that

runs on the compute nodes of a Blue Gene/L machine. BLRTS provides a

simple, flat, fixed-size 256MB address space, with no paging, accomplishing

a role similar to [2] The kernel and application program share the same

address space, with the kernel residing in protected memory at address 0 and

the application program image loaded above, followed by its heap and stack.

The kernel protects itself by appropriately programming the PowerPC

MMU. Physical resources (torus, tree, mutexes, barriers, scratchpad) are

partitioned between application and kernel. In the current implementation,

the entire torus network is mapped into user space to obtain better

communication efficiency, while one of the two tree channels is made

available to the kernel and user applications.

BLRTS presents a familiar POSIX interface: we have ported the GNU Glibc

runtime library and provided support for basic file I/O operations through

system calls. Multi-processing services (such as fork and exec) are

meaningless in single process kernel and have not been implemented.

Program launch, termination, and file I/O is accomplished via messages

passed between the compute node and its I/O node over the tree network,

using a point-to-point packet addressing mode.

This functionality is provided by a daemon called CIOD (Console I/O

Daemon) running in the I/O nodes. CIOD provides job control and I/O

management on behalf of all the compute nodes in the processing set. Under

normal operation, all messaging between CIOD and BLRTS is synchronous:

all file I/O operations are blocking on the application side.We used the

CIOD in two scenarios:

Page 21: Report on Blue Gene

8/2/2019 Report on Blue Gene 21/22

1. Driven by a console shell (called CIOMAN), used mostly for simulation

and testing purposes. The user is provided with a restricted set of commands:

run, kill, Ps, set and unset environment variables. The shell distributes the

commands to all the CIODs running in the simulation, which in turn take the

appropriate actions for their compute nodes.

2. Driven by a job scheduler (such as LoadLeveler) through a special

interface that implements the same protocol as the one defined for CIOMAN

and CIOD.

We are investigating a range of compute modes for our custom kernel. In

heater mode, one CPU executes both user and network code, while the other

CPU remains idle. This mode will be the mode of operation of the initial

prototypes, but it is unlikely to be used afterwards.

In co-processor mode, the application runs in a single, non-preempt able

thread of execution on the main processor (CPU 0). The coprocessor(CPU 1)

is used as a torus device off-load engine that runs as part of a user-level

application library, communicating with the main processor through a non-

cached region of shared memory. In symmetric mode, both CPUs run

applications and users are responsible for explicitly handling cache

coherence. In virtual node mode we provide support for two independent

processes in a node. The system then looks like a machine with 128K nodes.

Page 22: Report on Blue Gene

8/2/2019 Report on Blue Gene 22/22


Definitely the world is waiting for Processes or Tasks that can be

done in Fraction of seconds,the only option would be supercomputers. Blue

Gene's speed and expandability have enabled business and science toaddress a wide range of complex problems and make more informed

decisions -- not just in the life sciences, but also in astronomy, climate,

simulations, modeling and many other areas. Blue Gene systems have

helped map the human genome, investigated medical therapies, safeguarded

nuclear arsenals, simulated radioactive decay, replicated brain power, flown

airplanes, pinpointed tumors, predicted climate trends, and identified fossil

fuels  –  all without the time and money that would have been required to

physically complete these tasks.


1.  Harris, Mark (September 18, 2009). "Obama honours IBM

supercomputer". Techradar. 

2.  Blue Gene/L Configuration 

