BLUE GENE DOCUMENTATION-ramakrishna

7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

1/35

A SEMINAR REPORT

ON

BLUE GENE\L

Submitted in the partial fulfillment for the requirement of the award ofdegree in

BACHELOR OF TECHNOLOGY

IN

COMPUTER SCIENCE AND ENGINEERING

BY

M.S.RAMA KRISHNA ( 06M11A05A3)

Department of Computer Science and EngineeringBANDARI SRINIVAS INSTITUTE OF TECHNOLOGY, CHEVELLA

Affiliated to JNTU, HYDERABAD, Approved by AICTE

2009-2010\

\


2/35

BANDARI SRINIVAS INSTITUTE OF TECHNOLOGY

Chevella, R.R.Dist., A.P

( Approved by AICTE , New Delhi, Affiliated to J.N.T.U HYDERABAD )

Date: 30-03-2010

CERTIFICATE

This is to certify that it is a confide record of dissertation work entitled

BLUE GENE/L done by M.S.RAMA KRISHNA bearing roll

no.06M11A05A3 which has been submitted as a partial fulfillment of the

requirements for the award of degree in BACHELOR Of Technology in

Computer Science and Engineering , from Jawaharlal Nehru Technological

University, Hyderabad for the academic year 2009 2010. This work is notsubmitted to any University for the award of any degree / diploma.

( Mr. VIJAY KUMAR) ( Mr. CH.RAJ KISHORE )


3/35

ACKNOWLEDGEMENT

I whole heartedly thank the BANDARI SRINIVAS INSTITUTE OF TECHNOLOGY for

giving me an opportunity to present the seminar report on technical paper in the college.

I express my deep sense of gratitude to CH RAJA KISHORE Head of the Department

Computer Science and Engineering in BANDARI SRINIVAS INSTITUTE OF TECHNOLOGY

for his valuable guidance, inspiration and encouragement in presenting my seminar report.

I am also thankful to (Mr. VIJAY KUMAR), internal guide for his valuable academic

guidance and support.

I thank whole heartedly to all the staff members of Department of Computer Science

and Engineering, for their support and encouragement in doing my seminar work.

Lastly , I thank all those who have helped me directly and indirectly in doing this

seminar work successfully.

NAME: M.S.RAMAKRISHNA

ROLL NO: 06M11A05A3


4/35

ABSTRACT

Blue Gene is a massively parallel computer being

developed at the IBMThomas J. Watson Research Center.Blue Gene represents a hundred-fold improvement on

performance compared with the fastest supercomputers of

today. It will achieve 1 PetaFLOP/sec through

unprecedented levels of parallelism in excess of 4,0000,000

threads of execution. The Blue Gene project has twoimportant goals, in which understanding of biologically

import processes will be advanced, as well as advancement

of knowledge of cellular architectures (massively parallel

system built of single chip cells that integrated

processors,memory and communication) , and of the

software needed to exploit those effectively. This massively

parallel system of 65,536 nodes is based on anew

architecture that exploits system-on-a-chip technology to

deliver target peak processing power of 360 teraFLOPS

(trillion floating-point operations per second). The machine

is scheduled to be operational in the 2004-2005 time frame,

at performance and power consumption/performance

targets unobtainable with conventional architectures.


5/35

In November 2001 IBM announced partnership with

LawrenceLivermoreNationalLaboratory to build the Blue

Gene/L (BG/L) supercomputer, a 65,536-node machinedesigned around embedded PowerPC processors. Through

the use of system-on-a-chip integration coupled with a

highly scalable cellular architecture, Blue Gene/L will

deliver 180 or 360 Teraflops of peak computing power,

depending on the utilization mode. Blue Gene/L representsa new level of scalability for parallel systems. Whereas

existing large scale systems range in size from hundreds to

a few of compute nodes,Blue Gene/L makes a jump of

almost two orders of magnitude.It is reasonably clear that

such machines, in the near future at least, will require a

departure from the architectures of the current parallel

supercomputers,which use few thousand commodity

microprocessors. With the current technology, it would

take around a million microprocessors to achieve a

petaFLOPS performance.Clearly, power requirements and

cost considerations alone preclude this option.Using such a

design, petaFLOPS performance will be reached within the


6/35

next 2-3 years, especially since IBM hasannounced the

Blue Gene project aimed at building such a machine.


7/35

Index

Contents Page No.

Chapter 1 Introduction

Chapter 2 Detailed report

Chapter 3 Applications

Chapter 4 Advantages

Chapter 5 Disadvantages

Chapter 6 Conclusions

References


8/35

1.Introduction

Blue Gene is a computer architecture project designed toproduce several supercomputers, designed to reachoperating speeds in the PFLOPS (petaFLOPS) range, andcurrently reaching sustained speeds of nearly 500 TFLOPS(teraFLOPS). It is a cooperative project among IBM(particularly IBM Rochester and the Thomas J. WatsonResearch Center), the Lawrence Livermore NationalLaboratory, the United States Department of Energy (which

is partially funding the project), and academia. There arefour Blue Gene projects in development: Blue Gene/L,Blue Gene/C, Blue Gene/P, and Blue Gene/Q.The project was awarded the National Medal ofTechnology and Innovation by U.S. President BarackObama on September 18, 2009. The president bestowed theaward on October 7, 2009.[1]

2.DESCPRITIONThe first computer in the Blue Gene series, Blue Gene/L,developed through a partnership with Lawrence LivermoreNational Laboratory (LLNL), originally had a theoreticalpeak performance of 360 TFLOPS, and scored over 280TFLOPS sustained on the Linpack benchmark. After anupgrade in 2007 the performance increased to 478 TFLOPSsustained and 596 TFLOPS peak.The term Blue Gene/Lsometimes refers to the computer installed at LLNL; andsometimes refers to the architecture of that computer. As ofNovember 2006, there are 27 computers on the Top500 listusing the Blue Gene/L architecture. All these computers are


9/35

listed as having an architecture of eServer Blue GeneSolution.In December 1999, IBM announced a $100 million

research initiative for a five-year effort to build a massivelyparallel computer, to be applied to the study ofbiomolecular phenomena such as protein folding. Theproject has two main goals: to advance our understandingof the mechanisms behind protein folding via large-scalesimulation, and to explore novel ideas in massively parallelmachine architecture and software. This project shouldenable biomolecular simulations that are orders of

magnitude larger than current technology permits. Majorareas of investigation include: how to use this novelplatform to effectively meet its scientific goals, how tomake such massively parallel machines more usable, andhow to achieve performance targets at a reasonable cost,through novel machine architectures. The design is builtlargely around the previous QCDSP and QCDOC

supercomputers.In November 2001, Lawrence Livermore NationalLaboratory joined IBM as a research partner for Blue Gene.On September 29, 2004, IBM announced that a BlueGene/L prototype at IBM Rochester (Minnesota) hadovertaken NEC's Earth Simulator as the fastest computer inthe world, with a speed of 36.01 TFLOPS on the Linpackbenchmark, beating Earth Simulator's 35.86 TFLOPS. Thiswas achieved with an 8-cabinet system, with each cabinetholding 1,024 compute nodes. Upon doubling thisconfiguration to 16 cabinets, the machine reached a speedof 70.72 TFLOPS by November 2004 , taking first place inthe Top500 list.


10/35

On March 24, 2005, the US Department of Energyannounced that the Blue Gene/L installation at LLNL brokeits speed record, reaching 135.5 TFLOPS. This feat was

possible because of doubling the number of cabinets to 32.On the Top500 list,[2] Blue Gene/L installations acrossseveral sites worldwide took 3 out of the 10 top positions,and 13 out of the top 64. Three racks of Blue Gene/L arehoused at the San Diego Supercomputer Center and areavailable for academic research.On October 27, 2005, LLNL and IBM announced that BlueGene/L had once again broken its speed record, reaching

280.6 TFLOPS on Linpack, upon reaching its finalconfiguration of 65,536 "compute nodes" (i.e., 216 nodes)and an additional 1024 "I/O nodes" in 64 air-cooledcabinets. The LLNL Blue Gene/L uses Lustre to accessmultiple filesystems in the 600TB-1PB range[3].Blue Gene/L is also the first supercomputer ever to runover 100 TFLOPS sustained on a real world application,

namely a three-dimensional molecular dynamics code(ddcMD), simulating solidification (nucleation and growthprocesses) of molten metal under high pressure andtemperature conditions. This achievement won the 2005Gordon Bell Prize.On June 22, 2006, NNSA and IBM announced that BlueGene/L has achieved 207.3 TFLOPS on a quantumchemical application (Qbox).[4] On November 14, 2006, atSupercomputing 2006,[5] Blue Gene/L was awarded thewinning prize in all HPC Challenge Classes of awards.[6]A team from the IBM Almaden Research Center and theUniversity of Nevada on April 27, 2007 ran an artificialneural network almost half as complex as the brain of a


11/35

mouse for the equivalent of a second (the network was runat 1/10 of normal speed for 10 seconds).[7]In November 2007, the LLNL Blue Gene/L remained at the

number one spot as the world's fastest supercomputer. Ithad been upgraded since the previous measurement, andwas then almost three times as fast as the second fastest, aBlue Gene/P system.On June 18, 2008, the new Top500 List marked the firsttime a Blue Gene system was not the leader in the Top500since it had assumed that position, being topped by IBM'sCell-based Roadrunner system which was the only system

to surpass the mythical petaflops mark. Top500 announcedthat the Cray XT5 Jaguar housed at OCLF is currently thefastest supercomputer in the world for open science.Major features

The Blue Gene/L supercomputer is unique in the followingaspects:1.Trading the speed of processors for lower power

consumption.2.Dual processors per node with two working modes: co-processor (1 user process/node: computation andcommunication work is shared by two processors) andvirtual node (2 user processes/node)3.System-on-a-chip design4.A large number of nodes (scalable in increments of 1024up to at least 65,536)5.Three-dimensional torus interconnect with auxiliarynetworks for global communications, I/O, and management6.Lightweight OS per node for minimum system overhead(computational noise)Architecture


12/35

One Blue Gene/L node boardA schematic overview of aBlue Gene/L supercomputerEach Compute or I/O node is a single ASIC with associated

DRAM memory chips. The ASIC integrates two 700 MHzPowerPC 440 embedded processors, each with a double-pipeline-double-precision Floating Point Unit (FPU), acache sub-system with built-in DRAM controller and thelogic to support multiple communication sub-systems. Thedual FPUs give each Blue Gene/L node a theoretical peakperformance of 5.6 GFLOPS (gigaFLOPS). Node CPUs arenot cache coherent with one another.

Compute nodes are packaged two per compute card, with16 compute cards plus up to 2 I/O nodes per node board.There are 32 node boards per cabinet/rack.[9] Byintegration of all essential sub-systems on a single chip,each Compute or I/O node dissipates low power (about 17watts, including DRAMs). This allows very aggressivepackaging of up to 1024 compute nodes plus additional I/O

nodes in the standard 19" cabinet, within reasonable limitsof electrical power supply and air cooling. Theperformance metrics in terms of FLOPS per watt, FLOPSper m2 of floorspace and FLOPS per unit cost allow scalingup to very high performance.Each Blue Gene/L node is attached to three parallelcommunications networks: a 3D toroidal network for peer-to-peer communication between compute nodes, acollective network for collective communication, and aglobal interrupt network for fast barriers. The I/O nodes,which run the Linux operating system, providecommunication with the world via an Ethernet network.The I/O nodes also handle the filesystem operations on


13/35

behalf of the compute nodes. Finally, a separate and privateEthernet network provides access to any node forconfiguration, booting and diagnostics.

Blue Gene/L compute nodes use a minimal operatingsystem supporting a single user program. Only a subset ofPOSIX calls are supported, and only one process may berun at a time. Programmers need to implement greenthreads in order to simulate local concurrency.Application development is usually performed in C, C++,or Fortran using MPI for communication. However, somescripting languages such as Ruby have been ported to the

compute nodes.[10]To allow multiple programs to run concurrently, a BlueGene/L system can be partitioned into electronicallyisolated sets of nodes. The number of nodes in a partitionmust be a positive integer power of 2, and must contain atleast 25 = 32 nodes. The maximum partition is all nodes inthe computer. To run a program on Blue Gene/L, a partition

of the computer must first be reserved. The program is thenrun on all the nodes within the partition, and no otherprogram may access nodes within the partition while it is inuse. Upon completion, the partition nodes are released forfuture programs to use.With so many nodes, component failures are inevitable.The system is able to electrically isolate faulty hardware to


14/35

allow the machine to continue to run.

OPERATING SYSTEMS

Front-end nodes are commodity PCs running Linux,I/Onodes run a customized Linux kernel,Compute nodes usean extremely lightweight custom kernel,Service node is asingle multiprocessor machine running a custom OSCOMPUTER NODE KERNEL

Single user, dual-threaded,Flat address space, no paging,Physical resources are memory-mapped,Provides standardPOSIX functionality (mostly),Two execution modes:

1.Virtual node mode2.Coprocessor modeSERVICE NODE OS

Core Management and Control System (CMCS),BG/Lsglobal operating system.,MMCS - Midplane Monitoring


15/35

and Control System,CIOMAN - Control and I/OManager,DB2 relational databaseProgramming modes

This chapter provides information about the way in whichMessage Passing Interface (MPI) isimplemented and usedon Blue Gene/L.There are two main modes in which youcan use Blue Gene/L:1.Communication Coprocessor Mode2. Virtual Node Mode1.Communication Coprocessor ModeIn the default mode of operation of Blue Gene/L, named

Communication Coprocessor Mode,each physical computenode executes a single compute process. The Blue Gene/Lsystem software treats those two processors in a computenode asymmetrically. One of theprocessors (CPU 0)behaves as a main processor, running the main thread ofthecomputer process. The other processor (CPU 1) behavesas an offload engine (coprocessor) that only executes

specific operations.The coprocessor is used primarily foroffloading communication functions. It can also be usedforrunning application-level coroutines.2.Virtual Node Mode

The Compute Node Kernel in the compute nodes alsosupports a Virtual Node Mode of operation for themachine. In that mode, the kernel runs two separateprocesses in each compute node. Node resources (primarilythe memory and the torus network) are shared by bothprocesses.In Virtual Node Mode, an application can use bothprocessors in a node simply by doubling itsnumber of MPItasks, without explicitly handling cache coherence issues.


16/35

The now distinct MPI tasks running in the two CPUs of acompute node have to communicate to each other.This problem was solved by implementing a virtual torus

device, serviced by a virtual packetlayer, in the scratchpadmemory.In Virtual Node Mode, the two cores of a computenode act as different processes. Each has its own rank in themessage layer. The message layer supports Virtual NodeMode byproviding a correct torus to rank mapping and firstin, first out (FIFO) pinning in this mode.The hardware FIFOs are shared equally between theprocesses. Torus coordinates are expressed by quadruplets

instead of triplets. In Virtual Node Mode, communicationbetween the two processors in a compute node cannot bedone over the network hardware. Instead, it is done via aregion of memory, called the scratchpad to which bothprocessors have access. Virtual FIFOs make portions of thescratchpad look like a send FIFO to one of the processorsand a receive FIFO to the other. Access to the virtual FIFOs

is mediated with help from the hardware lockboxes. Froman application perspective, virtual nodes behave likephysical nodes, but with less memory. Each virtual nodeexecutes one compute process. Processes in differentvirtual nodes, even those allocated in the same computenode, only communicate through messages. Processesrunning in virtual node mode cannot invoke coroutines.The Blue Gene/L MPI implementation supports VirtualNode Mode operations by sharing the systemscommunications resources of a physical compute nodebetween the two compute processes that execute on thatphysical node. The low-level communications library ofBlue Gene/L, that is the message layer, virtualizes these


17/35

communications resources into logical units that eachprocess can use independently.Deciding which mode to use

Whether you choose to use Communication CoprocessorMode or Virtual Node Mode depends largely on the type ofapplication you plan to execute.I/O intensive tasks that require a relatively large amount ofdata interchange between compute nodes benefit more byusing Communication Coprocessor Mode. Thoseapplications that are primarily CPU bound, and do not havelarge working memory requirements (the application only

gets half of the node memory), run more quickly inVirtualNode Mode

System calls supported by theCompute Node Kernel

This chapter discusses the system calls (syscalls) that aresupported by the Compute Node Kernel. It is important foryou to understand which functions can be called, and


18/35

perhaps moreimportantly, which ones cannot be called, byyour application running on Blue Gene/L.

Introduction to the Compute Node KernelThe role of the kernel on the Compute Node is to create anenvironment for the execution of a user process which isLinux-like. It is not a full Linux kernel implementation,but rather implements a subset of POSIX functionality.The Compute Node Kernel is a single-process operatingsystem. It is designed to provide the services that areneeded by applications which are expected to run on Blue

Gene/L, but notfor all applications. The Compute NodeKernel is not intended to run system administrationfunctions from the compute node.To achieve the bestreliability, a small and simple kernel is a design goal. Thisenables a simpler checkpoint function.The compute nodeapplication never runs as the root user. In fact, it runs as thesame user(uid) and group (gid) under which the job was

submitted.

System calls

The Compute Node Kernel system calls are subdivided intothe following categories:1 .File I/O2 .Directory operations3 .Time4.Process information5.Signals6.Miscellaneous7.Sockets8.Compute Node Kernel


19/35

Additional Compute Node Kernel application support

This section provides details about additional supportprovided to application developers by the Compute Node

Kernel.3.3.1 Allocating memory regions with specific L1 cacheattributes

Each PowerPC 440 core on a compute node has a 32 KBL1 data cache that is 64-way set associative and uses a 32-byte cache line. A load or store operation to a virtualaddress is translated by the hardware to a real address usingthe Translation Lookaside Buffer (TLB). The real address

is used to select the set. If the address is available in thedata cache, it is returned to the processor without needingto access lower levels of the memory subsystem. If theaddress is not in the data cache (a cache miss), a cache lineis evicted from the data cache and is replaced with thecache line containing the address. The way to evict fromthe data cache is selected using a round-robin algorithm.

The L1 data cache can be divided into two regions: anormal region and a transient region. The number ofways to use for each region can be configured by theapplication. The Blue Gene/L memory subsystem supportsthe following L1 data cache attributes:_ Cache-inhibited orcached Memory with the cache-inhibited attribute causesall load and store operations to access the data from lowerlevels of the memory subsystem. Memory with the cachedattribute might use the data cache for load and storeoperations. The default attribute for application memory iscached. _ Store without allocate (SWOA) or store withallocate (SWA) Memory with the SWOA attributebypasses the L1 data cache on a cache miss for a store


20/35

operation, and the data is stored directly to lower levels ofthe memory subsystem. Memory with the SWA attributeallocates a line in the L1 data cache when there is a cache

miss for a store operation on the memory. The defaultattribute for application memory is SWA.Write-through or write-back memory with the write-

through attribute is written through to the lower levels ofthe memory subsystem for store operations. If the memoryalso exists in the L1 data cache, it is written to the datacache and the cache line is marked as clean. Memory withthe write-back attribute is written to the L1 data cache, and

the cache line is marTransient or normal

Memory with the transient attribute uses the transientregion of the L1 data cache. Memory with the normalattribute uses the normal region of the L1 data cache. Bydefault, the L1 data cache is configured without a transientregion and all application memory uses the normal region.

Checkpoint and restart supportWhy use checkpoint and restart Given the scale of the BlueGene/L system, faults are expected to be the norm ratherthan the exception. This is unfortunately inevitable, giventhe vast number of individual hardware processors andother components involved in running the systemCheckpoint and restart are among the primary techniquesfor fault recovery. A special user-level checkpoint libraryhas been developed for Blue Gene/L applications. Usingthis library, application programs can take a checkpoint oftheir program state at appropriate stages and can berestarted later from their last successful checkpoint


21/35

Why should you be interested in this support? Among thefollowing examples, numerousscenarios indicate that use of this support is warranted:

1.Your application is a long-running one. You do not wantit to fail a long time into a run, losing all the calculationsmade up until the failure. Checkpoint and restart allow youto restart the application at the last checkpoint position,losing a much smaller slice ofprocessing time.2. You are given access to a Blue Gene/L system forrelatively small increments of time, and you know that yourapplication run will take longer than your allotted amount

of processingtime. Checkpoint and restart allows you to execute yourapplication to completion indistinct chunks, rather than inone continuous period of time.these are just two of manyreasons to use checkpoint and restart support in you BlueGene/L applications.7.2.1 Input/output considerations

All the external I/O calls made from a program are shippedto the corresponding I/O Node using a function shippingprocedure implemented in the Compute Node Kernel Thecheckpoint library intercepts calls to the five main file I/Ofunctions: open, close, read, write, and lseek. The functionname open is a weak alias that maps to the _libc_openfunction. The checkpoint library intercepts this call andprovides its own implementation of open that internallyuses the _libc_open function The librarymaintains a filestate table that stores the file name and current file positionand the mode of all the files that are currently open. Thetable also maintains a translation that translates the filedescriptors used by the Compute Node Kernel to another


22/35

set of file descriptors to be used by the application. Whiletaking a checkpoint, the file state table is also stored in thecheckpoint file. Upon a restart, these tables are read. Also

the corresponding files are opened in the required mode,and the file pointers are positioned at the desired locationsas given in the checkpoint file. The current design assumesthat the programs either always read the file or write thefiles sequentially. A read followed by an overlapping write,or a write followed by an overlapping read, is notsupported.7.2.2 Signal considerations

Applications can register handlers for signals using thesignal() function call. The checkpoint library interceptscalls to signal() and installs its own signal handler instead.It also updates a signal-state table that stores the address ofthe signal handler function (sighandler) registered for eachsignal (signum). When a signal is raised, the checkpointsignal handler calls the appropriate application handler

given in the signal-state table. While taking checkpoints,the signal-state table is also stored in the checkpoint file inits signal-state section. At the time of restart, the signal-state table is read, and the checkpoint signal handler isinstalled for all the signals listed in the signal state table.The checkpoint handler calls the required applicationhandlers when needed. Signals during checkpoint Theapplication can potentially receive signals while thecheckpoint is in progress. If the application signal handlersare called while a checkpoint is in progress, it can changethe state of the memory being checkpointed. This mightmake the checkpoint inconsistent. Therefore, the signalsarriving while a checkpoint is under progress need to be


23/35

handled carefully. 86 IBM System Blue Gene Solution:Application Development For certain signals, such asSIGKILL and SIGSTOP, the action is fixed and the

application terminates without much choice. The signalswithout any registered handler are simply ignored. Forsignals with installed handlers, there are two choices:1. Deliver the signal immediately2.Postpone the signal delivery until the checkpoint iscompleteAll signals are classified into one of these two categories asshown in Table 7-1. If the signal must be delivered

immediately, the memory state of the application mightchange, making the current checkpoint file inconsistent.Therefore, the current checkpoint must be aborted. Thecheckpoint routine periodically checks if a signal has beendelivered since the currentcheckpoint began. In case asignal has been delivered, it aborts the current checkpointand returns to the application.

Checkpoint APIThe checkpoint interface consists of the following items:1.A set of library functions that are used by the applicationdeveloper to checkpoint enablethe application1.A set of conventions used to name and store thecheckpoint files2.A set of environment variables used to communicate

with the applicationThe following sections describe each of these componentsin detai lRestart A transparent restart mechanism isprovided through the use of the BGLCheckpointInit()function and the BGL_CHKPT_RESTART_SEQNOenvironment variable. Upon startup, an application is


24/35

expected to make a call to BGLCheckpointInit(). TheBGLCheckpointInit() function initializes the checkpointlibrary data structures. Moreover the BGLCheckpointInit()

function checks for the environment variableBGL_CHKPT_RESTART_SEQNO. If the variable is notset, a job launch is assumed and the function returnsnormally. In case the environment variable is set to zero,the individual processes restart from their individual latestconsistent global checkpoint. If the variable is set to apositive integer, the application is started from the specifiedcheckpoint sequence number

Checkpoint and restart functionalityIt is often desirable to enable or disable the checkpointfunctionality at the time of job launch.Applicationdevelopers are not required to provide two versions of theirprograms: one with checkpoint enabled and another withcheckpoint disabled. We have used environment variablesto transparently enable and disable the checkpoint and

restart functionality. The checkpoint library calls check forthe environment variable BGL_CHKPT_ENABLED. Thecheckpoint functionality is invoked only if thisenvironment variable is set to a value of 1.


25/35

High Throughput Computing onBlue Gene/L

The original focus of the Blue Gene project was to create ahigh performance computer with a small footprint thatconsumed a relatively small amount of power. The modelof running parallel applications (typically using MessagePassing Interface (MPI)) on Blue Gene/L is known as High

Performance Computing (HPC). Recent research hasshown that Blue Gene/L alsoprovides a good platform forHigh Throughput Computing (HTC).3.Applications: MPI or HTC

As previously stated, the Blue Gene architecture targetedMPI applications for optimal execution. These applicationsare characterized as Single Instruction Multiple Data(SIMD)with synchronized communication and execution.The tasks in an MPI program arecooperating to solve asingle problem. Because of this close cooperation, a failurein a single node, software or hardware, requires thetermination of all nodes. HTC applications have differentcharacteristics. The code that executes on each node is


26/35

independent of work that is being done on another node;communication between nodes is not required. At anyspecific time, each node is solving its own problem. As a

result, a failure on a single node, software or hardware,does not necessitate the termination of all nodes. Initially,in order to run these applications on a Blue Genesupercomputer, a port to the MPI programming model wasrequired. This was done successfully for severalapplications, but was not an optimal solution in some cases.Some MPI applications may benefit by being ported to theHTC model. In particular, some embarrassingly parallel

MPI applications may be good candidates for HTC modebecause they do not require communication between thenodes. Also the failure of one node does not invalidate thework being done on other nodes. A key advantage of theMPI model is a reduction of extraneous booking by theapplication. An MPI program is coded to handle datadistribution, minimize I/O by having all I/O done by one

node (typically rank 0), and distribute the work to the otherMPI ranks. When running in HTC mode, the applicationdata needs to be manually split up and distributed to thenodes. This sporting effort may be justified to achievebetter application reliability and throughput than could beachieved with an MPI model.HTC mode

In HTC mode, the compute nodes in a partition are runningindependent programs that do not communicate with eachother. A launcher program (running on a compute node)requests work from a dispatcher that is running on a remotesystem that is connected to the functional network. Basedon information provided by the dispatcher, the launcher


27/35

program spawns a worker program that performs sometask. When running in HTC mode, there are two basicoperational differences over the default HPC mode. First,

after the worker program completes, the launcher programis reloaded so it can handle additional work requests.Second, if a compute node encounters a soft error, such as aparity error, the entire partition is not terminated. Rather,the control system attempts to reset the single computenode while other compute nodes continue to operate. Thecontrol system polls hardware on a regular interval lookingfor compute nodes in a reset state. If a failed node is

discovered and the node failure is not due to a networkhardware error, a software reboot is attempted to recoverthe compute node.On a remote system that is connected to the functionalnetwork, which could be a Service Node or a Front EndNode, there is a dispatcher that manages a queue of jobs tobe run.

There is a client/server relationship between the launcherprogram and the dispatcher program. After the launcherprogram is started on a compute node, it connects back tothe dispatcher and indicates that it is ready to receive workrequests. When the dispatcher has a job for the launcher, itresponds to the work request by sending the launcherprogram the job related information, such as the name ofthe worker program executable, arguments, andenvironment variables. The launcher program then spawnsoff the worker program. When the worker programcompletes, the launcher is reloaded and repeats the processof requesting work from experiencingLauncher program


28/35

In the default HPC mode, when a program ends on thecompute node, the Compute Node Kernel sends a messageto the I/O node that reports how the node ended. The

message indicates if the program ended normally or by asignal and the exit value or signal number, respectively.The I/O node then forwards the message to the controlsystem. When the control system has received messages forall of the compute nodes, it then ends the job. In HTCmode, the Compute Node Kernel handles a programendingdifferently depending on the program that ended. TheCompute Node Kernel records the path of the program that

is first submitted with the job, which is the launcherprogram. When a program other than the launcher program(or the worker program) ends, the Compute Node Kernelrecords the exit status of the worker program, and thenreloads and restarts the launcher program If the workerprogram ended by a signal, the Compute Node Kernelgenerates an event to record the signal number that ended

the program. If the worker program ended normally, noinformation is logged. The launcherprogram can retrievethe exit status of the worker program using a Compute

Node Kernel system call.


29/35

Launcher run sequence

Since no message is sent to the control system indicatingthat a program ended, the job continues running. The effect

is to havea continually running program on the computenodes. To reduce the load on the file system, the launcherprogram is cached in memory on the I/O node. When theCompute Node Kernel requests to reload the launcherprogram, it does not need to be read from the file systembut can be sent directly to the compute node from memory.Since the launcher is typically a small executable, it doesnot require much additional memory to cache it. When the

launcher program ends, the Compute Node Kernel reportsthat a program ended to the control system as it does forHPC mode. This allows the launcher program to cleanlyend on the compute node and for the control system to endthe job

Template for an asynchronous task dispatch subsystem

This section gives a high level overview of one possiblemethod to implement an asynchronous task dispatchsubsystem. client, dispatcher program, and the launcherprogram in our example.


30/35

Asynchronous task dispatch subsystem model

Clients

The client implements a task submission thread thatpublishes task submission messages onto a work queue. Italso implements a task verification thread that listens fortask completion messages. When the dispatch system

informs the client that the task has terminated, andoptionally supplies an exit status, the client is thenresponsible for resolvingtask completion status and fortaking actions, including relaunching tasks. Keep in mindthat the client is ultimately responsible for ensuringsuccessful completion of the job.Message queue

The design is based on publishing and subscribing tomessage queues. Clients publish task submission messagesonto a singlework queue. Dispatcher programs subscribe tothe work queue and process task submission messages.Clients subscribe to task completion messages.Messagesconsist of text data that is comprised of the work to be


31/35

performed, a job identifier, a task identifier, and a messagetype. Job identifiers are generated by the client process andare required to be globally unique. Task identifiers are

unique within the job session. Themessage type field fortask submission messages is used by the dispatcher todistinguish work of high priority versus work of normalpriority. Responsibility for reliable messagedeliverybelongs to the message queueing system.Dispatcher program

The dispatcher program first pulls a task submissionmessage off the work queue. Then it waits on a socket for a

launcher connection and reads the launcher ID from thesocket. It writes the task into the socket, and the associationbetween task and launcher is stored in a table. The tablestores the last task dispatched to the launcher program. Thisconnection is an indication that the last task has completedand the task completion message can be published back tothe client. Figure 12-3 shows the entire cycle of a job

submitted in HTC mode.


32/35

HTC job cycleThe intention of this design is to optimize the launcherprogram. The dispatcher program spends little timebetween connect and dispatch, so latency volatility ismainly due to the waiting time for dispatcher programconnections. After rebooting, the launcher programconnects to the dispatcher program and passes thecompletion information back to thedispatcher program. Toassist task status resolution, the Compute Node Kernelstores the exit status of the last running process in a buffer.After the launcher program restarts, the contents of thisbuffer can be written to the dispatcher and stored in the task

completion message.

Launcher program

The launcher program is intentionally kept simple.


33/35

Arguments to the launcher program describe a socketconnection to the dispatcher. When the launcher programstarts, it connects to this socket, writes its identity into the

socket, and waits for a task message. Upon receipt of thetask message, the launcher parses the message and calls theexecve system call to execute the task. When the task exits(for any reason), the Compute Node Kernel restarts thelauncher program again. The launcher program is not acontainer for the application. Therefore, regardless of whathappens to the application, the launcher program will notfail to restart.

I/O considerationsRunning in HTC mode changes the I/O patterns. When aprogram reads and writes to the file system, it is typicallydone with small buffer sizes. While the administrator canconfigure the buffer size, the most common sizes are 256Kor 512K. When running in HTC mode, loading the workerprogram requires reading the complete executable into

memory and sending it to a compute node. An executable isat least several megabytes and can be many megabytes. Toachieve the fastest I/O throughput, low compute node toI/O node ratio is the best.4.ADVANTAGES

1.scalable2.Less space (half of the tennis court)3.Heat problems most supercomputers face4.Speed5.DISADVANTAGES

Important safety notices

Here are some important general comments about the BlueGene/L system regarding safety.


34/35

1.This equipment must be installed by trained servicepersonnel in a restricted access location as defined by theNEC (U.S. National Electric Code) and IEC 60950, The

Standard for Safety of Information Technology Equipment.(C033) 2.The doors and covers to the productare to be closed at all times except for service bytrainedservice personnel. All covers must be replaced and doorslocked at the conclusion of the service operation. (C013)3.Servicing of this product or unit is to be performed bytrained service personnel only.(C032)4.This product is equipped with a 4 wire (three-phase and

ground) power cable. Use thispower cable with a properlygrounded electrical outlet to avoid electrical shock. (C019)5.To prevent a possible electric shock from touching twosurfaces with different protective ground (earth), use onehand, when possible, to connect or disconnect signalcables) 6.An electrical outlet that is not correctlywired could place hazardous voltage on the metal parts of

the system or the devices that attach to the system. It is theresponsibility of thecustomer to ensure that the outlet iscorrectly wired and grounded to prevent an electricalshock. (D004)6.CONCLUSIONS

1.BG/L shows that a cell architecture for supercomputers isfeasible.2.Higher performance with a much smaller size and powerrequirements.3.In theory, no limits to scalability of a BlueGene system.7.REFERENCES

1.IBM Redbooks publications


35/35

2.General Parallel File System (GPFS) for Clusters:Concepts, Planning, and Installation,GA22-79683.IBM General Information Manual, Installation Manual-

Physical Planning, GC22-70724. LoadLeveler for AIX 5L and Linux V3.2 Using andAdministering, SA22-78815. PPC440x5 CPU Core User's Manualhttp://www-306.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_440_Embedded_Core

BLUE GENE DOCUMENTATION-ramakrishna

Documents

Transcript of BLUE GENE DOCUMENTATION-ramakrishna