BLUE GENE DOCUMENTATION-ramakrishna

download BLUE GENE DOCUMENTATION-ramakrishna

of 35

Transcript of BLUE GENE DOCUMENTATION-ramakrishna

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    1/35

    A SEMINAR REPORT

    ON

    BLUE GENE\L

    Submitted in the partial fulfillment for the requirement of the award ofdegree in

    BACHELOR OF TECHNOLOGY

    IN

    COMPUTER SCIENCE AND ENGINEERING

    BY

    M.S.RAMA KRISHNA ( 06M11A05A3)

    Department of Computer Science and EngineeringBANDARI SRINIVAS INSTITUTE OF TECHNOLOGY, CHEVELLA

    Affiliated to JNTU, HYDERABAD, Approved by AICTE

    2009-2010\

    \

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    2/35

    BANDARI SRINIVAS INSTITUTE OF TECHNOLOGY

    Chevella, R.R.Dist., A.P

    ( Approved by AICTE , New Delhi, Affiliated to J.N.T.U HYDERABAD )

    Date: 30-03-2010

    CERTIFICATE

    This is to certify that it is a confide record of dissertation work entitled

    BLUE GENE/L done by M.S.RAMA KRISHNA bearing roll

    no.06M11A05A3 which has been submitted as a partial fulfillment of the

    requirements for the award of degree in BACHELOR Of Technology in

    Computer Science and Engineering , from Jawaharlal Nehru Technological

    University, Hyderabad for the academic year 2009 2010. This work is notsubmitted to any University for the award of any degree / diploma.

    ( Mr. VIJAY KUMAR) ( Mr. CH.RAJ KISHORE )

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    3/35

    ACKNOWLEDGEMENT

    I whole heartedly thank the BANDARI SRINIVAS INSTITUTE OF TECHNOLOGY for

    giving me an opportunity to present the seminar report on technical paper in the college.

    I express my deep sense of gratitude to CH RAJA KISHORE Head of the Department

    Computer Science and Engineering in BANDARI SRINIVAS INSTITUTE OF TECHNOLOGY

    for his valuable guidance, inspiration and encouragement in presenting my seminar report.

    I am also thankful to (Mr. VIJAY KUMAR), internal guide for his valuable academic

    guidance and support.

    I thank whole heartedly to all the staff members of Department of Computer Science

    and Engineering, for their support and encouragement in doing my seminar work.

    Lastly , I thank all those who have helped me directly and indirectly in doing this

    seminar work successfully.

    NAME: M.S.RAMAKRISHNA

    ROLL NO: 06M11A05A3

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    4/35

    ABSTRACT

    Blue Gene is a massively parallel computer being

    developed at the IBMThomas J. Watson Research Center.Blue Gene represents a hundred-fold improvement on

    performance compared with the fastest supercomputers of

    today. It will achieve 1 PetaFLOP/sec through

    unprecedented levels of parallelism in excess of 4,0000,000

    threads of execution. The Blue Gene project has twoimportant goals, in which understanding of biologically

    import processes will be advanced, as well as advancement

    of knowledge of cellular architectures (massively parallel

    system built of single chip cells that integrated

    processors,memory and communication) , and of the

    software needed to exploit those effectively. This massively

    parallel system of 65,536 nodes is based on anew

    architecture that exploits system-on-a-chip technology to

    deliver target peak processing power of 360 teraFLOPS

    (trillion floating-point operations per second). The machine

    is scheduled to be operational in the 2004-2005 time frame,

    at performance and power consumption/performance

    targets unobtainable with conventional architectures.

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    5/35

    In November 2001 IBM announced partnership with

    LawrenceLivermoreNationalLaboratory to build the Blue

    Gene/L (BG/L) supercomputer, a 65,536-node machinedesigned around embedded PowerPC processors. Through

    the use of system-on-a-chip integration coupled with a

    highly scalable cellular architecture, Blue Gene/L will

    deliver 180 or 360 Teraflops of peak computing power,

    depending on the utilization mode. Blue Gene/L representsa new level of scalability for parallel systems. Whereas

    existing large scale systems range in size from hundreds to

    a few of compute nodes,Blue Gene/L makes a jump of

    almost two orders of magnitude.It is reasonably clear that

    such machines, in the near future at least, will require a

    departure from the architectures of the current parallel

    supercomputers,which use few thousand commodity

    microprocessors. With the current technology, it would

    take around a million microprocessors to achieve a

    petaFLOPS performance.Clearly, power requirements and

    cost considerations alone preclude this option.Using such a

    design, petaFLOPS performance will be reached within the

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    6/35

    next 2-3 years, especially since IBM hasannounced the

    Blue Gene project aimed at building such a machine.

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    7/35

    Index

    Contents Page No.

    Chapter 1 Introduction

    Chapter 2 Detailed report

    Chapter 3 Applications

    Chapter 4 Advantages

    Chapter 5 Disadvantages

    Chapter 6 Conclusions

    References

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    8/35

    1.Introduction

    Blue Gene is a computer architecture project designed toproduce several supercomputers, designed to reachoperating speeds in the PFLOPS (petaFLOPS) range, andcurrently reaching sustained speeds of nearly 500 TFLOPS(teraFLOPS). It is a cooperative project among IBM(particularly IBM Rochester and the Thomas J. WatsonResearch Center), the Lawrence Livermore NationalLaboratory, the United States Department of Energy (which

    is partially funding the project), and academia. There arefour Blue Gene projects in development: Blue Gene/L,Blue Gene/C, Blue Gene/P, and Blue Gene/Q.The project was awarded the National Medal ofTechnology and Innovation by U.S. President BarackObama on September 18, 2009. The president bestowed theaward on October 7, 2009.[1]

    2.DESCPRITIONThe first computer in the Blue Gene series, Blue Gene/L,developed through a partnership with Lawrence LivermoreNational Laboratory (LLNL), originally had a theoreticalpeak performance of 360 TFLOPS, and scored over 280TFLOPS sustained on the Linpack benchmark. After anupgrade in 2007 the performance increased to 478 TFLOPSsustained and 596 TFLOPS peak.The term Blue Gene/Lsometimes refers to the computer installed at LLNL; andsometimes refers to the architecture of that computer. As ofNovember 2006, there are 27 computers on the Top500 listusing the Blue Gene/L architecture. All these computers are

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    9/35

    listed as having an architecture of eServer Blue GeneSolution.In December 1999, IBM announced a $100 million

    research initiative for a five-year effort to build a massivelyparallel computer, to be applied to the study ofbiomolecular phenomena such as protein folding. Theproject has two main goals: to advance our understandingof the mechanisms behind protein folding via large-scalesimulation, and to explore novel ideas in massively parallelmachine architecture and software. This project shouldenable biomolecular simulations that are orders of

    magnitude larger than current technology permits. Majorareas of investigation include: how to use this novelplatform to effectively meet its scientific goals, how tomake such massively parallel machines more usable, andhow to achieve performance targets at a reasonable cost,through novel machine architectures. The design is builtlargely around the previous QCDSP and QCDOC

    supercomputers.In November 2001, Lawrence Livermore NationalLaboratory joined IBM as a research partner for Blue Gene.On September 29, 2004, IBM announced that a BlueGene/L prototype at IBM Rochester (Minnesota) hadovertaken NEC's Earth Simulator as the fastest computer inthe world, with a speed of 36.01 TFLOPS on the Linpackbenchmark, beating Earth Simulator's 35.86 TFLOPS. Thiswas achieved with an 8-cabinet system, with each cabinetholding 1,024 compute nodes. Upon doubling thisconfiguration to 16 cabinets, the machine reached a speedof 70.72 TFLOPS by November 2004 , taking first place inthe Top500 list.

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    10/35

    On March 24, 2005, the US Department of Energyannounced that the Blue Gene/L installation at LLNL brokeits speed record, reaching 135.5 TFLOPS. This feat was

    possible because of doubling the number of cabinets to 32.On the Top500 list,[2] Blue Gene/L installations acrossseveral sites worldwide took 3 out of the 10 top positions,and 13 out of the top 64. Three racks of Blue Gene/L arehoused at the San Diego Supercomputer Center and areavailable for academic research.On October 27, 2005, LLNL and IBM announced that BlueGene/L had once again broken its speed record, reaching

    280.6 TFLOPS on Linpack, upon reaching its finalconfiguration of 65,536 "compute nodes" (i.e., 216 nodes)and an additional 1024 "I/O nodes" in 64 air-cooledcabinets. The LLNL Blue Gene/L uses Lustre to accessmultiple filesystems in the 600TB-1PB range[3].Blue Gene/L is also the first supercomputer ever to runover 100 TFLOPS sustained on a real world application,

    namely a three-dimensional molecular dynamics code(ddcMD), simulating solidification (nucleation and growthprocesses) of molten metal under high pressure andtemperature conditions. This achievement won the 2005Gordon Bell Prize.On June 22, 2006, NNSA and IBM announced that BlueGene/L has achieved 207.3 TFLOPS on a quantumchemical application (Qbox).[4] On November 14, 2006, atSupercomputing 2006,[5] Blue Gene/L was awarded thewinning prize in all HPC Challenge Classes of awards.[6]A team from the IBM Almaden Research Center and theUniversity of Nevada on April 27, 2007 ran an artificialneural network almost half as complex as the brain of a

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    11/35

    mouse for the equivalent of a second (the network was runat 1/10 of normal speed for 10 seconds).[7]In November 2007, the LLNL Blue Gene/L remained at the

    number one spot as the world's fastest supercomputer. Ithad been upgraded since the previous measurement, andwas then almost three times as fast as the second fastest, aBlue Gene/P system.On June 18, 2008, the new Top500 List marked the firsttime a Blue Gene system was not the leader in the Top500since it had assumed that position, being topped by IBM'sCell-based Roadrunner system which was the only system

    to surpass the mythical petaflops mark. Top500 announcedthat the Cray XT5 Jaguar housed at OCLF is currently thefastest supercomputer in the world for open science.Major features

    The Blue Gene/L supercomputer is unique in the followingaspects:1.Trading the speed of processors for lower power

    consumption.2.Dual processors per node with two working modes: co-processor (1 user process/node: computation andcommunication work is shared by two processors) andvirtual node (2 user processes/node)3.System-on-a-chip design4.A large number of nodes (scalable in increments of 1024up to at least 65,536)5.Three-dimensional torus interconnect with auxiliarynetworks for global communications, I/O, and management6.Lightweight OS per node for minimum system overhead(computational noise)Architecture

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    12/35

    One Blue Gene/L node boardA schematic overview of aBlue Gene/L supercomputerEach Compute or I/O node is a single ASIC with associated

    DRAM memory chips. The ASIC integrates two 700 MHzPowerPC 440 embedded processors, each with a double-pipeline-double-precision Floating Point Unit (FPU), acache sub-system with built-in DRAM controller and thelogic to support multiple communication sub-systems. Thedual FPUs give each Blue Gene/L node a theoretical peakperformance of 5.6 GFLOPS (gigaFLOPS). Node CPUs arenot cache coherent with one another.

    Compute nodes are packaged two per compute card, with16 compute cards plus up to 2 I/O nodes per node board.There are 32 node boards per cabinet/rack.[9] Byintegration of all essential sub-systems on a single chip,each Compute or I/O node dissipates low power (about 17watts, including DRAMs). This allows very aggressivepackaging of up to 1024 compute nodes plus additional I/O

    nodes in the standard 19" cabinet, within reasonable limitsof electrical power supply and air cooling. Theperformance metrics in terms of FLOPS per watt, FLOPSper m2 of floorspace and FLOPS per unit cost allow scalingup to very high performance.Each Blue Gene/L node is attached to three parallelcommunications networks: a 3D toroidal network for peer-to-peer communication between compute nodes, acollective network for collective communication, and aglobal interrupt network for fast barriers. The I/O nodes,which run the Linux operating system, providecommunication with the world via an Ethernet network.The I/O nodes also handle the filesystem operations on

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    13/35

    behalf of the compute nodes. Finally, a separate and privateEthernet network provides access to any node forconfiguration, booting and diagnostics.

    Blue Gene/L compute nodes use a minimal operatingsystem supporting a single user program. Only a subset ofPOSIX calls are supported, and only one process may berun at a time. Programmers need to implement greenthreads in order to simulate local concurrency.Application development is usually performed in C, C++,or Fortran using MPI for communication. However, somescripting languages such as Ruby have been ported to the

    compute nodes.[10]To allow multiple programs to run concurrently, a BlueGene/L system can be partitioned into electronicallyisolated sets of nodes. The number of nodes in a partitionmust be a positive integer power of 2, and must contain atleast 25 = 32 nodes. The maximum partition is all nodes inthe computer. To run a program on Blue Gene/L, a partition

    of the computer must first be reserved. The program is thenrun on all the nodes within the partition, and no otherprogram may access nodes within the partition while it is inuse. Upon completion, the partition nodes are released forfuture programs to use.With so many nodes, component failures are inevitable.The system is able to electrically isolate faulty hardware to

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    14/35

    allow the machine to continue to run.

    OPERATING SYSTEMS

    Front-end nodes are commodity PCs running Linux,I/Onodes run a customized Linux kernel,Compute nodes usean extremely lightweight custom kernel,Service node is asingle multiprocessor machine running a custom OSCOMPUTER NODE KERNEL

    Single user, dual-threaded,Flat address space, no paging,Physical resources are memory-mapped,Provides standardPOSIX functionality (mostly),Two execution modes:

    1.Virtual node mode2.Coprocessor modeSERVICE NODE OS

    Core Management and Control System (CMCS),BG/Lsglobal operating system.,MMCS - Midplane Monitoring

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    15/35

    and Control System,CIOMAN - Control and I/OManager,DB2 relational databaseProgramming modes

    This chapter provides information about the way in whichMessage Passing Interface (MPI) isimplemented and usedon Blue Gene/L.There are two main modes in which youcan use Blue Gene/L:1.Communication Coprocessor Mode2. Virtual Node Mode1.Communication Coprocessor ModeIn the default mode of operation of Blue Gene/L, named

    Communication Coprocessor Mode,each physical computenode executes a single compute process. The Blue Gene/Lsystem software treats those two processors in a computenode asymmetrically. One of theprocessors (CPU 0)behaves as a main processor, running the main thread ofthecomputer process. The other processor (CPU 1) behavesas an offload engine (coprocessor) that only executes

    specific operations.The coprocessor is used primarily foroffloading communication functions. It can also be usedforrunning application-level coroutines.2.Virtual Node Mode

    The Compute Node Kernel in the compute nodes alsosupports a Virtual Node Mode of operation for themachine. In that mode, the kernel runs two separateprocesses in each compute node. Node resources (primarilythe memory and the torus network) are shared by bothprocesses.In Virtual Node Mode, an application can use bothprocessors in a node simply by doubling itsnumber of MPItasks, without explicitly handling cache coherence issues.

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    16/35

    The now distinct MPI tasks running in the two CPUs of acompute node have to communicate to each other.This problem was solved by implementing a virtual torus

    device, serviced by a virtual packetlayer, in the scratchpadmemory.In Virtual Node Mode, the two cores of a computenode act as different processes. Each has its own rank in themessage layer. The message layer supports Virtual NodeMode byproviding a correct torus to rank mapping and firstin, first out (FIFO) pinning in this mode.The hardware FIFOs are shared equally between theprocesses. Torus coordinates are expressed by quadruplets

    instead of triplets. In Virtual Node Mode, communicationbetween the two processors in a compute node cannot bedone over the network hardware. Instead, it is done via aregion of memory, called the scratchpad to which bothprocessors have access. Virtual FIFOs make portions of thescratchpad look like a send FIFO to one of the processorsand a receive FIFO to the other. Access to the virtual FIFOs

    is mediated with help from the hardware lockboxes. Froman application perspective, virtual nodes behave likephysical nodes, but with less memory. Each virtual nodeexecutes one compute process. Processes in differentvirtual nodes, even those allocated in the same computenode, only communicate through messages. Processesrunning in virtual node mode cannot invoke coroutines.The Blue Gene/L MPI implementation supports VirtualNode Mode operations by sharing the systemscommunications resources of a physical compute nodebetween the two compute processes that execute on thatphysical node. The low-level communications library ofBlue Gene/L, that is the message layer, virtualizes these

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    17/35

    communications resources into logical units that eachprocess can use independently.Deciding which mode to use

    Whether you choose to use Communication CoprocessorMode or Virtual Node Mode depends largely on the type ofapplication you plan to execute.I/O intensive tasks that require a relatively large amount ofdata interchange between compute nodes benefit more byusing Communication Coprocessor Mode. Thoseapplications that are primarily CPU bound, and do not havelarge working memory requirements (the application only

    gets half of the node memory), run more quickly inVirtualNode Mode

    System calls supported by theCompute Node Kernel

    This chapter discusses the system calls (syscalls) that aresupported by the Compute Node Kernel. It is important foryou to understand which functions can be called, and

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    18/35

    perhaps moreimportantly, which ones cannot be called, byyour application running on Blue Gene/L.

    Introduction to the Compute Node KernelThe role of the kernel on the Compute Node is to create anenvironment for the execution of a user process which isLinux-like. It is not a full Linux kernel implementation,but rather implements a subset of POSIX functionality.The Compute Node Kernel is a single-process operatingsystem. It is designed to provide the services that areneeded by applications which are expected to run on Blue

    Gene/L, but notfor all applications. The Compute NodeKernel is not intended to run system administrationfunctions from the compute node.To achieve the bestreliability, a small and simple kernel is a design goal. Thisenables a simpler checkpoint function.The compute nodeapplication never runs as the root user. In fact, it runs as thesame user(uid) and group (gid) under which the job was

    submitted.

    System calls

    The Compute Node Kernel system calls are subdivided intothe following categories:1 .File I/O2 .Directory operations3 .Time4.Process information5.Signals6.Miscellaneous7.Sockets8.Compute Node Kernel

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    19/35

    Additional Compute Node Kernel application support

    This section provides details about additional supportprovided to application developers by the Compute Node

    Kernel.3.3.1 Allocating memory regions with specific L1 cacheattributes

    Each PowerPC 440 core on a compute node has a 32 KBL1 data cache that is 64-way set associative and uses a 32-byte cache line. A load or store operation to a virtualaddress is translated by the hardware to a real address usingthe Translation Lookaside Buffer (TLB). The real address

    is used to select the set. If the address is available in thedata cache, it is returned to the processor without needingto access lower levels of the memory subsystem. If theaddress is not in the data cache (a cache miss), a cache lineis evicted from the data cache and is replaced with thecache line containing the address. The way to evict fromthe data cache is selected using a round-robin algorithm.

    The L1 data cache can be divided into two regions: anormal region and a transient region. The number ofways to use for each region can be configured by theapplication. The Blue Gene/L memory subsystem supportsthe following L1 data cache attributes:_ Cache-inhibited orcached Memory with the cache-inhibited attribute causesall load and store operations to access the data from lowerlevels of the memory subsystem. Memory with the cachedattribute might use the data cache for load and storeoperations. The default attribute for application memory iscached. _ Store without allocate (SWOA) or store withallocate (SWA) Memory with the SWOA attributebypasses the L1 data cache on a cache miss for a store

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    20/35

    operation, and the data is stored directly to lower levels ofthe memory subsystem. Memory with the SWA attributeallocates a line in the L1 data cache when there is a cache

    miss for a store operation on the memory. The defaultattribute for application memory is SWA.Write-through or write-back memory with the write-

    through attribute is written through to the lower levels ofthe memory subsystem for store operations. If the memoryalso exists in the L1 data cache, it is written to the datacache and the cache line is marked as clean. Memory withthe write-back attribute is written to the L1 data cache, and

    the cache line is marTransient or normal

    Memory with the transient attribute uses the transientregion of the L1 data cache. Memory with the normalattribute uses the normal region of the L1 data cache. Bydefault, the L1 data cache is configured without a transientregion and all application memory uses the normal region.

    Checkpoint and restart supportWhy use checkpoint and restart Given the scale of the BlueGene/L system, faults are expected to be the norm ratherthan the exception. This is unfortunately inevitable, giventhe vast number of individual hardware processors andother components involved in running the systemCheckpoint and restart are among the primary techniquesfor fault recovery. A special user-level checkpoint libraryhas been developed for Blue Gene/L applications. Usingthis library, application programs can take a checkpoint oftheir program state at appropriate stages and can berestarted later from their last successful checkpoint

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    21/35

    Why should you be interested in this support? Among thefollowing examples, numerousscenarios indicate that use of this support is warranted:

    1.Your application is a long-running one. You do not wantit to fail a long time into a run, losing all the calculationsmade up until the failure. Checkpoint and restart allow youto restart the application at the last checkpoint position,losing a much smaller slice ofprocessing time.2. You are given access to a Blue Gene/L system forrelatively small increments of time, and you know that yourapplication run will take longer than your allotted amount

    of processingtime. Checkpoint and restart allows you to execute yourapplication to completion indistinct chunks, rather than inone continuous period of time.these are just two of manyreasons to use checkpoint and restart support in you BlueGene/L applications.7.2.1 Input/output considerations

    All the external I/O calls made from a program are shippedto the corresponding I/O Node using a function shippingprocedure implemented in the Compute Node Kernel Thecheckpoint library intercepts calls to the five main file I/Ofunctions: open, close, read, write, and lseek. The functionname open is a weak alias that maps to the _libc_openfunction. The checkpoint library intercepts this call andprovides its own implementation of open that internallyuses the _libc_open function The librarymaintains a filestate table that stores the file name and current file positionand the mode of all the files that are currently open. Thetable also maintains a translation that translates the filedescriptors used by the Compute Node Kernel to another

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    22/35

    set of file descriptors to be used by the application. Whiletaking a checkpoint, the file state table is also stored in thecheckpoint file. Upon a restart, these tables are read. Also

    the corresponding files are opened in the required mode,and the file pointers are positioned at the desired locationsas given in the checkpoint file. The current design assumesthat the programs either always read the file or write thefiles sequentially. A read followed by an overlapping write,or a write followed by an overlapping read, is notsupported.7.2.2 Signal considerations

    Applications can register handlers for signals using thesignal() function call. The checkpoint library interceptscalls to signal() and installs its own signal handler instead.It also updates a signal-state table that stores the address ofthe signal handler function (sighandler) registered for eachsignal (signum). When a signal is raised, the checkpointsignal handler calls the appropriate application handler

    given in the signal-state table. While taking checkpoints,the signal-state table is also stored in the checkpoint file inits signal-state section. At the time of restart, the signal-state table is read, and the checkpoint signal handler isinstalled for all the signals listed in the signal state table.The checkpoint handler calls the required applicationhandlers when needed. Signals during checkpoint Theapplication can potentially receive signals while thecheckpoint is in progress. If the application signal handlersare called while a checkpoint is in progress, it can changethe state of the memory being checkpointed. This mightmake the checkpoint inconsistent. Therefore, the signalsarriving while a checkpoint is under progress need to be

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    23/35

    handled carefully. 86 IBM System Blue Gene Solution:Application Development For certain signals, such asSIGKILL and SIGSTOP, the action is fixed and the

    application terminates without much choice. The signalswithout any registered handler are simply ignored. Forsignals with installed handlers, there are two choices:1. Deliver the signal immediately2.Postpone the signal delivery until the checkpoint iscompleteAll signals are classified into one of these two categories asshown in Table 7-1. If the signal must be delivered

    immediately, the memory state of the application mightchange, making the current checkpoint file inconsistent.Therefore, the current checkpoint must be aborted. Thecheckpoint routine periodically checks if a signal has beendelivered since the currentcheckpoint began. In case asignal has been delivered, it aborts the current checkpointand returns to the application.

    Checkpoint APIThe checkpoint interface consists of the following items:1.A set of library functions that are used by the applicationdeveloper to checkpoint enablethe application1.A set of conventions used to name and store thecheckpoint files2.A set of environment variables used to communicate

    with the applicationThe following sections describe each of these componentsin detai lRestart A transparent restart mechanism isprovided through the use of the BGLCheckpointInit()function and the BGL_CHKPT_RESTART_SEQNOenvironment variable. Upon startup, an application is

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    24/35

    expected to make a call to BGLCheckpointInit(). TheBGLCheckpointInit() function initializes the checkpointlibrary data structures. Moreover the BGLCheckpointInit()

    function checks for the environment variableBGL_CHKPT_RESTART_SEQNO. If the variable is notset, a job launch is assumed and the function returnsnormally. In case the environment variable is set to zero,the individual processes restart from their individual latestconsistent global checkpoint. If the variable is set to apositive integer, the application is started from the specifiedcheckpoint sequence number

    Checkpoint and restart functionalityIt is often desirable to enable or disable the checkpointfunctionality at the time of job launch.Applicationdevelopers are not required to provide two versions of theirprograms: one with checkpoint enabled and another withcheckpoint disabled. We have used environment variablesto transparently enable and disable the checkpoint and

    restart functionality. The checkpoint library calls check forthe environment variable BGL_CHKPT_ENABLED. Thecheckpoint functionality is invoked only if thisenvironment variable is set to a value of 1.

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    25/35

    High Throughput Computing onBlue Gene/L

    The original focus of the Blue Gene project was to create ahigh performance computer with a small footprint thatconsumed a relatively small amount of power. The modelof running parallel applications (typically using MessagePassing Interface (MPI)) on Blue Gene/L is known as High

    Performance Computing (HPC). Recent research hasshown that Blue Gene/L alsoprovides a good platform forHigh Throughput Computing (HTC).3.Applications: MPI or HTC

    As previously stated, the Blue Gene architecture targetedMPI applications for optimal execution. These applicationsare characterized as Single Instruction Multiple Data(SIMD)with synchronized communication and execution.The tasks in an MPI program arecooperating to solve asingle problem. Because of this close cooperation, a failurein a single node, software or hardware, requires thetermination of all nodes. HTC applications have differentcharacteristics. The code that executes on each node is

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    26/35

    independent of work that is being done on another node;communication between nodes is not required. At anyspecific time, each node is solving its own problem. As a

    result, a failure on a single node, software or hardware,does not necessitate the termination of all nodes. Initially,in order to run these applications on a Blue Genesupercomputer, a port to the MPI programming model wasrequired. This was done successfully for severalapplications, but was not an optimal solution in some cases.Some MPI applications may benefit by being ported to theHTC model. In particular, some embarrassingly parallel

    MPI applications may be good candidates for HTC modebecause they do not require communication between thenodes. Also the failure of one node does not invalidate thework being done on other nodes. A key advantage of theMPI model is a reduction of extraneous booking by theapplication. An MPI program is coded to handle datadistribution, minimize I/O by having all I/O done by one

    node (typically rank 0), and distribute the work to the otherMPI ranks. When running in HTC mode, the applicationdata needs to be manually split up and distributed to thenodes. This sporting effort may be justified to achievebetter application reliability and throughput than could beachieved with an MPI model.HTC mode

    In HTC mode, the compute nodes in a partition are runningindependent programs that do not communicate with eachother. A launcher program (running on a compute node)requests work from a dispatcher that is running on a remotesystem that is connected to the functional network. Basedon information provided by the dispatcher, the launcher

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    27/35

    program spawns a worker program that performs sometask. When running in HTC mode, there are two basicoperational differences over the default HPC mode. First,

    after the worker program completes, the launcher programis reloaded so it can handle additional work requests.Second, if a compute node encounters a soft error, such as aparity error, the entire partition is not terminated. Rather,the control system attempts to reset the single computenode while other compute nodes continue to operate. Thecontrol system polls hardware on a regular interval lookingfor compute nodes in a reset state. If a failed node is

    discovered and the node failure is not due to a networkhardware error, a software reboot is attempted to recoverthe compute node.On a remote system that is connected to the functionalnetwork, which could be a Service Node or a Front EndNode, there is a dispatcher that manages a queue of jobs tobe run.

    There is a client/server relationship between the launcherprogram and the dispatcher program. After the launcherprogram is started on a compute node, it connects back tothe dispatcher and indicates that it is ready to receive workrequests. When the dispatcher has a job for the launcher, itresponds to the work request by sending the launcherprogram the job related information, such as the name ofthe worker program executable, arguments, andenvironment variables. The launcher program then spawnsoff the worker program. When the worker programcompletes, the launcher is reloaded and repeats the processof requesting work from experiencingLauncher program

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    28/35

    In the default HPC mode, when a program ends on thecompute node, the Compute Node Kernel sends a messageto the I/O node that reports how the node ended. The

    message indicates if the program ended normally or by asignal and the exit value or signal number, respectively.The I/O node then forwards the message to the controlsystem. When the control system has received messages forall of the compute nodes, it then ends the job. In HTCmode, the Compute Node Kernel handles a programendingdifferently depending on the program that ended. TheCompute Node Kernel records the path of the program that

    is first submitted with the job, which is the launcherprogram. When a program other than the launcher program(or the worker program) ends, the Compute Node Kernelrecords the exit status of the worker program, and thenreloads and restarts the launcher program If the workerprogram ended by a signal, the Compute Node Kernelgenerates an event to record the signal number that ended

    the program. If the worker program ended normally, noinformation is logged. The launcherprogram can retrievethe exit status of the worker program using a Compute

    Node Kernel system call.

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    29/35

    Launcher run sequence

    Since no message is sent to the control system indicatingthat a program ended, the job continues running. The effect

    is to havea continually running program on the computenodes. To reduce the load on the file system, the launcherprogram is cached in memory on the I/O node. When theCompute Node Kernel requests to reload the launcherprogram, it does not need to be read from the file systembut can be sent directly to the compute node from memory.Since the launcher is typically a small executable, it doesnot require much additional memory to cache it. When the

    launcher program ends, the Compute Node Kernel reportsthat a program ended to the control system as it does forHPC mode. This allows the launcher program to cleanlyend on the compute node and for the control system to endthe job

    Template for an asynchronous task dispatch subsystem

    This section gives a high level overview of one possiblemethod to implement an asynchronous task dispatchsubsystem. client, dispatcher program, and the launcherprogram in our example.

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    30/35

    Asynchronous task dispatch subsystem model

    Clients

    The client implements a task submission thread thatpublishes task submission messages onto a work queue. Italso implements a task verification thread that listens fortask completion messages. When the dispatch system

    informs the client that the task has terminated, andoptionally supplies an exit status, the client is thenresponsible for resolvingtask completion status and fortaking actions, including relaunching tasks. Keep in mindthat the client is ultimately responsible for ensuringsuccessful completion of the job.Message queue

    The design is based on publishing and subscribing tomessage queues. Clients publish task submission messagesonto a singlework queue. Dispatcher programs subscribe tothe work queue and process task submission messages.Clients subscribe to task completion messages.Messagesconsist of text data that is comprised of the work to be

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    31/35

    performed, a job identifier, a task identifier, and a messagetype. Job identifiers are generated by the client process andare required to be globally unique. Task identifiers are

    unique within the job session. Themessage type field fortask submission messages is used by the dispatcher todistinguish work of high priority versus work of normalpriority. Responsibility for reliable messagedeliverybelongs to the message queueing system.Dispatcher program

    The dispatcher program first pulls a task submissionmessage off the work queue. Then it waits on a socket for a

    launcher connection and reads the launcher ID from thesocket. It writes the task into the socket, and the associationbetween task and launcher is stored in a table. The tablestores the last task dispatched to the launcher program. Thisconnection is an indication that the last task has completedand the task completion message can be published back tothe client. Figure 12-3 shows the entire cycle of a job

    submitted in HTC mode.

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    32/35

    HTC job cycleThe intention of this design is to optimize the launcherprogram. The dispatcher program spends little timebetween connect and dispatch, so latency volatility ismainly due to the waiting time for dispatcher programconnections. After rebooting, the launcher programconnects to the dispatcher program and passes thecompletion information back to thedispatcher program. Toassist task status resolution, the Compute Node Kernelstores the exit status of the last running process in a buffer.After the launcher program restarts, the contents of thisbuffer can be written to the dispatcher and stored in the task

    completion message.

    Launcher program

    The launcher program is intentionally kept simple.

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    33/35

    Arguments to the launcher program describe a socketconnection to the dispatcher. When the launcher programstarts, it connects to this socket, writes its identity into the

    socket, and waits for a task message. Upon receipt of thetask message, the launcher parses the message and calls theexecve system call to execute the task. When the task exits(for any reason), the Compute Node Kernel restarts thelauncher program again. The launcher program is not acontainer for the application. Therefore, regardless of whathappens to the application, the launcher program will notfail to restart.

    I/O considerationsRunning in HTC mode changes the I/O patterns. When aprogram reads and writes to the file system, it is typicallydone with small buffer sizes. While the administrator canconfigure the buffer size, the most common sizes are 256Kor 512K. When running in HTC mode, loading the workerprogram requires reading the complete executable into

    memory and sending it to a compute node. An executable isat least several megabytes and can be many megabytes. Toachieve the fastest I/O throughput, low compute node toI/O node ratio is the best.4.ADVANTAGES

    1.scalable2.Less space (half of the tennis court)3.Heat problems most supercomputers face4.Speed5.DISADVANTAGES

    Important safety notices

    Here are some important general comments about the BlueGene/L system regarding safety.

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    34/35

    1.This equipment must be installed by trained servicepersonnel in a restricted access location as defined by theNEC (U.S. National Electric Code) and IEC 60950, The

    Standard for Safety of Information Technology Equipment.(C033) 2.The doors and covers to the productare to be closed at all times except for service bytrainedservice personnel. All covers must be replaced and doorslocked at the conclusion of the service operation. (C013)3.Servicing of this product or unit is to be performed bytrained service personnel only.(C032)4.This product is equipped with a 4 wire (three-phase and

    ground) power cable. Use thispower cable with a properlygrounded electrical outlet to avoid electrical shock. (C019)5.To prevent a possible electric shock from touching twosurfaces with different protective ground (earth), use onehand, when possible, to connect or disconnect signalcables) 6.An electrical outlet that is not correctlywired could place hazardous voltage on the metal parts of

    the system or the devices that attach to the system. It is theresponsibility of thecustomer to ensure that the outlet iscorrectly wired and grounded to prevent an electricalshock. (D004)6.CONCLUSIONS

    1.BG/L shows that a cell architecture for supercomputers isfeasible.2.Higher performance with a much smaller size and powerrequirements.3.In theory, no limits to scalability of a BlueGene system.7.REFERENCES

    1.IBM Redbooks publications

  • 7/29/2019 BLUE GENE DOCUMENTATION-ramakrishna

    35/35

    2.General Parallel File System (GPFS) for Clusters:Concepts, Planning, and Installation,GA22-79683.IBM General Information Manual, Installation Manual-

    Physical Planning, GC22-70724. LoadLeveler for AIX 5L and Linux V3.2 Using andAdministering, SA22-78815. PPC440x5 CPU Core User's Manualhttp://www-306.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_440_Embedded_Core