““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 11
Clusters in Molecular Clusters in Molecular Sciences ApplicationsSciences Applications
Serguei PatchkovskiiSerguei Patchkovskii@#@#, Rochus Schmid, Rochus Schmid@@, , Tom ZieglerTom Ziegler@@,,
Siu Pang ChanSiu Pang Chan##, Andrew McCormack, Andrew McCormack##, Roger , Roger RousseauRousseau##, Ian Skanes, Ian Skanes##
@@Department of Chemistry, University of Calgary, 2500 University Dr. NW, Department of Chemistry, University of Calgary, 2500 University Dr. NW, Calgary, Alberta, T2N 1N4 CanadaCalgary, Alberta, T2N 1N4 Canada
##Theory and Computation Group, SIMS, NRC, 100 Sussex Dr., Ottawa, Theory and Computation Group, SIMS, NRC, 100 Sussex Dr., Ottawa, Ontario, K1A 0R6 Ontario, K1A 0R6
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 22
OverviewOverview• Beowulf-style clusters entered mainstreamBeowulf-style clusters entered mainstream
• Are clusters a lasting, efficient investment?Are clusters a lasting, efficient investment?
• Odysseus: an internal cluster at the SIMS Odysseus: an internal cluster at the SIMS theory grouptheory group
• Clusters in molecular science applications: Clusters in molecular science applications: software availability and performancesoftware availability and performance
• Three war stories, and a cautionary messageThree war stories, and a cautionary message
• Summary and conclusionsSummary and conclusions
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 33
Shared, Academic Clusters in Shared, Academic Clusters in CanadaCanada
LocationLocation CPUsCPUs URL of other infoURL of other infoCarleton U.Carleton U. 8xPII-4008xPII-400 www.scs.carleton.ca/~gis/www.scs.carleton.ca/~gis/
UBCUBC 256xPIII-1000256xPIII-1000 www.gdcfd.ubc.ca/Monsterwww.gdcfd.ubc.ca/Monster
U of CalgaryU of Calgary 179xAlpha179xAlpha www.maci-cluster.ucalgary.cawww.maci-cluster.ucalgary.ca
U of Western OntarioU of Western Ontario 144xAlpha144xAlpha GreatWhite.sharcnet.caGreatWhite.sharcnet.ca
U of Western OntarioU of Western Ontario 48xAlpha48xAlpha DeepPurple.sharcnet.caDeepPurple.sharcnet.ca
McMaster UMcMaster U 106xAlpha106xAlpha Idra.physics.mcmaster.caIdra.physics.mcmaster.ca
U of GuelphU of Guelph 120xAlpha120xAlpha Hammerhead.uoguelph.caHammerhead.uoguelph.ca
U of WundsorU of Wundsor 8xAlpha8xAlpha
Winfrid Laurier UWinfrid Laurier U 8xAlpha8xAlpha
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 44
Canadian top-500 facilitiesCanadian top-500 facilities
Cluster
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 55
Internal, “workhorse” clustersInternal, “workhorse” clustersLocationLocation CPUsCPUs URL or otherURL or other
U of AlbertaU of Alberta 98xPIII-45098xPIII-450 www.phys.ualberta.ca/THORwww.phys.ualberta.ca/THOR
U of CalgaryU of Calgary 94x21164-50094x21164-500 www.cobalt.chem.ucalgary.cawww.cobalt.chem.ucalgary.ca
U of CalgaryU of Calgary 120xPIII-1000120xPIII-1000 www.ucalgary.ca/~tieleman/elk.htmlwww.ucalgary.ca/~tieleman/elk.html
U of CalgaryU of Calgary 32xPIII32xPIII
Memorial UMemorial U 32xPII-30032xPII-300 weland.esd.mun.caweland.esd.mun.ca
MDS ProteomicsMDS Proteomics 400xPIII-1000400xPIII-1000 www.mdsproteomics.comwww.mdsproteomics.com
ICPET, NRCICPET, NRC 80xPIII-80080xPIII-800
DRAO, NRCDRAO, NRC 16xPII-45016xPII-450
SIMS, NRCSIMS, NRC 32xPIII-93332xPIII-933
Samuel Lunenfeld Research InstituteSamuel Lunenfeld Research Institute 224xPIII-450224xPIII-450 Bioinfo.mshri.on.ca/yac/Bioinfo.mshri.on.ca/yac/
Sherbrooke USherbrooke U 64xPII-40064xPII-400
U of SaskatchewanU of Saskatchewan 12xAthlon-80012xAthlon-800 Sasquatch.usask.caSasquatch.usask.ca
Simon Frazer USimon Frazer U 16xPIII-50016xPIII-500 www.sfu.ca/acs/cluster/www.sfu.ca/acs/cluster/
U of VictoriaU of Victoria 39xPIII-45039xPIII-450 Pingu.phys.uvic.ca/muse/ (?)Pingu.phys.uvic.ca/muse/ (?)
McMaster UMcMaster U 32xPIII-70032xPIII-700 www.cim.mcgill.ca/~cvr/beowulf/www.cim.mcgill.ca/~cvr/beowulf/
CERCA, MontrealCERCA, Montreal 16xAthlon-120016xAthlon-1200 www.cerca.umontreal.ca/~fourmano/www.cerca.umontreal.ca/~fourmano/
U of Western OntarioU of Western Ontario variousvarious www.baldric.uwo.cawww.baldric.uwo.ca
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 66
Clusters are everywhereClusters are everywhereLemma 1Lemma 1: A computationally-intensive research group : A computationally-intensive research group in Canada can be in one of the three states:in Canada can be in one of the three states:
a)a) It owns a cluster, orIt owns a cluster, or
b)b) It builds a cluster, orIt builds a cluster, or
c)c) It plans building a cluster RSNIt plans building a cluster RSN
Clusters became a mainstream research tool – useful,Clusters became a mainstream research tool – useful,but not automatically worthy of a separate mentionbut not automatically worthy of a separate mention
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 77
Cobalt: Hardware Cobalt: Hardware
Node 1
Node 93
World
Switch 93x100BaseTx
100BaseTx
(half-duplex)
2x100BaseTx
128Mb memory18Gbytes RAID-1 (4 spindles)
CComputers omputers oon n bbenches enches aall ll llinked inked ttogetherogether
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 88
Cobalt: Nodes and NetworkCobalt: Nodes and NetworkDigital/Compaq Personal Workstation Digital/Compaq Personal Workstation 500au. 500au. CPUCPU Alpha 21164A, 500 MHzAlpha 21164A, 500 MHzCacheCache 96Kb on-chip (L1 and L2)96Kb on-chip (L1 and L2)Peak flopsPeak flops 101099 Flop/second Flop/secondSpecInt 95SpecInt 95 15.7 (estimate)15.7 (estimate)SpecFP 95SpecFP 95 19.5 (estimate)19.5 (estimate)
4 x 3COM SuperStack II 3300
Peak aggregate b/wPeak aggregate b/w 500.0 MB/s500.0 MB/sPeak internode b/w (TCP)Peak internode b/w (TCP) 11.2 MB/s11.2 MB/sNFS read/writeNFS read/write 3.4/4.1 MB/s3.4/4.1 MB/sRound-trip (TCP)Round-trip (TCP) 360 360 μsμsRound-trip (UDP)Round-trip (UDP) 354 354 μsμs
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 99
Cobalt: SoftwareCobalt: Software
OS, communications, and cluster management:OS, communications, and cluster management:Base OS: Tru64, using DMS, NIS, and NFSBase OS: Tru64, using DMS, NIS, and NFS
Compilers: Digital/Compaq C, C++, FortranCompilers: Digital/Compaq C, C++, Fortran
Communications: PVM, MPICHCommunications: PVM, MPICH
Batch queuing: DQSBatch queuing: DQS
Application software:Application software:ADF: Amsterdam Density Functional (PVM)ADF: Amsterdam Density Functional (PVM)
PAW: Projector-Augmented Wave (MPI) PAW: Projector-Augmented Wave (MPI)
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1010
Cobalt: Return on the InvestmentCobalt: Return on the Investment
Investment: DollarsInvestment: Dollars Payback: Research ArticlesPayback: Research Articles
Total publicationsTotal publications 9292
… … including:including:
OrganometallicsOrganometallics 2121
J. Am. Chem. Soc.J. Am. Chem. Soc. 1212
J. Phys. Chem.J. Phys. Chem. 1111
J. Chem. Phys.J. Chem. Phys. 1010
Inorg. Chem.Inorg. Chem. 66
Total costTotal cost 390,800390,800
… … including:including:
Initial purchaseInitial purchase 346,000346,000
Operating (’98-’01)Operating (’98-’01) power (6power (6¢¢/kWh)/kWh) 15,80015,800 admin (20% PDF) admin (20% PDF) 24,00024,000 spare partsspare parts 5,0005,000
ROI: 1 publication / $4,250 ROI: 1 publication / $4,250
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1111
Odysseus: Low-tech solution for Odysseus: Low-tech solution for high-tech problemshigh-tech problems11
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1212
Odysseus: Low-tech solution for Odysseus: Low-tech solution for high-tech problemshigh-tech problems22
Nodes (16+1)Nodes (16+1)ABIT VP6 motherboardABIT VP6 motherboard2xPIII-933, 133MHz FSB2xPIII-933, 133MHz FSB4x256Mbytes RAM4x256Mbytes RAM3COM 3C905C3COM 3C905C36Gb 7200rpm IDE36Gb 7200rpm IDE
… … plus, on the front end:plus, on the front end:Intel PRO/1000Intel PRO/1000Adaptec AHA-2940UWAdaptec AHA-2940UW60Gb 7200rpm IDE60Gb 7200rpm IDE
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1313
Odysseus: Low-tech solution for Odysseus: Low-tech solution for high-tech problemshigh-tech problems33
Network: SCI + 100MbitNetwork: SCI + 100MbitDolphin D339 (2D SCI)Dolphin D339 (2D SCI)
H ringH ring
V ringV ring
HP Procurve 2524 + 1GigHP Procurve 2524 + 1Gig
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1414
Odysseus: Low-tech solution for Odysseus: Low-tech solution for high-tech problemshigh-tech problems44
Backup unit:Backup unit:VXAtape (VXAtape (www.ecrix.comwww.ecrix.com))
35Gbytes/cartridge (physical)35Gbytes/cartridge (physical)
TreeFrog autoloader (TreeFrog autoloader (www.spectralogic.comwww.spectralogic.com))
16 cartridge capacity16 cartridge capacity
UPS Unit:UPS Unit:Powerware 5119Powerware 5119
2880VA2880VA
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1515
Odysseus: Low-tech solution for Odysseus: Low-tech solution for high-tech problemshigh-tech problems55
Four little wheelsFour little wheels
Odysseus at a glanceOdysseus at a glance
Processors:Processors: 32 (+2)32 (+2)
Memory:Memory: 16Gbytes16Gbytes
Disk:Disk: 636Gbytes636Gbytes
Peak flops:Peak flops: 29.9GFlops/sec29.9GFlops/sec
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1616
Odysseus: cost overviewOdysseus: cost overview
ExpenseExpense dollarsdollarsNodesNodes 40,64040,640
SCI network (cards & cables)SCI network (cards & cables) 26,77126,771
Backup unit (tape+robot)Backup unit (tape+robot) 5,8605,860
Spare parts in stockSpare parts in stock 5,0245,024
Ethernet (switch, cables, and head node link)Ethernet (switch, cables, and head node link) 4,1904,190
Compiler (PGI)Compiler (PGI) 3,7803,780
UPSUPS 2,2652,265
Backup tapes (16+1)Backup tapes (16+1) 1,9111,911
Total:Total: 90,44190,441
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1717
Clusters in molecular science – Clusters in molecular science – software availability software availability
• GaussianGaussian• TurbomoleTurbomole• GAMESSGAMESS• NWChemNWChem• GROMOSGROMOS
• ADFADF• PAWPAW• CPMDCPMD• AMBERAMBER• VASPVASP• PWSCFPWSCF• ABINITABINIT
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1818
Software: ADFSoftware: ADFADF – Amsterdam Density ADF – Amsterdam Density
Functional (Functional (www.scm.comwww.scm.com))
Example: Cr(N)PorphExample: Cr(N)Porph
Full geometry optimizationFull geometry optimization38 atoms38 atoms580 basis functions580 basis functionsC4v symmetryC4v symmetry45Mbytes of memory45Mbytes of memorySerial time: 683 minutesSerial time: 683 minutes
Number of Cobalt nodes
Sp
eed
up
idea
l
Observed
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1919
Software: PAWSoftware: PAW
PAW – “Projector-Augmented Wave”PAW – “Projector-Augmented Wave”((www.pt.tu-clausthal.de/~ptpb/PAW/pawmain.htmlwww.pt.tu-clausthal.de/~ptpb/PAW/pawmain.html))
Sp
eed
up
Cobalt Nodes
idea
l
Observed
Example: SExample: SNN2 reaction2 reaction
CHCH33I + [Rh(CO)I + [Rh(CO)22II22]]--
1111ÅÅ unit cell unit cell
Serial time per step: 83 secondsSerial time per step: 83 seconds
Memory: 231MbytesMemory: 231Mbytes
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2020
Software: CPMDSoftware: CPMDCPMD – Car-Parinello Molecular Dynamic CPMD – Car-Parinello Molecular Dynamic
((www.mpi-stuttgart.mpg.de/parinello/www.mpi-stuttgart.mpg.de/parinello/))
Example: H in SiExample: H in Si6464
65 atoms, periodic65 atoms, periodic
40Ryd cut-off40Ryd cut-off
Geometry opt (2 steps) + Geometry opt (2 steps) + free MD (70 steps)free MD (70 steps)
odysseus
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2121
Software: AMBERSoftware: AMBERAMBER – “Assisted Model AMBER – “Assisted Model
Building with Energy Building with Energy Refinement” Refinement” ((www.amber.ucsf.edu/amber/www.amber.ucsf.edu/amber/))
Ncpu
Tim
e (h
our)
Example:Example:
22-residue polypeptide+4K22-residue polypeptide+4K++
+2500 H+2500 H22OO
1ns MD1ns MD
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2222
Software: VASPSoftware: VASPVASP – Vienna Ab-initio Simulation Package (VASP – Vienna Ab-initio Simulation Package (cmscms
.mpi.univie.ac.at/vasp/.mpi.univie.ac.at/vasp/))
Example: LiExample: Li198198
1000GPa1000GPa
300 eV cutoff300 eV cutoff
9 K-points9 K-points
10 WF optimization steps 10 WF optimization steps + stress tensor+ stress tensor
odysseus
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2323
Software: PWSCFSoftware: PWSCFPWSCF and PHONON – Plane wave pseudopotential codes, PWSCF and PHONON – Plane wave pseudopotential codes,
optimized for phonon spectra calculations (optimized for phonon spectra calculations (www.pwscf.org/www.pwscf.org/))
Example: MgBExample: MgB22 solid solid
Geometry opt.Geometry opt.
40 Ryd cut-off40 Ryd cut-off
60 K-points60 K-points
odysseus
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2424
Software: ABINITSoftware: ABINIT
ABINIT (ABINIT (www.mapr.ucl.ac.be/ABINIT/www.mapr.ucl.ac.be/ABINIT/))
Example:Example:
SiOSiO22 (stishovite) (stishovite)
70Ryd cut-off70Ryd cut-off
6 K-points6 K-points
12 SCF iterations12 SCF iterations
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2525
War Story #1War Story #1Odysseus hardware maintenance log, Oct 19, 2001:Odysseus hardware maintenance log, Oct 19, 2001: Overnight, node 6 had a kernel OOPS … it responds to Overnight, node 6 had a kernel OOPS … it responds to
network pings and keyboard, but no new processes can be network pings and keyboard, but no new processes can be started …started …
Reason:Reason: Heat sink on CPU#1 became loose, resulting Heat sink on CPU#1 became loose, resulting in overheating under heavy load.in overheating under heavy load.Resolution:Resolution: Reinstall the heat sinkReinstall the heat sinkDetected by:Detected by: Elevated temperature readings for the Elevated temperature readings for the CPU#1 (lm_sensors)CPU#1 (lm_sensors)Downtime:Downtime: 20 minutes (the affected node)20 minutes (the affected node)
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2626
Odysseus hardware maintenance log, Nov 12, 2001:Odysseus hardware maintenance log, Nov 12, 2001: A large, 16-CPU VASP job fails with “LAPACK: Routine A large, 16-CPU VASP job fails with “LAPACK: Routine
ZPOTRF failed”, or random total energy ZPOTRF failed”, or random total energy Reason:Reason: DIMM in bank #0 on node 17 developed a single-DIMM in bank #0 on node 17 developed a single- bit failure at the address 0xfd9f0cbit failure at the address 0xfd9f0cResolution:Resolution: Replace memory module in bank #0Replace memory module in bank #0Detected by:Detected by: Rerunning failing job with different sets of nodes,Rerunning failing job with different sets of nodes, followed by the memory diagnostic on the affected followed by the memory diagnostic on the affected node (memtest32)node (memtest32)Downtime:Downtime: 1 day (the whole cluster) + 2 days (the affected node)1 day (the whole cluster) + 2 days (the affected node)
War Story #2War Story #2
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2727
War Story #3War Story #3Odysseus hardware maintenance log, Dec 10, 2001:Odysseus hardware maintenance log, Dec 10, 2001: Apparently random application failures are observedApparently random application failures are observedReason:Reason: Multiple single-bit memory Multiple single-bit memory failures, on the nodes (bank #): failures, on the nodes (bank #): 6 (#2), 7 (#2,#3), 8 (#0), 6 (#2), 7 (#2,#3), 8 (#0), 10 (#0), 11 (#0) 10 (#0), 11 (#0) Resolution:Resolution: Replace memory modulesReplace memory modulesDetected by:Detected by: Cluster-wide memory diagnostic (memtest32) Cluster-wide memory diagnostic (memtest32) Downtime:Downtime: 3 days (the whole cluster)3 days (the whole cluster)
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2828
• Using inexpensive, consumer-grade hardware Using inexpensive, consumer-grade hardware potentially exposes you to low-quality componentspotentially exposes you to low-quality components
• NeverNever use components which have no built-in use components which have no built-in hardware monitoring and error detection capabilityhardware monitoring and error detection capability
• Always configure your clusters to Always configure your clusters to reportreport corrected corrected errors and out-of-range hardware sensors readings. errors and out-of-range hardware sensors readings.
• ActAct on the early warnings on the early warnings
• Otherwise, you run a risk of producing garbage Otherwise, you run a risk of producing garbage science, science, and never knowing itand never knowing it
Cautionary NoteCautionary Note
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2929
Hardware Monitoring with LinuxHardware Monitoring with Linux
CategoryCategory ParameterParameter PackagePackageMotherboardMotherboard Temperature; Power supply Temperature; Power supply
voltage; Fan statusvoltage; Fan statuslm_sensorslm_sensors##
Hard drivesHard drives Corrected error counts; Corrected error counts; Impending failure indicatorsImpending failure indicators
ide-smartide-smart$$
S.M.A.R.T. SuiteS.M.A.R.T. Suite%%
MemoryMemory Corrected error countsCorrected error counts ecc.oecc.o^̂
NetworkNetwork Hardware-dependentHardware-dependent
# http://www2.lm-sensors.nu/~lm78/ $ http://www.linux-ide.org/smart.html % http://csl.cse.ucsc.edu/smart.shtml ^ http://www.anime.net/~goemon/linux-ecc/ (2.2 kernels only)
““Clusters in Molecular Sciences Applications”, 2Clusters in Molecular Sciences Applications”, 2ndnd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 3030
Summary and ConclusionsSummary and Conclusions• Clusters are no longer a techno-geek’s toy, and will Clusters are no longer a techno-geek’s toy, and will
remain the primary workhorse of many research remain the primary workhorse of many research groups, at least for a whilegroups, at least for a while
• Clusters give an impressive return on the investment, Clusters give an impressive return on the investment, and may remain useful longer than expectedand may remain useful longer than expected
• Many (most?) useful research codes in molecular Many (most?) useful research codes in molecular sciences are readily available on clusterssciences are readily available on clusters
• Configuring and operating PC clusters can be tricky. Configuring and operating PC clusters can be tricky. Consider a reputable system integrator with Beowulf Consider a reputable system integrator with Beowulf hardware hardware and softwareand software experience experience
Top Related