High End Computing at SDSC

16
High End Computing at SDSC CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007

description

High End Computing at SDSC. CSM Cluster Management Eva Hocks San Diego Supercomputer Center 2007. Managing the HPC systems: DataStar. System Software: AIX 5.2 ML3 CSM 1.3.3.1 RSCT 2.3.3.3 System Management with CSM: Node setup Node Groups Per frame - PowerPoint PPT Presentation

Transcript of High End Computing at SDSC

Page 1: High End Computing at SDSC

High End Computing at SDSC

CSM Cluster Management

Eva Hocks

San Diego Supercomputer Center

2007

Page 2: High End Computing at SDSC

Managing the HPC systems:DataStar System Software:

AIX 5.2 ML3 CSM 1.3.3.1 RSCT 2.3.3.3

System Management with CSM: Node setup Node Groups

Per frame Per function (NPACI,TG,POE,login,batch)

Page 3: High End Computing at SDSC

CSM setup nodes Configure Nodes

lshwinfo -p hmc -c dshmc07.sdsc.edu > /tmp/fr8_9 vi /tmp/fr8_9 : replace noname with cec_name

no_hostname::hmc::dshmc07.sdsc.edu::fr9-cg13::001::7039::651::02151FF

ds100::hmc::dshmc07.sdsc.edu::fr8-cg1::001::7039::651::021519

definenode -f /tmp/fr8_9 InstallOSName=AIX systemid -p hmc hscroot getadapters -n ds100 -z /tmp/ds100_adapters write to CSM database, include Federation_switch adapters csm2nimnodes -n 'ds100' type='standalone'

network_name='sdsc_net' platform='chrp' netboot_kernel='mp‘ netboot –n ds100 updatenode –n ds100

Page 4: High End Computing at SDSC

CSM_ADAPTERS_STANZA_FILEds100: MAC_address=00096B34E093 adapter_duplex=full

adapter_speed=100cable_type=N/Ainstall_server=192.168.236.31interface_name=en0location=U1.32-P1-H1/E1machine_type=installnetaddr=network_type=ensubnet_mask=

ds100:

machine_type=secondary

interface_name=sn1

network_type=sn

netaddr=

subnet_mask=

location=U1.5-P1-H1/Q2

ds100:

machine_type=secondary

interface_name=sn0

network_type=sn

netaddr=

subnet_mask=

location=U1.5-P1-H1/Q1

Page 5: High End Computing at SDSC

Managing the HPC systems:DataStar System Management with CSM:

Management through Command line Rpower

Power on/off, query node status Install node: netboot –n ds100 Dsh

Install updates on nodes (installp,rpm,emgr) Monitor processes on nodes

Page 6: High End Computing at SDSC

Managing the HPC systems:DataStar continued… System Configuration

Cfmupdatenode

Synchronize system configuration modification with nodes and system admins

Run pre/post scripts to capture security rsiks and send notification

System monitoring:

Distributed Monitoring responds (GUI configured) Event driven email notification for on-call personnel GUI monitoring for operations personnel

Page 7: High End Computing at SDSC

CSM monitoring

Page 8: High End Computing at SDSC

CSM monitoring

Page 9: High End Computing at SDSC
Page 10: High End Computing at SDSC

CSM Event Monitoring

GUI Event Monitoring Critical Conditions:

AnyNodeTmpFull AnyNodeVarSpace AnyNodeSwitchResponds LoadLeverProcess hostResponds see setting up ERRM Condition

Warning Conditions: Processor State

Page 11: High End Computing at SDSC

CSM Event Monitoring GUI

Page 12: High End Computing at SDSC

CSM Event Monitoringsetting up ERRM Conditions hostResponds ERRM condition

(redbook SG24-6953 page 193) mkcondition –r IBM.ManagedNode \

-e “Status!=1” –E “Status==1” \

-d “Node hostResponds down” \

-D “Node hostRsponds up” \

-m l hostResponds

mkresponse –n LogStatustoFIFO \

-s /usr/local/bin/LogStatusData \

-E STATUS_FILE=/var/adm/spmondata” LogStatusData

mkcondresp “hostResponds” “LogStatusData”

Page 13: High End Computing at SDSC
Page 14: High End Computing at SDSC
Page 15: High End Computing at SDSC

Event notification

Warning Event email

=====================================

Monday 07/26/04 19:12:34 Condition Name: LoadLProcess Severity: Warning Event Type: Event Expression: Processes.CurPidCount <= 0 Resource Name: ProgramName ==

'LoadL_startd' && Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [0,1,{},{282654}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0

=====================================

Rearm email:

=====================================

Monday 07/26/04 19:13:32 Condition Name: LoadLProcess Severity: Warning Event Type: Rearm event Expression: Processes.CurPidCount > 0Resource Name: ProgramName ==

'LoadL_startd' && Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [1,0,{270492},{270492}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0================================

=====

Page 16: High End Computing at SDSC

CSM Information

CSM Guide for the PSSP Systems Administrator SG24-6953 Useful scripts for ERRM conditions Command cross reference

IBM CSM for AIX 5L Administration Guide SA22-7918 CSM error messages

Web Sites http://www-124.ibm.com/developerworks/oss/mailman/listinfo/csm