Post on 11-Jan-2016
description
High End Computing at SDSC
CSM Cluster Management
Eva Hocks
San Diego Supercomputer Center
2007
Managing the HPC systems:DataStar System Software:
AIX 5.2 ML3 CSM 1.3.3.1 RSCT 2.3.3.3
System Management with CSM: Node setup Node Groups
Per frame Per function (NPACI,TG,POE,login,batch)
CSM setup nodes Configure Nodes
lshwinfo -p hmc -c dshmc07.sdsc.edu > /tmp/fr8_9 vi /tmp/fr8_9 : replace noname with cec_name
no_hostname::hmc::dshmc07.sdsc.edu::fr9-cg13::001::7039::651::02151FF
ds100::hmc::dshmc07.sdsc.edu::fr8-cg1::001::7039::651::021519
definenode -f /tmp/fr8_9 InstallOSName=AIX systemid -p hmc hscroot getadapters -n ds100 -z /tmp/ds100_adapters write to CSM database, include Federation_switch adapters csm2nimnodes -n 'ds100' type='standalone'
network_name='sdsc_net' platform='chrp' netboot_kernel='mp‘ netboot –n ds100 updatenode –n ds100
CSM_ADAPTERS_STANZA_FILEds100: MAC_address=00096B34E093 adapter_duplex=full
adapter_speed=100cable_type=N/Ainstall_server=192.168.236.31interface_name=en0location=U1.32-P1-H1/E1machine_type=installnetaddr=network_type=ensubnet_mask=
ds100:
machine_type=secondary
interface_name=sn1
network_type=sn
netaddr=
subnet_mask=
location=U1.5-P1-H1/Q2
ds100:
machine_type=secondary
interface_name=sn0
network_type=sn
netaddr=
subnet_mask=
location=U1.5-P1-H1/Q1
Managing the HPC systems:DataStar System Management with CSM:
Management through Command line Rpower
Power on/off, query node status Install node: netboot –n ds100 Dsh
Install updates on nodes (installp,rpm,emgr) Monitor processes on nodes
Managing the HPC systems:DataStar continued… System Configuration
Cfmupdatenode
Synchronize system configuration modification with nodes and system admins
Run pre/post scripts to capture security rsiks and send notification
System monitoring:
Distributed Monitoring responds (GUI configured) Event driven email notification for on-call personnel GUI monitoring for operations personnel
CSM monitoring
CSM monitoring
CSM Event Monitoring
GUI Event Monitoring Critical Conditions:
AnyNodeTmpFull AnyNodeVarSpace AnyNodeSwitchResponds LoadLeverProcess hostResponds see setting up ERRM Condition
Warning Conditions: Processor State
CSM Event Monitoring GUI
CSM Event Monitoringsetting up ERRM Conditions hostResponds ERRM condition
(redbook SG24-6953 page 193) mkcondition –r IBM.ManagedNode \
-e “Status!=1” –E “Status==1” \
-d “Node hostResponds down” \
-D “Node hostRsponds up” \
-m l hostResponds
mkresponse –n LogStatustoFIFO \
-s /usr/local/bin/LogStatusData \
-E STATUS_FILE=/var/adm/spmondata” LogStatusData
mkcondresp “hostResponds” “LogStatusData”
Event notification
Warning Event email
=====================================
Monday 07/26/04 19:12:34 Condition Name: LoadLProcess Severity: Warning Event Type: Event Expression: Processes.CurPidCount <= 0 Resource Name: ProgramName ==
'LoadL_startd' && Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [0,1,{},{282654}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0
=====================================
Rearm email:
=====================================
Monday 07/26/04 19:13:32 Condition Name: LoadLProcess Severity: Warning Event Type: Rearm event Expression: Processes.CurPidCount > 0Resource Name: ProgramName ==
'LoadL_startd' && Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [1,0,{270492},{270492}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0================================
=====
CSM Information
CSM Guide for the PSSP Systems Administrator SG24-6953 Useful scripts for ERRM conditions Command cross reference
IBM CSM for AIX 5L Administration Guide SA22-7918 CSM error messages
Web Sites http://www-124.ibm.com/developerworks/oss/mailman/listinfo/csm