High End Computing at SDSC
description
Transcript of High End Computing at SDSC
High End Computing at SDSC
CSM Cluster Management
Eva Hocks
San Diego Supercomputer Center
2007
Managing the HPC systems:DataStar System Software:
AIX 5.2 ML3 CSM 1.3.3.1 RSCT 2.3.3.3
System Management with CSM: Node setup Node Groups
Per frame Per function (NPACI,TG,POE,login,batch)
CSM setup nodes Configure Nodes
lshwinfo -p hmc -c dshmc07.sdsc.edu > /tmp/fr8_9 vi /tmp/fr8_9 : replace noname with cec_name
no_hostname::hmc::dshmc07.sdsc.edu::fr9-cg13::001::7039::651::02151FF
ds100::hmc::dshmc07.sdsc.edu::fr8-cg1::001::7039::651::021519
definenode -f /tmp/fr8_9 InstallOSName=AIX systemid -p hmc hscroot getadapters -n ds100 -z /tmp/ds100_adapters write to CSM database, include Federation_switch adapters csm2nimnodes -n 'ds100' type='standalone'
network_name='sdsc_net' platform='chrp' netboot_kernel='mp‘ netboot –n ds100 updatenode –n ds100
CSM_ADAPTERS_STANZA_FILEds100: MAC_address=00096B34E093 adapter_duplex=full
adapter_speed=100cable_type=N/Ainstall_server=192.168.236.31interface_name=en0location=U1.32-P1-H1/E1machine_type=installnetaddr=network_type=ensubnet_mask=
ds100:
machine_type=secondary
interface_name=sn1
network_type=sn
netaddr=
subnet_mask=
location=U1.5-P1-H1/Q2
ds100:
machine_type=secondary
interface_name=sn0
network_type=sn
netaddr=
subnet_mask=
location=U1.5-P1-H1/Q1
Managing the HPC systems:DataStar System Management with CSM:
Management through Command line Rpower
Power on/off, query node status Install node: netboot –n ds100 Dsh
Install updates on nodes (installp,rpm,emgr) Monitor processes on nodes
Managing the HPC systems:DataStar continued… System Configuration
Cfmupdatenode
Synchronize system configuration modification with nodes and system admins
Run pre/post scripts to capture security rsiks and send notification
System monitoring:
Distributed Monitoring responds (GUI configured) Event driven email notification for on-call personnel GUI monitoring for operations personnel
CSM monitoring
CSM monitoring
CSM Event Monitoring
GUI Event Monitoring Critical Conditions:
AnyNodeTmpFull AnyNodeVarSpace AnyNodeSwitchResponds LoadLeverProcess hostResponds see setting up ERRM Condition
Warning Conditions: Processor State
CSM Event Monitoring GUI
CSM Event Monitoringsetting up ERRM Conditions hostResponds ERRM condition
(redbook SG24-6953 page 193) mkcondition –r IBM.ManagedNode \
-e “Status!=1” –E “Status==1” \
-d “Node hostResponds down” \
-D “Node hostRsponds up” \
-m l hostResponds
mkresponse –n LogStatustoFIFO \
-s /usr/local/bin/LogStatusData \
-E STATUS_FILE=/var/adm/spmondata” LogStatusData
mkcondresp “hostResponds” “LogStatusData”
Event notification
Warning Event email
=====================================
Monday 07/26/04 19:12:34 Condition Name: LoadLProcess Severity: Warning Event Type: Event Expression: Processes.CurPidCount <= 0 Resource Name: ProgramName ==
'LoadL_startd' && Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [0,1,{},{282654}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0
=====================================
Rearm email:
=====================================
Monday 07/26/04 19:13:32 Condition Name: LoadLProcess Severity: Warning Event Type: Rearm event Expression: Processes.CurPidCount > 0Resource Name: ProgramName ==
'LoadL_startd' && Filter == 'ruser== root ' Resource Class: IBM.Program Data Type: CT_SD_PTR Data Value: [1,0,{270492},{270492}] Node Name: ds243 Node NameList: {ds243} Resource Type: 0================================
=====
CSM Information
CSM Guide for the PSSP Systems Administrator SG24-6953 Useful scripts for ERRM conditions Command cross reference
IBM CSM for AIX 5L Administration Guide SA22-7918 CSM error messages
Web Sites http://www-124.ibm.com/developerworks/oss/mailman/listinfo/csm