South African Grid Training COMPUTING ELEMENT Albert van Eck UFS - ICTS 18 November 2009 Slides by:...
-
Upload
alannah-harrington -
Category
Documents
-
view
222 -
download
0
description
Transcript of South African Grid Training COMPUTING ELEMENT Albert van Eck UFS - ICTS 18 November 2009 Slides by:...
South African Grid Training
COMPUTING ELEMENT
Albert van EckUFS - ICTS
18 November 2009Slides by: GIUSEPPE PLATANIA
18 Nov 2009, Cape TownSouth African Grid Training 2
OUTLINE• OVERVIEW
• INSTALLATION & CONFIGURATION
• TESTING
• FIREWALL SETUP
• TROUBLESHOOTING
18 Nov 2009, Cape TownSouth African Grid Training 3
OVERVIEW• The Computing Element is the central service of a site.• Its main functionalities are:
– manage the jobs (job submission, job control)– update the status of the jobs to the WMS– publish all site information (site location, queues, about
the CPUs status, and so on) via LDAP (site BDII service)
It can run on several kinds of batch systems:– Torque + MAUI– LSF– SGE– Condor
18 Nov 2009, Cape TownSouth African Grid Training 4
TORQUE + MAUI• The Torque server is composed of:
– pbs_server pbs_server which provides the basic batch services such as receiving/creating a batch job.
• The Torque client is composed of:– pbs_mompbs_mom which places the job into execution. It is
also responsible for returning the job’s output to the user
• The MAUI system is composed of:– job_schedulerjob_scheduler which contains the site's policy to
decide which job must be executed and when.
18 Nov 2009, Cape TownSouth African Grid Training 5
Site BDII**– By default it is installed on the CE– It collects all site GRISes* (for example
SE,RB,LFC,etc..)– The name of the service is bdii– Log file: /opt/bdii/var/bdii.log
*GRIS=Grid Resource Information Service**BDII=Berkeley Database Information Index
18 Nov 2009, Cape TownSouth African Grid Training 6
Computing Element installation &
configuration using YAIM
18 Nov 2009, Cape TownSouth African Grid Training 7
There are several kinds of metapackages to install:
ig_CE – LCG ComputingElement without batch system packages.
ig_CE_LSF – LCG ComputingElement with LSF.
• IMPORTANT: provided for consistency, it does not install LSF but it apply some fixes via ig_configure_node.
ig_CE_torque – LCG ComputingElement with Torque+MAUI.
WHAT KIND OF CE?
18 Nov 2009, Cape TownSouth African Grid Training 8
HOW TO GET A HOST CERTIFICATE
• Host certificate for CE.– Please, request it from your RA
•For this tutorial:HOST=$(hostname -f)mkdir /etc/grid-securitycp /root/$HOST/${HOST}-cert.pem /etc/grid-security/hostcert.pemcp /root/$HOST/${HOST}-key.pem /etc/grid-security/hostkey.pem
• Install host certificates – (hostcert.pem and hostkey.pem) in /etc/grid-security
– mkdir /etc/grid-security– cd /etc/grid-security– chmod 644 hostcert.pem– chmod 400 hostkey.pem
18 Nov 2009, Cape TownSouth African Grid Training 9
Repository settings
• REPOS="ca dag glite-lcg_ce ig jpackage gilda"
Download and save the repo files:• for name in $REPOS; do wget
http://grid018.ct.infn.it/mrepo/repos/$name.repo -O /etc/yum.repos.d/$name.repo; done
18 Nov 2009, Cape TownSouth African Grid Training 10
INSTALLATION• yum remove jdk• yum install xml-commons-resolver12• yum install jdk java-1.6.0-sun-compat • yum install maui-3.2.6p19_20.snap.1182974819-5.slc4 \ maui-server-3.2.6p19_20.snap.1182974819-5.slc4• yum install ig_CE_torque• yum install lcg-CA
Gilda rpms:• yum install gilda_utils
If it's also the site BDII collector:• yum install ig_BDII
18 Nov 2009, Cape TownSouth African Grid Training 11
• Copy ig-site-info.def template file provided by ig_yaim into gilda directory and customize it
cp /opt/glite/yaim/examples/siteinfo/ig-site-info.def /opt/glite/yaim/etc/gilda/<your_site-info.def>
• Open /opt/glite/yaim/etc/gilda/<your_site-info.def> file using a text editor and set the following values according to your grid environment:
CE_HOST=<write the CE hostname you are installing> TORQUE_SERVER=$CE_HOST
Customize ig-site-info.def
18 Nov 2009, Cape TownSouth African Grid Training 12
JOB_MANAGER=lcgpbsBATCH_BIN_DIR=/usr/binBATCH_VERSION=torque-2.1.9-4CE_BATCH_SYS=pbsCE_CPU_MODEL=OpteronCE_CPU_VENDOR=AMDCE_CPU_SPEED=3000 CE_OS="ScientificSL"CE_OS_RELEASE=4.8CE_OS_VERSION="SL"CE_MINPHYSMEM=2048CE_MINVIRTMEM=4096CE_SMPSIZE=2CE_SI00=1000CE_SF00=1200CE_OUTBOUNDIP=TRUECE_INBOUNDIP=TRUE
Customize ig-site-info.def
18 Nov 2009, Cape TownSouth African Grid Training 13
GROUPS_CONF=/opt/glite/yaim/etc/gilda/ig-groups.confUSERS_CONF=/opt/glite/yaim/etc/gilda/ig-users.confJAVA_LOCATION="/usr/java/latest"
SITE_EMAIL="grid-prod@<your_domain>"SITE_NAME=GILDA-54..58 #Your Number (eg. GILDA-60)SITE_LOC="Cape Town, SOUTH AFRICA"SITE_LAT=37.5SITE_LONG=15.152SITE_WEB="https://gilda.ct.infn.it"SITE_SUPPORT_SITE="grid-prod@<your_domain>“
REMOVE the following, if it exists:SITE_TIER=“xxxxxxxx"
Customize ig-site-info.def
18 Nov 2009, Cape TownSouth African Grid Training 14
QUEUES="short long infinite gilda"
SHORT_GROUP_ENABLE=$VOSLONG_GROUP_ENABLE=$VOSINFINITE_GROUP_ENABLE=$VOS
If you configure a queue for a single VO:
QUEUES="short long infinite gilda"
SHORT_GROUP_ENABLE=$VOSLONG_GROUP_ENABLE=$VOSINFINITE_GROUP_ENABLE=$VOSGILDA_GROUP_ENABLE="gilda"
Customize ig-site-info.def
18 Nov 2009, Cape TownSouth African Grid Training 15
DPM_HOST="aliserv6.ct.infn.it“SE_LIST="$DPM_HOST“VOS="gilda <others>" #If you have more than one: "gilda
my_other_vo"ALL_VOMS="gilda“WMS_HOST="egee-wms-01.cnaf.infn.it"SE_MOUNT_INFO_LIST="none"CE_OTHERDESCR="Cores=8,Benchmark=$CE_SI00-HEP-SPEC06"CE_RUNTIMEENV="LCG-2 LCG-2_1_0 LCG-2_1_1 LCG-2_2_0 GLITE-3_0_0
GLITE-3_1_0 R-GMA"CE_CAPABILITY="CPUScalingReferenceSI00=$CE_SI00"BATCH_SERVER=$CE_HOSTBDII_HOST=gilda-bdii.ct.infn.itSITE_BDII_HOST=$CE_HOSTBDII_REGIONS="CE SE"BDII_CE_URL="ldap://$CE_HOST:2170/mds-vo-name=resource,o=grid"BDII_SE_URL="ldap://$DPM_HOST:2170/mds-vo-name=resource,o=grid"
Customize ig-site-info.def
18 Nov 2009, Cape TownSouth African Grid Training 16
WMS_HOST="egee-wms-01.cnaf.infn.it"SE_MOUNT_INFO_LIST="none"CE_OTHERDESCR="Cores=8,Benchmark=$CE_SI00-HEP-SPEC06"CE_RUNTIMEENV="LCG-2 LCG-2_1_0 LCG-2_1_1 LCG-2_2_0 GLITE-3_0_0
GLITE-3_1_0 R-GMA"CE_CAPABILITY="CPUScalingReferenceSI00=$CE_SI00"
VO_GILDA_SW_DIR=$VO_SW_DIR/gildaVO_GILDA_DEFAULT_SE=$CLASSIC_HOSTVO_GILDA_STORAGE_DIR=$CLASSIC_STORAGE_DIR/gildaVO_GILDA_QUEUES="gilda"VO_GILDA_VOMS_SERVERS="vomss://voms.ct.infn.it:8443/voms/
gilda?/gilda"VO_GILDA_VOMSES="'gilda voms.ct.infn.it
15001/C=IT/O=INFN/OU=Host/L=Catania/CN=voms.ct.infn.it gilda'"VO_GILDA_VOMS_CA_DN="'/C=IT/O=INFN/CN=INFN CA'
'/C=IT/O=INFN/CN=INFN CA'"
Customize ig-site-info.def
18 Nov 2009, Cape TownSouth African Grid Training 17
WN_LIST=/opt/glite/yaim/etc/gilda/wn-list.conf
The file specified in WN_LIST has to define all your WNs' full hostnames.
WARNING: It's important to configure the WN file (/opt/glite/yaim/etc/gilda/wn-list.conf) before you run the yaim configure command
Customize ig-site-info.def
18 Nov 2009, Cape TownSouth African Grid Training 18
• Copy users and groups example files to /opt/glite/yaim/etc/gilda/
cp /opt/glite/yaim/examples/ig-groups.conf /opt/glite/yaim/etc/gilda/cp /opt/glite/yaim/examples/ig-users.conf /opt/glite/yaim/etc/gilda/
• Append gilda users and groups definitions to /opt/glite/yaim/etc/gilda/ig-users.conf and ig-groups.conf
cat /opt/glite/yaim/etc/gilda/gilda_ig-users.conf >> /opt/glite/yaim/etc/gilda/ig-users.conf
cat /opt/glite/yaim/etc/gilda/gilda_ig-groups.conf >> /opt/glite/yaim/etc/gilda/ig-groups.conf
Customize ig-site-info.def
18 Nov 2009, Cape TownSouth African Grid Training 19
CE Torque Configuration• Now we can configure the node:
/opt/glite/yaim/bin/ig_yaim -c \ -s /opt/glite/yaim/etc/gilda/<your_site-info.def> \ -n ig_CE_torque \ -n BDII_site
* Note that there is two different (-n) node type parameters
18 Nov 2009, Cape TownSouth African Grid Training 20
Computing ElementTesting
18 Nov 2009, Cape TownSouth African Grid Training 21
• Check that the local GRIS and the site BDII are running on CE and are publishing the right information (CPU, site name and so on)
ldapsearch -x –h your_ce_hostname -p 2170 -b mds-vo-name=resource,o=grid
ldapsearch -x –h your_ce_hostname -p 2170 -b mds-vo-name=your_site_name,o=grid
The second ldapsearch will return nothingSee next slide
Testing
18 Nov 2009, Cape TownSouth African Grid Training 22
ldapsearch -x -h your_ce_hostname -p 2170 -b mds-vo-name=your_site_name,o=grid
The ldapsearch won’t return anything
Solution:Edit the following file/opt/glite/yaim/etc/gilda/services/glite-bdii_siteComment out the following entries, or set the correct
values for them and rerun ig_yaim...BDII_REGIONS=...BDII_host-id-1_URL=...
Testing
18 Nov 2009, Cape TownSouth African Grid Training 23
• Become a gilda user # su – gilda001
• Create a file (test.sh) and add the following: #!/bin/sh sleep 20 #(it's useful to see the job status) hostname
• Save it and set the file permission to be executable:
chmod 700 test.sh
Testing
18 Nov 2009, Cape TownSouth African Grid Training 24
[gilda001@ce gilda001]$ qsub -q short test.sh
[gilda001@ce gilda001]$ qstat -a
ce.localdomain: Req'd Req'd ElapJob ID Username Queue Jobname SessID NDS TSK
Memory Time S Time--------------- -------- -------- ---------- ------ --- --- ------ ----- - ----3.wn.localdo gilda001 short test.sh 5839 -- -- --
00:15 R --
Testing
18 Nov 2009, Cape TownSouth African Grid Training 25
[gilda001@ce gilda001]$ qstat -a[gilda001@ce gilda001]$
• The job execution has finished and we have to list the output file:
[gilda001@ce gilda001]$ lstest.sh.e3 test.sh.o3
• And show the results:[gilda001@ce gilda001]$ cat test.sh.e3 (error file)[gilda001@ce gilda001]$[gilda001@ce gilda001]$ cat test.sh.o3 (output file)wn.localdomain
Testing
18 Nov 2009, Cape TownSouth African Grid Training 26
Log onto the UI:
Hostname -> glite-tutor.ct.infn.itUsername -> capetown01..06Password -> GridCAP01..06
Grid passphrase -> CAPETOWN
Testing
18 Nov 2009, Cape TownSouth African Grid Training 27
[plt@glite-tutor plt]$ voms-proxy-init --voms gilda[plt@glite-tutor plt]$ globus-job-run <your-ce-full-hostname>:2119/jobmanager-lcgpbs -q
short /bin/hostname
wn.localdomain
[plt@glite-tutor plt]$ glite-wms-job-submit -a -r your-ce-hostname:2119/jobmanager-lcgpbs-gilda hostname.jdl
Selected Virtual Organisation name (from proxy certificate extension): gildaConnecting to host glite-rb.ct.infn.it, port 7772Logging to host glite-rb.ct.infn.it, port 9002******************************************************************************** JOB SUBMIT OUTCOME The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is:
- https://glite-rb.ct.infn.it:9000/Vo-4Ih1s-iDbBPr3rs69GQ
********************************************************************************plt@glite-tutor plt]$ glite-wms-job-status https://glite-rb.ct.infn.it:9000/Vo-4Ih1s-iDbBPr3rs69GQ
Testing
18 Nov 2009, Cape TownSouth African Grid Training 28
FIREWALL SETUP
18 Nov 2009, Cape TownSouth African Grid Training 29
/etc/sysconfig/iptables (1/2)*filter:INPUT ACCEPT [0:0]:FORWARD ACCEPT [0:0]:OUTPUT ACCEPT [0:0]:RH-Firewall-1-INPUT - [0:0]-A INPUT -j RH-Firewall-1-INPUT-A FORWARD -j RH-Firewall-1-INPUT-A RH-Firewall-1-INPUT -i lo -j ACCEPT-A RH-Firewall-1-INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 2135 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 2119 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 2170 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 2811 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport maui -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport pbs_mom -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport pbs_resmon -j ACCEPT
18 Nov 2009, Cape TownSouth African Grid Training 30
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport pbs -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 3878:3879 -j
ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 3879 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 3882 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 1020:1023 -j
ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 20000:25000 -j
ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 32768:65535 -j
ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 32768:65535 -j
ACCEPT-A RH-Firewall-1-INPUT -p tcp -m tcp --syn -j REJECT-A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibitedCOMMIT
/etc/sysconfig/iptables (2/2)
18 Nov 2009, Cape TownSouth African Grid Training 31
IPTABLES STARTUP
/sbin/chkconfig iptables on
/etc/init.d/iptables start
18 Nov 2009, Cape TownSouth African Grid Training 32
Troubleshooting
18 Nov 2009, Cape TownSouth African Grid Training 33
Troubleshooting[plt@ui plt]$ globus-job-run you_ce_hostname:2119/jobmanager-lcgpbs -q short /bin/hostnameGRAM Job submission failed because the connection to the server failed (check host and port)
(error code 12)
solution: check if the globus-gatekeeper daemon is up and running on CE
[plt@ui plt]$ globus-job-run <ce_hostname>:2119/jobmanager-lcgpbs -q short /bin/hostnameGRAM Job submission failed because authentication failed:GSS Major Status: Authentication FailedGSS Minor Status Error Chain:
init.c:499: globus_gss_assist_init_sec_context_async: Error during context initializationinit_sec_context.c:171: gss_init_sec_context: SSLv3 handshake problemsglobus_i_gsi_gss_utils.c:888: globus_i_gsi_gss_handshake: Unable to verify remote side's
credentialsglobus_i_gsi_gss_utils.c:847: globus_i_gsi_gss_handshake: Unable to verify remote side's
credentials: Couldn't verify the remote certificateOpenSSL Error: s3_pkt.c:1046: in library: SSL routines, function SSL3_READ_BYTES: sslv3 alert
bad certificate (error code 7)
solution: probably there is no GILDA CA rpm installed on CE
18 Nov 2009, Cape TownSouth African Grid Training 34
[plt@ui plt]$ edg-gridftp-ls gsiftp://<ce_hostname>/error the server sent an error response: 530 530 LCMAPS
credential mapping NOT successful
error the server sent an error response: 530 530 LCMAPS credential mapping NOT successful
Solution: Check the VO mapping on the CE:/opt/edg/etc/lcmaps/gridmapfile/opt/edg/etc/lcmaps/groupmapfile
Troubleshooting
18 Nov 2009, Cape TownSouth African Grid Training 35
The CE is publishing incorrect information such as:GlueCEStateFreeCPUs: 0GlueCEStateRunningJobs: 0GlueCEStateStatus: ProductionGlueCEStateTotalJobs: 0GlueCEStateWaitingJobs: 4444
Run the script:/opt/glite/etc/gip/plugin/glite-info-dynamic-scheduler-wrapperand check if it gives some errors. Often it doesn’t work because
the batch system is down or in a lock state. If that is the case, restart the torque-server service:/etc/init.d/pbs_server restart
Troubleshooting
18 Nov 2009, Cape TownSouth African Grid Training 36
• If a query to the site BDII doesn’t show the information about a site, you have to look at the BDII logfile:/opt/bdii/var/bdii.log
• For example:GILDA: ldap_bind: Can't contact LDAP server
Check if:– BDII is up & running (ps aux |grep bdii)– That resource url is in the list file
/opt/glite/etc/gip/site-urls.conf – Firewall Setup
Troubleshooting
18 Nov 2009, Cape TownSouth African Grid Training 37