Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

23
Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham

Transcript of Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

Page 1: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

Southgrid Technical Meeting

Pete Gronbech: February 2005

Birmingham

Page 2: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

Present

• Pete Gronbech• Chris Brew• Santanu Das• Yves Coppens• Lawrie Lowe

Page 3: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

Southgrid Member Institutions

• Oxford • RAL PPD• Cambridge • Birmingha

m• Bristol• Warwick

Page 4: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

GridPP PMB Minutes 161 21st February

2005

1)On the basis of site feedback it is now less likely that any UK site will use Quattor - even the Tier-1A is unlikely to adopt it. - Only Dublin is showing an interest.

2)Sites have not understood the importance (or urgency) of installing a monitoring tool such as Ganglia. This is being addressed and will be raised at the next Tier-2 board (wording in the MoUs may limit what we can request). - This is being restricted by the MoUs. Sites will have to address this through the Tier-2 Board

Page 5: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

Monitoring

• http://www.gridpp.ac.uk/ganglia/• http://map.gridpp.ac.uk/• http://lcg-testzone-reports.web.cern.ch/lcg-testzone

-reports/cgi-bin/lastreport.cgi• Configure view UKI

• Discussion of installation of ganglia, three parts, gmond on each node, one gmetad to collect the data, and the web interface, which uses rrdtool and php4 running on a webserver

Page 6: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

Weekly EGEE reporting 1

Date: 18-02-051) A. Status of SouthGrid: Total: 5 sitesStatus at the 14.02: 3 ok / 1 with alerts + 1 maintenance Status at the 18.02: 4 ok / 1 with alerts3 sites running LCG 2.3.0, 1 sites are on LCG 2.2.0 but migrating now (Bham), 1 Site not yet joined LCG

Site OS of main cluster LCG version Test /legacy cluster OS

LCG version

Oxford RH7.3 2_3_0 SL3 None

Birmingham RH7.3 2_2_0 SL3 2_3_0

Cambrdige SL3 2_3_3

RAL ppd SL3 2_3_0 RH7.3 2_3_0

Bristol SL3 2_3_0 *

* Worker nodes not yet commissioned so not actually on grid

Page 7: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

Weekly EGEE reporting 2

[Comment: This information should be drawn from the site functional tests. If there is an alert then the detail should go below in B. and/or C.]B. Site failure reports:Task 1. Birmingham storage element full ATLAS required to clear files[Comment: To explain the cause of the failures noted in A.]C. Site maintenance reports:

1. Maintenance at Birmingham although may not be in goc db. Have created a parallel SL3 / LCG230 cluster and are phasing out the rh73 lcg220 cluster today and Monday. [Comment: To explain the reasons a site is down for maintenance]D. Tier tasks

Open tasks:Tasks created in last week:Tasks closed in last week:[Comment: The number of open, closed and newly created tasks as given in Savannah]E. Planning & updates for next week:

1. Birmingham to complete migration to sl3 and 2 3 0 2. Bristol working with Chris Brew to track down possible UDP blocks on traffic to PBS3. Oxford to start SL migration next week with parallel cluster4. Cambridge will complete port provided some expts issues are cleared up. Working on Condor to replace torque.5. RAL testing 6.4 TB disk server to go behind se and try to persuade expts to get off RH73 service. [Comment: What is happening in your area in the next week?]F. Issues & problems:1. RGMA tomcat4 still refuses to start at Cambridge[Comment: Problems or issues reported to you]

Page 8: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

Weekly EGEE reporting 3

2) Coordinator report:======================A. Activity in the last week:=============================1. Helped install SL on other physics cluster using and developing PXE/kickstart

technique to be used on grid nodes.Oxford weekly grid meetingOxford monthly technical physics committee meetingArrange Southern Teir 2 technical meeting to be at Birmingham on 24th FebruaryHoliday!!B. Plans for next week:=======================1. Prepare Oxford SL migrationC. Current issues/problems:===========================1. Cambridge teething problems following 2.3.0 upgradeBirmingham to complete upgrade.Oxford to continue preparation for SL upgrade.

Page 9: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

Status at RAL PPD

• Always on the leading edge of software deployment (Benefit of RAL Tier 1)

• SL3 cluster on 2.3.0 worker nodes increasing.

• Legacy service LCG 2.3.0 on RH7.3 (Winding down)

• CPUs: 24 2.4 GHz, 30 2.8GHz– 100% Dedicated to LCG

• 0.5 TB Storage– 100% Dedicated to LCG

Page 10: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

Status at Cambridge

• Currently LCG 2.3.0 on SL303 just installed.

• CPUs: 32 2.8GHz – increase to 40 soon.– 100% Dedicated to

LCG

• 3 TB Storage– 100% Dedicated to

LCG

Page 11: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

Status at Bristol

• Status– LCG involvement limited (“black dot”) for previous six months

due to lack of manpower– New resources, posts now on the horizon!

• Existing resources– 80-CPU BaBar farm to be switched to LCG– ~ 2TB storage resources to be LCG – accessible– LCG head nodes installed by SouthGrid support team with 2.3.0

• New resources– Funding now confirmed for large University investment in

hardware– Includes CPU, high quality and scratch disk resources

• Humans– New system manager post (RG) being filled– New SouthGrid support / development post (GridPP / HP) being

filled– HP keen to expand industrial collaboration – suggestions?

Page 12: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

Status at Birmingham

• Currently just installed SL3 on Gridpp Frontend Nodes, used yaim to install LCG-2_3_0

• CPUs: 22 2.0GHz Xenon (+48 soon)– 100% LCG

• 2 TB Storage awaiting “Front End Machines”– 100% LCG.

• Southgrid’s “Hardware Support Post”Yves Coppens appointed.

Page 13: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

Status at Oxford

• Currently LCG 2.3.0 on RH7.3• Parallel SL3 install, will use yaim to

install 2.3.0 asap• CPUs: 80 2.8 GHz

– 100% LCG• 1.5 TB Storage – upgrade to 3TB

planned– 100% LCG.

Page 14: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

dhcpd.conf

# /etc/dhcpd.conf.ngexample - a DHCP daemon configuration file example# for dhcpd 2.0

# distribute an IP address only if the nodes is knowndeny unknown-clients;# the server will not reply to the unknown clients; in this way# it is possible to have a second DHCP servernot authoritative;option domain-name "physics.ox.ac.uk";

# These 3 lines are needed for the installation via PXEoption dhcp-class-identifier "PXEClient";option vendor-encapsulated-options 01:04:00:00:00:00:ff;filename "pxelinux.0";

subnet 163.1.5.0 netmask 255.255.255.0 {

option routers 163.1.5.254; option domain-name-servers 163.1.2.1;

host t2slwn01 { hardware ethernet 00:30:48:72:F3:61; fixed-address 163.1.5.236; next-server 163.1.5.240; }

Page 15: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

/tftpboot/pxelinux.cfg

[root@t2lcfg pxelinux.cfg]# ls –la /tftpboot/pxelinux.cfglrwxrwxrwx 1 root root 31 Feb 15 12:47 A30105EC -> hosts/t2slwn01.physics.ox.ac.uklrwxrwxrwx 1 root root 11 Dec 8 17:09 A30105ECold -> sl-kick.cfglrwxrwxrwx 1 root root 31 Feb 15 12:47 A30105ED -> hosts/t2slwn02.physics.ox.ac.uklrwxrwxrwx 1 root root 11 Dec 10 14:41 A30105EDold -> sl-kick.cfglrwxrwxrwx 1 root root 31 Feb 15 13:43 A30105EE -> hosts/t2slwn03.physics.ox.ac.uk-rwxr-xr-x 1 root root 414 Feb 15 12:45 ack.cgi-rw-r--r-- 1 apache apache 631 Jul 21 2004 boot-hd.cfg-rwxr-xr-x 1 root root 1140 Feb 15 12:47 create-hash-links.pllrwxrwxrwx 1 apache apache 11 May 7 2004 default -> boot-hd.cfgdrwxr-xr-x 2 apache apache 4096 Feb 21 15:59 hosts-rw-r--r-- 1 apache apache 194 Oct 24 2003 lcfg-install-62.cfg-rw-r--r-- 1 apache apache 238 May 17 2004 lcfg-install-73-2.4.20.cfg-rw-r--r-- 1 apache apache 218 May 13 2004 lcfg-install-73.cfg-rw-r--r-- 1 apache apache 209 Oct 24 2003 lcfg-install-nointeract-62.cfg-rw-r--r-- 1 apache apache 253 May 17 2004 lcfg-install-nointeract-73-2.4.20.cfg-rw-r--r-- 1 apache apache 233 May 7 2004 lcfg-install-nointeract-73.cfg-rw-r--r-- 1 root root 277 May 13 2004 lcfg-install-nointeract-bigkernel-73.cfg-rw-r--r-- 1 root root 279 May 13 2004 lcfg-install-nointeract-custom-73.cfg-rwxr-xr-x 1 root root 182 Feb 15 12:45 Makefiledrwxr-xr-x 2 root root 4096 Feb 15 12:52 oldlinks-rw-r--r-- 1 root root 758 Dec 9 17:00 sl-kick.cfg-rwxr-xr-x 1 root root 1063 Feb 15 12:45 swing

Page 16: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

boot_hd.cfg

[root@t2lcfg pxelinux.cfg]# cat boot-hd.cfg default local

# This is the default pxelinux cfg file# It by default drops onto the harddisk but otherwise # various rescure and diagnostic utilities can be used.default localprompt 1# timeout after 6 seconds. (1/10s of seconds)timeout 60

# Pop up a small menu, this should be changed to correspond to # the options below.display messages/boot-hd.msg

label local localboot 0

label memtest+ kernel memdisk append initrd=diagnostics/memtestp-1.15.img

label cpuburn kernel memdisk append initrd=diagnostics/cpuburn-1.00.img

label nuke kernel memdisk append initrd=diagnostics/book-and-nuke.img

Page 17: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

sl-kick.cfg

[root@t2lcfg pxelinux.cfg]# cat sl-kick.cfg

# This is the default pxelinux cfg file# It by default drops onto the harddisk but otherwise # various rescure and diagnostic utilities can be used.default kickstartprompt 1# timeout after 6 seconds. (1/10s of seconds)timeout 60

# Pop up a small menu, this should be changed to correspond to # the options below.#display messages/boot-hd.msg

label kickstart kernel SL/vmlinuz append initrd=SL/initrd.img keymap=uk devfs=nomount ramdisk_size=16384

ksdevice=link ks=nfs:163.1.5.240:/opt/local/linux/SL303/ks/

Page 18: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

hosts subdir link script

[root@t2lcfg pxelinux.cfg]# cat Makefile # Make file to update all the hash to hostname links.# This should be run after the the dhcpd file is updated# or the DNS is changed.

all: /usr/local/sbin/create-hash-links.pl

[root@t2lcfg pxelinux.cfg]# cat /usr/local/sbin/create-hash-links.pl#!/usr/bin/perl -w

use strict ;use Socket ;

my $dhcpd = " /etc/dhcpd.conf" ;my $tftp = "/tftpboot/pxelinux.cfg" ;

my @ips ;

open (DHCP,"<$dhcpd") or die "Could not open $dhcpd: $!\n" ;

print "Collecting a list of ip address from $dhcpd\n" ;while ( <DHCP> ) { if ( /\s*[^#]\s*fixed-address\s+(\S+)\s*;/ ) { my $fixed = $1 ; # Check if it is a host name and if so we must convert it to # ip address. if ( $fixed =~ m/^.*ox\.ac\.uk$/ ) { print "Converting hostname $fixed to ip address: " ; $fixed = inet_ntoa(inet_aton($fixed) ) or die "fixed= $fixed\n" ; print "$fixed\n" ; } push(@ips,$fixed) ; }}

# Now set up the symlinks IF they are not already there.foreach my $ip ( @ips ) { my $hexip = sprintf("%02X%02X%02X%02X",split('\.',$ip)) ; my $hostname = gethostbyaddr(inet_aton($ip), AF_INET) or die "No reverse look up for $ip\n" ; # Create a symlink from the hostname to default config. symlink('../boot-hd.cfg',$tftp.'/hosts/'.$hostname) unless ( -l $tftp.'/hosts/'.$hostname ) ; symlink('hosts/'.$hostname,$tftp.'/'.$hexip) ;

}

close (DHCP) ;

Page 19: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

kickstart directory

[root@t2lcfg pxelinux.cfg]# cd /opt/local/linux/SL303/[root@t2lcfg SL303]# ls -latotal 20drwxr-xr-x 5 root root 4096 Dec 9 17:02 .drwxr-xr-x 5 root root 4096 Nov 26 16:14 ..drwxr-xr-x 3 root root 4096 Oct 1 00:14 imagesdrwxr-xr-x 2 root root 4096 Feb 15 15:02 ksdrwxr-xr-x 5 root root 4096 Oct 1 00:14 SL[root@t2lcfg SL303]# cd ks[root@t2lcfg ks]# ls -latotal 16drwxr-xr-x 2 root root 4096 Feb 15 15:02 .drwxr-xr-x 5 root root 4096 Dec 9 17:02 ..lrwxrwxrwx 1 root root 15 Dec 9 12:10 163.1.5.236-kickstart -> anaconda-ks.cfglrwxrwxrwx 1 root root 15 Dec 10 14:40 163.1.5.237-kickstart -> anaconda-ks.cfglrwxrwxrwx 1 root root 15 Dec 10 14:40 163.1.5.238-kickstart -> anaconda-ks.cfglrwxrwxrwx 1 root root 14 Feb 14 16:40 163.1.5.93-kickstart -> SL-Clar-ks.cfg-rw-r--r-- 1 root root 1551 Feb 15 15:01 anaconda-ks.cfg-rw-r--r-- 1 root root 1567 Feb 15 14:10 SL-Clar-ks.cfg

Page 20: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

kickstart file 1

[root@t2lcfg ks]# cat anaconda-ks.cfg # Kickstart file automatically generated by anaconda.#network --device eth1 --bootproto dhcpreboot

installlang en_US.UTF-8langsupport --default en_US.UTF-8 en_US.UTF-8keyboard ukmouse genericwheelps/2 --device psauxxconfig --card "ATI Mach64" --videoram 8192 --hsync 31.5-67 --vsync 50-75 --resolution 1280x1024 --depth 24 --startxonboot --defaultdesktop gnome#network --device eth0 --bootproto static --ip 163.1.5.236 --netmask 255.255.255.0 --gateway 163.1.5.254 --nameserver 163.1.2.1 --hostname t2slwn01network --bootproto dhcpnfs --server 163.1.5.240 --dir /opt/local/linux/SL303/rootpw --iscrypted encryptedpasswdherefirewall --disabledauthconfig --enableshadow --enablemd5timezone Europe/Londonbootloader --location=mbr

Page 21: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

Kickstart file 2

# The following is the partition information you requested# Note that any partitions you deleted are not expressed# here so unless you clear all partitions first, this is# not guaranteed to workzerombr yesclearpart --allpart / --fstype "ext3" --size=6000part /usr --fstype "ext3" --size=6000part swap --size=2000part /home --fstype "ext3" --size=100 --grow

%packages@ office@ engineering-and-scientific@ editors@ xemacs@ base-x@ graphics@ misc-sl@ text-internet@ kde-desktop@ gnome-desktop@ dialup@ yum@ openafs-client@ authoring-and-publishing@ printing@ sound-and-video@ graphical-internetkernelkernel-module-openafs-2.4.21-20.ELsmpkernel-smppinegrubgv

%post

# Change link on server to boot from hard diskwget -q t2lcfg.physics.ox.ac.uk/cgi-bin/ack.cgi

Page 22: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

ack.cgi

[root@t2lcfg ks]# more /var/www/cgi-bin/ack.cgi #!/usr/bin/perl

use Socket ;

# Configuration file to boot from HD$boothd = "../boot-hd.cfg";

print "Content-type: text/plain\n\n";

$point_address = $ENV{'REMOTE_ADDR'};$hostname = gethostbyaddr(inet_aton($point_address), AF_INET) ;

system ("cd /tftpboot/pxelinux.cfg/hosts ; ln -fs $boothd /tftpboot/pxelinux.cfg/hosts/$hostname");

print "$hostname is now configured to boot from $boothd\n";

Page 23: Southgrid Technical Meeting Pete Gronbech: February 2005 Birmingham.

Afternoon hands on session

• With Chris Brews help fixed RGMA problems at Cambridge and Birmingham

• Fixed APEL at Birmingham