How to get started London Tier2 O. van der Aa. 16/04/2007 Running the LT2 UK HEP Grid: GridPP, One...

Post on 13-Jan-2016

217 views 0 download

Tags:

Transcript of How to get started London Tier2 O. van der Aa. 16/04/2007 Running the LT2 UK HEP Grid: GridPP, One...

How to get startedLondon Tier2

O. van der Aa

16/04/2007 Running the LT2

UK HEP Grid:GridPP, One T1, Four

T2

ScotGridDurham, Edinburgh, Glasgow NorthGridDaresbury, Lancaster, Liverpool,Manchester, Sheffield

SouthGridBirmingham, Bristol, Cambridge,Oxford, RAL PPD

London Tier2Brunel, Imperial, QMUL, RHUL, UCL

16/04/2007 Running the LT2

Imperial College

Spread Across two Sites•Physics Department

•465 KIS2K (Dual Core Intel Woodcrest)running sge 6.•60TB running dCache

•Computing Department •177 KIS2K (Opterons) runningsge 6.•Storage using the Physics Department one

•All Running RHEL4 and RHEL3 and using the LCG Tarball.•Local Physicist, CMS/LHCB/DZero

16/04/2007 Running the LT2

•324 KSI2K across two clusters. Two CE running pbs/maui

•6.5 TB of storage running DPM

•Complex situation wrt to networking. Grid is in demilitarized zone with 200Mb/s max.

•Local Physicist are mainly from CMS.

16/04/2007 Running the LT2

Biggest cluster in LondonMixture of Athlons,XeonsOpterons.

•Total of 1200 KSI2K running separate pbs/maui•Cluster shared by Astronomy/HEP/MaterialSciences. •Storage 18TB runningpoolfs and DPM•Expect to use worker nodelocal disk with luster. 400TB•Local community Atlas oriented

16/04/2007 Running the LT2

• Separate pbs/maui from ce

• 160 KSI2K

• 8TB running DPM

• ATLAS/ILC community

• Running slc3

• Will soon buy 265KSI2K and 136TB to come around april

16/04/2007 Running the LT2

Situation similar to Imperial:•Physics department

•24KSI2K and ~1TB •Computing department

•Shared cluster with 50 KSI2K•1.5 TB running DPM

•Running centos3, sge

16/04/2007 Running the LT2

Resource Summary

0

500

1000

1500

2000

2500

3000

1Q05 2Q05 3Q05 4Q05 1Q06 2Q06 3Q06 4Q06

London KSI2K

UCL

RHUL

QMUL

Imperial

Brunel

CPU: 2.5 MSI2K

Storage: 94 TB

64%

1%

9%

19%

7% BRUNEL

IC

QMUL

RHUL

UCL

16/04/2007 Running the LT2

How are the resources used ?

Currently around 70%

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Sep-05Oct-05Nov-05Dec-05Jan-06Feb-06Mar-06Apr-06May-06Jun-06Jul-06Aug-06Sep-06Oct-06Nov-06Dec-06Jan-07Feb-07

London Grid Delivered MSI2K*hours

other gridbiomeddzerobabarcdflhcbcmsatlasalice

0%

10%

20%

30%

40%

50%

60%

70%

80%

Sep-05Oct-05Nov-05Dec-05Jan-06Feb-06Mar-06Apr-06May-06Jun-06Jul-06Aug-06Sep-06Oct-06Nov-06Dec-06Jan-07Feb-07

CPU Usage in London Grid

16/04/2007 Running the LT2

How to contact us

• Our mailing list: lt2-technical@imperial.ac.uk– The coordinator: o.van-der-aa@ic.ac.uk– The T2 manager: d.colling@ic.ac.uk

• via GGUS: http://www.ggus.org– Specify UKI-LT2 in the subject field and the university– Use it for any specific problem once you are setup

• Our wiki: http://wiki.gridpp.ac.uk/wiki/London_Tier2– Used to describe the infrastructure– Gives links to monitoring pages

16/04/2007 Running the LT2

How to start ?

• “The Tree Steps” …

1. Register for a certificate (as explained in the ngs talk). • https://ca.grid-support.ac.uk/

2. With your certificate register to the ltwo virtualorganisation• https://voms.gridpp.ac.uk:8443/voms/ltwo/

3. Get access to a user interface• Ask via the lt2-technical mailing list. Each university

in the LT2 has a user interface

16/04/2007 Running the LT2

Summary, main Grid Components

• User Interface (UI) is where the user sits to submit his job• The Virtual Organisation Membership Service (VOMS) is

involved in authorizing and authenticating users• The Information System (IS) publishes the individual site

information (CE Queue names, SE contact points, #waiting jobs, #running jobs etc)

• The Workload Management System (WMS) take the user job find a compatible site and submit the job to the site CE.

• The Computing Element (CE) is the entrance point for the jobs to get into the computing cluster.

• The Storage Element (SE) is the equivalent of the CE but for data

16/04/2007 Running the LT2

The Main Grid Components

wms

voms

16/04/2007 Running the LT2

Information System

• Tree structure showing all available resources in the Grid. – Implemented in the form of a ldap server– Top Level view at

• lcg-bdii.gridpp.ac.uk, port 2170

– Interesting to have a look• Use Jxplorer ldap browser• http://www.jxplorer.org/

16/04/2007 Running the LT2

Submitting your first job

• Get a login on a user interface– In this case gfe03.hep.ph.ic.ac.uk

• Initialize your proxy– voms-proxy-init --voms ltwo

• Prepare your JDL (Job Description Language)– The name of the executable– The files you want to transfer before the job starts– Your constrains, for example:

• How much cpu time you need• Which subset of resources you want to use

16/04/2007 Running the LT2

The files

• Hello.jdlExecutable = "/bin/sh";Arguments = "Hello.sh";StdOutput = "std.out";StdError = "std.err";InputSandbox = {"Hello.sh"};OutputSandbox = {"std.out", "std.err"};

• Hello.sh#!/bin/shecho 'Hello LT2 Workshop' whoamihostname

16/04/2007 Running the LT2

Submitting

• Finding matching resources– edg-job-list-match Hello.jdl

*************************************************************************** COMPUTING ELEMENT IDs LIST The following CE(s) matching your job requirements have been found:

*CEId* ce00.hep.ph.ic.ac.uk:2119/jobmanager-sge-30min ce00.hep.ph.ic.ac.uk:2119/jobmanager-sge-72hr ce1.pp.rhul.ac.uk:2119/jobmanager-pbs-ltwogrid dgc-grid-35.brunel.ac.uk:2119/jobmanager-lcgpbs-short gw39.hep.ph.ic.ac.uk:2119/jobmanager-lcgpbs-ltwo mars-ce2.mars.lesc.doc.ic.ac.uk:2119/jobmanager-sge-10min mars-ce2.mars.lesc.doc.ic.ac.uk:2119/jobmanager-sge-12hr mars-ce2.mars.lesc.doc.ic.ac.uk:2119/jobmanager-sge-1hr mars-ce2.mars.lesc.doc.ic.ac.uk:2119/jobmanager-sge-24hr mars-ce2.mars.lesc.doc.ic.ac.uk:2119/jobmanager-sge-30min mars-ce2.mars.lesc.doc.ic.ac.uk:2119/jobmanager-sge-3hr mars-ce2.mars.lesc.doc.ic.ac.uk:2119/jobmanager-sge-6hr mars-ce2.mars.lesc.doc.ic.ac.uk:2119/jobmanager-sge-72hr gw-2.ccc.ucl.ac.uk:2119/jobmanager-sge-default ***************************************************************************

16/04/2007 Running the LT2

Submitting

• The actual submission

– Edg-job-submit Hello.jdl**************************************************************************************

******* JOB SUBMIT OUTCOME The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier

(edg_jobId) is: - https://gfe01.hep.ph.ic.ac.uk:9000/izz75vlTThizfJVP-7VGdQ**************************************************************************************

*******

This is your job identifier you need to keep track of them

16/04/2007 Running the LT2

Checking the state of your job

• Edg-job-status [your job id]

16/04/2007 Running the LT2

Getting the result

• Edg-job-get-output [jobid]

– Will store your OutputSandbox files in /tmp/• Std.out• Std.err

– Content of std.out---------Hello LT2 Workshoplt2-ltwo007mars092.mars.lesc.doc.ic.ac.uk---------

16/04/2007 Running the LT2

JDL: more complex requirements

• Specify a CE in a domain– Requirements =

RegExp(".*mars.lesc.doc.ic.ac.uk.*$",other.GlueCEUniqueID);

• Require Some CPU Time (min)– Requirements =

RegExp(".*mars.lesc.doc.ic.ac.uk.*$",other.GlueCEUniqueID) && (other.GlueCEPolicyMaxCPUTime > 600);

• Require Some CPU*KSI2K Time– Requirements =

other.GlueCEPolicyMaxCPUTime > 30 * 500/other.GlueHostBenchmarkSI00 ))

More on how to master JDL at http://tinyurl.com/28oje9

16/04/2007 Running the LT2

Data Management

• In the previous example– All files are transferred via the SandBox– SandBox is limited to 100Mb

• Clearly something additional is required to transfer bigger datasets

Data Management tools: lcg utils, gfal

16/04/2007 Running the LT2

Catalogue services

•A “file” is identified by a GUID

•Several Alias (LFN) can be attached to the GUID

•One “file” can be located a several places (PFN)

16/04/2007 Running the LT2

Uploading a file to a storage element (SE)

• Finding list of SE– Lcg-info-sites --vo dteam SE

– If you don’t specify an SE the one closest tothe cluster will be used

• Uploading– lcg-cr --vo dteam -d gfe02.hep.ph.ic.ac.uk

file:myfile.dta – Returns: guid:ec362b1a-6f88-4860-a72b-68d4ad55eb59

16/04/2007 Running the LT2

GUID ?

• Remembering GUID is not human friendly

• You can give an alias (lfn) to a GUID.– lcg-aa --vo dteam guid:ec362b1a-6f88-4860-a72b-68d4ad55eb59

lfn:/grid/home/lt2wk.dta

• You can give an alias when registering the file– lcg-cr --vo dteam -d gfe02.hep.ph.ic.ac.uk file:myfile.dta -l

lfn:/grid/dteam/lt2wk.dta

16/04/2007 Running the LT2

More on moving files

• Copying files back on your UI– lcg-cp --vo dteam lfn:/grid/dteam/lt2wk.dta

file:`pwd`/myfile.dta

• Replicating files somewhere else– lcg-rep -d se1.pp.rhul.ac.uk --vo dteam

lfn:/grid/dteam/lt2wk.dta

16/04/2007 Running the LT2

Listing files

• Listing replicas:– lcg-lr –-vo [yourvo] lfn:<name>

• List the guid:– lcg-lg –-vo [yourvo] lfn:<name>

• Example– lcg-lr --vo dteam lfn:/grid/dteam/lt2wk.dta

• srm://gfe02.hep.ph.ic.ac.uk/pnfs/hep.ph.ic.ac.uk/data/dteam/generated/2007-04-16/filec6b6fba2-c854-4ee6-a0db-68bd6cd6e0dd

• srm://se1.pp.rhul.ac.uk/dpm/pp.rhul.ac.uk/home/dteam/generated/2007-04-16/file5642a5ea-b63f-411a-b56c-84a75137d716

16/04/2007 Running the LT2

Sending your job where your files are

• In your JDL– InputData = {"lfn:/grid/dteam/lt2wk.dta"};– DataAccessProtocol ={"file", "srm", "gridftp"};

• Then you have to use the lcg- commandsto copy the files

• Alternatively you can link to the gfal libraryand stream the data (man gfal).

16/04/2007 Running the LT2

Conclusions• In London you have

– Around 2500 cpu– 94 TB– All availaible trough the ltwo vo

• To get more on how to use– http://www.gridpp.ac.uk/deployment/users/– Get registered to the ltwo vo.

• See the GANGA talk for more high leveltools to submit jobs without having to writejdl.

16/04/2007 Running the LT2

Thanks to all of the Team

M. Aggarwal, D. Colling, A. Chamberlin, S. George, K. Georgiou, M. Green, W. Hay, P. Hobson, P. Kyberd, A. Martin, G. Mazza, D. Rand, G. Rybkine, G. Sciacca, K. Septhon,

B. Waugh,

LT2

BACKUP

16/04/2007 Running the LT2

Listing the SE. Removing files

• lcg-infosites –-vo ltwo se

• Don’t forget to remove your files– lcg-del

16/04/2007 Running the LT2

RLS remember file location

16/04/2007 Running the LT2

VOMS: Virtual Organization Membership Service.• Provides information on the user's relationship with her Virtual

Organization: her groups, roles and capabilities. • Provides the list of users for a given VO

16/04/2007 Running the LT2

GridLoadTool to monitor the sites:

-Updates every 5minutes-Uses the RTM data and stores it in rrd files•Shows theNumber of Jobs in any state•VO view. Stacks the Jobs by VO •CE view. Stacks the Jobs by CE

https://gfe03.hep.ph.ic.ac.uk:4175/cgi-bin/load

Still a prototype. Will add•View by GOC and ROC.•Error checking.•Add usage (running cpu/total cpu). •Improve look and feel

Could interface with NAGIOS for raising alarms (high abort rate)

16/04/2007 Running the LT2

GridLoad What can it be used for ?

#Aborted JobsHome dir full Problem solved

•Can be used to have a unique measureof the health of the system•We can then use nagios to find out more•Avoid the to many alarms syndrome !•You can query the cgi to get graphs for your site

16/04/2007 Running the LT2

Extracting the private and public keys.

• You have to create a .globus directory and extract the keys into it.– Extract your public key:

•openssl pkcs12 -in cert.p12 -clcerts -nokeys -out usercert.pem

• Chmod 644 usercert.pem

– Extract your private key:•openssl pkcs12 -in cert.p12 -nocerts -out

userkey.pem • Protected it: chmod 200 userkey.pem

16/04/2007 Running the LT2

Initialize your Proxy

• The Proxy is a temporary key pair that is signed by your private key. It allows to delegate your credidential to another machine where your job will run.

• To create a proxy (which will be a file in the /tmp directory) you need to– Voms-proxy-init –-voms ltwo– Type the password to decrypt your public key

• You should see this:

Your identity: /C=UK/O=eScience/OU=Imperial/L=Physics/CN=olivier van der aa (vo1)Enter GRID pass phrase:Creating temporary proxy ............................................... DoneContacting gm01.hep.ph.ic.ac.uk:15002 [/C=UK/O=eScience/OU=Imperial/L=Physics/CN=host/gm01.hep.ph.ic.ac.uk/emailAddress=o.van-der-aa@imperial.ac.uk] "ltwo" DoneCreating proxy ............................................ DoneYour proxy is valid until Tue Dec 6 23:45:14 2005

16/04/2007 Running the LT2

Preparing for submitting jobs

• A simple job program is made available in the /tmp/Lecture.tar.gz

• Copy it to your home dir and untar it.• To submit a job you need to create a file that contains

your requirements this is the so called jdl file (job description language)

• We will submit jobs as members of the London Tier 2 VO (LTWO) so we need to specify to run on sites that support it.

• For the moment the site that support it is the Imperial College HEP site.

16/04/2007 Running the LT2

Submit the job

• edg-job-submit --config-vo gridpp_wl_vo_ltwo.conf --config gridpp_wl_cmd_var.conf hello.jdl

• Or runjob.sh hello.jdl• The configuration files (gridpp_...) are there to specify to use the

imperial Ressource Broker since it is the only one that knows about the ltwo vo.

********************************************************************************************* JOB SUBMIT OUTCOME The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is:

- https://gfe01.hep.ph.ic.ac.uk:9000/kvexAiToJyvcBvxdMBTdoA

*********************************************************************************************

This is your Job ID

16/04/2007 Running the LT2

Check the status of your job• Edg-job-status [Your job ID] will get the status of your job

16/04/2007 Running the LT2

Managing large files

• To transfer large files your should not use the input and output sandbox. They are limited to 9MB.

• File replication should be used.

• The LTWO vo does not have a catalog to register the files so I will describe what can be done.

16/04/2007 Running the LT2

Globus-url-copy

• You can copy file to our SE using the globus-url-copy command

• Globus-url-copy file:////myfile gsiftp://gw38.hep.ph.ic.ac.uk/stage2/lcg2-data/ltwo/myfilename

• But this is not using the catalog to avoid knowing where your file really is.

16/04/2007 Running the LT2

Hello.jdl and finding matching ressources

• In the Lecture directory– See file Hello.jdlExecutable = "/bin/hostname";#Arguments = "none";StdOutput = "std.out";StdError = "std.err";OutputSandbox = {"std.out", "std.err"};

Name of the executable

Files you want to retreive

Check which ressources match your requirements

edg-job-list-match --config-vo gridpp_wl_vo_ltwo.conf --config gridpp_wl_cmd_var.conf hello.jdl

16/04/2007 Running the LT2

Exercice

• Find out what the GridCR program does• Submit 5 jobs. The output of the GridCR

program should be stored on the classic SE

• Using your job standard output retreive the files that have been generated.

16/04/2007 Running the LT2

Check the validity of your proxy

• voms-proxy-info will tell you how many hours your delegation is valid.

subject : /C=UK/O=eScience/OU=Imperial/L=Physics/CN=olivier van der aa (vo1)/CN=proxyissuer : /C=UK/O=eScience/OU=Imperial/L=Physics/CN=olivier van der aa (vo1)identity : /C=UK/O=eScience/OU=Imperial/L=Physics/CN=olivier van der aa (vo1)type : proxystrength : 512 bitspath : /tmp/x509up_u37227timeleft : 11:58:43

16/04/2007 Running the LT2

Finding which ce support the ltwo vo

• To get a list of CE that support the ltwo vo you use the lcg-infosites command– Lcg-infosites –vo ltwo ce

gw39.hep.ph.ic.ac.uk:2119/jobmanager-lcgpbs-ltwo

This is the CE of the HEP group.

- If you do lcg-infosites –-vo dteam ce you will get a list of CE in LCG.

16/04/2007 Running the LT2

Lcg-cr,lcg-rep

• To register a file in a catalog and copy it to your beloved SE

lcg-cr –-vo [yourvo] file://`pwd`/<name> \ -l lfn:<name> -d yourse

If you do not give SE the local one will be used.

• To replicate the same file in a different CE– lcg-rep -–vo [yourvo]