Monitoring and troubleshooting a glideinWMS-based HTCondor pool

CERN, Dec 2012 glideinWMS monitoring 1

glideinWMS for users

Monitoring and troubleshooting

a glideinWMS-basedHTCondor pool

by Igor Sfiligoi (UCSD)


Scope of this talk

This talk describes whatinformation are available when troubleshooting in a

glideinWMS-based HTCondor pool,and what tools can you use

to mine them.

Reader is expected to already have a basic understanding of HTCondor and glideinWMS.


HTCondor Architecture

● As a reminder

Central manager

Negotiator

Submit node

Schedd

Execute node

Condor

Submit node

Submit node

Execute node

Execute node

Execute node

Execute node

Grid

G.F.

G.F.VO FE

+3

+1


Typical user questionsaddressed in this talk

● Where is/was my job running?● Why are my jobs

not starting?● Why do my jobs

take forever to finish?


Where is/was my job running?


Job progress monitoring

● HTCondor provides two basic means to monitor job progress● Querying the system for current status

– Using the cmdline condor_q/condor_history● Parsing the job event log

– Either plain text or XML formatted– Starting with 7.9.1, condor_history can be used

to extract the last known state


Job status

● Each Job has a status associated with it● An integer attribute calledJobStatus– But has well known semantics

associated with each value

● Jobs start in the Idle state● Become Running if everything works fine● Completed when they terminate

● If anything goes wrong, a Job will go into Hold● If removed before completion, will be Removed


Monitoring the Job Status

● Idle/Running/Held jobs can be polled withcondor_q● Will query the Schedd daemon

● Once they terminate, or are removed,they leave the Schedd queue● Are put into a file on disk● Can use condor_history

to retrieve the last ClassAd

● The job event log has all the state transitions(of course)

One exception:If a job was running when it was removed, but the execute nodedoes not confirm the job was killed remotely, the job will be kept in the Schedd.


So, where is the job running?

● Easy to get the machine name and/or IP● Standard HTCondor attributeRemoteHost & StartdIpAddr

● But may not necessary make sense● Do you recognize all network domains?● And they could be on a private network!


Getting glidein attributes

● Glideins have many more attributes that describe them● e.g. symbolic site name

GLIDEIN_CMSSite

● However, by default, you do not get this info in the Job Classad

● But easy to add● <my attr> = $$(<glidein attr>:Unknown)

– Will get the info in MATCH_EXP_<my attr>


Standard attributes

● Standard glideinWMS attributes● JOB_GLIDEIN_Entry_Name = "$$(GLIDEIN_Entry_Name:Unknown)"

● JOB_GLIDEIN_Name = "$$(GLIDEIN_Name:Unknown)"

● JOB_GLIDEIN_Factory = "$$(GLIDEIN_Factory:Unknown)"

● JOB_GLIDEIN_Schedd = "$$(GLIDEIN_Schedd:Unknown)"

● JOB_GLIDEIN_ClusterId = "$$(GLIDEIN_ClusterId:...)"

● JOB_GLIDEIN_ProcId = "$$(GLIDEIN_ProcId:Unknown)"

● JOB_GLIDEIN_Site = "$$(GLIDEIN_Site:Unknown)"

● Standard CMS glideinWMS attribute● JOB_CMSSite = "$$(GLIDEIN_CMSSite:Unknown)"

Usefulfor in-depthdebugging

Configured by the HTCondor admin,no need for the user to do anythingSUBMIT_EXPRS = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, ...


Getting them in the event log

● You (or the admins) can also propagate the attributes into the event logjob_ad_information_attrs = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, …

● As a result you get “Job Ad” events

...001 (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749>...028 (20327.002.000) 12/03 00:46:33 Job ad information event triggered.TriggerEventTypeNumber = 1Cluster = 20327EventTypeNumber = 28ExecuteHost = "<193.48.85.94:38749>"

JOB_CMSSite = "T2_FR_IPHC"EventTime = "2012-12-03T00:46:33"TriggerEventTypeName = "ULOG_EXECUTE"Proc = 2Subproc = 0CurrentTime = time()MyType = "ExecuteEvent"...

...001 (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749>...028 (20327.002.000) 12/03 00:46:33 Job ad information event triggered.TriggerEventTypeNumber = 1Cluster = 20327EventTypeNumber = 28ExecuteHost = "<193.48.85.94:38749>"JOB_CMSSite = "T2_FR_IPHC"EventTime = "2012-12-03T00:46:33"TriggerEventTypeName = "ULOG_EXECUTE"Proc = 2Subproc = 0CurrentTime = time()MyType = "ExecuteEvent"...


Why is my jobnot starting?


Troubleshooting process

● First question● Do my jobs match any (logical) resource?

● Once you are sure of that● Are there jobs from higher priority users?● Are desired sites just too busy?● Are there problems at desired site(s)?

● If nothing gives a satisfying answer● It may be a glideinWMS misconfiguration,

see help from VO FE admins


How do I know if my jobs match?

● Good question!● Unfortunately, the answer is not trivial

● The FE matching policy not “public”● And, of course, no tools to probe for it

● You will have to rely on the FE admins to “explain” the policy● Hopefully in a human readable format● Hopefully without conversion errors!


An example FE policy

● See the CMS FE talk for an actual high level view

● The actual FE policy is a python expression

● And then there is the matching HTCondor one

(glidein["attrs"]["GLIDEIN_CMSSite"] in job["DESIRED_Sites"].split(",")) and((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True") == (job.get("DESIRES_HTPC")==1))

(glidein["attrs"]["GLIDEIN_CMSSite"] in job["DESIRED_Sites"].split(",")) and((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True") == (job.get("DESIRES_HTPC")==1))

A simple example – could be much more complex

(stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) &&((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True))

(stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) &&((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True))


A word about HTCondor matching

● Once glideins start, you can probe their policycondor_status -format '%s' START

● But no tools to help you understand the M.M.● The closest iscondor_q -analyze – But only looks at Job requirements– So, not really helping when all/most of the policy in glideins

$ condor_status -format '%s\n' START( ( true ) && ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",") =?= true ) && ( ( GLIDEIN_Is_HTPC =?= true ) =?= ( DESIRES_HTPC =?= true ) ) ) ) && ( ( ( GLIDEIN_ToRetire =?= undefined ) || ( CurrentTime < GLIDEIN_ToRetire ) ) )( ( true ) && ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",") =?= true ) && ( ( GLIDEIN_Is_HTPC =?= true ) =?= ( DESIRES_HTPC =?= true ) ) ) ) && ( ( ( GLIDEIN_ToRetire =?= undefined ) || ( CurrentTime < GLIDEIN_ToRetire ) ) )...

$ condor_status -format '%s\n' START( ( true ) && ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",") =?= true ) && ( ( GLIDEIN_Is_HTPC =?= true ) =?= ( DESIRES_HTPC =?= true ) ) ) ) && ( ( ( GLIDEIN_ToRetire =?= undefined ) || ( CurrentTime < GLIDEIN_ToRetire ) ) )( ( true ) && ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",") =?= true ) && ( ( GLIDEIN_Is_HTPC =?= true ) =?= ( DESIRES_HTPC =?= true ) ) ) ) && ( ( ( GLIDEIN_ToRetire =?= undefined ) || ( CurrentTime < GLIDEIN_ToRetire ) ) )...


User priorities

● So, jobs should be matching, but are not starting● And there are plenty matching glideins in the system

● Likely there are other higher-priority jobs in the system● Possibly from a different usercondor_userio

● Possibly on a different scheddcondor_status -submitters

● No tools to give you the easy answer● If you need the answer, you will have to investigate

Warning: Slow!


Unclaimed glideins

● If you see plenty of Unclaimed glideins,but no matching jobs from other users● You have either reached the schedd limitMAX_JOBS_RUNNING

● Or something bad is going on!

● You can only ask yout FE admin for help● But first double check that your jobs should

indeed be matching, at least on paper


Supported Sites

● What should you do if there are no (new) glideins coming from an expected site?

● First off, see if the site is even supported by the glideinWMS instance!

● Each Entry has a ClassAdcondor_status -any -const 'MyType==”glideresource”'

● Look for the attributes your FE is matching one.g. GLIDEIN_CMSSite

Sitenot there?Notify yourFE admin!


Is the FE even asking for them?

● You are sure that your jobs should be matching?● But what if you are wrong?

● Check it out… -format '%i\n' GlideFactoryMonitorRequestedIdle

But remember it is

not just yourjobs.


Maybe the site is just busy?

● Glideins have to compete with other Grid jobs at most sites● Maybe the site is just busy?

● Check if glideinWMS has put any glideins in the Grid queues… -format '%i\n' GlideFactoryMonitorStatusPending

If you findzeros,

notify yourFE admin!


Site problems?

● The glideins will validate the worker node before talking to the C.M.● If the test fails, the glidein will “waste” 20 mins on

the node to prevent other jobs to fail on it again

● You can check if there are “Running” glideins in glideinWMS, even though you see none (or few) in the C.M.… -format '%i\n' GlideFactoryMonitorStatusRunning

If you finda discrepancy,

notify yourFE admin!


Still no clue?

● If all your detective work fails● Notify your VO FE admin

● They have access to information you don't


Why do my jobstake forever to finish?


My jobs are running, but...

● Great, your jobs are happily running● But you are getting no results back!● i.e. the jobs are not finishing in the expected time

● Two main likely reasons● They are being restarted● You miscalculated the needed time


Jobs re-starting

● HTCondor tries to be user friendly● If a job gets preempted, for almost any reason,

it will try to re-start it with the hope it will finish on the next try

● And will not ever give up! (by default)

● You can easily check how many times it startedcondor_q -format '%i\n' NumJobStarts

● You may want to cap the number withperiodic_hold/remove

http://research.cs.wisc.edu/htcondor/manual/v7.8/condor_submit.html#condor-submit-periodic-removehttp://research.cs.wisc.edu/htcondor/manual/v7.8/3_3Configuration.html#param:SystemPeriodicRemove

http://research.cs.wisc.edu/htcondor/manual/v7.8/condor_submit.html#condor-submit-periodic-remove

http://research.cs.wisc.edu/htcondor/manual/v7.8/3_3Configuration.html#param:SystemPeriodicRemove


Why is it restarting?

● OK, I now know it is restarting... but why?● Most likely, the glidein was killed

● Was it due to your job “misbehaving”?

● Most Grid sites have limits on resource use● Including CPU, memory and disk● If you exceed them, the glidein (and you) will be killed

● Glideins should be configured to detect and hold/remove your job if you “misbehave”● Thus you would not be re-started● If you see many restart, notify your FE admin

Likely there is a policy rule missing


What is my job doing?

● What if it is not restarting... just running forever(or until hitting the time limit)

● HTCondor allows for peeking at a running job● A cmdline tool calledcondor_ssh_to_job

● Unfortunately, needs implicit permission from site– And about half of the sites don't allow it


The End


Pointers

● glideinWMS Home Pagehttp://tinyurl.com/glideinWMS

● HTCondor Home Pagehttp://research.cs.wisc.edu/htcondor/

● HTCondor [email protected]@cs.wisc.edu

● glideinWMS [email protected]

http://tinyurl.com/glideinWMS

http://research.cs.wisc.edu/htcondor/

mailto:[email protected]




Acknowledgments

● The creation of this document was sponsored by grants from the US NSF and US DOE,and by the University of California system

Monitoring and troubleshooting a glideinWMS-based HTCondor pool

Technology

Transcript of Monitoring and troubleshooting a glideinWMS-based HTCondor pool