Using HTCondor European HTCondor Site Admins Meeting CERN December 2014.
Monitoring and troubleshooting a glideinWMS-based HTCondor pool
-
Upload
igor-sfiligoi -
Category
Technology
-
view
279 -
download
1
description
Transcript of Monitoring and troubleshooting a glideinWMS-based HTCondor pool
CERN, Dec 2012 glideinWMS monitoring 1
glideinWMS for users
Monitoring and troubleshooting
a glideinWMS-basedHTCondor pool
by Igor Sfiligoi (UCSD)
CERN, Dec 2012 glideinWMS monitoring 2
Scope of this talk
This talk describes whatinformation are available when troubleshooting in a
glideinWMS-based HTCondor pool,and what tools can you use
to mine them.
Reader is expected to already have a basic understanding of HTCondor and glideinWMS.
CERN, Dec 2012 glideinWMS monitoring 3
HTCondor Architecture
● As a reminder
Central manager
Negotiator
Submit node
Schedd
Execute node
Condor
Submit node
Submit node
Execute node
Execute node
Execute node
Execute node
Grid
G.F.
G.F.VO FE
+3
+1
CERN, Dec 2012 glideinWMS monitoring 4
Typical user questionsaddressed in this talk
● Where is/was my job running?● Why are my jobs
not starting?● Why do my jobs
take forever to finish?
CERN, Dec 2012 glideinWMS monitoring 5
Where is/was my job running?
CERN, Dec 2012 glideinWMS monitoring 6
Job progress monitoring
● HTCondor provides two basic means to monitor job progress● Querying the system for current status
– Using the cmdline condor_q/condor_history● Parsing the job event log
– Either plain text or XML formatted– Starting with 7.9.1, condor_history can be used
to extract the last known state
CERN, Dec 2012 glideinWMS monitoring 7
Job status
● Each Job has a status associated with it● An integer attribute calledJobStatus– But has well known semantics
associated with each value
● Jobs start in the Idle state● Become Running if everything works fine● Completed when they terminate
● If anything goes wrong, a Job will go into Hold● If removed before completion, will be Removed
CERN, Dec 2012 glideinWMS monitoring 8
Monitoring the Job Status
● Idle/Running/Held jobs can be polled withcondor_q● Will query the Schedd daemon
● Once they terminate, or are removed,they leave the Schedd queue● Are put into a file on disk● Can use condor_history
to retrieve the last ClassAd
● The job event log has all the state transitions(of course)
One exception:If a job was running when it was removed, but the execute nodedoes not confirm the job was killed remotely, the job will be kept in the Schedd.
CERN, Dec 2012 glideinWMS monitoring 9
So, where is the job running?
● Easy to get the machine name and/or IP● Standard HTCondor attributeRemoteHost & StartdIpAddr
● But may not necessary make sense● Do you recognize all network domains?● And they could be on a private network!
CERN, Dec 2012 glideinWMS monitoring 10
Getting glidein attributes
● Glideins have many more attributes that describe them● e.g. symbolic site name
GLIDEIN_CMSSite
● However, by default, you do not get this info in the Job Classad
● But easy to add● <my attr> = $$(<glidein attr>:Unknown)
– Will get the info in MATCH_EXP_<my attr>
CERN, Dec 2012 glideinWMS monitoring 11
Standard attributes
● Standard glideinWMS attributes● JOB_GLIDEIN_Entry_Name = "$$(GLIDEIN_Entry_Name:Unknown)"
● JOB_GLIDEIN_Name = "$$(GLIDEIN_Name:Unknown)"
● JOB_GLIDEIN_Factory = "$$(GLIDEIN_Factory:Unknown)"
● JOB_GLIDEIN_Schedd = "$$(GLIDEIN_Schedd:Unknown)"
● JOB_GLIDEIN_ClusterId = "$$(GLIDEIN_ClusterId:...)"
● JOB_GLIDEIN_ProcId = "$$(GLIDEIN_ProcId:Unknown)"
● JOB_GLIDEIN_Site = "$$(GLIDEIN_Site:Unknown)"
● Standard CMS glideinWMS attribute● JOB_CMSSite = "$$(GLIDEIN_CMSSite:Unknown)"
Usefulfor in-depthdebugging
Configured by the HTCondor admin,no need for the user to do anythingSUBMIT_EXPRS = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, ...
CERN, Dec 2012 glideinWMS monitoring 12
Getting them in the event log
● You (or the admins) can also propagate the attributes into the event logjob_ad_information_attrs = JOB_GLIDEIN_Entry_Name, JOB_CMSSite, …
● As a result you get “Job Ad” events
...001 (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749>...028 (20327.002.000) 12/03 00:46:33 Job ad information event triggered.TriggerEventTypeNumber = 1Cluster = 20327EventTypeNumber = 28ExecuteHost = "<193.48.85.94:38749>"
JOB_CMSSite = "T2_FR_IPHC"EventTime = "2012-12-03T00:46:33"TriggerEventTypeName = "ULOG_EXECUTE"Proc = 2Subproc = 0CurrentTime = time()MyType = "ExecuteEvent"...
...001 (20327.002.000) 12/03 00:46:33 Job executing on host: <193.48.85.94:38749>...028 (20327.002.000) 12/03 00:46:33 Job ad information event triggered.TriggerEventTypeNumber = 1Cluster = 20327EventTypeNumber = 28ExecuteHost = "<193.48.85.94:38749>"JOB_CMSSite = "T2_FR_IPHC"EventTime = "2012-12-03T00:46:33"TriggerEventTypeName = "ULOG_EXECUTE"Proc = 2Subproc = 0CurrentTime = time()MyType = "ExecuteEvent"...
CERN, Dec 2012 glideinWMS monitoring 13
Why is my jobnot starting?
CERN, Dec 2012 glideinWMS monitoring 14
Troubleshooting process
● First question● Do my jobs match any (logical) resource?
● Once you are sure of that● Are there jobs from higher priority users?● Are desired sites just too busy?● Are there problems at desired site(s)?
● If nothing gives a satisfying answer● It may be a glideinWMS misconfiguration,
see help from VO FE admins
CERN, Dec 2012 glideinWMS monitoring 15
How do I know if my jobs match?
● Good question!● Unfortunately, the answer is not trivial
● The FE matching policy not “public”● And, of course, no tools to probe for it
● You will have to rely on the FE admins to “explain” the policy● Hopefully in a human readable format● Hopefully without conversion errors!
CERN, Dec 2012 glideinWMS monitoring 16
An example FE policy
● See the CMS FE talk for an actual high level view
● The actual FE policy is a python expression
● And then there is the matching HTCondor one
(glidein["attrs"]["GLIDEIN_CMSSite"] in job["DESIRED_Sites"].split(",")) and((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True") == (job.get("DESIRES_HTPC")==1))
(glidein["attrs"]["GLIDEIN_CMSSite"] in job["DESIRED_Sites"].split(",")) and((glidein["attrs"].get("GLIDEIN_Is_HTPC")=="True") == (job.get("DESIRES_HTPC")==1))
A simple example – could be much more complex
(stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) &&((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True))
(stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",")=?=True) &&((GLIDEIN_Is_HTPC=?=True)=?=(DESIRES_HTPC=?=True))
CERN, Dec 2012 glideinWMS monitoring 17
A word about HTCondor matching
● Once glideins start, you can probe their policycondor_status -format '%s' START
● But no tools to help you understand the M.M.● The closest iscondor_q -analyze – But only looks at Job requirements– So, not really helping when all/most of the policy in glideins
$ condor_status -format '%s\n' START( ( true ) && ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",") =?= true ) && ( ( GLIDEIN_Is_HTPC =?= true ) =?= ( DESIRES_HTPC =?= true ) ) ) ) && ( ( ( GLIDEIN_ToRetire =?= undefined ) || ( CurrentTime < GLIDEIN_ToRetire ) ) )( ( true ) && ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",") =?= true ) && ( ( GLIDEIN_Is_HTPC =?= true ) =?= ( DESIRES_HTPC =?= true ) ) ) ) && ( ( ( GLIDEIN_ToRetire =?= undefined ) || ( CurrentTime < GLIDEIN_ToRetire ) ) )...
$ condor_status -format '%s\n' START( ( true ) && ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",") =?= true ) && ( ( GLIDEIN_Is_HTPC =?= true ) =?= ( DESIRES_HTPC =?= true ) ) ) ) && ( ( ( GLIDEIN_ToRetire =?= undefined ) || ( CurrentTime < GLIDEIN_ToRetire ) ) )( ( true ) && ( true ) && ( true ) && ( ( stringListMember(GLIDEIN_CMSSite,DESIRED_Sites,",") =?= true ) && ( ( GLIDEIN_Is_HTPC =?= true ) =?= ( DESIRES_HTPC =?= true ) ) ) ) && ( ( ( GLIDEIN_ToRetire =?= undefined ) || ( CurrentTime < GLIDEIN_ToRetire ) ) )...
CERN, Dec 2012 glideinWMS monitoring 18
User priorities
● So, jobs should be matching, but are not starting● And there are plenty matching glideins in the system
● Likely there are other higher-priority jobs in the system● Possibly from a different usercondor_userio
● Possibly on a different scheddcondor_status -submitters
● No tools to give you the easy answer● If you need the answer, you will have to investigate
Warning: Slow!
CERN, Dec 2012 glideinWMS monitoring 19
Unclaimed glideins
● If you see plenty of Unclaimed glideins,but no matching jobs from other users● You have either reached the schedd limitMAX_JOBS_RUNNING
● Or something bad is going on!
● You can only ask yout FE admin for help● But first double check that your jobs should
indeed be matching, at least on paper
CERN, Dec 2012 glideinWMS monitoring 20
Supported Sites
● What should you do if there are no (new) glideins coming from an expected site?
● First off, see if the site is even supported by the glideinWMS instance!
● Each Entry has a ClassAdcondor_status -any -const 'MyType==”glideresource”'
● Look for the attributes your FE is matching one.g. GLIDEIN_CMSSite
Sitenot there?Notify yourFE admin!
CERN, Dec 2012 glideinWMS monitoring 21
Is the FE even asking for them?
● You are sure that your jobs should be matching?● But what if you are wrong?
● Check it out… -format '%i\n' GlideFactoryMonitorRequestedIdle
But remember it is
not just yourjobs.
CERN, Dec 2012 glideinWMS monitoring 22
Maybe the site is just busy?
● Glideins have to compete with other Grid jobs at most sites● Maybe the site is just busy?
● Check if glideinWMS has put any glideins in the Grid queues… -format '%i\n' GlideFactoryMonitorStatusPending
If you findzeros,
notify yourFE admin!
CERN, Dec 2012 glideinWMS monitoring 23
Site problems?
● The glideins will validate the worker node before talking to the C.M.● If the test fails, the glidein will “waste” 20 mins on
the node to prevent other jobs to fail on it again
● You can check if there are “Running” glideins in glideinWMS, even though you see none (or few) in the C.M.… -format '%i\n' GlideFactoryMonitorStatusRunning
If you finda discrepancy,
notify yourFE admin!
CERN, Dec 2012 glideinWMS monitoring 24
Still no clue?
● If all your detective work fails● Notify your VO FE admin
● They have access to information you don't
CERN, Dec 2012 glideinWMS monitoring 25
Why do my jobstake forever to finish?
CERN, Dec 2012 glideinWMS monitoring 26
My jobs are running, but...
● Great, your jobs are happily running● But you are getting no results back!● i.e. the jobs are not finishing in the expected time
● Two main likely reasons● They are being restarted● You miscalculated the needed time
CERN, Dec 2012 glideinWMS monitoring 27
Jobs re-starting
● HTCondor tries to be user friendly● If a job gets preempted, for almost any reason,
it will try to re-start it with the hope it will finish on the next try
● And will not ever give up! (by default)
● You can easily check how many times it startedcondor_q -format '%i\n' NumJobStarts
● You may want to cap the number withperiodic_hold/remove
http://research.cs.wisc.edu/htcondor/manual/v7.8/condor_submit.html#condor-submit-periodic-removehttp://research.cs.wisc.edu/htcondor/manual/v7.8/3_3Configuration.html#param:SystemPeriodicRemove
CERN, Dec 2012 glideinWMS monitoring 28
Why is it restarting?
● OK, I now know it is restarting... but why?● Most likely, the glidein was killed
● Was it due to your job “misbehaving”?
● Most Grid sites have limits on resource use● Including CPU, memory and disk● If you exceed them, the glidein (and you) will be killed
● Glideins should be configured to detect and hold/remove your job if you “misbehave”● Thus you would not be re-started● If you see many restart, notify your FE admin
Likely there is a policy rule missing
CERN, Dec 2012 glideinWMS monitoring 29
What is my job doing?
● What if it is not restarting... just running forever(or until hitting the time limit)
● HTCondor allows for peeking at a running job● A cmdline tool calledcondor_ssh_to_job
● Unfortunately, needs implicit permission from site– And about half of the sites don't allow it
CERN, Dec 2012 glideinWMS monitoring 30
The End
CERN, Dec 2012 glideinWMS monitoring 31
Pointers
● glideinWMS Home Pagehttp://tinyurl.com/glideinWMS
● HTCondor Home Pagehttp://research.cs.wisc.edu/htcondor/
● HTCondor [email protected]@cs.wisc.edu
● glideinWMS [email protected]
CERN, Dec 2012 glideinWMS monitoring 32
Acknowledgments
● The creation of this document was sponsored by grants from the US NSF and US DOE,and by the University of California system