Jaime Frey Computer Sciences Department University of Wisconsin-Madison [email protected] OGF 19...
-
Upload
tyler-harding -
Category
Documents
-
view
225 -
download
0
Transcript of Jaime Frey Computer Sciences Department University of Wisconsin-Madison [email protected] OGF 19...
![Page 1: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/1.jpg)
Jaime FreyComputer Sciences DepartmentUniversity of Wisconsin-Madison
[email protected]://www.cs.wisc.edu/condor
OGF 19Condor Software Forum
Routing Jobs to the Grid
![Page 2: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/2.jpg)
www.cs.wisc.edu/condor
Schedd
Job Routera.k.a.
ScheddOn The
Side
What’s a Job Router?Specialized scheduler operating on schedd’s jobs.
Job 1Job 2Job 3Job 4Job 5…Job 4*
job queue
![Page 3: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/3.jpg)
www.cs.wisc.edu/condor
Adapted Quill Technology
› Using Quill library to mirror job queue in memoryo Efficient - just “tails” the logo Independent - mirror without clogging
schedd command queue
› Modifying the job queue is another matter - must interact with schedd
![Page 4: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/4.jpg)
www.cs.wisc.edu/condor
Usage Case
Routing: Vanilla -> Grid
![Page 5: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/5.jpg)
www.cs.wisc.edu/condor
Condor Farm Story
Schedd
StartdResources
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
Application
condor_submit
job queue
•Now that this is working, howcan I use my collaborator’sresources too?
![Page 6: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/6.jpg)
www.cs.wisc.edu/condor
Option #1: Merge Farms
› Combine machines with collaborator into one Condor resource pool.o Everything works just like it did before.o Excellent option for small to medium clusters.o Requires bidirectional connectivity to all
startds, or equivalent via GCB.o Requires some administrative coordination
(e.g. upgrades, negotiator policy, security, etc.)
![Page 7: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/7.jpg)
www.cs.wisc.edu/condor
Option #1b: submit to multiple pools
› condor_submit -remote …
› Works
› Ok for small scale
› Have to manually partition jobs
![Page 8: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/8.jpg)
www.cs.wisc.edu/condor
Option #2: Flocking Together
Schedd
LocalStartds
RemoteStartds
•full featured(std universe etc)•automatic matchmaking•easy to configure
•requires bidirectionalconnectivity•both sites must runcondor
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
![Page 9: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/9.jpg)
www.cs.wisc.edu/condor
Gatekeeper
X
Option #3: Grid Universe
Schedd
Startds
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed Random
SeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
•easier to live with private networks•may use non-Condor resources
•restricted Condor feature set(e.g. no std universe over grid)•must pre-allocating jobsbetween vanilla and grid universe
vanilla site X
![Page 10: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/10.jpg)
www.cs.wisc.edu/condor
Option #4: Routing Jobs
Schedd
LocalStartds
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeed Random
SeedRandomSeed
RandomSeed Random
SeedRandomSeed
RandomSeed Random
SeedRandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
ScheddOn The
Side Gatekeeper
X
Y
Z
vanilla site X
RandomSeed
RandomSeed
site Y site Z
•dynamic allocation of jobsbetween vanilla and grid universes.•not every job is appropriate fortransformation into a grid job.
![Page 11: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/11.jpg)
www.cs.wisc.edu/condor
Example Routing Table
[GridResource = “gt2 gatekeeper.site1/jobmanager-pbs”; MaxJobs = 500; MaxIdle = 50; set_GlobusRSL = “(…)”][GridResource = “condor schedd.site2 collector.site2”; MaxJobs = 700; MaxIdle = 100; Requirements = other.ImageSize < 500]…
![Page 12: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/12.jpg)
www.cs.wisc.edu/condor
What About I/O?
› Jobs must be sandboxable (i.e. specifying input/output via transfer-files mechanism).
› Routing of standard universe is not supported.
› Must have enough storage space at site for input/output files!
![Page 13: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/13.jpg)
www.cs.wisc.edu/condor
What Types of Grids?› Routing table may contain any
combination of grid types supported by Condor’s grid universe.
› Example: Condor-C
Schedd
ScheddOn The
Side
Schedd X
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeed
site X
•for two Condor sites, schedd-to-scheddsubmission requires no additional software•however, still not as trivial to use as flocking
![Page 14: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/14.jpg)
www.cs.wisc.edu/condor
Source Routing
› Routing the old-fashioned way:
universe = GridGridResource = condor site1 …remote_universe = Gridremote_GridResource = condor site2 …remote_remote_universe = Gridremote_remote_GridResource = pbs
![Page 15: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/15.jpg)
www.cs.wisc.edu/condor
Routing At the Site
Gatekeeper
XSchedd
ScheddOn The
Side
Schedd X3
X2
•navigate internal firewalls•provide custom routesfor special users•improve scalability•However, keep in mindI/O requirements etc.
![Page 16: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/16.jpg)
www.cs.wisc.edu/condor
Multicast in Future?
› Currently: route one job to one site
› Multicast: route one job to many sites
› Thin out all but first to germinate
› … or all but first to yield fruit.
![Page 17: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/17.jpg)
www.cs.wisc.edu/condor
Future Glidein FactoryGatekeeper
X
Schedd
Startds
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeedRandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
•true late binding of jobs to resources•may run on top of non-Condor sites•supports full feature-set of Condor(e.g. standard universe)
•requires GCB for private networks
homesite X
ScheddOn The
Side
glidein jobs
![Page 18: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/18.jpg)
www.cs.wisc.edu/condor
Glideing in the Factory
Schedd
ScheddOn The
Side
glidein factory
site X
schedd-to-schedd
schedd-to-gatekeeper
•hierarchical strategy for scalabilityand reliability•better match for private networks
•may require some additional horsepowerfrom gatekeeper machine, perhaps adedicated element for “edge services”.
RandomSeed
RandomSeed
RandomSeed
RandomSeed
RandomSeed
![Page 19: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/19.jpg)
www.cs.wisc.edu/condor
Pluggable Router
› Beyond simple ClassAd transforms
› Pluggins would fire when job matches entry in routing table
› Don’t yet understand semantics
› There is work to do!
![Page 20: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu OGF 19 Condor Software Forum Routing.](https://reader036.fdocuments.in/reader036/viewer/2022062404/5514d85f550346b0478b5332/html5/thumbnails/20.jpg)
www.cs.wisc.edu/condor
Thanks
Interested?Let us know.
We are currentlyusing job routingfor specific usersat UW. Jaime Frey
Future developmentwill focus on moreuse-cases.