WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.
-
Upload
dortha-greene -
Category
Documents
-
view
217 -
download
0
Transcript of WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.
![Page 1: WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.](https://reader036.fdocuments.in/reader036/viewer/2022082613/5697bf891a28abf838c8a15f/html5/thumbnails/1.jpg)
WP1 WMS rel. 2.0Some issues
Massimo SgaravattoINFN Padova
![Page 2: WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.](https://reader036.fdocuments.in/reader036/viewer/2022082613/5697bf891a28abf838c8a15f/html5/thumbnails/2.jpg)
Outline Some issues to discuss (and let’s try to decide)
LB server choice New CondorG Proxy renewal RLS integration WP2 Optor integration Output data upload and registration LB issues Gangmatching Security of files on the WM node Disk quota management in WM node VOMS integration Job exit code ISB/OSB transfer errors Accounting integration User vs host proxies … ?
![Page 3: WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.](https://reader036.fdocuments.in/reader036/viewer/2022082613/5697bf891a28abf838c8a15f/html5/thumbnails/3.jpg)
LB server choice Allow multiple LB servers for a single WM
for increased reliability and performance Approach
UI responsible to choose the LB server (e.g. via a round robin) ?
List of available LB servers in UI conf file, waiting for having this VO specific info published in a “VO repository” (R-GMA/IS/VOMS) ?
Move list of available NSs in this VO repository as well, when available
Not too clear yet what could be this VO repository (discussions within ATF)
![Page 4: WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.](https://reader036.fdocuments.in/reader036/viewer/2022082613/5697bf891a28abf838c8a15f/html5/thumbnails/4.jpg)
New CondorG New CondorG negotiated with Condor people (more
details by Francesco P.) Released by end of March, included in VDT, and to be used in rel
2.0 Two proxies
X509UserProxy One per job
X509ManagementProxy One per user’s DN or one “serving” n jobs for that user’s DN A CondorG <gridmanager, gahp-servers> pair for a given
X509ManagementProxy
Details on the whole machinery to be discussed Where is this user’s DN X509ManagementProxy mapping kept
and managed ? Proxy renewal ? …
![Page 5: WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.](https://reader036.fdocuments.in/reader036/viewer/2022082613/5697bf891a28abf838c8a15f/html5/thumbnails/5.jpg)
Proxy renewal Necessary to have a “persistent” proxy renewal daemon
(i.e. if it is restarted it shouldn’t loose control of the “managed” jobs as it happens now)
Necessary to discuss and decide on various issues Renewal of X509UserProxy
Done only if requested by the user (if MyProxyServer specified in the JDL ?) ?
No MyproxyServer in WM conf file anymore ? And what about renewal of X509ManagementProxy ?
If a new proxy “arrives” from UI and extends the validity of the existing one, the new one replace the old one ?
Not enough: what about if at least a job of that user asked for proxy renewal ?
Necessary to renew also X509ManagementProxy Who does registration ? NS ? Who does un-registration ?? …
![Page 6: WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.](https://reader036.fdocuments.in/reader036/viewer/2022082613/5697bf891a28abf838c8a15f/html5/thumbnails/6.jpg)
RLS integration At J+27 RB/MM will have to query the WP2 RLS
instead of WP2 RC to get the SFNs given a LFN (or LCN, or a GUID)
On-going negotiation of this WP1-WP2 interface New JDL attribute (VirtualOrganization) to make
possible to refer to the “official” VO’s RLS (needed by WP2 services)
Not needed anymore when VOMS integrated and therefore it will be possible to get the VO from user’s proxy
Optional JDL attribute to make possible to specify a “non-official” RLS ?
edgReplicaManager::listReplicas to have the SFNs New BrokerInfo content (under negotiation)
![Page 7: WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.](https://reader036.fdocuments.in/reader036/viewer/2022082613/5697bf891a28abf838c8a15f/html5/thumbnails/7.jpg)
Integration with WP2 Optor Completely different approach than querying the RLS to have the
PFNs (mutually exclusive) … RB calls getAccessCost for all the suitable CEs (the ones where the user is
authorized to submit jobs and matching the JDL “Requirements” expression) and for all the specified input data (LFNs, LCNs, GUIDs)
A “cost” is returned for each CE The RB chooses the CE, taking into account this cost and also the other
Ranks (to be decided how) In some cases the WM has also to trigger the replica of files to the closeSE
Not too difficult, but very high impact on scheduling/planning performed by RB/MM
Integration WMS-Optor Planned after J+27 However according to WP2, this stuff ready and tested well before J+27 To discuss details of integration
How ? A binary flag in the WM conf file to enable/disable Optor ? When ?
![Page 8: WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.](https://reader036.fdocuments.in/reader036/viewer/2022082613/5697bf891a28abf838c8a15f/html5/thumbnails/8.jpg)
Output data upload and registration Problem discussed and solution agreed in
the ATF Approach (details by Fabrizio P.):
OutputData JDL attribute (optional) to specify output file names, output LFNs and output SEs
Jobwrapper at the end has to call the WP2 function copyAndRegister
Issues Some details about copyAndRegister to be
sorted out Release date of this stuff not decided yet
![Page 9: WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.](https://reader036.fdocuments.in/reader036/viewer/2022082613/5697bf891a28abf838c8a15f/html5/thumbnails/9.jpg)
LB What happens exactly at J+27 wrt:
“Advanced query to LB” ? “LB – RGMA integration” ?
How ? Interfaces (e.g. for advanced queries) ? Issues ?
Ales ??
![Page 10: WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.](https://reader036.fdocuments.in/reader036/viewer/2022082613/5697bf891a28abf838c8a15f/html5/thumbnails/10.jpg)
Gangmatching Problem: take into account both CE and SE
information in the matchmaking For example to require a job to run on a CE close
to a SE with “enough space” Salvo has been working on this for a while,
also after some negotiations with Condor team (A. Roy)
Salvo’s talk for details (e.g. JDL) and discussions
When can this stuff be released ? J+27 ?
![Page 11: WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.](https://reader036.fdocuments.in/reader036/viewer/2022082613/5697bf891a28abf838c8a15f/html5/thumbnails/11.jpg)
Security of files on the WM node Approach
WP1 services (NS, …) running as edguser.edguser in WM node
Different user’s subjects mapped to different local users in grid-mapfile: user1.user, user2.user, …
Patched gridftp server (by Massimo M.) running on the NS node, so that the InputSandbox files are transferred in the NS node belonging to edguser as group and rwxrwx--- as mask
So a user can not access files belonging to an other user anymore
Issues When ? J+27 ? How ? Gridftp server RPM released by WP1 ?
![Page 12: WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.](https://reader036.fdocuments.in/reader036/viewer/2022082613/5697bf891a28abf838c8a15f/html5/thumbnails/12.jpg)
Disk quota management on the WM node Having different DN users mapped to
different local users in the grid-mapfile of the WM node allows to set disk quota for the various users
NS to be modified (for J+27) so that it has to reject a job if no enough disk quota available to store the input sandbox files
Issues ? Marco ??
![Page 13: WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.](https://reader036.fdocuments.in/reader036/viewer/2022082613/5697bf891a28abf838c8a15f/html5/thumbnails/13.jpg)
VOMS integration E.g.: voms-proxy-init –vo CMS
VO info in the generated proxy Impact on WP1 software
Retrieve VO from user’s proxy So not necessary to provide it anymore in the JDL, for querying
the RLS Check for authorization not node anymore with a
matchmaking considering User Cert Subject but according to VO
Proxy used by the various services (NS, LB, etc.) generated by VOMS ?
Issues VOMS deployed at J+37 but not too clear which and when
integration will take place Not clear yet which VOMS APIs available
![Page 14: WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.](https://reader036.fdocuments.in/reader036/viewer/2022082613/5697bf891a28abf838c8a15f/html5/thumbnails/14.jpg)
Job exit code For release 2.0 we agreed to
return job exit code to user with dg-job-status
What about if exit code <> 0 ? Done-ok in any case ? Done-failed (and therefore
resubmission) ?
![Page 15: WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.](https://reader036.fdocuments.in/reader036/viewer/2022082613/5697bf891a28abf838c8a15f/html5/thumbnails/15.jpg)
ISB/OSB transfer errors In release 1.x job considered failed (and therefore
resubmission attempted) if JobWrapper detects errors when transferring a file of ISB/OSB between RB node and WN
But failure could be simply because of user’s error when writing ISB/OSB expressions in JDL …
And what about if the job crashed for “internal” problems and therefore some OSB files not produced ?
Is it ok to mark the job as failed and re-attempt the submission or is it better to consider the job as done-ok ?
Approach in release 2.0 JobAdapter should check and issue globus-url-copy only for
ISB-OSB files which exist (simple for OSB, bit more complex for ISB) and/or globus-url-copy errors ignored ?
![Page 16: WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.](https://reader036.fdocuments.in/reader036/viewer/2022082613/5697bf891a28abf838c8a15f/html5/thumbnails/16.jpg)
Accounting integration What exactly happens at J+27
(“Accounting infrastructure”) ? And later, after release 2.0 (“Full
integration of cost estimation/accouting into scheduling policies”) ?
Dependencies and interfaces with other components and other WPs at J+27 and later ?
![Page 17: WP1 WMS rel. 2.0 Some issues Massimo Sgaravatto INFN Padova.](https://reader036.fdocuments.in/reader036/viewer/2022082613/5697bf891a28abf838c8a15f/html5/thumbnails/17.jpg)
Host vs user proxies Can we rely on user’s proxies
instead of host proxies for authentication when possible, as recommended ? E.g. in LB logging Other cases ?