Summary of TEG outcomes First cut of a prioritisation/categorisation Ian Bird, CERN WLCG Workshop,...

Summary of TEG outcomes

First cut of a prioritisation/categorisation

Ian Bird, CERNWLCG Workshop, New York

May 20th 2012

Ian.Bird@cern.ch 2

Comments

• This is far from being a comprehensive or complete summary

• Not discussed here:– Directions/decisions that are already taken

• Extracted here are essentially:– Action items– Items in need of further work/discussion– Unanswered questions– A few provocative comments

• I have sometimes made strong conclusions from tentative statements …

20th May 2012

Ian.Bird@cern.ch 3

General needs• Overall strategy

– Robustness and simplicity of use: move towards “Computing as a Service” particularly at smaller sites with limited effort• Implies trivial set up and configuration of services essential• Environments need to be self-describing (or job able to

determine environment) - no complex info publishing or requirements

• Better monitoring:– Network monitoring, including traffic flows etc. Need

to correlate with how DM is done.– Mechanism to do analysis on monitoring data– Better coordinate dashboards, availability tests, etc.➔Set up a WLCG monitoring group to coordinate and

oversee this20th May 2012

Ian.Bird@cern.ch 4

Data and Storage

• Distinguish between tape archives and disk pools– Data on tape is moved explicitly to a disk pool,

not invisibly migrated

• TBD: Distinguish between Tier 2s that really provide data storage and those that are merely caches– The latter could have a simple storage service,

esp. if http as a protocol is usable (e.g. squid)– Determine what lower level of service is

required at such Tier 2 caches20th May 2012

Ian.Bird@cern.ch 5

Data and storage – 2

• Data federation with xrootd is a clear direction, for some part of the data– Later using http?

• Essential to have robustness of storage services at a site– Argument for smaller sites to act as “cache”

rather than “storage”

• Use of remote i/o– Several use scenarios, but needs monitoring data

to ensure efficiency– Hopefully most of this being integrated into

xrootd20th May 2012

Ian.Bird@cern.ch 6

Data and storage - 3

• And SRM??– Keep as interface to archives and managed

storage• But useful functionality has been delineated

– Not there for federated storage with xrootd– FTS-3 can talk directly to gridftp anyway– No specific need to replace SRM as an

interface • But may be an interest in cloud storage

interfaces at some point (technology watch)• Allow/encourage (??) sites to offer other

interfaces20th May 2012

Ian.Bird@cern.ch 7

Data and Storage

• Conclusions:– Don’t question use of gridftp for now– Need all systems to support xrootd fully

• Anything actually missing here?

– Eventual use of http is potentially interesting• Continue work on plugins and testing at low (?) or high (?)

priority (but limited effort?)

– FTS-3 is high priority; • Follow up on requirements, use for tapedisk movement; use of

replicas if source file is missing

– Storage accounting• EMI StAR, but need an implementation

– I/O benchmarking, requirements, monitoring• To improve I/O perf and clarify statement of needs to vendors

20th May 2012

Open Questions: Access Patterns

• Difference between staging data for I/O to and from the WN to:– I/O over the LAN to local storage– I/O over the WAN to remote storage

• Connected questions:– What fraction of each file is read ?

• how sparse are sparse reads?

– How well is this fraction known wrt the type of file and the processing stage?

– Impact of new vector read (TTreeCache)• how many round-trips per GB used data

Open Questions: Federation

• Repair only mode– can we verify the TEG expected data

volume ?– repair by catalogue-SE comparisons

• what is the difference to re-populate by FTS

• Caching– caching files or what has been read ?– caching and access control?

• caching for world readable (reduced AA) only?

Open Questions: WN

• Staging to WN– for read access: local disk I/O

• most efficient alternative, excellent clients

– for writing: how to stage out data without loosing data due to running out of queue time?

– discussion needs input from data access monitoring to understand role of sparse reads

– Measurements needed to directly compare access strategies

Open Questions: World Readable Data with relaxed AA

• Expected benefits: Less round-trips, reduced computational overhead, much improved latency for access to many small files, simplicity for many operations ( caching, etc.)

• How to manage transition?– to be efficient has to work without moving the data

• How will clients be aware and suppress AA costs?• Restricted to subset of access protocols?• What fraction of the data and processing qualify?

– results from data access studies needed as input

Ian.Bird@cern.ch 12

Data security

• Can we agree a model that distinguishes between:– Read-only data (that can be cached)

• Need to specify how caches are populated

– Written data that needs to be stored– This model would allow simple AA for r-o (lower

overhead)

• Can we agree to distinguish between sites– That store and manage data

• These need real data management systems

– That cache data for analysis or processing• These might need only off-the-shelf storage (or squids)

accessible via xrootd• Would benefit then from use of http as transport• Also would need to define how such a site (or jobs on a site)

move output files to real storage20th May 2012

Ian.Bird@cern.ch 13

Workload management

• Glexec:– Deploy fully in setuid mode. Define timescale now and

follow up.

• No further need for WMS: decommission end 2012?• Pilots:

– Report is too conservative?– Support streamed submission:

• Requires modified CE; need to test at scale by 2013 (CE changes have taken years to reach production)

– Common pilot framework? • Based on glideinWMS?

– So why do we still need a complex CE?• No answer? Is there a simplification to be made?• The above is “anti-CaaS”?

20th May 2012

Ian.Bird@cern.ch 14

WLM – 2

• Whole node and multi-core– Complex solution proposed including

new JDL and new CE interfaces in order to allow experiments to make arbitrary requests.

–Why? • This goes against “CaaS”?

➔Simplification: job wakes up, determines what is available, runs.➔Why not?

20th May 2012

Ian.Bird@cern.ch 15

WLM – 3

• CPU pinning + I/O bound vs CPU bound jobs–Why? is it really practical to think of

optimisation at this level? – Adding complexity for undefined

benefit?–Why expose it at the grid layer➔HEPiX; ➔SFT concurrency project to address CPU

efficiency in general20th May 2012

Ian.Bird@cern.ch 16

WLM – 4

• Virtual CE: better support for “any” LRMS– Clear essential need

• Virtualisation use cases– Essentially a site decision – Consider performance issues

• Cloud use cases– Unresolved issues (AAA, etc.)–More work is required here

20th May 2012

HEPiX and/or WLCG WG

Ian.Bird@cern.ch 17

Information system

• Really distinguish between:– “Stable” information needed for service

discovery– “Changing” information for monitoring etc

• no use case at all for info related to job brokering

– Need a clear proposal for how to proceed➔Set up a small, rapid, wg to

a) Make a clear statement of the status – some work has been done here

b) Define the plan and clarify specific goals.

20th May 2012

Ian.Bird@cern.ch 18

Databases• Ensure support for COOL/CORAL+server:

– Core support will continue in IT; ideally supplemented by some experiment effort

– POOL no longer supported by IT

• Frontier/Squid as full WLCG service:– Should be done now; partly already– Needs to be added to GOCDB, monitoring etc– Who is responsible?

• Hadoop: (and NoSQL tools)– Not specifically a DB issue – broader use cases– CERN will (does) have a small instance; part of

monitoring strategy➔ Important to have a forum to share experiences etc.

➔GDB20th May 2012

Ian.Bird@cern.ch 19

Operations & Tools

• WLCG service coordination team:– Should be set up/strengthened– Should include effort from the entire

collaboration– Clarify roles of other meetings

• Strong desire for “Computing as a Service” at smaller sites

• Service commissioning/staged rollout– Needs to be formalised by WLCG as part of

service coordination20th May 2012

Ian.Bird@cern.ch 20

Operations & tools – 2

• Middleware– Before investing too much; see how much actual

middleware still has a long term future– Simplify service management (goal of CaaS)

• Several different recommendations involved

– Simplify software maintenance➔This requires continuing work

• Need to write a statement on software management policy for the future– Lifecycle model post EMI, and new OSG model

• Proposals very convergent!

20th May 2012

Ian.Bird@cern.ch 21

Security – Risk Analysis• Highlighted the need for fine-grained traceability

– Essential to contain, investigate incidents, prevent re-occurrence

• Aggravating factor for every risk:– Publicity and press impact arising from security incidents

• 11 separate risks identified and scored

20th May 2012

Ian.Bird@cern.ch 22

Security – areas needing work

• Fulfil traceability requirements on all services– Sufficient logging for middleware services– Improve logging of WNs and Uis– Too many sites simply opt-out of incident response: “no data,

no investigation -> no work done!”– Prepare for future computing model (e.g. private clouds)

• Enable appropriate security controls (AuthZ)– Need to incorporate identity federations– Enable convenient central banning

• People issues:– Must improve our security patching and practices at the sites– Collaborate with external communities for incident response

and policies– Building trust has proven extremely fruitful – needs to continue

20th May 2012

Ian.Bird@cern.ch 23

Discussion/work group topics

20th May 2012

TEG WG / Liaison Purpose

WLM HEPiX Liaison(s) with HEPiX (and others) on CPU pinning and “cloud” computing

WLM “CE” At least one WG to define CE extensions (and/or alternatives) in more detail: scoping work, defining timescales, testing and deployment plans

Several IS IS WG to (re-) define requirements, their implementation and deployment

DSM Topical storage groups

e.g. R/O placement layer; SRM alternates; liaison with ROOT I/O wg; Separation of R/O & R/W data incl. R/O caches;Federation as “repair mechanism”

OPS m/w services & configuration

WGs to review m/w services and m/w configuration tools / mechanisms (not clear how useful now)

OPS Coordination Not a WG per se, but still a body that will continue and will monitor / coordinate other efforts

OPS Service Commissioning

A “virtual team” created (and disbanded) as required – and with targeted expertise – to validate, commission and trouble-shoot

DB “user group” To share experiences

All Monitoring Coordinate all monitoring activities, including missing functions (e.g. network traffic), + monitoring analysis

DSM Data access security Define/agree data access/placement security model

All HEPiX? Technology watch: storage interfaces, protocols, etc., etc.

Ian.Bird@cern.ch 24

Some questions for the workshop• What should be done to approach “Computing as a

Service” for sites?• Can we agree a strategy for a CE that does not add

complexity but allows pilot factories, etc.?• Can we agree a simplified subset of SRM?• Can we separate archives and disk storage?• Can we distinguish between sites that store and

sites that cache data only?• Can we agree a straightforward data security model?• How far can we converge “middleware” across grid

infrastructures?• What are disruptive changes that must be done in

LS1? (any?)20th May 2012

Ian.Bird@cern.ch 25

Need to do in LS1:

• Testing new concepts at scale:– FTS-3 scale testing– On large sites separation between archives

and placement layer– Federation: run production with some fraction

of data not local• Needs good monitoring

– Test reduced data access Authz requirements– Testing use of multicore/whole node

environments?

20th May 2012

Ian.Bird@cern.ch 26

Hello, Good-bye:(to be completed…)

• CVMFS• Frontier/Squid• …

20th May 2012

• POOL• LFC …• WMS

Ian.Bird@cern.ch 27

Effort?

• Re-iterate the need for more collaborative activities …

20th May 2012

Summary of TEG outcomes First cut of a prioritisation/categorisation Ian Bird, CERN WLCG Workshop,...

Documents

Transcript of Summary of TEG outcomes First cut of a prioritisation/categorisation Ian Bird, CERN WLCG Workshop,...

Mscala UCA Categorisation

WLCG - Winman

Categorisation of norms

WLCG Service Report Andrea.Valassi@cern.ch ~~~ WLCG Management Board, 9 th August 2011 1.

WLCG Service Report Jean-Philippe.Baud@cern.ch ~~~ WLCG Management Board, 24 th November 2009 1.

WLCG Service Report Harry.Renshall@cern.ch ~~~ WLCG Management Board, 18 th August 2009 1.

WLCG Update

Categorisation - Portsmouth

WLCG Service Requirements

WLCG Service Report

WLCG Service Report Harry.Renshall&Jamie.Shiers@cern.ch ~~~ WLCG Management Board, 27 th January 2009.

WLCG ‘Weekly’ Service Report Harry.Renshall@cern.ch ~~~ WLCG Management Board, 22 th July 2008.

WLCG Service Report ~~~ WLCG Management Board, 18 th September 2012 1.

WLCG Operations Coordination

WLCG Service Report Olof.Barring@cern.ch ~~~ WLCG Management Board, 1 st September 2009 1.

WLCG Status Report

Prioritisation simulation

WLCG Accounting Requirements

WLCG Transfers Dashboard

Company Categorisation