Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18...

42

Transcript of Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18...

Page 1: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)
Page 2: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Data Preservation in HEP:

The Next 3 (2?) Years

CERN IT-MM, June 2018

These slides and associated material:

https://indico.cern.ch/event/731584/

[email protected]

International Collaboration for Data Preservation and

Long Term Analysis in High Energy Physics

Page 3: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 3

Overview

1. DPHEP “2020 vision” – brief reminder

2. Status of CERN Certification and Outlook

3. “EIROforum” TWG on Long-Term Data Preservation

4. PV2020@CERN [ Preservation & Value adding ]

1. European Strategy update – ESPP2020

Not covered: ARCHIVER, ESCAPE etc. (although Certification relevant here too)

Page 4: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 4

DPHEP 2020 Vision – Reminder

• DPHEP Blueprint published in May 2012 –to some extent a “cry for help”

Urgent action is needed for LTDP in HEP

The preservation of the full capacity to do analysis is recommended such that new scientific output is made possible using the archived data

• Current ESPP (May 2013):

• …data preservation and distributed data-intensive computing should be maintained and further developed.

“Open Data” was not part of Blueprint. CMS policy: May 2012@CHEP

Page 5: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 5

What does DPHEP do? DPHEP is a Collaboration with signatures from the main HEP

laboratories and some funding agencies worldwide.

• It has established a "2020 vision", whereby: + All archived data – e.g. that described in DPHEP Blueprint, including

LHC data – should be easily findable and fully usable by the designated communities with clear (Open) access policies and possibilities to annotate further ( = P + V );

Best practices, tools and services should be well run-in, fully documented and sustainable; built in common with other disciplines, based on standards;

+ There should be a DPHEP portal, through which data / tools accessed;

Clear targets & metrics to measure the above should be agreed between Funding Agencies, Service Providers and the Experiments.

Vision presented to ICFA meeting in Feb 2013 who issued a “statement”

Page 6: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 6

Path to “Vision” – some MS• “Full Costs of Curation” workshop – inspired by 4C – January 2014

• Bit preservation “cost model”, good understanding of costs (P+M) vs “value” – adopted HEP-wide (and beyond), e.g. Data Rescue talk at PV2018

• First “Collaboration Workshop” (after signatures of CA) – June 2015 –DPHEP Status Report

• Led to LEP data on EOS – 3 copies just at CERN!

• Common reporting format for all HEP expts (DMP+SWOT)

• ISO 16363 training – June 2015 – CERN + T1s

iPRES 2016 “CERN Services for LTDP” paper

• Also CERNLIB documentation update, GPHIGS license etc.

2018: ICFA report; Request to host BaBar data; OPERA ingest; ESRIN visit; PV2020 agreement; ISO 16363 self-assessment; HSF-CWP

Meets and (greatly) exceeds ESPP; good progress towards “vision”

“Standards” include:

• FAIR DMPs; TDRs

+ OAIS & related

Page 7: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 7

Built on 3 "pillars": 1. The data itself (“bits” – state of the art bit preservation,

e.g. in a “Trustworthy Digital Repository” (TDR));

2. Documentation (services like Zenodo, B2SHARE);

together with the necessary

3. Software + environment (CernVM / CVMFS)

• Services for all 3 areas exist and are mature but change on fully independent timescales

We need flexible (not static) bridges between them

LTDP in HEP (iPRES 2016 paper)

See https://cds.cern.ch/record/2195937/files/iPRES2016-CERN_July3.pdf

Page 8: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 8

Certification – a “sine qua non” of LTDP

• In May 2017 I wrote a note to the IT-MM on a strategy for certification as a TDR (attached)

• From then until end 2017, worked on draft responses to the 109 metrics in ISO 16363

• These have now been submitted to “stage 1” offsite audit and a contract signed with PTAB

• Feedback is expected shortly – work on OAIS update has delayed this but iteration is to be expected, if not major revisions

More details on the ISO audit process can be found in the PV2018 paper (attached)

“Sterling work” according to WLCG GDB chair

Page 9: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 9

ISO 16363 (is right for CERN)• Was developed and is maintained by the same

people as OAIS (ISO 14721) – the “space community” Much closer to us than e.g. humanities

• CoreTrustSeal, which came from DSA+WDS, follows the same breakdown: but it is not as thorough

• Satisfying ISO 16363 should “automatically” mean satisfying CoreTrustSeal (e.g. BSc vs Oxbridge)

• European Framework for Audit and Certification of Digital Repositories presents the main methodologies as a “hierarchy”, along with an MoU

• Others pursuing ISO 16363 include EU publications office, US library of congress & some “secret” ones

Open Archival Information System – an archive (systems + people)

Page 10: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 10

Who does it benefit?• Funding agencies, who can better judge if the money

they are providing will be used according to their requirements • e.g. FAIR DMPs which call for preservation & re-use

• Data users to be able to determine the “trustworthiness” of the data (user surveys)

• Producers (e.g. LHC experiments) to understand how and what a repository does to preserve their data

The data of most CERN experiments already lost!• By number of experiments, not by volume

• CERN Greybook: 776 completed experiments, ~20 active

• “Preserved”: LEP(3/4), LHC(4)

O(10) vs O(1000)

Page 11: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 11

Certification Areas

3. Organisational Infrastructure• IMHO we are quite strong here, but some of the

descriptions may well not be clear to people outside HEP (such as the auditors)

4. Digital Object Management• Here we are (very) weak WRT standard. We don’t

in general have AIPs etc but we have proven that we can “ingest” data (e.g. OPERA, maybe BaBar) TOGETHER WITH the experiment

IMHO (cf EOSC Pilot) a generic TDR could NOT

5. Infrastructure and Security Risk Management• Probably OK although some elements, e.g. Business

Continuity, still work-in-progress

e-group DPHEP-CERN-Certification

Page 12: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 12

Example Feedback (APARSEN)

3. Organizational Infrastructure

• Currently, <SITE> does not formally document all

changes to its operations, procedures, software

and hardware.

4. Digital Object Management

• The process for converting SIPs to AIPs and the

corollary mapping history between them was unclear.

5. Infrastructure and Security Risk Management

• <SITE> has no technology watch.

• <SITE> has no risk register.

Page 13: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 13

What Happens Next?• It is hard to make a concrete plan without the first written feedback

Still target an on-site audit in 2019 with Certification by 2020 –earlier if possible

• This will need the presence of a number of experts• In IT, most likely from DI, ST, CDA & WLCG

• “Surveillance audits” would typically follow in 2021 & 2022• Ideally, I should be involved in 1, if not 2, of these

Even 1 may no longer be possible if there are further delays (for whatever reason)!

• In parallel, the “motivation” for certification can be expected to increase (cf Science Europe w/s, FAIR action plan)

Page 14: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 14

Page 15: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 15

EIROforum WG on LTDP• As mentioned above, the “space community” defined

and maintains the LTDP standards• Very active mailing list: hope to update OAIS in 2020!

• Initiated the “PV” conference series, all but one (@DCC) have been at “space institutes”

• Triggered by a visit from ESRIN (they now want to learn from CERN / HEP!) technical & topical meetings will be held with “EIROforum” institutes and similar, e.g. DLR, [ ARCHIVER procurers etc, ]

Complementary to PV but much more hands-on

• Topics could include archive i/f, tape strategies, portals, s/w preservation, certification etc etc First meeting at CERN(?) after the summer

Yet another “success story” IMHO…

Page 16: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 16

PV2020@CERN

Just to be clear, I think that this is a great opportunity!

• Typically a 2.5 day meeting, probably late April / early May, plenaries + 2 parallel tracks + posters

• Can have some co-located events: some good suggestions from closing talk in RAL• E.g. show-casing open data from different

disciplines to school kids etc

• (We are not leaders in this area!)

• Target 150 – 200 attendees from all continents and many scientific disciplines

And another! We are definitely “on the map” WRT LTDP

Page 17: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 17

Goals (as presented at RAL)

• Attract more scientific communities

• Broaden information exchange, sharing of

experiences, tools and even services

• Keep in step with (or ahead of) funding agencies /

policy makers in their push for LTDP & OD

• (Discussion at end) Suggestions came here

Page 18: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 18

PV2020 Organisation• Would prefer to avoid need for a sponsor (and sales talks)

• PV2018 registration was GBP 160 without conference dinner

• Should be able to cover coffee / lunch breaks, welcome “apero”, plus conference bag for this (or less…)

• Session chairs come from programme committee (usual suspects plus some new ones)

• Abstract submission / reviews done using EasyChair which works quite well (assume Indico for agenda & badges)

• Proceedings (4 pp per talk) published before meeting

• On-site visits? Would take organisation & guides but likely to be very popular

Quite a few invited talks were poor, 1 or 2 excellent!

Panel session went well – something to repeat?

• Local organisers at RAL seemed to be very stressed. Why?

Would be good to have DG / Directorate level support

Page 19: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 19

ESPP 2020

• Clear from HSF CWP that “bit preservation” (with acceptably low error rate) considered “solved”

• Services around the key “pillars” of LTDP in HEP exist, are mature and well supported:• Bit preservation; CVMFS/CernVM for s/w + environment;

Invenio-based solutions for documentation) [EOSC service]

• Focus now on “new” areas. Those discussed in Naples include:• Re-use & reproducibility (always a goal but unclear how)

• Handling changes in access protocols

New and changing requirements from FAs will need to be considered

Input in 2012 was not well coordinated and sometimes contradictory

Page 20: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 20

Beyond 2020

• The elaboration of a post-2020 DPHEP Vision, its implementation and that of new directives in the 2020 ESPP need to be addressed by someone else (non-CERN needs?)

• Ideally, they would start well before this, getting increasingly involved in ISO 16363 (re-)certification, PV2020 preparation, EIROforumLTDP WG, H2020 projects(?) and any other activities deemed important

OAIS (ISO 14721) updated in 2020, ISO 16363 and other updates will follow

Page 21: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 21

Questions for Run3 management

1. Does CERN wish to continue with DPHEP Project Management? (See letter from former DRC)

• In principle would need to be approved by DPHEP Collaboration Board and ICFA in 2020 or B4

2. Does CERN / HEP wish to continue to collaborate with other disciplines / policy makers / funding agencies? (At the same level as now? More?)

• We benefit – they benefit – we all benefit, e.g. costs & benefits, technical solutions, “knowledge is more than documentation” etc

3. Does CERN wish to maintain Certification?

4. Should this activity – if retained – be in IT?

5. Did you get the DPHEP CA from the former DRC?

Page 22: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 22

Comments…

• By 2020, we should be able to…

Implement the DPHEP 2020 vision

Whilst taking account of the evolving landscape during this period (FAIR+DMPs+TDRs+EOSC etc)

Obtain ISO 16363 certification for CERN as a TDR for all LTDP activities

Bring the leading scientific LTDP conference to CERN

Run a WG with other major scientific organisationson LTDP – including those that “wrote the book”

• Change the way people think of LTDP?

Page 23: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 23

ARCHIVER

• It is unlikely that this can succeed (in

attracting suppliers) without a good

understanding of OAIS & TDRs

• This includes agreement on SIPs & DIPs

(the conversion to AIPs is up to the supplier)

• We are also likely to insist that suppliers are

certified to some agreed standard

• Collaboration with ESRIN will be important

to help specify interface

Page 24: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 24

Summary

• 2020 is a key date for many aspects of LTDP

• And the horizon for some non-CERN projects

• There is a lot to do – even without additional H2020 projects or new ideas from FAs

• Some preparation for 2020+ (Run3 and beyond) will be is now required

On-going certification should help ensure LTDP remains a reality at CERN for decades(?) to come (LHC, HL-LHC, HE-LHC)

Page 25: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)
Page 26: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 26

Page 27: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 27

F.A.I.R. Data Management

• Increasing emphasis on FAIR DMPs, including

preservation, sharing, reproducibility etc.

FAIR now includes also s/w but not yet build

systems, verification procedures & environment

• IMHO not yet fully understood (some claim

otherwise) - we see (ir)regular changes on how we

find data and what protocol(s) we use to access it

• This can be a problem over periods < 1 decade

Only solution we know of: find the effort to

migrate (problem for legacy projects / data)

FAIR = Findable, Accessible, Inter-operable, Re-usable

Page 28: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 28

Expert Group on FAIR

Sandra Collins, National Library of Ireland

Françoise Genova, Observatoire Astronomique de Strasbourg

Natalie Harrower, Digital Repository of Ireland

Simon Hodson, CODATA, Chair of the Group

Sarah Jones, Digital Curation Centre, Rapporteur

Leif Laaksonen, CSC-IT Center for Science

Daniel Mietchen, Data Science Institute, Univ. of Virginia

Ruta Petrauskaité, Vytautas Magnus University

Peter Wittenburg, Max Planck Computing & Data Facility

Page 29: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 29

Collaboration1. Through technology:

• Large Tape Users’ Group

• Invenio Zenodo B2SHARE (INSPIREHEP)

• CVMFS / CernVM

2. Through projects:• e*, E* and H*

3. Through services:• Obi-wan Zenodo

• CVMFS repository for “lost” experiments

• Possible hosting of 2PB of BaBar@SLAC data

• 70 TB of OPERA data (CERN “recognised” expt)

4. Through workshops & conferences:• e.g. EIROForum technical WG on LTDP

• More on Thursday… (?)

Page 30: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

BABAR needs Help! BABAR in Numbers• BABAR data actively being analyzed and high

impact papers published (see slide 2). Expect this to continue to at least through 2021.

• SLAC management plans to stop hosting BABARcomputing in February 2020 at which time the tapes with data will be ejected.

• DOE support ended in 2017, now running on international common funds (OCF).

• Looking for possibility of support and long term data preservation at

– CERN,

– GridKa (BABAR site for analysis and XRootDfederated dataset main redirector),

– University of Victoria (BABAR site for analysis, documentation, and tools support).

• BABAR lightweight VMs come with the latest software release and xrootd client included, running under the most common virtual machine players. Just add the data via the GridKa main XRootD redirector.

• 2PB of data on T10k-D tapes– raw, processed, Monte Carlo– Unique dataset at the Y(3S) resonance

(no plan (yet?) to run at the Y(3S) @ Belle II)

• Full environment enclosed in VMs (SL5,SL6)

• ~1TB of documentation, repositories, and dataset information (DBs, cvs, wiki, html)

– Internal documents archived on INSPIRE

• 574 papers, ~10 papers/year past 3 years • 231 members (semi-frozen author list)

– Including PhD students in Canada, Germany, Israel, Italy, Russia, US

– Associated theorists mine data to test new ideas

• ~20 analyses on track, ~10 more in the pipeline

– Continue to have new analyses every year including joint BABAR -Belle analyses

• Students analyze BABAR data while working on Belle II and other experiments in construction/commissioning phase

Page 31: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 31

ISO 16363 certification of CERN• ISO 16363 follows OAIS breakdown:

3. Organisational Infrastructure;

4. Digital Object Management;

5. Infrastructure and Security Risk Management.

• Many of the elements in 3) and 5) covered by existing (and documented) CERN practices• Some “weak” areas – being addressed – include disaster

preparedness / recovery (together with EIROForum)

On-going “stage 1” external audit to high-light those areas requiring attention• May just be a question of documentation,

e.g. CERN is not going to change its financial practices (MTP etc) as a result of ISO 16363!

Page 32: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 32

Who does it benefit?• Funding agencies, who can better judge if the money

they are providing will be used according to their requirements • e.g. FAIR DMPs which call for preservation & re-use

• Data users to be able to determine the “trustworthiness” of the data (user surveys)

• Producers (e.g. LHC experiments) to understand how and what a repository does to preserve their data

The data of most CERN experiments already lost!• By number of experiments, not by volume

• CERN Greybook: 776 completed experiments, ~20 active

• “Preserved”: LEP(4), LHC(4)

O(10) vs O(1000)

Page 33: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 33

HEP Community White Paper

• Focuses on the challenges of the next decade or so (LHC Run3, HL-LHC Run4) Massive increase in data rates and computational

needs – way beyond technology predictions

• “bit preservation with an acceptably low error rate can now be considered a solved problem”

• Main areas of work now:• Analysis capture (incl. workflows) and reproducibility

• “Open Data” at multi-PB scale and beyond

• Trying to do this in collaboration with others (e.g. RDA)

Does “Open Data” mean zero or low latency?• People assume so – enormous implications!

Page 34: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 34

Services are (just) services

• No matter how fantastic our { TDRs, PID services,

Digital Library, Software repository } etc is, they

are there to support the users

Who have to do the really hard work!

E.g. write the software, documentation, acquire and

analyse the data, write the scientific papers

• However, getting the degree of public recognition

as at the Higgs discovery day was a target e-

KPI!

Computing was thanked in the same way as the LHC & experiments

Page 35: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 35

What is the future?

• Some hope that it may be possible to separate

long-term preservation of data at the bit level

from domain-specific aspects

• The former could benefit from economies of

scale and specialised knowledge in running

multi-PB / EB archives

The latter will continue to need expert

knowledge to revalidate on a regular basis

• Drive to reduce overhead through "domain

protocols" for DMPs

Bottom line: be collaborative to drive down costs

Page 36: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 36

Input to next ESPP

• Certification as a “Very Trustworthy Digital

Repository” – exabytes & decades & changes

• Open Data – clarification(s); resources

• Reproducibility & Re-use

Resilience to and handling of change(s)

Page 37: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 37

29 years of LEP – what does it tell us?

► Major migrations are unavoidable but hard to foresee!

► Data is not just “bits”, but also documentation, software + environment + “knowledge”► “Collective knowledge” particularly hard to capture

► Documentation “refreshed” after 20 years (1995) – now in Digital Library in PDF & PDF/A formats (was Postscript)

► Today’s “Big Data” may become tomorrow’s “peanuts”

► 100TB per LEP experiment: immensely challenging at the time; now “trivial” for both CPU and storage

► With time, hardware costs tend to zero ► O(CHF 1000) per experiment per year for archive storage

► Personnel costs tend to O(1FTE) >> CHF 1000!► Perhaps as little now as 0.1 – 0.2 FTE per LEP experiment to keep

data + s/w alive – (new analyses “cost extra”)

See DPHEP Workshop on “Full Costs of Curation”, January 2014:

https://indico.cern.ch/event/276820/

Page 38: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 38

Conclusions

• We are well on the way to implementing our 2020 vision using “standard” services• VTDR & PIDs, Digital Libraries & DOIs, s/w preservation

• Services that are – or should be – offered in the EOSC*

But they are not “holistic” – “mind the gap(s)”

• And they will change over time – whatever people (especially in IT) pretend!

• Beware of "grey-backed gorillas"

==> Constant effort is needed – like with a bike

Page 39: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)
Page 40: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

Slide 40

What Makes HEP Different?

• We throw away most of our data before it is even recorded – “triggers”

• Our detectors are relatively stable over long periods of time (years) – not “doubling every 6 or 18 months”

• We make “measurements” – not “observations”

• Our projects typically last for decades – we need to keep data usable during at least this length of time (but not necessarily “forever”)

• We have shared “data behind publications” for more than 30 years… (HEPData)

Page 41: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

13th January 2014A. Valassi – Objectivity Migration 41

ODBMS migration – overview (300TB)

A triple migration! Data format and software conversion from Objectivity/DB to Oracle Physical media migration from StorageTek 9940A to 9940B tapes

Took ~1 year to prepare; ~1 year to execute

Could never have been achieved without extensive system, database and application support!

Two experiments – many software packages and data sets COMPASS raw event data (300 TB)

Data taking continued after the migration, using the new Oracle software

HARP raw event data (30 TB), event collections and conditions data Data taking stopped in 2002, no need to port event writing infrastructure

In both cases, the migration was during the “lifetime” of the experiment System integration tests validating read-back from the new storage

Page 42: Data Preservation in HEP - Indico › event › 731584 › attachments › 1654419 › ...Slide 18 PV2020 Organisation • Would prefer to avoid need for a sponsor (and sales talks)

BABAR Highlights and Press Releases

November 2017

Dataset:

Y(4S): 433/fb

Y(3S): 30/fb

Y(2S): 14/fb

Off resonance: 10%

Y(1S) accessed via

Y(2S,3S) → Y(1S) π+π–

June 2017