Gray, N., Carozzi, T.D., and Woan, G. (2012) Managing Research Data: Gravitational Waves. Project Report. University of Glasgow, Glasgow, UK.
Copyright © 2012 The Authors
http://eprints.gla.ac.uk/76815/
Deposited on: 15 March 2013
Enlighten – Research publications by members of the University of Glasgow
http://eprints.gla.ac.uk
ManagingResearchDatainBigScience
NormanGray, TobiaCarozziandGrahamWoanSUPA SchoolofPhysicsandAstronomy, UniversityofGlasgow
2011July14, v1.1
Version v1.1
URL http://purl.org/nxg/projects/mrd-gw/report
Distribution Public
This reportwasprepared as part of theRDMP strandof the JISC programmeManagingResearchData.
Abstract
TheprojectwhichledtothisreportwasfundedbyJISC in2010–2011aspartofits‘ManagingResearchData’programme, toexaminethewayinwhichBigSciencedataismanaged, andproduceanyrecommendationswhichmaybeappropriate.
Bigsciencedataisdifferent: itcomesinlargevolumes, anditissharedandexploitedinwayswhichmaydifferfromotherdisciplines. Thisprojecthasex-ploredthesedifferencesusingasacase-studyGravitationalWavedatageneratedbytheLSC,andhasproducedrecommendationsintendedtobeusefulvariouslytoJISC,thefundingcouncil(STFC) andtheLSC community.
InSect. 1 wedefinewhatwemeanby‘bigscience’, describetheoveralldataculturethere, layingstressonhowitnecessarilyorcontingentlydiffersfromotherdisciplines.
InSect. 2 wediscussthebenefitsofaformaldata-preservationstrategy, andthecasesforopendataandforwell-preserveddatathatfollowfromthat. Thisleadstoourrecommendationsthat, inessence, fundersshouldadoptratherlight-touchprescriptionsregardingdatapreservationplanning: normaldatamanage-mentpractice, intheareasunderstudy, correspondstonotablygoodpracticeinmostotherareas, so that theonlychangewesuggest is tomake thisplan-ningmoreformal, whichmakesitmoreeasilyauditable, andmoreamenabletoconstructivecriticism.
InSect. 3 webrieflydiscusstheLIGO datamanagementplan, andpullto-getherwhateverinformationisavailableontheestimationofdigitalpreservationcosts.
Thereportisinformed, throughout, bytheOAIS referencemodelforanopenarchive. Someofthereport'sfindingsandconclusionsweresummarisedin [1].
Seethedocumenthistoryonpage 38.
Copyright2011–2012, UniversityofGlasgow
School of Physics& Astronomy
Thisreportwaspreparedfor, andfundedby, theRDMP strandoftheJISC pro-gramme ManagingResearchData.
Release: 7b021525c2d7, 2012-07-1709:55+0100
Copyright2011–2012, UniversityofGlasgow. ThisworkislicensedundertheCreativeCommonsAttribution-ShareAlike2.5UK:ScotlandLicence. Toviewacopyofthislicence, visit http://creativecommons.org/licenses/by-sa/2.5/scotland/.
ManagingResearchDatainBigScience
Contents
0 Introduction0.1 ProjectBackground p4
0.2 Howtoreadthisdocument p5
0.3 Workingwithcommunities –pragmatics p5
1 DatamanagementinBigScience1.1 LIGO inperspective: LIGO,bigscience, andastronomy p6
1.2 Datavolumes p6
1.3 Datamanagementstylesinthephysicalsciences p7
1.4 Astronomydata p8
1.4.1 StrasbourgDataCentre(CDS) asadisciplinaryrepository p11
1.4.2 Collaborationsinastronomy p12
1.5 HighEnergyPhysicsdata p13
1.6 Gravitationalwavephysics p14
1.6.1 Gravitationalwaveconsortia p14
1.6.2 GW data p15
1.6.3 Gravitationalwavedatareleases p16
1.6.4 Summary: big-sciencepreservationchallenges p16
1.7 A contrast: socialsciencedata p17
1.8 Babyloniandatamanagement(lesscontrastthanyou'dthink) p18
1.9 Bibliographicrepositories p19
1.10 VirtualObservatories p19
1.11 Dataproductsandproprietaryperiods: reifyingdatamanagementandrelease p20
2 Theresponsibilitiesfordatapreservation2.1 Visualisingbenefits p21
2.2 Thecaseforopendata p22
2.3 Thecasefordatapreservation p24
2.4 Shouldrawdatabepreserved? p25
2.5 OAIS:suitabilityandmotivation p25
2.6 Whatshouldbig-sciencefundersrequire, orprovide? p26
3 Thepracticalitiesofdatapreservation3.1 Modellingthearchive p27
3.1.1 TheOAIS model p27
3.1.2 TheDCC CurationLifecyclemodel p28
3.2 Softwarepreservation p29
3.3 Datamanagementplanning p30
3.3.1 DMP inspace p30
3.3.2 CurrentandfutureDMP intheLSC p30
3.4 Datapreservationcosts p31
3.5 TheGW communityandtheAIDA toolkit p33
4 Conclusionsandrecommendations
A Casestudy
B AIDA assessment
Aboutthisdocument
References
Glossaryandindex
3
Gray, CarozziandWoan
0 Introduction
Astronomyisasoldashumanculture. Earlyagriculturalcivilisationsrequiredreliablepredictionsof thepositionsandmotionsof theSunandMoon, inor-dertopredict inturnseasons, tides, andriverrisings. Evenintheabsenceofanextensivescientificmodel, thesepredictionsreliedoncarefulobservations,preservedintheformofalmanacsorephemerides. Documentssuchastheseassociateastronomywithnotonlythefirstdataarchivesbut, sincetheseartifactsstillexist, alsotheoldestdataarchivesintheworld. Long-termdigitalpreserva-tioninastronomyisnothingnew. Wecannotresistsayingmoreaboutthis, inSect. 1.8.
Astronomicalarchivingdoeshoweverevolve, andinthelastfewdecadesbothastronomyandparticlephysicshavehadtobecomeleadersinlarge-scaledatamanagement.
Althoughastronomicalimages(nowallborndigital)havealwaysbeensub-stantialinsize, theyhavegenerallybeenreasonablymanageable. Newerastro-nomicaltechniques –andwearethinkingof21stcenturyradioastronomyandgravitationalastronomy –arecapableofgeneratingtrulychallengingquantitiesofdata; andparticlephysicshasbeengenerating, andaddressing, intimidatingdataproblems fordecades. Theseproblemscoverboth themanagementandpreservationoflargedatavolumes, astechnicalproblems, andthepreservationofthedata'sinformationcontent, onsubstantiallyvaryingtimescales.
0.1 ProjectBackground
TheManagingResearchData/GravitationalWavesproject (MRD-GW) iscon-cernedwiththedatamanagementarrangementsofthe LIGO ScientificCollabo-ration(LSC),andofthebroader GravitationalWave(GW) community. ItisoneofthesixprojectsintheRDMP strandoftheJISC ManagingResearchData(MRD)programme [2].
TheGW communitywasselectedbythe ScienceandTechnologyFacilitiesCouncil(STFC),atJISC'sinvitation, asarepresentativeexampleofbig-sciencedatamanagementpractice –asweelaboratebelow, ithasfeaturesofboththetraditionalastronomyandHEP communities, withoutbeingidentifiablewithei-therofthem. Whilemanyofthespecifics, below, relatetothiscommunity, webelievemuchofthediscussionisrelevanttotheotherdisciplines. Here, wearefocusingonthebig-scienceprojectswhichreceivestrategicsupportfrom STFC,ratherthanthesmaller-scaleprojectsfundedbyspecificresearchgrants, sinceitistheselarge-scaleprojectsthataredistinctiveaboutSTFC-fundedresearch.Weassumethattheoutputsofthesmallerprojectswillbemanagedthroughdis-ciplinaryrepositories, inamannerwhichmorecloselyresemblesthatofotherresearchcouncils.
TheMRD-GW projectexiststoinformthreesetsofstakeholders:
• Although the Joint Information SystemsCommittee (JISC) and the DigitalCurationCentre(DCC) haveextensiveexperiencewithdigitallibrariesanddigitalcurationingeneral, thereareproblemsspecificto‘bigscience’datawhichJISC wouldliketobetterunderstand.
• TheResearchCouncilshaverecentlystartedtorequirebidderstoincludea‘datamanagementplan’withinprojectproposals. HoweverthereisnoconsensusonwhatsuchaplanshouldlooklikeforsciencefundedbytheSTFC.TheUS NationalScienceFoundation(NSF) hasrecentlyplacedbind-ingrequirementsonprojectstoproducedatamanagementplans [3].
• TheLSC communityhasconsiderableinternalsoftwareandadministrationexperience, andhassolvedalargenumberofdatamanagementproblems
4
ManagingResearchDatainBigScience
focused on large-scale data storage and transport. However there is anawarenessthat(partlybecausetherehavebeennoimmediateimperativestodoso) therewasuntil recentlynopublishedplanfora long-termdataarchive.
Theexistenceof these threegroups is reflected in theoverall structureof thedocument.
Thisproject'scontextalsoincludesthebroad VirtualObservatory(VO)move-ment, whichaimstodevelopstandardsandareasofconsensuswhichhelpsci-entistshavereadyaccesstoastronomicaldataacrosssub-disciplinesandwave-lengths. Allthestakeholdergroupshaveinterestsinthesuccessofthe VO move-ment.
Theprojectaims tobring together twosetsofpractice, namely the long-termdigitalpreservationperspectivesrepresentedbytheOAIS referencemodelintheabstractandthe DCC inparticular, andtheveryconsiderableexperienceofpracticallarge-scaledatamanagement, embeddedwithintheLSC community.
0.2 Howtoreadthisdocument
Thisdocumentisorganisedintothreemainsections, broadlycorrespondingtothethreeaudiencesweareaddressing.
Sect. 1 isaboutdatamanagementin bigscience. Itisaddressedtothe JISCandtothedatapreservationcommunityingeneral, andisintendedtoilluminatetheways inwhich scientists in theseareashavedistinctivedatamanagementrequirements, andadistinctivedataculture, whichcontrastsinformativelywithotherdisciplines.
Sect. 2 isprimarilyaddressedto STFC andothersimilarfundersofthistypeofscience. Itisconcernedwiththeresponsibilitieswhichareimposedonfundersbythewidersociety, andwhicharepassedontothefundedthroughrequirementsonthegovernanceofprojectsandtheavailabilityofdata. Therecommendationshereareconcernedwithhowbesttoexpresstheseresponsibilities.
Finally, Sect. 3 isprimarilyaddressedtothe LSC,asaproxyforsimilarbig-scienceprojects. Theexplicitrecommendationshereareintendedtobeofasmuch interest toprojects, asactions theymaywish to take, as to funders, asbehaviouritmaybeprudentorproductivetorequire.
0.3 Workingwithcommunities –pragmatics
Thisreportistheresultofafruitfulcollaborationwiththe GW community. Itmaybeusefultonotesomeofthefeaturesoftheproject, andthecommunity, whichcontributedtothis.
• Theprojectteam, aspartofGlasgowUniversity, hascurrentinvolvementinthecommunity, andtheprojectdirector(Woan)isaseniorfigurethere.
• The LIGO communityisalreadyawareofthegeneralneedfordatamanage-ment, andthespecificneedforpreservation(see [5]).
• Theprojectpersonnelhaverelevantscientificbackground, andaretosomeextentinthepositionofbeinginformatics-for-astronomyspecialists(iewe're‘insiders’).
• Thecommunityislargeand(viastudiessuchas [6])hassomeexperienceofbeing‘studied’.
• Theexisting LVC workshopseriesmeantthatwecouldcontactrelevantpeo-pleeasilyinacontextwherenewcomerswereexpected, andwedidn'thavetoaddourowndatamanagementworkshop.
ForOAIS,see [4]andSect. 3.1.1; inthisreportthe‘DCC’istheJISCDigitalCurationCentre, nottheLIGO DocumentControlCenter.
5
Gray, CarozziandWoan
http://askanexpert.web.cern.ch/AskAnExpert/en/Accelerators/
LHCgeneral-en.html
Inthecontextoflarger-scaleprojects, a‘sciencerun’isaperiod
whentheequipmentisrunmore-or-lesscontinuously, gatheringscientificallyusefuldata. Betweenscienceruns, theexperimentwill
eitherbedownformaintenance, oronaplanned‘engineeringrun’; datafromengineeringrunsisgenerallystored, butisnotexpectedtobe
usefultoscientists.
ATLAS,oneofthetwolargerLHCdetectors, stores 3 PB yr−1 byitself;see http://atlas.ch/pdf/atlas_
factsheet_4.pdf forsomeentertainingnumbers.
1 DatamanagementinBigScience
1.1 LIGO inperspective: LIGO,bigscience, andastronomy
Whatis‘bigscience’?Big scienceprojects tend to sharemany featureswhichdistinguish them
fromthewaythatexperimentalsciencehasworkedinthepast. Suchprojectsshare(non-independent)featuressuchas:
bigdiscoveries Theseprojectsareexpectedtobeamongstthemostimportantonesoftheirgeneration. Althoughthereisveryhighconfidencethattheirheadlinesciencegoals(forexampletheHiggsandGW searches)willbesuc-cessful, theyarealsoexpectedtoproducelonglistsofunexpectedresults,andabroadrangeofengineeringspinoffs.
bigmoney Thesearedecades-longprojects, supportedbycountry-scalefundersandbillion-Eurobudgets(thetotalbudgetforthe LHC isaroundthreebillionEuros, not including thedetectors, nor thepersonnelandhardwarecostsdirectly supportedbycountry funders, whichcostbetweenoneand twotimesthatsum).
bigauthorlists Theprojectsinvolvecollaborationsofhundredsofpeople(theLSC authorlistrunsataround600people(cf Sect. 1.6.1), andtheLHC'sATLAS detectorauthorlistisaround3000).
bigdata Enhanced- andAdvanced-LIGO (for example)will produceof order1 PB yr−1, comparabletothe ATLAS detector's 10 PB yr−1; theeventualSKAdatavolumeswilldwarfthese.
bigadmin MOUs, councils, workshopseries.
bigcareers Individualsmaymake the journey fromPhD tochairona singleproject.
Thereisadiscussionofthefeaturesof‘bigscience’, and LIGO'sprogresstowardsthatstyleofworking, in [7], withanextendedhistoryofthesub-disciplinein [6].
Becauseofthelargecostsinvolvedandbecausethereisusuallylittleimme-diatecommercialvalueinthisresearch(thoughofcoursetherearesubstantiallong-termeconomicpayoffsfortheinvestingcountries), theselargeprojectsarefundedatthenationalorinternationallevel, sothattaxpayersaretheultimatestakeholders. Evenputtingasidethescientificandscholarlyneedforadequatedatapreservation, thesenationalinvestmentsmakeitnecessaryforfundersbothtodemonstratethatprojectsarebeingefficientlyexploitedtoproducemacro-economicvalue, andtomakethe dataproductsavailableforpublicuse. WediscussopendatainSect. 2.2
1.2 Datavolumes
Themostimmediateproblemwithdatacurationandsharinginthesescientificareas –thoughintheendnotthemostsignificantone –isthedatavolumesin-volved. ThecurrentvolumeofLIGO dataisoftheorderofhundredsofterabytes,andthedataratesisexpectedtogrow, overthecourseoftheproject, fromitscurrent 100 TB yr−1 toaround 1 PB yr−1 (seeTable 1 onthefacingpage, whichshowsthevariationindatasizeforsciencerunsthreetosix).
LIGO isjustoneofseveralotherexistingorplannedbigphysicsprojects, in-cludingthe LHC,the SquareKilometreArray(SKA),andvarious EuropeanSpaceAgency(ESA)/NASA spacemissions. Incomparisonwiththeseprojects, LIGO'sdatahandlingrequirementsarerelativelymodest. TheLHC willhavedatavol-umesoftensof PB yr−1 Furtherinthefuture, theSKA (whichisduetobecom-
6
ManagingResearchDatainBigScience
S3 S4 S5 S6L0 57 32 816 261L1 8.24 4.04 119 76L2 1.55L3 0.97 0.86 9.70 3
(duration/day) 70 29 695 482
Table 1: LIGO datasetsizeestimatesinTB,andrunlengthsindays, forsciencerunsthreetosix(‘Sn’), andvariousdatatypes(sizedatatakenfrom [8]; therewereatotalofsixsciencerunsinLIGO;L0istherun'srawdataset, andL1toL3areprogressivelyreduced).
missionedaround2020)haspredictedrequirementsupto 1 Tbit s−1 locallyand100Gbit s−1 intercontinentally; thisinvolvestransporting, thoughnotnecessar-ilystoring, around 1 TBmin−1 or 0.5 EB yr−1 [9]. Thisis0.05%ofthepredicted1ZB yr−1 total worldwideIP trafficfor2015 [10].
Large-scale physical science experiments have long produced significantdatavolumes, but in recentyearsdatasetsappear tobe increasing involumeandincomplexityatanoverwhelmingrate, andthismaypresentaqualitativelydifferentdatamanagementproblem. Thisissometimesdescribedinratherapoc-alypticterms –asa‘datadeluge’orthelike –andsomeofthechallengesandopportunitiesaredescribedin [11].
1.3 Datamanagementstylesinthephysicalsciences
Itseemsusefultodiscuss, here, someofthedistinctivefeaturesofdatacollec-tionandmanagementintheexperimentalphysicalsciences, sincethesehaveanimpactonboththeexpectationsfor, andtheproblemswith, thedata.
Big-scienceresearchprojectshaveanumberofrelevantcommonfeatures:
Largedatasets Suchprojects'datasetsare‘large’intheobjectivesensethattheprojectsaretypicallysogreedyfordatastorage, thattheirholdingsareneartheedgeofwhatitistechnicallyfeasibletostoreandtransport.
Innovativedatamanagement Aspartoftheresponsetotheirneedforlargedatavolumes, big-scienceprojects are often extremely innovative in their so-lutions todatamanagementproblems, to theextent that theyarewillingtoworkwithexperimentalfilesystemtypes, oradaptandextendoperatingsystemsoftwareornetworktransportprotocols(see http://lcg.web.cern.ch/togetanimpressionofthescopeofdevelopmenteffortshere).
Specialisedsoftware Becausetheinstrumentsandtheirdatasetsaresocompli-cated, theseprojectstypicallygeneratelargecustomdataanalysissoftwaresuites. Thesemayrequirespecialisedandunwrittenknowledgetouse, andthereforeappeartorepresentasignificantsoftwarepreservationchallenge.
Beyond the substantial softwareengineeringchallengesdescribedabove,thephysical sciences tend to have few ‘IT’ problems, since the communitiescontainplentyofpeoplewithsufficienttechnologicalnoustoaddressessentiallyallday-to-daycomputing-relatedproblems, andthesecommunitiesarethereforegenerallyreasonablywell-organisedwithregardtobackups, storage, andbasicfilesharing(seealsothediscussionoftechnologicalreadinessinSect. 3.5). Atthesametime, however, thecommunitiesareratherconservativefromthepointofviewofacomputerscientist, andsometimesratherinformalfromthepointofviewofasoftwareengineer. Thatis, theattitudetocustomcomputingsolutionsisverysimilartotheattitudetocustomlabhardware: itmayneedtobecreative
1 PB is 1000 TB; 1 EB is 1000 PB;1 ZB is 1000 EB; notethattheunit Breferstobytes, notbits
7
Gray, CarozziandWoan
OneoftheDCC researchers,commentingonthisreport, quoteda
GridPP surveyrespondentcommentingthattheLHC
computingtask: “Providingresilientservicesthatmaintainaccesstodata
fortheexperimentusers24/7 –servicesarecomplex, bleedingedge,
andareconstantlybeingupdated.Controllingthatprocess, whilstalsomaintainingserviceup-timeisvery
challenging”.
andexperimental, butnever for its own sake; itmustbe stable, but is neverfrozen; itisaccuratelymade, butrarelypolished. Theanalogywithlabhardwareandsoftwareholdstotheextentthat, intheLHC community, datamanagementgroupsareregardedasdetectorsubsystemgroups; thatis, theyhavethesamegeneralstatusasthemagnetoracceleratorengineers, andexpectedtoproduceagileandinnovativecomputingservicesverydifferent fromthemoreroutine,lab-wide, provisionofCERN IT services.
Theresultofthisisthatlabsoftwarerepresentsfunctionalsolutionstoim-mediate-termproblems, generallywithflexibilityenoughtorespondtomedium-termproblems, butwithoutmuchattentionbeinggiventotheimponderablesofthe long term, after theexperimenthascompleted. It isprecisely these LongTermpreservationquestions, in theOAIS senseofmore thanone technologygeneration, thataretheconcernofthisreport.
There is plentyof prior art in this area. See reference [12] for a reviewofdatamanagementpracticeinavarietyofscholarlyareas, whichadditionallycoversseveralproposedlife-cyclemodels, andanalysistechniques. ThereisasimilaroverviewinthePARSE.Insightcase-studiesreport [13], whichexaminesdatamanagementpractice inHEP, earthobservation, and social scienceandhumanities. Thesecase-studieswereconductedviainterviews, andparticipationinongoingeffortswithinthecommunities. Thesameprojectproducedagapanalysisandroadmap, whichmakevaluablereading.
This isagoodplacetostress that ‘bigscience’generallyhandles itsdatawell, andcanevenberegardedasexemplary(compareSect. 3.5). Thereareafewfeatureswhichnaturallyencouragegooddatamanagementpracticeinthelarge-scalephysicalsciences.
• Theseareoftenrelativelywell-resourcedprojects, withplentyofcomputingexperienceandlotsofengineeringmanagement. Thereislotsofobviousin-frastructureinthedevelopmentofalargecollaborativeexperiment, whichgivesdatamanagementanobviousbudgetaryhome, whereitisnotcom-petingwithfundingwhichdirectlysupportsresearchers.
• AstronomyandHEP projectshavealwaysproduced‘large’datavolumes:thismakes adhoc datamanagementmanifestlyunattractive, andencour-agesexplicitdatamanagementplanninganddiscipline.
• Thescaleoftheseexperimentsmeansthattheytendtobesharedfacilitiesprovidingdocumentedservicestotheirusers, sothatdocumentedinterfacesandSLAsarenatural.
• Theseprojectsrarelyifeverproducecommerciallysensitivedata, sothattheconfidentiality concerns arewell circumscribed, concerningprofessionalpriorityratherthanIPR orotherfinancialworries.
Althoughthesefeaturesaretoagreaterorlesserextentspecifictothistypeofscience, theyhavegivenrisetothenotionsof dataproducts andexplicit propri-etaryperiods, whichwebelievewouldbeusefulinotherareas, andwhichwediscussinSect. 1.11.
Althoughitis GW datawhichisournominalfocusinthisreport, itiscon-venienttofirstdescribegeneralastronomydata, thendistinguishthatfrom HighEnergyPhysics(HEP) data, whichhasasomewhatdifferentdataculture, andthendescribehowtheGW community, whichisinmanywaysintermediatebetweenthetwo, handlesitsdata.
1.4 Astronomydata
Astronomy(excludingGW astronomyfor themoment)hasprobablythemoststraightforwarddatamanagementpracticesinthephysicalsciences. Whenan
8
ManagingResearchDatainBigScience
opticaltelescopetakesanimage(oraspectrum, whichforourpresentpurposesistechnicallyequivalenttoanimage), eitheraspartofasystematicsurveyofthewholeskyorasapointedobservationrequestedbyanastronomer, theimageis typicallymoved fromthe telescope'sdetectorstraight into itsarchive, fromwhereitcanbelaterretrievedbytheastronomer, accompaniedbyautomaticormanually-addedmetadata.
Non-opticalastronomers(coveringtherestofthespectrum, fromradiotogammarays), andmostsatellitemissions, haveasomewhatmorecomplicatedroutefromobservationtoimage, andabroadersetofdataproducts, buthaveessentially thesamemodel, andthesamedisciplineandexpectationsaroundarchives. Fromthepointofviewofdatamanagementtherefore, wecanelidethedifferencesbetweenthevariousbranchesofastronomy. Gravitationalwaveandneutrinoastronomy, incontrast, arenotstudyingtheelectromagneticspectrum,andpartlyasaconsequencetheirstudymorecloselyresemblesparticlephysics(seeSects. 1.5 and 1.6 below).
Most large telescopes, satellites and instruments operate partly or exclu-sivelyaccordingtoamodelinwhichastronomersareawarded‘telescopetime’,rangingfromafewhourstoafewnights, astheresultsofcompetitivebidscloselyanalogoustograntbids. Theresultingdatagenerallyhasaproprietaryperiod,extendingforperhaps12, 18or24monthsafterthedataistaken, duringwhichonlytheobserverwhorequesteditcanretrieveit, butafterwhichitautomati-callybecomesretrievablebyanyone(‘embargo’wouldbeabetterterm, thoughunconventional). Similarly, instrumentsbuiltbyconsortiagenerallyhavepro-prietaryperiodsduringwhichthedataisonlyavailabletoconsortiummembers.Theproprietaryperiodsarepartlyforthebenefitoftheconsortiumindividuals –itistheirrewardfortheinitiativeandpossiblydecadaleffortofbuildingtheinstru-ment –buttheyarealsoapragmaticreflectionofthelengthoftimeitmaytaketocalibrateandvalidateacquireddata, readyfordepositinanopenarchive. Asaresult, thelengthsandtermsofproprietaryperiodsarethesubjectofnegotiationsbetweentheinstrumentbuildersandtheirultimatefunders, thoughthenegotia-tionsarealwaysaboutthelengthofthedelaybeforeageneraldatarelease, andneverquestionthenecessityforthereleaseitself.
NASA missionsnowtypicallyhave12-monthproprietaryperiods, butthishasvariedhistorically, andforexamplethe1990COBE mission, whichincludedsignificanttechnologicalnovelty, andwhoseperformancewasthereforeratherunpredictable, hada36-monthproprietaryperiod.
Notallinstrumentshaveformalreleaseplans, andtheproprietaryperiodsthatexistmaybeadjusted informally. Caltech isoneof the fewprivate insti-tutionswhichisrichenoughtoown, orhaveasignificantsharein, world-classtelescopes(PalomarandKeck). Ithasnodeclaredpolicyondatamanagementordatasharing, beyondabroadtacitexpectationthatdatawillbepublishedasap-propriatefornormalscientificpractice. Asasecondexample, duringthe‘sciencedemonstrationphase’ofthecommissioningoftheHerschel telescope(thatis, thelastcommissioningphase, verifyingthatthesciencegoalswereachievable), theinstrumentteaminvited observerstonominatepartoftheirscheduledobserva-tionstobeperformedearly, duringthisstill-experimentalcommissioningphase.Whentheobservationsprovedsuccessful(astheygenerallywere), theobserv-ingteamsweregiventhechoiceeitherofmakingthedataimmediatelypublic,intimefortheopeningoftheHerschelarchiveandajournalspecialissue, andhavingtheobservationtimere-creditedtothem; orelseretainingthe12monthproprietaryperiod withoutthere-credit.
Imagedataisthearchetypalastronomydata, andisgenerallystoredasfiles,butanotherimportantcategoryistheastronomical‘catalogue’, ofobjectposi-tions, spectraandotherproperties, usuallystoredinrelationaldatabases. Astro-nomicalarchivesrangefromquitesmallones(atoneextreme, asmallspecialisedinstrumentmayhaveits‘archive’consistingofafileserverlookedafterbyagrad-
An‘instrument’inthiscontextisthelight-sensitivedetectorattachedtothetelescopeorsatelliteoptics(thecamera, ineffect). Itisreplaceableorswappable, andregardedasaseparatepieceofengineeringfromthetelescope. Thedayswhenobserverswouldtraveltothetelescopecarryingtheirowninstrumentarenowlargelypast.
See‘HerschelObservers'Manual’§1.1.4, http://herschel.esac.esa.int/Docs/Herschel/html/ch01.html;thankstoHaleyGomezforbringingthistoourattention.
9
Gray, CarozziandWoan
http://surveys.roe.ac.uk/ssa/
uatestudent)toverylargeprofessionallymanagedarchiveswhichareboththeprimarysourcesofsomedatasets, andmirrorsofothers.
Astronomydata ispotentiallyvery long-lived. Althoughastronomersarenaturallydrawntothenewestinstrumentswiththegreatestsensitivity, itisnotunusualtodrawonrelativelyoldarchivedata. Inmostcases, thiswillbestillbeborn-digitaldata, butdigitisedversionsof century-oldastronomicalplatesareusedinpreciseastrometry, andtoidentifytheprecursorsofsupernovaeandotherone-offevents(seeforexampletheEdinburghSSA,furtherdiscussed, withbackground, at [14]; andamorediscursiveaccountofplatescanning, includ-ingdiscussionofsomeofthearchivalchallenges, in [15, 16]). Evenbabylonianandancientchineseastronomicaldata hasbeenusedforcontemporaryscience,helpingmeasuretherateatwhichtheearth'sspinrate, andthusthelengthofday, ischanging [17]; similarly, 3 000-year-oldegyptiandatahasbeenusedtomeasurethechangeintheorbitalbehaviourofthethreestarsintheAlgolsys-tem [18]. Thecosmoschangesslowlyonourtimescales, sothatthegreatmajor-ityofastronomicalobservationsarerepeatable; theexceptionsarethosecaseswherelongtime-basesarenecessary(preciseastrometry)orwheretheobjectofstudyisaone-off, andthereforeunrepeatable, eventsuchasarecentorhistoricalsupernova.
Astronomydataisalsointelligibleinthelongterm: althoughuntranscribedbabyloniantabletscanonlybereadbyspecialists, contemporaryastronomerscanbasicallyunderstandthedatapublishedinKepler's1627 RudolphineTables,andwithsomeassistancecanunderstandthecontentofthe11th-to12th-centuryToledanTables [19]. Althoughbiologistsmightbeabletomakesimilarclaimswithrespectto, forexample, Linnæus'sobservations, itishardtofindequallylong-liveddatainthephysicalsciences, orborn-analoguephysicsdatawherethereisasimilarcontemporarypressurefordigitisation.
Thereisessentiallynofile-formatproblemin(electromagnetic)astronomy,since the Flexible ImageTransport System (FITS) format isuniversal [20, 21].Thoughnotperfect, thisisarelativelysimpleandwell-definedformat, combiningbinaryortabledatawithkeyword-valuemetadata.
Astronomicaldataalsohasawell-developednotionof dataproducts. Thesearedatasetswhichcontain, not rawdata, butdatawhichhasbeenprocessedtoa greateror lesser extent. Wecandistinguish at least three levelsof data inthiscontext; most large instrumentswillhavemore thanone levelofderivedproducts.
Rawdata Thisisthelowest-leveldata, consistingofthedirectoutputofade-tectororotherinstrument, ortherawsatellitetelemetry. Thisdataismademeaningfulonlybyprocessingwithsoftwarewhichistosomeextentspe-cifictotheequipmentinquestion. Thoughitwillbepreservedasamatterofcourse, itisrarelypublished, norusedby, norusefulto, otherresearchers,exceptinunusualcircumstances. Inthecaseofaparticularlysubtleeffect –orlesscommonly, adebateoveratheoreticalanalysisorcalibration –are-searchermightreturntothe rawdata, butthiswillgenerallybedonewiththecollaborationoftheinstrumentscientists, andmaybeotherwiseinfeasi-ble, totheextentthatanyresultsobtainedwithoutsuchinsiderknowledgemightnotbebelievablebythebroadercommunity.
Dataproducts Afteritisgathered, (raw)datamustbeprocessed(‘reduced’)toturn it intoscientificallymeaningfulnumbers (interpretingengineeringortelemetrydatastreams, andcalibration)andtoremovevariousinstrumentalandobservationalartefacts. Dataproductsareusuallymadeavailable instandardformats(inastronomy, generallyFITS files), whereasrawdata, ifitismadeavailableatall, maywellbeinaninstrument-specificform.
Publications Sittingabovethedataproductsisaclassofhigh-leveloutputs, in-
10
ManagingResearchDatainBigScience
cludingscientificpapers, andotherpeer-reviewedoutputssuchaspublishedcatalogues. Journalarticlesarecuratedatpublishersitesandthe Astrophysi-calDataService(ADS),andarticlepreprintsatthe arXiv(cfSect. 1.9). Mod-estvolumesofdatacanbepublishedasdigitalappendicestojournalarti-clesin, forexample, AstronomyandAstrophysicsSupplements; thesearecuratedatthejournalandat VizieR.
Itisthedataproducts whicharetheoutputswhicharesufficientlyfreefromobservationalartefactstobethestartingpointforscientificanalysis(high-levelproductsaresometimesreferredto, informally, as‘sciencedata’), andwhichrep-resenttheclassofdatawhichisnaturallyarchived, mostcarefullydocumented,andwhichwilleventuallybemadepublic. Theremaybemultiplelevelsofdataproducts, withlower-levelproductscarryingmoreinformation, butusingwhichrequiresmoredetailedknowledgeofthesubtletiesoftheinstrumentanditspro-cessing pipeline. ToamuchgreaterextentthanistrueforHEP data, forexample,thehighestlevelastronomydataproductsarebothusefulandgenerallyintelli-gible –everyoneis, afterall, lookingatthesamesky –butresearcherswilloftenuseintermediate-levelproducts, iftheycaninvestthetimetolearnaboutthem,orhavecollaboratorswhohaveexperiencewiththem. Thoseresearcherswhoaremoreintimatelyinvolvedwithaninstrumentwillbecomfortableusinglower-leveldataproducts, becausetheywillhavetheknowledgewhichenablesthemtorun, orexperimentallyre-run, thepipelinesinascientificallymeaningfulway.Thatsaid, inOAIS terms, astronomicaldatacanbecharacterisedashavingabroad DesignatedCommunityandwellunderstood RepresentationInformation.
Publicationsareintheprovinceoflibrariesandsimilarrepositories, andarenotconsideredfurtherinthisreport.
Opticalastronomy(thatis, withobservationsmadeusingvisiblelight)hasthemoststraightforwarddata, sothatthedistinctionbetween rawdataanddataproductsisslighttothepointofbeingratherartificial: astronomersreusingopti-caldatawouldexpecttorecalibratetherawornearlyrawdata, andwouldnotanticipatehavingdifficultydoingso.
Weconcludewithsomeexamples: A typicaltelescopearchiveistheUKIRTarchiveat http://archive.ast.cam.ac.uk/ukirt_arch/; thereareseveralimageandspectrumarchivesattheRoyalObservatoryEdinburgh'sWideFieldArchiveUnit;andthereisalargecollectionof cataloguesavailableat StrasbourgDataCentre(CDS) (seeSect. 1.4.1 below).
The ESA Hipparcosastrometrymission flewbetween1989and1993, andproduced ahigh-precision catalogueof 100 000 stars [22]. The catalogue isavailableonlineasqueriabledatabasesatESA and CDS,asCDs, andasPDFswhichmatchthecatalogue's17-volumeprintedversion. Theprintedversionisaninterestingcase: asdiscussedinthecatalogue(vol. 1, §2.11.3), theprintedpagesaredesignedwithaper-pagechecksum, tohelpwithre-scanningthecat-alogue frompaper, in thepresumed-likely futurecase that thedigital versionbecomesunreadableandonlycopiesofthepaperbooksurvive. TheTychocat-alogue, fromthesamemission, comprisesaround20timesthenumberofstars,atlowerprecision, andisonlyavailableonline.
ThereissomediscussionofpreservationcostsinSect. 3.4.
1.4.1 StrasbourgDataCentre(CDS) asadisciplinaryrepository
CDS isalargedisciplinaryrepositoryforastronomy [23]. Itstoresabroadrangeofcatalogues, ofvarioussizes, inits VizieR service(see[24]and http://vizier.u-strasbg.fr/) and provides a large librarian-curated collection of data from,measurementsof, andreferencesto, individualastronomicalobjects. Itcoop-eratescloselywith ADS.
CDSwascreated, andissupported, bythefrenchagencyinchargeofground-basedastronomy –firstCNRS/INAG thenCNRS/INSU –asajointventurewith
http://www.roe.ac.uk/ifa/wfau/
http://www.esa.int/science/hipparcos
http://www.rssd.esa.int/index.php?project=HIPPARCOS
11
Gray, CarozziandWoan
WearemostgratefultoFrançoiseGenova, ofCDS,forthisdiscussion
ofCDS'shistoryandsupport.
http://www.nottingham.ac.uk/astronomy/UDS/,
http://www.h-atlas.org/,http://hermes.sussex.ac.uk/,http://www.gama-survey.org/
StrasbourgUniversity. Themainsupport is throughpermanentpositions fromtheCNRS/INSU andtheUniversity(researchers, computerengineers, andspe-cialisedlibrarians), withadditionalcontractssupportedbyfundingfromvarioussources.
CDS isadministrativelylocatedwithinaresearchstructure, StrasbourgOb-servatory, providinganactiveresearchenvironmentforCDS astronomers. Thepreservationaspectshaveneverbeenseparatedfromtheprovisionofservicesandthemaintenanceoflocalexpertiseondatamanagementandpreservation.
Thiscanbeseenasanexampleofaverysuccessfuldisciplinaryrepository.Thereappeartobeseveralkeyfeaturesofthissuccess.
• CDS hasestablished, andactivelymaintains, internationalleadershipinthecurationofastronomicaldata, byvirtueofcollaboratingwidelyandinvest-ingeffortinprojects(suchasthe InternationalVirtualObservatoryAlliance(IVOA)) whichsupportandpromotedatasharing.
• Asaresultoftheintimaterelationshipbetweentherepository, theobserva-toryandtheuniversity(totheextentthattheboundariesbetweenthethreecanseemrathervaguetooutsiders), CDS personnelhavepracticalknowl-edgeofhowtheirdataisused, andwhatresearchersneed.
• ThecorefundingforCDS comesfromthefrenchstate, butitisconceivedasaninternationallyvisibleproject.
1.4.2 Collaborationsinastronomy
Themostvisiblecollaborationsinastronomyarelargeterrestrialandsatellite-bornetelescopesandotherinstruments. Attheriskofoversimplifying, thesearegenerallynotthemany-personcollaborationsusualin GW or HEP physics, butareinsteadfacilitiescreatedbyspaceagenciesorconsortiaofnationalfunders.Althoughtheyarehighlyinnovativeleading-edgefacilities, theyarenotseenasexperiments in thesamewayas LIGO or the LHC are (massive) itemsofspe-cialisedhardwarebuilttoansweradelimitedsetofscientificquestions. Theyareinstead observatories: thedatamanagementinthesefacilitiesispartoftheirgeneraloperatinginfrastructure, andtheresearchandresearchdatatheyproduceis‘owned’(atleastinanacademicratherthanalegalsense)bythescientist usersofthefacilities, ratherthanthefacilityitself.
Astronomydoeshoweverhaveavarietyofdata-analysiscollaborations. Thesearesemi-formalcollaborationsconcerned, mostly, withmulti-wavelengthstud-iesofmultiplearchives, and include forexampleUKIDSS-UDS, theHerschelAtlascollaboration, HerMES andGAMA.Thesehavebetween20and60collab-oratingmembersscatteredoverperhapsadozeninstitutionsbut, crucially, no‘corporate’existence, andlittleornodirectfunding. Instead, theyarefundedindirectlyviaindividualfellowshipsorrollinggrants: participationinthecollab-orationmightbeastrongfeatureofagrantapplication, butitisnotthecollabora-tionassuchthatreceivesthedirectsupport. Theydohavegovernancestructures,butthesetendnottobeparticularlyformal, becausetheyremainsmallenoughthatthereislittleperceivedneed. Thesecollaborationsexisttoderivehigh-valuederiveddataproductsfromthelower-leveldataproductsofthearchivestheyareanalysing(forexampleHerschelisan ESA observatorymission: thismeansthatindividualscanbidforobservations, butthatESA doesnothaveitaspartofitsremittoprovidemorethanminimallyreducedsciencedata).
Thecollaborationsdistributetheirresultsinpapers, andassociateddatasets;theytypicallybuildarchivestosupportanddistributetheirwork, butthere'snoexpectation(beyondtheusualcooperativeacademicnorms)thattheywillhelpothersworkonthedata, orreleaseit. Itishardtoseehowtherecouldbesuchanexpectation, muchlessanobligation, sincetheyreceivelittledirectfunding, and
12
ManagingResearchDatainBigScience
theirindirectfundingcomesfromamultinationalsetofentitieswithpotentiallyverydifferent DataManagementandPreservation(DMP) policies.
1.5 HighEnergyPhysicsdata
Astronomyisessentiallyanobservationalscience: telescopes, theiroptics, andthedetectorswhichhangoff them, areconstructed tocreateapath fromna-ture todatawhich isasnearlyaspossibleunmediated. Thismeans that it isbothreasonablyobviouswhatthingsaretobearchived, andthatthenatureandprocessingofobservationalartefactsarewellandcommonlyunderstood. Thismeansthatastronomy, unusualinthephysicalsciencesforneedingtopreservedatalong-term, isinthehappypositionofhavingitsdatareadilypreservable.
HEP dataisdifferent. HEP isaparticipativescience, whereobjectsranginginsizefromelectronsallthewayuptonucleiaredisassembled, anddataaboutthemessyresultsofthisdisassemblyisexaminedtoretrieveinformationabouttheinteriorstructureoftheoriginal. Thisreconstructionfromcollisiondatade-pendsonashiftingengineeringunderstandingofriversofdata, outofinstrumentswhichareone-offworksofart, designedandassembledbyathousand-strongcommunity, close-packedintoadetectorthesizeofasmallcathedral, attachedtoamachinewithitsownpostcode.
TheresultofthisisthatHEP dataanalysisisrathertricky, withmanystepsbetweendataandscience, eachofwhichdependsonsoftwarewhichencodesadetailedunderstandingofthedata'sprovenance. Inconsequence, althoughHEP dataistypicallydistributedwithmultiplelevelsofreduction, almostnoneoftheselevels(withtheexceptionofformalpublications)arestraightforwardlysuitableforlong-termpreservation. Thisisbecauseinterpretationofthisdataisheavilydependentonsoftware, theuseofwhichrequiresdetailedexperimentalknowledgewhichitmaybeinfeasibletopreserve. InOAIS terms, thedesignatedcommunityistinybecausethe RepresentationInformationishugelycomplex.
Inadditiontothis, HEP datahasaconsiderablyshortershelf-lifethanas-tronomydata, asdiscussedabove. Incontrast, oldHEP dataistypicallymaderedundantbynewdata, obtainedfrommorepowerfulaccelerators. Alsoincon-trasttoastronomydata, HEP dataisnotexpectedtobegenerallyintelligibleforverylong: two-orthree-decadeolddatamightpotentiallybeusefulorintelli-gible, butmuchbeyondthatwouldcountasarchaeology. Attheriskofbeingwhimsical, wecancomparetheroughlymillenniallifespanofastronomicaldatawiththeroughlythree-decadelifespanofHEP data, andconcludethatthelattergoes‘off’about30timesfasterthantheformer. Althoughfacilitiesmakeveryconsiderableeffortstomanagedatasafelywhileanexperimentisrunning, thereislittlerealpressuretopreserveHEP dataintothelongterm.
Ofcourse, thingsarenotquiteasstraightforwardasthatinfact. (i) The LHCgainsinteractionenergyattheexpenseofamessiercollision, sotherearepoten-tiallysomefeaturesthatwillbedetectableinonedataset(forexamplethe HERAp-edata)whichwouldnotbefindableintheLHC.Whileinteractionenergyisthemostprominentmetricofanaccelerator'sperformance, itisnottheonlyone,sothatlargeracceleratorswillnotrendersmalleronesobsoleteasinevitablyaswemayhavesuggestedabove. Similarlytothis, (ii) datareductionerrorsmaybedominatedbytheoreticaluncertaintiesratherthanexperimentalones, andthesewillonlybe improved, and thedata re-reduced, after theexperiment isover.Finally(iii) therearenoacceleratorsbiggerthantheLHC currentlyscheduled,sothatthisdatasetmayremainthehighest-energyoneforarelativelylongtime.Thearchaeologyisillustratedin [25]andtheproblemfurtherexploredin [26],whichalsodiscussestheHEP community'sdevelopingplansfordatapreserva-tion. Qualificationsnotwithstanding, theoveralltimescalesinHEP areshorterthaninastronomy, andthesolutionsdescribedin [26]areconcernedwithpro-longingacontinuouslow-levelrelationshipwithadatasetratherthanbeingable
13
Gray, CarozziandWoan
http://www.pegasus.lse.ac.uk/research.htm andinparticular[28]
TheMOU whichcreatedtheLVC isat [31], butMOUsarenotroutinely
madepublic.
ThedefinitionofLSC membershipisincludedin [32]andthe
constructionoftheauthorlistin [33].
Theterm‘LVC’isnotaninitialism. Itcolloquiallyreferstothe
data-sharingagreement [31]andjointmeetingsbetweentheLSC and
theVirgoCollaboration. Thoughthereare‘LSC/Virgocollaborationgroups’, thereisnoformalbig-C
Collaboration.
toreturntoadatasetcold.Unlikeastronomy, HEP has for the last fewdecadesbeenorganised into
largerandlargercollaborations, andthesecollaborationshavedevelopedintri-cate, andsociallyfascinating, culturesformanagingthis. Thetwolargerinstru-mentsattheLHC, ATLAS and CompactMuonSolenoid(CMS),eachhaveauthorlistsoforder3 000people, sothatthevariousCERN collaborationsaccountforaround10 000research-activeindividuals. ThereisextensivediscussionofthehistoryandstructureoftheLHC collaborationsin [27]andintheoutputsofthePEGASUS project, butmanyofthecollaborations'relevantorganisationalfeaturesareechoedinthe GW community: thisisdiscussedinSect. 1.6.1 andwedonotdiscussthemhere.
1.6 Gravitationalwavephysics
Thegravitationalwavecommunityhasastronomicalgoals, butinthescaleoftheLIGO project, andintheamountofnoveltechnologyinvolved, aswellasinthefactthatmanyofthepersonnelinvolvedcameoriginallyfromaHEP background,theproject'sculturemorecloselyresemblesthatofaHEP experimentthanofanastronomicaltelescope. WediscusssomespecificfeaturesofLIGO datain [29];herewediscusswhere GW data, andthediscipline'sorganisationstructure, fitsonthespectrumbetweenastronomicalandHEP data.
1.6.1 Gravitationalwaveconsortia
TherearethreeprincipalsourcesofrecentGW dataavailabletoUK researchers:LIGO,GEO600andVirgo. Thereareotherdetectorswhichareeithersmallerefforts(intermsofconsortiumsizes), whichhavestoppedtakingdata(TAMA-300), orwhicharestillattheplanningstage. See [30]foranoverviewofcurrentdetectors, andofdetectorphysics.
LIGO LabisacollaborationbetweenCaltechandMIT,whichdesignsandrunsthreeinterferometersinHanford, WA,andLivingston, LA,intheUS. GEOisaGerman/Britishcollaboration, whichrunsthe GEO600interferometer. ThethreeLIGO interferometerswereshutdowninOctober2010torefitfor AdvancedLIGO (aLIGO);theGEO600interferometerisstillcurrentlyrunning. The LSC istheresultofanetworkof MemorandaofUnderstandingbetweenLIGO Lab(ormorelooselytheLSC) andmultipleotherinstitutionsofvarioussize. Thesere-lationshipsinvolvehardware, resources, anddataaccessofvarioustypes. Mosttypically, theresourcesinquestionarepersonnel, andaninstitutionsuchasauniversityphysicsdepartment, whichwishesaccesstoLIGO data, willcontributeinreturnfractionsofstafffrompermanentstaff, throughpost-docs, toPhD stu-dents, forabroadspectrumofactivitiesincludingdataanalysis, instrumentfabri-cationandshift-workinthedetectorcontrolroom. Howeverinsomecases, theMOUsareconcernedwithdataswaps, andsetuplimiteddatareleaseswithotherscientists: forexample, thereareafewMOUsbetweentheLSC,Virgoandotherobservatories, whichdescribewhatdataistobeshared, inwhatvolumes, andtheoutlineauthorshiparrangementsforanysubsequentpapers. GEO'sMOUdescribesaparticularlycloserelationshipwithLIGO Lab, butmostoftheMOUsarebroadly similar toeachother, and theprocessof creatingone isbynowstreamlined. In total (asof June2010), theLSC consistsofa littleover1300‘members’; ofthese, 615spendmorethan50%oftheirtimededicatedtotheprojectandsohaveaplaceontheLSC authorlist.
Theterm‘LIGO’hasanumberofnotquiteequivalentmeanings: sometimesitreferstoLIGO Lab, sometimestoLIGO LabplustheLSC,andthephrase‘LIGOdetectors’isgenerallyunderstoodtorefertotheLIGO LabandGEO detectors.
TheItalian/French Virgoconsortiumhasitsowndetectorandanalysis pipeline,andhasadata-sharingagreementwiththeLSC,representedbythe LVC.Aswith
14
ManagingResearchDatainBigScience
LSC
otherinstitutions
Virgo
LVC
UK DEUSA (NSF) FR IT other EU
$ £ € €
LIGO Lab GEO
Figure 1: TherelationshipsbetweenvariousGW consortia.
LIGO,theVirgodetectorwillshutdownbetween2011androughly2015. Virgohas246members(withaslightlydifferentdefinitionfromtheLSC),andGEO600around100.
ThereisanattempttosummarisetheserelationshipsinFig. 1.Theseexperimentshaveacommonpurpose: theyexisttodetectsignatures
ofgravitationalwaves, whichareconfidentlypredictedbytheGeneralTheoryofRelativity, buttheactualobservationofwhichwouldbeamajorscientificevent(thereexistsanLSC dataprocessingflowchartwhichincludesthenotentirelyseriousbranch“CallStockholm!”).
Gravitationalwavesaresufficientlyweak, however, thattheexistingequip-mentwillnotbecomesensitiveenoughtohaveagoodchanceofdetectingthemuntilafteritsrefit, whichbeganinlate2010(whentheprojectenteredthephaseknownas aLIGO),andwhichisscheduledtobecompletedwhenthenewde-tectorsarecommissionedin2015.
1.6.2 GW data
Althoughtheconsortiahave(asexpected)announcednodetectionsofar, theynonethelessproducealargevolumeofauxiliarydata, representingbackgroundandcalibrationsignalsofvarioustypes, andthis, togetherwiththecoredata,means that theLSC collectivelyproducesdataata rateofapproximatelyonePB yr−1.
WecanreadilyidentifythelevelsofdatawhichwerediscussedinSect. 1.4:
Rawdata Thelowest-levelGW dataconsistsofthesignalsfromthecoredetec-tors. Thisdataismademeaningfulonlybyprocessingwithsoftwarewhichiscompletelyspecifictothedetectorsinquestion. Thisisstoredin‘frameformat’, which isaverysimple format intelligible toall theprimarydataanalysissoftwareinthecommunity, andwhichismultiplyreplicatedacrossNorthAmerica, EuropeandAustralia. Althoughthediskformatiscommon,thesemanticcontentoftherawdataisspecifictodetectorsandsoftware, sothatpreservingitlong-termwouldrepresentasignificantcurationchallenge.
Dataproducts Therawdataisprocessedintocalibrated‘straindata’, whichisthedatachannel inwhichaGW signalwill eventuallybe found (this ispossibly, butnotnecessarily, alsoheldinframeformat). Thisistheclassofdataproducts whichwilleventuallybemadepublic. Unusually, itturnsoutthatGW rawdataisinasemi-standardformat, andthedataproductsarespecifictotheanalysis pipelinewhichproducedthem.
Publications Sittingabovethedataproductsisaclassofhigh-leveldataprod-ucts, scientificpapers, andotherpeer-reviewedoutputs. TheGW projectshaveannouncednodetectionsofgravitationalwaves, buthavenonetheless
15
Gray, CarozziandWoan
http://www.ligo.org/about/join.php
producedabroadrangeofastrophysicallysignificantnegativeresults [30,§6.2].
AswiththegeneralastronomydataproductsdiscussedinSect. 1.4, thedis-tinctionbetweenthe‘rawdata’ andthe‘dataproducts’ isthatthelatterdatasets,alongsidetheirsupportingdocumentation, willbeavailableforuseandreusebyscientistswhodonothaveanintimateconnectionwith, andknowledgeof, theinstrument.
Boththe‘dataproduct’and‘publication’groupsarebroadclassesofobjects.Thepracticalboundarybetweenthemisclear, however: whatwearecalling‘publications’areentitiessuchasjournalarticlesorderivedcatalogueswhoselong-termcurationisnottheresponsibilityoftheLSC dataarchive, thoughtheymaybeheldinsomeseparateLSC paperarchive, whichisassuchoutofscopeforthisproject.
1.6.3 Gravitationalwavedatareleases
BecausetheLSC hasnotannouncedthedetectionofanysignalsofar, andbe-causethedatawillremainproprietarytotheconsortiumuntilwellaftersuchanannouncement, therearenodistributeddataproductssofar, andsotheissuessurroundingformatsanddocumentationhavenotyetbeenaddressed. Howeveritistheeventualpublicdataproductswhicharethehighest-valueoutputsfromtheexperiment, andwhicharetheproductswhichitwillbemostimportanttoarchiveindefinitely.
Atpresent, LIGO dataisavailableonlytomembersofthe LSC.This isanopencollaboration, andresearchgroupswhichjointheLSC haveaccesstoallof the LIGO data. Inreturn, theycontributepersonnel to theproject (includ-ingforexamplepeopletodoshift-workmanningthedetectors), andacceptthecollaboration'spublicationpolicies, which require thatallpublicationsbasedonLIGO dataarereviewedbytheentirecollaboration, andcarrythecomplete800-personauthorlist. Atpresent, andinthefuture, datawhichisreferredtobyanLSC publicationismadepubliclyavailable. SeeSect. 3.3.2 forfurtherdetailsonLIGO's DMP plan.
1.6.4 Summary: big-sciencepreservationchallenges
Inthethreesectionsabove, wehavetriedtodescribebothdifferencesandcom-monalitiesbetweenthreelarge-scalescientificdisciplines. Possiblythebiggestdifferencebetweenthethreeareasisthathigh-levelastronomicaldataproductsaremuchmoregenerallyintelligiblethaneventhehighest-levelHEP products. Ineachcase, however, wehavealadderofreasonablywell-defineddataproducts,witheachrunggeneratedfromtheloweronesbysophisticateddatareductionpipelines.
Thesituationisnotasrosy, fromthepointofviewoflong-termpreservation,asthisaccountmaysuggest. Becausethepipelineshavedevelopedorganicallyoveranumberofyears, undertheinfluenceofexperiencewithearlierversionsandincreasedunderstandingoftheinstrument, theknowledgetheyrepresentissometimesencodedwithintheminalessstructuredwaythanwouldbedesirable.Sometimes, metadataisencodedinfilenames, orinconfigurationfiles, orwikis,orevenprivateemails. Ofcourse, onecouldsimplyarguethatthisinformationshouldbedocumentedbetter, butitwouldbehardtoarguethatthecostsofthisworkwouldbejustifiable, toserviceafuturetheoreticalneedthatfewbelievewouldevenbecomeanactualone. Inconsequence, althoughtheresultingdataproductwillberegardedasperfectlyreliable, itmaybeinfeasibletoredotheanalysisotherthanbypreservingandrerunningthepipelinesoftware(evenifitwerefeasible, itwouldbeprohibitivelyexpensive, andrarelyseenasvaluable;
16
ManagingResearchDatainBigScience
seealsoSect. 2.4). Forthisreason, softwarepreservation hassomeroleintheoveralldatapreservationstrategy. Howeverit isnotcleartouswhatthisroleshouldbe, andthethornyissueofsoftwarepreservation isaddressedatgreaterlengthinSect. 3.2.
1.7 A contrast: socialsciencedata
It is possibly instructive to contrast thedatamanagement practices discussedhere, withtheverydifferentproblemsfacedbydatamanagersinthesocialsci-ences. In [34], theauthorssurveyanumberofsocialscienceprojects, withaparticularfocusontwolarge(forthesocialsciences)programmesfundedbytheEconomicandSocialResearchCouncil(ESRC) (theUK socialscienceresearchcouncil)withsubstantialresponsibilitiesfordatapreservationandsharing.
For theESRC projects, theartefactsbeingstoredaresimplethings, at thelevelof ContentInformation: theyareconventionalWorddocumentsandaudiofiles, ratherthantheheavilystructuredandstillsomewhatexperimentalbigsci-encedataobjects. TheESRC archivecontentswillremainbroadlyintelligibletofutureresearchers, withoutmucharchive-specificefforttodefine RepresentationInformationora DesignatedCommunity. Incontrasttothissimplicity, however,theESRC archiveshavetocopewithabroadrangeofassociatedcontextualis-ingmetadata, whichisdifferentfordifferentprojects, andinconsistentlyorin-completelyspecifiedbytheoriginatingresearchers, perhapsasanafterthought.Thismakesarchiveingestacomplicatedproblem, incontrasttothebigsciencecases, wherearchiveingest fundamentallyinvolveslittlemorethancopyingaself-containedsetofartefactsfromworkingstoragetosomepreservationstore.Inparticular, theESRC projectshaveacomplicatedsetofanxietiesaboutcopy-right, IPR,confidentiality, anonymizationandconsent; while LIGO caresintri-catelyaboutdataaccessandsecurity, itdoessointheratherformalcontextofprofessionalethicsratherthanfamilysecrets.
Thisillustratestwofurthernotabledifferencesbetweenphysicalsciencedataandthatofsocialscienceorbroaderarchivalresources.
Firstly, theresponsibilityforESRC datainpracticelieswithmorejuniorre-searchers, helpedbypart-fundedarchivists [34, §§5.2.1&5.4]. Forbigscienceprojects, itisfundersandseniorcollaborationmemberswhodrivethepreserva-tionefforts.
Secondly, essentiallyallphysicsdataisborndigitalandcomplete, mean-ingthatalloftheinformationtobearchivedispresentatthetimeofdeposit. Ofcourse, thisisnotcompletefromthepointofviewofreproducibility(thatrequiresjournalarticlesandpersonalknowledge)anddoesnotdiscountthesubsequentadditionof subjectivemetadataasfindingaids, but it iscompletely specifiedfromthepointofviewofconventional futureanalysis. Thedistinctionis thatexperimentaldataisacompleteandobjectiveaccountofeverythingthatwasbelievedtoberelevantinrecordingaphysicaleventwhichhappenedataspe-cifictime. Onecandisagreewiththeexperimenters'beliefsaboutcompleteness(thisshadesintoquestionsofreproducibilityandtacitknowledge), complainthatsomedetailsmightberecordedinnotebooksratherthandigitalrecords(moretrueoflab-scalethanfacility-scaleexperiments), orinextremecasesargueaboutthenatureofobjectivity, butanaturalscienceexperimenthasamuchclearerboundary, inspace, timeanddocumentaryextent, andsoamorenaturalexpec-tationofdocumentarycompleteness, thanwillbeusualforanexperimentinthesocialorhumansciences. Thisisdifferentfromthetraditionalarchiveproblem,wheretheproblemsofinterpretationaremorevisibleandacknowledged, andtheproblemofincompletenessmoreevident.
ThesummaryisnotthattheESRC orthebigsciencearchiveshaveaneasierjoboverall, butthatthecomplicationsexpressthemselvesindifferentpartsofthemappingfromOAIS abstractionstolocalfact. Bigsciencearchivesmustpreserve
Thisworkwaspartofthe‘DataManagementPlanningforESRCResearchData-RichInvestments’project(DMP-ESRC)(http://www.data-archive.ac.uk/create-manage/projects/JISC-DMP),fundedbyJISC,likethepresentproject, aspartoftheManagingResearchDataprogramme.
Foravividandilluminatingdiscussionofthecomplicationsandphysicalityofreproducingexperiments, see [35]andreferencestherein(bycoincidence, thisdescribesobservationsamongstgravitationalwaveexperimentersinGlasgow); thatdiscussionisreprisedinalargercontextin[6, ch.35]. Thequestionoftacitknowledgeisdiscussedatlengthin [36]. Foradiscussionofdifferenttypesofreuse,see [37, §3].
17
Gray, CarozziandWoan
Figure 2: Calculatedephemeris for theperiod104 BCE March23to101 BCE
April18, writtenonSeleucidyear209, monthIX,day18(103 BCE December20?). ComparisonwithaJPL ephemerisshowsthatthetextconjunctiontimesremainwithinacoupleofhoursofthecorrectvalues, withanoffsetattributabletoanerrorintheinitialvalue. Fordetaileddiscussion, see[40]. BritishMuseumitemSp-II.52, ©TrusteesoftheBritishMuseum.
See[38]forbackgroundandfurtherreferences, and[39, ch.4]forverydetaileddiscussionofthephysical
tablets. Theprecisedateoftheobservationsisofconsiderable
scholarlyinterest, sinceanagreeddatewouldprovideanabsolutefix
fortheotherwiserelativechronologyoftheLateBronzeAgeNearEast.
largecomplicatedobjectsforahard-to-describe DesignatedCommunity, butbe-causetheyareessentiallyalwaysproject-specificarchives, theirimplementationdoesnothavetobegeneric, andmanyoftheingestionissuescanbebakedintotheoriginalarchivedesign.
1.8 Babyloniandatamanagement(lesscontrastthanyou'dthink)
Contemporaryastronomybegan, inthewest, inMesopotamiainthefifthandfourthcenturies BCE. Althoughearlierdatasetsexist – the Venus tablet ofAm-misaduqa isaclusterof7th C BCE copiesof17thC or16thC datarecordingtherisetimesofVenusovera21yearperiod –theseearlieromentextsseemtohavebeenpreservedforlargelyculturalreasons.
Distinctfromthese, thereisalargesetof4–500othertexts, rangingfrom4thC BCE to75 CEwithasmatteringgoingbackasfarasmid-8thC BCE,andspan-ningthedevelopmentofBabyloniantheoreticalastronomyduringthe4thC BCE.Theseareamixtureofobservations, calculatedephemerides (suchasFig. 2),andtelegraphicallyobscuretechnicaldocumentation. Theobservationtexts –‘astronomicaldiaries’, formingthemajorityofthetexts –describeinsequencecelestialandmeterologicalobservations, dailycommodityprices, river levels,andtopicalevents. TheobservationsoftheSun, Moonandplanetswereofgoodenoughquality, andpreservedoveralongenoughtime, thatwhenbabylonianmathematicalmodelswerefittedtothemtheyproducedvaluesforthesynodicandanomalisticmonthsand(implicitly)theorbitalperiodsoftheplanets, whichareveryrespectablyclosetotheircurrently-determinedvalues(outbyafactorof3× 10−7, inthecaseofthesynodicmonth). Thesewereusedtopredictthefirstandlastappearancesofplanets, andthetimesoflunar(butnotsolar)eclipses.
The information in these texts issometimesavailableonmultiple tablets,althoughitisnotclearwhethertheseduplicateswerebackups, mirrors, ormediarefreshes. Manytabletshaveacquisitionmetadata, addedininkbythearchives,millenniaapart, inBabylonandBloomsbury.
Itisclearthatthetabletsthathavesurvivedrepresentonlyasmallfractionofthetotal. butboththedata, andthemathematicaltechnologythatreducedthedataandgeneratedtheephemerides, wereavailableandfullyintelligibleto
18
ManagingResearchDatainBigScience
Hipparchus (c 150 BCE) and, eitherviahimordirectly, toPtolemy (c 150 CE).TheBabylonDataCentrewasstillactiveinthefirstcentury CE,thoughfundingcutsmeantnewacquisitionswerebythenminimal, anditwasoperatinginthecollapsingruinsofthedesertcity.
The Content Information in the texts is sufficientlywell preserved that ifthetextscanbedatedatall(insomecasesthroughcontemporaryingestmeta-data), theycangenerallybedatedtotheveryday; thetechnical RepresentationInformation, incontrast, issoterseas tomakesenseonlyafter theprocedurebeingdocumentedisreconstructedfromtheContent. Thecuneiformpresentsachallenge, butoncethishasbeentransliterated, thedatasetsarefundamen-tally intelligible tocurrent astronomers. Thepreservation strategy is adaringone: byeffectivelyfoundingwesternastronomy, andarrangingthatthedatawaspreservedjustlongenoughthatitcouldbetakenoverbythe(hellenic)succes-sorcivilisation, thebabyloniansensuredthattheircoordinatesystem(basedonthezodiac)andnumbersystem(withanglesindegrees, subdividedintobase-60fractions)wouldstillbeinusebyastronomers25centurieslater.
1.9 Bibliographicrepositories
Thoughitisnotstrictlydata, itseemsusefultomakeparentheticalmentionofthebigsciencecommunities'literaturerepositories, sincetheyseemtoillustratethewayinwhichthecommunitieshavelearnedtoactcollectively.
Thepreprintarchiveat arXiv.org startedin1991asanelectronicversionofthelong-establishedpracticeofdistributingpreprintsofacceptedjournalar-ticlesaroundthehigh-energyphysicscommunity, bypost. Itcurrentlyreceivesaround6000submissionspermonth, predominantly inHEP,astronomy, con-densedmatterphysicsandmathematics; itprobablyreceivescopiesofnearly100%oftheHEP community'soutput. Authorsmosttypicallysubmitpapersatthepointwhentheyhavebeenacceptedbythejournal, butsomesubmitearlierversions, andafewarenotfurtherpublishedatall. Althoughthejournalsarestillprovidingan imprimatur, manypapersarenowprincipallyreadaspreprints, andmanyjournalspermitcitationsbyarXivreference. ArXivissupportedbyrequest-ingcontributionsfromitsheaviestinstitutionalusers, onaslidingscalerisingto$4 000/year. JISC Collectionsisoneofthese‘tier 1’supporters, onbehalfofUKcollegesanduniversities.
TheNASA ADS attheSmithsonianAstrophysicalObservatorypreservesbib-liographicinformationfortheastronomyliterature, holdsreferencestoorcopiesofjournalarticlefulltexts, andcuratesdigitisedcopiesofolderarticlessometimesunavailablefrompublishers. ItalsocurateslinksbetweenthesepublicationsandthearXiv, andbetweenpublicationsanddata. See [41]forcontext, andsomediscussionofthearXivnumbersmentionedabove.
ThepublicationparadigmrepresentedbyarXiv (andsimilarsmaller-scaleefforts) isunderpinnedby thepeer reviewprocessesof journals. Howeverasjournalsubscriptioncostsrise, journalsareprogressivelycancelled, inaprocesswhichmayultimately damage the reviewingprocess onwhich theparadigmdepends. TheSCOAP3 consortiumaimstobreakoutof thiscyclebydirectlysupportingasmallnumberofHEP journals, throughalevyonthefundingagen-cieswhichsupportthefield, inproportiontotheshareofHEP publishingtheysupport. Inreturnforthisthejournalswillremovebothsubscriptionchargesandpagechargesforthesejournals.
1.10 VirtualObservatories
A VirtualObservatory isanastronomicaldata-sharingsystem, composedofanetworkofarchivesanddata-accessprotocols. Thegoalisthatthedataappearstobeintegratedandideallyappearstobelocal.
Thiscanbeclassedasa‘high-risk’datapreservationstrategy, andisnotincludedamongstthisreport'sRecommendationstoSTFC.
http://arxiv.org/Stats/hcamonthly.html
http://arxiv.org/help/support/whitepaper
http://scoap3.org/about.html
19
Gray, CarozziandWoan
http://www.astrogrid.org,http://www.usvao.org/, and
http://www.euro-vo.org; plushttp://www.ivoa.net.
Seefurthercommentaryinhttp://lwsde.gsfc.nasa.gov/VxO_
Report_Decadal_Survey_5_2011.pdf
http://www.helio-vo.eu andhttp://lwsde.gsfc.nasa.gov/
http://www.roe.ac.uk/ifa/wfau/
Theearliest VOswereAstrogrid in theUK, theUS-VO in theUS (whichbecameNVO andthenVAO),andtheAstrophysicalVirtualObservatoryinEu-rope(whichbecameEuro-VO).They, alongwithagrowingcollectionofsmallernationalorregionalVOs, formedthe IVOA in2002. TheIVOA existstobrokerportablenetworkprotocolsforsharingdata, onthepartofcooperatingarchives,andaccessingit, onthepartofclientapplications. TheIVOA focusesprimarilyon‘traditional’astronomy, andsohaspoorcoverageofsolarphysicsandmorebroadlygeophysics(andcertainlyprovidesnoaccesstoGW data).
Fromthishasgrownthemoregeneralnotionof the ‘VxO’, whichis“[a]servicethatensuresthatallresourcesfromsub-field x areknown, discoverable,andeasilyaccessible. Itlookstotheuserlikeauniformdataprovider, butitisvirtual.”ExamplesincludetheVirtualSolar-TerrestrialObservatory [42], HELIO,andNASA'sHeliophysicsDataEnvironment.
1.11 Dataproductsandproprietaryperiods: reifyingdataman-agementandrelease
A commonfeatureofthevariousdatastylesaboveisthenotionofthe dataprod-uct, anditseemsusefultorecapandstressthesalientfeaturesofthishere.
Dataproducts: A dataproductisadesignedanddocumentedoutputofaninstrument, intendedtobebotharchivableandimmediatelyusefultootherresearchers, byvirtueofhavingobservationalartefactsremovedasmuchaspossible.
Dependingonthedisciplineandtheengineeringcomplexityoftheinstrument,dataproductsmaybeanythingfromtherawdatatoahighlyprocessedderivativeof the rawdata; the idealdataproductcontainsall the scientifically relevantinformationwithnoneoftheexperimentalartefacts.
Researchersarenotrestrictedtousingonlydataproducts, butitwillonlyrarelybenecessaryforthemtoresorttoreanalysingrawdata(seethediscussiononp.10).
Dataproductscorrespondcloselytothe‘InformationPackages’oftheOAISmodel(seeSect. 3.1.1). Inourexperience, theretendstobelittlepracticaldiffer-encebetween SubmissionInformationPackages(SIPs)and ArchivalInformationPackages(AIPs), andwheretherearedistinct DisseminationInformationPack-ages(DIPs), theytendtobeavailableinadditiontotheavailableSIPsandAIPs.AnexceptiontothisisarchivessuchastheWide-FieldAstronomyUnitatEdin-burgh, whichspecialisesinastronomicalsurveyscience, anddevelopsenhancedarchives(whichistosay, value-addedAIPs)aspartofitsparticipationincollab-orativeastronomyprojects.
WhenthevariousPackagesdiffer, theytendtoberegardedassuccessivelyhigher-level, asopposedtoalternative, dataproducts.
Thenotionofdataproductshasanumberofconcreteadvantages.
• Mostimmediately, theexistenceofastableanddocumentedoutputmakesiteasierforresearcherstouseandrepurposeexperimentalandobservationalresults.
• Becausetheproductsaresocentraltoaninstrument'soutput, they, andthepipelinesthatproducethem, aredesignedandcostedatearlystagesofaninstrument'sproduction.
• Researcherscanproduceandsharesoftwarewhichprocesseswell-definedproducts, possiblyfrommorethanoneinstrument.
• Because theyaresoexplicit, they formwell-definedstartandendpointsofdiscussionsaboutinteroperabilitybetweeninstruments. Indeed, the VO
20
ManagingResearchDatainBigScience
programmecouldbecharacterisedasanextendedefforttonegotiatenewcommonproductswhicharchives and softwaredevelopers agreecanbesuccessfullygenerated(byarchives)fromexistingAIPs.
There isofcourseacostassociatedwith thedesignanddevelopmentofdataproducts, butwebelievethatthiswillinmostcasesbemuchsmallerthanthecostsassociatedwiththeretrospectivedocumentationanddistributionof adhocdatasets.
Anothernotionthatiswell-knowninthephysicalsciences, butwhichasfarasweareawareisrareoutside, isthatofexplicit proprietaryperiods fordata.
Proprietaryperiod: A ‘proprietaryperiod’isaperiodafterdataisacquired, andthereforearchived, byasharedinstrument, duringwhichitisprivatetotheobserverorobserverswhorequestedit, andafterwhichthedata(usuallyautomatically)becomespublic.
The term ‘embargoperiod’wouldpossiblybemoregenerally intelligible, but‘proprietary’isconventional. Thenotionisdiscussedelsewhereinthisdocument(seeforexampleSect. 1.4), butwestressitherebecauseitusefullyconcretizesanumberofotherwisevaguequestionsaboutdatarelease.
Insteadofratherbroadquestionsofthehow, when, whyandwhetherofdatamanagementandrelease, weinsteadhavequestionssuchas‘whatarethedataproducts?’, ‘whomaretheydocumentedfor, andhowexpensively?’, ‘howlongistheproprietaryperiod?’or ‘whatis thequidproquofor thisperiod?’Thesequestionsdon'tmagicallybecomeeasytoanswer, buttheybecomealoteasiertoask, andinviteconcreteanswersandnegotiationratherthan adhoc argument.
There is nothing in thenotions of data products andproprietary periodswhichisobviouslyspecifictothephysicalsciences. Thenotionshavebecomewell-establishedinthisareaprobablybecauseithaslongexperience, ofneces-sity, ofusinglargesharedinstrumentswhichareoperatedtoagreaterorlesserextentasservices. This is lessoftenthecaseindisciplineswithmorebench-scaleexperimentalnorms, butevensomeareasofbiologyarenowmoreoftenusingsharedfacilities, andinotherdisciplines, dataproductsandproprietaryperiodswouldbecomemorenatural, themorethatpreservation-awarestorageisused [43].
Wecommend thenotionsofdataproducts andproprietaryperiods, andthedata culture they engender, to thebroader research community. Indeed,werecommendthat datamanagersshouldconsideradoptingthelanguageofdataproductsandexplicitproprietaryperiodswhendesigninganddocument-ingtheirholdings.
2 Theresponsibilitiesfordatapreservation
2.1 Visualisingbenefits
Whydofunderswishtopreservedata? Becausetheyperceive benefits tothatpreservation.
Buildingonthistruism, itseemsusefultoexplicitlyarticulatethesebene-fits. The JISC-fundedproject KeepingResearchDataSafe (KRDS) (see http://www.beagrie.com/krds.php and [44]) described a collectionof studies and toolssupportingdatapreservation. AmongsttheKRDS innovationswasatypologyofbenefits, describingthreedimensions: directtoindirect, near-tolong-termandpublictoprivate. InaslightextensiontotheworkinKRDS,wecantaketheno-tionof‘dimensions’perfectlyliterally, assignanyparticularbenefittoapositionalongeachofthethreeaxes, andplottheresultinathree-dimensionalspace;seeFig. 3 onthenextpage.
ComparethecommentsaboutHerscheldatainSect. 1.4.
Recommendation1
21
Gray, CarozziandWoan
(unlessit's otherpeople's opendata,ofcourse)
http://www.scitech.ac.uk/rgh/rghDisplay2.aspx?m=s&s=64
Metadata
Open data
Sysadmins
DR software
Researchers
Collaborations Institutions
Funders
Figure 3: Visualizingbenefits
Inthisfigureweidentifyfourbenefitswhichmightbeassociatedwithabig-scienceproject –namelytheexistenceofdata-reductionsoft-ware, goodmetadata, theprovisionofopendataandtheexistenceofsystemadministrators –andwesketchtheapproximatevolumestheymightoccupyalongthethreeaxes(inblue). Onthesamediagram, wecanindicate(inred)theapproximateareasofinterestoffoursamplestakeholders.
Intheexamplehere, ‘sysadminsupport’canbeseenasanindirectbenefit to researchers, typicallyprivate toan institution, butcreatingvalueinthenear-andlongterm; itisthereforespreadalongthe‘near-longterm’axis, butatoneextremeoftheothertwodimensions. Wecanputonthesamediagramtheapproximateareasofinterestofvar-iousresearchstakeholders. Forsimplicity, wearehereconceivingofindividualresearchersasselfishandshort-termist, thoughthesamere-searcherswillhavelong-terminterestwhentheyhaveacollaborationorinstitutionalhaton, andindirectpublicinterestsinthelong-termhealthof theirdisciplinewhentheyareservingonafundingcouncilgrantspanel; belowwewilltaketheterm‘funders’toreferbothtotheofficialsoffundingbodies, actingasproxiesforthewiderinterestsofsociety,andtomembersoftheresearchcommunitydischargingserviceroles.
Weshouldnottakethisdiagramtooliterally –itisnotclearthattheaxesare independent, and theextentandeven thegrosspositionsof thevariousinterestsandbenefitsaredebatable. Thediagramisnonethelessthought-provoking. Forexample, itvisuallypredictsthatmuchoftheresearchcommunityisnotparticularlyinterestedin‘opendata’andonlyincompletelyinterestedin‘goodmetadata’(in-collaborationresearcherscarewhenadatasetwasacquired,becausetheyneedthatinformationtoperformtheiranalyses, buttheyhavelittleinterestindisseminationandlicensingmetadata, forexample, becausethatisthelong-termconcernoffundersandtheirproxies). Wecanthereforenaturallyconceiveofthefunderstakingtheroleoftheconscienceofadiscipline, worryingaboutlong-termimponderablessothatindividualresearchersdon'thaveto. Itfollowsfromthis, thattheopendatacasemadetofunders, forexample, willbeaninstitutionallyself-interestedone, butthatthecasemadetoresearchersmustbequalitativelydifferent, andbeeitherpragmatic(‘youmustcarebecauseyourfunderscare’)orhigh-minded(‘yoursocio-culturaldutyis…’). Neitheroftheseisapoorargument, norindeedacynicalone, butweareacknowledgingherethat, toabusyanddistractedresearcher, theself-interestargumentinisolationmayhavelittlepurchase.
2.2 Thecaseforopendata
Internationally, there isapushtowardssuch datasharingin themoregeneralcontextofscholarlyresearch(seeforexample [45]or [46]). Themostexplicitstatementhereisinthe NSF'sGC-1document [47], whichinsection 41statesthat“[NSF] expectsinvestigatorstosharewithotherresearchers, atnomorethanincrementalcostandwithinareasonabletime, thedata, samples, physicalcol-lectionsandothersupportingmaterialscreatedorgatheredinthecourseofthework. It also encourages grantees to share software and inventionsor other-wiseacttomaketheinnovationstheyembodywidelyusefulandusable.”Thisisreiteratedinalmostthesamewordsintheir2010datasharingpolicy [3]. Theyadditionallyrequireabriefstatement, attachedtoproposals, ofhowtheproposalwouldconformtoNSF'sdata-sharingpolicy.
STFC,incommonwiththeotherUK researchcouncils, requiresthat“thefulltextofanyarticlesresultingfromthegrantthatarepublishedinjournalsorconferenceproceedings[…]mustbedeposited, attheearliestopportunity, inanappropriatee-printrepository”; ithasnotyetmadeanycorrespondingstatement
22
ManagingResearchDatainBigScience
ondatareleases.Theyear2009sawsomeexcitement(relatingtotheincidentinevitablyla-
belled‘climategate’, andtosomeotherdata-releasedisputes)relatedtotheman-agementandreleaseofclimatedata. Thisillustratedthepoliticalandsocialsig-nificanceofsomesciencedatasets; thecontrastbetweenwhatscientistsknow,andthepublicbelieves, tobenormalscientificpractice; andsomeoftheissuesinvolvedinthegeneration, ownership, useandpublicationofdata. Thecasesduringthatyearillustrateanumberofcomplicationsinvolvedindatareleases.
1. Data isoftenpassedfromresearchersorgroupsdirectly toothers, acrossborders, withnogeneralpermissiontodistributeitfurther.
2. Datacollectionmaybeonerous, andtheresultofsignificantprofessionalandpersonalinvestments.
3. Rawdata isgenerallyuselesswithoutthemoreorlesssignificantprocessingwhichcleansitofartefactsandmakesitusefulforfurtheranalysis.
4. Howevernotalldisciplineshavetheclearnotionofpublisheddataproductswhichis foundinastronomyandwhichis implicit intheOAIS notionofarchivaldeposit.
5. Scienceisacomplicatedsocialprocess.
Inscience, wepreservedatasothatwecanmakeitavailablelater. Thisisonthegroundsthatscientificdatashouldgenerallybeuniversallyavailable, partlybecause it isusuallypubliclypaid for, butalsobecause thepublicdisplayofcorroboratingevidencehasbeenpartofscienceeversincethemodernnotionofsciencebegantoemergeinthe17thcentury(CE) –witnesstheRoyalSociety'smotto, ‘nulliusinverba’, whichtheSocietyglossesas‘takenobody'swordforit’. Ofcourse, thepracticeisnotquiteassimpleastheprinciple, andahostofissues, rangingacrossthetechnical, political, socialandpersonal, complicatethesocial, evidentialandmoralargumentsforgeneraldatarelease.
Thearguments against generaldatareleasesarepracticalones: datareleasesarenot free, andmayhavesignificantfinancialandeffortcosts (cfSect. 3.4).Manyof thesecostscome from (preparation for)datapreservation, since it isformallyarchiveddataproductsthatarethemostnaturallyreleasableobjects:releasingraworlow-leveldatamay becheap, butmayalsohavelittlevalue, sincerawunderdocumenteddatasetsarelikelytobeuseless; ormorepessimisticallytheymayhaveanegativevalue, iftheyendupfosteringmisunderstandingswhicharetime-consumingtocounter(thispointobviouslyhasparticularrelevancetopoliticisedareassuchasclimatescience). Inconsequenceofthis, the‘opendataquestion’overlapswiththequestionofdatapreservation –ifthevariouscostsandsensitivitiesofdatapreservationaresatisfactorilyhandled, thenasignificantsubsetofthepracticalproblemswithopendatareleasewillpromptlydisappear.Wediscussthedatapreservationquestionbelow, inSect. 2.3.
Itseemsworthnoting, inpassing, thatthephysicalsciencesbroadlyperformbetterherethanotherdisciplines, bothinthetechnicalmaturityoftheexistingarchivesandinthecommunity'swillingnesstoallocatethetimeandmoneytoseethisdoneeffectively.
Whatallthisindicatesisthatthereisaneedforanexplicitframeworkfordiscussingthepragmaticsofopendata(cfpoint 4 above). Wecangofurtherandsuggest(itisalmostaRecommendation)thattheOAIS model'snotionofanAIP,anditsreflectioninthenotionofa dataproduct shouldbecentraltothisdiscussion.
http://www.guardian.co.uk/environment/2010/apr/20/climate-sceptic-wins-data-victory
UEA'sClimateResearchUnitisapartnerintheACRID project, alsofundedbytheJISC MRDprogramme: http://www.cru.uea.ac.uk/cru/projects/acrid/
Thelastpointissimultaneouslyobviousanddeeplyintricate.Unpackingitwoulddistractushere,butthereisfurtherdiscussion, inaveryappositehistoricalcontext,in [6], elaboratedin [36].
23
Gray, CarozziandWoan
2.3 Thecasefordatapreservation
ThecasefordatapreservationinastronomywasimplicitlymadeinSect. 1.4:asanobservationalscience, muchastronomydataisrepeatable, butthereareimportantcaseswherewhatisbeingobservedisaslowsecularchange, orsomeunpredictable(usuallyultimatelyexplosive)event; sometimesdatacanbeop-portunistically reanalysed to extract informationdistinct from the informationtheobservationwasdesignedfor. Astronomicaldata ispotentiallyuseful andusablealmostindefinitely. Thusthereisareasonableexpectationthatthedatacanbeandwillbeexploitedbyunknownastronomers, farintothefuture.
HEP dataissomewhatdifferent(asnotedinSect. 1.5). Asanexperimentalscience, itisgenerallyverymuchincontrolofwhatitobserves, andisabletodesignexperimentsofconsiderableingenuity, inordertomakemeasurementsofexquisitediscriminatorypower. A consequenceofthisisfirstlythatHEP ex-perimentshaveamuchstrongertendencytobecomeobsoletewitheachtechno-logicalgeneration, andsecondlythatthecomplicationoftheapparatusmakesithardtocommunicateintothefuturealevelofunderstandingsufficienttomakeplausibleuseofthedata. Experimentalapparatuswillgenerallybeunderstoodbetterandbetterastimegoeson(thisisalsotrueofsatellite-bornedetectorsinastronomy), so thatdatagatheredearly inanexperimentwillbeperiodicallyreanalysedwith increasedaccuracy. However thisunderstanding isgenerallynotpreservedformally, butispragmaticallycommunicatedthroughwikis, work-shops, wordofmouth, configurationandcalibrationfiles, andinternalandex-ternalreports. Evenifallofthetangiblerecordsweremagicallypreservedwithcompletefidelity, andsupposingthatthemoreformalrecordsdocontainalltheinformationrequiredtoanalysetherawdata, anarchivewouldstillbemissingtheword-of-mouthinformationwhichanewpostgradstudent(forexample)hastoacquirebeforetheycanunderstandthemorecompletedocumentation. Wecanthinkof thisasa ‘bootstrapproblem’. InOAIS terms, the RepresentationNetworkforHEP dataisparticularlyintricate, andwhilethe RepresentationIn-formationnearesttothe DataObjectmaybecomplete, itmaybeinfeasibletogathertheRepresentationInformationnecessarytoletanaiveresearchermakesenseofit. The DesignatedCommunityforHEP datamaythereforebenullinthelongterm.
Thissoundspessimistic, but [26]describesanumberofscenariosinwhichHEP datacanandshouldbereanalysedsomedecadesafteranexperimenthasfinished, anddescribesongoingworkonthedevelopmentofconsensusmodelsforpreservingdataforlongenoughtoenablesuchpost-experimentexploitation.Thisprovidesastrongcaseforastyleofpreservationsomewhatdifferentfromtheastronomicalone. Whatthesemodelshaveincommonisacommitmentofstafftoactivelyconserveandcontinuouslyexploitthedata. Thispost-experimentstaffcanthereforebeconceivedasaformofwalkingRepresentationInformationsothat, whiletheyarestillinvolved, thedatamighthaveaDesignatedCommunitywhichcorrespondstothoseindividualsinapositiontoundertakeanextendedapprenticeshipinthedataanalysis(thismodelisfurtherdiscussedonp.32).
GW datais, asusual, somewherebetweenthesetwoextremes. Asastron-omy, theGW dataconsistsofunrepeatablemeasurementswhichwillpotentiallybeofvaluetoastronomerswellintothefuture; asaHEP-styleexperimentitmakesthosemeasurementsusingtwoorthreegenerationsofhighlysophisticatedappa-ratus, eachgenerationofwhichwillimproveonthesensitivityofitspredecessorsbyordersofmagnitude. Anadditionalfeature, however, isthatno-onehaseverconvincinglydetectedagravitationalwave, though therehavebeen repeatedclaimsofdetectioninthepast, sothatthefirstclaimsbyLIGO or aLIGO willbescrutinizedparticularlyclosely.
Finally, andasnotedinSect. 2.2, ifdataiswellarchived, thenmostofthepragmaticobjectionstoopeningthatdatadonotapply. Thus, totheextentthat
24
ManagingResearchDatainBigScience
generaldatareleaseisagoodinitself, itisafurtherargumentinfavourofawellsupportedarchive.
2.4 Shouldrawdatabepreserved?
Inthedata-preservationworld, thereisoftenanautomaticexpectationthat‘ev-erythingshouldbepreserved’, sothatanexperimentcanberedone, resultsre-analysed, orananalysisrepeated, later. Isthisactuallytrue? Orifitisatleastdesirable, howmucheffortshouldbeexpendedtomakeittrue? Thisquestionisimplicitin, forexample, thediscussionofsoftwarepreservation inSect. 3.2.
Whenaphysicalexperimentissetupandworking, itisusualtoavoidtin-keringwithitasmuchaspossible, toavoidanyunexpectedlysignificantchange.That is, evenwithasmall-scale lab-benchexperiment, it isaccepted thatnoteverythingcanbeeffectivelydocumented, and that anexperimentmightnotbeimmediatelyreplicablepurelyfrompublishedinformation(cf[6, ch.35]andSect. 1.7). Thisexpectation(orrather, lackofexpectation)isalsotrueoflarger-scaleexperiments, whichmightbefinancially, professionallyor, at thelargestscales, politicallyinfeasibletoreplicate. Perhapsthisattitudeshouldextendtootheraspectsoftheexperimentalprocess.
Inmanycases, thepipelineforreducingrawdataseemstofallintothiscat-egory: itencodeshard-to-documentinformation, butisitselfhardtodocument,hardtouse, andunlikelyevertobereusedinfact. Ifthissoftwareisnotpreserved,thentherawdataiseffectivelyunreadable, whichmeansthereisnocaseforpre-servingit. Thereisthereforeacasethatatleastsomedetailsoftheexperimentalenvironment –digitalaswellasphysical –arenotreasonablypreservable, andthatasaresultlittleeffortshouldbeexpendedonpreservingthem.
Itisdataproducts thatmakerawdatalessnecessary. Itisfeasibletodoc-umentthescientificmeaningofdataproducts, andthecommunityexpectsthataprojectwillprovidethisdocumentationaspartofthepublicationoftheprod-ucts(indeed, itthedocumentationthatmakesthese products ratherthanjustacasualdatasnapshot). Thedataproductsallowresearcherstodigbeneaththeconclusionsofaparticulararticle(orindeedthecontentsofahigher-leveldataproduct), andtocriticiseandbuildonwhattheyfindthere. Higher-levelprod-uctsaretheresultofhigher-levelscientificjudgements, anditisnormalforthesetoberegeneratedbyresearchersotherthantheoriginators, eitherusingtheirownsoftwareortheoriginators'pipelines. Theselater-stagepipelinesaremorefor-mallysupportedbyprojects, whichinvolvesmakingthemreasonablyportable,sothattheyarebotheasiertopreserveaswellasbeingmorevaluableobjectsofpreservation.
Weshouldstressthatwearenotadvocatingdeliberatelydeletingrawdata,anditsassociatedpipelines –it might beuseful, andit might beusable –butsimplynotingthatoneshouldnotoverstateitsvalue.
2.5 OAIS:suitabilityandmotivation
InSect. 3.1.1, weprovideanoverviewoftheOAIS model, anddescribehowitrelatestoastronomicaldata.
TheOAIS standardisformallyaproductofthe ConsultativeCommitteeforSpaceDataSystems (CCSDS),andwith this in its lineage it isquitenaturallymatchedtothedatamanagementproblemsofthephysicalsciences. EssentiallyalltheexplicitandimplicitassumptionsoftheOAIS standardaretrueintheareawearestudying: thedataproducer(asatelliteoradetector)isusuallyobvious,thevarious InformationPackages (ordataproducts)wellunderstood, and theDesignatedCommunityeasilyidentified.
Themotivationforadigitalpreservationstandard, asdiscussedintheOAISstandarditself [4, §2], isthatdigitalpreservationrepresentsadoubleproblem:
Itisbecauseverylarge-scaleexperimentsareimpossibletoreplicate, andevenhardforanexternalreviewerofanarticletocriticisemeaningfully, thatlargecollaborationssubmittheirpublicationstoextremelyscrupulousinternalreview. Seehttp://stuver.blogspot.com/2011/03/big-dog-in-envelope.html forapost-mortemaccountofsuchareview.
http://www.ccsds.org
ThereisnocontradictionherewiththeremarksinSect. 1.7 aboutthedifficultyofdescribingtheDesignatedCommunityofsciencearchiveusers. ItiseasytonameascienceDesignatedCommunity, butitmaybehardtodescribeaheadoftimewhatthosecommunitymemberscanbeexpectedtoknow.A socialsciencearchivemayhaveanunpredictablybroadrangeofultimateusers, butusingthearchivewillneedlittlespecialistknowledge;incontrastaparticlephysicsdatasetwillprobablybeofinterestonlytoparticlephysicists, butthenormaleducationofsuchaphysicistthreedecadeshence, andthusthecontentandextentofthespecialisedRepresentationInformationthatCommunitywillneed, mightbeveryhardtoguessat.
25
Gray, CarozziandWoan
Recommendation2
Recommendation3
(i) digitalinformationisintrinsicallyhardertopreservethantraditionalinforma-tion, whichiscapableofsittingonashelfinawell-understoodandintelligibleformat, andmoulderingatawell-understoodandgracefulrate; and(ii) moreandmoreorganisationsareproducingdigitalinformation and areimplicitlyexpectedtoarchivetheirownmaterial. Thismeansthatthesenon-specialistarchiveshaveacomplicatedtasktoperform, whichispotentiallyatoddswiththedailyurgen-ciesoftheirmainbusiness.
This appears tomeaninturn(andinJISC contextsitisoftentakentomeaninpractice)thattheseorganisationsneedasmuchdetailedandprescriptivehelpaspossible, ideallydevolvingtheirarchiveresponsibilitiestoacentraldiscipline-orfunder-specificarchive, totheextentpossiblewhilerespectingthelow-levelcomplicationsandfrictionalludedtoinSect. 1.7. Thisisnotthemodelwhichisappropriateforbig-sciencedatasets.
2.6 Whatshouldbig-sciencefundersrequire, orprovide?
Wehavedescribedseveralcommonfeaturesofbig-sciencedatamanagementinSect. 1.3. andwehaveoutlinedsomeparticularcontrastswithothercommuni-tiesinSect. 1.7. AsnotedinSect. 0.1, ourfocushereisonSTFC'sstrategicallyfundedprojects, ratherthanthesmallerprojectsfundedbyindividualresearchgrants.
Big-sciencedatasetsaregenerallyintimatelycoupledtosolutionstoleading-edgetechnicalchallenges, andcannotusefullyberegardedasincrementalchangestoprevioussolutions. This, coupledwith thegeneralavailabilityofextensivetechnicalexpertisewithinsuchcommunities, meansthatanygenericsolutionisveryunlikelytobeappropriate, andthatitisbothreasonableandfeasibletorequirecustomarchivingsolutionsforsuchprojects. Thereisno recipe fordatapreservationonthisscale, andallthatcanbehopedforisastructuredapproachtoacustomsolution. Havingsaid this, noteven themost innovativescienceexperimentsaresocompletely suigeneris thattheywarrantadatapreservationapproachwhichisreimaginedfromscratch. ItisthereforewastefultoignoretheconsiderableintellectualinvestmentsintheOAIS model, thegrowingpenumbraofcommentariesonanddevelopmentsofit, andtheminorindustryofvalidationandauditingeffortsrelatedtoit.
Wearethereforeledtotheconclusionthatthemosteffectiveoverallstrategyforeffectivedatamanagementinthelarge-scaleexperimentalphysicalsciencesisthat fundersshouldsimplyrequirethataprojectdevelopahigh-levelDMPplanasasuitableprofileoftheOAIS specification [4]. Thisprofileshouldbedetailedenoughtorequirenegotiationwiththefunderandwiththeexperiment'scommunity, butcanleavemanyoftheimplementationdetailstothegooden-gineeringjudgementoftheproject'smanagement. WebelievetheLIGO DMPplan [5]canbetakentobeexemplaryinthisregard.
Big-scienceprojectshavethetechnicalskills, themanagementstructures,andthebudgetstotakeonsuchatask, andtodeliveracustomarchivewhichcanbe shown tomeet identifiedgoals. We recommend that funders shouldsupportprojectsincreatingper-projectOAIS profileswhichareappropriatetotheprojectandmeetfunders'strategicprioritiesandresponsibilities.
ThediscussioninSect. 3.5 suggeststhatoneresultofthedevelopmentofanOAIS-based DMP planisthattheresultingplanisexplicitenoughtogenerateusefuldeliverables, andtobenefitfromthegrowinginterestinOAIS ‘validation’.
Wesuggestthefollowingspecificfunderactions.
• ActivelyengagewithprojectstohelpthemdevelopanOAIS profile. Thiswillincludeoverviewliterature, includingtheOAIS specification, tutorialreportssuchas [48], andcommentarysuchas [49], orperhapsspecialised
26
ManagingResearchDatainBigScience
Producer Consumer
Submission Information Packages
Dissemination Information Packages
Archive Information Packages
OAISresult sets
queries
orders
data products
data products
the public scientistsʻthe archiveʼ
observers
Figure 4: Thehighest-levelstructureofanOAIS archive, annotatedwiththecor-respondinglabelsfromconventionalastronomicalpractice(redrawnfrom [4,Fig. 2-4]). Thedisseminationdataproductswilltypicallybethesameasthesubmittedones, butarchivescansometimescreatevalue-addedonesoftheirown.
workshopsifnecessary. Thesearehigh-levelintroductions, ratherthanpro-cedure-basedtick-lists.
• DeveloporsupportexpertiseincriticisingandvalidatingsuchOAIS profiles.Forexample, theCASPAR consortium(seeforexample [50])hasdevelopedstrategies fordetailedvalidationofprojects' claimsabout long-termdatamigration. Similarwork –forexamplevalidatingaproject'sassumptionsabout its DesignatedCommunity –would reassure thewidercommunitythatthearchivedesignislikelytoachieveitsgoalsforthefuture.
Thefirstoftheseisreasonablystraightforward, consistingoflittlemorethangatheringresources. Thesecondisalonger-termprojectwhichmayrequiresomeexpertisetobebuiltupandsupportedatafunder-supportedfacility(suchasRAL,intheUK),orthroughliaisonwiththe DCC.
A corollaryofthismoreactiveengagementisthatfundersmustfinanciallysupportthepreservationworktheyrequire. SeeSect. 3.4.
3 Thepracticalitiesofdatapreservation
3.1 Modellingthearchive
3.1.1 TheOAIS model
WeintroduceherethemainconceptsoftheOAIS model. Fulldetailsarein [4]withausefulintroductoryguidein [48]andsomediscussionintheLSC contextin[5]; theOAIS motivationisfurtherdiscussedinSect. 2.5.
Theterm OAIS standsforan OpenArchivalInformationSystem. Theword‘open’isnotintendedtoimplythatthearchiveddataisfreelyavailable(thoughitmaybe), butinsteadthattheprocessofdefininganddevelopingthesystemisanopenone. TheprincipalconcernofanOAIS istopreservetheusabilityofdigitalartefactsforapragmaticallydefinedlongterm. AnOAIS isnotonlyconcernedwithstoringthelowest-level bits ofadigitalobject(thoughthispartof itsconcern, andisnotatrivialproblem), butwithstoringenough informa-tion about theobject, anddefininganadequately specifiedanddocumentedprocess formigratingthosebitsfromsystemtosystemovertime, thattheinfor-mationorknowledgethosebitsrepresentcanberetrievedfromthematsome
27
Gray, CarozziandWoan
WethankDorotheaSalo, oftheUniversityofWisconsinlibrary, for
emphasizingtoustheusefulapplicabilityoftheDCC modeltothecaseofbigsciencedata, andAngusWhyte, forelaboratingthecontrastsbetweentheDCC and
OAIS models.
indeterminatefuturetime. TheOAIS modelcanthereforebeseenasaddressinganadministrativeandmanagerialproblem, ratherthananexclusivelytechnicalone.
TheOAIS specification'sprincipaloutputistheOAIS referencemodel, whichis an explicit (but still rather abstract) set of concepts and interdependencieswhichisbelievedtoexhibitthepropertiesthatthestandardassertsareimpor-tant(Fig. 4 ontheprecedingpage). TheOAIS modelcanbecriticisedforbeingsohigh-levelthat“almostanysystemcapableofstoringandretrievingdatacanmakeaplausiblecasethatitsatisfiestheOAIS conformancerequirements” [49],andthereexistbotheffortstodefinemoredetailedrequirements [49], andeffortstodevisemore stringentandmoreauditableassessmentsofanOAIS'sactualabilitytobeappropriatelyresponsivetotechnologychange [50].
AnOAIS archiveisconceivedasanentitywhichpreservesobjects(digitalorphysical)inthe LongTerm, wherethe‘LongTerm’isdefinedasbeinglongenoughtobesubjecttotechnologicalchange. Thearchiveacceptsobjectsalongwithenough RepresentationInformationtodescribehowthedigitalinformationintheobjectshouldbeinterpretedsoastoextracttheinformationwithinit(forexample, the FITS specification isRepresentation Information for a FITS file).That Informationmayneed furthercontext – forexample, tosay thatafile isanASCII filerequiresonetodefinewhatASCII means –andthecollectionofsuchexplanationsturnsintoa RepresentationNetwork. Thisinformationisallsubmittedtothearchiveintheformofa SIP agreedinsomemoreorlessformalcontractbetweenthearchiveanditsdataproducers.
Oncetheinformationisinthearchive, thelong-termresponsibilityforitspreservationis transferred fromtheprovidertothearchive, whichmustthereforehaveanexplicitplanforhowitintendstodischargethis.
ThearchivedistributesitswarestoConsumersinoneormore DesignatedCommunities, by transforming them, if necessary, into the DIP which corre-sponds to a ‘dataproduct’. Themembers of theDesignatedCommunity arethoseusers, in the future, whomthearchive isdesigned tosupport. Thisde-signrequiresincluding, inthe AIP,RepresentationInformationatalevelwhichallowstheDesignatedCommunity tointerpret thedataproducts withouteverhavingmetoneofthedataProducers, whoareassumedtohavedied, retired, orforgottentheiremailaddresses.
TheOAIS modeloriginatedwithinthespacesciencecommunity, soitcanbemappedtothephysicalsciencedataoftheGW communitywithoutmuchviolence.
3.1.2 TheDCC CurationLifecyclemodel
TheOAIS modelisonthefaceofitalinearone, andsuggeststhatdataiscreated,theningested, thenpreserved, andthenaccessed, inaprocesswhichhasaclearbeginningandend. Thisiscompatiblewiththeobservationthatonepointofarchivingdataistoreuseorrepurposeit, creatingnewarchivabledataproductsinturn, butthislonger-termcycleremainsonlyimplicitinthemodel. TheOAISmodel is thereforeveryusefullyexplicit about thoseaspectsof archivalworkconcernedwithlong-termpreservation, butitsconceptualrepertoireissuchthatadiscussionframedbyitrunstheriskofunderemphasizingtherangeofrolesadatarepositoryhas, orevenofmarginalisingit.
Incontrast, the DCC hasproducedalifecyclemodel [51](Fig. 5 onthenextpage)whichstressesthatdatacreation, management, andreusearepartofacycleinwhichpreservationplanning, forexample, cannaturallyhappenbeforedatacreationaswellasafterit; andinwhichdatacanbeappraised, reappraised, andpossiblydisposedofifitbecomesobsolete. Itthereforemakesexplicitboththeshort-andlong-termcyclesintheflowofactiveresearchdata, anditemphasizestheactiveinvolvementofdatacuratorsinmaintainingthatcycle.
28
ManagingResearchDatainBigScience
ACCE
SS, U
SE&
REUS
ETRANSFORM CREATE OR RECEIVE
APPRAISE&
SELECT
STORE
PRESERVATION ACTION
INGEST
CONCEPTUALISE
DISPOSECURATE
PRESERVE
andand
MIGRATEREA
PPRA
ISE
COMMUNITY WATCH & PARTICIPATION
PRESE
RVATION PLANNING
DESCRIPTION
REPRESENTATION INFORMATIO
N
Data(Digital Objects
orDatabases)
Figure 5: TheDCC lifecyclemodel, from[51]
Cyclesofuseandre-usearenottheonlylinksbetweendatasets. Asdis-cussedin [52], onedigitalobjectcanalsoprovidecontextforanother, inava-rietyofways. Tosomeextent this remark rediscovers thenotionof theOAISRepresentationNetwork, andthisinturnpromptsustostressthatalthoughwehavecontrastedOAIS andDCC here, theyarenotincompetition: OAIS iscon-cernedwiththecreationandmanagementofaworkingarchivewithgatekeepersandfirmgoals; theDCC modelisconcernedwiththelocationofthearchiveinthewiderintellectualcontext.
TheDCCmodelisimmediatelycompatiblewiththeobservation, inSect. 3.4below, thatHEP andGW archiveseffectivelyavoidsomepreservationcostsbyseeinglong-termpreservationasonlypartoftheroleofadatarepository. Accept-ingdata, makingitavailableasworkingstorage, transformingitintoimmediatelyusefulforms, orappraising(possiblyregenerable)datasetswhosestoragecostsoutweightheirusefulness, allgivethearchiveafamiliaritywiththedata, andtheresearchersafamiliaritywiththearchive, whichmeansthatthedecisiontoselectcertaindataforlong-termpreservationispotentiallymoreeasilyreached,moreeasilydefendedandmoreeasilyfunded, thanifthearchiveisconceivedasacost-centrebucketboltedonthesideoftheproject. ThisappearstobeborneoutbytheLIGO experience, inwhichthenew DMP planwasdevelopedandsuccessfullyarguedforbythesamepersonnelwhowerelongresponsibleforthedesignandmanagementofthedatamanagementsystemonwhicheveryone'sdailyworkdepends.
3.2 Softwarepreservation
Asdiscussedin, forexample, Sect. 1.6.2, thereisoftenasubstantialamountofimportantinformationencodedinwayswhichareonlyeffectivelydocumentedinsoftware, orsoftwareconfigurationinformation. Thereisthereforeanobviouscaseforpreservingthissoftware(thoughnotethecaveatsofSect. 2.4).
Preservationofasoftwarepipelinerequirespreservingthe pipelinesoftwareitself, apossiblylargecollectionoflibrariesthesoftwaredependson, theoper-
29
Gray, CarozziandWoan
http://www.software.ac.uk/
TheUK Starlinkprojectprovidedastronomicalsoftware. Itranfrom
1980to2005, whenitwasrescuedfromoblivionbybeingtakenupby
theUK JointAstronomyCentreHawai‘i. Thecurrentdistribution
includesstill-workingcodefromthe80s. TheNetlibandBLAS librarieshavecomponentswhichdatefrom
the70s.
http://nssdc.gsfc.nasa.gov
http://pds.nasa.gov/http://nssdc.gsfc.nasa.gov/nssdc/
data_retention.htmlhttp://pds.nasa.gov/tools/index.
shtmlWearegratefultoPaulButterworth
ofNASA forhelpfuladvicehere.
http://www.rssd.esa.int/index.php?project=PSA&page=about
atingsystem(OS) itallrunson, andtheconfigurationandstart-upinstructionsforsettingthewholethinginmotion. TheOS mayrequireparticularhardware(CPUsorGPUs), thesoftwaremaybequalifiedforaverysmall rangeofOSsandlibraryversions, anditmaybehardtogatheralloftheconfigurationinfor-mationrequired(thereissomediscussionofhowoneapproachesthisprobleminforexample [26]). Itisnotcertainthatitisnecessary, however: ifthedataproductsarewell-enoughdescribed, thenre-runningtheanalysispipelinemaybeunnecessary, oratleasthaveasufficientlysmallpayofftobenotworththeconsiderableinvestmentrequiredforthesoftwarepreservation. Wefeelthat, ofthetwooptions –preservethesoftware, ordocumentthedataproducts –thelatterwillgenerallybebothcheaperandmorereliableasawayofcarryingtheexperiment'sinformationcontentintothefuture, andthatthistradeoffismoreinfavourofdatapreservationasweconsiderlonger-termpreservation.
This lastpoint, about thechanging tradeoff, emphasizes that the twoop-tionsarenotexclusive: onecanpreservedata and preservesoftware, andtheJISC-fundedSoftwareSustainabilityInstituteprovidesagrowingsetofresourceswhichprovideguidancehere. Howeverthesolutionspresentedgenerallyfocusonactivecuration, inthesenseofpreservingsoftwarethroughcontinuinguseandmaintenance. Thiscanbesuccessful, andistheapproachimplicitin [26],butitseemsbrittleinthefaceofsignificantfundinggaps, andwouldnotdealwellwiththecasewhereasoftwarereleaseisdeliberatelyunused, forexamplebecauseithasbeensuperseded.
3.3 Datamanagementplanning
3.3.1 DMP inspace
Asonemightexpect, bothNASA andESA haveformalised DMP plans.NASA's NationalSpaceScienceDataCenter(NSSDC) hasledNASA'sdata
planningsincethemid-80s. ItwasinitiallytheNSSDCwhichnegotiatedaProjectDMP planwithmissions, butsincethe1990sthishasbecometheresponsibilityoftheNASA PlanetaryDataSystem(PDS).TheNSSDC'sdataretentionpolicydescribeswhatcategoriesofdataproductshouldberetainedindefinitely, andthePDS providesresourcestomissionplannersontheprocessesandtoolsforpreparingdataforpreservation.
ESA'sPlanetaryScienceArchive“providesexpertconsultancytoallofthedataproducersthroughoutthearchivingprocess. Assoonasaninstrumentisselected, PSA beginworkingwith the instrument teamtodefineasetofdataproductsanddatasetstructuresthatwillbesuitableforingestionintothelong-termarchive.”TheESA archiveisbydesigncompatiblewiththePDS.
3.3.2 CurrentandfutureDMP intheLSC
Thecurrent LIGO DMP plan [5], discussesDM planningwithanemphasisonthepreparationsfortheeventualpublicdatarelease.
TheLIGO DMP planproposesatwo-phasedatareleasescheme, tocomeintoplaywhen aLIGO iscommissioned; thiswaspreparedattherequestoftheNSF,developedduring2010–11, andwillbereviewedyearly.
TheplandocumentsthewayinwhichtheconsortiumwillmakeLIGO dataopentothebroaderresearchcommunity, ratherthan(asatpresent)onlythosewhoaremembersof theLSC.Thisdocumentdescribes theplans for thedatareleaseanditsproprietaryperiods, andoutlinesthedesign, function, scopeandestimatedcosts oftheeventualLIGO archive, asaninstanceofanOAIS model.This isahigh-levelplan, withmuchof thedetailed implementationplanningdelegatedtopartnerinstitutionsinthemediumterm.
30
ManagingResearchDatainBigScience
Inthefirstphase, dataisreleasedmuchasitisatpresent: validateddatawillbereleasedwhenitisassociatedwithdetections, orwhenitisrelatedtopapersannouncing non-detections(forexample, associatedwithanotherastronomicaleventwhichmightbeexpectedorhopedtoproducedetectableGWs). Inthesecondphase –afterdetectionshavebecomeroutine, andtheLIGO equipmentisactingasanobservatoryratherthanaphysicsexperiment –thedatawillberoutinelyreleasedinfull: “theentirebodyofgravitationalwavedata, correctedforinstrumentalidiosyncrasiesandenvironmentalperturbations, willbereleasedtothebroaderresearchcommunity. Inaddition, LIGO willbegintoreleasenear-real-timealertstointerestedobservatoriesassoonasLIGO may havedetectedasignal.”ThissecondphasewillbeginafterLIGO hasprobedagivenvolumeofspace-time(see[5, ref7]), or after3.5yearshaveelapsedsincetheformalLIGOcommissioning, whicheverisearlier. Alternatively, LIGOmayelecttostartphasetwosooner, ifthedetectionrateishigherthanexpected.
Inphasetwo, thedatawillhavea24-monthproprietaryperiod.TheDMP plandescribesthree(OAIS) DesignatedCommunities. Quoting
from [5, §1.5], thecommunitiesareasfollows.
• LSC scientists: whoareassumedtounderstand, orberesponsiblefor, allthecomplexdetailsoftheLIGO datastream.
• Externalscientists: whoareexpectedtounderstandgeneralconcepts, suchasspace-timecoordinates, Fouriertransformsandtime-frequencyplots, andhaveknowledgeofprogrammingandscientificdataanalysis. Manyofthesewillbeastronomers, butalsoinclude, forexample, thoseinterestedinLIGO'senvironmentalmonitoringdata.
• Generalpublic: thearchivetargetedtothegeneralpublic, willrequiremin-imalscienceknowledgeandlittlemorecomputationalexpertisethanhowtouseawebbrowser. WewillalsorecommendorbuildtoolstoreadLIGOdatafilesintootherapplications.
TheLIGO DMP planis, webelieve, agoodexampleofaplanforaprojectofLIGO'ssize: itisspecificwherenecessary, itwasnegotiatedwiththeproject'sfunder(NSF) sothatitachievedtheirgoals, anditwentthroughenoughiterationswiththebroaderLIGO community(theagreedversionin[5]isversion 14)thatitsauthorscouldbeconfidentithadtheirapproval, andthatthecommunitywascomfortablewithwhattheDMP planwasproposing. ThedocumenthasastrongfocusontheLIGO datareleasecriteria, sincethiswasthemostimmediatecon-cernofboththefunderandtheproject, butitsystematicallylaysoutahigh-levelframeworkforfuturedatapreservation, guidedbytheOAIS functionalmodel.
3.4 Datapreservationcosts
Thereisagooddealofdetailedinformation, andsomemodelling, ofthecostsofdigitalpreservation. TheKRDS2study[44, §§6&7]includesdetailedcostingsfromanumberofrunningdigitalpreservationprojects, insomecasesdowntothelevelofcostingsspreadsheets. TheLIFE3 projecthasalsodevelopedpredictivecostingstools [53], andthePLANETS project(http://www.planets-project.eu/)hasgeneratedabroadrangeofmaterialsonpreservationplanning, includingcostingstudies.
Although there is a broad rangeof preservationprojects surveyed in theKRDS report, therearenumerouscommonfeatures. Staffcostsdominatehard-warecosts, andscaleonlyveryweaklywitharchivesize. Thestudyalsonotesthatacquisitionandingestcostsareasubstantial fraction(70–80%)ofoverallstaff costs, but also scaleveryweaklywitharchive size. Theseare relativelysmallarchives, generallybelowafew TB insize, whereingestisasignificant
31
Gray, CarozziandWoan
WearemostgratefultoFernandoComerón, ofESO,forsharingthese
figures.
http://pds.nasa.gov/tools/cost-analysis-tool.shtml
componentoftheworkload. Inthisreportweareinterestedinarchivesthreeorfourordersofmagnitudelargerthanthiswhere(asdiscussedbelow)ingestmaybecheaper, butinbroadterms, itappearsstilltobetruethatstaffcostsdominatehardwarecostsatlargerscales, andscaleonlyweaklywitharchivesize.
Parenthetically, noticethattheabovediscussionpromptsthequestion‘whatisthesizeofanarchive?’Thenumberofbytesitconsumesisanobviousandreadilyavailablemeasure, butmaynotbeparticularlymeaningfulinthiscontext.Thenumberof items (suchas interview transcripts, imagesordatabase rows)maybeabettermeasure, andstillobjectivelyidentifiable, archivebyarchive. Ifthereweresomemeasureofabstractinformationcontent, wespeculatethatthisiswhatwouldscalemoststraightforwardlywiththeeffortrequiredforqualitycontrolandmetadatacuration, andhencewithstaffeffort. Wehesitatetoaskwhatsuchameasuremightbe, incasetheansweris‘citationanalysis’.
Thelackofscalingwithsize, evenwhenanarchiveprogressivelygrowsinsize, seems tosuggest that it isanarchive's initial size (in thesenseof small,mediumorlarge, forthetime)thatlargelygovernsthecosts.
Weweregivenaccesstoconfidentialfiguresforthedevelopmentandop-erationsofamid-to-largesizeastronomyarchive (oforder 10 TB ofrelationaldataand 100 TB offlatfiledata), developedbyanexperiencedarchivesite. Thearchivesoftwareandsystemdevelopmentcost25–30staff-yearsofeffort: thebulkofthiswasforthecoredatabasesystem, butbetweenaquarterandathirdwasforsoftwaretosupportingestandthegenerationofdataproducts. Theor-ganisationbudgetsaround3 FTEsforoperationofthisarchive, whichincludesingest, qualitycontrolandhelpdesksupport(thisisanestimatedfractionofanoperationsteamcoveringseveralarchivesatthesamesite, sotheremaybesomeeconomiesofscale). Aboutaquarteroftheannualoperatingbudgetisspentonhardware.
The EuropeanSouthernObservatory(ESO) dataarchivemanagesdatafrommultipleESO facilities; itsharesspacewiththestill-developingALMA archive,butthefiguresbelowdonotincludeALMA.Thearchiveisbasedonspinningdisksbackedbyatapelibrary(forfurtherdetails, see [54]). Itcurrentlyholds190 TB, increasingataround 7 TBmonth−1. Thehardwarecostsaveragearound330 k€ yr−1, whichincludeshardwarereplacementanddatamigration, andwhichhasremainedflat forsomeyears, despite theslowlyincreasingdatavolumes.Runningcostsamountto 55 k€ yr−1 (somesmallersystemsaccountforpartofthis), andlicences, networksandotherconsumablesaccountforabout 30 k€ yr−1.Manpowercostscometo4 FTEsofESO staffplusaround 270 k€ yr−1 ofout-sourced staff. Neitherhardwarenor softwarecostsappear to scalewithdatavolume, withsomecostelementsevendroppingasthearchivemovestocom-pletelyon-linedatadistribution.
Thereissomediscussionofthe CDS fundingmodelinSect. 1.4.1.The NASA PDS hasdevelopedaparameterizedmodelforhelpingproposers
estimate the costs involved in preparing data for archiving in the PDS;mostrelevantly for the abovediscussion it includes a scalingwithdata volumeof1 + 1.5 log10(volume/MB) (thatis, amultiplierwhichincreasesby1.5foreachorderofmagnitudeincreaseindatavolume).
AsnotedinSect. 1.5, theHEP communityisnowconstructingmoredetailedplansfordatapreservation, andtheassociatedcosts. Reference [26]estimatesthataformallong-termarchive(alevel-3or-4archive, inthetermsofthatpaper)wouldcost2–3FTEsfor2–3yearsaftertheendoftheexperiment, followedby0.5–1.0FTE/year/experimentspentonthearchive'spreservation. Theycomparethistothe100sofFTEsspentonfortherunningoftheexperiment, andonthisbasisclaimanarchivalstaff investmentof1%of thepeakstaff investment, toobtaina5–10%increaseinoutput(thelatterfigureisbasedontheirestimatethataround5–10%ofthepapersresultingfromanexperimentappearintheyearsimmediatelyaftertheexperimentfinishes; sincethislatterfigureisderivedonthe
32
ManagingResearchDatainBigScience
currentmodel, whichachievesthiswithoutanyformalpreservationmechanisms,thisestimateofthereturnoninvestmentinarchivesmaybeoptimistic).
Itisworthnotingthatinastronomical, HEP andGW contexts, archiveingestisgenerallytightlyintegratedwiththesystemforday-to-daydatamanagement, inthesensethatdatagoesdirectlytothearchiveonacquisitionandisretrievedfromthatarchivebyresearchers, aspartofnormaloperations. Ontheothersideofthearchive, projectswillgenerateanddisseminatedataproducts –whichlookverymuchlikeOAIS DIPs –aspartoftheirinteractionwithexternalcollaborators,withoutregardingtheseasspecificallyarchivalobjects. ThusthesubmissionsintothearchivemayconsistofbothrawdataandthingswhichlookverymuchlikeDIPs, andtheobjectsdisseminatedwillincludeeitherorbothveryrawandhighlyprocesseddata. The long-term planningrepresentedintheLIGO DMPplan [5], forexample, is therefore lessconcernedwith settingupanarchive,thanwiththeadjustmentsandformalizationsrequiredtomakeanexistingdata-managementsystemrobustforthearchivallongterm, andmoreaccessibletoawiderconstituency. Whatthismeans, inturn, isthatsomefractionoftheOAISingestanddisseminationcosts(associatedwithqualitycontrolandmetadata, forexample)willbecoveredbynormaloperations, withtheresultthatthe marginalcostsoftheadditionalactivity, namelylong-termarchivalingestanddissemina-tion, areprobablybothratherlowandtypicallybornebyinfrastructurebudgetsratherthanrequiringextraeffortfromresearchers. Thisiscorroboratedbyourin-formantsabove, whogenerallyregardarchivecostsascomingunderadifferentheadingfrom‘dataprocessingcosts’. ThepointhereisnotthattheOAIS modeldoesnotfitwell –itfitsverywellindeed –northatingestanddisseminationdonothavecosts, butthatiftheassociatedactivitiescanbecontrivedtooverlapwithnormaloperations, thenthecostsdirectlyassociatedwiththearchivemaybesignificantlydecreased. Thisistheintuitionbehindtherecentdevelopmentsin‘archive-ready’or‘preservation-awarestorage’(cf[43]andSect. 3.1.2), andconfirmsthatitisaviableandeffectiveapproach.
Asafinalpoint, wenotethatbig-scienceprojectsareinevitablyalsolarge-scaleengineeringprojects, sothat theconsortiaandtheir fundersarebroadlyfamiliarwiththeprocedures, uncertaintiesandmanagementofcostestimates,sothatthecostingandmanagementofdatapreservationcanbenaturallybuiltintotherelationshipbetweenfundersandfunded, ifthefunderssorequireit.
Asisshownbythevaguenessofsomeoftheremarksabove(despitesome-timesveryspecificnumbers), thereseemslittleinthewayofaconsensusmodelforthecostingofthelong-termpreservationoflarge-scaledata. TherewillsurelybedetailedcostingsforthemanagementofPB-scaledataforcommercialorgani-sations, butthesearenotlikelytobeusefulforourpurposes, sincetheyaremoreconcernedwithimmediatebusinesscontinuitythanmulti-decadearchives, areservingdifferenttechnicalcommunities, andarelikelytobeextremelyconfiden-tial.
Wethereforerecommendthat STFC shoulddevelopacostingsmodelforthepublicationandpreservationofdata, whichismatchedtothedatachallengesofthebig-sciencecommunity. Weexpectthatthiscanbuildonthedomain-agnosticworkalreadydoneinthisareabyJISC,andonthedetailedworkdoneoncloselyrelatedproblemsby NASA'scost-estimationcommunity [56].
3.5 TheGW communityandtheAIDA toolkit
TheAIDA Self-AssessmentToolkit [57] isa(JISC funded)setofqualitativebench-marks fordiscussingathowdevelopedan institution'sarchive is. It leadsanarchivemanagerthroughasetofafewdozenelements, invitingthemtogradetheirarchivefrom1(poor)to5(internationalexemplar). Thegoalisnottopro-duceapass/fail score, but instead tohelparchivemanagersunderstand theircurrentandfuturerequirements, andto“enableaninstitutiontodecidewhether
ThisisconsistentwiththeERIMproject'sconclusionsthat“ideallyinformationmanagementinterventionsshouldresultinazeronetresourceincrease” [55, p.8]. Inthiscasethereisnoextraresourcerequiredfromtheresearchers,thoughtheremightbeaneedforextraresourceunderaninfrastructureheading.
Recommendation4
TheAIDA documentlinksthesefivestages, ratheralarmingly, toafive-stepprogrammedevelopedatCornell, whichstartswithacknowledgingthatyouhaveaproblem, andgoes, viainstitutionalisation, to“embracing[…]dependencies”, notingthat“youcan'tdoitalone”. Clearly,data-managementplanningishabit-forming.
33
Gray, CarozziandWoan
specificactionsneedtobetakeninregardtoparticularassets, orwhenandhowitisdesirabletoimproveonitscurrentcapabilities”. TheAIDA authorsacknowl-edgethattheassessmentissimplisticandsubjective, butstressthat“AIDA aimstoallowyoutoevaluateyourinstitutionagainstarecognisedcapabilityscale, andthensuggestsappropriateactionsbasedonthatevaluation”. TheAIDA goalistomodeltheprogressofanarchivefromtheacknowledgementthatanarchiveisdesirable, throughtotheexemplaryexternalisationofthearchiveasaresource.
InAppx. B,welistourestimatesofthescoresforLSC datamanagement. WehopetheseassessmentsareofspecificusetotheGW community, butbelievethatthediscussioningeneralmaybeofusetoother, similarlystructured, bigsciencecommunities.
ThescoresforthecurrentLSC clusterinthemiddle, aroundthree(whichcorrespondsto‘consolidate’intheCornellmodel). Thisisanimpressivescoreforaprojectwhichis, fromonepointofview, doingonlywhatisregardedasnormalforawell-runlarge-scalephysicsexperiment. Thehigherscoresaregenerallyassociatedwiththeformalityandauditabilityofthelong-termplans, ratherthananyqualitativelydifferentpractice, andwebelievethatthesescoreswillnaturallydriftupwardsasaresultofthedevelopmentofanexplicitDMP plan, structuredusingtheOAIS conceptset, incollaborationwithasuitablycriticalfunder.
Thetoolkitisbrokenintoorganisational, technologyandresources(gener-allyfunding)‘legs’.
The ‘organisational leg’ is concernedwith the high-level support for thearchive. Totheextentthatitismeaningful, theaverageforthesescoresisabovethree (which isgood). The lowerscoresaregenerallyassociatedwith the in-formalityofthecurrentarchive(comparedtoaservice-orientedcommercialor-ganisation)rather thananymoreconcrete inadequacy: thedata isbackedupandreasonablyfindable, thoughthisreflectsculturalnormswithinthephysicalsciencesratherthansomethingaparticulararchivingplancantakecreditfor.
The‘technologyleg’isconcernedwiththehardwareandpersonnelsupportfordatamanagement. Aswiththeorganisationalleg, theGW communityscoreshighlyherewithoutreallytrying, simplybecausethecommunityhaslongexperi-enceofmanaging andsharing largevolumesofdata. Thelowerscoresareagainassociatedwiththecurrentinformalityofoperations(fromthepointofviewofanarchiveasopposedtoaworkingdata-managementinfrastructure), andthesewillnaturallyrisewhentheLSC's DMP planisimplementedandreviewed.
Thescoresinthe‘resourcesleg’aretheleastwell-justified. TheLSC gener-allyscoreswell, inthesensethatwecanbeconfidentthattherewillberesourcestosupportanarchiveeffort –it'sseenasahigh-importanceactivity –eventhoughtherearefewresourcescurrentlyexplicitlyearmarkedforthis. Thissectionmaythereforebeusefulforsuggestingwhatbudgetlinesshouldeventuallyexist.
4 Conclusionsandrecommendations
Inthisreport, wehavedescribedsomeofthewaysinwhich‘bigscience’man-agesitsdata, aspartofabroaderdataculturewhichischaracterisedbylargecol-laborations, andwhichhasdecadesofexperienceinagreeinghow, andwhen,andwhennot, tosharedata.
Wecansaywithsomeconfidencethatthebigsciencedataculturemanagesitsdatawell(andthisseemstobecorroboratedbytheAIDA assessmentdiscussedinSect. 3.5), butwearenotsuggestingthatotherdisciplinescouldorshouldsimplycopythisculture, sincetherearevariousreasons(cf, Sect. 1.3)whythiscultureisparticularlynaturalinsomeareas.
Therearehoweversomepracticeswhichwedobelievearestraightforwardlyportabletootherdisciplines. AswediscussinSect. 1.11, thenotionsof dataproducts and proprietaryperiods verynaturallyconcretizeotherwisediffusear-
34
ManagingResearchDatainBigScience
gumentsaboutdatamanagementandsharing, transformingthemfrom‘whether’and‘why’to‘which’and‘howlong’. Aswell, webelievethatembeddingdatamanagementintheday-to-daypracticeofresearcherslowerscostsinboththeshortterm(researcherscaneasilyre-findtheirowndata, andinterpretothers')andthelongterm(sincepreservationbecomesatechnicalproblemofconserv-inganin-userepository). WediscussthecostingofdatamanagementatslightlygreaterlengthinSect. 3.4.
Werepeatourexplicitrecommendationsbelow.
1. Datamanagersshouldconsideradoptingthelanguageofdataproductsandexplicitproprietaryperiodswhendesigninganddocumenting theirhold-ings (Sect. 1.11, p21).
2. Fundersshouldsimplyrequirethataprojectdevelopahigh-levelDMP planasasuitableprofileoftheOAIS specification [4] (Sect. 2.6, p26).
3. Fundersshouldsupportprojectsincreatingper-projectOAIS profileswhichareappropriatetotheprojectandmeetfunders'strategicprioritiesandre-sponsibilities (Sect. 2.6, p26).
4. STFC shoulddevelopacostingsmodelforthepublicationandpreservationofdata, whichismatchedtothedatachallengesofthebig-sciencecommu-nity (Sect. 3.4, p33).
Acknowledgements
ThisprojectwasfundedbyJISC,aspartofthe‘ManagingResearchData’pro-gramme. Wearemostgratefultothenumerouspeoplewhohavecommentedonvariousdraftsof this report, orprovideduswith informationor resources.Inparticular, wethankStuartAnderson(LIGO),PaulButterworth(NASA),HarryCollins(Cardiff), FernandoComerón(ESO),JoyDavidson(DCC),FrançoiseGen-ova(CDS),MagdalenaGetler(DCC),SimonHodson(JISC),SarahJones(DCC),DorotheaSalo(Wisconsin), AngusWhyte(DCC),andRoyWilliams(LIGO).
A Casestudy
WehaveproducedadetaileddiscussionofthestructureoftheLIGO workingdatamanagementsystem, asaseparatedocument [29]. Thisdocumentiscur-rentlyavailableonlywithinLIGO:thoseobservationswhichhavenotbeenin-corporated into thispresent reportareprobably toodetailed tobeofgeneralinterest. Wehope, however, thatthecase-studywillbeofsomeuseinternallytototheLSC.
B AIDA assessment
TheAIDA self-assessmenttoolkit [57] isa JISC-fundedsetofqualitativebench-marksforassessinghowdevelopedaninstitution'sarchiveis. SeeSect. 3.5 fordiscussion.
Thelabelsinthetablebelowaresometimesalittlecryptic; refertothefulltoolkitforusefulelaborations.
Theanswersbelowgenerallyrefertothe early2011 stateoftheLSC archivearrangements, on thegrounds thatconcreteanswers toavariantquestionarepreferabletospeculativeanswerstoafutureone. Theseareprobablyareason-ableindicationofthelikelystatusofaforthcomingformalarchive, butinafewcase, asnoted, wecangivenomeaningfulanswer.
Inthescoresbelow, level 1is‘poor’, andlevel 5is‘internationalexamplar’.
35
Gray, CarozziandWoan
Organisationalleg
1: institution-widemissionstatements(5) TheLIGO projecthaspreparedafor-malDMP,atfunderrequest
2: institutionalpoliciesforassetmanagement(3) LIGO haspreparedaformalDMP,andisaddressingpoliticalandculturalreservations, awaitingfundingandimplementation
3: reviewmechanismsatInstitutionallevel(4) Aswell as theDMP, there al-readyexistwell-understoodcollaboration-widereviewprocedures, andthesewillbeusedtoreviewtheplanonanannualbasis
4: institutionalcapabilityforsharingassets(3) Currentstorageis, ofnecessity,distributed; thecollaborationmanagesthisinformallybuteffectively, how-everthisisgenerallyworkingstorage, andnotregardedasarchivalstorage
5: institutionallevelofcontingencyplanning(3) Thereisnoformalcentralisedassetmanagement. Continuityisregardedasatechnicalmatterwhichcanreasonablybelefttotheprofessionalgoodpracticeofthesitesmanagingthedistributedstorage. Asbefore, thisiscurrentlyregardedasworkingratherthanarchivalstorage.
6: institutionalcapabilityforaudit(3) Extensivelogsexist, butarenotcentralisednorinanystandardformat; files, oncecreated, arenotexpectedtobemod-ified, thoughthereisnowaytoverifythatthisistrueinfact
7: institutionalmonitoringmechanisms(4.5) Alldataandprocessesareopentotheentirecollaboration, andmostprocessesarewidelydiscussed; thecollaborationisitsownuser-base. Thereare(bydesign)noexternalusersofthedata, noryetanyexternalreviewofthemechanisms.
8: extentofinstitutionalconformancetometadatamanagement(2to4) Meta-dataisdevisedinasomewhat adhoc waybyindividualinstrumentsorsoft-wareelements(stage2), butthisisalsoaddedandmanagedthoroughly, andinaccordancewithwhatisregardedasexperimentalgoodpractice(stage4)
9: extentofinstitutionalcontracts(3) Notapplicable tocurrentworkingstor-age
10: institutionalunderstandingofIPR (5) FormalMoUsbetweenpartnersregard-ingaccesstodata, andclearguidancefromfundersregardingtheeventualreleaseofthedata
11: institutionaldisasterplanning(2) Aswithassetcontinuity, thisiscurrentlyregardedasatechnicalmatterforstoragemanagers
Technologicalleg
1: institutionalinfrastructure(5) Thecollaborationhasconsiderabletechnicalresource, and interoperateswell. Planning is informalbuteffective. Thesophisticateduser-baseiscomfortablewiththisinformality, butthiscouldinprinciplebecomealiabilitywhentheresourcemanagementmovesfromadevelopmenttoaservicemodel.
2: appropriatenessofinstitutionaltechnologies(4) Thereisplentyofappropri-atetechnology, thoughtheplanforthearchivalmanagementofassetsisnotyetdetailed
36
ManagingResearchDatainBigScience
3: integrityofinstitutionalbackupandstorage(3) Importantdataisbackedup(possiblybymirroring), aspartofnormaloperations
4: institutionalprocesses(2) Uncertain: what there iswillbedoneaspartofnormaloperations
5: institutionalunderstandingofobsolescence(3.5) Highgeneralawareness, andoccasionaldiscussion, butatpresentlittleformalplanning
6: institutionalcapability(4) Changestoprocessesarewidelyandfranklydis-cussed, anddocumentedasinternalpublications; changeismanagedeffec-tively, butrelativelyinformally
7: institutionalcapabilityforsecurity(3) Thereisahighlevelofawarenessoftheneedtokeepthedataproprietary, butgiventhescientificcontext, therearenolikelyattackscenariosassuch; theproblemwill largelyevaporateoncethedataisfinallyreleasedpublicly
8: institutionalsecuritymechanisms(3.5) Noformalthreatanalyses, butthese-curityisprobablyappropriatetothelevelofthreat; day-to-dayattacks(ienotspecificallytargetedatthisdata)aretheresponsibilityofdistributedstorageandcomputingmanagers
9: institutionaldisasterplanandcapacityforbusinessrecovery(3) Notapplica-bletothecurrentexperimentalphase
10: institutionalcapacitytocreatemetadata(4) Almostallmetadataisaddedautomatically(compareorganisational.08)
11: effectivenessofanInstitution-widerepository(2) LIGO haspreparedafor-malDMP
Resourcesleg
1: institutionalbusinessplanning(2) LIGO ispreparingaformalDMP
2: institutionalcapacityforreview(4) DMP tobereviewedannually; projectasawholehascloserelationshipswithfundersandstakeholders
3: institutionalcapabilityforresourceallocation(4) Resourceplanning isco-ordinatedataseniorlevel
4: institutionalcapabilityforriskmanagement(2) Generalawarenessatpresent,butthisshouldbecomeclearerinfutureDMP iterations
5: institutionalbusinesstransparency(4.5) Dependingontheprecisemeaningintended, thiscouldbe4or5. Thereissubstantialauditingfromcollabora-tionfunders
6: institutionalcapacityforsustainablefunding(3.5) Good relationships withfundersmeanthatfundingisprobablypredictableonfive-toten-yeartime-scales, butunpredictableinthelongerterm. Howeverthemainfunder(NSF)hasexpressedastrategiccommitmenttolong-termdatapreservation.
7: institutionalstaffmanagement(3) Notapplicabletothecurrentexperimen-talphase
8: institutionalmanagementofstaffnumbers(3) Notapplicabletothecurrentexperimentalphase
9: institutionalcommitmenttostaffdevelopment(3) Notapplicabletothecur-rentexperimentalphase
37
Gray, CarozziandWoan
Aboutthisdocument
LIGO-P1000188-v10, 2011June29 First public version, available at https://dcc.ligo.org/cgi-bin/DocDB/ShowDocument?docid=p1000188
v1.1, 2012July14 Minorrevisions, someaddedmaterial, andtyposandminorerrorscorrected. Thereareacoupleofadditionalsections, butnochangestosectionorfigurenumbers. Thepaginationwillhavechangedinplaces.
38
ManagingResearchDatainBigScience
References
[1] NormanGrayandGrahamWoan. Digitalpreservationandastronomy:Lessonsforfundersandthefunded. InI. N.Evans, A. Accomazzi, D. J.Mink, andA. H.Rots, editors, ProceedingsofADASS XX,volume442ofASP ConferenceSeries, pages13–16.ASP,2011. Availablefrom:http://www.aspbooks.org/publications/442/013.pdf, arXiv:1103.2318.
[2] SimonHodson. Managingresearchdataprogramme[online]. Availablefrom: http://www.jisc.ac.uk/whatwedo/programmes/mrd [cited14May2010].
[3] Disseminationandsharingofresearchresults[online]. March2011.Availablefrom: http://www.nsf.gov/bfa/dias/policy/dmp.jsp [cited10May2011].
[4] Referencemodelforanopenarchivalinformationsystem(OAIS) –CCSDS650.0-B-1. CCSDS Recommendation, 2002. IdenticaltoISO 14721:2003.Availablefrom: http://public.ccsds.org/publications/archive/650x0b1.pdf.
[5] StuartAndersonandRoyWilliams. LIGO datamanagementplan. LIGOTechnicalReport, 2011. Availablefrom:https://dcc.ligo.org/public/0009/M1000066/014/LIGO-M1000066-v14.pdf.
[6] Harry M Collins. Gravity'sshadow: thesearchforgravitationalwaves.UniversityofChicagoPress, 2004.
[7] Harry M Collins. LIGO becomesbigscience. HistoricalStudiesinthePhysicalandBiologicalSciences, 33:261–297, 2003.doi:10.1525/hsps.2003.33.2.261.
[8] BrianMoe. LIGO datagrid: Datasetsizeestimates[online]. Availablefrom:https://www.lsc-group.phys.uwm.edu/lscdatagrid/resources/data/sizes.html[cited20May, 2010].
[9] A. R.Taylor. Thesquarekilometrearray. ProceedingsoftheInternationalAstronomicalUnion, 3(SymposiumS248):164–169, 2007.doi:10.1017/S1743921308018954.
[10] Cisco. Enteringthezettabyteera. WhitePaper, July2011. Availablefrom:http://www.cisco.com/en/US/solutions/collateral/ns341/ns525/ns537/ns705/ns827/VNI_Hyperconnectivity_WP.pdf.
[11] HighLevelExpertGrouponScientificData. Ridingthewave. Finalreport,EuropeanCommission, October2010. Availablefrom:http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/hlg-sdi-report.pdf.
[12] AlexBall. Reviewofthestateoftheartofthedigitalcurationofresearchdata. Projectreporterim1rep091103ab12, UniversityofBath, 2010.Availablefrom: http://opus.bath.ac.uk/19022/.
[13] PARSE.InsightProject. Deliverabled3.3: Casestudiesreport. Projectdeliverable, 2010. Availablefrom: http://www.parse-insight.eu/downloads/PARSE-Insight_D3-3_CaseStudiesReport.pdf.
[14] N. Hambly, H. MacGillivray, M. Read, S. Tritton, E. Thomson, B. Kelly,D. Morgan, R. Smith, S. Driver, J. Williamson, Q. Parker, M. Hawkins,P. Williams, andA. Lawrence. TheSuperCOSMOS skysurvey–i.introductionanddescription. MonthlyNoticesoftheRoyalAstronomicalSociety, 326(4):1279–1294, 2001. arXiv:astro-ph/0108286,doi:10.1111/j.1365-2966.2001.04660.x.
[15] DerekJones. ThescientificvalueoftheCarteduCiel. Astronomy&Geophysics, 41(5):16–21, 2000. doi:10.1046/j.1468-4004.2000.41516.x.
[16] YudhijitBhattacharjee. Starsindustyfilingcabinets. Science,324(5926):460–461, April2009. doi:10.1126/science.324_460.
39
Gray, CarozziandWoan
[17] F. R.StephensonandL. V.Morrison. Long-termfluctuationsintheearth'srotation: 700BC toAD 1990. PhilosophicalTransactionsoftheRoyalSocietyofLondon.SeriesA:PhysicalandEngineeringSciences,351(1695):165–202, 1995. Availablefrom:http://rsta.royalsocietypublishing.org/content/351/1695/165.abstract,doi:10.1098/rsta.1995.0028.
[18] L. Jetsu, S. Porceddu, J. Lyytinen, P. Kajatkari, J. Lehtinen, T. Markkanen,andJ. Toivari-Viitala. DidtheancientegyptiansrecordtheperiodoftheeclipsingbinaryAlgol–theRagingone? Preprint, April2012. ToappearinAstron.Astrophys. arXiv:1204.6206.
[19] G. J.Toomer. A surveyoftheToledanTables. Osiris, 15:pp.5–174, 1968.Availablefrom: http://www.jstor.org/stable/301687.
[20] FITS WorkingGroup. Definitionoftheflexibleimagetransportsystem(FITS). Technicalreport, Commission5oftheInternationalAstronomicalUnion, 2008. Availablefrom:http://fits.gsfc.nasa.gov/fits_standard.html.
[21] William D Pence, L. Chiappetti, Clive G Page, R. A.Shaw, andE. Stobie.DefinitionoftheFlexibleImageTransportSystem(FITS),version3.0.AstronomyandAstrophysics, 524:A42+, December2010.doi:10.1051/0004-6361/201015362.
[22] EuropeanSpaceAgency. TheHipparcosandTychocatalogues. Online,1997. ESA SP-1200. Availablefrom:http://www.rssd.esa.int/index.php?project=HIPPARCOS.
[23] F. Genova, D. Egret, O. Bienaymé, F. Bonnarel, P. Dubois, P. Fernique,G. Jasniewicz, S. Lesteven, R. Monier, F. Ochsenbein, andM. Wenger.TheCDS informationhub. Astron.Astrophys.Suppl.Ser., 143(1):1–7,2000. doi:10.1051/aas:2000333.
[24] F. Ochsenbein, P. Bauer, andJ. Marcout. TheVizieR databaseofastronomicalcatalogues. Astron.Astrophys.Suppl.Ser., 143(1):23–32,2000. doi:10.1051/aas:2000169.
[25] AndrewCurry. Rescueofolddataofferslessonforparticlephysicists.Science, 331(6018):694–695, February2011.doi:10.1126/science.331.6018.694.
[26] David M South. Datapreservationinhighenergyphysics. In Proceedingsofthe18thInternationalConferenceonComputinginHighEnergyandNuclearPhysics(CHEP 2010), January2011. arXiv:1101.3186.
[27] MaxBoisot, MarkusNordberg, SaïdYami, andBertrandNicquevert,editors. CollisionsandCollaboration. OxfordScholarshipOnline, 2011.doi:10.1093/acprof:oso/9780199567928.001.0001.
[28] AvgoustaKyriakidou-ZacharoudiouandWillVenters. Distributedlarge-scalesystemsdevelopment: Exploringthecollaborativedevelopmentoftheparticlephysicsgrid. In 5thInternationalconferenceone-SocialScience, Cologne, 2009. Availablefrom:http://ssrn.com/abstract=2109945.
[29] Tobia D Carozzi, NormanGray, andGrahamWoan. LIGO data: acase-studyinbigsciencedatabases. TechnicalcasestudyT1000368, LSC,2010.
[30] MatthewPitkin, StuartReid, SheilaRowan, andJimHough. Gravitationalwavedetectionbyinterferometry(groundandspace). LivingReviewsinRelativity, 14(5), 2011. Availablefrom:http://relativity.livingreviews.org/Articles/lrr-2011-5/.
[31] LSC andVirgo. MemorandumofunderstandingbetweenVIRGO andLIGO. LSC MemorandumM060038, LSC andVirgoconsortia, 2009.
40
ManagingResearchDatainBigScience
Version1, of2009March11. Availablefrom:https://dcc.ligo.org/cgi-bin/DocDB/ShowDocument?docid=1156.
[32] LSC. BylawsoftheLIGO scientificcollaboration. LSC MemorandumM050172, LSC,2010. Version5, of2010January15. Availablefrom:https://dcc.ligo.org/cgi-bin/DocDB/ShowDocument?docid=8432.
[33] LSC. LIGO ScientificCollaborationpublicationandpresentationpolicy.LSC MemorandumT010168, LSC,2007. Version3, of2007August8.Availablefrom:https://dcc.ligo.org/cgi-bin/DocDB/ShowDocument?docid=26956.
[34] VeerleVan denEynden, LibbyBishop, LaurenceHorton, andLouiseCorti.Datamanagementpracticesinthesocialsciences. Technicalreport, UKDataArchive, July2010. Availablefrom: http://www.data-archive.ac.uk/media/203597/datamanagement_socialsciences.pdf.
[35] Harry M Collins. Tacitknowledge, trustandtheQ ofsapphire. SocialStudiesofScience, 31(1):71–85, 2001. doi:10.1177/030631201031001004.
[36] Harry M CollinsandRobertEvans. RethinkingExpertise. ChicagoUniversityPress, 2007.
[37] SeanBechhofer, JohnAinsworth, JitenBhagat, IainBuchan, PhilipCouch,DonCruickshank, David DeRoure, MarkDelderfield, IanDunlop,MatthewGamble, CaroleGoble, DaniusMichaelides, PaoloMissier,StuartOwen, DavidNewman, andShoaibSufi. Whylinkeddataisnotenoughforscientists. In IEEE InternationalConferenceoneScience, pages300–307, LosAlamitos, CA,USA,2010.IEEE ComputerSociety.doi:10.1109/eScience.2010.21.
[38] AsgerAaboe. Babylonianmathematics, astrologyandastronomy. In TheCambridgeAncientHistory, volume3(part2), pages276–292.CambridgeUniversityPress, 1991. doi:10.1017/CHOL9780521227179.
[39] RussellHobson. TheExactTransmissionofTextsintheFirstMillenniumBCE:AnExaminationoftheCuneiformEvidencefromMesopotamiaandtheTorahScrollsfromtheWesternShoreoftheDeadSea. PhD thesis,UniversityofSydney, 2009. Availablefrom: http://ses.library.usyd.edu.au/bitstream/2123/5404/1/r-hobson-2009-thesis.pdf.
[40] AsgerAaboe. Scientificastronomyinantiquity. Phil.Trans.R.Soc.Lond.,A276:21–42, 1974. doi:10.1098/rsta.1974.0007.
[41] AlbertoAccomazzi. Theroleofrepositoriesandjournalsintheastronomyresearchlifecycle. SlidesfromtalkatAstroinformatics2010, Pasadena,June2010. Availablefrom:http://www.astroinformatics2010.org/pdfs/Accomazzi.pdf.
[42] PeterFox, Deborah L.McGuinness, LucaCinquini, PatrickWest, JoseGarcia, James L.Benedict, andDonMiddleton. Ontology-supportedscientificdataframeworks: TheVirtualSolar-TerrestrialObservatoryexperience. Computers&Geosciences, 35(4):724–738, 2009.doi:10.1016/j.cageo.2007.12.019.
[43] MichaelFactor, DalitNaor, SimonaRabinovici-Cohen, LeeatRamati, PetraReshef, andJulianSatran. Theneedforpreservationawarestorage. ACMSIGOPS OperatingSystemsReview, 41(1):19–23, 2007. Availablefrom:http://www.research.ibm.com/haifa/projects/storage/datastores/papers/preservation_data_store_osr07_dec_30.pdf, doi:10.1145/1228291.1228298.
[44] NeilBeagrie, BrianLavoie, andMatthewWoollard. Keepingresearchdatasafe2. JISC ProjectReport, April2010. Availablefrom:http://www.jisc.ac.uk/media/documents/publications/reports/2010/keepingresearchdatasafe2.pdf.
41
Gray, CarozziandWoan
[45] PeterArzberger, PeterSchroeder, AnneBeaulieu, GeofBowker, KathleenCasey, LeifLaaksonen, DavidMoorman, PaulUhlir, andPaulWouters.Scienceandgovernment: Aninternationalframeworktopromoteaccesstodata. Science, 303(5665):1777–1778, 2004. Availablefrom:http://www.sciencemag.org/cgi/reprint/303/5665/1777.pdf,doi:10.1126/science.1095958.
[46] RaivoRuusalepp. Infrastructureplanninganddatacuration: Acomparativestudyofinternationalapproachestoenablingthesharingofresearchdata. Technicalreport, DigitalCurationCentre, November2008.Availablefrom:http://www.dcc.ac.uk/docs/publications/reports/Data_Sharing_Report.pdf.
[47] NationalScienceFoundation. Grantgeneralconditions(GC-1). TechnicalReportgc1010, NationalScienceFoundation, 2010. Availablefrom:http://www.nsf.gov/publications/pub_summ.jsp?ods_key=gc1010.
[48] Brian F Lavoie. Theopenarchivalinformationsystemreferencemodel:Introductoryguide. DPC TechnologyWatchSeriesReport04-01, OCLC,January2004. Availablefrom:http://www.dpconline.org/docs/lavoie_OAIS.pdf.
[49] DavidS H Rosenthal, ThomasRobertson, TomLipkis, VickyReich, andSethMorabito. Requirementsfordigitalpreservationsystems: Abottom-upapproach. D-LibMagazine, 11(11), 2005. Availablefrom:http://www.dlib.org/dlib/november05/rosenthal/11rosenthal.html.
[50] CASPAR Consortium. CASPAR D4104: Validation/evaluationreport. FP6projectdeliverable, October2009. Availablefrom:http://www.casparpreserves.eu/Members/cclrc/Deliverables/caspar-validation-evaluation-report/at_download/file.
[51] DigitalCurationCentre. DCC curationlifecyclemodel[online]. 2010.Availablefrom: http://www.dcc.ac.uk/resources/curation-lifecycle-model[cited22June2011].
[52] Christopher E Lee. Takingcontextseriously: A frameworkforcontextualinformationindigitalcollections. TechnicalReportSILS TechnicalReport2007-04, UniversityofNorthCarolina, October2007. Availablefrom:http://sils.unc.edu/sites/default/files/general/research/TR_2007_04.pdf.
[53] BrianHole, Li Lin, PatrickMcCann, andPaulWheatley. LIFE3: Apredictivecostingtoolfordigitalcollections. In iPres, September2010.SubmittedtoiPres. Availablefrom:http://www.life.ac.uk/3/docs/Ipres2010_life3_submitted.pdf.
[54] PaulEglitisandDieterSuchar. Historicallessons, inter-disciplinarycomparison, andtheirapplicationtothefutureevolutionoftheESOarchivefacilityandarchiveservices. In EnsuringLong-TermPreservationandAddingValuetoScientificandTechnicalData.EuropeanSpaceAgency, December2009. Availablefrom: http://www.sciops.esa.int/SYS/CONFERENCE/include/pv2009/papers/34_Eglitis_ESO_ARCHIVE.pdf.
[55] MansurDarlington, AlexBall, TomHoward, SteveCulley, andChrisMcMahon. RAID associativetoolrequirementsspecification(version1.0).TechnicalReportERIM Projectdocumenterim6rep101111mjd10,UniversityofBath, 2011. Toappear. Availablefrom:http://opus.bath.ac.uk/22811.
[56] NASA CostAnalysisDivision. Costestimatinghandbook(2008edition).Online, 2008. Seealsohttp://cost.jsc.nasa.gov/. Availablefrom:http://www.nasa.gov/offices/pae/organization/cost_analysis_division.html.
[57] UniversityofLondonComputerCentre. TheAIDA self-assessmenttoolkitmarkII,February2009. Availablefrom:http://aida.jiscinvolve.org/toolkit.
42
ManagingResearchDatainBigScience
Glossary
Termsmarked‘OAIS’arecopiedfromtheOAIS specification[4, §1.7.2].
ADS AstrophysicalDataService: abibliographicarchiveforastronomy, basedattheHarvard-SmithsonianCenterforAstrophysics; ADS preservesfull-textcopiesofjournalarticles, bothincollaborationwithpublishers, andthroughadigitizationprocess, andmaintainsawidely-usedbibliographicID system(http://ads.harvard.edu). 10, 11, 19
AIP Archival InformationPackage: An InformationPackage, consistingof theContentInformationandtheassociatedPreservationDescriptionInforma-tion, whichispreservedwithinanOAIS (OAIS).20, 23, 28
aLIGO AdvancedLIGO:Thesuccessorproject toLIGO,duetostart in2015.14, 15, 24, 30
arXiv A electronicpreprintservice, see http://arxiv.org. ThearXivstartedintheearly90s, basedonFTP andemail. Itinitiallyservicedparticlephysicsandastronomy, buthasexpandedtocoverotherareasofphysics, mathematics,andsomeareasofcomputingscience. 10
ATLAS OneofthefourdetectorsattheLHC,andoneofthetwolargeones. 6,14
bigscience A classofscienceprojectscharacterisedbybeinginternational, highlycollaborativeandexpensivelyfunded(seeSect. 1.1 formorediscussion). 4,5, 34
catalogue Intheastronomicalcontext, acatalogueisatableofpositionsandotherinformationforstarsorotherotherastronomicalobjects. 9--11
CCSDS ConsultativeCommitteeforSpaceDataSystems: authorsof theOAISreferencemodel, see http://www.ccsds.org. 25
CDS StrasbourgDataCentre: (see http://cdsweb.u-strasbg.fr/ andSect. 1.4.1).11, 32
CMS CompactMuonSolenoid: OneofthefourdetectorsattheLHC,andoneofthetwolargeones. 14
ContentInformation Thesetofinformationthatistheoriginaltargetofpreser-vationbytheOAIS (OAIS).17, 19
DataObject EitheraPhysicalObjectoraDigitalObject(OAIS) (thatis, the`DataObject'isthesequenceofbits, orthephysicalobjectwhichis thedata inthemostprimitivesense). 24
dataproducts Formaldataoutputsfromanobservatory, instrumentorprocess(seeSect. 1.4). 10
datasharing Theformalisedpracticeofmakingsciencedatapubliclyavailable.22
DCC DigitalCurationCentre: http://www.dcc.ac.uk (nottobeconfusedwiththeLSC DocumentControlCenter). 4, 5, 27, 28
DesignatedCommunity AnidentifiedgroupofpotentialConsumerswhoshouldbeabletounderstandaparticularsetofinformation(OAIS).11, 17, 24, 25,27, 28, 31
43
Gray, CarozziandWoan
DIP DisseminationInformationPackage: TheInformationPackage, derivedfromoneormoreAIPs, receivedbytheConsumerinresponsetoarequesttotheOAIS (OAIS).20, 28, 33
DMP DataManagementandPreservation. 12, 16, 26, 29--31, 34
ESA EuropeanSpaceAgency: http://www.esa.int. 6, 11, 12, 30
ESO EuropeanSouthernObservatory: A pan-europeanagencyrunningasetofsouthern-hemispheretelescopes http://www.eso.org. 32
ESRC EconomicandSocialResearchCouncil: theprincipalsocialsciencefun-derintheUK,see http://www.esrc.ac.uk. 17
FITS FlexibleImageTransportSystem: thestandardfileformatinastronomy, seehttp://fits.gsfc.nasa.gov. 10
GEO A German-Britishconsortium, responsiblefortheGEO600interferometer,fundedjointlybySTFC andtheGermangovernment. 14
GEO600 TheGEO observatorylocatednearHannoverinGermany. 14
GW GravitationalWave. 4, 5, 8, 12, 14
HEP HighEnergyPhysics. 8, 12, 13
HERA A particleacceleratorattheGermanDESY facility. 13
InformationPackage TheContentInformationandassociatedPreservationDe-scriptionInformationwhichisneededtoaidinthepreservationoftheCon-tentInformation. TheInformationPackagehasassociatedPackagingInfor-mationusedtodelimitandidentifytheContentInformationandPreservationDescriptionInformation(OAIS).20, 25
IVOA InternationalVirtualObservatoryAlliance: theconsortiumwhichdefinesVO standards. 12, 19
JISC JointInformationSystemsCommittee: TheorganisationresponsibleforthemaintenanceandeffectiveexploitationoftheacademiccomputingnetworkintheUK,andthefundersofthispresentreport. 4, 5, 33, 35
KRDS KeepingResearchDataSafe: JISC projectdevelopinganddocumentingdatapreservationtoolsandstudies; see http://www.beagrie.com/krds.php and[44]. 21
LHC TheLargeHadronCollideratCERN:theacceleratoristhehostfortwolargegeneralpurposedetectors(ATLAS andCMS) andtwosmallerones(ALICEandLHCb). 6, 12, 13
LIGO LaserInterferometerGravitational-waveObservatory: thehardware, com-prisingLIGO LabandGEO (see http://ligo.org andSect. 1.6.1). 5, 6, 12,14, 16, 17, 30
LIGO Lab TheCaltech/MIT consortium, fundedbyNSF todesignandruntheLIGO interferometersintheUS,see http://www.ligo.caltech.edu. 14
LongTerm A periodoftimelongenoughfortheretobeconcernabouttheim-pactsofchangingtechnologies, includingsupportfornewmediaanddataformats, andofachangingusercommunity, ontheinformationbeingheldinarepository. Thisperiodextendsintotheindefinitefuture(OAIS).8, 28
44
ManagingResearchDatainBigScience
LSC LIGO ScientificCollaboration: ThenetworkofresearchgroupscontributingefforttotheLIGO experimentanddataanalysis, see http://ligo.org. 4, 5,14, 16
LVC A data-sharingagreementbetweentheLSC andtheVirgoCollaboration(seeSect. 1.6.1). 5, 14
MOU MemorandumofUnderstanding: therelationshipsbetweenthevariousparticipatingentitiesandtheLSC isarticulatedthroughaseriesofannuallyreviewedMOUs. 14
MRD ManagingResearchData: A fundingprogrammewithintheJISC e-Researchtheme, see http://www.jisc.ac.uk/whatwedo/programmes/mrd. 4, 23
NASA NationalAeronauticsandSpaceAdministration: TheUS spaceagencyhttp://www.nasa.gov. 6, 9, 30, 32, 33
NSF NationalScienceFoundation: theprincipal(non-defence)sciencefunderintheUSA.4, 22, 30
NSSDC NationalSpaceScienceDataCenter: thepermanentarchiveforNASAspacesciencemissiondata http://nssdc.gsfc.nasa.gov. 30
PDS PlanetaryDataSystem: TheNASA dataarchiveandstandardset http://pds.nasa.gov/. 30, 32
pipeline A software system (or sometimesa software-hardwarehybrid)whichtransforms rawdata intomoreormore levelsofdataproduct. Thedatareductionpipelines, whichmustbeabletokeepupwiththerateatwhichdataisacquired, andwhichareassembledfromamixtureofstandardandcustomsoftwarecomponents, generallyabsorbasignificantfractionofthetotaldevelopmentbudgetofanewinstrument. 11, 14, 15, 29
rawdata Thedataextracteddirectlyfromaninstrumentorobservation; sinceitisuncalibratedanduncorrected, itisgenerallyoflittleusetothosenotintimatelyfamiliarwiththeinstrument(seeSect. 1.4). 10, 11
RepresentationInformation TheinformationthatmapsaDataObjectintomoremeaningfulconcepts(OAIS).11, 13, 17, 19, 24, 28
RepresentationNetwork The set of Representation Information that fully de-scribesthemeaningofaDataObject. RepresentationInformationindigitalformsneedsadditionalRepresentationInformationsoitsdigitalformscanbeunderstoodovertheLongTerm(OAIS).24, 28
SIP SubmissionInformationPackage: AnInformationPackagethatisdeliveredbytheProducertotheOAIS foruseintheconstructionofoneormoreAIPs(OAIS).20, 28
SKA SquareKilometreArray: alowfrequencyradiotelescopewithalarge(onesquarekilometre)collectingarea. 6
STFC ScienceandTechnologyFacilitiesCouncil: theprimaryUK funderoffacility-scalescience, see http://www.stfc.ac.uk. 4, 5, 22
straindata ThefundamentalGW signal. 15
Virgo Italian-Frenchgravitational-wavedetector http://www.virgo.infn.it/. 14
VizieR A repositoryofastronomicalcataloguedataatCDS (seeSect. 1.4.1). 10,11
VO VirtualObservatory: asetofdatasharingargreementsandprotocols. SeeSect. 1.10 (nottobeconfusedwithgridVirtualOrganisations). 5, 19, 20
45
Gray, CarozziandWoan
IndexSeealsotheGlossary, whichaddition-allyservesastheindexofacronyms
AIDA,33, 35astronomydata, 8–11
Babylon, 10, 18benefits, 21, 22bigscience, 6–17
Caltech, 9climatedata, 23costs, 23, 30–33
datagravitationalwave, 14–16ingest, 33volume, 6, 7
dataproducts, 6, 10, 11, 15, 16, 20,21, 23, 25
DCC lifecycle, 28, 29
GAMA,12
HEP data, 8, 13, 14HerMES,12Herschel, 9, 12Hipparchus, 19Hipparcos, 11
LIGOAdvanced, see: glossary: aLIGODMP,26, 30, 31, 33
OAIS,5, 8, 11, 13, 23, 26–28opendata, 6, 22, 23
privatefacilities, 9proprietarydata, 9, 16, 21, 31Ptolemy, 19
rawdata, 10, 15, 16preservation, 25utility, 23
socialsciences, 17softwarepreservation, 17, 25, 29, 30
UKIDSS,12
virtualobservatory, 19, 20
WFAU,Edinburgh, 20
46
Top Related