Post on 05-Jul-2015
a centre of expertise in data curation and preservation
Funded by:This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License, excluding content property of others. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
“Tomorrow, and tomorrow, and tomorrow”:
the players on the curation stage
Chris RusbridgePresentation at OCLC
a centre of expertise in data curation and preservation
OCLC October 2006
•"To-morrow, and to-morrow, and to-morrow,•Creeps in this petty pace from day to day,•To the last syllable of recorded time;•And all our yesterdays have lighted fools•The way to dusty death. •Out, out, brief candle!•Life's but a walking shadow; a poor player,•That struts and frets his hour upon the stage,•And then is heard no more: it is a tale•Told by an idiot, full of sound and fury,•Signifying nothing."
•Shakespeare: Macbeth
a centre of expertise in data curation and preservation
OCLC October 2006
•Dunsinane Hill
•Photo by Fabrice
a centre of expertise in data curation and preservation
OCLC October 2006
a centre of expertise in data curation and preservation
OCLC October 2006
a centre of expertise in data curation and preservation
OCLC October 2006
Contents• Curation and the Digital Curation Centre• Science and Data Citations• The “poor players” of data curation• Sustainability of curated data• Macbeth again…
a centre of expertise in data curation and preservation
OCLC October 2006
Curation• Data increasingly important as evidence
• Experimental verifiability (the basis of science)• Unrepeatable observations & experiments
(particularly environmental in broadest sense)• Legal, compliance & transactions• Cultural resources
• “Preservation” view vs “Publishing” view
a centre of expertise in data curation and preservation
OCLC October 2006
Lynch remarks• Closing the Curation Conference• 3 views of digital curation
• Finite process, handover to preservation• Whole life process, evolving object(s)• Collection as a living thing
a centre of expertise in data curation and preservation
OCLC October 2006
Digital curation?
Digital preservation
Static
For later use
a centre of expertise in data curation and preservation
OCLC October 2006
Digital curation?
Digital preservationDigital curation
StaticDynamic Long-term
For later use In use now (and the future)
a centre of expertise in data curation and preservation
OCLC October 2006
Digital curation
Digital curation & preservation
StaticDynamic Long-term
For later use In use now (and the future)
“maintaining and adding value to a trusted body of digital information for current and future use”
a centre of expertise in data curation and preservation
OCLC October 2006
Mission“The over-riding purpose of the DCC is to support and promote continuing improvement in the quality of data curation, and of associated digital preservation”
a centre of expertise in data curation and preservation
OCLC October 2006
Organisation to Engage & Collaborate
Industry
research collaborators
standards bodies
testbeds& tools
communities of practice: users
community support & outreach
research
development co-ordination
service definition & delivery
management & admin support
Associates Network
curation organisations eg DPC
a centre of expertise in data curation and preservation
OCLC October 2006
Organisation to Engage & Collaborate: Leads
Industry
research collaborators
standards bodies
testbeds& tools
communities of practice: users
Bath
Edinburgh
CCLRC
Glasgow EdinburghAssociates Network
curation organisations eg DPC
a centre of expertise in data curation and preservation
OCLC October 2006
Associated work• DCC LOCKSS Technical Support Service
(Lots of Copies Keep Stuff Safe)
• DCC SCARP Project• Disciplinary approaches to sharing, curation, re-
use and preservation
• EU projects associated• CASPAR• Digital Preservation Europe• PLANETS
a centre of expertise in data curation and preservation
OCLC October 2006
Phase 2• Externally-moderated, reflective self-
evaluation completed• Phase 2 proposal (2007/10) to JISC
• Accepted: focus on science data, reduced scale
• EPSRC-funded Research continues until 2007/8
a centre of expertise in data curation and preservation
OCLC October 2006
2nd International Digital Curation Conference
• Research & invited presentations• Glasgow, 21/22 November, 2006• Please register at:
http://www.dcc.ac.uk/events/dcc-2006/
a centre of expertise in data curation and preservation
OCLC October 2006
a centre of expertise in data curation and preservation
OCLC October 2006
Data resource stages• Curated data is created…
• Observations? Fixed!
• Or Acquired…• Data brought/bought from outside• Ingest
• Development• Derived, refined, combined, processed data• Potentially many stages
a centre of expertise in data curation and preservation
OCLC October 2006SDSS (Visual)
TWOMASS (Infrared)
Slide from Rajendra Bose
a centre of expertise in data curation and preservation
OCLC October 2006 Slide from Rajendra Bose
a centre of expertise in data curation and preservation
OCLC October 2006
New discovery…• National Virtual Observatory
• Johns Hopkins press release: “Scientists working to create the NVO, an online portal for astronomical research unifying dozens of large astronomical databases, confirmed discovery of [a] new brown dwarf recently. The star emerged from a computerized search of information on millions of astronomical objects in two separate astronomical databases. Thanks to an NVO prototype, that search, formerly an endeavor requiring weeks or months of human attention, took approximately two minutes.”
a centre of expertise in data curation and preservation
OCLC October 2006
Context • Data meaningless without context
• Linkage• Metadata of many kinds• Workflow!
• Provenance • Computational lineage • Authenticity
a centre of expertise in data curation and preservation
OCLC October 2006
Csat8-day composite
and subscene
Csat
E0
SST
8-day composite and subscene Pbopt calc
Ctot calc Zeu calc PPeu calc
PARsubscene
HRPT
NASA
University research group1
research group3 local
decision-making body
University research group2
Slide from Rajendra Bose
a centre of expertise in data curation and preservation
OCLC October 2006
Access and re-use• Ethics and rights control access
• Weak in expressing this long-term
• Collaboration tools• Annotation, discussion, review• Re-use leading to change and development
• “Publication”• Not just in “print”• Underlying data should be “published”, too
• Citation…
a centre of expertise in data curation and preservation
OCLC October 2006
CLADDIER citation investigation“My last example was an MST data set held at the BADC, and I was
suggesting something like this (for a citation):<Citation><Author> Natural Environment Research Council </Author><Title> Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth </Title><Medium> Internet </Medium><Publisher> British Atmospheric Data Centre (BADC) </Publisher><PublicationDate status="ongoing"> 1990</PublicationDate><Identifier> badc.nerc.ac.uk/data/mst/v3/upd15032006</Identifier><Feature><FeatureType>http://featuretype.registry/verticalProfile</FeatureType><
LocalID>200409031205</LocalID></Feature><AccessDate> Sep 21 2006 </AccessDate><AvailableAt><url>http://badc.nerc.ac.uk/data/mst/v3/</url></AvailableAt></Citation>(Made up tags!)”
•Bryan Lawrence Weblog
a centre of expertise in data curation and preservation
OCLC October 2006
CLADDIER 2: “Version of record”• Role of Publisher: add value
• provision of catalogue metadata• some commitment to maintenance of the resource
at the AvailableAt url• some commitment to the resource being
conformant to the description of the Feature• some commitment to the maintenance of the
mapping between the identifier [LocalID] and the resource.
•Bryan Lawrence Weblog
a centre of expertise in data curation and preservation
OCLC October 2006
CLADDIER 3: persistence• Wayback Machine
• Only snapshots (eg only 2004 version of Bryan’s home page!)
• WebCite• allows the creater of content to submit URLs for [archiving],
thus ensuring when one writes an academic document, the material will be archived, and the citation will be persistent
• But no real help for data…
• “… only allow [data citation] when we believe in the persistence of the organisation making the data available…”
•Bryan Lawrence Weblog
a centre of expertise in data curation and preservation
OCLC October 2006
a centre of expertise in data curation and preservation
OCLC October 2006
Citation
OWL Web Ontology Language Reference
W3C Proposed Recommendation 15 December 2003
This version:http://www.w3.org/TR/2003/PR-owl-ref-20031215/Latest version:http://www.w3.org/TR/owl-ref/Previous version:http://www.w3.org/TR/2003/CR-owl-ref-2003081
• Needs a stable resource to cite…
• (FRBR works & expressions?)
a centre of expertise in data curation and preservation
OCLC October 2006
Citation…• The date alone (as in common web citation
approaches) is not enough!
• Cited object likely to have changed…• Citation should link to the cited object as it was!
•[6] The CIA World Factbook. •www.cia.gov/cia/publications/factbook/. •Retrieved on 8 Jan 2006.
a centre of expertise in data curation and preservation
OCLC October 2006
Citation needs…• An efficient way to reference and access “archived” past states
of a changing dataset (work in progress, Buneman et al)• Not important for original observations
• Don’t mess with those data
• Less important for incremental datasets• Later stuff should not invalidate earlier
• Very important for revisable datasets• Eg Genomics… datasets that result from the combined work of
curators, or contain opinions or facts likely to change• Eg Mapping… OS maps represent a huge database that changes
on a daily basis
a centre of expertise in data curation and preservation
OCLC October 2006
XM
L Arch
iver
RelationalDatabase
XML Archive at time t - 1
XML Archive at time t
XMLArch: System Architecture
Pre-processor
VersionMerger
Data Extractor
XML Snapshot at time t
•Carwyn Edwards
a centre of expertise in data curation and preservation
OCLC October 2006
Who are the curation players?
a centre of expertise in data curation and preservation
OCLC October 2006
Curation: Individual• “Small science” 2-3 times more data than “Big
science”, but much more at risk• PhD student? RA? PI? Administrator? IT support?• Data potentially on local hard drives, or at best
shared network drives• May be inadequately protected• Liable for policy-led deletion on resignation
• Individual “knows” too much• Documentation/metadata unlikely to be adequate
• Tomorrow: gone!
a centre of expertise in data curation and preservation
OCLC October 2006
Department: eCrystals• Specialist department
archive (& national service)• Workflow recording of lab
parameters (R4L)• Public & private elements• Trying to build eCrystals
federation (eBank 3)• But… ReciprocalNet?
French COD efforts? Fragmented discipline!
• Tomorrow: likely to continue
a centre of expertise in data curation and preservation
OCLC October 2006
Institution: Cambridge Chemistry• 175,000 small molecule
structures in CML• Alongside Archaeology,
Manuscripts, Learning Materials, etc
• No library curation skills; dependent on research group enthusiast
• Collection isolated from other Chemistry
• Tomorrow: assured…
a centre of expertise in data curation and preservation
OCLC October 2006
Community: CDL• Shared effort from
group of institutions• Comparison OhioLink?• Document tradition, not
data• Passive role re
collections • Rely on departmental &
domain expertise• Tomorrow: assured…
a centre of expertise in data curation and preservation
OCLC October 2006
Community: SDSC?
• Data specialists• Multiple disciplines• Distinct from domains;
curation dependent on external expertise
• Research ethos• Tomorrow: dependent
on grant/contract income & research priorities
a centre of expertise in data curation and preservation
OCLC October 2006
Community: LOCKSS?• Self-selected group of
collectors: closest to genuine open activity (despite Alliance)?
• Traditionally libraries collecting eJournals
• Model respects IPR• No domain expertise; rely on
origins• Data limitations…• Tomorrow: potentially very
persistent (low cost, high reliability, attack resistance, distributed)
a centre of expertise in data curation and preservation
OCLC October 2006
Discipline: Archaeology• Staffed by archaeologist
curators• Understand special
legal issues• Strong relationship with
community & peers• Internationally still
fragmented?• Tomorrow: dependent
on research council grants + deposit funding
a centre of expertise in data curation and preservation
OCLC October 2006
Discipline: Astronomy• Part of major
international effort• Expensive shared
facilities, global reach• Well integrated into
community• Enable new science• Tomorrow: assured by
community (another large facility)
a centre of expertise in data curation and preservation
OCLC October 2006
Discipline: Atmosphere• Strong believer in need
for domain scientists as curators
• Significant participant in “community proxy” agenda-setting activities
• Internationally fragmented resources
• Tomorrow: mostly dependent on grant funding (but strong commitment)
a centre of expertise in data curation and preservation
OCLC October 2006
Discipline: Pharmacology• International Scientific
Union• Attempting to build
credit for data contributions
• DB ownership rotates• Tomorrow: extremely
limited funding
a centre of expertise in data curation and preservation
OCLC October 2006
Discipline: Social Sciences• Mature!• Staffed by Social
Science curators• Alert to opportunities• Able to appraise
material offered• Strong relationship to
discipline• Tomorrow: assured
through broad mix of funding streams
a centre of expertise in data curation and preservation
OCLC October 2006
Publisher: Crystallography
• Publisher and Scientific Union
• Created key domain crystallographic standard (CIF)
• Strong motivator for deposit of structure data
• Consistent quality checks• DOIs used for structure data• Tomorrow: publishing
business model
•Slide from IUCr
a centre of expertise in data curation and preservation
OCLC October 2006
National bodies: British Library• Serious and robust
approach• Legal deposit powers &
responsibilities as driver• Oriented primarily
towards “cultural heritage” (broadly interpreted)
• Little data, no science domain experience
• Tomorrow: strong future commitment
a centre of expertise in data curation and preservation
OCLC October 2006
National bodies: TNA/NDAD• Specialist archive for
government datasets• Understand government
regulations, dynamics & requirements
• Subject generalists; disconnected from associated science
• Technology specialists (understand databases)
• Tomorrow: likely to pass eventually to The National Archives
a centre of expertise in data curation and preservation
OCLC October 2006
National bodies: NOAA (etc)• Government body
making serious data available
• Domain scientists curate data
• Operates in current political context (!)
• Tomorrow: reasonably assured but some un-funded mandates?
a centre of expertise in data curation and preservation
OCLC October 2006
3rd parties: OCLC?• Should this be
community?• Demand driven• No domain science
expertise: rely on origins
• Tomorrow: business case
a centre of expertise in data curation and preservation
OCLC October 2006
3rd parties: Portico• Specific area: eJournals• Depends on publisher
agreements• No data or domain
science expertise• Tomorrow: commitment
from Mellon + publishers + subscriptions, good funding mix
a centre of expertise in data curation and preservation
OCLC October 2006
3rd Parties: Iron Mountain• Records management
IS a curation problem• Organisations like this
very likely to branch out• No domain science
expertise• Tomorrow: business
case, viability, stock market…
a centre of expertise in data curation and preservation
OCLC October 2006
Institutions & the network• Institutions have some fundamental
sustainability• Disciplines live in the network; sustainability is
an issue • Can we get the best of both?
a centre of expertise in data curation and preservation
OCLC October 2006
Intersections…
etc
XXDiscipline 3
XXDiscipline 2
XXDiscipline 1
etcInstitution 3
Institution 2
Institution 1
a centre of expertise in data curation and preservation
OCLC October 2006
Who are the curation players again?
a centre of expertise in data curation and preservation
OCLC October 2006
Project StORe findings• Discipline commonality from survey (Miller, UKDA, 2006):
• 2-way links between data & publication useful• Barriers to actual deposit of data/outputs• Sharing data important, likely between colleagues• Perceived inconsistency across repositories• Most common searching: Google type• Researchers favour self-reliance rather than library support• Recognise need for common minimum metadata
• Aim for pilot linking middleware demonstrator• “Creating small scale ‘silos’ of information with institutional
repositories is not … a compelling information management strategy in the ‘Google age’” (Heery & Anderson for JISC, 2005)
a centre of expertise in data curation and preservation
OCLC October 2006
Sustainability: tomorrow is the emerging worry
• Sustainability work package in DCC (new grant!)
• JISC/NDIIPP meeting addressed it• AHRC report draft soon• Research Information Network report draft• JISC study on sustainable IT systems for HE• Recent ARL/NSF workshop, NSF strategy
a centre of expertise in data curation and preservation
OCLC October 2006
Sustainability of what?• Repository as an organisation• Repository as a service• Repository as a system• Repositories as a network (federation?)• Collections and objects supported by
repositories
• Commit to collection: contract the manager!
a centre of expertise in data curation and preservation
OCLC October 2006
Social factors• Commitment essential… much more than anything else
(cf persistent identifiers)• Funder requirements express social determination
• Policy & grant application forms, selection criteria• Monitoring essential
• Legal, ethical, IPR impacts all significant• Public good questions
• Academic credit (citations?)• Free-loaders (embargos?)• Disciplines are different!
• Workforce skills: researcher, data librarian/scientist
a centre of expertise in data curation and preservation
OCLC October 2006
Sustainability a function of...• Commitment• Goals• Value and cost• Business model• Time• Environment• Domain knowledge and information• Dimensions (how much stuff)• Technical approaches• Usage
a centre of expertise in data curation and preservation
OCLC October 2006
So, tomorrow…• Digital data repositories already sustained > 30 years
• How?• Vision, leadership, commitment
• Libraries, archives, museums sustained 100s of years• How?• Aggregate value proposition• Perception now under threat!
• Collectively we need to identify the next steps toward digital data sustainability, for tomorrow, and tomorrow, and tomorrow!
a centre of expertise in data curation and preservation
OCLC October 2006
Macbeth again…•"To-morrow, and to-morrow, and to-morrow,•Creeps in this petty pace from day to day,•To the last syllable of recorded time;
•…it is a tale•Told by an idiot, full of sound and fury,•Signifying nothing."
a centre of expertise in data curation and preservation
OCLC October 2006
Mission (impossible?)• To that last syllable of recorded time• Keep our tales forever full of significance!
Thank you