PPT Slides

63
a centre of expertise in data curation and preservation Funded by: This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License, excluding content property of others. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. “Tomorrow, and tomorrow, and tomorrow”: the players on the curation stage Chris Rusbridge Presentation at OCLC

Transcript of PPT Slides

Page 1: PPT Slides

a centre of expertise in data curation and preservation

Funded by:This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License, excluding content property of others. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

“Tomorrow, and tomorrow, and tomorrow”:

the players on the curation stage

Chris RusbridgePresentation at OCLC

Page 2: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

•"To-morrow, and to-morrow, and to-morrow,•Creeps in this petty pace from day to day,•To the last syllable of recorded time;•And all our yesterdays have lighted fools•The way to dusty death. •Out, out, brief candle!•Life's but a walking shadow; a poor player,•That struts and frets his hour upon the stage,•And then is heard no more: it is a tale•Told by an idiot, full of sound and fury,•Signifying nothing."

•Shakespeare: Macbeth

Page 3: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

•Dunsinane Hill

•Photo by Fabrice

Page 4: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Page 5: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Page 6: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Contents• Curation and the Digital Curation Centre• Science and Data Citations• The “poor players” of data curation• Sustainability of curated data• Macbeth again…

Page 7: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Curation• Data increasingly important as evidence

• Experimental verifiability (the basis of science)• Unrepeatable observations & experiments

(particularly environmental in broadest sense)• Legal, compliance & transactions• Cultural resources

• “Preservation” view vs “Publishing” view

Page 8: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Lynch remarks• Closing the Curation Conference• 3 views of digital curation

• Finite process, handover to preservation• Whole life process, evolving object(s)• Collection as a living thing

Page 9: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Digital curation?

Digital preservation

Static

For later use

Page 10: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Digital curation?

Digital preservationDigital curation

StaticDynamic Long-term

For later use In use now (and the future)

Page 11: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Digital curation

Digital curation & preservation

StaticDynamic Long-term

For later use In use now (and the future)

“maintaining and adding value to a trusted body of digital information for current and future use”

Page 12: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Mission“The over-riding purpose of the DCC is to support and promote continuing improvement in the quality of data curation, and of associated digital preservation”

Page 13: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Organisation to Engage & Collaborate

Industry

research collaborators

standards bodies

testbeds& tools

communities of practice: users

community support & outreach

research

development co-ordination

service definition & delivery

management & admin support

Associates Network

curation organisations eg DPC

Page 14: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Organisation to Engage & Collaborate: Leads

Industry

research collaborators

standards bodies

testbeds& tools

communities of practice: users

Bath

Edinburgh

CCLRC

Glasgow EdinburghAssociates Network

curation organisations eg DPC

Page 15: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Associated work• DCC LOCKSS Technical Support Service

(Lots of Copies Keep Stuff Safe)

• DCC SCARP Project• Disciplinary approaches to sharing, curation, re-

use and preservation

• EU projects associated• CASPAR• Digital Preservation Europe• PLANETS

Page 16: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Phase 2• Externally-moderated, reflective self-

evaluation completed• Phase 2 proposal (2007/10) to JISC

• Accepted: focus on science data, reduced scale

• EPSRC-funded Research continues until 2007/8

Page 17: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

2nd International Digital Curation Conference

• Research & invited presentations• Glasgow, 21/22 November, 2006• Please register at:

http://www.dcc.ac.uk/events/dcc-2006/

Page 18: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Page 19: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Data resource stages• Curated data is created…

• Observations? Fixed!

• Or Acquired…• Data brought/bought from outside• Ingest

• Development• Derived, refined, combined, processed data• Potentially many stages

Page 20: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006SDSS (Visual)

TWOMASS (Infrared)

Slide from Rajendra Bose

Page 21: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006 Slide from Rajendra Bose

Page 22: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

New discovery…• National Virtual Observatory

• Johns Hopkins press release: “Scientists working to create the NVO, an online portal for astronomical research unifying dozens of large astronomical databases, confirmed discovery of [a] new brown dwarf recently. The star emerged from a computerized search of information on millions of astronomical objects in two separate astronomical databases. Thanks to an NVO prototype, that search, formerly an endeavor requiring weeks or months of human attention, took approximately two minutes.”

Page 23: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Context • Data meaningless without context

• Linkage• Metadata of many kinds• Workflow!

• Provenance • Computational lineage • Authenticity

Page 24: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Csat8-day composite

and subscene

Csat

E0

SST

8-day composite and subscene Pbopt calc

Ctot calc Zeu calc PPeu calc

PARsubscene

HRPT

NASA

University research group1

research group3 local

decision-making body

University research group2

Slide from Rajendra Bose

Page 25: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Access and re-use• Ethics and rights control access

• Weak in expressing this long-term

• Collaboration tools• Annotation, discussion, review• Re-use leading to change and development

• “Publication”• Not just in “print”• Underlying data should be “published”, too

• Citation…

Page 26: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

CLADDIER citation investigation“My last example was an MST data set held at the BADC, and I was

suggesting something like this (for a citation):<Citation><Author> Natural Environment Research Council </Author><Title> Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth </Title><Medium> Internet </Medium><Publisher> British Atmospheric Data Centre (BADC) </Publisher><PublicationDate status="ongoing"> 1990</PublicationDate><Identifier> badc.nerc.ac.uk/data/mst/v3/upd15032006</Identifier><Feature><FeatureType>http://featuretype.registry/verticalProfile</FeatureType><

LocalID>200409031205</LocalID></Feature><AccessDate> Sep 21 2006 </AccessDate><AvailableAt><url>http://badc.nerc.ac.uk/data/mst/v3/</url></AvailableAt></Citation>(Made up tags!)”

•Bryan Lawrence Weblog

Page 27: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

CLADDIER 2: “Version of record”• Role of Publisher: add value

• provision of catalogue metadata• some commitment to maintenance of the resource

at the AvailableAt url• some commitment to the resource being

conformant to the description of the Feature• some commitment to the maintenance of the

mapping between the identifier [LocalID] and the resource.

•Bryan Lawrence Weblog

Page 28: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

CLADDIER 3: persistence• Wayback Machine

• Only snapshots (eg only 2004 version of Bryan’s home page!)

• WebCite• allows the creater of content to submit URLs for [archiving],

thus ensuring when one writes an academic document, the material will be archived, and the citation will be persistent

• But no real help for data…

• “… only allow [data citation] when we believe in the persistence of the organisation making the data available…”

•Bryan Lawrence Weblog

Page 29: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Page 30: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Citation

OWL Web Ontology Language Reference

W3C Proposed Recommendation 15 December 2003

This version:http://www.w3.org/TR/2003/PR-owl-ref-20031215/Latest version:http://www.w3.org/TR/owl-ref/Previous version:http://www.w3.org/TR/2003/CR-owl-ref-2003081

• Needs a stable resource to cite…

• (FRBR works & expressions?)

Page 31: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Citation…• The date alone (as in common web citation

approaches) is not enough!

• Cited object likely to have changed…• Citation should link to the cited object as it was!

•[6] The CIA World Factbook. •www.cia.gov/cia/publications/factbook/. •Retrieved on 8 Jan 2006.

Page 32: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Citation needs…• An efficient way to reference and access “archived” past states

of a changing dataset (work in progress, Buneman et al)• Not important for original observations

• Don’t mess with those data

• Less important for incremental datasets• Later stuff should not invalidate earlier

• Very important for revisable datasets• Eg Genomics… datasets that result from the combined work of

curators, or contain opinions or facts likely to change• Eg Mapping… OS maps represent a huge database that changes

on a daily basis

Page 33: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

XM

L Arch

iver

RelationalDatabase

XML Archive at time t - 1

XML Archive at time t

XMLArch: System Architecture

Pre-processor

VersionMerger

Data Extractor

XML Snapshot at time t

•Carwyn Edwards

Page 34: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Who are the curation players?

Page 35: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Curation: Individual• “Small science” 2-3 times more data than “Big

science”, but much more at risk• PhD student? RA? PI? Administrator? IT support?• Data potentially on local hard drives, or at best

shared network drives• May be inadequately protected• Liable for policy-led deletion on resignation

• Individual “knows” too much• Documentation/metadata unlikely to be adequate

• Tomorrow: gone!

Page 36: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Department: eCrystals• Specialist department

archive (& national service)• Workflow recording of lab

parameters (R4L)• Public & private elements• Trying to build eCrystals

federation (eBank 3)• But… ReciprocalNet?

French COD efforts? Fragmented discipline!

• Tomorrow: likely to continue

Page 37: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Institution: Cambridge Chemistry• 175,000 small molecule

structures in CML• Alongside Archaeology,

Manuscripts, Learning Materials, etc

• No library curation skills; dependent on research group enthusiast

• Collection isolated from other Chemistry

• Tomorrow: assured…

Page 38: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Community: CDL• Shared effort from

group of institutions• Comparison OhioLink?• Document tradition, not

data• Passive role re

collections • Rely on departmental &

domain expertise• Tomorrow: assured…

Page 39: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Community: SDSC?

• Data specialists• Multiple disciplines• Distinct from domains;

curation dependent on external expertise

• Research ethos• Tomorrow: dependent

on grant/contract income & research priorities

Page 40: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Community: LOCKSS?• Self-selected group of

collectors: closest to genuine open activity (despite Alliance)?

• Traditionally libraries collecting eJournals

• Model respects IPR• No domain expertise; rely on

origins• Data limitations…• Tomorrow: potentially very

persistent (low cost, high reliability, attack resistance, distributed)

Page 41: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Discipline: Archaeology• Staffed by archaeologist

curators• Understand special

legal issues• Strong relationship with

community & peers• Internationally still

fragmented?• Tomorrow: dependent

on research council grants + deposit funding

Page 42: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Discipline: Astronomy• Part of major

international effort• Expensive shared

facilities, global reach• Well integrated into

community• Enable new science• Tomorrow: assured by

community (another large facility)

Page 43: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Discipline: Atmosphere• Strong believer in need

for domain scientists as curators

• Significant participant in “community proxy” agenda-setting activities

• Internationally fragmented resources

• Tomorrow: mostly dependent on grant funding (but strong commitment)

Page 44: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Discipline: Pharmacology• International Scientific

Union• Attempting to build

credit for data contributions

• DB ownership rotates• Tomorrow: extremely

limited funding

Page 45: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Discipline: Social Sciences• Mature!• Staffed by Social

Science curators• Alert to opportunities• Able to appraise

material offered• Strong relationship to

discipline• Tomorrow: assured

through broad mix of funding streams

Page 46: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Publisher: Crystallography

• Publisher and Scientific Union

• Created key domain crystallographic standard (CIF)

• Strong motivator for deposit of structure data

• Consistent quality checks• DOIs used for structure data• Tomorrow: publishing

business model

•Slide from IUCr

Page 47: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

National bodies: British Library• Serious and robust

approach• Legal deposit powers &

responsibilities as driver• Oriented primarily

towards “cultural heritage” (broadly interpreted)

• Little data, no science domain experience

• Tomorrow: strong future commitment

Page 48: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

National bodies: TNA/NDAD• Specialist archive for

government datasets• Understand government

regulations, dynamics & requirements

• Subject generalists; disconnected from associated science

• Technology specialists (understand databases)

• Tomorrow: likely to pass eventually to The National Archives

Page 49: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

National bodies: NOAA (etc)• Government body

making serious data available

• Domain scientists curate data

• Operates in current political context (!)

• Tomorrow: reasonably assured but some un-funded mandates?

Page 50: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

3rd parties: OCLC?• Should this be

community?• Demand driven• No domain science

expertise: rely on origins

• Tomorrow: business case

Page 51: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

3rd parties: Portico• Specific area: eJournals• Depends on publisher

agreements• No data or domain

science expertise• Tomorrow: commitment

from Mellon + publishers + subscriptions, good funding mix

Page 52: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

3rd Parties: Iron Mountain• Records management

IS a curation problem• Organisations like this

very likely to branch out• No domain science

expertise• Tomorrow: business

case, viability, stock market…

Page 53: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Institutions & the network• Institutions have some fundamental

sustainability• Disciplines live in the network; sustainability is

an issue • Can we get the best of both?

Page 54: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Intersections…

etc

XXDiscipline 3

XXDiscipline 2

XXDiscipline 1

etcInstitution 3

Institution 2

Institution 1

Page 55: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Who are the curation players again?

Page 56: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Project StORe findings• Discipline commonality from survey (Miller, UKDA, 2006):

• 2-way links between data & publication useful• Barriers to actual deposit of data/outputs• Sharing data important, likely between colleagues• Perceived inconsistency across repositories• Most common searching: Google type• Researchers favour self-reliance rather than library support• Recognise need for common minimum metadata

• Aim for pilot linking middleware demonstrator• “Creating small scale ‘silos’ of information with institutional

repositories is not … a compelling information management strategy in the ‘Google age’” (Heery & Anderson for JISC, 2005)

Page 57: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Sustainability: tomorrow is the emerging worry

• Sustainability work package in DCC (new grant!)

• JISC/NDIIPP meeting addressed it• AHRC report draft soon• Research Information Network report draft• JISC study on sustainable IT systems for HE• Recent ARL/NSF workshop, NSF strategy

Page 58: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Sustainability of what?• Repository as an organisation• Repository as a service• Repository as a system• Repositories as a network (federation?)• Collections and objects supported by

repositories

• Commit to collection: contract the manager!

Page 59: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Social factors• Commitment essential… much more than anything else

(cf persistent identifiers)• Funder requirements express social determination

• Policy & grant application forms, selection criteria• Monitoring essential

• Legal, ethical, IPR impacts all significant• Public good questions

• Academic credit (citations?)• Free-loaders (embargos?)• Disciplines are different!

• Workforce skills: researcher, data librarian/scientist

Page 60: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Sustainability a function of...• Commitment• Goals• Value and cost• Business model• Time• Environment• Domain knowledge and information• Dimensions (how much stuff)• Technical approaches• Usage

Page 61: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

So, tomorrow…• Digital data repositories already sustained > 30 years

• How?• Vision, leadership, commitment

• Libraries, archives, museums sustained 100s of years• How?• Aggregate value proposition• Perception now under threat!

• Collectively we need to identify the next steps toward digital data sustainability, for tomorrow, and tomorrow, and tomorrow!

Page 62: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Macbeth again…•"To-morrow, and to-morrow, and to-morrow,•Creeps in this petty pace from day to day,•To the last syllable of recorded time;

•…it is a tale•Told by an idiot, full of sound and fury,•Signifying nothing."

Page 63: PPT Slides

a centre of expertise in data curation and preservation

OCLC October 2006

Mission (impossible?)• To that last syllable of recorded time• Keep our tales forever full of significance!

Thank you