Curating data for integrated science

27
a centre of expertise in data curation and preservation Funded by: This work is licensed under the Creative Commons Attribution- NonCommercial-ShareAlike 2.5 UK: Scotland License. To view a copy of this license, visit http: //creativecommons .org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. Curating data for integrated science Chris Rusbridge NERC Data Management Workshop February 2009

description

DCC presentation to NERC Data Managers 2009

Transcript of Curating data for integrated science

Page 1: Curating data for integrated science

a centre of expertise in data curation and preservation

Funded by:This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

Curating data for integrated science

Chris Rusbridge

NERC Data Management Workshop February 2009

Page 2: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

Contents• Curation• Integrated science• Poetry & Philosophy of D H Rumsfeld• Designated Community & Knowledge Base• Curation and integration • Data and Texts

Page 3: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

Curation• Wikipedia

• Curator: a content specialist responsible for an institution's collections and, together with a publications specialist, their associated collections catalogs.

• Digital Curation: the curation, preservation, maintenance, collection and archiving of digital assets

• Sheer curation: an approach to digital curation where curation activities are quietly integrated into the normal work flow of those creating and managing data and other digital assets.

• DCC: Digital curation is maintaining and adding value to a trusted body of digital information for current and future use.

Page 4: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

Integrated Science?• Mostly educational: easy-to-swallow science• Some strange things• One nice essay• Lots of environmental science

Page 5: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

Page 6: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

University of Integrated Science, California

• Degree Programs:• Vertical reality• Tachyon Holistic Wellness• Tantra (including Sexual Alchemy for Singles 101)• Vegan and Live Food Nutrition Masters Program

• …and that’s it!

Page 7: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

Edward O Wilson (1998)• “Science: organized systematic enterprise that gathers

knowledge about the world and condenses the knowledge into testable laws and principles. Defining traits are • 1st, confirmation of discoveries & support of hypotheses through repetition by

independent investigators, preferably with different tests & analyses; • 2nd, mensuration, the quantitative description of the phenomena on universally

accepted scales; • 3rd, economy, by which the largest amount of information is abstracted into a

simple and precise form, which can be unpacked to re-create detail; • 4th, heuristics, the opening of avenues to new discovery and interpretation.• And 5th, and finally, is consilience, the interlocking of causal explanations across

disciplines.”

• Consilience: “the concurrence of multiple inductions drawn from different data sets”

•Wilson, E. O. (1998, 27 March 1998). Integrated Science and The •Coming Century of The Environment. Science Magazine, 279, 2048-2049.

Page 8: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

Wilson concluding• “Arguably the foremost of global problems

grounded in the idiosyncrasies of human nature is overpopulation and the destruction of the environment. The crisis is not long-term but here and now; it is upon us. Like it or not, we are entering the century of the environment, when science and polities will give the highest priority to settling humanity down before we wreck the planet.”

Page 9: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

NCAR: January 2009• The Integrated Science Program will promote scientific

frontiers that are dependent on an integrated approach, across NCAR laboratories and across disciplines. ISP will focus on thematic areas where the mission and expertise at NCAR, and in the university atmospheric and related sciences community, can be advanced by contributions from the social and environmental sciences beyond those that typically occur within single programs or departments. These areas include, but are not limited to, Earth system-society interactions, building societal resilience to weather and climate hazards, hydrologic sciences, and biogeochemistry.

Page 10: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

Fisheries & Oceans Canada• Integrated Science Data Management

(ISDM) Providing Access to Ocean Data• “ISDM's mandate is to manage and archive

ocean data collected by DFO, or acquired through national and international programmes conducted in ocean areas adjacent to Canada, and to disseminate data, data products, and services to the marine community in accordance with the policies of the Department.”

Page 11: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

Integrated Science• We need a definition that works better;

something like:

“The application of multiple scientific disciplines to one or more core scientific challenges”

• Examples of integrated sciences?• Archaeology• Environmental sciences

Page 12: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

Integrated Science implications• Scientists will be using unfamiliar data,

therefore• Data curators and managers must make their

data available for unfamiliar users!

• And now for something unfamiliar?

Page 13: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

Poetry & Philosophy of D H Rumsfeld

Hart Seely, April 2, 2003, SLATE http://www.slate.com/id/2081042/

Page 14: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

A Confession‘Once in a while,I'm standing here, doing something.And I think,"What in the world am I doing here?"It's a big surprise.’—May 16, 2001, interview with the New York Times

Page 15: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

Clarity‘I think what you'll find,I think what you'll find is,Whatever it is we do substantively,There will be near-perfect clarityAs to what it is.

‘And it will be known,And it will be known to the Congress,And it will be known to you,Probably before we decide it,But it will be known.’—Feb. 28, 2003, Department of Defense briefing

Page 16: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

The Unknown‘As we know,There are known knowns.There are things we know we know.We also knowThere are known unknowns.That is to sayWe know there are some thingsWe do not know.But there are also unknown unknowns,The ones we don't knowWe don't know.’—Feb. 12, 2002, Department of Defense news briefing

Page 17: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

The 4th Rumsfeld?• 3 epistemological classes (???)

• Known knowns• Known unknowns• Unknown unknowns

• 4th class?• Uknown knowns?• Critical issue for integrated sciences

Page 18: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

Some OAIS Concepts?• Knowledge Base: allows a consumer to understand

something• Designated Community: the set of consumers for whom

the archive curates something• Representation Information: helps you interpret a data

object yielding an information object• The amount and nature of RepInfo required is dependent on

the Knowledge Base of the Designated Community• If you curate for project colleagues in the short term, little if any

RepInfo required• If you curate for those unfamiliar with the data, more RepInfo is

needed• (All broadly interpreted!) •CCSDS (2002). Reference Model for an Open Archival Information System (OAIS).

•Retrieved. from http://public.ccsds.org/publications/archive/650x0b1.pdf.

Page 19: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

Time• KB is f1(DC, t)• DC is f2(t)• RepInfo needed is f3(f1(DC, t), f2(t))

• (but none of these concepts can be precisely defined!)

• If DC is small and t is short (months to year or so), then both may be ignored, and RepInfo be assumed part of the KB

• If DC is extensive (eg cross-discipline) and t is long (5 years to 25 plus), then RepInfo must be articulated

• If t is very long, most bets are off (post-hoc reconstruction likely to be needed)

Page 20: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

What might RepInfo include• Structure information: file format definitions, etc • Semantic information: data dictionaries, code books etc• Robust methods (working code?)• Not to mention many kinds of metadata, provenance,

documentation of hidden assumptions, etc• Cross-domain schemas one approach to articulating

RepInfo?• (Never perfect, of course)

Page 21: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

What about Rumsfeld 4?• Biggest concern with unfamiliar user is

clashing concepts, eg different baselines, units, geographies, granularity• Especially where terms are ambiguous or

differently interpreted• The KBs of two DCs conflict, potentially silently• Happens all the time, of course

• The unspoken: tacit knowledge, unknown knowns!

Page 22: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

Timing• Curation starts before creation

• Before project proposal!

• Data acquisition should not happen at the end• Continuous acquisition much better?

• Enforcement… or credit for data?

Page 23: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

Other curation issues of concern• Sustainability (work on your survival)• Succession (what happens to your data if you don’t)• Data audit (know what you’ve got)• Data risk assessment (assess your chances of loss)• Repository external audit???• Provenance & computational lineage• Archiving database changes• Community proxy roles: help your communities develop

data standards & data practices

• DCC has tools & support for some of these…

Page 24: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

… and what is the role of

RDF?

Page 25: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

RDF• Anchors data to (well?) defined ontology or

schema• Reduces 4th Rumsfeld risk?

• Allows processing by increasing class of tools• More suited to comparatively isolated “facts”

or claims than substantial data arrays?

Page 26: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

… and Research Outputs?• Need more semantically aware texts to

support cross-community understanding• Coded up (cf microformats, RDFa)

• People• Citations & references• Science features (eg chemicals, reactions)• Graphs, spectra, tables linking to • Supplementary data

• PDF is pretty bad at this

Page 27: Curating data for integrated science

a centre of expertise in data curation and preservation

NERC Data Management Workshop

Thanks… and now for the experts!