Principles for proper data management and reuse--An RDA view

24
Principles for proper Data Management and Re-Use an RDA view Peter Wittenburg Max Planck Society

Transcript of Principles for proper data management and reuse--An RDA view

Principles for proper Data Management and

Re-Use – an RDA view

Peter Wittenburg

Max Planck Society

2

does RDA have one view – yes & no

RDA is basically a bottom-up organization driven by the many

“creative” minds who want to change data practices

RDA has now about 2000 members – so we have 2000 opinions?

we have an intensive discussion process since 2012 (ICRI

Conference Copenhagen) and we can see that there are a number

of trends and principles all or most seem to agree with

still RDA is a very young initiative and needs

much attention and grease

Clarification

3Why is this all relevant?

Naoyuki Tsunematsu (JST ):

• Data exchange (and thus the need for proper data

management) difficult to convey in Japanese Science

• parallel trends observed for Japanese Science

• not so often included in collaborations anymore

• not so often represented in the top papers

• enormous decrease in international ranking

• serious worries about counterproductive encapsulation

• this concern seems to be relevant for all of us

4Trends I – Volume, Complexity

from simple

structures ...

... towards

complex

relationships

5Trends II - Anonymity

direct exchange between known colleagues

Domain of Repositories

6Trends III – Re-Usage

Domain of

trusted

Repositories

• Data will be re-used in different contexts

• Data needs to be findable, accessible, combinable and

interpretable for others

7Data Practices I – Survey

~120 Interviews/Interactions

2 Workshops with Leading Scientists (EU, US)

too much manual or via ad hoc scripts

too much in Legacy formats (no PID & MD)

there are lighthouse projects etc. but ...

DM and DP not efficient and too expensive

(Biologist for 75% of his time data manager)

federating data incl. logical information much too expensive

hardly usage of automated workflows and lack of

reproducibility

8Data Practices I – Survey

~120 Interviews/Interactions

2 Workshops with Leading Scientists (EU, US)

too much manual or via ad hoc scripts

too much in Legacy formats (no PID & MD)

there are lighthouse projects etc. but ...

DM and DP not efficient and too expensive

(Biologist for 75% of his time data manager)

federating data incl. logical information much too expensive

hardly usage of automated workflows and lack of

reproducibility

9

12 21 26

95 95 96 97

266

676

DIF DwC DC EML FGDC OpenGIS

ISO My Lab none

Metadata standards

Data Practices III - Metadata

slide von Bill Michener, DataONE

10

lack of proper documentation,

schemas, semantics, relations, etc.

directory structures, spreadsheets etc.

are ad hoc creations and knowledge

fades away

etc.

Data Practices II – Data Entropy

11

Community Center

Common Data

Center

Changes needed – EUDAT and others

many excellent projects

are working on

changes: ESFRI

projects, DataNet

projects, e-

Infrastructures, national

projects

RDA needs to build on

experiences and

expertise

12RDA widely agreed I – time to change

management of data objects is widely type and discipline

independent

still every project defines its own strategies leading to huge stack of

software that will not be maintainable

13RDA widely agreed II –time to change

what

Value AddedServices

DataSources

PersistentIdentifiers

PersistentReference

Analysis Citation

AppsCustomClients

Plug-Ins

Resolution System Typing

PID

Local Storage Cloud Computed

Data Sets RDBMS Files

Digital Objects

PID record

attributes

bit sequence

(instance)

metadata

attributes

points to instances

describes properties

describes

properties

& context

point to

each other

14RDA Results I: common data model

• PIDs at the beginning of trust chain

• have a worldwide, independent and robust PID system

worldwide (DONA Handles – DOIs are Handles)!

• metadata are essential in anonymous data world

taken from RDA WG Data

Foundation & Terminology

15

result: a registry for data types

you get an unknown file,

pull it on DTR and content is being

visualized

extended MIME Type concept

no free lunch: someone needs to

register and define type

code available begin 2015

PIT Demo already working with

DTR

RDA Results II: Data Type Registry

Federated Set ofType Registries

Visualization

Data Processing1010011010101…. Data Set

Dissemination

1010011010101….

1010011010101….

Terms:…

Rights

Agree

VisualizationProcessingInterpretation

3

Domain ofServices

2

1

Human or Machine Consumers

4

• NIST is already working with

communities on fargoing ideas

16

result: a generic API and a set of basic attributes

a PID Record is like a Passport (Number, Photo, Exp-Date, etc.)

if all PID Service-Provider agree on one API and talk the same language

(registered terms) SW development will become easy

Test-Installation

in operation

together with

DTR

RDA Results III: PID Information Types

LOC location, path

CKSM checksum

CKSM_T checksum type

RoR owning repository

MD path to MD

17

due to unforeseen circumstances need until P5

Practical Policies = executable Workflow Statements

result at P5: a set of Best Practice PPs for a number of typical DM/DP

tasks (Integrity Check, Replication, etc.)

currently a large collection of PPs, currently being evaluated

you could add your policies

RDA Results IV: Practical Policies

replication policy Xreplication policy Yintegrity policy Aintegrity policy Bintegrity policy Cmd extraction policy lmd extraction policy ketc.

Policy InventoryRepository

selection

implementation

execution

data manager

18

need to place many RDA WGs & IGs on a common landscape since

finally everything needs to fit together -> Data Fabric

RDA ongoing: Data Fabric

19

1973

Changes take long ...

1990 1993

TCP/IP

Specification

1977

TCP/IP

Stress-test

WWW-Mosaic

availableworldwide

adoption

many different suggestion & protocols

first no advantage for TCP/IP

at the beginning discussion about different email systems

at the beginning no interest from researchers and also industry

(toi of some freaks)

required some top-down decisions to enforce unification

20 years!

20RDA is about global bridge building

20

RDA is about building the social and technical bridges that

enable global open sharing of data.

Researchers, scientists, data practitioners from around the

world are invited to work together to achieve the vision

Funders: NSF, EC, AU Gov, Japan, Brazil, DE?, UK?, ZA?, FI?,

etc.

21

Thanks for your attention.

http://www.rd-alliance.org

http://europe.rd-alliance.org

22

see Science 2.0 Initiative of EC

nr. of researchers increases enormously

there is a pressure in the direction of Grand Challenges

and those topics relevant for societies

research is increasingly often data intensive

border-crossing research is a fact (countries, disciplines)

faster cycles (hypothesis – analysis – publications –

reviews)

Trends IV: research is changing

23

bottom-up process

top-downprocess

uptake to come

RDA is about global bridge building

24EUDAT Services

24

EUDAT Boxdropbox-like service

easy sharing

local synching

Semantic Annochecking , referencing and

annotating

Dynamic Data

immediate handling

Generic Workflowautomating data

processing

B2DROP B2NOTE