Library and data lecture for inf21306

Post on 16-Jan-2017

80 views 2 download

Transcript of Library and data lecture for inf21306

Guest lecture: Library and data?

www.slideshare.net/hugobesemer (use on WURNET Chrome, Firefox)

20160920, Hugo Besemer

Two different things

●An example of data modelling challenges for the library or if you wish: data is dirty ....

●Data management planning at Wageningen University

2

Data is dirty

3

The problem

I am in the tenure track, the university wants me publish in “Q1” journals

My research is funded by NWO/EU/.... And they want me to publish in “Open access” journals

Journals catalogue

Open_access

QuartilesSelect title,issn from Journals where topics=“mine” INNER JOIN open_access.status=“yes” INNER JOIN Quartiles.quartile=“Q1” UNION ALL

topicstitle

Open access status

(boolean)

quartile

issn

issn

issn

Let’s look in Nottingham for online status’

6

But we can also go to Lund

7

Confusion from Amsterdam

8

Things change all the time

9

So we have learned....

ISSN (primary key) is ambiguous●so you need to harmonize data

Open access status is ambiguous ●Gold, Green or Hybrid●Discussion: which one do we take

There are several sources for online status●Discussion: which one do we take?

10

Journals catalogue

Quartiles

topicstitle

Romeo Sherpa (colours)

quartile

issn

issn

issn

Romeo Sherpa (colours)

DOAJ (Romeo gold)

issn

issnAPC

Hybrid publisher

issnAPC

issn

Now for the quartiles

12

Q1

Q2

Q3

Q4

How do we compare numbers

Scientist Z. Math has a publication from 2003 with 17 citations

Scientist M. Biology has a publication from 2009 with 24 citations

Baselines for Mathematics

Baselines for Molecular Biology

0

100

200

300

400

0 2 4 6 8 10 12

Years after publication

Cum

ulat

ive

no. c

itatio

ns

Baselinetop 10%top 1%

What does that mean for our E-R diagram?

Quartile distribution depends on topic

17

Journals catalogue

Quartiles

topicstitle

Romeo Sherpa (colours)

quartile

issn

issn

issn

Romeo Sherpa (colours)

DOAJ (Romeo gold)

issn

issnAPC

Hybrid publisher

issnAPC

issn

topics

19

Datamagement planning at Wageningen University

Wageningen UR policy – What’s in place

●Data management plan for PhD projects●Data management plans for research groups●Data management planning course●Options for data publishing●Code Repository●“Support hub”

20

Wageningen UR data policy – What needs to be resolved

Registration and accessibility of data for ongoing research Storage (security, “getting rid of external hard drives”) Research notes Legal issues?

21

Day-to-day issues (from a workshop for PE&RC)

We are human Synchronizing between different platforms Relationships between files What is a logical file / folder structure? Collaborating on files

22

Some terminology: retention

Retention: obligation to produce upon request data underlying publications for a certain time

Verification purposes or as a basis for further work Often required by scientific organizations or publishers The “Netherlands Code of conduct for Academic

Practice” requires 10 years Rule is seldom enforced

33

More terminology: ‘long term storage’’

‘Long term storage’ used in the DMP format ‘Long term’ meaning

●With sufficient documentation on project, file and parameter / variable level

●In a format that is usable in the future (so preferably “ flat files”)

34

More terminology: ‘publishing data’

We prefer “Data Publishing” as it implies making the data persistently accessible

That’s only possible in a service with a long-term mission It should come with a persistent identifier

independent of its current of future location

35

Persistent identifiers

http://hdl.handle.net/ 1902.1/UOVMCPSWOL

http://dx.doi.org/10.1594/PANGAEA.701380

36

Scheme / ResolverPrefix (identifying institution)Suffix (identifying this dataset)

To get a persistent identifier for your dataset you need to store it with a service, and the resolver will redirect users there

An example

37

An example (continued)

38

An example (continued 2)

39

Publish all data?

40

Services (1)

Discplinary services with a specific data model●EBI, NCBI (bioinformatics) example SRA●Pangaea (spatial)●GBIF (Biodiversity)

Generic (multidisciplinary) services

41

Services - (2)

42

*  DANS 3TU Datacentrum

Dryad Figshare Zenodo

URL http://www.dans.knaw.nl/en/

http://datacentrum.3tu.nl/en/home/

http://datadryad.org/ https://figshare.com/

http://www.zenodo.org/

Single file size

unknown - 2GB 5GB 2GB

Total disk space n.a. n.a. Extra charge for

larger sets20 GB “Please be aware that we

cannot offer infinite space for free, so donations from heavy users towards sustainability are encouraged”

Paid € 2.85 per GB (WUR covers first 500 GB)

€ 3.50 per GB (WUR covers first 500 GB)

$120 (> 20 GB extra charge)

N N

Private/public

Public (part of royal Dutch Academy for Sciences – KNAW)

Public, owned by Dutch Technical Universities

Not-for-profit company governed by members

Private, Macmillan inc.

Public, CERN

Special relationships

Wageningen UR Library acts as front office

Wageningen UR Library acts as front office

Reduced fee or free for certain journals, see http://datadryad.org/pages/journalLookup

Embedded in PLOS article submission workflow

EU (output of the Openaire plus project and used for data in the EU data management pilot)

That’s all

43