Principles for proper data management and reuse--An RDA view
-
Upload
research-data-alliance -
Category
Data & Analytics
-
view
123 -
download
2
Transcript of Principles for proper data management and reuse--An RDA view
2
does RDA have one view – yes & no
RDA is basically a bottom-up organization driven by the many
“creative” minds who want to change data practices
RDA has now about 2000 members – so we have 2000 opinions?
we have an intensive discussion process since 2012 (ICRI
Conference Copenhagen) and we can see that there are a number
of trends and principles all or most seem to agree with
still RDA is a very young initiative and needs
much attention and grease
Clarification
3Why is this all relevant?
Naoyuki Tsunematsu (JST ):
• Data exchange (and thus the need for proper data
management) difficult to convey in Japanese Science
• parallel trends observed for Japanese Science
• not so often included in collaborations anymore
• not so often represented in the top papers
• enormous decrease in international ranking
• serious worries about counterproductive encapsulation
• this concern seems to be relevant for all of us
6Trends III – Re-Usage
Domain of
trusted
Repositories
• Data will be re-used in different contexts
• Data needs to be findable, accessible, combinable and
interpretable for others
7Data Practices I – Survey
~120 Interviews/Interactions
2 Workshops with Leading Scientists (EU, US)
too much manual or via ad hoc scripts
too much in Legacy formats (no PID & MD)
there are lighthouse projects etc. but ...
DM and DP not efficient and too expensive
(Biologist for 75% of his time data manager)
federating data incl. logical information much too expensive
hardly usage of automated workflows and lack of
reproducibility
8Data Practices I – Survey
~120 Interviews/Interactions
2 Workshops with Leading Scientists (EU, US)
too much manual or via ad hoc scripts
too much in Legacy formats (no PID & MD)
there are lighthouse projects etc. but ...
DM and DP not efficient and too expensive
(Biologist for 75% of his time data manager)
federating data incl. logical information much too expensive
hardly usage of automated workflows and lack of
reproducibility
9
12 21 26
95 95 96 97
266
676
DIF DwC DC EML FGDC OpenGIS
ISO My Lab none
Metadata standards
Data Practices III - Metadata
slide von Bill Michener, DataONE
10
lack of proper documentation,
schemas, semantics, relations, etc.
directory structures, spreadsheets etc.
are ad hoc creations and knowledge
fades away
etc.
Data Practices II – Data Entropy
11
Community Center
Common Data
Center
Changes needed – EUDAT and others
many excellent projects
are working on
changes: ESFRI
projects, DataNet
projects, e-
Infrastructures, national
projects
RDA needs to build on
experiences and
expertise
12RDA widely agreed I – time to change
management of data objects is widely type and discipline
independent
still every project defines its own strategies leading to huge stack of
software that will not be maintainable
13RDA widely agreed II –time to change
what
Value AddedServices
DataSources
PersistentIdentifiers
PersistentReference
Analysis Citation
AppsCustomClients
Plug-Ins
Resolution System Typing
PID
Local Storage Cloud Computed
Data Sets RDBMS Files
Digital Objects
PID record
attributes
bit sequence
(instance)
metadata
attributes
points to instances
describes properties
describes
properties
& context
point to
each other
14RDA Results I: common data model
• PIDs at the beginning of trust chain
• have a worldwide, independent and robust PID system
worldwide (DONA Handles – DOIs are Handles)!
• metadata are essential in anonymous data world
taken from RDA WG Data
Foundation & Terminology
15
result: a registry for data types
you get an unknown file,
pull it on DTR and content is being
visualized
extended MIME Type concept
no free lunch: someone needs to
register and define type
code available begin 2015
PIT Demo already working with
DTR
RDA Results II: Data Type Registry
Federated Set ofType Registries
Visualization
Data Processing1010011010101…. Data Set
Dissemination
1010011010101….
1010011010101….
Terms:…
Rights
Agree
VisualizationProcessingInterpretation
3
Domain ofServices
2
1
Human or Machine Consumers
4
• NIST is already working with
communities on fargoing ideas
16
result: a generic API and a set of basic attributes
a PID Record is like a Passport (Number, Photo, Exp-Date, etc.)
if all PID Service-Provider agree on one API and talk the same language
(registered terms) SW development will become easy
Test-Installation
in operation
together with
DTR
RDA Results III: PID Information Types
LOC location, path
CKSM checksum
CKSM_T checksum type
RoR owning repository
MD path to MD
17
due to unforeseen circumstances need until P5
Practical Policies = executable Workflow Statements
result at P5: a set of Best Practice PPs for a number of typical DM/DP
tasks (Integrity Check, Replication, etc.)
currently a large collection of PPs, currently being evaluated
you could add your policies
RDA Results IV: Practical Policies
replication policy Xreplication policy Yintegrity policy Aintegrity policy Bintegrity policy Cmd extraction policy lmd extraction policy ketc.
Policy InventoryRepository
selection
implementation
execution
data manager
18
need to place many RDA WGs & IGs on a common landscape since
finally everything needs to fit together -> Data Fabric
RDA ongoing: Data Fabric
19
1973
Changes take long ...
1990 1993
TCP/IP
Specification
1977
TCP/IP
Stress-test
WWW-Mosaic
availableworldwide
adoption
many different suggestion & protocols
first no advantage for TCP/IP
at the beginning discussion about different email systems
at the beginning no interest from researchers and also industry
(toi of some freaks)
required some top-down decisions to enforce unification
20 years!
20RDA is about global bridge building
20
RDA is about building the social and technical bridges that
enable global open sharing of data.
Researchers, scientists, data practitioners from around the
world are invited to work together to achieve the vision
Funders: NSF, EC, AU Gov, Japan, Brazil, DE?, UK?, ZA?, FI?,
etc.
21
Thanks for your attention.
http://www.rd-alliance.org
http://europe.rd-alliance.org
22
see Science 2.0 Initiative of EC
nr. of researchers increases enormously
there is a pressure in the direction of Grand Challenges
and those topics relevant for societies
research is increasingly often data intensive
border-crossing research is a fact (countries, disciplines)
faster cycles (hypothesis – analysis – publications –
reviews)
Trends IV: research is changing