Scientific data management (v2)
description
Transcript of Scientific data management (v2)
Scien&fic Data Management A tutorial at ICADL 2011
October 24, 2011
Jian Qin School of Informa&on Studies
Syracuse University hGp://eslib.ischool.syr.edu/
The morning ahead
12/18/11 15:51 Overview of E-‐Science 2
An environmental scan • E-‐Science, cyberinfrastructure, and data • What do all these have to do with me?
Case study: The gravita&onal wave research data management
Group work: Role play in developing data management ini&a&ves
Overview of E-‐Science
An environmental scan • E-‐Science, cyberinfrastructure, and data • What do all these have to do with me?
Characteris&cs of e-‐science Data sets, data collec&ons, and data
repositories Why does it maGer to libraries?
E-‐Science
“In the future, e-‐Science will refer to the large scale science that will increasingly be carried out through distributed global collabora&ons enabled by the Internet. ”
12/18/11 15:51 Overview of E-‐Science
Na&onal e-‐Science Center. (2008). Defining e-‐Science. hGp://www.nesc.ac.uk/nesc/define.html
4
E-‐Infrastructure for the research lifecycle
12/18/11 15:51 Overview of E-‐Science 5
hGp://epubs.cclrc.ac.uk/bitstream/3857/science_lifecycle_STFC_poster1.PDF
Shib in Science Paradigms Thousand years
ago A few hundred
years ago A few decades
ago Today
Science was empirical describing natural phenomena
Theore7cal branch using models, generaliza&ons
A computa7onal approach
simula&ng complex phenomena
Data explora7on (eScience) unify theory, experiment, and
simula&on -‐-‐ Data captured by instruments or generated by simulator -‐-‐ Processed by sobware -‐-‐ Informa&on/Knowledge stored in computer -‐-‐ Scien&st analyzes database/files using data management and sta&s&cs
Gray, J. & Szalay, A. (2007). eScience – A transformed scien&fic method. hGp://research.microsob.com/en-‐us/um/people/gray/talks/NRC-‐CSTB_eScience.ppt
12/18/11 15:51 Overview of E-‐Science 7
X-‐Info • The evolu&on of X-‐Info and Comp-‐X
for each discipline X • How to codify and represent our knowledge
• Data ingest • Managing a petabyte • Common schema • How to organize it • How to reorganize it • How to share with others
• Query and Vis tools • Building and execu&ng models • Integra&ng data and Literature • Documen&ng experiments • Cura&on and long-‐term preserva&on
The Generic Problems
Experiments & Instruments
Simula&ons
answers
ques&ons
Literature
Other Archives facts facts ?
Gray, J. & Szalay, A. (2007). eScience – A transformed scien&fic method. hGp://research.microsob.com/en-‐us/um/people/gray/talks/NRC-‐CSTB_eScience.ppt
Useful resources • What is eScience? • eScience Ini7a7ves • Science Research and Data • Science Data Management • Literature Reviews • Data Policy Issues • eScience Research Centers
• hGp://eslib.ischool.syr.edu/index.php?op&on=com_content&view=sec&on&id=9&Itemid=83
12/18/11 15:51 Overview of E-‐Science 9
hGp://research.microsob.com/en-‐us/collabora&on/fourthparadigm/
A FEW IMPORTANT CONCEPTS
12/18/11 15:51 Overview of E-‐Science 10
12/18/11 15:51 Overview of E-‐Science
Data
Any and all complex data en&&es from observa&ons, experiments, simula&ons, models, and higher order assemblies, along with the associated documenta&on needed to describe and interpret the data.
An ar&st’s concep&on (above) depicts fundamental NEON observatory instrumenta&on and systems as well as poten&al spa&al organiza&on of the environmental measurements made by these instruments and systems. hGp://www.nsf.gov/pubs/2007/nsf0728/nsf0728_4.pdf
11
Scien&fic data formats
12/18/11 15:51 Overview of E-‐Science 12
Common data format Image formats Matrix formats
Microarray file formats Communica&on protocols
Scien&fic datasets • The scien&fic data set,
or SDS, is a group of data structures used to store and describe mul&dimensional arrays of scien&fic data.
• The boundaries of datasets vary from discipline to discipline
NCSA HDF Development Group. (1998). HDF 4.1r2 User's Guide. hGp://www.hdfgroup.org/training/HDFtraining/UsersGuide/SDS_SD.fm1.html#48894
12/18/11 15:51 13 Overview of E-‐Science
Scien&fic workflows • Steps in data collec&on and analysis process • Different types of scien&fic workflows: – Data-‐intensive – Compute-‐intensive – Analysis-‐intensive – Visualiza&on-‐intensive
12/18/11 15:51 Overview of E-‐Science 14
Ludäscher, B., Al&ntas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, E., Lee, E.A., Tao, J., & Zhao, Y. (2006). Scien&fic workflow management and the Kepler system. Currency and Computa>on: Prac>ce and Experience, 18(10): 1039-‐1065.
Example: Ecological dataset • Floris&c diversity data – Related links – Data aGributes – Download link
12/18/11 15:51 15 Overview of E-‐Science
Example: Biodiversity dataset • Ac7ons for Porcupine
Marine Natural History Society -‐ Marine flora and fauna records from the North-‐east Atlan7c – Metadata record output
in different standard formats
– URL for dataset download
12/18/11 15:51 16 Overview of E-‐Science
Example: The Significant Earthquake Database
12/18/11 15:51 17 Overview of E-‐Science
• The Significant Earthquake Database – A database containing data
about significant earthquake events and the damages caused
– An interface for extrac&ng a subset of data
– A link to download the whole dataset
– Documenta&on
12/18/11 15:51 Overview of E-‐Science 18
Social Science Data
Research data collec&ons
12/18/11 15:51 Overview of E-‐Science 19
Data output Size Metadata Management Standards
Larger, discipline-‐based
Smaller, team-‐based None or
random
Mul&ple, comprehensive
Heroic individual inside the team
Organized Ins&tu&onalized,
Research collec&ons • Limited processing or long-‐term management
• Not conformed to any data standards
• Varying sizes and formats of data files
• Low level of processing, lack of plan for data products
• Low awareness of metadata standards and data management issues
12/18/11 15:51 Overview of E-‐Science 20
Resource collec&ons • Authored by a community of inves&gators, within
a domain or science or engineering • Developed with community level standards • Life &me is between mid-‐ and long-‐term
• Example: Hubbard Brook Ecosystem Study (hGp://www.hubbardbrook.org ) – One of the regional sites in the Long term
Ecological Research Network (LTER) – Community of the ecological domain – Community of inves&gators from around the
country on ecosystem study – Ecological Metadata Language (EML), a
community-‐level standard – Cataloged, searchable dataset collec&ons
12/18/11 15:51 Overview of E-‐Science 21
Reference collec&on • Example: Global Biodiversity Informa&on Facility – Created by large segments of science community – Conform to robust, well-‐established and comprehensive standards, e.g. • ABCD (Access to Biological Collec&on Data) • Darwin Core • DiGIR (Distributed Generic Informa&on Retrieval) • Dublin Core Metadata standard • GGF (Global Grid Forum) • Invasive Alien Species Profile • LSID (Life Sciences Iden&fier) • OGC (Open Geospa&al Consor&um)
12/18/11 15:51 Overview of E-‐Science 22
hGp://www.gbif.org/informa&cs/discoverymetadata/a-‐metadata-‐infrastructure/
hGp://www.tdwg.org/standards/ Global Biodiversity
Informa7on Facility
12/18/11 15:51 23 Overview of E-‐Science
Datasets, data collec&ons, and data repositories
• Data collec&ons are built for larger segments of science and engineering
• Datasets – typically centered around an event or a study
– contain a single file or mul&ple files in various formats
– coupled with documenta&on about the background of data collec&on and processing
Data repository
System for storing, managing, preserving, and providing access to datasets
A repository may contain one or more data collec&ons A data collec&on may contain one or more datasets A dataset may contain one or more data files
12/18/11 15:51 24 Overview of E-‐Science
An emerging trend in academic libraries
12/18/11 15:51 Overview of E-‐Science 25
Ini&a&ves in research libraries
• Pressure points: – Lack of resources – Difficulty acquiring the appropriate staff and
exper&se to provide eScience and data management or cura&on services
– Lack of a unifying direc&on on campus
12/18/11 15:51 Overview of E-‐Science 26
Data support and services in ins&tu&ons:
45%
Libraries involved in suppor&ng eScience:
73%
Source: Soehner, C., Steeves, C. & Ward, J. (2010). E-‐Science and data support services: A study of ARL member ins&tu&on. hGp://www.arl.org/bm~doc/escience_report2010.pdf
Data management challenges
• No one-‐size-‐fits-‐all solu&on • Requires an in-‐depth understanding of scien&fic workflows and research lifecycle
• Involves not only technical design and planning but also organiza&onal collabora&on and ins&tu&onaliza&on of data policy
12/18/11 15:51 Overview of E-‐Science 27
Data preserva&on challenges
• Data formats – Vary in data types, e.g. vector and raster data types – Format conversions, e.g. from an old version to a newer one
• Data rela&ons – e.g. there are data models, annota&ons, classifica&on schemes, and symboliza&on files for a digital map
• Seman&c issues – Naming datasets and aGributes
Overview of E-‐Science 28 12/18/11 15:51
Data access challenges
• Reliability • Authen&city • Leverage technology to make data access easier and more effec&ve – Cross-‐database search – Integra&on applica&ons
Overview of E-‐Science 29 12/18/11 15:51
Suppor&ng digital research data • Lifecycle of research data
– Create: data crea&on/capture/gathering from laboratory experiments, field work, surveys, devices, media, simula&on output…
– Edit: organize, annotate, clean, filter… – Use/reuse: analyze, mine, model, derive addi&onal data, visualize, input to instruments /computers
– Publish: disseminate data via portals and associate datasets with research publica&ons
– Preserve/destroy: store / preserve, store /replicate /preserve, store / ignore, destroy…
12/18/11 15:51 Overview of E-‐Science 30
Suppor&ng data management
12/18/11 15:51 Overview of E-‐Science 31
The data deluge Numerical, image, video Models, simula&ons, bit streams XML, CVS, DB, HTML
Specialized search engines to discover the data they need Powerful data mining tools to use and analyze the data
Researchers need:
Research data management
12/18/11 15:51 Overview of E-‐Science
Ins&tu&on
Financial and policy support
Community
User requirements
Science domain
Data content idiosyncrasies
Ins&tu&onal repository
Community repository
Na&onal repository
Interna&onal repository
Evolving and interconnec&ng –
eScience librarian
32
Implica&ons to scholarly communica&on process
12/18/11 15:51 Overview of E-‐Science
Publishing Cura&on Archiving
33
Maintaining, preserving and adding value to digital research data throughout
its lifecycle.
The long-‐term storage, retrieval, and use of scien&fic data and
methods.
Data publishing; New scholarly publishing models—open access,
ins&tu&onal and community repositories, self-‐publishing, library
publishing, ....
12/18/11 15:50 促进学术交流:如何踢开第一脚? 34
术语的演变 �
个案研究1:制定数据保存分享的机构政策 �
12/18/11 15:50 促进学术交流:如何踢开第一脚? 35
现状
12/18/11 15:50 促进学术交流:如何踢开第一脚? 36
数据、文件
院、系服务器
学科仓储
期刊、会议论文出版
校内机构仓储
校园服务器
研究人员
• 什么文件格式? • 如何组织的? • 如何使用的? • 能否与非项目团队人员分享? • 如果能,有什么条件和规定? • 文件和数据的保存是如何做的? • 有哪些法律条例需要遵守?
有无学科仓储? 有无呈交? 校内仓储有无与学科仓储连接?
12/18/11 15:50 促进学术交流:如何踢开第一脚? 37 37
无统一规章条例 � �无文件、数据管理的认识 � �无数据使用和分享的政策规定 �
建立统一的数据获取、使用、管理、分享的政策 � �建立机构数据仓储(campus cyberinfrastructure-enabled support) � �广泛宣传、用事实说服研究人员 �
调查现有机构数据
政策 �
获取校领导及有关部门的支持 �
Proof of Concept Project �
�
目标 �
现状 �
Ac&ons!
12/18/11 15:50 促进学术交流:如何踢开第一脚? 38
校长
VP for Research
科研处 图书馆 IT services
VP for Academic Affairs
iSchool College⋯
调查现有机构数据政策,写出报告并给VP for Research提出建议参考意见
与学校有关部门协作
12/18/11 15:50 促进学术交流:如何踢开第一脚? 39
DATA MANAGEMENT PRACTICES IN ACADEMIC LIBRARIES
hGp://researchdata.wisc.edu/
hGps://confluence.cornell.edu/display/rdmsgweb/Home
hGp://libraries.mit.edu/guides/subjects/data-‐management/
Summary • Managing research data is mo&vated by: – Government funding agency’s policy – Needs for data sharing, cross valida&on of data and research, credit, and large-‐scale interdisciplinary discovery
• Organiza&onal changes: – New organiza&onal units within the university library or at the university level
– Virtual group – Collabora&on among key units: Libraries, IT services, research administra&on office
Summary
• Types of services – Training faculty and students for data literacy – Data cura&on services (data repositories, digital libraries, archiving data)
– Consul&ng services – Data management plan – Developing data policies