Dr Darrell Williamson, eResearch Director

28
eResearch at CSIRO within the National Collaborative Research Infrastructure Strategy IBS PDIW Workshop: Canberra 23 April 2010 Dr Darrell Williamson, eResearch Director

description

eResearch at CSIRO within the National Collaborative Research Infrastructure Strategy IBS PDIW Workshop: Canberra 23 April 2010. Dr Darrell Williamson, eResearch Director. eResearch (AU) = eScience (EU) = Cyberinfrastructure (US). Overview of Presentation. eResearch Challenge - in IBS - PowerPoint PPT Presentation

Transcript of Dr Darrell Williamson, eResearch Director

Page 1: Dr Darrell Williamson, eResearch Director

eResearch at CSIRO within the National Collaborative Research Infrastructure StrategyIBS PDIW Workshop: Canberra 23 April 2010

Dr Darrell Williamson, eResearch Director

Page 2: Dr Darrell Williamson, eResearch Director

eResearch (AU) = eScience (EU) = Cyberinfrastructure (US)

Page 3: Dr Darrell Williamson, eResearch Director

3

1. eResearch Challenge - in IBS

2. CSIRO & NCRIS Capabilities

3. eResearch Challenge - in Geophysical Sciences• AuScope - An eResearch capability in geosciences developed

for a CSIRO Flagship & an NCRIS Capability

4. Data Storage

Overview of Presentation

Page 4: Dr Darrell Williamson, eResearch Director

4

eResearch Challenge: Integrated Biological Systems Research

• Preserve, with shared access to, long time series of scientific data spanning many biological systems science disciplines. 

• Access, annotate & analyse large scale, distributed datasets that conform to world standard data formats & international discipline-based standard metadata schemas

• Ingest, manage, annotate, analyse, share & publish their own data.

• Develop & integrate modelling, simulation & visualisation tools on high-end computing facilities

• Use complex scientific workflows that automate research tasks• Remotely manage & operate facilities, instruments & sensor

networks.

Enable advances in biological based scientific research through enabling researchers to:

Page 5: Dr Darrell Williamson, eResearch Director

5

NCRIS: 12 Capabilities

• Capabilities Evolving Biomolecular Platforms & Informatics

• Integrated Biological Systems• Characterisation• Fabrication• Biotechnology Products• Networked Biosecurity Framework• Optical & Radio Astronomy• Integrated Marine Observing System• Structure & Evolution of the Australian Continent•Terrestrial Ecosystem Research Network• Population Health Research Network• Platforms for Collaboration

Page 6: Dr Darrell Williamson, eResearch Director

6

CSIRO: Visualisation of Large Data Sets

100Mpixel image Around 4Mpixels resolution per 30” screen (2560 x 1600)

24Mpixel image

Page 7: Dr Darrell Williamson, eResearch Director

7

NCRIS: 11 + 1 = 12 Capabilities

• Capabilities Evolving Biomolecular Platforms & Informatics• Integrated Biological Systems• Characterisation• Fabrication• Biotechnology Products• Networked Biosecurity Framework• Optical & Radio Astronomy• Integrated Marine Observing System• Structure & Evolution of the Australian Continent•Terrestrial Ecosystem Research Network• Population Health Research Network

• Platforms for Collaboration

Page 8: Dr Darrell Williamson, eResearch Director

8

NCRIS Capability: Platforms for Collaboration

• Australian Academic & Research Network (AARNet )

• Australian Access Federation (AAF)• Australian National Data Services (ANDS)• Australian Research Collaboration Services (ARCS)• National eResearch Architecture Taskforce (NeAT)

• National Computational Infrastructure (NCI)

Page 9: Dr Darrell Williamson, eResearch Director

9

ANDS: CSIRO Meta-Data Integration Module

• Collects & integrates meta-data from online sources.• Data collection supports various formats & source types:

• HTTP / FTP / SOAP / REST / JDBC / Filesystem• XML-RDF / OAI-PMH / CSV / HTML / Propriety formats

• Collected data is mapped to a domain specific ontology.• Original and mapped data is stored in a Fedora repository.• Data can be queried through:

• Full text queries.• Structured XML queries.• Semantic queries (SPARQL).

• Data can be retrieved through:• SOAP API• REST API

Page 10: Dr Darrell Williamson, eResearch Director

10

ALA Biodiversity ExplorerALA Biodiversity Explorer

ANDS: Application in Atlas of Living Australia Biodiversity Information Explorer

FedoraFedora

SOLR / LuceneSOLR / Lucene

Triple StoreTriple Store

Data Integration Module

Data Integration Module

Page 11: Dr Darrell Williamson, eResearch Director

11

To develop technologies to capture public domain science research data such as from the CSIRO Water Resources Observation Network (WRON) and populate the Australian Research Data Commons (ARDC) to enable the data resources to be accessed and re-used easily and efficiently.

ANDS WRON Data Management

Overview

CSIRO has key water research data holdings of national significance which are not well publicised within the science communities. CSIRO, in collaboration with the Australian National Data Service (ANDS), has made a commitment to populate the Australian Research Data Commons (ARDC). In this way, CSIRO will be able to ensure that the public data it produces can be made available and become more readily accessible for greater scientific and general community collaborations.

Data Collections

The sample data collections for this project will be from the Sustainable Yields Project from the Water for a Healthy Country Flagship. This is a collaboration between the Land and Water and Marine and Atmospheric Research Divisions of the CSIRO.

The Sustainable Yields Project comprises four significant data collections and a further collection populated from remote sensors. They are:

Who is involved?CSIRO Land and Water & CSIRO Atmospheric and Marine Research are prime Data Custodians and Data Re-use candidates. They will produce and consume research data, provide context and metadata to the data set or collections, and maintain the instruments and hardware to produce raw data.CSIRO Information Management & Technology (IM&T)are responsible for the technical development, testing, maintenance and support for ANDS WRON Data Management; delivering the Data Management Service; and for the design, technical development, testing, maintenance and support for the CSIRO metadata repository. They will liaise with ANDS, ARCS, and NCRIS and will publish the data into ARDC.

Research communities (internal and external to the CSIRO) will consume data produced by WRON for their research and will also collaborate with WRON.

What is the approach?

The ANDS WRON Data Management project has a staged approach consisting of the following stages:

What are the expected project outcomes?

The ANDS WRON Data Management project outcomes are:

Project outputs (tangible deliverables)

The ANDS WRON Data Management project will produce the following high-level outputs.

Stage 1 – Initiation: Scoping, project plan and approval

Stage 2 – Discovery: Identification of key data sets, standard data formats, metadata schemas, and high level requirements for translation tools and software

Stage 3 - Implementation: Development of software and translation tools, incorporation of new metadata schemas and data formats into the CSIRO metadata repository, enabling the interface between the ARDC and the CSIRO metadata repository, testing all components, and approval for release into production

Stage 4 - Access: Population of the CSIRO metadata repository with datasets from the Sustainable Yields project and the subsequent harvest and population of the ARDC and the WRON.

Stage 5 - Re-use: Expand data sets populated to ARDC and WRON and re-use software to cater for a wider range of WRON data sets.

Stage 6 - Closure: Transfer of technologies and knowledge and sign-off by the project sponsor.

• Catchment Yield

• Groundwater Modelling

• Water Accounting and Environment

• River Modelling• Water research data of national significance will be

discoverable and accessible to the broader research community for re-use

• Continued access to and preservation of the data and ability to curate the data into the future

• Technologies associated with the capture and translation of the data will be available for transfer to support other data capture

• Provision of the data will support the ANDS initiative of Public Sector Data Access Infrastructure

1. Data cleansing support

2. Licence tracking middleware

3. Converted and pre-processed archive data

4. OAI-PMH Capability

5. Establishment of Access environment

6. ANDS Persistent Identifier (PID) support

7. Storage architecture implementation coordination

8. Current archive copied to new architecture

9. NETCDF metadata harvesting support

10. GeoNetworks feed

11. Embargo support

12. Data collection and structure analysis

13. Current Environment and Technology Analysis

14. Integration with other ANDS services

15. System and User Documentation

16. Transition to Business as Usual

What technology and resources will be used?

The resources available within CSIRO that will be used with this project are:• Existing Fedora Repository to store WRON

metadata for harvesting

• Existing networks and data storage infrastructure

• Data Management and Technical resources experienced in CSIRO Data Management and in the CSIRO Fedora repository

• CSIRO resources experienced in collecting and working with WRON Data.

V0.2 18/11/09

Page 12: Dr Darrell Williamson, eResearch Director

12

WRON IRODS Node

Raw data

Instrument metadata

Raw data

Instrument metadata Ingest middleware

allowing for the collection of water related

data and assignment of

appropriate metadata.

Current Data mostly in .CSV

or Modis satellite data

format

Current WRON Data Store

Reference data partition

WRON Data Server

Translation middleware that

takes the raw output from the current WRON datastore and creates self describing

archive files in an appropiate format for later reuse i.e NETCDF format

Reference data partition

IRODS Server/ Rules engine

IRODS Metadata Catalogue

CSIRO Fedora Repository

Water Community GeoNetworks node

Researchers who are interested in Water related Datasets use

the GeoNetworks tools to discover datasets of interest. The

metadata sourced from the Fedora/IRODs system describes

appropriate access methods.

Metadata maintenance is carried out by both researchers and Data management specialists on this platform. The platform also provides

CSIRO wide visibility of all datasets held by the organisation. The RIF-CS collection metadata files are created on this system.

Wider Australian research community searches

collection level metadata looking for datasets of interest

ANDS Metadata store

Metadata harvesting middleware – reading the

various data file headers and creating ANZLIC compliant

metadata records

Proposed WRON Infrastructure solution External Water data

sources

31 2

4

7

65

10

9

8

11

12

13

Page 13: Dr Darrell Williamson, eResearch Director

13

CSIRO: Capabilities & Path-to-Impact – ‘The Matrix’

Agribusiness Environment Manufacturing,Materials &

Minerals

Energy Information & Communications

10 x National Research Flagships - responsible for Path-to-Impact

5 x Capability Groups - responsible for domain knowledge & Capabilities

Page 14: Dr Darrell Williamson, eResearch Director

14

CSIRO: National Flagships – Path-to-Impact

• Climate Adaptation• Light Metals• Sustainable Agriculture - IBS

• Energy Transformed• Minerals Down Under• Water for a Healthy Country• Food Futures - IBS

• Preventative Health - IBS

• Wealth from Oceans• Future Manufacturing

Page 15: Dr Darrell Williamson, eResearch Director

15

eResearch Challenge: Geophysical Sciences Research (Solution via AuScope)

• Preserve, with shared access to, long time series of scientific data spanning many geophysical science disciplines. 

• Access, annotate & analyse large scale, distributed datasets that conform to world standard data formats & international discipline-based standard metadata schemas

• Ingest, manage, annotate, analyse, share & publish their own data.

• Develop & integrate modelling, simulation & visualisation tools on high-end computing facilities

• Use complex scientific workflows that automate research tasks• Remotely manage & operate facilities, instruments & sensor

networks.

Enable advances in geophysics based scientific research through enabling researchers to:

Page 16: Dr Darrell Williamson, eResearch Director

16

CSIRO: National Flagships – Path-to-Impact

• Climate Adaptation• Light Metals• Sustainable Agriculture - IBS

• Energy Transformed• Minerals Down Under - AuScope

• Water for a Healthy Country• Food Futures - IBS

• Preventative Health - IBS

• Wealth from Oceans• Future Manufacturing

Page 17: Dr Darrell Williamson, eResearch Director

17

NCRIS: 12 Capabilities

• Capabilities Evolving Biomolecular Platforms & Informatics• Integrated Biological Systems• Characterisation• Fabrication• Biotechnology Products• Networked Biosecurity Framework• Optical & Radio Astronomy• Integrated Marine Observing System• Structure & Evolution of the Australian Continent - AuScope

•Terrestrial Ecosystem Research Network• Population Health Research Network• Platforms for Collaboration

Page 18: Dr Darrell Williamson, eResearch Director

19

Discovery Activities:

- Scope future projects

- Establish test lab

- Establish baselines

Jan-10 Jan-12

Apr-10 Jul-10 Oct-10 Jan-11 Apr-11 Jul-11 Oct-11

1: Limestone Avenue to CDC Move

2 a: WAN & UPS Remediation

3 a. Evaluation of key technologies with Science Apps

3 b. Implementation of new technologies i.e Citrix, WAN Acceleration

4. Consolidation of Adelaide Data Centres

5. Consolidation of Science Applications form all Sites – GIS, Vector NTI etc

6. Consolidation of Regional File Servers

Scoped projects

Not yet scoped

7. Consolidation of Queensland Data Centres

8. Consolidation of Victorian Data Centres

2 b: Major Facilities and Small Site WAN Remediation

CSIRO.

Scientific Workflow: Research developments in the Geosciences

Geological information

Activities:

- geological data integration - scientific theory development - technology development - continuous improvement

Theory

StoragePost processing

Run simulation

3D numerical

model

Page 19: Dr Darrell Williamson, eResearch Director

21

Spectrometer

Telescope

Robotic x/y table

Linescan cameraControl

computer

Cooler

Profilometer

AuScope: Infrastructure System

1. Geophysical DataMT, seismic

3. Geochem, Geochron data

4. Hyperspectral data

2. GPS data

Acquisition Groups

Analysis & Synthesis

Access

Integration

Page 20: Dr Darrell Williamson, eResearch Director

22

AuScope: Standardised Information Models

• Not a storage problem…

• Exchange

• Semantics and structure• GeoSciML, OGC

• Tool support• Creation and validation

Geography Markup Language

Page 21: Dr Darrell Williamson, eResearch Director

23

AuScope: Geoscience Network - Data types

Structured Unstructured

Large Volume Binary Files• Hyperspectral data• Geophysical data• Satellite data • BLOBs

Point• GPS• Mineral Occurrence• Geochron

Curve (ID)• Well log• Geophys Profile• Flight line

Surface (2D)• Geological Map• Cross section• Swath

Solid (3D)• 3D Geological Model• Lidar cloud

Page 22: Dr Darrell Williamson, eResearch Director

24

URN ResolverService

VocabularyService

Community Agreed Service Interfaces and Information Models

AuScope service catalog

Standard Vocabularie

s

Service Registry

Discovery Layer

Exchange

Layer

Resources

Discovery Portal

Government Department

Data

Geological SurveyWeb Feature Service (WFS)

Analysis Workflow

AuScope: Based on a Spatial Information Services Stack

With Application Schemas

Page 23: Dr Darrell Williamson, eResearch Director

25

AuScope: Earth Science Information Network

Page 24: Dr Darrell Williamson, eResearch Director

26

Discovery Activities:

- Scope future projects

- Establish test lab

- Establish baselines

Jan-10 Jan-12

Apr-10 Jul-10 Oct-10 Jan-11 Apr-11 Jul-11 Oct-11

1: Limestone Avenue to CDC Move

2 a: WAN & UPS Remediation

3 a. Evaluation of key technologies with Science Apps

3 b. Implementation of new technologies i.e Citrix, WAN Acceleration

4. Consolidation of Adelaide Data Centres

5. Consolidation of Science Applications form all Sites – GIS, Vector NTI etc

6. Consolidation of Regional File Servers

Scoped projects

Not yet scoped

7. Consolidation of Queensland Data Centres

8. Consolidation of Victorian Data Centres

2 b: Major Facilities and Small Site WAN Remediation

Precision Agriculture: Spatial Information Services Stack – a NeAT Project

1. Spatial prioritisation of catchment incentives

2. Regional scale climate analyses

Page 25: Dr Darrell Williamson, eResearch Director

27

Discovery Activities:

- Scope future projects

- Establish test lab

- Establish baselines

Jan-10 Jan-12

Apr-10 Jul-10 Oct-10 Jan-11 Apr-11 Jul-11 Oct-11

1: Limestone Avenue to CDC Move

2 a: WAN & UPS Remediation

3 a. Evaluation of key technologies with Science Apps

3 b. Implementation of new technologies i.e Citrix, WAN Acceleration

4. Consolidation of Adelaide Data Centres

5. Consolidation of Science Applications form all Sites – GIS, Vector NTI etc

6. Consolidation of Regional File Servers

Scoped projects

Not yet scoped

7. Consolidation of Queensland Data Centres

8. Consolidation of Victorian Data Centres

2 b: Major Facilities and Small Site WAN Remediation

CSIRO Office

Data Centre

CSIRO: Data Storage - Consolidation

• Promising technologies are:• WAN Optimisation• Desktop Virtualisation (i.e. Citrix)

• Some key questions:• How will desktop virtualisation

work with 3D modelling and visualisation ?

• How will data be accessed ?• How will complex workflows be

managed in a virtual desktop ?• How will data be managed

available to a virtual desktop as well remote users ?

Virtual DesktopServer

File Server

User’sWorkstation

WAN Optimization

WAN Optimization

Pro

cess

ing

is m

oved

fro

m t

he

wor

ksta

tion

back

to

a ce

ntra

l ser

ver

Page 26: Dr Darrell Williamson, eResearch Director

28

NCRIS: Data Storage – ConsolidationThe $50m dilemma!Model 1 – New Peak National Capability A ‘new’ national facility is created to be the Australian peak research data service.

• The envisaged service could be based (in a minimal configuration) around two physical sites supporting a single fully replicated data service open to all researchers. • Because data is held remotely, some form of operating cost contribution would be required from data contributors, subscribers or sector participants. • The cost and funding model would need sector agreement and some risk would be present for the operators. However, the cost for the volume of data envisaged would be significantly less than any possible in-house solution because of the substantive EIF funds and because of the large economy of scale factors.

Model 2 – Regional Strength

Regionally focussed services would be developed on the basis of existing regional associations. A particular advantage could be gained from building on associations in which state governments have an interest, as this may assist state government agreement to co-locate copies of state generated research related data.

Model 3 – Industry Partnerships

The sector to work with commercial suppliers to build all or part of the required infrastructure, whilst retaining the provision of an appropriate interface layer within the sector. This approach could contribute to either the new peak national capability model, or the regional store model, described as Model 2.

Page 27: Dr Darrell Williamson, eResearch Director

29

NCRIS: Data Storage – ConsolidationThe $50m dilemma!Essential Requirements

• The establishment of a sector based governance, management and implementation mechanism appropriate to the growing importance of research data retention across the sector that is capable of addressing longer term issues, beyond the life of this funding.

• The establishment of a process that has sector support, to identify data sets and collections that will be inputs to future research activities, and to focus and apportion resource allocation using the National Research Priorities and national research infrastructure priorities (as identified in the Strategic Roadmap for Australian Research Infrastructure).

Issues to consider in the development of criteria are:

• What data will be re-used by the research community?

• What data sets make up the inputs to research?

• Where are the relevant data sets sourced from and how?

• For what period and with what access rights should data be retained?

• What happens at the end of the retention period?

Page 28: Dr Darrell Williamson, eResearch Director

Questions?

Dr Darrell Williamson eResearch DirectorEmail: [email protected]