e-Science and Grid The VL-e approach L.O. (Bob) Hertzberger
Computer Architecture and Parallel Systems Group Department of
Computer Science Universiteit van Amsterdam [email protected]
Slide 2
Content Developments in Grid Developments in e-Science
Objectives Virtual Lab for e-Science Research philosophy
Conclusions
Slide 3
Background ICTpush developments Processing power doubles every
18 month Memory size doubles every 12 month Network speed doubles
every 9 month Something has to be done to harness this development
Virtualization of ICT resources Internet Web Grid
Slide 4
Internal versus external bandwidth Mbit/s Computer busses
networks
Slide 5
Web& Grid & Web/Grid Services Web services is a
paradigm/way of using/accessing information Web resources are
mostly human-centered (understood by humans, computers can read but
cant understand) Grid is about accessing & sharing computing
resources by virtualization Data & Information repositories
Experimental facilities OGSA: Service oriented Grid standard based
on Web services
Slide 6
Web & Grid & Semantic Web/Grid Semantic Web resources
should also be understandable by computers This way, many complex
tasks formulated and assigned by humans can be automated and
executed by agents working on the Semantic Web But, both Grid and
Web Services only focus on single services. Semantic Web/Grid
should be able to describe a single service as well as the
relationship between services e.g. aggregate service using a number
of services so making knowledge explicit
Slide 7
Levels of Grid abstraction Computational Grid Data Grid
Information Web/Grid Knowledge Web/Grid
Slide 8
Background information experimental sciences There is a
tendency to look ever deeper in: Matter e.g. Physics Universe e.g.
Astronomy Life e.g. Life sciences Therefore experiments become
increasingly more complex Instrumental consequences are increase in
detector: Resolution & sensitivity Automation &
robotization Results for instance in life science in: So called
high throughput methods Omics experimentation genome ===>
genomics
Slide 9
New technologies in Life Sciences research University of
Amsterdam cell GenomicsTranscriptomicsProteomicsMetabolomics RNA
protein metabolites DNA Methodology/ Technology
Slide 10
Paradigm shift in Life science Past experiments where
hypothesis driven Evaluate hypothesis Complement existing knowledge
Present experiments are data driven Discover knowledge from large
amounts of data Apply statistical techniques
Slide 11
Background information experimental sciences Experiments become
increasingly more complex Driven by detector developments
Resolution increases Automation & robotization increases
Results in an increase in amount and complexity of data
Slide 12
The Application data crisis Scientific experiments start to
generate lots of data medical imaging (fMRI): ~ 1 GByte per
measurement (day) Bio-informatics queries:500 GByte per database
Satellite world imagery: ~ 5 TByte/year Current particle physics: 1
PByte per year LHC physics (2007): 10-30 PByte per year Data is
often very distributed
Slide 13
Background information experimental sciences Experiments become
increasingly more complex Driven by detector developments
Resolution increases Automation & robotization increases
Results in an increase in amount and complexity of data Something
has to be done to harness this development Virtualization of
experimental resources: e-Science
Slide 14
The what of e-Science e-Science is the application domain
Science of Grid & Web More than only coping with data explosion
A multi-disciplinary activity combining human expertise &
knowledge between: A particular domain scientist ICT scientist
e-Science demands a different approach to experimentation because
computer is integrated part of experiment Consequence is a radical
change in design for experimentation e-Science should apply and
integrate Web/Grid methods where and whenever possible
Slide 15
Grid and Web Services Convergence Grid Definition of Web
Service Resource Framework(WSRF) makes explicit distinction between
service and stateful entities acting upon service i.e. the
resources Means that Grid and Web communities can move forward on a
common base!!! WSRF Started far apart in apps & tech OGSI GT2
GT1 HTTP WSDL, WS-* WSDL 2, WSDM Have been converging Ref: Foster
Web
Slide 16
e-Science Objectives It should enhance the scientific process
by: Stimulating collaboration by sharing data & information
Result is re-use of data & information
Slide 17
The data sharing potential for Cognition Collaborative
scientific research Information sharing Metadata modeling Allows
for experiment validation Independent confirmation of results
Statistical methodologies Access to large collections of data and
metadata Training Train the next generation using peer reviewed
publications and the associated data
Slide 18
Acquisition Alignment Reconstruction Segmentation
Interpretation Electron tomography data pipeline
Slide 19
Cell Centred Database NCIMR (San Diego) Maryann Martone and
Mark Ellisman
Slide 20
e-Science Objectives It should enhance the scientific process
by: Stimulating collaboration by sharing data & information
Improve re-use of data & information Combing data and
information from different modalities Sensor data & information
fusion
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
e-Science Objectives It should enhance the scientific process
by: Stimulating collaboration by sharing data & information
Improve re-use of data & information Combing data and
information from different modalities Sensor data & information
fusion Realize the combination of real life & (model based)
simulation experiments
Slide 26
An example of the combination of real life & (model based)
simulation experiments patient specific vascular geometry (from CT)
Segmentation blood flow simulation (Latice Bolzmann) Pre-operative
planning (interaction) Suitable for parallelization through
functional decomposition Simulated Fem-Fem bypass Patients vascular
geometry (CTA) Grid resources Simulated Vascular Reconstruction in
a Virtual Operating Theatre
Slide 27
e-Science Objectives It should enhance the scientific process
by: Stimulating collaboration by sharing data & information
Improve re-use of data & information Combing data and
information from different modalities Sensor data & information
fusion Realize the combination of real life & (model based)
simulation experiments Modeling of dynamic systems
Slide 28
Dynamic bird behaviour MODELS Bird distributions Ensembles
Calibration and Data assimilation Predictions and on-line warnings
RADAR Bird behaviour in relation to weather and landscape
Slide 29
e-Science Objectives It should result in : Computer aided
support for rapid prototyping of ideas Stimulate the creativity
process It should realize that by creating & applying: New ICT
methodologies and a computing infrastructure stimulating this From
this ICT point of view it should support the following application
steps: Design Development & realization Execution Analysis
& interpretation We try to realize e-Science and their
applications via the Virtual Lab for e-Science (VL-e) project
Slide 30
Virtual Lab for e-Science research Philosophy Multidisciplinary
research & the development of related ICT infrastructure
Generic application support Application cases are drivers for
computer & computational science and engineering research
Slide 31
Grid Services Harness multi-domain distributed resources
Management of comm. & computing VL-e Application Oriented
Services Food Informatics Dutch Telescience Medical Diagnosis &
Imaging VL-e project Bio- Informatics Data Insive Science/ LOFAR
Bio- Diversity Data intensive sciences
Slide 32
Two sides of Bioinformatics as an e-Science The scientific
responsibility to develop the underlying computational concepts and
models to convert complex biological data into useful biological
and chemical knowledge Technological responsibility to manage and
integrate huge amounts of heterogeneous data sources from high
throughput experimentation
Slide 33
Role of bioinformatics cell Data generation/validation Data
integration/fusion Data usage/user interfacing
GenomicsTranscriptomicsProteomicsMetabolomics Integrative/System
Biology RNA protein metabolites DNA methodology bioinformatics
Slide 34
Virtual Lab for e-Science research Philosophy Multidisciplinary
research and development of related ICT infrastructure Generic
application support Application cases are drivers for computer
& computational science and engineering research Problem
solving partly generic and partly specific Re-use of components via
generic solutions whenever possible
Slide 35
Grid/ Web Services Harness multi-domain distributed resources
Management of comm. & computing Management of comm. &
computing Management of comm. & computing Potential Generic
part Potential Generic part Potential Generic part Application
Specific Part Application Specific Part Application Specific Part
Virtual Laboratory Application Oriented Services Application
pull
Slide 36
Generic e-Science aspects Virtual Reality Visualization &
user interfaces Modeling & Simulation Interactive Problem
Solving Data & information management Data modeling dynamic
work flow management Content (knowledge) management Semantic
aspects Meta data modeling Ontologies Wrapper technology Design for
Experimentation
Slide 37
Virtual Lab for e-Science research Philosophy Multidisciplinary
research and development of related ICT infrastructure Generic
application support Application cases are drivers for computer
& computational science and engineering research Problem
solving partly generic and partly specific Re-use of components via
generic solutions whenever possible Rationalization of experimental
process Reproducible & comparable
Slide 38
Issues for a reproducible scientific experiment interpretation
Rationalization of the experiment and processes via protocols
processing processed data conversion, filtering, analyses,
simulation, experiment parameters/settings, algorithms,
intermediate results, Parameter settings, Calibrations, Protocols
software packages, algorithms raw data acquisition
sensors,amplifiers imaging devices,, presentation visualization,
animation interactive exploration, Metadata Much of this is lost
when an experiment is completed.
Slide 39
Scientific experiments & e-Science Step1: designing an
experiment Step2: performing the experiment Step3: analyzing the
experiment results success For complex experiments: contain complex
processes require interdisciplinary expertise need large scale
resource Grid & high level support
Slide 40
Experiment Topology Graphical representation of self-contained
data processing modules attached to each other in a workflow
Process-Flow Templates(PFT) Derived from ontologies Graphical
representation of data elements and processing steps in an
experimental procedure Information to support context-sensitive
assistance (semantics) Study Descriptions of experimental steps
represented as an instance of a PFT with references to experiment
topologies Components in a VL-e experiment
Slide 41
Step1: designing an experiment Step2: performing the experiment
Step3: analyzing the experiment results success Taverna, Kepler,
and Triana: model processes for computing tasks. In VL-e: model
both computing tasks and human activity based processes, and model
them from the perspective of an entire lifecycle. It tries to
support Collaboration in different stages Information sharing Reuse
of experiment PFT in VLAMGPFT instances in VLAMG Process Flow
Template e-Science environment Scientific Workflow Management
Systems Process Flow Template
Slide 42
Definition of experiment protocols Workflow definitions
Recreate complex experiments into process flows Workflow execution
Maintain control over the experiment Data processing execution
Topologies of data processing modules Interpretation Visualization
of processed results to help intuition Ontology definitions can
help in obtaining a well-structured definition of experiment, data
and metadata.
Slide 43
Virtual Lab for e-Science research Philosophy Multidisciplinary
research and development of related ICT infrastructure Generic
application support Application cases are drivers for computer
& computational science and engineering research Problem
solving partly generic and partly specific Re-use of components via
generic solutions whenever possible Rationalization of experimental
process Reproducible & comparable Two research experimentation
environments Proof of concept for application experimentation Rapid
prototyping for computer & computational science
experimentation
Slide 44
The VL-e infrastructure Grid Middleware Surfnet Application
specific service Application Potential Generic service &
Virtual Lab. services Grid & Network Services Virtual
Laboratory VL-e Proof of Concept Environment Telescience Medical
Application Bio Informatics Applications VL-e Experimental
Environment Virtual Lab. rapid prototyping (interactive simulation)
Additional Grid Services (OGSA services) Network Service (lambda
networking) VL-e Certification Environment Test & Cert.
Compatibility Test & Cert. Grid Middleware Test & Cert.
VL-software
Slide 45
Infrastructure for Applications Applications are a driving
force of the PoC Experience shows applications value stability
Foster two-way interaction to make this happen
Slide 46
Taverna, Kepler, Triana and VLAM. Step1: designing an
experiment Step2: performing the experiment Step3: analyzing the
experiment results success PFT e-Science environment Scientific
Workflow Management Systems SWMS Research activity to be developed
in the Rapid Prototyping environment Stable developments to be used
in VL-e in the Proof of Concept environment Research activity to be
developed in the Rapid Prototyping environment PFT
Slide 47
VL-e PoC environment Latest certified stable software
environment of core grid and VL-e services Core infrastructure
built around clusters and storage at SARA and NIKHEF (production
quality) Controlled extension to other platforms and distributions
On the user end: install needed servers: user interface systems,
storage elements for data disclosure, grid-secured DB access Focus
on stability and scalability
Slide 48
Hosted services for VL-e Key services and resources are offered
centrally for all applications in VL-e Mass data and number
crunching on the large resources at SARA Storage for data
replication & distribution Persistent strategic storage on tape
Resource brokers, resource discovery, user group management
Slide 49
Why such a complex scheme? software is part of the
infrastructure stability of core software needed to develop the new
scientific applications enable distributed systems management (who
runs what version when?) the grid is one big error amplifier
computers make mistakes like humans, only much, much faster
Slide 50
What did we learn It is not enough to just transport current
applications to Grid or e-Science infrastructures To fully exploit
the potential of e-Science infrastructures one has to learn what is
possible Therefore the full lifecycle of an experiment has to be
taken into account Workflow management is a first step We add
semantic information via Process Flow Templates Application
innovation such as for instance bio- banking should be the aim
Grids should be transparent for the end-user
Slide 51
Conclusions e-Science is a lot more more than trying to cope
with data explosion alone Implementation of e-Science systems
requires further rationalization and standardization of
experimentation process e-Science success demands the realization
of an environment allowing application driven experimentation &
rapid dissemination of feed back of these new methods We try to do
that via development of Proof of Concept based on Grid
Slide 52
Electron tomography data pipeline development TOM (Matlab)
acquisition and data storage with the SRB SRB Storage With the CCDB
database for retrieval of experimental data sets Grid-computing 3D
reconstruction refinements Currently for EMAN and later also for
TOM (Matlab) SARA - Amsterdam 1 Gbit/s