IBM Platform Computing: Accelerating Business Results for Compute
The ELIXIR Compute Platform: An environment for …The ELIXIR Compute Platform: An environment for...
Transcript of The ELIXIR Compute Platform: An environment for …The ELIXIR Compute Platform: An environment for...
ELIXIR-EXCELERATE is funded by the European Commission within the Research Infrastructures programme of Horizon 2020, grant agreement number 676559.
The ELIXIR Compute Platform:An environment for Analysing
Life-Science DataSteven Newhouse, Head of Technical Services, EMBL-EBI
Genomics vs High Energy Physics
• Both are excellent examples of Big Data but Genomics data is:
• More complex and variable, used in more demanding ways
• Growth is accelerating faster than physics data
• Greater uncertainty on short timescales => less time to respond
• Less community-wide investment in s/w and infrastructure
• Sequencing and imaging machines provide 1000’s of data sources
• Research data deposited into repositories before publishing
• Health data retained inside organisational firewalls
• Tony Wildish, Genomics vs. Physics, HEPIX 2016,
https://indico.cern.ch/event/531810/sessions/208405/#20161019
EMBL: European Molecular Biology LaboratoryOver 1600 people and more than 80 nationalities
Structural biology
Hamburg
Life sciences
Heidelberg
Epigenetics and neurobiology
Rome
Bioinformatics
Cambridge(EMBL-EBI)
Structural biology
Grenoble
Tissue biology and disease modelling
Barcelona
Data Resources at EMBL-EBI
Literature & ontologies• Experimental Factor
Ontology• Gene Ontology• BioStudies• Europe PMC
Chemical biology• ChEBI• ChEMBL• SureChEMBL
Molecular structures• Protein Data Bank in Europe• Electron Microscopy Data Bank
Gene, protein & metabolite expression• Expression Atlas• Metabolights• PRIDE• RNA Central
Protein sequences, families & motifs• InterPro• Pfam• UniProt
Genes, genomes & variation• Ensembl• Ensembl Genomes• GWAS Catalog• Metagenomics portal
Systems• BioModels• BioSamples• Enzyme Portal• IntAct• Reactome
Molecular Archives• European Nucleotide Archive• European Variation Archive• European Genome-phenome Archive• ArrayExpress
Cross domain resources . Cross dom
ain resources
dg
P
b
s
y
Ever Increasing Demands
Storage growth at EMBL-EBI still 40-50% a year.
Increasingly ‘interesting’ data being generated and held in national or local repositories.• Integration challenges
Big data, big demand
~27 million requests to EMBL-EBI websites
every day
200 petabytes
of storage capacity in our data centres
EMBL-EBI delivered
152 million jobs to its users in 2016
Scientists at over
3.2 million unique IP addresses use
EMBL-EBI websites
Data Centre Infrastructure
Campus(Hinxton)[90 racks]
Leased Data Centre(Hemel Hempstead)
[90 racks]
Leased Data
Centre(Slough)
[10 racks]
JANET – UK Academic Network
• Raw Storage:• Object Store – 101PB• NAS – 70PB• HPC Storage 22PB• Tape – 22PB
• Analysis Capacity:• HTC: 22,000 job slots• HPC: 7,000 job slots• Cloud: 6,000 vCPUs• Virtual infrastructure: 1,500 cores
20Gbs10Gbs1Gbs
ELIXIR – Research Infrastructure for Life Science
8
• ComputeAccess, Exchange & Compute on sensitive data
• DataSustain core data resources
• ToolsServices & connectors to drive access and exploitation
• StandardsIntegration and interoperability of data and services.
• TrainingProfessional skills for managing and exploiting data
ELIXIR Compute Platform: Integration with communities
The transfer of large volume, electronic confidential, human data
https://www.elixir-europe.org/events/elixir-webinar-transfer-large-volume-data
ELIXIR Compute Platform: Integrating Existing Serviceshttps://www.elixir-europe.org/platforms/compute
The ELIXIR Nodes and their collaboration with European e-Infrastructures form the technical and resource foundation of the ELIXIR Compute Platform.
A geographically distributed Authentication & Authorisation Infrastructure (AAI) in operation.
Integrated Cloud & Compute and Storage & File Transfer Services that are provided by the individual ELIXIR Nodes and which will be discoverable through ELIXIR.
Moving data between sites is one key capability of the ELIXIR Compute Platform.
Raising the level of abstraction through platforms that promote distributed workflow execution
ELIXIR Cloud & Compute
ELIXIR Cloud capacities surveyed here DK, DE, EBI, FI, FR, SUI confirmed capacity, counting only these nodes
> 60.000 compute cores
> 24.000 TB of storage
> 3.000 compute users
Resource allocationdecisions are made bythe nodes
ELIXIR Data Transfer and Storage
• PID and Metadata Registry
• Minimal metadata for tracking and downloading data available
• Example implementation integrating GridFTP and Handles capturing minimal metadata; automatic Handle resolving
• Next step: Integration with RDA collections API and specification
• File Transfer
• Deployed FTS3 integrated with ELIXIR AAI supporting multiple protocols (gridftp, https, S3, …),
• Command line and web UI
• Performance tests between GridFTP, Aspera, http and other protocols is still ongoing (Elixir-ES)
• Reference Data Set Distribution Service
• RDSDS planned, designed and developed at EMBL-EBI with support from EUDAT2020
Interaction with e-Infrastructures
Communities CommunitiesCommunities
ELIXIR Compute Platform
EOSC(EGI, EUDAT & Indigo)
Commercial ProvidersELIXIR Nodes
At a GLIF meeting a long time ago, in a galaxy far, far away…
• EGI.eu still here!
• Coordinating EOSC-Hub
• EOSC: European Open Science Cloud
• Federating cloud resources and services
• Enabling open science around open data
• https://www.glif.is/meetings/2010/plenary/newhouse-egi.pdf
Future Compute Platform: ELIXIR-GA4GH Analysis Environment
• Integrate user federation ELIXIR AAI into local compute and data deployments
• Rationalise a ELIXIR-wide Data Distribution Network – starting with Reference datasets
• Drive ELIXIR Compute Platform support for hybrid (public/private & cloud/HPC) deployments – e.g. Openstack, SGE, etc
• Develop Task Distribution Network using Task orchestration engines – e.g. Kubernetes
• Support national or regional workflow choreography engines – e.g. CWL, Nextflow, Galaxy, etc.
Infrastructure Requirements
• Data Sources:
• EMBL-EBI has a lot of data! Science DMZ to improve access.
• Cloud Resources:
• From within ELIXIR Nodes, national providers, others all federated through EOSC
• Commercial cloud providers: HelixNebula (T-Systems & RHEA), AWS, GCP, MSA, …
• Data Sinks:
• Strategic placement of reference data sets & tactical placement of analysis data
Underlying Network Infrastructure with dynamic dedicated virtual links?
Hybrid Cloud Future
• Cost model of public clouds:• Good for transitory activities, e.g. 1000 cores for 2 months
• Bad for long-term activities, e.g. 17PB for 5 years growing 0.5PB/month in + out
• How can we present our on-site storage externally?• Replicate and sync to the cloud: Existing file based access model
• On-demand caching: Existing file based access model with smart layer
• Direct network access over http: Read/Write whole object
• How much bandwidth is needed to support these models?
Scaling out from EMBL-EBI’s Data Centres
JANET/GEANT
Public Clouds
http(Object Store)
NFS(Scale out storage)
Web SitesWeb Services
ELIXIR Clouds
IndividualUsers
EBI Hybrid Cloud
Summary
• Network remains key factor• Even more so for big data in clouds
• Business & service models remain complex• Both local ISP connectivity and public cloud providers
• Can network paths on demand help (or hinder) here• Danger in just adding another complexity layer!• But could provide USP for performance & cost with public clouds
www.elixir-europe.org
@ELIXIREurope /company/elixir-europe
www.elixir-europe.org
/company/elixir-europe
Thank you – [email protected]: ELIXIR Compute Platform &
EMBL-EBI Technical Services [email protected]