Presentation Title April 4, 2002 CAMERA- Metagenomics meets the Cyberinfrastructure David T....
-
Upload
albert-bryan -
Category
Documents
-
view
215 -
download
1
Transcript of Presentation Title April 4, 2002 CAMERA- Metagenomics meets the Cyberinfrastructure David T....
Presentation Title April 4, 2002
CAMERA- Metagenomics meets the
Cyberinfrastructure
David T. Kingsbury
Gordon and Betty Moore Foundation
BERAC - October 16, 2006
Presentation Title April 4, 2002
The CAMERA Partnership
Community Cyberinfrastructure for
Advanced Marine Microbial Ecology
Research and Analysis
Presentation Title April 4, 2002
Genomic Data Is Growing Rapidly, But Metagenomics Will Vastly Increase The Scale…
GenBank Protein Data Bank
www.rcsb.org/pdb/holdings.htmlwww.ncbi.nlm.nih.gov/Genbank
100 Billion Bases!
Total Data < 1TB
35,000 Structures
Presentation Title April 4, 2002
The Sargasso Sea Experiment The Power of Environmental Metagenomics
Yielded a Total of Over 1 billion Base Pairs of Non-Redundant Sequence
Displayed the Gene Content, Diversity, & Relative Abundance of the Organisms
Sequences from at Least 1800 Genomic Species, including 148 Previously Unknown
Identified over 1.2 Million Unknown Genes
MODIS-Aqua satellite image of ocean chlorophyll in the Sargasso Sea grid about the BATS site from
22 February 2003
J. Craig Venter, et al.
Science 2 April 2004:
Vol. 304. pp. 66 - 74
Presentation Title April 4, 2002
Marine Genome Sequencing ProjectMeasuring the Genetic Diversity of Ocean Microbes
Presentation Title April 4, 2002
Moore Foundation Funded the Venter Institute to Provide the Full Genome Sequence of 155 Marine Microbes
www.moore.org/microgenome/trees_main.asp
Presentation Title April 4, 2002
Moore Microbial Genome Sequencing Project: Cyanobacteria Being Sequenced by Venter Institute
Presentation Title April 4, 2002
Moore Microbial Genome Sequencing Project: Cyanobacteria Being Sequenced by Venter Institute
Presentation Title April 4, 2002
GOS Sequences are Largely Bacterial
Source: Shibu Yooseph, et al. (PLOS Biology in press 2006)
~3 Million Previously Known
Sequences
~5.6 Million GOS
Sequences
Presentation Title April 4, 2002
Metagenomics Will Couple to Earth Observations Which Add Several TBs/Day
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
8,00020
01
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
Calendar Year
Cu
mu
lati
ve
Ter
a B
ytes
Other EOSHIRDLSMLSTESOMIAMSR-EAIRS-isGMAOMOPITTASTERMISRV0 HoldingsMODIS-TMODIS-A
Other EOS =• ACRIMSAT• Meteor 3M• Midori II• ICESat• SORCE
file name: archive holdings_122204.xlstab: all instr bar
Terra EOMDec 2005
Aqua EOMMay 2008
Aura EOMJul 2010
NOTE: Data remains in the archive pending transition to LTA
Source: Glenn Iona, EOSDIS Element Evolution Technical Working Group January 6-7, 2005
Presentation Title April 4, 2002
Driven by User Needs
CAMERA serves as one representation of a specific research
community’s need for a system to
– Collect and reference increasing metadata relevant to environmental
metagenome datasets
– Exploit the power of querying on metadata across multiple geospatial
locations
– Have access to a diverse and customizable set of easy-to-use tools to
analyze their data in the context of collected metagenomic and whole
genomic datasets
– Have the ability to update and propagate improvements to annotations
– Have a pre-publication, pre-submission collaborative workspace
– Serve a diverse informatics-literate community
Presentation Title April 4, 2002
Services Provided
Data and Application Services
Tools and Workflows
Computational Data, Visualization and Collaborative
environment
Outreach and Training in Environmental Genomics
Presentation Title April 4, 2002
Data and Application Services
Primary Data
– Sargasso Sea and Sorcerer II expedition data
– JGI marine & terrestrial environmental datasets
– Moore Microbial Genomes
– JGI and other relevant whole genomes
– Research community submitted datasets
– Submitted 454-based metagenomic datasets
– Publically available NR protein and DNA sequence datasets
Derived Data
– Annotations of datasets
– Assemblies
– Alignments
– Pre-computed clusters
Presentation Title April 4, 2002
Sample Metadata from GOS
Site Metadata
– Location (lat/long, water depth)
– Site characterization (finite list of types plus “other”)
– Site description (free text)
– Country
Sampling Metadata
– Sample collection date/time
– Sampling depth
– Conditions at time of sampling (e.g., stormy, surface temperature)
– Sample physical/chemical measurements (T (oC), S (ppt), chl a (mg m-3), etc)
– “author”
Experimental Parameters
– Filter size
– Insert size
Presentation Title April 4, 2002
Tools and Workflows Initial set
– BLAST Server
– Clustering
– HMM/Profile
– Neighborhood analysis
– Multiple sequence alignments
– Assembly
Proposed New Tools
– Multiple Auto Annotation pipelines
– Fast Sequence lookup
– Customized Assembly
– Phylogenetic Analysis
– Clustering Tools
Presentation Title April 4, 2002
CAMERA Outreach Modes
Scientific Advisory Board – Early Adopters – OptIPortal End Points
Targeted Workshops – User Forums – User Software Testing– Viz Tool Brainstorming
Presentations at Scientific Meetings– e.g. Demonstration Booth at JCVI Genomes, Medicine, and
the Environment Conference October 2006
Partnerships With Metagenomics Projects– E.g. DoE’s Joint Genome Institute (JGI)
Training and User Services Team
Presentation Title April 4, 2002
Guiding Philosophy for Development
Sprint Q4 2006– Propagate JCVI toolkit and data ASAP
Mechanism for publication of Sorcerer II data Enabler for community
– Defined deliverables, project management approach
Marathon Q4 2006 onward– Additional Datasets– Additional tools– Community drives prioritization for ongoing releases
Advisory Board, Community Outreach
Keys to success: Tight integration of science, bioinformatics, software, and IT Matched to Community Needs
Presentation Title April 4, 2002
The Future Home of the Moore Foundation Funded Marine Microbial Ecology Metagenomics Complex
First Implementation of the CAMERA Complex
Photo Courtesy Joe Keefe, Calit2
Major Buildout of Calit2 Server Room Underway
http://calit2-1101-1.ucsd.edu/
Presentation Title April 4, 2002
Moore CAMERAProduction Environment
Creation of Initial Production Environment – September 2006
– Hardware Compute Nodes –
– ~200 4 CPU Nodes = ~800 Processing Cores Storage Servers –
– 10 systems = ¼ Petabyte raw storage Database Servers
– Larger 20-40TB; Smaller 5-10TB Network Management –
– Force10 E1200 Router w/12 10GigE Interfaces to Each System Ports
User Access to Compute Cycles– Bulk of free cycles available to external users
– Proposal mechanism
Source: Greg Hidley, Calit2; Phil Papadopoulos, SDSC, Calit2
Presentation Title April 4, 2002
www.glif.is
Created in Reykjavik, Iceland 2003
Countries are Aggressively Creating Gigabit Services:Interactive Access to CAMERA and LOOKING Systems
Visualization courtesy of Bob Patterson, NCSA.
Presentation Title April 4, 2002
Flat FileServerFarm
W E
B
PO
RT
AL
Traditional
User
Response
Request
DedicatedCompute Farm(1000 CPUs)
TeraGrid: Cyberinfrastructure Backplane(scheduled activities, e.g. all by all comparison)
(10000s of CPUs)
Web(other service)
Local Cluster
LocalEnvironment
DirectAccess LambdaCnxns
Data-BaseFarm
10 GigE Fabric
Calit2’s Direct Access Core Architecture Will Create Next Generation Metagenomics Server
Source: Phil Papadopoulos, SDSC, Calit2+
Web
Se
rvic
es
Sargasso Sea Data
Sorcerer II Expedition (GOS)
JGI Community Sequencing Project
Moore Marine Microbial Project
NASA Goddard Satellite Data
Community Microbial Metagenomics Data
Presentation Title April 4, 2002
OptIPuter Scalable Adaptive Graphics Environment (SAGE) Allows Integration of HD Streams
OptIPortal– Termination
Device for the
OptIPuter Global
Backplane
Presentation Title April 4, 2002
OptIPortal– Termination Device for the OptIPuter Global Backplane
20 Dual CPU Nodes, 20 24” Monitors, ~$50,000
1/4 Teraflop, 5 Terabyte Storage, 45 Mega Pixels--Nice PC!
Scalable Adaptive Graphics Environment ( SAGE) Jason Leigh, EVL-UIC
Source: Phil Papadopoulos SDSC, Calit2
Presentation Title April 4, 2002
UIC/UCSD 10GE CAVEWave on the National LambdaRailEmerging OptIPortal Sites
CAVEWave Connects Chicago to Seattle to San Diego…and Washington D.C. as of 4/1/06
and JCVI as of 5/15/06
NEW!
NEW!
SunLight
CICESE
UW
JCVI
MIT
SIO UCSD
SDSU
UIC EVL
UCI
OptIPortals
Presentation Title April 4, 2002
First Remote Interactive High Definition Video Exploration of Deep Sea Vents
Source John Delaney & Deborah Kelley, UWash
Canadian-U.S. Collaboration
Presentation Title April 4, 2002
High Definition Still Frame of Hydrothermal Vent Ecology 2.3 Km Deep
White Filamentous Bacteria on 'Pill Bug' Outer Carapace
1 cm.
Source: John Delaney and
Research Channel, U Washington