Presentation Title April 4, 2002 CAMERA- Metagenomics meets the Cyberinfrastructure David T....

28
Presentation Title April 4, 2002 CAMERA- Metagenomics meets the Cyberinfrastructure David T. Kingsbury Gordon and Betty Moore Foundation BERAC - October 16, 2006

Transcript of Presentation Title April 4, 2002 CAMERA- Metagenomics meets the Cyberinfrastructure David T....

Presentation Title April 4, 2002

CAMERA- Metagenomics meets the

Cyberinfrastructure

David T. Kingsbury

Gordon and Betty Moore Foundation

BERAC - October 16, 2006

Presentation Title April 4, 2002

The CAMERA Partnership

Community Cyberinfrastructure for

Advanced Marine Microbial Ecology

Research and Analysis

Presentation Title April 4, 2002

Genomic Data Is Growing Rapidly, But Metagenomics Will Vastly Increase The Scale…

GenBank Protein Data Bank

www.rcsb.org/pdb/holdings.htmlwww.ncbi.nlm.nih.gov/Genbank

100 Billion Bases!

Total Data < 1TB

35,000 Structures

Presentation Title April 4, 2002

The Sargasso Sea Experiment The Power of Environmental Metagenomics

Yielded a Total of Over 1 billion Base Pairs of Non-Redundant Sequence

Displayed the Gene Content, Diversity, & Relative Abundance of the Organisms

Sequences from at Least 1800 Genomic Species, including 148 Previously Unknown

Identified over 1.2 Million Unknown Genes

MODIS-Aqua satellite image of ocean chlorophyll in the Sargasso Sea grid about the BATS site from

22 February 2003

J. Craig Venter, et al.

Science 2 April 2004:

Vol. 304. pp. 66 - 74

Presentation Title April 4, 2002

Marine Genome Sequencing ProjectMeasuring the Genetic Diversity of Ocean Microbes

Presentation Title April 4, 2002

Moore Foundation Funded the Venter Institute to Provide the Full Genome Sequence of 155 Marine Microbes

www.moore.org/microgenome/trees_main.asp

Presentation Title April 4, 2002

Moore Microbial Genome Sequencing Project: Cyanobacteria Being Sequenced by Venter Institute

Presentation Title April 4, 2002

Moore Microbial Genome Sequencing Project: Cyanobacteria Being Sequenced by Venter Institute

Presentation Title April 4, 2002

GOS Sequences are Largely Bacterial

Source: Shibu Yooseph, et al. (PLOS Biology in press 2006)

~3 Million Previously Known

Sequences

~5.6 Million GOS

Sequences

Presentation Title April 4, 2002

Metagenomics Will Couple to Earth Observations Which Add Several TBs/Day

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,00020

01

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

Calendar Year

Cu

mu

lati

ve

Ter

a B

ytes

Other EOSHIRDLSMLSTESOMIAMSR-EAIRS-isGMAOMOPITTASTERMISRV0 HoldingsMODIS-TMODIS-A

Other EOS =• ACRIMSAT• Meteor 3M• Midori II• ICESat• SORCE

file name: archive holdings_122204.xlstab: all instr bar

Terra EOMDec 2005

Aqua EOMMay 2008

Aura EOMJul 2010

NOTE: Data remains in the archive pending transition to LTA

Source: Glenn Iona, EOSDIS Element Evolution Technical Working Group January 6-7, 2005

Presentation Title April 4, 2002

Driven by User Needs

CAMERA serves as one representation of a specific research

community’s need for a system to

– Collect and reference increasing metadata relevant to environmental

metagenome datasets

– Exploit the power of querying on metadata across multiple geospatial

locations

– Have access to a diverse and customizable set of easy-to-use tools to

analyze their data in the context of collected metagenomic and whole

genomic datasets

– Have the ability to update and propagate improvements to annotations

– Have a pre-publication, pre-submission collaborative workspace

– Serve a diverse informatics-literate community

Presentation Title April 4, 2002

Services Provided

Data and Application Services

Tools and Workflows

Computational Data, Visualization and Collaborative

environment

Outreach and Training in Environmental Genomics

Presentation Title April 4, 2002

Data and Application Services

Primary Data

– Sargasso Sea and Sorcerer II expedition data

– JGI marine & terrestrial environmental datasets

– Moore Microbial Genomes

– JGI and other relevant whole genomes

– Research community submitted datasets

– Submitted 454-based metagenomic datasets

– Publically available NR protein and DNA sequence datasets

Derived Data

– Annotations of datasets

– Assemblies

– Alignments

– Pre-computed clusters

Presentation Title April 4, 2002

Sample Metadata from GOS

Site Metadata

– Location (lat/long, water depth)

– Site characterization (finite list of types plus “other”)

– Site description (free text)

– Country

Sampling Metadata

– Sample collection date/time

– Sampling depth

– Conditions at time of sampling (e.g., stormy, surface temperature)

– Sample physical/chemical measurements (T (oC), S (ppt), chl a (mg m-3), etc)

– “author”

Experimental Parameters

– Filter size

– Insert size

Presentation Title April 4, 2002

Tools and Workflows Initial set

– BLAST Server

– Clustering

– HMM/Profile

– Neighborhood analysis

– Multiple sequence alignments

– Assembly

Proposed New Tools

– Multiple Auto Annotation pipelines

– Fast Sequence lookup

– Customized Assembly

– Phylogenetic Analysis

– Clustering Tools

Presentation Title April 4, 2002

CAMERA Outreach Modes

Scientific Advisory Board – Early Adopters – OptIPortal End Points

Targeted Workshops – User Forums – User Software Testing– Viz Tool Brainstorming

Presentations at Scientific Meetings– e.g. Demonstration Booth at JCVI Genomes, Medicine, and

the Environment Conference October 2006

Partnerships With Metagenomics Projects– E.g. DoE’s Joint Genome Institute (JGI)

Training and User Services Team

Presentation Title April 4, 2002

Guiding Philosophy for Development

Sprint Q4 2006– Propagate JCVI toolkit and data ASAP

Mechanism for publication of Sorcerer II data Enabler for community

– Defined deliverables, project management approach

Marathon Q4 2006 onward– Additional Datasets– Additional tools– Community drives prioritization for ongoing releases

Advisory Board, Community Outreach

Keys to success: Tight integration of science, bioinformatics, software, and IT Matched to Community Needs

Presentation Title April 4, 2002

The Future Home of the Moore Foundation Funded Marine Microbial Ecology Metagenomics Complex

First Implementation of the CAMERA Complex

Photo Courtesy Joe Keefe, Calit2

Major Buildout of Calit2 Server Room Underway

http://calit2-1101-1.ucsd.edu/

Presentation Title April 4, 2002

Moore CAMERAProduction Environment

Creation of Initial Production Environment – September 2006

– Hardware Compute Nodes –

– ~200 4 CPU Nodes = ~800 Processing Cores Storage Servers –

– 10 systems = ¼ Petabyte raw storage Database Servers

– Larger 20-40TB; Smaller 5-10TB Network Management –

– Force10 E1200 Router w/12 10GigE Interfaces to Each System Ports

User Access to Compute Cycles– Bulk of free cycles available to external users

– Proposal mechanism

Source: Greg Hidley, Calit2; Phil Papadopoulos, SDSC, Calit2

Presentation Title April 4, 2002

www.glif.is

Created in Reykjavik, Iceland 2003

Countries are Aggressively Creating Gigabit Services:Interactive Access to CAMERA and LOOKING Systems

Visualization courtesy of Bob Patterson, NCSA.

Presentation Title April 4, 2002

Scale

Presentation Title April 4, 2002

Flat FileServerFarm

W E

B

PO

RT

AL

Traditional

User

Response

Request

DedicatedCompute Farm(1000 CPUs)

TeraGrid: Cyberinfrastructure Backplane(scheduled activities, e.g. all by all comparison)

(10000s of CPUs)

Web(other service)

Local Cluster

LocalEnvironment

DirectAccess LambdaCnxns

Data-BaseFarm

10 GigE Fabric

Calit2’s Direct Access Core Architecture Will Create Next Generation Metagenomics Server

Source: Phil Papadopoulos, SDSC, Calit2+

Web

Se

rvic

es

Sargasso Sea Data

Sorcerer II Expedition (GOS)

JGI Community Sequencing Project

Moore Marine Microbial Project

NASA Goddard Satellite Data

Community Microbial Metagenomics Data

Presentation Title April 4, 2002

OptIPuter Scalable Adaptive Graphics Environment (SAGE) Allows Integration of HD Streams

OptIPortal– Termination

Device for the

OptIPuter Global

Backplane

Presentation Title April 4, 2002

OptIPortal– Termination Device for the OptIPuter Global Backplane

20 Dual CPU Nodes, 20 24” Monitors, ~$50,000

1/4 Teraflop, 5 Terabyte Storage, 45 Mega Pixels--Nice PC!

Scalable Adaptive Graphics Environment ( SAGE) Jason Leigh, EVL-UIC

Source: Phil Papadopoulos SDSC, Calit2

Presentation Title April 4, 2002

UIC/UCSD 10GE CAVEWave on the National LambdaRailEmerging OptIPortal Sites

CAVEWave Connects Chicago to Seattle to San Diego…and Washington D.C. as of 4/1/06

and JCVI as of 5/15/06

NEW!

NEW!

SunLight

CICESE

UW

JCVI

MIT

SIO UCSD

SDSU

UIC EVL

UCI

OptIPortals

Presentation Title April 4, 2002

First Remote Interactive High Definition Video Exploration of Deep Sea Vents

Source John Delaney & Deborah Kelley, UWash

Canadian-U.S. Collaboration

Presentation Title April 4, 2002

High Definition Still Frame of Hydrothermal Vent Ecology 2.3 Km Deep

White Filamentous Bacteria on 'Pill Bug' Outer Carapace

1 cm.

Source: John Delaney and

Research Channel, U Washington

Presentation Title April 4, 2002

A Near Future Metagenomics Fiber Optic-Enabled Data Generator

Source John Delaney, UWash