The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

21
The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD

Transcript of The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

Page 1: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

The CAMERA Project

Metagenomics 2006

Oct 3-5, 2006

Paul Gilna, Calit2, UCSD

Page 2: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

The CAMERA Partnership

Community Cyberinfrastructure for

Advanced Marine Microbial Ecology

Research and Analysis

Page 3: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

Genomic Data Is Growing Rapidly, But Metagenomics Will Vastly Increase The Scale…

GenBank Protein Data Bank

www.rcsb.org/pdb/holdings.htmlwww.ncbi.nlm.nih.gov/Genbank

100 Billion Bases!

Total Data < 1TB

35,000 Structures

Page 4: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

The Sargasso Sea Experiment The Power of Environmental Metagenomics

• Yielded a Total of Over 1 billion Base Pairs of Non-

Redundant Sequence

• Displayed the Gene Content, Diversity, & Relative

Abundance of the Organisms

• Sequences from at Least 1800 Genomic Species,

including 148 Previously Unknown

• Identified over 1.2 Million Unknown Genes

MODIS-Aqua satellite image of ocean chlorophyll in the Sargasso Sea grid about the BATS site from

22 February 2003

J. Craig Venter, et al.

Science 2 April 2004:

Vol. 304. pp. 66 - 74

Page 5: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

Full Genome Sequencing is Exploding:Most Sequenced Genomes are Bacterial

Total 422

Completed GenomesArchaeal

Bacterial

Eukaryal

Total 1665

Ongoing Genomes

www.genomesonline.org

55Metagenomes

First Genome 1995 6 Genomes/ Year 2000

Moore 155 In Here

Page 6: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

Moore Microbial Genome Sequencing ProjectSelected Microbes Throughout the World’s Oceans

www.moore.org/microgenome/worldmap.asp

Microbes Nominated by Leading Ocean Microbial

Biologists

Page 7: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

Moore Microbial Genome Sequencing Project: Cyanobacteria Being Sequenced by Venter Institute

Page 8: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

Marine Genome Sequencing ProjectMeasuring the Genetic Diversity of Ocean Microbes

Page 9: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

Genomic Data Is Growing Rapidly, But Metagenomics Will Vastly Increase The Scale…

GenBank Protein Data Bank

www.rcsb.org/pdb/holdings.htmlwww.ncbi.nlm.nih.gov/Genbank

100 Billion Bases!

Total Data < 1TB

35,000 Structures

Page 10: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

Metagenomics Will Couple to Earth Observations Which Add Several TBs/Day

0

1,000

2,000

3,000

4,000

5,000

6,000

7,000

8,00020

01

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

Calendar Year

Cu

mu

lati

ve T

era

Byt

es

Other EOSHIRDLSMLSTESOMIAMSR-EAIRS-isGMAOMOPITTASTERMISRV0 HoldingsMODIS-TMODIS-A

Other EOS =• ACRIMSAT• Meteor 3M• Midori II• ICESat• SORCE

file name: archive holdings_122204.xlstab: all instr bar

Terra EOMDec 2005

Aqua EOMMay 2008

Aura EOMJul 2010

NOTE: Data remains in the archive pending transition to LTA

Source: Glenn Iona, EOSDIS Element Evolution Technical Working Group January 6-7, 2005

Page 11: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

Driven by User Needs• CAMERA serves as one representation of a specific research

community’s need for a system to– Collect and reference increasing metadata relevant to environmental metagenome

datasets

– Exploit the power of querying on metadata across multiple geospatial

locations

– Have access to a diverse and customizable set of easy-to-use tools to analyze

their data

– Have ability to add, update and propagate improvements to annotations

– Have a pre-publication, pre-submission collaborative workspace

– Serve a diverse levels of informatics literacy

Page 12: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

Services Provided

• Data and Application Services

• Tools and Workflows

• Computational Data, Visualization and

Collaborative environment

• Outreach and Training in Environmental Genomics

Page 13: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

Data and Application Services• Primary Data

– Sargasso Sea and Sorcerer II expedition data

– JGI marine & terrestrial environmental datasets

– Moore Microbial Genomes

– JGI and other relevant whole genomes

– Research community submitted datasets

– Submitted 454-based metagenomic datasets

– Publicly available NR protein and DNA sequence datasets

• Derived Data

– Annotations of datasets

– Assemblies

– Alignments

– Pre-computed clusters

Page 14: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

Sample Metadata from GOS

• Site Metadata

– Location (lat/long, water depth)

– Site characterization (finite list of types plus “other”)

– Site description (free text)

– Country

• Sampling Metadata– Sample collection date/time

– Sampling depth

– Conditions at time of sampling (e.g., stormy, surface temperature)

– Sample physical/chemical measurements (T (oC), S (ppt), chl a (mg m-3), etc)

– “author”

• Experimental Parameters– Filter size

– Insert size

Page 15: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

Tools and Workflows• Initial set

– BLAST Server

– Clustering

– HMM/Profile

– Neighborhood analysis

– Multiple sequence alignments

– Assembly

• Proposed New Tools

– Multiple Auto Annotation pipelines

– Fast Sequence lookup

– Customized Assembly

– Phylogenetic Analysis

– Clustering Tools

Page 16: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

Guiding Philosophy for Development

• Sprint Q4 2006– Propagate JCVI toolkit and data ASAP

– Mechanism for publication of Sorcerer II data– Enabler for community

– Defined deliverables, project management approach

• Marathon Q4 2006 onward– Additional Datasets– Additional tools– Community drives prioritization for ongoing releases

– Advisory Board, Community Outreach

• Keys to success: Tight integration of science, bioinformatics, software, and IT Matched to Community Needs

Page 17: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

The Future Home of the Moore Foundation Funded Marine Microbial Ecology Metagenomics Complex

First Implementation of the CAMERA Complex

Photo Courtesy Joe Keefe, Calit2

Major Buildout of Calit2 Server Room Underway

http://calit2-1101-1.ucsd.edu/

Page 18: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

Moore CAMERAProduction Environment

• Creation of Initial Production Environment – September 2006

– Hardware– Compute Nodes –

– ~200 4 CPU Nodes = ~800 Processing Cores

– Storage Servers –– 10 systems = ¼ Petabyte raw storage

– Database Servers – Larger 20-40TB; Smaller 5-10TB

– Network Management – – Force10 E1200 Router w/12 10GigE Interfaces to Each System Ports

• User Access to Compute Cycles– Bulk of free cycles available to external users

– Proposal mechanism in process

Source: Greg Hidley, Calit2; Phil Papadopoulos, SDSC, Calit2

Page 19: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

www.glif.is

Created in Reykjavik, Iceland 2003

Countries are Aggressively Creating Gigabit Services:Interactive Access to CAMERA and LOOKING Systems

Visualization courtesy of Bob Patterson, NCSA.

Page 20: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

CAMERA Outreach Modes

• Scientific Advisory Board – Early Adopters – OptIPortal End Points

• Targeted Workshops – User Forums – User Software Testing– Viz Tool Brainstorming

• Presentations at Scientific Meetings– Talks, posters, eventually demonstration booths

• Partnerships With Metagenomics Projects– E.g. DoE’s Joint Genome Institute (JGI)

• Training and User Services Team

Page 21: The CAMERA Project Metagenomics 2006 Oct 3-5, 2006 Paul Gilna, Calit2, UCSD.

A Near Future Metagenomics Fiber Optic-Enabled Data Generator

Source John Delaney, UWash