Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

64
Challenges for metagenomic data analysis and lessons from viral metagenomes [What would you do if sequencing were free?] Rob Edwards San Diego State University Fellowship for Interpretation of Genomes The Burnham Institute for Medical Research

description

Challenges for metagenomic data analysis and lessons from viral metagenomes [What would you do if sequencing were free?]. Rob Edwards San Diego State University Fellowship for Interpretation of Genomes The Burnham Institute for Medical Research. Outline. - PowerPoint PPT Presentation

Transcript of Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Page 1: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Challenges for metagenomic data analysis and lessons from viral metagenomes

[What would you do if sequencing were free?]

Rob Edwards

San Diego State UniversityFellowship for Interpretation of Genomes

The Burnham Institute for Medical Research

Page 2: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Outline

• How and why we sequence environments

• Viral metagenomics– Marine stories– Human stories

• Pyrosequencing– Mine story

• Is there a Future?

Page 3: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Why Metagenomics?

• What is there?

• How many are there?

• What are they doing?

Page 4: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

How do you sequence the environment?

• Extract DNA

Page 5: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes
Page 6: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes
Page 7: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

1.1 g ml-1

1.35 g ml-1

1.5 g ml-1

1.7 g ml-1

CsCl step gradient

Page 8: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

CsCl step gradient

Page 9: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

How do you sequence the environment?

• Extract DNA

• Create library

Page 10: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Hydroshear

Blunt-ending

Addition of Linkers

Amplification of Fragments

Linker-Amplified Shotgun Libraries (LASLs)

HydroshearBlunt-ending

Addition of LinkersAmplification of Fragments

This method produces high coverage libraries of over 1 million clones from as little as 1 ng DNA

Soil Extraction Kit

David Mead -

Breitbart (2002) PNAS

Page 11: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

How do you sequence the environment?

• Extract DNA

• Create library

• Sequence fragments

Page 12: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Outline

• How and why we sequence environments

• Viral metagenomics– Marine stories– Human stories

• Pyrosequencing– Mine story

• Is there a Future?

Page 13: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Why Phages?

• Phages are viruses that infect bacteria– 10:1 ratio of phages:bacteria

– 1031 phages on the planet

• Specific interactions (probably)– one virus : one host

• Small genome size– Higher coverage

• Horizontal gene transfer– 1025-1028 bp DNA per year in the oceans

Page 14: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes
Page 15: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Uncultured Viruses

200 liters water 5-500 g fresh fecal matter

DNA/RNA LASL

Sequence

Epifluorescent Microscopy

Concentrate and purify viruses

Extract nucleic acids

Page 16: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Bioinformatics

• BLASTagainst NR – blastx, tblastn, tblastx

• BLAST against boutique databases– Complete phage genomes, ACLAME, Other

libraries, 16S

• Parsing to present data in a useful format

Page 17: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

BLAST and Parsing

• http://phage.sdsu.edu/blast

• Submit BLAST to local and remote databases– Local (as fast as possible)– NCBI (one search every 3 seconds)

• Many concurrent searches– One search versus 1,000 searches

• Parse data into tables– Access to taxonomy etc

Page 18: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Known22%

Unknown78%

Most Viral Genes are Unknown

Breitbart (2002) PNASRohwer (2003) Cell

TBLAST (E<0.001) 3,093 sequences

Page 19: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

60 billion base pairs

60 million sequences

GenBank has more than doubled since 2001 …

Page 20: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

GenBank has more than doubled since 2001 …

but the fraction of unknowns remains constant

Edwards (2005) Nature Rev. Microbiol.

Page 21: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

All of the new genes in the databases are

coming from environmental sequences

Page 22: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Outline

• How and why we sequence environments

• Viral metagenomics– Marine stories– Human stories

• Pyrosequencing– Mine story

• Is there a Future?

Page 23: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Human-associated viruses

• More bacteria than somatic cells by at least an order of magnitude

• More phages than bacteria by an order of magnitude

• Sample the bacteria in the intestine by sampling their phage

Page 24: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Most Viral DNA Sequences in Adult Human Feces are Unknown Phages

Known40%

Unknown60%

Breitbart (2003) J. Bacteriol.

TBLAST (E<0.001) 532 sequences

Phages94%

Eukaryotic Viruses 6%

Page 25: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

No bacteria or viruses in 1st fecal sample

Abundant bacterial and viral communities by 1 week of age

Adults Versus Babies

>108 VLP/g feces

Page 26: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Baby Feces Viruses

• Most sequences are unknown (≈70%)

• Similarities to phages from Lactococcus, Lactobacillus, Listeria, Streptococcus, and other Gram positive hosts

• From microarray studies, sequences are stable in the baby over a 3 month period

• Same types of phage as present in adult feces– one identical sequence to an unrelated adult!

Page 27: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

DNA viruses in feces are phages.

Feces ≠ intestines.

RNA viruses?

Page 28: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Most Human RNA Viruses are Known

Known92%

Unknown8%

TBLAST (E<0.001) ≈36,000 sequences

Pepper MildMottle Virus

65%

Other PlantViruses

9%

Other26%

Zhang (2006) PLoS Biology

Page 29: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Pepper Mild Mottle Virus (PMMV)• ssRNA virus; ≈6 kb genome• Related to Tobacco Mosaic Virus• Infects members of Capsicum family• Widely distributed – spread through seeds• Fruits are small, malformed, mottled• Rod-shaped virions

TOBACCO MOSAIC VIRUS http://www.rothamsted.bbsrc.ac.uk/ppi/links/pplinks/virusems/

Viral particles in fecal sample

Page 30: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

S1 S2 S3 S4 S5 S6 S7 S8 S9 PMMV

PMMV is common in Human FecesFecal samples

Extract total RNA

RT-PCR for PMMV

San Diego : 78% people are positiveSingapore : 67% people are positive

10-50 fold increase in feces compared to food106-109 PMMV copies per gram dry weight of feces

Page 31: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

India

n c

urr

yPork

noodle

red c

hili

Chic

ken r

ice

Chin

ese

food

Hong K

ong c

hili sa

uce

Hong K

ong g

reen c

hili

Vegeta

rian c

hili

Which Foods Contain PMMV?

Chili powder

Chili sauces

NOT FOUND IN FRESH PEPPERS

Page 32: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

T

he

su

nm

ac

hin

e.n

et

http://www.sweatnspice.com

Koch’s Postulates

Page 33: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Human microbial metagenome is more

important than human genome

Page 34: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Outline

• How and why we sequence environments

• Viral metagenomics– Marine stories– Human stories

• Pyrosequencing– Mine story

• Is there a Future?

Page 35: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

How do you sequence the environment?

• Extract DNA

• Create library

• Sequence fragments

Ever

ythi

ng so

far f

rom

40,0

00 seq

uenc

es

This

is so

200

4

Page 36: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

How do you sequence the environment?

• Extract DNA

• Pyrosequence

Page 37: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

454 Pyrosequencing

• Emulsion-based PCR

• Luciferase-based sequencing

} SDSU•DNA extraction from environment

•Whole genome amplification

} 454 Inc.

Margulies (2005) Nature

Page 38: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

454 Sequence Data(Only from Rohwer Lab)

• 21 libraries– 10 microbial, 11 phage

• 597,340,328 bp total– 20% of the human genome– 50% of all complete and partial microbial

genomes

• 5,769,035 sequences– Average 274,716 per library

• Average read length 103.5 bp– Av. read length has not increased in 7 months

Page 39: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Growth of sequence data

6 million reads600 million bp

Page 40: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Cost of sequencing

• One reaction = $10,000• One reaction = 250,000 reads• 250 reads = $10• 1 read = 4¢• 1 read = 100bp• 1 bp = 0.04¢

($400 per 1x 1,000,000 bp)

• Sanger sequencing ca. $1/rxn, 0.2¢/bp– real cost ca. $5/rxn, 1¢/bp

454 sequencing doescot require cloning, arrayingetc.

Page 41: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Bioinformatics

• 597,340,328 bp total

• 5,769,035 sequences

• 7 months

• Existing tools are not sufficient

Page 42: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Current Pipeline

• Dereplicate

• BLAST against– 16S– Complete phage– nr (SEED)– subsystems

http://phage.sdsu.edu/~rob/Pyrosequencing/

Page 43: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Sequencing is cheap and easy.

Bioinformatics is neither.

Page 44: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Outline

• How and why we sequence environments

• Viral metagenomics– Marine stories– Human stories

• Pyrosequencing– Mine story

• Is there a Future?

Page 45: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

The Soudan Mine, Minnesota

Red Stuff OxidizedBlack Stuff Reduced

Page 46: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Red and Black Samples Are Different

Cloned and 454 sequenced16S are indistinguishable

Black stuff

Red

ClonedRed

Page 47: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Annotation of metagenomes by subsystems

A subsystem is a group of genes that work together

– Metabolism– Pathway– Cellular structures– Anything an annotator thinks is

interesting

Page 48: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

There are different amounts of metabolism in each environment

Page 49: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

There are different amounts ofsubstrates in each environment

BlackStuff

RedStuff

Page 50: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

But are the differences significant?

• Sample 10,000 proteins from site 1• Count frequency of each subsystem• Repeat 20,000 times

• Repeat for sample 2

• Combine both samples• Sample 10,000 proteins 20,000 times• Build 95% CI

• Compare medians from sites 1 and 2 with 95% CI

Rodriguez-Brito (2006). In Review

Page 51: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Examples of significantly different subsystems

Red Stuff

Arg, Trp, His UbiquinoneFA oxidationChemotaxis, FlagellaMethylglyoxal

metabolism

Black Stuff

Ile, Leu, ValSiderophoresGlycerolipidsNiFe hydrogenasePhenylpropionate

degradation

Page 52: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Subsystem differences & metabolism

Iron acquisitionBlack Stuff

Siderophore enterobactin biosynthesisferric enterobactin transportABC transporter ferrichromeABC transporter heme

Black stuff: ferrous iron (Fe2+, ferroan [(Mg,Fe)6(Si,Al)4O10(OH)8])

Red stuff: ferric iron (goethite [FeO(OH)])

Page 53: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Nitrification differentiates the samples

Edwards (2006)In review

Page 54: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Not all biochemistry happens in a single organism

Anaerobic methane oxidationBoetius et al. Nature, 2000.

CH4 + SO42- -> HCO3

- + HS- + H2S

ArchaeaCH4 + H2O ->

HCO3- + OH + H2 -> CO2 +

H2OBacteriaSO4

2- + H2O ->

HS- + OH + 2O2

Page 55: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

The challenge is explaining the differences between samples

Red Sample

Arg, Trp, His UbiquinoneFA oxidationChemotaxis, FlagellaMethylglyoxal

metabolism

Black Sample

Ile, Leu, ValSiderophoresGlycerolipidsNiFe hydrogenasePhenylpropionate

degradation

Page 56: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

We are moving away from one organism one reaction andtowards studying the biochemistry of whole environments

Bacteria don’t live alone

Page 57: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Summary

From 454 sequence:– Identify microbial composition– Identify metabolic function– Identify statistically significant

differences in metabolism

– Who, what, why of microbial ecology

Page 58: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Marine Near-shore water (~100 samples) Off-shore water (~50 samples) Near- and off-shore sediments

Metazoanassociated Corals Fish Human blood Human stool

Sampling Sites

Terrestrial/Soil Amazon rainforest Konza prairie Joshua Tree desert Singapore Air

Freshwater Aquifer Glacial lake

ExtremeHot springs (84oC; 78oC)Soda lake (pH 13)Solar saltern (>35% salt)

Page 59: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

SDSUForest RohwerMya BreitbartBeltran Rodriguez-Brito

Rohwer Lab:Linda WegleyFlorent AnglyMatt Haynes

Also at SDSUAnca SegallWillow SegallStanley Maloy

Math Guys@SDSU Peter Salamon Joe Mahaffy James Nulton Ben Felts David Bangor Steve Rayhawk Jennifer Mueller

MIT: Ed DeLong

NSF - Biotic Surveys and Inventories - Biological Oceanography

- Biocomplexity

FIG Veronika Vonstein Ross Overbeek Annotators

Genome Institute of Singapore: Zhang Tao Charlie Lee Chia Lin Wei Yijun Ruan

Page 60: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes
Page 61: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Viral Community Structure

• Contigs assembled from fragments with >= 98% identity over 20 bp are a resampling of a single phage genome

• Contig specturm is the number of contigs that have one sequence, the number that have two sequences, and so on

• Use both analytical and Monte-Carlo simulations to predict community structure from contig spectrum

The Math Guys (2006) In preparation

Page 62: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

2.00

0 10 20 30 40 50

Species Rank

Abundance of the species (%)

Determine the actual contig spectrum of the sample

Predict a contig spectrum using a species abundance modelCompute the error between the actual and predicted

Adjust the parameters in the species abundance model to minimize errors

Continue this procedure until we obtain the smallest error

Find the smallest error, a global minimum

Model parameters

Error

Page 63: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

Fecal

Seawater

MarineSediments

Viral Communities are Extremely Diverse

Lots of rare viral genotypes

Page 64: Rob Edwards San Diego State University Fellowship for Interpretation of Genomes

0

1

2

3

4

5

6

7

8

9

10

Cropland Earthworms

Fossil CoralsForest Amphibians

River Bacteria

Sediment Viruses

Seawater VirusesSeawater Viruses

Fecal Viruses

Bacteria on CoralsAgriculture Soil Bacteria

Soil Nematodes

Rainforest SpidersAmazon Fish

Rainforest BirdsForest Mammals

Temperate Forest Beetles

Shannon-WienerIndex