Folker Meyer: Metagenomic Data Annotation
-
Upload
gigascience-bgi-hong-kong -
Category
Technology
-
view
115 -
download
2
description
Transcript of Folker Meyer: Metagenomic Data Annotation
Folker MeyerArgonne National Laboratory and University of Chicago
June 14th, 1st EMP meeting Shenzhen, China
Metagenome Annotation
datadata
Metagenomics needs the magic wand..Metagenomics needs the magic wand..
• == “shotgun genomics applied directly to various environments” “shotgun metagenomics”
• != sequencing of BAC clones with env. DNA “functional metagenomics”
• != sequencing single genes (16 rDNA) “gene surveys”
Who are they?What are they doing?
Portals help with computational analysis• MG-RAST and IMG/M and CAMERA for metagenomes
– Provide complete project support including metadata input– Systems allow upload of sequence runs and provide QC, feature
identification, feature annotation, views and comparison– Systems provide lots of public samples to compare to
• MG-RAST: 4,000+ public samples (June 2011)– Google will reveal URLs
• QIIME for amplicon studies– Provides support for amplicon analysis– Large number of public amplicon samples– Advanced visualization capabilities with rich metadata– Integration with other tools including MG-RAST
2010 state of metagenomics2010 state of metagenomics
• 8492 metagenomes from > 500 groups• Over 20GB per week (rapid growth)• Many centers produce data• This was a few weeks ago
• 8492 metagenomes from > 500 groups• Over 20GB per week (rapid growth)• Many centers produce data• This was a few weeks ago
2011: many small scale 2011: many small scale projectsprojects
V303/201
1
V303/201
1
• ~25,000 data sets, hundreds of groups• ~4000 public, with metadata, 45GBp• >> 1Terabase (10^12 basepairs)
• ~25,000 data sets, hundreds of groups• ~4000 public, with metadata, 45GBp• >> 1Terabase (10^12 basepairs)
Even data upload is hard! Jumploader
Thanks to Rob Knight’s team to pointing us there
Part of an emerging digital biologyPart of an emerging digital biology
• Users (dots) sharing pre-publication metagenomes (edges)
Source: MG-RAST, 800+ shared metagenomes
Computing cost dominate
Source: Rob Knight, UColorado
“Living on the log scale” (Guy Cochrane, EBI, UK)
From: Wilkening et al., IEEE Cluster09, 2009
computingsequencing
Challenges during shotgun metagenome analysis1. Quality Control2. Finding features3. Characterizing features4. Presentation
Quality control for de-novo sequencing
• Question is simple: How trustworthy is my data?– “rare biosphere debate” de-noising for amplicon runs– No such tool for shotgun data
• Existing QC approaches rely on:– Using reference sequences– Using vendor specific scores
• Includes e.g. phred scores• None of those are suitable to what we are doing• EMP needs novel quality control to ensure comparisons work
Approaches utilizing artifacts of sequencing and library prep processes show promising results
Tell me if my data set is of type A or B
A) •Lots of error ~10% at 70bp
Real data sets from MG-RAST
B) •Errors only at tail
K. Keegan, in preparation
% duplicates varies also
Finding features
• Protein coding features– Statistics based approaches:
• Using e.g. codon usage trained on existing genomes• MGA, Metagene, FragGeneScan, Prodigal, MetageneMarkHMM• Limitation: novel proteins are harder, islands and transferred also
– Similarity based approaches• Blastx search against • Limitation: Runtime + Novel proteins will never be found….
• Running more specialized tools e.g. RFAM is often not feasible for large scale data sets
EMP will enable systematic search for novel proteins (think of CRISPRs from AMD)
Performance Analysis on simulated data sets w/ errors
W. Trimble, in preparation
Characterizing features
• Describe sequences by comparison to existing databases– GenBank, GO, KEGG, COGs, SEED, STRINGS, ..– Use sequence similarity to define – Function: function string(s), EC number, GO number, …– Taxonomic origin
• Algorithms (not exhaustive)– BLAST (default, sensitive, too expensive)– BLAT (well tested, no parallel, a bit less sensitive)– Suffix array based (fast, limited mis-matches)– HMM based (HMMer 3.0 is as fast a BLAST)– We haven’t tested RAPsearch2
• Similarity search cost is high, repeat searches are required– Think of Nikos’ MEP (next talk)
Presentation layerPresentation layer
MG-RAST v3 workflow (simplified)
Upload
QC / normalization
Similarities (Parallel Blat)
Metabolic reconstruction
Community reconstruction
SFF, fastq and fasta data
find emPCR and BridgePCR artifacts
Metadata
Feature prediction (FGS)
find coding regions/peptides using FragGeneScan (Ye, NAR 2010)
Abundance profiles
Metabolic model
Many databases integratedGSC’s M5nr
The future
Source: Rob Knight, UColorado
Driving forceDriving force
• “Living on the log scale” (Guy Cochrane, EBI, UK)• “Data bonanza” (Dawn Field, Oxford UK)• “Metadata are essential for turning data into knowledge” (Rob Knight, U
Colorado, USA)
600 GBp / run
600 GBp / run
60 GBp / run
60 GBp / run
Future 1: World is not clonal, study strain/species variation • “Pangenome view” allows definition of strains
new strain?new strain?
Future 2: Expand metadata (1): MIMS/MIMARKS
19
• Genomics Standards Consortium (GSC) provides• Extensible metadata standards• Environmental packages allow domain specific extension
• Groups starting to build environmental packages• MG-RAST v3 supports GSC metadata standards
• Genomics Standards Consortium (GSC) provides• Extensible metadata standards• Environmental packages allow domain specific extension
• Groups starting to build environmental packages• MG-RAST v3 supports GSC metadata standards
Use metadataSelect data sets to compare to based on:- Biome, location, sampling procedure, …
Use metadataSelect data sets to compare to based on:- Biome, location, sampling procedure, …
Capture metadataExtensive metadata questionnaire
supporting offline editors // input
Capture metadataExtensive metadata questionnaire
supporting offline editors // input
Expand metadata support (2): Capture metadata early
20
Imagine adding metadata to the plot below:
Very hard after the fact!
capture metadata early
Aanensen et al, Plos ONE, 2009
Many current challenges and pitfalls
21
• Assembly (state of the art: hard)– Several groups are working actively on metagenome assemblers– Quotes Mihai Pop (UMaryland)
• “metagenomes can’t be assembled” and “all assemblers are equal”• Rare k-Filtering (state of the art: DO NOT)
– C. Titus Brown (MSU): “Friends don’t let friends filter rare k-mers”• Binning (state of the art: use k-mers)
– Traditional binning does not work for short reads (Alice C. McHardy)– K-mer based binning can produce organism sized bins Titus’ work
• Sequence quality– Quality really matters and vendors lie all the time
• Metadata – challenge for the next few years is to add metadata
• Cloud computing – does not change the cost structure
Metagenome transport format (MTF)
• Input Sequences (“from the machine”) ▫ FASTA, FASTQ, SFF (maybe Archive BAM)• Transformed sequences (“after QC”) ▫ FASTA• Feature coordinates (“after genefinding”) ▫ GFF3/GTF• Similarities (“ the BIG computation “) ▫ Blast/BLAT/.. results• Metadata ( context, “ the important stuff “) ▫ GSC compliant MIMS format • Workflow description ( “provenance” ) ▫ What did we do? (not in shell script !) ▫ What version of code // databases did we use ▫ Who computed where
Acknowledgements
MG-RAST team• Daniela Bartels• Narayan Desai • Mark d’Souza• Elizabeth M. Glass• Travis Harrison• Kevin Keegan• Tobias Paczian• William Trimble• Andreas Wilke• Jared Wilkening
Metadata:• Dawn Field, Oxford• Renzo Kottmann, MPI Bremen
o and all of GSCM5/QC collaboration with • Nikos Kyrpides, JGI• Kostas Konstantinidis, JGI
M5 standards• Sarah Hunter, EBI
CLOVR• Sam Anguielo, Owen White
(HMP DACC)QIIME• Rob Knight (Colorado)
INSDC submission/archiving• Guy Cochrane/EBI
Thank you for your attention
24