Untangling Metagenomes with ggKbase
-
Upload
andrea-singh -
Category
Science
-
view
804 -
download
1
Transcript of Untangling Metagenomes with ggKbase
Untangling Metagenomes With
Introduction & Background
• Online software system for the organization and analysis of metagenomic data
• Conceived and started about 6 years ago• Arose out of the need to manage and organize
the vast variety and massive volume of data associated with metagenomics
Specific Challenges• Establishing (lab internal?) standards for the
organization of metagenomic data• Optimization (speed of data import and
access, etc.) • Adapting to a rapidly changing scientific
landscape• Developing user-friendly tools for data
presentation & visualization
Facts & Figures• 30,000+ lines of code• 100 projects • 3000+ organisms• 1.6 Million contigs• 12 Million features / genes• 17 Million annotations • 22,503,876,226 bp ~= 22.5 Gbp of sequence• Database size: ~50 Gb • … and growing
Software Stack
• Ruby on Rails• Server: Apache
with Passenger• Database: MySql
(moving to Postgres)
• Full text search: Sphinx
• Background jobs: Redis with Sidekiq
• Data visualization: d3.js (Data-Driven-Documents)
• ~73 gems
What is Metagenomics?Cultured
Not culturedWho’s there?
Methanocaldococcus jannaschii DSM 2661
Who is doing what?
The world of metagenomics
What ggkbase can do?(… and what it can’t)
• Starts with a set of assembled contigs• Gene prediction, annotation & taxonomic
classification• Binning• Data storage and organization• Metabolic analysis • Post publication / NCBI submission
Outside of ggkbase:Sequencing & Assembly
Sequencing
raw readsAssembly
?
Assembly considerations• Consider length cutoff (500, 1000, 2000)• Make sure your contigs are reasonably named
• Every contig needs a “coverage” value (relative abundance within sample)
• (read count * read length) / sequence size• Determine read count by re-mapping the raw reads to
the assembled contigs (e.g. using bowtie)• Read count value needs to be encoded in the header
Good names Bad names
gwd2_scaffold_1121Rifle_16ft_4_min_1402cnbg_combo_scaffold_933
scaffold_9876I1.NODE_10_length_6414_cov_385.631287
ggkbase pre-processing
• gene prediction (prodigal)• Similarity searches (usearch)
– KEGG (manually curated metabolic pathways)– UniRef100 (also manually curated, phylogenetic details)– Forward + reverse search => reciprocal best hit signals
especially good annotation• Motif searches (iprscan, on request only because of
long runtime)• RNA searches (tRNA, rRNA, 16S, …)
gene prediction (ORFs) KEGGUNIREF
annotationcandidatephyla db
Data Import
FILE TYPES:
raw contigs (*.fa)
Prodigal gene predictions (*.genes)
Gene DNA sequences (*.genes.fna)
Gene protein sequences (*.genes.faa)
KEGG annotations (*.faa-vs-kegg.b6)
UniRef annotations(*.faa-vs-uni.b6)
RBH results against KEGG
RBH results against UniRef
tRNA, 16S
xyz_scaffold_1 xyz_scaffold_2 xyz_scaffold_3
xyz_scaffold_1_1
xyz_scaffold_1_2
xyz_scaffold_1_3
xyz_scaffold_1_4
XYZ(new project)
xyz_UNK (starter “unknown” bin)
Taxonomy: __KEGG: __UniRef: __
ggkbase DB Schema
Binning: Who’s there?
• What characteristics of the contig sequence can be used for binning?
✔ (GC%)Sequence composition (GC% content, tri-,tetra-) Coverage
✔ (other binning, e.g. abawaca) Abundance patterns (across time and/or space)
Phylogeny ✔ (UniRef100-based taxonomy)Ideally a combination of the above ✔ (Binning tools!)
✔
UniRef & PhylogenyWe use UniRef hits of individual predicted genes to extrapolate to the overall contig phylogeny:
predicted ORFs
UniRef hit (yes/no)
>UniRef100_UPI0002D3AEE6 hypothetical protein n=1 Tax=Zavarzinella formosa RepID=UPI0002D3AEE6
Zavarzinella formosaspecies
Zavarzinellagenus
Planctomycetalesorder
Planctomycetiaclassphylum
PlanctomycetesdomainBacteria
UniRef & Phylogeny
predicted ORFs
UniRef hit (yes/no)
>UniRef100_C5ER77 Uncharacterized protein n=3 Tax=Clostridiales RepID=C5ER77_9FIRM
unknownspecies
unknowngenus
Clostridialesorder
Clostridiaclassphylum
FirmicutesdomainBacteria
• ORFs with a UniRef hit don’t always return results all the way down to the species level• UniRef100 is better than UniRef90 because of smaller cluster sizes
UniRef & Phylogeny
predicted ORFs
UniRef hit (yes/no)
>ACD13_124_3 30S ribosomal protein S1 n=1 Tax=ACD13 ……. (hit to one of our candidate phyla)
unknownspecies
unknowngenusphylum
OP11domainBacteria
orderunknown
classunknown
We added a “candidate phyla” database to UniRef100 with 93 newly discovered Phyla, e.g. OD1s, OP11s, Melainabacteria, WS6, etc.
Contig-level winner
Gene Domain Phylum Class Order Genus Speciesxyz_scaffold_1_1 Bacteria Proteobacteria Betaproteobact
eriaBurkholderiales Janthinobacteri
umJanthinobacterium sp. HH01
xyz_scaffold_1_2 Bacteria Proteobacteria Betaproteobacteria
Burkholderiales Burkholderiaceae
unknown
xyz_scaffold_1_3 Bacteria Proteobacteria Alphaproteobacteria
Neisseriales Chromobacterium
Chromobacterium violeceum
xyz_scaffold_1_4 Bacteria Proteobacteria Betaproteobacteria
Rhodocyclales Azospira Azospira oxyzae
xyz_scaffold_1_5 Bacteria Proteobacteria Betaproteobacteria
Rhodospirillales unknown unknown
xyz_scaffold_1_6 unknown unknown unknown unknown unknown unknown
Specificity
Definition: The phylogeny at the most specific taxonomic level at which there is a majority of >= 60%
Winner found
unknownspecies
unknowngenus
unknownorder
Betaproteobacteriaclassphylum
ProteobacteriadomainBacteria
Binning confirmation: Metrics for genome completeness
Single Copy Genes (SCG)• Based on similarity searches against a
reference database• Itai’s scg.pl script• 51 indicates full set• Most of the ribosomal proteins • Histidyl tRNA synthetase• Phenylalanyl tRNA synthetase alpha• Preprotein translocase subunit SecY• Valyl tRNA synthetase• gyrA• leucyl tRNA synthetase• recA• aspartyl tRNA synthetase• arginyl tRNA synthetase• alanyl tRNA synthetase
Ribosomal Proteins (RP)• Based on keyword-
based annotation searches
• 55 indicates “full set”• L1-L35• S1-S24
?
name GC% length RPs SCGs Cov. species genus order class phylum domain
xyz_scaffold_760 64.42 19380 rp S2 rp S2 370.68 Thiobacillus denitrificans Thiobacillus Hydrogenophilales Betaproteobacteria Proteobacteria Bacteria
xyz_scaffold_772 62.05 22706rp L9, rp S18, rp S6
rp L9, rp S18, rp S6 318.43 Hydrogenophilales Betaproteobacteria Proteobacteria Bacteria Proteobacteria Bacteria
xyz_scaffold_854 62.22 22767 301.99 Thiobacillus denitrificans Thiobacillus Hydrogenophilales Betaproteobacteria Proteobacteria Bacteria
xyz_scaffold_988 59.5 14284 299.03 Thiobacillus denitrificans Thiobacillus Hydrogenophilales Betaproteobacteria Proteobacteria Bacteriaxyz_scaffold_1009 63.81 17471 311.86 unknown unknown unknown Betaproteobacteria Proteobacteria Bacteria
xyz_scaffold_1082 63.76 12832 379.09 Thiobacillus denitrificans Thiobacillus Hydrogenophilales Betaproteobacteria Proteobacteria Bacteriaxyz_scaffold_1092 63.01 12754 336.33 unknown unknown unknown unknown Proteobacteria Bacteria
xyz_scaffold_1107 60.2 13014 294.1 Thiobacillus denitrificans Thiobacillus Hydrogenophilales Betaproteobacteria Proteobacteria Bacteria
xyz_scaffold_1136 62.63 12051 341.09 Thiobacillus denitrificans Thiobacillus Hydrogenophilales Betaproteobacteria Proteobacteria Bacteria
xyz_scaffold_1150 61.53 13432 325.86 Thiobacillus denitrificans Thiobacillus Hydrogenophilales Betaproteobacteria Proteobacteria Bacteria
Contig data Binning tools
DomainPhylumClassOrderGenusSpecies
The binning tool page at a glance
Phylogeny wheel
GC Content
Coverage
Ribosomal protein inventory
Single Copy Gene inventory
Aggregates & rebinning form
Interactive binning tool overview
• individual data attributes (dimensions) are represented as separate graphs
• filtering by one attribute, e.g. GC% content, creates a subset of the data matching the selected range of GC%
• the filter updates all other graphs simultaneously and in real time to reflect the selection
• in the background the data is quickly re-grouped and re-aggregated to make this possible
• filters are incremental: one filter can be applied on top of another filter
Common binning flow: start with phylogeny1. Initial state (78,504 contigs)
90% Bacteria7% ArchaeaClick here
2. All Bacteria (62,364 contigs)
40% Firmicutes15% Proteobacteria
3. All Firmicutes (11,606 contigs)
76% Clostridia23% Bacilli
4. Genus: Clostridium (587 contigs)
50% Clostridium perfringens48.9% Clostridium thermocellum
Let’s look at the other graphs
Binning success!
There are clearly 2 genomes
Clostridium perfringensClostridium thermocellum
1
1
Separation of 2 closely related strains2 Desulfo strains; RP and SCG confirm that two genomes are present
LimitationsHuman Gut sample Rifle background sediment
Lots of unknown!
Abundance pattern-based binning
Single abawaca bin
99% Bacteroidetes
CovGC
Good RC & SCG inventories
Iterative Binning
• Binning tools can be used in the context of bins (not only the UNK bin)
• Individual contigs can be re-binned as needed
• Contigs can be moved back to the orignal UNK bin
• ggkbase supports import of already “binned” organisms
• you can download “found” bins, curate them outside of ggkbase and re-import them as a “finalized” bin
• rough bins and curated bins can be linked
Data Export
• Project: • Contigs• DNA• Proteins
• Organisms / bins• Contigs• DNA• Proteins• Tab delimited• Genbank
Files outdated
Files are being regenerated
Files availablefor download
Beyond Binning
Binning projects• Contains a (single)
metagenome (UNK) bin• All bins (curated and found)
are sourced from the same metagenome
• Cannot contain bins from other sources
Analysis Projects• Can contain bins from many
different sources, e.g. binning projects or outside reference organisms
• Doesn’t contain metagenomes (unbinned repositories of contigs)
Metagenome Analysis:What are they doing?
• Primary tool: “lists” based on keyword searches against annotations
• Lists are collections of features / genes that match a set of search terms
• Search terms often try to capture entire metabolic pathways, e.g. Glycolysis
• Lists are dynamic (change on data updates, e.g. new set of annotations, user-created notes)
• List contents can be visualized in the “Genome Summary” page
Search and list creation
Keyword suggestions
Lists are scopedto projects
Search results
Detour: Name Search
• Search for features, contigs and /or organisms by name or alias.
• We don’t rename contigs & features any longer, i.e. original names are kept if possible.
• No renaming on binning events.
• Main search box is for annotations, notes and systematic names (DB x-refs) with the purpose of list building.
• You won’t find features by name using that search box
Genome Summary for functional prediction
bins
lists
block of ribosomal proteinsLooking down atcolumns can reveal whichorganisms partake in ametabolic pathway
Heatmap: light vs.dark colors indicatenumber of genes
Genome Summariescan be saved, downloaded
(as .svg) and used in publications
Future Development1.Universal lists2.Automated binning3.API (read-only version very close to
completion)4.Data integration (incorporating and displaying
other *omic data)5.Your ideas and suggestions here ….