Untangling Metagenomes with ggKbase

Untangling Metagenomes With

Introduction & Background

• Online software system for the organization and analysis of metagenomic data

• Conceived and started about 6 years ago• Arose out of the need to manage and organize

the vast variety and massive volume of data associated with metagenomics

Specific Challenges• Establishing (lab internal?) standards for the

organization of metagenomic data• Optimization (speed of data import and

access, etc.) • Adapting to a rapidly changing scientific

landscape• Developing user-friendly tools for data

presentation & visualization

Facts & Figures• 30,000+ lines of code• 100 projects • 3000+ organisms• 1.6 Million contigs• 12 Million features / genes• 17 Million annotations • 22,503,876,226 bp ~= 22.5 Gbp of sequence• Database size: ~50 Gb • … and growing

Software Stack

• Ruby on Rails• Server: Apache

with Passenger• Database: MySql

(moving to Postgres)

• Full text search: Sphinx

• Background jobs: Redis with Sidekiq

• Data visualization: d3.js (Data-Driven-Documents)

• ~73 gems

What is Metagenomics?Cultured

Not culturedWho’s there?

Methanocaldococcus jannaschii DSM 2661

Who is doing what?

The world of metagenomics

What ggkbase can do?(… and what it can’t)

• Starts with a set of assembled contigs• Gene prediction, annotation & taxonomic

classification• Binning• Data storage and organization• Metabolic analysis • Post publication / NCBI submission

Outside of ggkbase:Sequencing & Assembly

Sequencing

raw readsAssembly

?

Assembly considerations• Consider length cutoff (500, 1000, 2000)• Make sure your contigs are reasonably named

• Every contig needs a “coverage” value (relative abundance within sample)

• (read count * read length) / sequence size• Determine read count by re-mapping the raw reads to

the assembled contigs (e.g. using bowtie)• Read count value needs to be encoded in the header

Good names Bad names

gwd2_scaffold_1121Rifle_16ft_4_min_1402cnbg_combo_scaffold_933

scaffold_9876I1.NODE_10_length_6414_cov_385.631287

ggkbase pre-processing

• gene prediction (prodigal)• Similarity searches (usearch)

– KEGG (manually curated metabolic pathways)– UniRef100 (also manually curated, phylogenetic details)– Forward + reverse search => reciprocal best hit signals

especially good annotation• Motif searches (iprscan, on request only because of

long runtime)• RNA searches (tRNA, rRNA, 16S, …)

gene prediction (ORFs) KEGGUNIREF

annotationcandidatephyla db

Data Import

FILE TYPES:

raw contigs (*.fa)

Prodigal gene predictions (*.genes)

Gene DNA sequences (*.genes.fna)

Gene protein sequences (*.genes.faa)

KEGG annotations (*.faa-vs-kegg.b6)

UniRef annotations(*.faa-vs-uni.b6)

RBH results against KEGG

RBH results against UniRef

tRNA, 16S

xyz_scaffold_1 xyz_scaffold_2 xyz_scaffold_3

xyz_scaffold_1_1

xyz_scaffold_1_2

xyz_scaffold_1_3

xyz_scaffold_1_4

XYZ(new project)

xyz_UNK (starter “unknown” bin)

Taxonomy: __KEGG: __UniRef: __

ggkbase DB Schema

Binning: Who’s there?

• What characteristics of the contig sequence can be used for binning?

✔ (GC%)Sequence composition (GC% content, tri-,tetra-) Coverage

✔ (other binning, e.g. abawaca) Abundance patterns (across time and/or space)

Phylogeny ✔ (UniRef100-based taxonomy)Ideally a combination of the above ✔ (Binning tools!)

✔

UniRef & PhylogenyWe use UniRef hits of individual predicted genes to extrapolate to the overall contig phylogeny:

predicted ORFs

UniRef hit (yes/no)

>UniRef100_UPI0002D3AEE6 hypothetical protein n=1 Tax=Zavarzinella formosa RepID=UPI0002D3AEE6

Zavarzinella formosaspecies

Zavarzinellagenus

Planctomycetalesorder

Planctomycetiaclassphylum

PlanctomycetesdomainBacteria

UniRef & Phylogeny

predicted ORFs

UniRef hit (yes/no)

>UniRef100_C5ER77 Uncharacterized protein n=3 Tax=Clostridiales RepID=C5ER77_9FIRM

unknownspecies

unknowngenus

Clostridialesorder

Clostridiaclassphylum

FirmicutesdomainBacteria

• ORFs with a UniRef hit don’t always return results all the way down to the species level• UniRef100 is better than UniRef90 because of smaller cluster sizes

UniRef & Phylogeny

predicted ORFs

UniRef hit (yes/no)

>ACD13_124_3 30S ribosomal protein S1 n=1 Tax=ACD13 ……. (hit to one of our candidate phyla)

unknownspecies

unknowngenusphylum

OP11domainBacteria

orderunknown

classunknown

We added a “candidate phyla” database to UniRef100 with 93 newly discovered Phyla, e.g. OD1s, OP11s, Melainabacteria, WS6, etc.

Contig-level winner

Gene Domain Phylum Class Order Genus Speciesxyz_scaffold_1_1 Bacteria Proteobacteria Betaproteobact

eriaBurkholderiales Janthinobacteri

umJanthinobacterium sp. HH01

xyz_scaffold_1_2 Bacteria Proteobacteria Betaproteobacteria

Burkholderiales Burkholderiaceae

unknown

xyz_scaffold_1_3 Bacteria Proteobacteria Alphaproteobacteria

Neisseriales Chromobacterium

Chromobacterium violeceum


Rhodocyclales Azospira Azospira oxyzae


Rhodospirillales unknown unknown

xyz_scaffold_1_6 unknown unknown unknown unknown unknown unknown

Specificity

Definition: The phylogeny at the most specific taxonomic level at which there is a majority of >= 60%

Winner found

unknownspecies

unknowngenus

unknownorder

Betaproteobacteriaclassphylum

ProteobacteriadomainBacteria

Binning confirmation: Metrics for genome completeness

Single Copy Genes (SCG)• Based on similarity searches against a

reference database• Itai’s scg.pl script• 51 indicates full set• Most of the ribosomal proteins • Histidyl tRNA synthetase• Phenylalanyl tRNA synthetase alpha• Preprotein translocase subunit SecY• Valyl tRNA synthetase• gyrA• leucyl tRNA synthetase• recA• aspartyl tRNA synthetase• arginyl tRNA synthetase• alanyl tRNA synthetase

Ribosomal Proteins (RP)• Based on keyword-

based annotation searches

• 55 indicates “full set”• L1-L35• S1-S24

?

name GC% length RPs SCGs Cov. species genus order class phylum domain

xyz_scaffold_760 64.42 19380 rp S2 rp S2 370.68 Thiobacillus denitrificans Thiobacillus Hydrogenophilales Betaproteobacteria Proteobacteria Bacteria

xyz_scaffold_772 62.05 22706rp L9, rp S18, rp S6

rp L9, rp S18, rp S6 318.43 Hydrogenophilales Betaproteobacteria Proteobacteria Bacteria Proteobacteria Bacteria

xyz_scaffold_854 62.22 22767 301.99 Thiobacillus denitrificans Thiobacillus Hydrogenophilales Betaproteobacteria Proteobacteria Bacteria

xyz_scaffold_988 59.5 14284 299.03 Thiobacillus denitrificans Thiobacillus Hydrogenophilales Betaproteobacteria Proteobacteria Bacteriaxyz_scaffold_1009 63.81 17471 311.86 unknown unknown unknown Betaproteobacteria Proteobacteria Bacteria

xyz_scaffold_1082 63.76 12832 379.09 Thiobacillus denitrificans Thiobacillus Hydrogenophilales Betaproteobacteria Proteobacteria Bacteriaxyz_scaffold_1092 63.01 12754 336.33 unknown unknown unknown unknown Proteobacteria Bacteria




Contig data Binning tools

DomainPhylumClassOrderGenusSpecies

The binning tool page at a glance

Phylogeny wheel

GC Content

Coverage

Ribosomal protein inventory

Single Copy Gene inventory

Aggregates & rebinning form

Interactive binning tool overview

• individual data attributes (dimensions) are represented as separate graphs

• filtering by one attribute, e.g. GC% content, creates a subset of the data matching the selected range of GC%

• the filter updates all other graphs simultaneously and in real time to reflect the selection

• in the background the data is quickly re-grouped and re-aggregated to make this possible

• filters are incremental: one filter can be applied on top of another filter

Common binning flow: start with phylogeny1. Initial state (78,504 contigs)

90% Bacteria7% ArchaeaClick here

2. All Bacteria (62,364 contigs)

40% Firmicutes15% Proteobacteria

3. All Firmicutes (11,606 contigs)

76% Clostridia23% Bacilli

4. Genus: Clostridium (587 contigs)

50% Clostridium perfringens48.9% Clostridium thermocellum

Let’s look at the other graphs

Binning success!

There are clearly 2 genomes

Clostridium perfringensClostridium thermocellum

1

1

Separation of 2 closely related strains2 Desulfo strains; RP and SCG confirm that two genomes are present

LimitationsHuman Gut sample Rifle background sediment

Lots of unknown!

Abundance pattern-based binning

Single abawaca bin

99% Bacteroidetes

CovGC

Good RC & SCG inventories

Iterative Binning

• Binning tools can be used in the context of bins (not only the UNK bin)

• Individual contigs can be re-binned as needed

• Contigs can be moved back to the orignal UNK bin

• ggkbase supports import of already “binned” organisms

• you can download “found” bins, curate them outside of ggkbase and re-import them as a “finalized” bin

• rough bins and curated bins can be linked

Data Export

• Project: • Contigs• DNA• Proteins

• Organisms / bins• Contigs• DNA• Proteins• Tab delimited• Genbank

Files outdated

Files are being regenerated

Files availablefor download

Beyond Binning

Binning projects• Contains a (single)

metagenome (UNK) bin• All bins (curated and found)

are sourced from the same metagenome

• Cannot contain bins from other sources

Analysis Projects• Can contain bins from many

different sources, e.g. binning projects or outside reference organisms

• Doesn’t contain metagenomes (unbinned repositories of contigs)

Metagenome Analysis:What are they doing?

• Primary tool: “lists” based on keyword searches against annotations

• Lists are collections of features / genes that match a set of search terms

• Search terms often try to capture entire metabolic pathways, e.g. Glycolysis

• Lists are dynamic (change on data updates, e.g. new set of annotations, user-created notes)

• List contents can be visualized in the “Genome Summary” page

Search and list creation

Keyword suggestions

Lists are scopedto projects

Search results

Detour: Name Search

• Search for features, contigs and /or organisms by name or alias.

• We don’t rename contigs & features any longer, i.e. original names are kept if possible.

• No renaming on binning events.

• Main search box is for annotations, notes and systematic names (DB x-refs) with the purpose of list building.

• You won’t find features by name using that search box

Genome Summary for functional prediction

bins

lists

block of ribosomal proteinsLooking down atcolumns can reveal whichorganisms partake in ametabolic pathway

Heatmap: light vs.dark colors indicatenumber of genes

Genome Summariescan be saved, downloaded

(as .svg) and used in publications

Future Development1.Universal lists2.Automated binning3.API (read-only version very close to

completion)4.Data integration (incorporating and displaying

other *omic data)5.Your ideas and suggestions here ….

Untangling Metagenomes with ggKbase

Science

Transcript of Untangling Metagenomes with ggKbase