Improving pan-genome annotation using whole genome multiple alignment

Raunak Shrestha

27th October 2011

Source: Angiuoli SV, Hotopp JC, Salzberg SL, Tettelin H. Improving pan-genome annotation using whole genome multiple alignment. BMC Bioinformatics. 2011 Jun 30;12:272.

Background

• Describing genetic diversity of some organism is difficult on the basis of a single reference genome

• Pan-genomes • greater intra-specific

genetic variation even in closely related strains

• To aid gene-prediction & annotation genome sequence of the some closely related strains are required

2http://en.wikipedia.org/wiki/File:Pan-genome-graphics.png

Background

Schnoes et. al., 2009

The change in misannotation over time in the NR database for the 37 families investigated.

Mugsy-Annotator (http://mugsy.sf.net)• Steps:

1. Aligning multiple whole genomes, 2. mapping orthologs among the genomes, 3. identifying annotation anomalies

• Objectives :1) identifying orthologs and 2) Evaluating the quality of

annotated gene structures in prokaryotic genomes.

Determining Orthologs

• Identifies orthologs on the basis of Whole Genome Alignment (WGA), sequence position and length of sequence.

• expects one segment per organism in the whole genome alignment.

• For segmental duplications: • It will report separate ortholog groups for each copy only if whole

genome alignment identifies orthologous copies in other genomes

• If not, it will not recognize the duplication and group under a single ortholog

Identification of annotation inconsistencies• Evaluate Start codon, Stop codon and Translation Initiation Sites

(TIS),

Data set• Neisseria meningitidis (Nmen) dataset of 20 genomes• Nmen verA contained 13 genomes • Nmen verB contained 7 genomes• Annotation pipeline differs between Nmen verA and Nmen verB

• A genome dataset of other 9 bacterial species from Refseq database.

Comparison of the groups oforthologs for 20 Nmen genomes

• Within the genes reported exclusively by any one method• intra-genome BLASTP matches predicts most of the genes to be

paralogs (40 % for Mugsy-Annotator & 60% for OrthoMCL)• Some have functional names that indicate transposases• Some are hypothetical proteins

• Paper claims that OrthoMCL clusters paralogs and orthologs in a single group

Run Time Performance

• Nmen dataset of 20 genomes

• single CPU in ~4 h • ~2 h for WGA with Mugsy and • ~2 h for comparing annotations with Mugsy-Annotator

• OrthoMCL consumed ~32 CPU hours

• WGA method is computationally efficient and has a significant runtime performance advantage over BLAST based OrthoMCL

Consistency of annotated gene structures in several species pan-genomes as reported by Mugsy-Annotator

improve annotation consistency

• In case of inconsistency in TIS, Mugsy-Annotator suggests alternative gene structures that improve annotation consistency

• Strategy -> to look for the conserved TIS in the close proximity to the previously annotated TIS

Conclusion• aids in identifying and comparing gene content across a pan-

genome

• Aids annotation and re-annotation of genes within a pan-genome rather than in a single genome

• Study demonstrates significant variation in annotation primarily due to different bioinformatics approaches available rather than the true biological variation

• Mugsy-Annotator : efficient, accurate method for finding orthologs within a pan-genome

• Mugsy (WGA approach) is computationally efficient compared to BLAST-based approaches for finding orthologs

Critique• Musgy-Annotator requires pre-predicted annotation

information and is therefore not an independent annotation tool

• Musgy-Annotator still finds difficult to determine the segmental duplications and paralogs

• It would have been even better, if the author had measured the performance of Musgy-Annotator for pan-genomes dataset with larger evolutionary distance.

QUESTIONS?

Improving pan-genome annotation using whole genome multiple alignment

Health & Medicine

Transcript of Improving pan-genome annotation using whole genome multiple alignment

Genome Annotation - DTU Health Tech · Genome Annotation and Gene-finding 27621 Prokaryotic Gene Discovery, Metagenomics and Pangenomics. Genome Sequencing. Annotation is the process

NCBI’s Genome Annotation: Overview

Genome sequencing and annotation

AGOUTI: improving genome assembly and annotation using ...Keywords: Genome assembly, Scaffolding, Genome annotation, RNA sequencing, RNA-seq Background Findings Genomes sequenced using

Genome Sequencing Impact on Annotation

Eukaryotic Genome Annotation

Whole Genome Alignment

Annotation. Traditional genome annotation BLAST Similarities.

Arabidopsis Genome Annotation

Era7 Bacterial Genome Annotation

Apollo Collaborative genome annotation editing

Genome Annotation: A Protein-centric Perspective.

Crowdsourcing genome annotation at #ccs14

Bio305 genome analysis and annotation 2012

Genome Annotation

Genome Annotation: A Protein-centric Perspective

Genome Assembly and Annotation

Genome Alignment

Ensembl Genome Annotation Overview

Bioinformatics and Genome Annotation