Bioinformatic approaches to 454 data analysis for...

Bioinformatic approaches to 454 data analysis for HIV

10th European Workshop Meeting on HIV & Hepatitis

Marc Noguera i Julian Barcelona, 29/03/2012

Presented at the 10th EU Meeting on HIV & Hepatitis, 28 - 30 March 2012, Barcelona

Summary

1 – Technology Introduction

2 – Specs & Performance

3 – Applications

4 – Limits and sources of error

5 – Bioinformatic Tools


Technology Introduction (I)

- Chromatogram signal averages over whole DNA sequences population -This translates into only being to detect viral variants which are present in 20% or more of the population - Characterization of minority variants requires several (>100 ) cloning and sequencing steps

-Shows very high sensitivity -Limited to a pre-defined (small) set of mutations of interest

Sanger Based, globally used for resistance testing:

Allele-Specific PCR


Amplicon 1

Protease (PI) RT (NRTI+NNRTI)

Amplicon 2 Amplicon 3

Amplicon 4

Amplicon 5

Single Clone

Emulsion PCR

Technology Introduction (II)

Single template sequencing - Flowgram

Picotiter Plate Loading


Specs & Performance

FLX+ FLX GS/Junior

Median Read Length

800 450 450

Read Number 750.000 750.000 80.000

Samples/Run >40 25-40 5-8

RunTime(Lab) 7 days 7 days 4 days

Analysis ? ? ?

Two Different Platforms for large/small scale projects.


Applications - HIV

High Sensitivity Genotyping 1000 – 20.000 seq / Sample

Population characterization

Low Frequency Resistant Mutants

Low-Level X4 Tropic Viruses

Resistance&Tropism Dynamics

Population Reconstruction


Limits and Sources of error

Sampling

PCR Error

Recombination

Sequencing

Carry Forward – Incomplete Extension (CaFIE) Error

Homopolymer Error

Paredes et al. J. Virol. Method. 2007, 146, 136 Zagordi et al. Nucl.Ac.Res. 2010, 38, 7400. Margulies et al. Nature, 2005, 437, 376. Lahrs et al. BioTechniques, 2009, 47, 857.

A viral load of approx 30.000 is needed for reliably detecting 0.1% variant Primer design may introduce amplification biases

Depends on used polymerases, PCR definition and extension times May produce chimeric sequences Broad Range of recombination rates.

Depends on RT-PCR polymerases error rates. May introduce false mutations, virtually indistinguishable from true ones.


Gilles et al. BMC Genomics. 2011. doi:10.1186/1471-2164-12-245

Mismatches

Insertions

Deletions

Error is not random/shows patterns

InDel errors are the most common error

For a read length of > 400 bp, Majority of reads contain errors

Error patterns are critical on amplicon Sequencing designs.

Error Patterns


Applications

Raw Reads

AmpliconNoise1

Shorah3

V-Phaser/RC4542

1Quince, C., BMC Bioinformatics, 2011, 12, 38 3Zagordi, O., BMC Bioinformatics, 2011, 12, 119 2Macalalad, AR., PLoS Comp Biol, in press

NO ?

Bioinformatician Around?

YES

ViSPA QuRE

Haplotype Reconstruction

Error Correction

Sequence Aligners

Mosaik BWA MAFFT MUSCLE

In-House Code

Pipeline

Variant Detection

AVA®

Segminator

DataMonkey

Geno2Pheno[454] DeepChek®

PyroDyn/Mut®

Resistance Interpretation Tropism Prediction

Diversity Analysis


Resistance – UI – AVA (Roche – 454)

Amplicon Variant Analyzer (AVA)

Alignment & Flowgram Browsing Variant Explorer

Features: Multiple samples Search for pre-defined mutation set New variant discovery Export tools to standard formats and tables


Resistance – UI – Segminator

Segminator II

REFERENCE COVERAGE NON-CONSENSUS ENTROPY INSERTIONS DELETIONS

Sequencing Browser

Quick Trees For Selected Regions

Positional Browser

Archer et al. http://www.bioinf.manchester.ac.uk/segminator/

Features: •Little pre-processing •Filters quality reads •Own Reference •Diversity estimation •Phylogenetic tools


HIV Tropism – UI – Geno2Pheno[454]

•Web Interface to tropism predictor for 454 Sequences •Needs some pre-processing on a local computer •Filters applied to sequences •Each sequence tropism analyzed •Several Samples at a time •Full Report for each sample, web-shareable.

FPR Cons-Dist

http://g2p-454.bioinf.mpi-inf.mpg.de/index.php Daumer et al.BMC Biomed. Inf. & Decision Making.2011,11,30


http://g2p-454.bioinf.mpi-inf.mpg.de/index.php






Bioinformatics - YES

• Generally needs of programming an in-house pipeline. • Detection limit stablished on internal controls. • Sensitivity limit generally stablished around 0.5-1.0%

Filtering / Correction

Raw Reads

Quality Scores

Ambiguous Bases

Insertion Deletions

Quality Improved

Length

Reference Alignment

Variant Calling

Variant Filtering x N

Multiple Alignment

Phylogeny Inference

Population Reconstruction

Interpretation


Example 1: Resistance Dynamics

Figure 3. HIV-1 variant dynamics before, during and after treatment. Most common variants in each time point are illustrated as circles (if recurring) or as cubes (if not recurring). The genetic distance of the variants in nucleotide changes/site (from the most frequentvariant at the first time-point) is plotted over time. The frequency of the variants is proportional to the area of the circles and cubes.


Example 2: V3 Population Reconstruction

Figure 3. Minimum spanning tree (MST) of V3 sequences from subject DS1. V3 sequences generated by deep sequencing were used to construct MSTs. Identical nucleotide sequences are grouped in one node, and the circle size is proportional to the abundance of that particular V3 sequence. The length of the connecting branches corresponds to the number of nucleotide differences between the two connected nodes. Timepoints are color-coded, using bright colors for PBMC samples and corresponding soft colors for serum samples.


Error Corr.

Reference Mapping

Phylogeny Variant Calling

Haplotype Reconstruction

Resistance Analysis

UI Ref. Free Software

AVA 1

AmpliconNoise 2

Shorah 3

Segminator 4

DataMonkey 5

QuRe 6

ViSPA 7

V-Phaser 8

PyroDyn

DeepChek

Geno2Pheno[454] 9

454 Applications - Summary

1Margulies et al. Nature, 2005, 437, 376.

2Quince, C., BMC Bioinformatics, 2011, 12, 38 3Zagordi, O., BMC Bioinformatics, 2011, 12, 119 4http://www.bioinf.manchester.ac.uk/segminator/

5Delport et al. Bioinformatics, 2010, PMID: 20671151 6Prosperi et al. Bioinformatics, 2012, 28, 132 7Astrovskaya et al. BMC Bioinformatics, 2011; 12(Suppl 6): S1. 8Macalalad, AR., PLoS Comp Biol, in press

9Daumer et al.BMC Biomed. Inf. & Decision Making.2011,11,30


Acknowledgements

Funding Molecular Epidemiology Group

Rocio Bellido Maria Casadellà

Elisabeth Gómez Roger Paredes

Susana Pérez Christian Pou

Cristina Rodriguez Teresa Sequeros

Put a bioinformatician in your life


Bioinformatic approaches to 454 data analysis for...

Documents

Transcript of Bioinformatic approaches to 454 data analysis for...