Soergel oa week-2014-lightning

Open Workflows A Vision for Collaborative Science

David Soergel [email protected] October 23, 2014

Viral Population Estimation Using PyrosequencingNicholas Eriksson1*, Lior Pachter2, Yumi Mitsuya3, Soo-Yon Rhee3, Chunlin Wang3, Baback Gharizadeh4,

Mostafa Ronaghi4, Robert W. Shafer3, Niko Beerenwinkel5*

1 Department of Statistics, University of Chicago, Chicago, Illinois, United States of America, 2 Department of Mathematics, University of California, Berkeley, California,

United States of America, 3 Division of Infectious Diseases, Stanford University Medical Center, Stanford, California, United States of America, 4 Genome Technology

Center, Stanford University, Palo Alto, California, United States of America, 5 Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland

Abstract

The diversity of virus populations within single infected hosts presents a major difficulty for the natural immune response aswell as for vaccine design and antiviral drug therapy. Recently developed pyrophosphate-based sequencing technologies(pyrosequencing) can be used for quantifying this diversity by ultra-deep sequencing of virus samples. We presentcomputational methods for the analysis of such sequence data and apply these techniques to pyrosequencing dataobtained from HIV populations within patients harboring drug-resistant virus strains. Our main result is the estimation of thepopulation structure of the sample from the pyrosequencing reads. This inference is based on a statistical approach to errorcorrection, followed by a combinatorial algorithm for constructing a minimal set of haplotypes that explain the data. Usingthis set of explaining haplotypes, we apply a statistical model to infer the frequencies of the haplotypes in the populationvia an expectation–maximization (EM) algorithm. We demonstrate that pyrosequencing reads allow for effective populationreconstruction by extensive simulations and by comparison to 165 sequences obtained directly from clonal sequencing offour independent, diverse HIV populations. Thus, pyrosequencing can be used for cost-effective estimation of the structureof virus populations, promising new insights into viral evolutionary dynamics and disease control strategies.

Citation: Eriksson N, Pachter L, Mitsuya Y, Rhee S-Y, Wang C, et al. (2008) Viral Population Estimation Using Pyrosequencing. PLoS Comput Biol 4(5): e1000074.doi:10.1371/journal.pcbi.1000074

Editor: Glenn Tesler, University of California San Diego, United States of America

Received July 2, 2007; Accepted March 27, 2008; Published May 9, 2008

Copyright: ! 2008 Eriksson et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: N. Eriksson and L. Pachter were partially supported by the NSF (grants DMS-0603448 and CCF-0347992, respectively). N. Beerenwinkel was funded by agrant from the Bill and Melinda Gates Foundation through the Grand Challenges in Global Health Initiative. The NSF has played no role in any part of this work.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected] (NE); [email protected] (NB)

Introduction

Pyrosequencing is a novel experimental technique for deter-mining the sequence of DNA bases in a genome [1,2]. Themethod is faster, less laborious, and cheaper than existingtechnologies, but pyrosequencing reads are also significantlyshorter and more error-prone (about 100–250 base pairs and 5–10 errors/kb) than those obtained from Sanger sequencing (about1000 base pairs and 0.01 errors/kb) [3–5].

In this paper we address computational issues that arise inapplying this technology to the sequencing of an RNA virussample. Within-host RNA virus populations consist of differenthaplotypes (or strains) that are evolutionarily related. Thepopulation can exhibit a high degree of genetic diversity and isoften referred to as a quasispecies, a concept that originallydescribed a mutation-selection balance [6,7]. Viral geneticdiversity is a key factor in disease progression [8,9], vaccinedesign [10,11], and antiretroviral drug therapy [12,13]. Ultra-deepsequencing of mixed virus samples is a promising approach toquantifying this diversity and to resolving the viral populationstructure [14–16].

Pyrosequencing of a virus population produces many reads,each of which originates from exactly one—but unknown—haplotype in the population. Thus, the central problem is toreconstruct from the read data the set of possible haplotypes that isconsistent with the observed reads and to infer the structure of thepopulation, i.e., the relative frequency of each haplotype.

Here we present a computational four-step procedure formaking inference about the virus population based on a set ofpyrosequencing reads (Figure 1). First, the reads are aligned to areference genome. Second, sequencing errors are corrected locallyin windows along the multiple alignment using clusteringtechniques. Next, we assemble haplotypes that are consistent withthe observed reads. We formulate this problem as a search for a setof covering paths in a directed acyclic graph and show how thesearch problem can be solved very efficiently. Finally, weintroduce a statistical model that mimics the sequencing processand we employ the maximum likelihood (ML) principle forestimating the frequency of each haplotype in the population.

The alignment step of the proposed procedure is straightforwardfor the data analyzed here and has been discussed elsewhere [5]. Dueto the presence of a reference genome, only pair-wise alignment isnecessary between each read and the reference genome. We willtherefore focus on the core methods of error correction, haplotypereconstruction, and haplotype frequency estimation. Two indepen-dent approaches are pursued for validating the proposed method.First, we present extensive simulation results of all the steps in themethod. Second, we validate the procedure by reconstructing fourindependent HIV populations from pyrosequencing reads andcomparing these populations to the results of clonal Sangersequencing from the same samples.

These datasets consist of approximately 5000 to 8000 reads ofaverage length 105 bp sequenced from a 1 kb region of the polgene from clinical samples of HIV-1 populations. Pyrosequencing

PLoS Computational Biology | www.ploscompbiol.org 1 May 2008 | Volume 4 | Issue 5 | e1000074

Abstract

Introduction

raw data raw data

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

Abstract

Introduction

raw data raw data

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

parameters

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

parameters

Module Repository: Format Converters

.sdf -> .gsk .ffj -> .oij .asd -> .nnv

.qqa -> .qqb.dfg -> .fgh .ert -> .yey

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

parameters

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

parameters

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

parameters

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

parameters

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

parameters

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

parameters

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

parameters

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

parameters

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

parameters

F1000Research

Open Peer Review

, Medical University ofHans LassmannVienna Austria

, The Ohio StateMichael K. RackeUniversity USA

, University of TexasElliot M. FrohmanSouthwestern Medical Center at DallasUSA

Discuss this article (0)Comments

RESEARCH ARTICLE

Novel somatic single nucleotide variants within the RNA binding protein hnRNP A1 in multiple sclerosis patients [v2; ref status:

indexed, http://f1000r.es/4dh]Sangmin Lee , Michael Levin1-4

Research Service, Veterans Affairs Medical Center, Memphis, TN, USADepartment of Neurology, University of Tennessee Health Science Center, Memphis, TN, USADepartment of Anatomy/Neurobiology, University of Tennessee Health Science Center, Memphis, TN, USANeuroscience Institute, University of Tennessee Health Science Center, Memphis, TN, USA

AbstractSome somatic single nucleotide variants (SNVs) are thought to be pathogenic,leading to neurological disease. We hypothesized that heterogeneous nuclearribonuclear protein A1 (hnRNP A1), an autoantigen associated with multiplesclerosis (MS) would contain SNVs. MS patients develop antibodies to hnRNPA1 , an epitope within the M9 domain (AA ) of hnRNP A1. M9 ishnRNP A1’s nucleocytoplasmic transport domain, which binds transportin-1(TPNO-1) and allows for hnRNP A1’s transport into and out of the nucleus.Genomic DNA sequencing of M9 revealed nine novel SNVs that resulted in anamino acid substitution in MS patients that were not present in controls. SNVsoccurred within the TPNO-1 binding domain (hnRNP A1 ) and the MSIgG epitope (hnRNP A1 ), within M9. In contrast to the nuclearlocalization of wild type (WT) hnRNP A1, mutant hnRNP A1 mis-localized to thecytoplasm, co-localized with stress granules and caused cellular apoptosis.Whilst WT hnRNP A1 bound TPNO-1, mutant hnRNP A1 showed reducedTPNO-1 binding. These data suggest SNVs in hnRNP A1 might contribute topathogenesis of MS.

1,2,4 1-4

Referee Status:

Invited Referees

version 2published18 Sep 2014

version 1published20 Jun 2014

report

report report report

20 Jun 2014, :132 (doi: )First published: 3 10.12688/f1000research.4436.1 18 Sep 2014, :132 (doi: )Latest published: 3 10.12688/f1000research.4436.2

293-304 268-305

268-289293-304

F1000Research 2014, 3:132 Last updated: 03 OCT 2014

F1000Research

Open Peer Review

RESEARCH ARTICLE

1,2,4 1-4

Referee Status:

Invited Referees

report

293-304 268-305

268-289293-304

F1000Research

Open Peer Review

RESEARCH ARTICLE

1,2,4 1-4

Referee Status:

Invited Referees

report

293-304 268-305

268-289293-304

F1000Research

Open Peer Review

RESEARCH ARTICLE

1,2,4 1-4

Referee Status:

Invited Referees

report

293-304 268-305

268-289293-304

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

parameters

F1000Research

Open Peer Review

RESEARCH ARTICLE

1,2,4 1-4

Referee Status:

Invited Referees

report

293-304 268-305

268-289293-304

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

parameters

F1000Research

Open Peer Review

RESEARCH ARTICLE

1,2,4 1-4

Referee Status:

Invited Referees

report

293-304 268-305

268-289293-304

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

parameters

F1000Research

Open Peer Review

RESEARCH ARTICLE

1,2,4 1-4

Referee Status:

Invited Referees

report

293-304 268-305

268-289293-304

2 2.2 5.5

0.4 3.1 3.2

4.1 5.2 4.2

5.2 4.2 6.5

parameters

q = 3.4worksbetter!

openreview.net

worldmake.org

• Pre-publication open peer review (CS conferences, so far) • Paper on open peer reviewing models

davidsoergel.com

Soergel oa week-2014-lightning

Science

Transcript of Soergel oa week-2014-lightning