Evolution and Vaccination of Inﬂuenza Virusboley/publications/papers/IJCAI-BAI15-slides.pdf0.05...

Evolution and Vaccinationof Influenza Virus

Ham Ching Lam Srinand Sreevatsan Daniel BoleyUniversity of Minnesota

supported in part by NSF 1319749, NIH N266200700007C, and others

7/28/2015

1 / 31

Presentation Overview

◮ Introduction◮ Influenza virus and influenza vaccines

◮ High Throughput Genetic Sequence Analysis◮ Influenza evolution visualization.

◮ Multi-class separateness analysis◮ from visualization to quantification

◮ Results

◮ Related works

◮ Conclusions

2 / 31

Influenza Virus and Influenza Vaccine

Influenza virus

◮ About 40 % of global population are infected in a single year [WHO 2014]

◮ Type A: Rich diversity and classified by antigenicity of HA and NAmolecules (H1N1, H3N2, H5N1, ..H18).

◮ Evolution mechanisms:◮ Antigenic drift: gradual point mutations that can lead to antigenic

changes.◮ Antigenic shift/Reassortment: rearrangement of viral gene segments

in cells infected with 2 or more viruses.

Influenza vaccine

◮ Stimulates host antibody against Hemagglutinin (HA) andneuraminidase (NA).

◮ Yearly evaluation of seasonal human influenza vaccine components.

◮ Each vaccine component is selected to target a specific strain of aninfluenza virus.

◮ Vaccine components:

◮ From late 1930s, A/H1N1 (10 updates)◮ From 1968, A/H3N2 (23 updates)◮ From late 1970s Type B (16 updates)

3 / 31

Does vaccine drive the evolution of influenza?

◮ Seasonal human A/H3N2 1971 - 2009

◮ Repeated emergence of new antigenic drift variants

◮ New vaccine is introduced frequently.

0 50 100 150 200 250 3000

5

10

15

20

25

Time

HA1 domain

A/H3N2

dN/dS ratio > 1

Repeated Vaccine

NewVacc. Introduction

dN/dS ratio .8−1

H.C. Lam et. al., 30th American Society for Virology Meeting 2011.

4 / 31

High Throughput Genetic Sequence Analysis

Overall concepts: visualize influenza evolution

◮ A departure from traditional phylogenetic approach.

◮ No assumption made about ancestors or ancestryrelationships.

◮ No bias.

◮ Dimension reduction of non-numerical fixed length geneticsequence data.

◮ Expose hidden data structure/patterns.

Advantages:

◮ Enable generation of Hypothesis because it is able to..◮ Examine data comprehensively and give efficient initial ’look’.◮ Achieve high coverage in dense and sparse data

◮ Dense: matrix elements are non-zero.◮ Sparse: majority of matrix elements are zero.

5 / 31

Visualization Method:

1. Genetic Sequence Data Conversion.◮ Binary encoding of nucleotide.

◮ A → 1 0 0 0◮ C → 0 1 0 0◮ G → 0 0 1 0◮ T → 0 0 0 1

◮ Each coded nucleotide is equidistant from each other.◮ Form data matrix X : rows (HA seqs), columns (residues).

2. Apply Singular Value Decomposition (SVD) to X to reducedimensions and eliminate noise.

3. Principal Components are the columns of V.

4. Select leading 2 or 3 components for visualization.

6 / 31

Hemagglutinin Sequence data

◮ Multiple flu sequence databases:◮ NCBI Influenza Virus Resource◮ Influenza Research Database◮ EpiFlu

◮ Multiple Seq Alignment and removed the few sequences with:◮ gaps ”-”◮ ”wildcard characters”: W, N, S,◮ ’partially completed’

◮ Fixed length HA1 seqs: 987 nucleotides.◮ ”Vacc controlled: 3 human,1 avian samples◮ ”Non-Vacc controlled: 1 human, 1 avian samples”

Samples Year Seqs

Human A/H1N1 1918-13 2140

Human A/H3N2 1968-09 175(235)

Human Type B (Vic/Yam) 1970-13 818

Human H5N1 1997-12 127(128)

Avian H5 (Mexico) 1994-02 32

Avian H5 (China) 1997-02 32

7 / 31

Visualization Result: Human sample (A/H3N2)Vaccine controlled

−10 −8 −6 −4 −2 0 2 4 6 8−6

−4

−2

0

2

4

6

1st PC

2nd

PC

A/H3N2 1968−2009

Year 1970

Year 1980

Year 1985

Year 1990

Year 2002

Year 2005

2

1

3

4

5 6

7

8

9

10

13

14

11

12

◮ Viruses evolving in a restricteddirection.

◮ Chronological pattern appearsto be nonrandom.

◮ Viruses clustered by isolationyears.

◮ Undergone antigenic drift way

from older/vaccine strains.

−10−50510

−10

0

10

0

50

100

150

200

Ham

min

g di

stanc

e

PC2 PC1

Vaccines1:Aichi/19682:Port Chalmers/1/19733:Phillippines/2/19824:Shanghai/11/19875:Beijing/353/19896:Shangdong/9/19937:Johannesburg/33/19948:Sydney/5/19979:Moscow/10/199910:Fujian/411/200211:California/7/200412:Wisconsin/67/200513:Brisbane/10/200714:Perth/16/2009

0Lam et al., 20128 / 31

Visualization Result: Human sample (Type B)Vaccine controlled

−8−6−4−202468

−6

−5

−4

−3

−2

−1

0

1

2

3

4

197219791983

1986

1988−Yamagata

1987−Victoria

1990

1993

2001

1999

2004

2002

2012

20082006

2010

PC1

PC

2

Year 1970

Year 1980

Year 1985

Year 1990

Year 2002

Year 2012

Yamagata LineageVictoria Lineage

◮ Diverged from a single lineage prior to1980 into antigenically distinct Victoria(blue) and Yamagata lineages (red).

◮ Chronological patterns appear to benonrandom.

◮ Viruses clustered by isolation years.

◮ Undergone antigenic drift away fromolder/vaccine strains.

−10

−5

0

5

10

−4

−2

0

2

4

6

0

20

40

60

80

100

1972

19791983

1986

1988−Yamagata

1987−Victoria

1990

19932001

19992004

2002

2012

2008

2006

2010

Ham

ming

dist

ance

PC1

PC2

VaccineB/Hong Kong/05/1972B/Singapore/222/79B/USSR/100/83B/Ann Arbor/1/86B/Beijing/1/1987B/Yamagata/16/88B/Panama/45/90B/Beijing/184/93B/Sichuan/379/99B/Hong Kong/330/2001B/Shanghai/361/2002B/Malaysia/2506/2004B/Florida/4/2006B/Brisbane/60/2008B/Wisconsin/01/2010B/Massachusetts/02/2012

9 / 31

Visualization Result: Human sample (A/H1N1)

Vaccine controlled

◮ Split between pre-2009 and post 2009 viruses.◮ Pandemic A/H1N1pmd09 overtook classical A/H1N1(black) after the split.

◮ Isolates from 2010 onward appeared to have diverged.◮ Undergone antigenic drift away from older/vaccine strains.

10 / 31

Visualization Result: Human sample

Non-vaccine controlled Human H5N1

−40

−20

0

20

−10−5

05

10

−4

−2

0

2

4

6

PC2

Human H5N1

PC1

PC

3

1997

2001−2004

2005−2007

2008−2012

◮ Appears to diverge into 3 lineages.

◮ Clusters contain viruses span longer time period.

11 / 31

Visualization Result: Avian samples

−6 −4−2 0

2 46

−5

0

5−4

−2

0

2

4

6

3rd

PC

Vaccinated H5 sample

1st PC

2nd PCYear 1994

Year 1996

Year 1998

Year 2000

Year 2001

Year 2002

* Vaccine controlled: Mexico avian H5* Directional evolutionary trend* Diverged and established into separatelineages after early 2000s [Lee2004].

* Clusters contain viruses by isolation year.* Undergone antigenic drift away fromvaccine strain.(A/Ck/Mexico/CPA0232/1994)

* Non-Vaccine controlled: China avian H5* No obvious evolutionary trend.* Clusters contain viruses span longer timeperiod.

* Early and late isolates are overlapped.* Appears to be more scattered.

12 / 31

From visualization to quantification

Visualization provides ’pictorial evidence’:

◮ In 2D and 3D PCA space:◮ Vaccine controlled:

* Distinct restricted nonrandom directional trend* Viruses with the same isolation year appear to clustertogether.

◮ Non-vaccine controlled:* Less obvious directional evolution.* Larger spread of clusters.* Clusters contain viruses span longer time period.

Quantification: gives some form of statistical confidence

◮ Quantify visualization by measuring ”distance” in terms of’standard deviations’

13 / 31

Multi-class separateness analysis

◮ Vaccine components are evaluated/updated every year.◮ Viruses tend to cluster by the same isolation year.

Determine the cohesiveness of the viruses in each year** Distances between points and their respective centersin reduced space.

1. Visualization results as inputs:◮ Two dimensional PCA coordinates.◮ Class label: Isolation year of each virus.

HA seq header: AF201875 A/Minnesota/1/1993

2. Compute class separateness value λo for each sample.3. Determine the significant of λo using Class labels

randomization simulation.14 / 31

Class separateness measure: compute λ

* C : Number of Classes* Ni : number of data points in class i = 1, 2, ...,C* ui : is the mean vector of class i = 1, 2, ...,C* NT : total number of points.* Sought the trace(W) and trace(B): Sum of squared distancesbetween points and their respective centers.

◮ W : Within cluster scatter matrix◮

∑Ci

∑Nij (xj − ui )(xj − ui )

T

◮ ui : mean of class i .

◮ B : Between cluster scatter matrix

◮1NT

∑Ci Ni (ui −M)(ui −M)

T

◮ M = 1NT

∑Ci Niui ”global mean of dataset”

◮ λ = tr(B)tr(W )

15 / 31

Class labels randomization

Significant of λoUsing ”distance measure” as a surrogate for the probability ofobserving the observed λo by chance.

Algorithm:

Let λo =tr(Bo)tr(Wo)

be the observed separateness value.Repeat j = 1, . . . ,K2:Repeat i = 1, . . . ,K1:

Generate a randomization of the class labels.Compute the within-cluster scatter W .Compute the ratio λi =

tr(B)tr(W ) =

tr(T )−tr(W )tr(W ) .

Compute the mean µ and std σ for all λi=1,..K1.

Compute the distance dj =µ−λ0σ

.

Compute the mean d̄ and std d̂ of all dj=1..K2.

Report the distance of λo from the mean in the form of d̄ ± d̂ .

16 / 31

Simulation results: Human samples

** Vaccine controlled [A/H3N2, Type B (Yamagata), A/H1N1] **

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40

1000

2000

3000

4000

5000

Lambda values

Frequ

ency

Human A/H3N2

0 0.05 0.1 0.15 0.20

500

1000

1500

2000

2500

3000

3500

4000

Lambda values

Frequ

ency

Type B − Yamagata Lineage

0 0.05 0.1 0.15 0.20

500

1000

1500

2000

2500

3000

3500

4000Human A/H1N1

Lambda values

Frequ

ency

0 5 10 15 20 25 30 350

1000

2000

3000

4000

5000

Lambda values

Frequenc

y

Human A/H3N2

0 5 10 15 20 25 300

500

1000

1500

2000

2500

3000

3500

4000

Lambda values

Frequenc

y

Type B − Yamagata Lineage

0 5 10 15 20 250

500

1000

1500

2000

2500

3000

3500

4000

Lambda values

Frequenc

y

Human A/H1N1

◮ Histogram: distribution of λi=1...K1◮ 100 Bins◮ λo observed class separateness value (far right).◮ The area under the tail of the distributions beyond the observed

separateness values was below rounding error of 10−16 which made thecomputation of p-value not possible. 17 / 31

Simulation results: Human samples

Vaccine controlled: Type B (Victoria)

0.05 0.1 0.15 0.2 0.250

500

1000

1500

2000

2500

3000

3500

4000

Lambda values

Freq

uenc

y

Type B − Victoria Lineage

0 5 10 15 20 25 300

500

1000

1500

2000

2500

3000

3500

4000

Lambda values

Freque

ncy

Type B − Victoria Lineage

** Non-Vaccine controlled Human H5N1 **

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1000

2000

3000

4000

Lambda values

Freq

uenc

y

Human H5N1

◮ Histogram: distribution of λi=1...K1

◮ 100 Bins

◮ λo observed class separateness value (far right).

18 / 31

Simulation results: Avian samples

** Non-vaccine controlled **

0 0.2 0.4 0.6 0.8 1 1.20

0.5

1

1.5

2x 10

4

Lambda values

Freq

uenc

y

Avian H5** Vaccine controlled **

0 0.5 1 1.50

2000

4000

6000

8000

10000

Lambda values

Freq

uenc

y

Avian H5 (Vaccinated)

◮ Histogram: distribution of λi=1...K1

◮ 100 Bins

◮ λo observed class separateness value.

19 / 31

Class Separateness Results

Table: Human samples

Sample λo Distance

A/H3N2 30.5 978.3± .031

B (Victoria) 26.3 1310± .02

B (Yamagata) 25.3 1327.8± .019

A/H1N1 24.7 617.2± .04

H5N1 1.01 34.8± .029

Table: Avian samples

Sample λo Distance

Avian H5(Mexico) 1.7 12.23± .11

H5 (China) 0.268 3.16± .06

* Vaccine controlled* Non-vaccine controlled

20 / 31

Related Works

Influenza cartography [Smith 2003]:◮ Requires hemagglutination inhibition (HI)

assay data.

◮ Build pairwise distance matrix K from HI data.

◮ Run MultiDimensional scaling on K

◮ Plot the leading 2 ’eigenvectors’

◮ Can be used to evaluate vaccine strains

Others◮ Wet lab approach: vaccinated vs. unvaccinated mice

* Hensley et al. [Science. 2009]** Influenza virus mutated in a vaccine protected environment.

◮ Binary conversion of genetic sequences.*Chaitanya Muralidhara, Orly Alter [PLOS ONE, 2011]– Studied the evolutionary pathways with 6 bits encoding scheme.*Sagara et al. [Nucleic Acids Res, 1998]–Detect sequence identity in short gene segments.

21 / 31

Conclusions

Based on publicly available genetic sequence data alone, wewere able to ..

◮ Show that the high throughput approach was able to◮ exposed hidden distinctive patterns in high dimensional data.◮ distinguished vaccinated from non-vaccinated populations.◮ revealed evolutionary paths of different lineages.

◮ Facilitate a quantitative interpretation of the visualizationresults

◮ Vaccine controlled influenza viruses showed higher’cohesiveness’ in each year than non-vaccine controlledinfluenza viruses.

◮ Analysis indicated that vaccine as an evolution drivercannot be completely eliminated.

22 / 31

Thank you

and

Questions ?

23 / 31

Hamming distance and PCA distance

◮ Every single change in the genetic sequence alphabetcorresponds to changes to 2 bits in the binary encoding.

◮ Let ‖s − t‖H denote the pairwise Hamming distance betweentwo strains s, t (number of differences in genetic sequences).

◮ Let ‖s − t‖bin 1,‖s − t‖bin 2 denote the distance between thebinary encodings of the two sequences (1-norm and 2-norm,respectively)

◮ ‖s − t‖proj denote the 2-norm distance in lower dimensionalspace after projection onto the leading principal components.

◮ ‖s − t‖2proj ≤ ‖s − t‖2bin 2 = ‖s − t‖bin 1 = 2‖s − t‖H .

24 / 31

Hamming distance and PCA distance

0 20 40 60 80 100 120 140 160 1800

20

40

60

80

100

120

140

160

180

Strains

Dis

tanc

e va

lues

A/H3N2

PCA 2D distance

Sequence Distance in Full Space

◮ Pairwise distance of the oldest strain (A/Hong Kong/68 1968)to every other strains in the A/H3N2 dataset in both thereduced PCA 2 dimensional space and in full sequence space.

◮ The high agreement of the PCA 2D distance with the pairwisedistance computed in full sequence space is indicated by thePearson correlation coefficient of 0.9792.

25 / 31

Principal Component Analysis (PCA)

◮ Given a data matrix X of size m(strains) by n(residues). We want toreveal the most important properties in X as the combinations of theoriginal properties.

◮ The variance is the indicator of the importance and we need to form the

covariance matrix.

◮ Center X by subtracting column mean. Replace X withX̂ = X − 1

meeTX , where e is a column vector of all ones.

◮ Obtain the covariance matrix C from X̂ by C = 1(m−1) X̂T X̂ .

◮ Eigenvalue decomposition of C = SΛST gives the principalcomponents.

◮ Final transformed data: Z = X̂ ∗ S .◮ Compute PCA using SVD matrix decomposition X̂ = UΣV T .

◮ Seek a set of orthonormal axes that decorrelating X̂ by findingits eigenvectors X̂T X̂ = VΣ2V T .

◮ Orthonormal axes (principal coordinates) are the new basis forthe data.

◮ Project centered data onto the new basis gives the ”PCAview” of the data with mean zero and variance maximized.

◮ Final transformed data: Z = X̂ ∗ V .26 / 31

Markov Model

x1

y0=0

z0=1

x2

y1 y2 yn-1 yn

xn

z1 zn-1

Markov chain state diagram

State 0 State 1 State 2 State NState N-1

◮ Poisson: pt(Y ) =(Yλ)t

t! e−Yλ

* t:number of mutations, Y: Years, λ: rate of mutation

◮ Markov chain: qt(k) =∑k

i=0 vti .

◮ Combined: Pκ(Y ) =∑

∞

t=0 pt(Y ) · qt(κ)

27 / 31

Markov Model: Results

Strain H Y EG P-valueA/SouthCaroline/1/1918 0 0 0 source seqA/swine/St-Hyacinthe/148/1990 20 72 47.3 6.349e-06

Table: H1N1 subtype long time gap strains (Rate: 2×10−3 per site peryear). H = Hamming distance, Y = Year, EG = Expected number ofmutations.

The model predicts that the probability of finding highly similarvirus after several decades is extremely small. The existence ofrecent viruses which are very similar to older viruses suggests thatpotentially there exists some reservoir which preserves viruses overlong periods.

28 / 31

dN/dS ratio calculation

◮ Mirror actual flu season scenario.◮ 23 flu seasons.◮ Pairwise dN/dS ratio computation of vaccine strain (V)

against circulating strain (C).◮ Assumes C are the immune escape mutants caused by

previous flu season.

29 / 31

North American swine H3N2 Influenza clusters

Site A B C D E142 Gly Glu Asn Arg Lys144 Val Asp Val Ile Gly

◮ Data: HA protein sequences

◮ Convert data to binary matrix B

◮ Compute Shannon entropy foreach site on HA

◮ Method: Construct diagonalweight matrix W based oncomputed entropy values.

◮ Select sites with highest entropyvalues to form matrix W̄ .

◮ Apply W̄ to B to yield M

◮ Apply SVD to M matrix

◮ Plot top 2 leading components.

30 / 31

Influenza reassortant detection

−30 −25 −20 −15 −10 −5 0 5 10 15−40

−20

0

20

40

65748642

PC 1

3

2

1

PC

2

cH1N1 (ref)

Test Virus

◮ Test virus: A/Swine/North Carolina/35922/98(H3N2)

◮ Reference virus: Classical swine H1N1 virus

◮ Genetic sequence data conversion for all segments for test andref. viruses.

◮ PCA projection: Computed PCs using reference virus.Projected test virus onto ’pre-computed’ PCs.

◮ Result: Test virus’s segments 2, 4, and 6 did not pair with therespective reference virus segments.

31 / 31

Appendix

Evolution and Vaccination of Inﬂuenza Virusboley/publications/papers/IJCAI-BAI15-slides.pdf0.05...

Documents

Transcript of Evolution and Vaccination of Inﬂuenza Virusboley/publications/papers/IJCAI-BAI15-slides.pdf0.05...