Evolution and Vaccination of Influenza Virusboley/publications/papers/IJCAI-BAI15-slides.pdf0.05...

31
Evolution and Vaccination of Influenza Virus Ham Ching Lam Srinand Sreevatsan Daniel Boley University of Minnesota supported in part by NSF 1319749, NIH N266200700007C, and others 7/28/2015 1 / 31

Transcript of Evolution and Vaccination of Influenza Virusboley/publications/papers/IJCAI-BAI15-slides.pdf0.05...

  • Evolution and Vaccinationof Influenza Virus

    Ham Ching Lam Srinand Sreevatsan Daniel BoleyUniversity of Minnesota

    supported in part by NSF 1319749, NIH N266200700007C, and others

    7/28/2015

    1 / 31

  • Presentation Overview

    ◮ Introduction◮ Influenza virus and influenza vaccines

    ◮ High Throughput Genetic Sequence Analysis◮ Influenza evolution visualization.

    ◮ Multi-class separateness analysis◮ from visualization to quantification

    ◮ Results

    ◮ Related works

    ◮ Conclusions

    2 / 31

  • Influenza Virus and Influenza Vaccine

    Influenza virus

    ◮ About 40 % of global population are infected in a single year [WHO 2014]

    ◮ Type A: Rich diversity and classified by antigenicity of HA and NAmolecules (H1N1, H3N2, H5N1, ..H18).

    ◮ Evolution mechanisms:◮ Antigenic drift: gradual point mutations that can lead to antigenic

    changes.◮ Antigenic shift/Reassortment: rearrangement of viral gene segments

    in cells infected with 2 or more viruses.

    Influenza vaccine

    ◮ Stimulates host antibody against Hemagglutinin (HA) andneuraminidase (NA).

    ◮ Yearly evaluation of seasonal human influenza vaccine components.

    ◮ Each vaccine component is selected to target a specific strain of aninfluenza virus.

    ◮ Vaccine components:

    ◮ From late 1930s, A/H1N1 (10 updates)◮ From 1968, A/H3N2 (23 updates)◮ From late 1970s Type B (16 updates)

    3 / 31

  • Does vaccine drive the evolution of influenza?

    ◮ Seasonal human A/H3N2 1971 - 2009

    ◮ Repeated emergence of new antigenic drift variants

    ◮ New vaccine is introduced frequently.

    0 50 100 150 200 250 3000

    5

    10

    15

    20

    25

    Time

    HA1 domain

    A/H3N2

    dN/dS ratio > 1

    Repeated Vaccine

    NewVacc. Introduction

    dN/dS ratio .8−1

    H.C. Lam et. al., 30th American Society for Virology Meeting 2011.

    4 / 31

  • High Throughput Genetic Sequence Analysis

    Overall concepts: visualize influenza evolution

    ◮ A departure from traditional phylogenetic approach.

    ◮ No assumption made about ancestors or ancestryrelationships.

    ◮ No bias.

    ◮ Dimension reduction of non-numerical fixed length geneticsequence data.

    ◮ Expose hidden data structure/patterns.

    Advantages:

    ◮ Enable generation of Hypothesis because it is able to..◮ Examine data comprehensively and give efficient initial ’look’.◮ Achieve high coverage in dense and sparse data

    ◮ Dense: matrix elements are non-zero.◮ Sparse: majority of matrix elements are zero.

    5 / 31

  • Visualization Method:

    1. Genetic Sequence Data Conversion.◮ Binary encoding of nucleotide.

    ◮ A → 1 0 0 0◮ C → 0 1 0 0◮ G → 0 0 1 0◮ T → 0 0 0 1

    ◮ Each coded nucleotide is equidistant from each other.◮ Form data matrix X : rows (HA seqs), columns (residues).

    2. Apply Singular Value Decomposition (SVD) to X to reducedimensions and eliminate noise.

    3. Principal Components are the columns of V.

    4. Select leading 2 or 3 components for visualization.

    6 / 31

  • Hemagglutinin Sequence data

    ◮ Multiple flu sequence databases:◮ NCBI Influenza Virus Resource◮ Influenza Research Database◮ EpiFlu

    ◮ Multiple Seq Alignment and removed the few sequences with:◮ gaps ”-”◮ ”wildcard characters”: W, N, S,◮ ’partially completed’

    ◮ Fixed length HA1 seqs: 987 nucleotides.◮ ”Vacc controlled: 3 human,1 avian samples◮ ”Non-Vacc controlled: 1 human, 1 avian samples”

    Samples Year Seqs

    Human A/H1N1 1918-13 2140

    Human A/H3N2 1968-09 175(235)

    Human Type B (Vic/Yam) 1970-13 818

    Human H5N1 1997-12 127(128)

    Avian H5 (Mexico) 1994-02 32

    Avian H5 (China) 1997-02 32

    7 / 31

  • Visualization Result: Human sample (A/H3N2)Vaccine controlled

    −10 −8 −6 −4 −2 0 2 4 6 8−6

    −4

    −2

    0

    2

    4

    6

    1st PC

    2nd

    PC

    A/H3N2 1968−2009

    Year 1970

    Year 1980

    Year 1985

    Year 1990

    Year 2002

    Year 2005

    2

    1

    3

    4

    5 6

    7

    8

    9

    10

    13

    14

    11

    12

    ◮ Viruses evolving in a restricteddirection.

    ◮ Chronological pattern appearsto be nonrandom.

    ◮ Viruses clustered by isolationyears.

    ◮ Undergone antigenic drift way

    from older/vaccine strains.

    −10−50510

    −10

    0

    10

    0

    50

    100

    150

    200

    Ham

    min

    g di

    stanc

    e

    PC2 PC1

    Vaccines1:Aichi/19682:Port Chalmers/1/19733:Phillippines/2/19824:Shanghai/11/19875:Beijing/353/19896:Shangdong/9/19937:Johannesburg/33/19948:Sydney/5/19979:Moscow/10/199910:Fujian/411/200211:California/7/200412:Wisconsin/67/200513:Brisbane/10/200714:Perth/16/2009

    0Lam et al., 20128 / 31

  • Visualization Result: Human sample (Type B)Vaccine controlled

    −8−6−4−202468

    −6

    −5

    −4

    −3

    −2

    −1

    0

    1

    2

    3

    4

    197219791983

    1986

    1988−Yamagata

    1987−Victoria

    1990

    1993

    2001

    1999

    2004

    2002

    2012

    20082006

    2010

    PC1

    PC

    2

    Year 1970

    Year 1980

    Year 1985

    Year 1990

    Year 2002

    Year 2012

    Yamagata LineageVictoria Lineage

    ◮ Diverged from a single lineage prior to1980 into antigenically distinct Victoria(blue) and Yamagata lineages (red).

    ◮ Chronological patterns appear to benonrandom.

    ◮ Viruses clustered by isolation years.

    ◮ Undergone antigenic drift away fromolder/vaccine strains.

    −10

    −5

    0

    5

    10

    −4

    −2

    0

    2

    4

    6

    0

    20

    40

    60

    80

    100

    1972

    19791983

    1986

    1988−Yamagata

    1987−Victoria

    1990

    19932001

    19992004

    2002

    2012

    2008

    2006

    2010

    Ham

    ming

    dist

    ance

    PC1

    PC2

    VaccineB/Hong Kong/05/1972B/Singapore/222/79B/USSR/100/83B/Ann Arbor/1/86B/Beijing/1/1987B/Yamagata/16/88B/Panama/45/90B/Beijing/184/93B/Sichuan/379/99B/Hong Kong/330/2001B/Shanghai/361/2002B/Malaysia/2506/2004B/Florida/4/2006B/Brisbane/60/2008B/Wisconsin/01/2010B/Massachusetts/02/2012

    9 / 31

  • Visualization Result: Human sample (A/H1N1)

    Vaccine controlled

    ◮ Split between pre-2009 and post 2009 viruses.◮ Pandemic A/H1N1pmd09 overtook classical A/H1N1(black) after the split.

    ◮ Isolates from 2010 onward appeared to have diverged.◮ Undergone antigenic drift away from older/vaccine strains.

    10 / 31

  • Visualization Result: Human sample

    Non-vaccine controlled Human H5N1

    −40

    −20

    0

    20

    −10−5

    05

    10

    −4

    −2

    0

    2

    4

    6

    PC2

    Human H5N1

    PC1

    PC

    3

    1997

    2001−2004

    2005−2007

    2008−2012

    ◮ Appears to diverge into 3 lineages.

    ◮ Clusters contain viruses span longer time period.

    11 / 31

  • Visualization Result: Avian samples

    −6 −4−2 0

    2 46

    −5

    0

    5−4

    −2

    0

    2

    4

    6

    3rd

    PC

    Vaccinated H5 sample

    1st PC

    2nd PCYear 1994

    Year 1996

    Year 1998

    Year 2000

    Year 2001

    Year 2002

    * Vaccine controlled: Mexico avian H5* Directional evolutionary trend* Diverged and established into separatelineages after early 2000s [Lee2004].

    * Clusters contain viruses by isolation year.* Undergone antigenic drift away fromvaccine strain.(A/Ck/Mexico/CPA0232/1994)

    * Non-Vaccine controlled: China avian H5* No obvious evolutionary trend.* Clusters contain viruses span longer timeperiod.

    * Early and late isolates are overlapped.* Appears to be more scattered.

    12 / 31

  • From visualization to quantification

    Visualization provides ’pictorial evidence’:

    ◮ In 2D and 3D PCA space:◮ Vaccine controlled:

    * Distinct restricted nonrandom directional trend* Viruses with the same isolation year appear to clustertogether.

    ◮ Non-vaccine controlled:* Less obvious directional evolution.* Larger spread of clusters.* Clusters contain viruses span longer time period.

    Quantification: gives some form of statistical confidence

    ◮ Quantify visualization by measuring ”distance” in terms of’standard deviations’

    13 / 31

  • Multi-class separateness analysis

    ◮ Vaccine components are evaluated/updated every year.◮ Viruses tend to cluster by the same isolation year.

    Determine the cohesiveness of the viruses in each year** Distances between points and their respective centersin reduced space.

    1. Visualization results as inputs:◮ Two dimensional PCA coordinates.◮ Class label: Isolation year of each virus.

    HA seq header: AF201875 A/Minnesota/1/1993

    2. Compute class separateness value λo for each sample.3. Determine the significant of λo using Class labels

    randomization simulation.14 / 31

  • Class separateness measure: compute λ

    * C : Number of Classes* Ni : number of data points in class i = 1, 2, ...,C* ui : is the mean vector of class i = 1, 2, ...,C* NT : total number of points.* Sought the trace(W) and trace(B): Sum of squared distancesbetween points and their respective centers.

    ◮ W : Within cluster scatter matrix◮

    ∑Ci

    ∑Nij (xj − ui )(xj − ui )

    T

    ◮ ui : mean of class i .

    ◮ B : Between cluster scatter matrix

    ◮1NT

    ∑Ci Ni (ui −M)(ui −M)

    T

    ◮ M = 1NT

    ∑Ci Niui ”global mean of dataset”

    ◮ λ = tr(B)tr(W )

    15 / 31

  • Class labels randomization

    Significant of λoUsing ”distance measure” as a surrogate for the probability ofobserving the observed λo by chance.

    Algorithm:

    Let λo =tr(Bo)tr(Wo)

    be the observed separateness value.Repeat j = 1, . . . ,K2:Repeat i = 1, . . . ,K1:

    Generate a randomization of the class labels.Compute the within-cluster scatter W .Compute the ratio λi =

    tr(B)tr(W ) =

    tr(T )−tr(W )tr(W ) .

    Compute the mean µ and std σ for all λi=1,..K1.

    Compute the distance dj =µ−λ0σ

    .

    Compute the mean d̄ and std d̂ of all dj=1..K2.

    Report the distance of λo from the mean in the form of d̄ ± d̂ .

    16 / 31

  • Simulation results: Human samples

    ** Vaccine controlled [A/H3N2, Type B (Yamagata), A/H1N1] **

    0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40

    1000

    2000

    3000

    4000

    5000

    Lambda values

    Frequ

    ency

    Human A/H3N2

    0 0.05 0.1 0.15 0.20

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    Lambda values

    Frequ

    ency

    Type B − Yamagata Lineage

    0 0.05 0.1 0.15 0.20

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000Human A/H1N1

    Lambda values

    Frequ

    ency

    0 5 10 15 20 25 30 350

    1000

    2000

    3000

    4000

    5000

    Lambda values

    Frequenc

    y

    Human A/H3N2

    0 5 10 15 20 25 300

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    Lambda values

    Frequenc

    y

    Type B − Yamagata Lineage

    0 5 10 15 20 250

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    Lambda values

    Frequenc

    y

    Human A/H1N1

    ◮ Histogram: distribution of λi=1...K1◮ 100 Bins◮ λo observed class separateness value (far right).◮ The area under the tail of the distributions beyond the observed

    separateness values was below rounding error of 10−16 which made thecomputation of p-value not possible. 17 / 31

  • Simulation results: Human samples

    Vaccine controlled: Type B (Victoria)

    0.05 0.1 0.15 0.2 0.250

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    Lambda values

    Freq

    uenc

    y

    Type B − Victoria Lineage

    0 5 10 15 20 25 300

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    Lambda values

    Freque

    ncy

    Type B − Victoria Lineage

    ** Non-Vaccine controlled Human H5N1 **

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    1000

    2000

    3000

    4000

    Lambda values

    Freq

    uenc

    y

    Human H5N1

    ◮ Histogram: distribution of λi=1...K1

    ◮ 100 Bins

    ◮ λo observed class separateness value (far right).

    18 / 31

  • Simulation results: Avian samples

    ** Non-vaccine controlled **

    0 0.2 0.4 0.6 0.8 1 1.20

    0.5

    1

    1.5

    2x 10

    4

    Lambda values

    Freq

    uenc

    y

    Avian H5** Vaccine controlled **

    0 0.5 1 1.50

    2000

    4000

    6000

    8000

    10000

    Lambda values

    Freq

    uenc

    y

    Avian H5 (Vaccinated)

    ◮ Histogram: distribution of λi=1...K1

    ◮ 100 Bins

    ◮ λo observed class separateness value.

    19 / 31

  • Class Separateness Results

    Table: Human samples

    Sample λo Distance

    A/H3N2 30.5 978.3± .031

    B (Victoria) 26.3 1310± .02

    B (Yamagata) 25.3 1327.8± .019

    A/H1N1 24.7 617.2± .04

    H5N1 1.01 34.8± .029

    Table: Avian samples

    Sample λo Distance

    Avian H5(Mexico) 1.7 12.23± .11

    H5 (China) 0.268 3.16± .06

    * Vaccine controlled* Non-vaccine controlled

    20 / 31

  • Related Works

    Influenza cartography [Smith 2003]:◮ Requires hemagglutination inhibition (HI)

    assay data.

    ◮ Build pairwise distance matrix K from HI data.

    ◮ Run MultiDimensional scaling on K

    ◮ Plot the leading 2 ’eigenvectors’

    ◮ Can be used to evaluate vaccine strains

    Others◮ Wet lab approach: vaccinated vs. unvaccinated mice

    * Hensley et al. [Science. 2009]** Influenza virus mutated in a vaccine protected environment.

    ◮ Binary conversion of genetic sequences.*Chaitanya Muralidhara, Orly Alter [PLOS ONE, 2011]– Studied the evolutionary pathways with 6 bits encoding scheme.*Sagara et al. [Nucleic Acids Res, 1998]–Detect sequence identity in short gene segments.

    21 / 31

  • Conclusions

    Based on publicly available genetic sequence data alone, wewere able to ..

    ◮ Show that the high throughput approach was able to◮ exposed hidden distinctive patterns in high dimensional data.◮ distinguished vaccinated from non-vaccinated populations.◮ revealed evolutionary paths of different lineages.

    ◮ Facilitate a quantitative interpretation of the visualizationresults

    ◮ Vaccine controlled influenza viruses showed higher’cohesiveness’ in each year than non-vaccine controlledinfluenza viruses.

    ◮ Analysis indicated that vaccine as an evolution drivercannot be completely eliminated.

    22 / 31

  • Thank you

    and

    Questions ?

    23 / 31

  • Hamming distance and PCA distance

    ◮ Every single change in the genetic sequence alphabetcorresponds to changes to 2 bits in the binary encoding.

    ◮ Let ‖s − t‖H denote the pairwise Hamming distance betweentwo strains s, t (number of differences in genetic sequences).

    ◮ Let ‖s − t‖bin 1,‖s − t‖bin 2 denote the distance between thebinary encodings of the two sequences (1-norm and 2-norm,respectively)

    ◮ ‖s − t‖proj denote the 2-norm distance in lower dimensionalspace after projection onto the leading principal components.

    ◮ ‖s − t‖2proj ≤ ‖s − t‖2bin 2 = ‖s − t‖bin 1 = 2‖s − t‖H .

    24 / 31

  • Hamming distance and PCA distance

    0 20 40 60 80 100 120 140 160 1800

    20

    40

    60

    80

    100

    120

    140

    160

    180

    Strains

    Dis

    tanc

    e va

    lues

    A/H3N2

    PCA 2D distance

    Sequence Distance in Full Space

    ◮ Pairwise distance of the oldest strain (A/Hong Kong/68 1968)to every other strains in the A/H3N2 dataset in both thereduced PCA 2 dimensional space and in full sequence space.

    ◮ The high agreement of the PCA 2D distance with the pairwisedistance computed in full sequence space is indicated by thePearson correlation coefficient of 0.9792.

    25 / 31

  • Principal Component Analysis (PCA)

    ◮ Given a data matrix X of size m(strains) by n(residues). We want toreveal the most important properties in X as the combinations of theoriginal properties.

    ◮ The variance is the indicator of the importance and we need to form the

    covariance matrix.

    ◮ Center X by subtracting column mean. Replace X withX̂ = X − 1

    meeTX , where e is a column vector of all ones.

    ◮ Obtain the covariance matrix C from X̂ by C = 1(m−1) X̂T X̂ .

    ◮ Eigenvalue decomposition of C = SΛST gives the principalcomponents.

    ◮ Final transformed data: Z = X̂ ∗ S .◮ Compute PCA using SVD matrix decomposition X̂ = UΣV T .

    ◮ Seek a set of orthonormal axes that decorrelating X̂ by findingits eigenvectors X̂T X̂ = VΣ2V T .

    ◮ Orthonormal axes (principal coordinates) are the new basis forthe data.

    ◮ Project centered data onto the new basis gives the ”PCAview” of the data with mean zero and variance maximized.

    ◮ Final transformed data: Z = X̂ ∗ V .26 / 31

  • Markov Model

    x1

    y0=0

    z0=1

    x2

    y1 y2 yn-1 yn

    xn

    z1 zn-1

    Markov chain state diagram

    State 0 State 1 State 2 State NState N-1

    ◮ Poisson: pt(Y ) =(Yλ)t

    t! e−Yλ

    * t:number of mutations, Y: Years, λ: rate of mutation

    ◮ Markov chain: qt(k) =∑k

    i=0 vti .

    ◮ Combined: Pκ(Y ) =∑

    t=0 pt(Y ) · qt(κ)

    27 / 31

  • Markov Model: Results

    Strain H Y EG P-valueA/SouthCaroline/1/1918 0 0 0 source seqA/swine/St-Hyacinthe/148/1990 20 72 47.3 6.349e-06

    Table: H1N1 subtype long time gap strains (Rate: 2×10−3 per site peryear). H = Hamming distance, Y = Year, EG = Expected number ofmutations.

    The model predicts that the probability of finding highly similarvirus after several decades is extremely small. The existence ofrecent viruses which are very similar to older viruses suggests thatpotentially there exists some reservoir which preserves viruses overlong periods.

    28 / 31

  • dN/dS ratio calculation

    ◮ Mirror actual flu season scenario.◮ 23 flu seasons.◮ Pairwise dN/dS ratio computation of vaccine strain (V)

    against circulating strain (C).◮ Assumes C are the immune escape mutants caused by

    previous flu season.

    29 / 31

  • North American swine H3N2 Influenza clusters

    Site A B C D E142 Gly Glu Asn Arg Lys144 Val Asp Val Ile Gly

    ◮ Data: HA protein sequences

    ◮ Convert data to binary matrix B

    ◮ Compute Shannon entropy foreach site on HA

    ◮ Method: Construct diagonalweight matrix W based oncomputed entropy values.

    ◮ Select sites with highest entropyvalues to form matrix W̄ .

    ◮ Apply W̄ to B to yield M

    ◮ Apply SVD to M matrix

    ◮ Plot top 2 leading components.

    30 / 31

  • Influenza reassortant detection

    −30 −25 −20 −15 −10 −5 0 5 10 15−40

    −20

    0

    20

    40

    65748642

    PC 1

    3

    2

    1

    PC

    2

    cH1N1 (ref)

    Test Virus

    ◮ Test virus: A/Swine/North Carolina/35922/98(H3N2)

    ◮ Reference virus: Classical swine H1N1 virus

    ◮ Genetic sequence data conversion for all segments for test andref. viruses.

    ◮ PCA projection: Computed PCs using reference virus.Projected test virus onto ’pre-computed’ PCs.

    ◮ Result: Test virus’s segments 2, 4, and 6 did not pair with therespective reference virus segments.

    31 / 31

    Appendix