Evolution and Vaccination of Influenza Virusboley/publications/papers/IJCAI-BAI15-slides.pdf0.05...
Transcript of Evolution and Vaccination of Influenza Virusboley/publications/papers/IJCAI-BAI15-slides.pdf0.05...
-
Evolution and Vaccinationof Influenza Virus
Ham Ching Lam Srinand Sreevatsan Daniel BoleyUniversity of Minnesota
supported in part by NSF 1319749, NIH N266200700007C, and others
7/28/2015
1 / 31
-
Presentation Overview
◮ Introduction◮ Influenza virus and influenza vaccines
◮ High Throughput Genetic Sequence Analysis◮ Influenza evolution visualization.
◮ Multi-class separateness analysis◮ from visualization to quantification
◮ Results
◮ Related works
◮ Conclusions
2 / 31
-
Influenza Virus and Influenza Vaccine
Influenza virus
◮ About 40 % of global population are infected in a single year [WHO 2014]
◮ Type A: Rich diversity and classified by antigenicity of HA and NAmolecules (H1N1, H3N2, H5N1, ..H18).
◮ Evolution mechanisms:◮ Antigenic drift: gradual point mutations that can lead to antigenic
changes.◮ Antigenic shift/Reassortment: rearrangement of viral gene segments
in cells infected with 2 or more viruses.
Influenza vaccine
◮ Stimulates host antibody against Hemagglutinin (HA) andneuraminidase (NA).
◮ Yearly evaluation of seasonal human influenza vaccine components.
◮ Each vaccine component is selected to target a specific strain of aninfluenza virus.
◮ Vaccine components:
◮ From late 1930s, A/H1N1 (10 updates)◮ From 1968, A/H3N2 (23 updates)◮ From late 1970s Type B (16 updates)
3 / 31
-
Does vaccine drive the evolution of influenza?
◮ Seasonal human A/H3N2 1971 - 2009
◮ Repeated emergence of new antigenic drift variants
◮ New vaccine is introduced frequently.
0 50 100 150 200 250 3000
5
10
15
20
25
Time
HA1 domain
A/H3N2
dN/dS ratio > 1
Repeated Vaccine
NewVacc. Introduction
dN/dS ratio .8−1
H.C. Lam et. al., 30th American Society for Virology Meeting 2011.
4 / 31
-
High Throughput Genetic Sequence Analysis
Overall concepts: visualize influenza evolution
◮ A departure from traditional phylogenetic approach.
◮ No assumption made about ancestors or ancestryrelationships.
◮ No bias.
◮ Dimension reduction of non-numerical fixed length geneticsequence data.
◮ Expose hidden data structure/patterns.
Advantages:
◮ Enable generation of Hypothesis because it is able to..◮ Examine data comprehensively and give efficient initial ’look’.◮ Achieve high coverage in dense and sparse data
◮ Dense: matrix elements are non-zero.◮ Sparse: majority of matrix elements are zero.
5 / 31
-
Visualization Method:
1. Genetic Sequence Data Conversion.◮ Binary encoding of nucleotide.
◮ A → 1 0 0 0◮ C → 0 1 0 0◮ G → 0 0 1 0◮ T → 0 0 0 1
◮ Each coded nucleotide is equidistant from each other.◮ Form data matrix X : rows (HA seqs), columns (residues).
2. Apply Singular Value Decomposition (SVD) to X to reducedimensions and eliminate noise.
3. Principal Components are the columns of V.
4. Select leading 2 or 3 components for visualization.
6 / 31
-
Hemagglutinin Sequence data
◮ Multiple flu sequence databases:◮ NCBI Influenza Virus Resource◮ Influenza Research Database◮ EpiFlu
◮ Multiple Seq Alignment and removed the few sequences with:◮ gaps ”-”◮ ”wildcard characters”: W, N, S,◮ ’partially completed’
◮ Fixed length HA1 seqs: 987 nucleotides.◮ ”Vacc controlled: 3 human,1 avian samples◮ ”Non-Vacc controlled: 1 human, 1 avian samples”
Samples Year Seqs
Human A/H1N1 1918-13 2140
Human A/H3N2 1968-09 175(235)
Human Type B (Vic/Yam) 1970-13 818
Human H5N1 1997-12 127(128)
Avian H5 (Mexico) 1994-02 32
Avian H5 (China) 1997-02 32
7 / 31
-
Visualization Result: Human sample (A/H3N2)Vaccine controlled
−10 −8 −6 −4 −2 0 2 4 6 8−6
−4
−2
0
2
4
6
1st PC
2nd
PC
A/H3N2 1968−2009
Year 1970
Year 1980
Year 1985
Year 1990
Year 2002
Year 2005
2
1
3
4
5 6
7
8
9
10
13
14
11
12
◮ Viruses evolving in a restricteddirection.
◮ Chronological pattern appearsto be nonrandom.
◮ Viruses clustered by isolationyears.
◮ Undergone antigenic drift way
from older/vaccine strains.
−10−50510
−10
0
10
0
50
100
150
200
Ham
min
g di
stanc
e
PC2 PC1
Vaccines1:Aichi/19682:Port Chalmers/1/19733:Phillippines/2/19824:Shanghai/11/19875:Beijing/353/19896:Shangdong/9/19937:Johannesburg/33/19948:Sydney/5/19979:Moscow/10/199910:Fujian/411/200211:California/7/200412:Wisconsin/67/200513:Brisbane/10/200714:Perth/16/2009
0Lam et al., 20128 / 31
-
Visualization Result: Human sample (Type B)Vaccine controlled
−8−6−4−202468
−6
−5
−4
−3
−2
−1
0
1
2
3
4
197219791983
1986
1988−Yamagata
1987−Victoria
1990
1993
2001
1999
2004
2002
2012
20082006
2010
PC1
PC
2
Year 1970
Year 1980
Year 1985
Year 1990
Year 2002
Year 2012
Yamagata LineageVictoria Lineage
◮ Diverged from a single lineage prior to1980 into antigenically distinct Victoria(blue) and Yamagata lineages (red).
◮ Chronological patterns appear to benonrandom.
◮ Viruses clustered by isolation years.
◮ Undergone antigenic drift away fromolder/vaccine strains.
−10
−5
0
5
10
−4
−2
0
2
4
6
0
20
40
60
80
100
1972
19791983
1986
1988−Yamagata
1987−Victoria
1990
19932001
19992004
2002
2012
2008
2006
2010
Ham
ming
dist
ance
PC1
PC2
VaccineB/Hong Kong/05/1972B/Singapore/222/79B/USSR/100/83B/Ann Arbor/1/86B/Beijing/1/1987B/Yamagata/16/88B/Panama/45/90B/Beijing/184/93B/Sichuan/379/99B/Hong Kong/330/2001B/Shanghai/361/2002B/Malaysia/2506/2004B/Florida/4/2006B/Brisbane/60/2008B/Wisconsin/01/2010B/Massachusetts/02/2012
9 / 31
-
Visualization Result: Human sample (A/H1N1)
Vaccine controlled
◮ Split between pre-2009 and post 2009 viruses.◮ Pandemic A/H1N1pmd09 overtook classical A/H1N1(black) after the split.
◮ Isolates from 2010 onward appeared to have diverged.◮ Undergone antigenic drift away from older/vaccine strains.
10 / 31
-
Visualization Result: Human sample
Non-vaccine controlled Human H5N1
−40
−20
0
20
−10−5
05
10
−4
−2
0
2
4
6
PC2
Human H5N1
PC1
PC
3
1997
2001−2004
2005−2007
2008−2012
◮ Appears to diverge into 3 lineages.
◮ Clusters contain viruses span longer time period.
11 / 31
-
Visualization Result: Avian samples
−6 −4−2 0
2 46
−5
0
5−4
−2
0
2
4
6
3rd
PC
Vaccinated H5 sample
1st PC
2nd PCYear 1994
Year 1996
Year 1998
Year 2000
Year 2001
Year 2002
* Vaccine controlled: Mexico avian H5* Directional evolutionary trend* Diverged and established into separatelineages after early 2000s [Lee2004].
* Clusters contain viruses by isolation year.* Undergone antigenic drift away fromvaccine strain.(A/Ck/Mexico/CPA0232/1994)
* Non-Vaccine controlled: China avian H5* No obvious evolutionary trend.* Clusters contain viruses span longer timeperiod.
* Early and late isolates are overlapped.* Appears to be more scattered.
12 / 31
-
From visualization to quantification
Visualization provides ’pictorial evidence’:
◮ In 2D and 3D PCA space:◮ Vaccine controlled:
* Distinct restricted nonrandom directional trend* Viruses with the same isolation year appear to clustertogether.
◮ Non-vaccine controlled:* Less obvious directional evolution.* Larger spread of clusters.* Clusters contain viruses span longer time period.
Quantification: gives some form of statistical confidence
◮ Quantify visualization by measuring ”distance” in terms of’standard deviations’
13 / 31
-
Multi-class separateness analysis
◮ Vaccine components are evaluated/updated every year.◮ Viruses tend to cluster by the same isolation year.
Determine the cohesiveness of the viruses in each year** Distances between points and their respective centersin reduced space.
1. Visualization results as inputs:◮ Two dimensional PCA coordinates.◮ Class label: Isolation year of each virus.
HA seq header: AF201875 A/Minnesota/1/1993
2. Compute class separateness value λo for each sample.3. Determine the significant of λo using Class labels
randomization simulation.14 / 31
-
Class separateness measure: compute λ
* C : Number of Classes* Ni : number of data points in class i = 1, 2, ...,C* ui : is the mean vector of class i = 1, 2, ...,C* NT : total number of points.* Sought the trace(W) and trace(B): Sum of squared distancesbetween points and their respective centers.
◮ W : Within cluster scatter matrix◮
∑Ci
∑Nij (xj − ui )(xj − ui )
T
◮ ui : mean of class i .
◮ B : Between cluster scatter matrix
◮1NT
∑Ci Ni (ui −M)(ui −M)
T
◮ M = 1NT
∑Ci Niui ”global mean of dataset”
◮ λ = tr(B)tr(W )
15 / 31
-
Class labels randomization
Significant of λoUsing ”distance measure” as a surrogate for the probability ofobserving the observed λo by chance.
Algorithm:
Let λo =tr(Bo)tr(Wo)
be the observed separateness value.Repeat j = 1, . . . ,K2:Repeat i = 1, . . . ,K1:
Generate a randomization of the class labels.Compute the within-cluster scatter W .Compute the ratio λi =
tr(B)tr(W ) =
tr(T )−tr(W )tr(W ) .
Compute the mean µ and std σ for all λi=1,..K1.
Compute the distance dj =µ−λ0σ
.
Compute the mean d̄ and std d̂ of all dj=1..K2.
Report the distance of λo from the mean in the form of d̄ ± d̂ .
16 / 31
-
Simulation results: Human samples
** Vaccine controlled [A/H3N2, Type B (Yamagata), A/H1N1] **
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.40
1000
2000
3000
4000
5000
Lambda values
Frequ
ency
Human A/H3N2
0 0.05 0.1 0.15 0.20
500
1000
1500
2000
2500
3000
3500
4000
Lambda values
Frequ
ency
Type B − Yamagata Lineage
0 0.05 0.1 0.15 0.20
500
1000
1500
2000
2500
3000
3500
4000Human A/H1N1
Lambda values
Frequ
ency
0 5 10 15 20 25 30 350
1000
2000
3000
4000
5000
Lambda values
Frequenc
y
Human A/H3N2
0 5 10 15 20 25 300
500
1000
1500
2000
2500
3000
3500
4000
Lambda values
Frequenc
y
Type B − Yamagata Lineage
0 5 10 15 20 250
500
1000
1500
2000
2500
3000
3500
4000
Lambda values
Frequenc
y
Human A/H1N1
◮ Histogram: distribution of λi=1...K1◮ 100 Bins◮ λo observed class separateness value (far right).◮ The area under the tail of the distributions beyond the observed
separateness values was below rounding error of 10−16 which made thecomputation of p-value not possible. 17 / 31
-
Simulation results: Human samples
Vaccine controlled: Type B (Victoria)
0.05 0.1 0.15 0.2 0.250
500
1000
1500
2000
2500
3000
3500
4000
Lambda values
Freq
uenc
y
Type B − Victoria Lineage
0 5 10 15 20 25 300
500
1000
1500
2000
2500
3000
3500
4000
Lambda values
Freque
ncy
Type B − Victoria Lineage
** Non-Vaccine controlled Human H5N1 **
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
1000
2000
3000
4000
Lambda values
Freq
uenc
y
Human H5N1
◮ Histogram: distribution of λi=1...K1
◮ 100 Bins
◮ λo observed class separateness value (far right).
18 / 31
-
Simulation results: Avian samples
** Non-vaccine controlled **
0 0.2 0.4 0.6 0.8 1 1.20
0.5
1
1.5
2x 10
4
Lambda values
Freq
uenc
y
Avian H5** Vaccine controlled **
0 0.5 1 1.50
2000
4000
6000
8000
10000
Lambda values
Freq
uenc
y
Avian H5 (Vaccinated)
◮ Histogram: distribution of λi=1...K1
◮ 100 Bins
◮ λo observed class separateness value.
19 / 31
-
Class Separateness Results
Table: Human samples
Sample λo Distance
A/H3N2 30.5 978.3± .031
B (Victoria) 26.3 1310± .02
B (Yamagata) 25.3 1327.8± .019
A/H1N1 24.7 617.2± .04
H5N1 1.01 34.8± .029
Table: Avian samples
Sample λo Distance
Avian H5(Mexico) 1.7 12.23± .11
H5 (China) 0.268 3.16± .06
* Vaccine controlled* Non-vaccine controlled
20 / 31
-
Related Works
Influenza cartography [Smith 2003]:◮ Requires hemagglutination inhibition (HI)
assay data.
◮ Build pairwise distance matrix K from HI data.
◮ Run MultiDimensional scaling on K
◮ Plot the leading 2 ’eigenvectors’
◮ Can be used to evaluate vaccine strains
Others◮ Wet lab approach: vaccinated vs. unvaccinated mice
* Hensley et al. [Science. 2009]** Influenza virus mutated in a vaccine protected environment.
◮ Binary conversion of genetic sequences.*Chaitanya Muralidhara, Orly Alter [PLOS ONE, 2011]– Studied the evolutionary pathways with 6 bits encoding scheme.*Sagara et al. [Nucleic Acids Res, 1998]–Detect sequence identity in short gene segments.
21 / 31
-
Conclusions
Based on publicly available genetic sequence data alone, wewere able to ..
◮ Show that the high throughput approach was able to◮ exposed hidden distinctive patterns in high dimensional data.◮ distinguished vaccinated from non-vaccinated populations.◮ revealed evolutionary paths of different lineages.
◮ Facilitate a quantitative interpretation of the visualizationresults
◮ Vaccine controlled influenza viruses showed higher’cohesiveness’ in each year than non-vaccine controlledinfluenza viruses.
◮ Analysis indicated that vaccine as an evolution drivercannot be completely eliminated.
22 / 31
-
Thank you
and
Questions ?
23 / 31
-
Hamming distance and PCA distance
◮ Every single change in the genetic sequence alphabetcorresponds to changes to 2 bits in the binary encoding.
◮ Let ‖s − t‖H denote the pairwise Hamming distance betweentwo strains s, t (number of differences in genetic sequences).
◮ Let ‖s − t‖bin 1,‖s − t‖bin 2 denote the distance between thebinary encodings of the two sequences (1-norm and 2-norm,respectively)
◮ ‖s − t‖proj denote the 2-norm distance in lower dimensionalspace after projection onto the leading principal components.
◮ ‖s − t‖2proj ≤ ‖s − t‖2bin 2 = ‖s − t‖bin 1 = 2‖s − t‖H .
24 / 31
-
Hamming distance and PCA distance
0 20 40 60 80 100 120 140 160 1800
20
40
60
80
100
120
140
160
180
Strains
Dis
tanc
e va
lues
A/H3N2
PCA 2D distance
Sequence Distance in Full Space
◮ Pairwise distance of the oldest strain (A/Hong Kong/68 1968)to every other strains in the A/H3N2 dataset in both thereduced PCA 2 dimensional space and in full sequence space.
◮ The high agreement of the PCA 2D distance with the pairwisedistance computed in full sequence space is indicated by thePearson correlation coefficient of 0.9792.
25 / 31
-
Principal Component Analysis (PCA)
◮ Given a data matrix X of size m(strains) by n(residues). We want toreveal the most important properties in X as the combinations of theoriginal properties.
◮ The variance is the indicator of the importance and we need to form the
covariance matrix.
◮ Center X by subtracting column mean. Replace X withX̂ = X − 1
meeTX , where e is a column vector of all ones.
◮ Obtain the covariance matrix C from X̂ by C = 1(m−1) X̂T X̂ .
◮ Eigenvalue decomposition of C = SΛST gives the principalcomponents.
◮ Final transformed data: Z = X̂ ∗ S .◮ Compute PCA using SVD matrix decomposition X̂ = UΣV T .
◮ Seek a set of orthonormal axes that decorrelating X̂ by findingits eigenvectors X̂T X̂ = VΣ2V T .
◮ Orthonormal axes (principal coordinates) are the new basis forthe data.
◮ Project centered data onto the new basis gives the ”PCAview” of the data with mean zero and variance maximized.
◮ Final transformed data: Z = X̂ ∗ V .26 / 31
-
Markov Model
x1
y0=0
z0=1
x2
y1 y2 yn-1 yn
xn
z1 zn-1
Markov chain state diagram
State 0 State 1 State 2 State NState N-1
◮ Poisson: pt(Y ) =(Yλ)t
t! e−Yλ
* t:number of mutations, Y: Years, λ: rate of mutation
◮ Markov chain: qt(k) =∑k
i=0 vti .
◮ Combined: Pκ(Y ) =∑
∞
t=0 pt(Y ) · qt(κ)
27 / 31
-
Markov Model: Results
Strain H Y EG P-valueA/SouthCaroline/1/1918 0 0 0 source seqA/swine/St-Hyacinthe/148/1990 20 72 47.3 6.349e-06
Table: H1N1 subtype long time gap strains (Rate: 2×10−3 per site peryear). H = Hamming distance, Y = Year, EG = Expected number ofmutations.
The model predicts that the probability of finding highly similarvirus after several decades is extremely small. The existence ofrecent viruses which are very similar to older viruses suggests thatpotentially there exists some reservoir which preserves viruses overlong periods.
28 / 31
-
dN/dS ratio calculation
◮ Mirror actual flu season scenario.◮ 23 flu seasons.◮ Pairwise dN/dS ratio computation of vaccine strain (V)
against circulating strain (C).◮ Assumes C are the immune escape mutants caused by
previous flu season.
29 / 31
-
North American swine H3N2 Influenza clusters
Site A B C D E142 Gly Glu Asn Arg Lys144 Val Asp Val Ile Gly
◮ Data: HA protein sequences
◮ Convert data to binary matrix B
◮ Compute Shannon entropy foreach site on HA
◮ Method: Construct diagonalweight matrix W based oncomputed entropy values.
◮ Select sites with highest entropyvalues to form matrix W̄ .
◮ Apply W̄ to B to yield M
◮ Apply SVD to M matrix
◮ Plot top 2 leading components.
30 / 31
-
Influenza reassortant detection
−30 −25 −20 −15 −10 −5 0 5 10 15−40
−20
0
20
40
65748642
PC 1
3
2
1
PC
2
cH1N1 (ref)
Test Virus
◮ Test virus: A/Swine/North Carolina/35922/98(H3N2)
◮ Reference virus: Classical swine H1N1 virus
◮ Genetic sequence data conversion for all segments for test andref. viruses.
◮ PCA projection: Computed PCs using reference virus.Projected test virus onto ’pre-computed’ PCs.
◮ Result: Test virus’s segments 2, 4, and 6 did not pair with therespective reference virus segments.
31 / 31
Appendix