Joint work with Quaid Morris (1),(2) , Tim Hughes (2) and
description
Transcript of Joint work with Quaid Morris (1),(2) , Tim Hughes (2) and
27/06/2005ISMB 2005
GenXHC: A Probabilistic Generative Model for Cross-hybridization Compensation in High-density
Genome-wide Microarray Data
Joint work with Quaid Morris(1),(2), Tim Hughes(2) and Brendan Frey(1),(2)
(1) Probabilistic and Statistical Inference Group, University of Toronto
(2) Banting & Best Department of Medical Research, University of Toronto
Jim Huang(1)
27/06/2005ISMB 2005
Genome-wide profiling using high-density microarrays
• The move towards high-density arrays for genome-wide profiling presents challenges…
Probes
Con
dit
ion
s
Exp
ressio
n
…
Coding regions
Genome
27/06/2005ISMB 2005
Cross-hybridization in high-density microarrays
• As we move to higher-density arrays, cross-hybridization noise becomes significant and unavoidable
TCGAT CTA
TCGAT CTAHybridization
Oligonucleotide Probes
mRNA transcript
Cross-hybridization
AGCTAGGAT
GC
TA
GCTAG
CG
TC
C
27/06/2005ISMB 2005
Cross-hybridization in high-density microarrays (cont’d)
• Large cross-hybridization noise component in high-density data!
27/06/2005ISMB 2005
Cross-hybridization compensation
• State-of-the-art methods for cross-hybridization compensation designed for Affymetrix GeneChips
• Affymetrix MAS 5.0
• Robust Multi-array Analysis (RMA/GC-RMA)(1),(2)
(1) Wu, Z. and Irizarry, R.A. (2004) Stochastic models inspired by hybridization theory for short oligonucleotide arrays. Proc. Ninth International Conference on Research in Computational Molecular Biology (RECOMB), March 2004, pp. 98-106.
(2) Irizarry, R.A. et al. (2003) Exploration, normalization, and summaries of high density oligonucleotide array
probe level data. Biostatistics, 4, pp. 249 - 264.
27/06/2005ISMB 2005
Z
Λ
X
TCGAT CTA
TCGAT CTAHybridization
OligonucleotideProbes
mRNA transcriptmRNA transcript
Cross-hybridization
AGCTAGGAT
GC
TA
AGCTAGGAT
GC
TA
GCTAG
CG
TC
C
GCTAG
CG
TC
C
Bilinear model for cross-hybridization
• Each probe is assigned a set of cross-hybridizing transcript expression profiles
• Each transcript has a hybridization weight λ that determines its contribution
j
jiji zx
27/06/2005ISMB 2005
The probabilistic generative model for cross-hybridization
• Model the data probabilistically as
X = ΛZ + V
where
X = [x1 x2 … xT] is N x T,
Z = [z1 z2 … zT] is M x T,
Λ is the N x M hybridization matrix,
V is additive noise
27/06/2005ISMB 2005
Sparsity of the Λ matrix
• Force many of the weights λij to 0
• Denote by S the set of weights which are non-zero: the prior becomes
where
SS
SΛ),(),(
)1exp()()|(jiji
ijpp
x
e
1
)exp(
27/06/2005ISMB 2005
The probabilistic generative model for cross-hybridization (cont’d)
• The probabilistic model p(X,Z,Λ|S) for cross-hybridization is therefore
t
tt
tt Npp ),(),|(),|( ΨΛzΛzxΛZX
)()|(),|()|,,( ZSΛΛZXSΛZX pppp
SS
SΛ),(),(
)1exp()()|(jiji
ijpp
tjt
tpp,
)1exp()()( zZ
27/06/2005ISMB 2005
Variational inference
• To perform inference, minimize the KL-divergence
with respect to a distribution q for the given probabilistic model p
• The optimum is the posterior distribution q(Z,Λ) = p(Z,Λ|X,S)
• Difficult to compute exactly!• Use a surrogate
which approximates the true posterior
Λ Z
ΛZSΛZX
ΛZΛZ dd
p
qqpqD
)|,,(
),(log),()||(
S
ΛZ),(,
)()(),(ji
ijtj
jt qzqq
27/06/2005ISMB 2005
Variational EM for approximate inference and parameter estimation
• Use exponential distributions parameterized by variational parameters for q
• Minimize KL-divergence via variational EM(2),(3) to get the estimate βjt of the transcript expression profiles:
)exp()( ijijq )exp()( jtjtzq
)||(minargˆ pqDjt
jt )||(minargˆ pqD
ijij
)||(minargˆ2
2 pqDi
i
Variational E-step
Variational M-step
(2) Neal, R. M. and Hinton, G. E. (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants, Learning in Graphical Models, Kluwer Academic Publishers, pp. 355-368.(3) Jaakkola, T. and Jordan, M.I. (2000) Bayesian parameter estimation via variational methods. Statistics and Computing, 10:1, January 2000, pp. 25-37.
27/06/2005ISMB 2005
Variational Expectation-Maximization algorithm
t jijjtij
jijjtijiti x
T SS ),(:
222
),(:
2 2)(1
Variational E-step
Variational M-step
i i
ij
i i
ij
i i
ktkikjk
ikitij
i i
ktkikjk
ikitij
jt
xx
2
2
2
2
2
2
2
),(:,2
2
),(:,
42
44
)(
1
)(
1
SS
t i
jt
t i
jt
t i
ktkikjk
ikitjt
t i
ktkikjk
ikitjt
ij
xx
2
2
2
2
2
2
2
),(:,2
2
),(:,
42
44
)(
1
)(
1
SS
27/06/2005ISMB 2005
Results
• Agilent exon-tiling microarray data with 26,486 60-mer probes across 12 tissue pools
• Matched each probe to full-length RefSeq cDNAs via BLAST search to determine the sparsity structure S
• Resulting data set contains 9,904 probes matched to 2,905 mouse transcripts
27/06/2005ISMB 2005
Results (cont’d)
27/06/2005ISMB 2005
Significance testing of inferred expression profiles
• Randomly permute the rows of the S matrix and perform inference
• Mean SNR significantly lower for permuted data compared to unpermuted data
ttt
tt
SNR 2
2
10ˆ
ˆlog10
xx
x
27/06/2005ISMB 2005
Gene Ontology-Biological Process (GO-BP) enrichment using denoised data
• Perform agglomerative hierarchical clustering and compute a hypergeometric p-value for each cluster to evaluate statistical significance of the clustering
• Majority of clusters are have increased significance in denoised data compared to clustering using noisy data
27/06/2005ISMB 2005
Comparison to Robust Multi-array Analysis
• Unlike RMA, GenXHC models the explicit sparse structure of the set of probe-transcript interactions
• This increases statistical power when doing functional prediction
27/06/2005ISMB 2005
Summary• Cross-hybridization compensation using prior
knowledge about the transcript population doubles number of probes on array
• Problem of inferring latent transcript profiles is one of variational inference
• Functional annotation using denoised data yields functional categories which have higher statistical significance compared to noisy expression data
• Taking into account the set of probe-transcript binding interactions generally yields greater statistical power versus ignoring them