Joint work with Quaid Morris (1),(2) , Tim Hughes (2) and

27/06/2005ISMB 2005

GenXHC: A Probabilistic Generative Model for Cross-hybridization Compensation in High-density

Genome-wide Microarray Data

Joint work with Quaid Morris(1),(2), Tim Hughes(2) and Brendan Frey(1),(2)

(1) Probabilistic and Statistical Inference Group, University of Toronto

(2) Banting & Best Department of Medical Research, University of Toronto

Jim Huang(1)

27/06/2005ISMB 2005

Genome-wide profiling using high-density microarrays

• The move towards high-density arrays for genome-wide profiling presents challenges…

Probes

Con

dit

ion

s

Exp

ressio

n

…

Coding regions

Genome

27/06/2005ISMB 2005

Cross-hybridization in high-density microarrays

• As we move to higher-density arrays, cross-hybridization noise becomes significant and unavoidable

TCGAT CTA

TCGAT CTAHybridization

Oligonucleotide Probes

mRNA transcript

Cross-hybridization

AGCTAGGAT

GC

TA

GCTAG

CG

TC

C

27/06/2005ISMB 2005

Cross-hybridization in high-density microarrays (cont’d)

• Large cross-hybridization noise component in high-density data!

27/06/2005ISMB 2005

Cross-hybridization compensation

• State-of-the-art methods for cross-hybridization compensation designed for Affymetrix GeneChips

• Affymetrix MAS 5.0

• Robust Multi-array Analysis (RMA/GC-RMA)(1),(2)

(1) Wu, Z. and Irizarry, R.A. (2004) Stochastic models inspired by hybridization theory for short oligonucleotide arrays. Proc. Ninth International Conference on Research in Computational Molecular Biology (RECOMB), March 2004, pp. 98-106.

(2) Irizarry, R.A. et al. (2003) Exploration, normalization, and summaries of high density oligonucleotide array

probe level data. Biostatistics, 4, pp. 249 - 264.

27/06/2005ISMB 2005

Z

Λ

X

TCGAT CTA

TCGAT CTAHybridization

OligonucleotideProbes

mRNA transcriptmRNA transcript

Cross-hybridization

AGCTAGGAT

GC

TA

AGCTAGGAT

GC

TA

GCTAG

CG

TC

C

GCTAG

CG

TC

C

Bilinear model for cross-hybridization

• Each probe is assigned a set of cross-hybridizing transcript expression profiles

• Each transcript has a hybridization weight λ that determines its contribution

j

jiji zx

27/06/2005ISMB 2005

The probabilistic generative model for cross-hybridization

• Model the data probabilistically as

X = ΛZ + V

where

X = [x1 x2 … xT] is N x T,

Z = [z1 z2 … zT] is M x T,

Λ is the N x M hybridization matrix,

V is additive noise

27/06/2005ISMB 2005

Sparsity of the Λ matrix

• Force many of the weights λij to 0

• Denote by S the set of weights which are non-zero: the prior becomes

where

SS

SΛ),(),(

)1exp()()|(jiji

ijpp

x

e

1

)exp(

27/06/2005ISMB 2005

Variational inference

• To perform inference, minimize the KL-divergence

with respect to a distribution q for the given probabilistic model p

• The optimum is the posterior distribution q(Z,Λ) = p(Z,Λ|X,S)

• Difficult to compute exactly!• Use a surrogate

which approximates the true posterior

Λ Z

ΛZSΛZX

ΛZΛZ dd

p

qqpqD

)|,,(

),(log),()||(

S

ΛZ),(,

)()(),(ji

ijtj

jt qzqq

27/06/2005ISMB 2005

Variational EM for approximate inference and parameter estimation

• Use exponential distributions parameterized by variational parameters for q

• Minimize KL-divergence via variational EM(2),(3) to get the estimate βjt of the transcript expression profiles:

)exp()( ijijq )exp()( jtjtzq

)||(minargˆ pqDjt

jt )||(minargˆ pqD

ijij

)||(minargˆ2

2 pqDi

i

Variational E-step

Variational M-step

(2) Neal, R. M. and Hinton, G. E. (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants, Learning in Graphical Models, Kluwer Academic Publishers, pp. 355-368.(3) Jaakkola, T. and Jordan, M.I. (2000) Bayesian parameter estimation via variational methods. Statistics and Computing, 10:1, January 2000, pp. 25-37.

27/06/2005ISMB 2005

Variational Expectation-Maximization algorithm

t jijjtij

jijjtijiti x

T SS ),(:

222

),(:

2 2)(1

Variational E-step

Variational M-step

i i

ij

i i

ij

i i

ktkikjk

ikitij

i i

ktkikjk

ikitij

jt

xx

2

2

2

2

2

2

2

),(:,2

2

),(:,

42

44

)(

1

)(

1

SS

t i

jt

t i

jt

t i

ktkikjk

ikitjt

t i

ktkikjk

ikitjt

ij

xx

2

2

2

2

2

2

2

),(:,2

2

),(:,

42

44

)(

1

)(

1

SS

27/06/2005ISMB 2005

Results

• Agilent exon-tiling microarray data with 26,486 60-mer probes across 12 tissue pools

• Matched each probe to full-length RefSeq cDNAs via BLAST search to determine the sparsity structure S

• Resulting data set contains 9,904 probes matched to 2,905 mouse transcripts

27/06/2005ISMB 2005

Results (cont’d)

27/06/2005ISMB 2005

Significance testing of inferred expression profiles

• Randomly permute the rows of the S matrix and perform inference

• Mean SNR significantly lower for permuted data compared to unpermuted data

ttt

tt

SNR 2

2

10ˆ

ˆlog10

xx

x

27/06/2005ISMB 2005

Gene Ontology-Biological Process (GO-BP) enrichment using denoised data

• Perform agglomerative hierarchical clustering and compute a hypergeometric p-value for each cluster to evaluate statistical significance of the clustering

• Majority of clusters are have increased significance in denoised data compared to clustering using noisy data

27/06/2005ISMB 2005

Comparison to Robust Multi-array Analysis

• Unlike RMA, GenXHC models the explicit sparse structure of the set of probe-transcript interactions

• This increases statistical power when doing functional prediction

27/06/2005ISMB 2005

Summary• Cross-hybridization compensation using prior

knowledge about the transcript population doubles number of probes on array

• Problem of inferring latent transcript profiles is one of variational inference

• Functional annotation using denoised data yields functional categories which have higher statistical significance compared to noisy expression data

• Taking into account the set of probe-transcript binding interactions generally yields greater statistical power versus ignoring them

Joint work with Quaid Morris (1),(2) , Tim Hughes (2) and

Documents

Transcript of Joint work with Quaid Morris (1),(2) , Tim Hughes (2) and