Probabilistic Sparse Matrix Factorization
description
Transcript of Probabilistic Sparse Matrix Factorization
Probabilistic Sparse Matrix Factorization
Delbert Dueck, Quaid Morris, Brendan Frey(Probabilistic & Statistical Inference Group)
Tim Hughes(Banting and Best Department of Medical Research)
Objective
Patterns in gene expression array data can be used to help understand gene regulation and predict the function of yet-uncharacterized genes
Objective: To develop a method of probabilistic sparse matrix factorization (PSMF) and apply it to gene expression data to learn the hidden structure underlying the data.
Biological Background
Genes encode basic information about an organism They tend to be highly expressed in tissues related to their
functional role Mouse gene expression data is from Zhang, Morris,
et al. (2004) Gene expression is influenced by the presence of
transcription factors (TFs) Co-expressed genes are likely activated by the same TFs The activity of each gene can be explained by the activities
of a small number of transcription factors
Gene Expression Array Dataset
T=55tissues
G=
22
70
9 g
en
es
T=55 tissues
Entire data set: X G×T matrix (G=22709, T=55)
10
0 g
en
es
Scalar expression values (xgt )
bla
dd
er (t=
3)
hin
db
rain
(t=2
2)
colo
n (t=
9)
mid
bra
in (t=
31
)
larg
e in
testin
e (t=
25
)
lymp
h n
od
e (t=
28
)
stom
ach
(t=4
5)
sple
en
(t=4
4)
pa
ncre
as (t=
34
)
sma
ll inte
stine
(t=4
1)
Expression vector for gene XM_133866.1 xg (g=10056), a row vector of length T=55
Scale:0 2 4 6 8 >10
Sparse Matrix Factorization
Gene expression data model: Each gene’s expression profile (xg) is …
a linear combination (weighted by ygc, csg) …
of a small number (rg<N) …
of C possible transcription factor profiles (zc, csg)
1
g
gn gn
r
g gs sny
=» åx z
Sparse Matrix Factorization
11 12 13 12 15
1 22 23 22 26
31 32 33 31 35
41 42 43 44
51 52 53
61 62 63
71 72 73
81 82 83
1 2 3
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0
G G G
x x x y y
x x x y y
x x x y y
x x x y
x x x
x x x
x x x
x x x
x x x
2
é ùê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úë û
X644444444744444L
L
L
L
L
L
L
L
M
4
M M
L
4 48
»
11 12 13
1 22 23
51 53 31 32 33
41 42 4365
51 52 5372
61 62 681 83
1 4
0 0
0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0
0 0 0 0G G
x x x
x x x
y y x x x
x x xyx x xyx x x
y y
y y
2
é ùê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úê úë û
×
Y644444444444444447444444444444444
L
L
L
L
L
L
M M M M M M
48
3
é ùê úê úê úê úê úê úê úê úê úê úê úë û
Z6444444447444444448
Matrix format:(entire dataset)
} }1 5 2
2 6 2
1 5 1
4 1
1 3 2
5 1
2 1
1 3 2
1 4 2
,
é ù éùê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúê ú êúë û ëû
rS
M M M
Probabilistic Sparse Matrix Factorization To express as a distribution, assume …
varying levels of Gaussian noise in the data:
nothing about transcription factor weights: normally-distributed transcription factor profiles: uniformly-distributed factor assignments: multinomially-distributed factor counts:
( )2
1( | , , , , ) ; ,g
gn gn
r
g g g g g g gs s gnP r yy y
== åx y Z s x z IN
( ) 1gP µy
( ) ( )11 1
( | ) g gnr C s cCg g n c
P r d -
= ==Õ Õs
( ) ( ; , )c cP =z z 0 IN
( )g nP r n n= =
Probabilistic Sparse Matrix Factorization To express as a distribution, assume …
varying levels of Gaussian noise in the data:
nothing about transcription factor weights: normally-distributed transcription factor profiles: uniformly-distributed factor assignments: multinomially-distributed factor counts:
Multiply together to get joint distribution
( )2
1( | , , , , ) ; ,g
gn gn
r
g g g g g g gs s gnP r yy y
== åx y Z s x z IN
( ) 1gP µy
( ) ( )11 1
( | ) g gnr C s cCg g n c
P r d -
= ==Õ Õs
( ) ( ; , )c cP =z z 0 IN
( )g nP r n n= =
( ) ( ) ( ) ( ) ( )11
1 1 1 1 1 1 1
( , , , , | ) ( | , , , , ) ( ) ( ) ( | ) ( )
; ; ,g gn g
gn gn
G C G C N G Nr s c r n
g gs s c nCng c g c n g n
P P P P P P
yd dn
- -
== = = = = = =
Y = Y × × × ×
é ù é ùé ùé ùê ú ê úê úê úµ ê ú ê úê úê úë ûë û ë ûë û
åÕ Õ ÕÕÕ ÕÕ
X Y ZSr X Y ZSr Y Z S r r
x z z 0 IN N
Factorized Variational Inference Exact inference is intractable with P(∙)
( ) ( )
( ) ( ) ( )
11 1
1
1 1 1 1 1
( , , , , | ) ; ; ,g
gn gn
gn g
G Cr
g gs s cng c
G C N G Ns c r n
nCg c n g n
P y
d dn
== =
- -
= = = = =
é ùé ùê úê úY µ ê úê úë ûë û
é ùé ùê úê ú×ê úê úë ûë û
åÕ Õ
ÕÕÕ ÕÕ
X Y ZSr x z z 0 IN N
Factorized Variational Inference Exact inference is intractable with P(∙)
Approximate it by a simpler distribution, Q(∙), and perform inference on that
1 1 1 1 1 1 1
( , , , | , ) ( ) ( ) ( ) ( )G C C T G N G
gc c gn gg c c t g n g
P Q y Q Q s Q r= = = = = = =
Y » × × ×ÕÕ ÕÕ ÕÕ ÕY ZSr X z
( ) ( )
( ) ( ) ( )
11 1
1
1 1 1 1 1
( , , , , | ) ; ; ,g
gn gn
gn g
G Cr
g gs s cng c
G C N G Ns c r n
nCg c n g n
P y
d dn
== =
- -
= = = = =
é ùé ùê úê úY µ ê úê úë ûë û
é ùé ùê úê ú×ê úê úë ûë û
åÕ Õ
ÕÕÕ ÕÕ
X Y ZSr x z z 0 IN N
VisualizationG
ge
ne
s
T tissues
G
ge
ne
s
T tissues
G
ge
ne
s
C factorsC
fact
ors
T tissues
X X = Y Z
PROBABILISTIC SPARSE MATRIX FACTORIZATION
C=50 possible factors
N=3 factors per gene (max) P(rg)=[.55 .27 .18]
Scale:0 2 4 6 8 >10
*Sorted by primary transcription factor (sg1)
Results – p-value histograms
Genes can be partitioned into “primary categories” (i.e. same sg1 value), “secondary classes”, etc. Compare classes with annotated gene ontology
(GO-BP) categories for statistical significance
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
randomclustering
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
freq
uenc
y
hierarchicalagglomerative
clustering
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
PSMF(primary)
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
PSMF(secondary)
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
PSMF(tertiary)
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
randomclustering
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
freq
uenc
y
hierarchicalagglomerative
clustering
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
PSMF(primary)
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
PSMF(secondary)
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
PSMF(tertiary)
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
randomclustering
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
freq
uenc
y
hierarchicalagglomerative
clustering
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
PSMF(primary)
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
PSMF(secondary)
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
PSMF(tertiary)
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
randomclustering
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
freq
uenc
y
hierarchicalagglomerative
clustering
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
PSMF(primary)
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
PSMF(secondary)
-20 -10 00
0.1
0.2
0.3
0.4
log10
(p-value)
PSMF(tertiary)
Results – mean log10 p-values
10 20 30 40 50 60 70 80 90 100
-25
-20
-15
-10
-5
0
Mean log10
p-values
C (# clusters, factors)
mea
n lo
g 10(p
-val
ue)
PSMF N={1,2,3} primary(i.e. s
g1 clustering)
hierarchicalagglomerative
clustering
PSMF N={2,3} secondary
PSMF N=3 tertiary
random clustering
10 20 30 40 50 60 70 80 90 1000%
20%
40%
60%
80%
100%Fraction of factors with significance
C (# clusters, factors)
frac
tion
of fa
ctor
s w
ith s
igni
fican
ce
PSMF N={1,2,3} primary(i.e. s
g1 clustering)
hierarchicalagglomerative
clustering
PSMF N={2,3} secondary
PSMF N=3 tertiary
random clustering
Results – count of significant p-values
10 20 30 40 50 60 70 80 90 100
-25
-20
-15
-10
-5
0
Mean log10
p-values
C (# clusters, factors)
mea
n lo
g 10(p
-val
ue)
PSMF N={1,2,3} primary(i.e. s
g1 clustering)
hierarchicalagglomerative
clustering
PSMF N={2,3} secondary
PSMF N=3 tertiary
random clustering
10 20 30 40 50 60 70 80 90 1000%
20%
40%
60%
80%
100%Fraction of factors with significance
C (# clusters, factors)
frac
tion
of fa
ctor
s w
ith s
igni
fican
cePSMF N={1,2,3} primary
(i.e. sg1
clustering)
hierarchicalagglomerative
clustering
PSMF N={2,3} secondary
PSMF N=3 tertiary
random clustering
Future Directions – different Q(·)
0 5 10 15 20 25 30 35 40 45 50-7
-6
-5
-4
-3
-2
-1-0.8-0.6-0.4-0.2
0x 10
4
iteration
co
mp
lete
log
lik
elih
oo
d
iterated conditional modes
**NOTE: The complete log likelihoods are not necessarily monotonically increasing due to the non-negativity constraint,implemented via a zero-thresholding heuristic.
1 1 1 1 1 1
( ) ( ) ( ) ( )G C C G N G
gc c gn gg c c g n g
Q y Q Q s Q r= = = = = =
× × ×ÕÕ Õ ÕÕ Õz
1 1 1 1
( ) ( ) ( , )G C C G
gc c g gg c c g
Q y Q Q r= = = =
× ×ÕÕ Õ Õz s1 1
( ) ( , , )C G
c g g gc g
Q Q r= =
×Õ Õz y s
Iterated conditional modes (point estimates)
Summary
Introduced probabilistic sparse matrix factorization (PSMF), each row is a linear combination of a “small” number of hidden factors selected from a larger set.
Described a variational inference algorithm for fitting the PSMF model.
Evaluated model on a gene functional prediction task.