Heinrich Presentation - Michael W. Berry

Post on 12-Jul-2022

0 views 0 download

Transcript of Heinrich Presentation - Michael W. Berry

Automated Gene Classificationusing Nonnegative Matrix Factorization

on Biomedical Literature

Kevin Heinrich

PhD Dissertation DefenseDepartment of Computer Science, UTK

March 19, 2007

Kevin Heinrich Using NMF for Gene Classification2/1

Kevin Heinrich Using NMF for Gene Classification3/1

What Is The Problem?

Understanding functional gene relationships requires expertknowledge.

Gene sequence analysis does not necessarily imply function.

Gene structure analysis is difficult.

Issue of scale.

Biologists know a small subset of genes.Thousands of genes.

Time & Money.

Kevin Heinrich Using NMF for Gene Classification4/1

Defining Functional Gene Relationships

Direct Relationships.

Known gene relationships (e.g. A-B).Based on term co-occurrence.1

Indirect Relationships.

Unknown gene relationships (e.g. A-C).Based on semantic structure.

1Jenssen et al., Nature Genetics, 28:21, 2001.Kevin Heinrich Using NMF for Gene Classification5/1

Semantic Gene Organizer

Gene information is compiled in human-curated databases.

Medical Literature, Analysis, and Retrieval System Online(MEDLINE)EntrezGene (LocusLink)Medical Subject Heading (MeSH)Gene Ontology (GO)

Gene documents are formed by taking titles and abstractsfrom MEDLINE citations cross-referenced in the Mouse, Rat,and Human EntrezGene entries for that gene.

Examines literature (phenotype) instead of genotype.

Can be used as a guide for future gene exploration.

Kevin Heinrich Using NMF for Gene Classification6/1

Kevin Heinrich Using NMF for Gene Classification7/1

Vector Space Model

Gene documents are parsed into tokens.

Tokens are assigned a weight, wij , of i th token in j th

document.

An m × n term-by-document matrix, A, is created.A = [wij ]

Genes are m-dimensional vectors.Tokens are n-dimensional vectors.

Kevin Heinrich Using NMF for Gene Classification8/1

Term-by-Document Matrix

d1 d2 d3 . . . dn

t1 w11 w12 w13 w1n

t2 w21 w22 w23 w2n

t3 w31 w32 w33 w3n

t4 w41 w42 w43 w4n...

. . .

tm wm1 wm2 wm3 wmn

Typically, a term-document matrix is sparse and unstructured.

Kevin Heinrich Using NMF for Gene Classification9/1

Weighting Schemes

Term weights are the product of a local, global component,and document normalization factor.

wij = lijgidj

The log-entropy weighting scheme is used where

lij = log2 (1 + fij)

gi = 1 +

∑j

(pij log2 pij)

log2 n

, pij =fij∑j

fij

Kevin Heinrich Using NMF for Gene Classification10/1

Latent Semantic Indexing (LSI)

LSI performs a truncated singular value decomposition (SVD) onM into three factor matrices

A = UΣV T

U is the m × r matrix of eigenvectors of AAT

V T is the r × n matrix of eigenvectors of ATA

Σ is the r × r diagonal matrix of the r nonnegative singularvalues of A

r is the rank of A

Kevin Heinrich Using NMF for Gene Classification11/1

SVD Properties

A rank-k approximation is generated by truncating the first kcolumn of each matrix, i.e., Ak = UkΣkV T

k

Ak is the closest of all rank-k approximations, i.e.,‖A− Ak‖F ≤ ‖A− B‖ for any rank-k matrix B

Kevin Heinrich Using NMF for Gene Classification12/1

SVD Querying

Document-to-Document Similarity

ATk Ak = (VkΣk) (VkΣk)

T

Term-to-Term Similarity

AkATk = (UkΣk) (UkΣk)

T

Document-to-Term Similarity

Ak = UkΣkVTk

Kevin Heinrich Using NMF for Gene Classification13/1

Advantages of LSI

A is sparse, factor matrices are dense. This causes improvedrecall for concept-based matching.

Scaled document vectors can be computed once and storedfor quick retrieval.

Components of factor matrices represent concepts.

Decreasing number of dimensions compares documents in abroader sense and achieves better compression.

Similar word usage patterns get mapped to same geometricspace.

Genes are compared at a concept level rather than a simpleterm co-occurrence level resulting in vocabulary independentcomparisons.

Kevin Heinrich Using NMF for Gene Classification14/1

Kevin Heinrich Using NMF for Gene Classification15/1

Presentation of Results

Problem:

Biologists are familiar with interpreting trees.

LSI produces ranked lists of related terms/documents.

Solution:

Generate pairwise distance data, i.e., 1− cos θij

Apply distance-based tree-building algorithm

Fitch - O(n4)NJ - O(n3)FastME - O(n2)

Kevin Heinrich Using NMF for Gene Classification16/1

Defining Functional Gene Relationships on Test Data

Kevin Heinrich Using NMF for Gene Classification17/1

Kevin Heinrich Using NMF for Gene Classification18/1

“Problems” with LSI

Initial term weights are nonnegative; SVD introduces negativecomponents.

Dimensions of factored space do not have an immediateinterpretation.

Want advantages of factored/reduced dimension space, butwant to interpret dimensions for clustering/labeling trees.

Issue of scale—understand small collections better rather thanhuge collections.

Kevin Heinrich Using NMF for Gene Classification19/1

Defining Functional Gene Relationships

Direct Relationships.

Known gene relationships (e.g. A-B).Based on term co-occurrence.2

Indirect Relationships.

Unknown gene relationships (e.g. A-C).Based on semantic structure.

Label Relationships (e.g. x & y).

2Jenssen et al., Nature Genetics, 28:21, 2001.Kevin Heinrich Using NMF for Gene Classification20/1

bb

yb

xb

A

b

B

b

C

1

NMF Problem Definition

Given nonnegative V , find W and H such that

V ≈WH

W ,H ≥ 0

W has size m × k

H has size k × n

W and H are not unique.i.e., WDD−1H for any invertible nonnegative D

Kevin Heinrich Using NMF for Gene Classification21/1

NMF Problem Definition

Given nonnegative V , find W and H such that

V ≈WH

W ,H ≥ 0

W has size m × k

H has size k × n

W and H are not unique.i.e., WDD−1H for any invertible nonnegative D

Kevin Heinrich Using NMF for Gene Classification22/1

NMF Interpretation

V ≈WH

Columns of W are k “feature” or “basis” vectors; representsemantic concepts.

Columns of H are linear combinations of feature vectors toapproximate corresponding column in V .

Choice of k determines accuracy and quality of basis vectors.

Ultimately produces a “parts-based” representation of theoriginal space.

Kevin Heinrich Using NMF for Gene Classification23/1

Kevin Heinrich Using NMF for Gene Classification24/1

W

m

k

H

k

n

Kevin Heinrich Using NMF for Gene Classification25/1

W

m

k

H

k

n

3 0.18 2.29 0.7

-

Kevin Heinrich Using NMF for Gene Classification26/1

W

m

k

H

k

n

3 0.18 2.29 0.7

-

cerebrovasculardisturbancemicrocephalyspectroscopyneuromuscular

Euclidean Distance (Cost Function)

E (W ,H) = ‖V −WH‖2F =∑i ,j

(Vij − (WH)ij

)2

Minimize E (W ,H) subject to W ,H ≥ 0.

E (W ,H) ≥ 0.

E (W ,H) = 0 if and only if V = WH.

‖V −WH‖ convex in W or H separately, not bothsimultaneously.

No guarantee to find global minima.

Kevin Heinrich Using NMF for Gene Classification27/1

Initialization Methods

Since NMF is an iterative algorithm, W and H must be initialized.

Random positive entries.

Structured initialization typically speeds convergence.

Run k-means on V .Choose representative vector from each cluster to form W andH.

Most methods do not provide static starting point.

Kevin Heinrich Using NMF for Gene Classification28/1

Non-Negative Double SVD

NNDSVD is one way to provide a static starting point.3

Observe Ak =k∑

j=1σjujv

Tj , i.e. sum of rank-1 matrices

Foreach j

Compute C = ujvTj

Set to 0 all negative elements of CCompute maximum singular triplet of C , i.e., [u, s, v ]Set jth column of W to u and jth row of H to σj s v

Resulting W and H are influenced by SVD.

3Boutsidis & Gallopoulos, Tech Report, 2005Kevin Heinrich Using NMF for Gene Classification29/1

NNDSVD Variations

Zero elements remain “locked” during MM update.

NNDSVDz keeps zero elements.

NNDSVDe assigns ε = 10−9 to zero elements.

NNDSVDa assigns average value of A to zero elements.

Kevin Heinrich Using NMF for Gene Classification30/1

Update Rules

Update rules should

decrease the approximation.

maintain nonnegativity constraints.

maintain other constraints imposed by the application(smoothness/sparsity).

Kevin Heinrich Using NMF for Gene Classification31/1

Multiplicative Method (MM)

Hcj ← Hcj

(W TV

)cj

(W TWH)cj + ε

Wic ←Wic

(VHT

)ic

(WHHT )ic + ε

ε ensures numerical stability.

Lee and Seung proved MM non-increasing under Euclideancost function.

Most implementations update H and W “simultaneously.”

Kevin Heinrich Using NMF for Gene Classification32/1

Other Objective Functions

‖V −WH‖2F + αJ1 (W ) + βJ2 (H)

α and β are parameters to control level of additional constraints.

Kevin Heinrich Using NMF for Gene Classification33/1

Smoothing Update Rules

For example, set J2 (H) = ‖H‖2F to enforce smoothness on H totry to force uniqueness on W .4

Hcj ← Hcj

(W TV

)cj− βHcj

(W TWH)cj + ε

Wic ←Wic

(VHT

)ic− αWic

(WHHT )ic + ε

4Piper et. al., AMOS, 2004Kevin Heinrich Using NMF for Gene Classification34/1

Sparsity

Hoyer defined sparsity as

sparseness (~x) =

√n −

P|xi |√P

x2i√

n − 1

Zero if and only if all components have same magnitude.

One if and only if x contains one nonzero component.

Kevin Heinrich Using NMF for Gene Classification35/1

Visual Interpretation

Sparseness constraints are explicitly set and built into updatealgorithm.

Can be applied to W , H, or both.

Dominant features are (hopefully) preserved.

Kevin Heinrich Using NMF for Gene Classification36/1

0.90.70.30.1

Sparsity Update Rules with MM

Pauca et. al. implemented sparsity within MM as

Hcj ← Hcj

(W TV

)cj− β (c1Hcj + c2Ecj)

(W TWH)cj + ε

c1 = ω2 − ω‖H‖12‖H‖2

c2 = ‖H‖ − ω‖H‖2ω =

√kn −

(√kn − 1

)sparseness(H)

Kevin Heinrich Using NMF for Gene Classification37/1

Benefits of NMF

Automated cluster labeling (& clustering)

Synonym generation (features)

Possible automated ontology creation

Labeling can be applied to any hierarchy

Kevin Heinrich Using NMF for Gene Classification38/1

Comparison of SVD vs. NMF

SVD NMF

Solution Accuracy A BUniqueness A CConvergence A C-

Querying A C+Interpretability of Parameters A CInterpretability of Elements D A

Sparseness D- B+Storage B- A

Kevin Heinrich Using NMF for Gene Classification39/1

Kevin Heinrich Using NMF for Gene Classification40/1

Labeling Algorithm

Given a hierarchical tree and a weighted list of terms associatedwith each gene, assign labels to each internal node

Mark each gene as labeled.

For each pair of labeled sibling nodes

Add all terms to parent’s listKeep top t terms

Kevin Heinrich Using NMF for Gene Classification41/1

Kevin Heinrich Using NMF for Gene Classification42/1

��

��inherited 1.05genetic 1disorder 0.95tumor 0.95Alzheimer 0.9memory 0.8cancer 0.8

tumor 0.95cancer 0.8inherited 0.6genetic 0.5disorder 0.35

Alzheimer 0.9memory 0.8disorder 0.6genetic 0.5inherited 0.45

Parentb

b

b

Gene1

b

Gene2

1

Calculating Initial Term Weights

Three different methods to calculate initial term weights:

Assign global weight associated with each term to eachdocument.

Calculate document-to-term similarity (LSI).

For NMF, for each document j :

Determine top feature i (by examining H).Assign dominant terms from feature vector i to document j ,scaled by coefficient from H.(Can be extended/thresholded to assign more features)

Kevin Heinrich Using NMF for Gene Classification43/1

MeSH Labeling

Many studies validate via inspection.

For automated validation, need to generate a “correct” treelabeling.

Take advantage of expert opinions, i.e., indexers from MeSH.Create MeSH meta-document for each gene, i.e., list of MeSHheadings.Label hierarchy using global weights of meta-collection.

Kevin Heinrich Using NMF for Gene Classification44/1

Recall

From traditional IR, recall is ratio of relevant returneddocuments to all relevant documents.

Extending this to trees, recall can be found at each node.

Averaging across each tree depth level produces average recall.

Averaging average recalls across all levels produces meanaverage recall.

Kevin Heinrich Using NMF for Gene Classification45/1

Feature Vector Replacement

Unfortunately, MeSH vocabulary is too restrictive, so nearly allruns produced near 0% recall.

Map NMF vocabulary to MeSH terminology.

Foreach document i :

Determine j , the highest coefficient from ith column of H.Choose top r MeSH headings from corresponding MeSHmeta-document.Split each MeSH heading into tokens.Add each token to jth column of W ′, where weight is globalMeSH header weight × coefficient from H.

Result: feature vector comprised solely of MeSH terms (MeSHfeature vector).

Kevin Heinrich Using NMF for Gene Classification46/1

Kevin Heinrich Using NMF for Gene Classification47/1

0

0.2

0.4

0.6

0.8

1

2 4 6 8 10

Ave

rage

Rec

all

Node Level

RecallBest Possible Recall

Kevin Heinrich Using NMF for Gene Classification48/1

Data Sets

Five collections were available:

50TG, test set of 50 genes

115IFN, set of 115 interferon genes

3 cerebellar datasets, 40 genes of unknown relationship

Kevin Heinrich Using NMF for Gene Classification49/1

Constraints

Each set was run under the given constraints fork = 2, 4, 6, 8, 10, 15, 20, 25, 30:

no constraints

smoothing W with α = 0.1, 0.01, 0.001

smoothing H with β = 0.1, 0.01, 0.001

sparsifying W with α = 0.1, 0.01, 0.001 and sparseness = 0.1,0.25, 0.5, 0.75, 0.9

sparsifying H with β = 0.1, 0.01, 0.001 and sparseness = 0.1,0.25, 0.5, 0.75, 0.9

Kevin Heinrich Using NMF for Gene Classification50/1

50TG Recall

Kevin Heinrich Using NMF for Gene Classification51/1

0

0.2

0.4

0.6

0.8

1

2 4 6 8 10

Ave

rage

Rec

all

Node Level

Best Overall (72%)Best First (83%)

Best Second (76%)Best Third (83%)

115IFN Recall

Kevin Heinrich Using NMF for Gene Classification52/1

0

0.2

0.4

0.6

0.8

1

2 4 6 8 10 12 14 16 18

Ave

rage

Rec

all

Node Level

Best Overall (41%)Best First (51%)

Best Second (46%)Best Third (56%)

Math1 Recall

Kevin Heinrich Using NMF for Gene Classification53/1

0

0.2

0.4

0.6

0.8

1

2 4 6 8 10 12 14

Ave

rage

Rec

all

Node Level

Best Overall (69%)Best First (84%)

Best Second (77%)Best Third (83%)

Mea Recall

Kevin Heinrich Using NMF for Gene Classification54/1

0

0.2

0.4

0.6

0.8

1

2 4 6 8 10

Ave

rage

Rec

all

Node Level

Best Overall (64%)Best First (72%)

Best Second (66%)Best Third (91%)

Sey Recall

Kevin Heinrich Using NMF for Gene Classification55/1

0

0.2

0.4

0.6

0.8

1

2 4 6 8 10

Ave

rage

Rec

all

Node Level

Best Overall (56%)Best First (71%)

Best Second (71%)Best Third (78%)

Observations

Recall was more a function of initialization and choice of kthan constraint.

Often, the NMF run with the smallest approximation error didnot produce the optimal labeling.

Overall, sparsity did not perform well.

Smoothing W achieved the best MAR in general.

Small parameters choices and larger values of k performedbetter.

Kevin Heinrich Using NMF for Gene Classification56/1

Future Work

Lots of NMF algorithms, each with differentstrengths/weaknesses.

More complex labeling scheme.

Different weighting schemes.

Investigate hierarchical graphs.

Visualization?

Gold Standard?

Kevin Heinrich Using NMF for Gene Classification57/1

SGO & CGx

This tool as implemented is called theSemantic Gene Organizer (SGO).

Much of the technology used in SGO are the basis forComputable Genomix, LLC.

Kevin Heinrich Using NMF for Gene Classification58/1

SGO & CGx Screenshots

Kevin Heinrich Using NMF for Gene Classification59/1

SGO & CGx Screenshots

Kevin Heinrich Using NMF for Gene Classification60/1

Acknowledgments

SGO: shad.cs.utk.edu/sgoCGx: computablegenomix.com

Committee:

Michael W. Berry, chair

Ramin Homayouni (Memphis)

Jens Gregor

Mike Thomason

Paul Pauca, WFU

Kevin Heinrich Using NMF for Gene Classification61/1