Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD...

75
Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and Health National Jewish Health [email protected]

Transcript of Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD...

Page 1: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Center for Genes, Environment, and Health

Machine LearningCPBS7711

Oct 8, 2015

Sonia Leach, PhDAssistant Professor

Center for Genes, Environment, and HealthNational Jewish Health

[email protected]

Page 2: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Someone once said “Artificial Intelligence = Search”

so Machine Learning = ?Induction of New Knowledge from experience and ability to improve?

Machine Learning is a natural outgrowth of the intersection of Computer Science and Statistics. We might say the defining question of Computer Science is “How can we build machines that solve problems, and which problems are inherently tractable/intractable?” The question that largely defines Statistics is “What can be inferred from data plus a set of modeling assumptions, with what reliability?” The defining question for Machine Learning builds on both, but it is a distinct question. Whereas Computer Science has focused primarily on how to manually program computers, Machine Learning focuses on the question of how to get computers to program themselves (from experience plus some initial structure). Whereas Statistics has focused primarily on what conclusions can be inferred from data, Machine Learning incorporates additional questions about what computational architectures and algorithms can be used to most effectively capture, store, index, retrieve and merge these data, how multiple learning subtasks can be orchestrated in a larger system, and questions of computational tractability. We say that a machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E.

- Tom Mitchell http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf

Center for Genes, Environment, and Health 2

Also interesting discussion of differences among AI, ML, Data Mining, Stats :http://stats.stackexchange.com/questions/5026/what-is-the-difference-between-data-mining-statistics-machine-learning-and-ai

Page 3: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Machine Learning• From Wikipedia:

– 7.1 Decision tree learning– 7.2 Association rule learning– 7.3 Artificial neural networks– 7.4 Inductive logic programming– 7.5 Support vector machines– 7.6 Clustering– 7.7 Bayesian networks– 7.8 Reinforcement learning– 7.9 Representation learning– 7.10 Similarity and metric

learning– 7.11 Sparse Dictionary Learning

• From Alppaydin Intro to Mach Learn: – Supervised Learning– Bayesian Decision Theory– Parametric Methods– Multivariate Methods– Dimensionality Reduction– Clustering– Nonparametric Methods– Decision Trees– Linear Discrimination– Multilayer Perceptrons– Local Models– Kernel Machines – Bayesian Estimation– Hidden Markov Models– Graphical Models– Combining Multiple Learners– Reinforcement Learning

3 Center for Genes, Environment, and Health

http://www.realtechsupport.org/UB/MRIII/papers/MachineLearning/Alppaydin_MachineLearning_2010.pdf

Page 4: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Machine Learning (what I will cover)

• Unsupervised– Dimensionality Reduction

• PCA– Clustering

• k-Means, SOM, Hierarchical

– Association Set Mining– Probabilistic Graphical

Models• HMMs, Bayes Nets

• Supervised– k-Nearest Neighbor– Neural Nets– Decision Trees/Random

Forests– SVMs– Naïve Bayes

• Issues– Regression/Classification– Feature selection/reduction– Missing data– Boosting/bagging/jackknife– Cross validation,

generalization– Model selection 4 Center for Genes, Environment, and Health

Connections to other lectures: Miller (HMM), Pollock (HMM), Leach (HMM), Lozupone (PCA, Feature Importance Scores, Clustering), Kechris (Regression), [Hunter (Knowledge-Based Analysis), Cohen (BioNLP), Phang (Expr Analysis) ….]

R: http://cran.r-project.org/web/views/MachineLearning.html

Page 5: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Machine Learning• Supervised

Learning– training set = both

inputs and correct answers

• Example: classification in predefined classes for which examples of labeled data are known

– It is similar with the optimization of an error function which measures the difference between the true answers and the answers given by the learner

• Unsupervised Learning– training set = just input

data• Example: grouping data

into categories based on similarities among them

– Relies on statistical properties of data when try to extract models of data

– Does not use an error concept but a model quality concept which should be maximized

5 Center for Genes, Environment, and Health http://slideplayer.com/slide/4040706/

Page 6: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Unsupervised Learning

Center for Genes, Environment, and Health 6

Page 7: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Dimensionality Reduction: Principal Components Analysis (PCA)

• Motivation: Instead of considering all variables, use small number of linear combos of those variables with minimum information lost

7 Center for Genes, Environment, and Health

http://blog.peltarion.com/2006/06/20/the-talented-drhebb-part-2-pca/

2D data: What if could only choose 1 of the variablesto represent data?

Amount of variance explained by

single variable

http://blog.peltarion.com/2006/06/20/the-talented-drhebb-part-2-pca/

P1var

Amount of variance

explained by P1 >

explained by Y

Choosey-axis,explainsmorevariancein data

Page 8: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Principal Components Analysis (PCA)

• Let X=(x1,x2,…,xn) is a random vector (mean vector , covariance matrix )– Example X=(height, weight, GPA, credit score)

• Each element of vector has a distribution over the population (i.e. xi is a random variable)

• A dataset is a set of samples from the joint distribution of X

• =(66, 179, 3.0, 687) =

8 Center for Genes, Environment, and Health

Sample Ht Wt GPA FICA

Bob 73 185 3.3 610

Anna 62 105 3.7 730

Therese

69 137 2.89 717

Jacob 76 210 4.0 780

2,,,

,2

,,

,,2

,

,,,2

ficaficagpaficawtficaht

ficagpagpagpawtgpaht

ficawtgpawtwtwtht

ficahtgpahtwththt

Note: Ht & Wt usually strongly correlated

Page 9: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Principal Components Analysis (PCA)

• If X=(x1,x2,…,xn) is a random vector (mean vector , covariance matrix ), then principal component transformation

X Y = (X-) s.t. is orthogonal, T = is diagonal, 1 2 … p 0.

– Linear orthogonal transform of original data to new coordinate system

– each component is linear combination of original variables• coefficient of variables in linear combo = Loadings• data transformed to new coords = Scores

– components ordered by percentage of variance explained along new axis

– number of components = minimum dimension of input data matrix– set of orthogonal vectors not unique, not scale-invariant

(covariance vs correlation), computed by eigen value decomposition (as above & R princomp) or singular value decomposition (SVD) (R prncmp) 9 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics

Page 10: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Principal Components Analysis (PCA)• If X is a random vector (mean , covariance matrix ), then

principal component transformation X Y=(X-) s.t. is orthogonal, T = is diagonal, 1 2 … p 0.

10 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics

diffgeom complex algebra reals stats1 36 58 43 36 372 62 54 50 46 523 31 42 41 40 294 76 78 69 66 815 46 56 52 56 406 12 42 38 38 287 39 46 51 54 418 30 51 54 52 329 22 32 43 28 2210 9 40 47 30 24

X

What if we could only choose two dimensions?

Page 11: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Principal Components Analysis (PCA)• If X is a random vector (mean , covariance matrix ), then

principal component transformation X Y=(X-) s.t. is orthogonal, T = is diagonal, 1 2 … p 0.

11 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 [1,] -2.292745 5.827588 8.966977 -7.1630488 -2.2195936 [2,] 25.846460 13.457048 -3.257987 0.5344066 0.4777994 [3,] -14.856875 4.337867 -4.057297 -2.5308172 1.4998247 [4,] 70.434116 -3.286077 6.423473 3.9571310 0.8815369 [5,] 13.768664 -4.392701 -6.058773 -4.7551497 -2.2951908 [6,] -28.899236 -4.611347 4.338621 -2.2710490 6.7118075 [7,] 5.216449 -4.536616 -7.625423 2.2093319 3.2618335 [8,] -3.432334 -11.115805 -3.553422 -0.9908949 -4.1604420 [9,] -31.579207 8.354892 -2.497369 5.6986938 -1.9742069[10,] -34.205292 -4.034848 7.321199 5.3113963 -2.1833687

Y(scores)

diffgeom complex algebra reals stats1 36 58 43 36 372 62 54 50 46 523 31 42 41 40 294 76 78 69 66 815 46 56 52 56 406 12 42 38 38 287 39 46 51 54 418 30 51 54 52 329 22 32 43 28 2210 9 40 47 30 24

X

Component Importance: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5Standard deviation 30.142 7.179 5.786 4.098 3.084Proportion of Variance 0.890 0.050 0.032 0.016 0.009Cumulative Proportion 0.890 0.941 0.974 0.990 1.000Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5diffgeom 0.638 0.599 -0.407 -0.112 -0.237complex 0.372 -0.230 0.593 -0.595 -0.320algebra 0.240 -0.371 0.645 -0.624reals 0.333 -0.671 -0.557 -0.234 0.271statistics 0.535 0.414 0.404 0.615

(loadings)

~i

EXAMPLE IN RX = read.table('pca.input',sep=" ", header=TRUE)pc = princomp(X)mu = pc$centerGamma = pc$loadingsY = pc$scores

XminusMu=sweep(X,MARGIN=2,mu,FUN="-")propOfVar= pc$sdev^2/sum(pc$sdev^2)eigenVals= pc$sdev^2

Page 12: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Principal Components Analysis (PCA)

12 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics

Component Importance: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5Standard deviation 30.142 7.179 5.786 4.098 3.084Proportion of Variance 0.890 0.050 0.032 0.016 0.009Cumulative Proportion 0.890 0.941 0.974 0.990 1.000Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5diffgeom 0.638 0.599 -0.407 -0.112 -0.237complex 0.372 -0.230 0.593 -0.595 -0.320algebra 0.240 -0.371 0.645 -0.624reals 0.333 -0.671 -0.557 -0.234 0.271statistics 0.535 0.414 0.404 0.615

(loadings)

## Verify Y = (X-mu)*Gammaunique(Y-as.matrix(XminusMu)%*%Gamma)## Verify X repr by Comp. i== Y[,i]par(mfrow=c(2,1),pty="s"),biplot(pc)plot(Y[,1],Y[,2],col="white")text(Y[,1],Y[,2],1:10)

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 [1,] -2.292745 5.827588 8.966977 -7.1630488 -2.2195936 [2,] 25.846460 13.457048 -3.257987 0.5344066 0.4777994 [3,] -14.856875 4.337867 -4.057297 -2.5308172 1.4998247 [4,] 70.434116 -3.286077 6.423473 3.9571310 0.8815369 [5,] 13.768664 -4.392701 -6.058773 -4.7551497 -2.2951908 [6,] -28.899236 -4.611347 4.338621 -2.2710490 6.7118075 [7,] 5.216449 -4.536616 -7.625423 2.2093319 3.2618335 [8,] -3.432334 -11.115805 -3.553422 -0.9908949 -4.1604420 [9,] -31.579207 8.354892 -2.497369 5.6986938 -1.9742069[10,] -34.205292 -4.034848 7.321199 5.3113963 -2.1833687

Y(scores)

X = read.table('pca.input',sep=" ", header=TRUE)pc = princomp(X)mu = pc$centerGamma = pc$loadingsY = pc$scoresXminusMu=sweep(X,MARGIN=2,mu,FUN="-")propOfVar= pc$sdev^2 /sum(pc$sdev^2)eigenVals= pc$sdev^2

~i

Arrows for original variables: Length=PropVarExplained in 2 compsDirection=relative loadings in 2 comps

ex) diffgeom largest(++,++)algebra smallest (+,-)

diffgeom complex algebra reals stats1 36 58 43 36 372 62 54 50 46 523 31 42 41 40 294 76 78 69 66 815 46 56 52 56 406 12 42 38 38 287 39 46 51 54 418 30 51 54 52 329 22 32 43 28 2210 9 40 47 30 24

X

Page 13: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Principal Components Analysis (PCA)• If X is a random vector (mean , covariance matrix ), then

principal component transformation X Y=(X-) s.t. is orthogonal, T = is diagonal, 1 2 … p 0.

13 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics

diffgeom complex algebra reals stats1 36 58 43 36 372 62 54 50 46 523 31 42 41 40 294 76 78 69 66 815 46 56 52 56 406 12 42 38 38 287 39 46 51 54 418 30 51 54 52 329 22 32 43 28 2210 9 40 47 30 24

X

What if we could only choose two dimensions?

Page 14: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Clustering• Partitioning

– Must specify number of clusters– K-Means, Self-Organizing Maps (SOM/Kohonen

Net)• Hierarchical Clustering

– Do not need to specify number of clusters– Need to specify distance metric and linkage

method• Other approaches

– Fuzzy clustering (probabilistic membership)– Spectral Clustering (using eigen value

decomposition) 14 Center for Genes, Environment, and Health

Page 15: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Clustering

15 Center for Genes, Environment, and Healthhttp://apandre.wordpress.com/visible-data/cluster-analysis/

Page 16: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

16 Center for Genes, Environment, and Healthhttp://stackoverflow.com/questions/4722290/generating-synthetic-datasets

R package: mlbench: Machine Learning Benchmark Problems

Page 17: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

k-Means• Intitialize: Select the initial k Centroids

– REPEAT• Form k clusters by assigning all points to

the ‘closest’ Centroid• Recompute the Centroid for each cluster

– UNTIL ”The Centroids don’t change or all changes are below predefined threshold”

• Initial Centroids are random vectors, randomly selected among vectors, first k vectors, etc or computed from random 1st assignment

• ‘closest’ typically defined by Euclidean distance (Voronoi diagram)

• Prone to local maxima so typically do N random restarts, take best (min sum of distE

2 to centroids)• In practice, favors separated spherical clusters 17 Center for Genes, Environment, and Health

2

1

),(),(

n

iiiEE yxxydistyxdist

Images from wikipedia

Page 18: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

k-Means

18 Center for Genes, Environment, and Health http://en.wikipedia.org/wiki/K-means_clustering

Iteration 0 Iteration 1 Iteration 2

Iteration 3 Iteration 4 Iteration 5

Images from wikipedia

Page 19: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Self-Organizing Maps (SOM)• Similar to k-Means, goal to assign data to map

node (e.g. Centroid in k-Means) with ‘closest’ weight vector to data space vector (minimize distE(x,w))

• Difference: map nodes constrained by neighborhood relationships, whereas k-Means Centroids freely move

• Must input initial topology, map ‘stretches’ to cover nD data in 2D, similar data assigned to map neighbors

19 Center for Genes, Environment, and HealthImage from wikipedia

Page 20: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Self-Organizing Maps (SOM)• 1. Initialization – Choose

random values for initial weight vectors wj.

• 2. Sampling – Draw a sample training input vector x from the input space.

• 3. Matching – Find the winning neuron I(x) with weight vector closest to input vector (i.e.,min distE)

• 4. Updating – Apply the weight update equation wji = (t)Tj,I(x) (t)( xi-wji)where (t) = learning rate @ time t*

Tj,I(x) (t)=neighborhood @

time t• 5. Continuation – keep

returning to step 2 until the feature map stops changing.

20 Center for Genes, Environment, and Health

http://www.sciencedirect.com/science/article/pii/S0014579399005244* Informal intro to simulated annealing, gradient descent…

Page 21: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Self-Organizing Maps (SOM)

21 Center for Genes, Environment, and Health

http://www.sciencedirect.com/science/article/pii/S0014579399005244

Page 22: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Self-Organizing Maps (SOM)• Initial grids

– Wrt size: 1-dimensional, 2-dimensional, 3-dimensional– Wrt structure: Rectangular, Hexagonal, Arbitrary planar

22 Center for Genes, Environment, and Healthhttp://www.cis.hut.fi/somtoolbox/documentation/grids.gifhttp://slideplayer.com/slide/4040706/

Page 23: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Self-Organizing Maps (SOM)

23 Center for Genes, Environment, and Health

Example:

ClusteringGene Expression Profiles

http://physiolgenomics.physiology.org/content/physiolgenomics/10/2/103/F2.large.jpg

Page 24: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Hierarchical Clustering • Divisive – (top down) start with all

points in 1 cluster, successively sub-divide ‘farthest’ points until full tree

• Agglomerative – (bottom up) start with each point in its own cluster (singleton), merge ‘closest’ pair of Clusters at each step until root– Requires metric to define ‘closest’ –

distance no longer between points, but between clusters

– Linkage strategy for which merge is often based on pairwise point comparisons

• Dendrogram shows order of splits 24 Center for Genes, Environment, and Health

Page 25: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Hierarchical Clustering

25 Center for Genes, Environment, and Healthhttp://images.slideplayer.com/11/3289326/slides/slide_7.jpg

Page 26: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Distance Metrics• Euclidean

– distance in Euclidean space• Pearson Correlation

– linear relationships• Spearman Correlation

– monotonic relationships• Mutual Information

– non-linear relationships• Polyserial Correlation

– correlation continuous vs ordinal (polychoric if ordinal vs ordinal)

• Hamming Distance, Jaccard, Dice (binary variables)

26 Center for Genes, Environment, and Health

2

1

),(

n

iiiE yxyxdist

n

i i

n

i i

n

i iiP

yyxx

yyxxyxdist

1

2

1

2

11),(

n

i yy

n

i xx

n

i yyxxyxS

rrrr

rrrrrrdist

ii

ii

1

2

1

2

11),(

)(zrankrz

),(),(),( yxMIyxHyxdistMI

yx yxyxx xx ppyxHppxH

yxHyHxHyxMI

, ,, log),( and log)(

),()()(),(

111001

1001

MMM

MMdistJ

1001 MMdistH Good when 0

gives no info YX

YXdistD

21

Like Jaccard but 2*Matches

Page 27: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

0 0 0 0 1 0.91 Spearman

C -1 -1 1 -0.7 0 0 PearsonB 8 0 1 6 22 23 EucDist

0 8 9 6 17 19 EucDist

A B C D E FA 1 1 -1 0.8 0 0 Pearson

Distance Metrics• Euclidean vs Pearson (linear) vs Spearman

(monotonic)

27 Center for Genes, Environment, and Health

Numbers are Pearson correlation

Note Pearson invariant to slope

Pearson=0 if non-linear1 1 -1 1 0 0 Spearman

E 0 0 0 0.3 1 0.85 Pearson

Page 28: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

• Single Linkage argmin S,T min sS,tT dist(s,t)

• Complete Linkage argmin S,T max sS,tT dist(s,t)

• Average Linkage (a.k.a. group average)

argmin S,T average sS,tT dist(s,t)• Centroid Linkage (People err after Eisen et al 1998

Treeview paper think=Average Linkage!) – min dist(centroid(S), centroid(T))

• Ward’s Linkage (optimizes same criterion as kMeans)• UPGMA (Unweighted Pair Group Method with Arithmetic

Mean) from Lozupone lecture – assumes constant rate of evolution, average linkage, Euclidean distance

Linkage Methods

28 Center for Genes, Environment, and Health

Page 29: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

29 Center for Genes, Environment, and Healthhttp://stackoverflow.com/questions/4722290/generating-synthetic-datasets

R package: mlbench: Machine Learning Benchmark Problems

Page 30: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

30 Center for Genes, Environment, and Health Comp.1 Comp.2Murder -0.53 0.41Assault -0.58 0.18UrbanPop -0.27 -0.87Rape -0.54 -0.16

Murder Assault UrbanPop RapeAlabama 13.2 236 58 21.2Alaska 10.0 263 48 44.5Arizona 8.1 294 80 31.0Arkansas 8.8 190 50 19.5California 9.0 276 91 40.6Colorado 7.9 204 78 38.7

Page 31: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

31 Center for Genes, Environment, and Health Comp.1 Comp.2Murder -0.53 0.41Assault -0.58 0.18UrbanPop -0.27 -0.87Rape -0.54 -0.16

Murder Assault UrbanPop RapeAlabama 13.2 236 58 21.2Alaska 10.0 263 48 44.5Arizona 8.1 294 80 31.0Arkansas 8.8 190 50 19.5California 9.0 276 91 40.6Colorado 7.9 204 78 38.7

Page 32: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

32 Center for Genes, Environment, and Health

Page 33: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Choosing the Number of Clusters• Rule of thumb: k= n/2• Elbow or Knee method (bend in plot of

metric)

• K-means likes spherical so minimize within-cluster variation (SSE, sum dist of all points to cluster mean) or maximize between-cluster variation (dist between clusts) or both CH(K)=[B(K)/K-1]/[W(K)/(n-K)]

• Gap Statistic– Calculate SSE, randomize dataset,

calculate SSE rand, n times, gap= log(mean SSErand/ SSE)

• Hierarchical – plot dist chosen at each merge (okay for single, complete) 33 Center for Genes, Environment, and Health

See also http://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf for long list of indices, NbClust R package: http://cedric.cnam.fr/fichiers/art_2579.pdf and http://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3.pdf

W(K)

B(K)

CH(K)

Gap(K)

*Calinski & Harabasz 1974

*Tibshirani, Walther, Hasties 2001

Page 34: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Association Set Mining• Also known as Market Basket Analysis

{milk, eggs} {butter}• Support of itemset X

supp(X) = # transactions with itemset X• Confidence of rule

conf(X Y) = supp(X &Y)/ supp(X)• Lift of rule (perf over assuming independent)

lift(X Y) = supp(X &Y)/ (supp(X)*supp(Y))• Want rules with max supp, conf, lift• Other measures found at:

http://michael.hahsler.net/research/association_rules/measures.html

34 Center for Genes, Environment, and Health

Page 35: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Association Set Mining• Tables of data converted to transactions by

creating binary variables for all categories for all variables (must discretize continuous, missing data okay)

35 Center for Genes, Environment, and Health

ID Gender

Age

Height (inches)

Race Diagnosis

CC245 Male 6 25 Caucasian

Depression

CC346 Male 75 60 African COPD

CC978 30 54 Asian Obesity

CC125 Female 15 54 African

{ {gender_M=Y, age_child=Y, height_20-29=Y, race_WH=Y, diag_depr=Y}, {gender_M=Y, age_senior=Y, height_60-69=Y, race_BL=Y, diag_copd=Y},

{age_adult=Y, height_50-59=Y,race_AS=Y, diag_obes=Y}, {gender_F=Y, age_adol=Y, height_50-59=Y, race_BL=Y} }

Page 36: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Association Set MiningExample in R: arules pkg, apriori algorithm

36 Center for Genes, Environment, and Health

lhs rhs support confidence lift1 {Class=2nd, Age=Child} => {Survived=Yes} 0.011 1.000 3.0972 {Class=2nd, Sex=Female, Age=Child} => {Survived=Yes} 0.006 1.000 3.0963 {Class=1st, Sex=Female} => {Survived=Yes} 0.064 0.972 3.0104 {Class=1st, Sex=Female, Age=Adult} => {Survived=Yes} 0.064 0.972 3.010

…12 {Sex=Female,Survived=Yes} => {Age=Adult} 0.143 0.918 0.96627 {Class=2nd} => {Age=Adult} 0.118 0.915 0.963

Note that rule 2 subsumed by rule 1, which has better lift (and support) – can remove

redundants

Page 37: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

37 Center for Genes, Environment, and Health

Probabilistic Graphical Models

Time

Observability Utility Observabilityand Utility

MarkovDecisionProcess (MDP)

A tA t−1

X tX t −1

U tU t−1

PartiallyObservableMarkovDecisionProcess (POMDP)

A t−1A t

X tX t −1

OtO t−1

U tU t−1

Markov Process (MP)X tX t −1

Hidden Markov Model (HMM)

OtO

XtX t-1

t-1

Y X

Page 38: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Hidden Markov Model• Finite set of N states X• Finite set of M observations O• Parameter set λ = (A, B, π)

– Initial state distribution πi = Pr(X1 = i)– Transition probability aij = Pr(Xt=j | Xt-1 = i)– Emission probability bik = Pr(Ot=k | Xt = i)

• Given observation sequence O=O1,O2,…,On, how compute Pr(O| λ)? 38 Center for Genes, Environment, and Health

Hidden Markov Model (HMM)

OtO

XtX t-1

t-1

1 2

3

N=3, M=2π=(0.25, 0.55, 0.2)A = B =

000.1

1.09.00

8.02.00

5.0

25.0

5.0

75.09.01.0

Example:

st1

st2

st3

st1

st2

st3

obs1 obs2st1 st2 st3

Page 39: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

• Probability of O is sum over all state sequencesPr(O|λ) = ∑

all X Pr(O|X, λ) Pr(X|λ)

= ∑all X

πx1 bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT

• At each t, are N states to reach, so NT possible state sequences and 2T multiplications per seq, means O(2T*NT) operations

• So 3 states, length 10 seq = 1,180,980 operations and len 20 = 1e11!

• Efficient dynamic programming algo: Forward algorithm (Baum&Welch) O(N2T)

39 Center for Genes, Environment, and Health

1 2

3

N=3, M=2π=(0.25, 0.55, 0.2)A = B =

000.1

1.09.00

8.02.00

5.0

25.0

5.0

75.09.01.0

Example:

πi = Pr(X1 = i)aij = Pr(Xt=j | Xt-1 = i)bik = Pr(Ot=k | Xt = i)

Page 40: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Applications in Bioinformatics• DNA – motif matching, gene matching,

multiple sequence alignment• Amino Acids – domain matching, fold

recognition• Microarrays/Whole Genome Sequencing –

assign copy number• ChIP-chip/seq – distinct chromatin states

40 Center for Genes, Environment, and Health

Page 41: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Bayesian Networks• Given set of random

variables, the joint probability distribution can be represented by:– Structure: Directed Acyclic

Graph (DAG)• variables are nodes, absence of arcs

captures conditional independencies

– Parameters: Local Conditional Probability Distributions (CPDs)

• conditional probability of variable given values of parents in graph

• Joint Probability factors into product of local CPDs:

41 Center for Genes, Environment, and Health

Pr(X1, X2, …, Xn) = i=1 to N Pr(Xi | Parents(Xi))

Page 42: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Bayesian Networks

42 Center for Genes, Environment, and Health

• Generally can think of directed arcs as ‘causal’ (be careful!)– If the sprinkler is on OR it is raining, then the

grass will be wet: Pr(W|S,R)• If observe wet grass, can determine

whether because of sprinkler or rain– Pr(R|W) and Pr(S|W)– Bayes rule = Pr(X|Y) = Pr(Y|X)Pr(X)/Pr(Y)

• Note S and R compete to explain W: this model says sprinkler usage is (conditionally) independent of rain, but if know the grass is wet, and it is raining, then it is less likely that the sprinkler being on is the explanation for W– Pr(S|W,R) < Pr(S|W) “explaining away”

• Note only need 9 parameters instead of 24=16

http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html

Page 43: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Applications in Bioinformatics

43 Center for Genes, Environment, and Health

PMID: 16873470

Gene regulatory networks (Friedman et al, 2000, PMID: 11108481)

Predicting clinical outcomes using expression data (Gevaert et al, 2006, PMID: 16873470)

Determining Regulators with PRMS (Segal et al, 2002, RECOMB)

Gene Function Prediction(Troyanskaya et al, 2003, PMID: 12826619 )

Hanalyzer – edge scores(Leach et al, 2009, PMID: 19325874)

Page 44: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Supervised Learning

Center for Genes, Environment, and Health 44

Page 45: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Supervised Learning• Given examples (x,y) of input features x and

output variable y, learn function f(x)=y– Regression (continuous response) vs Classification

(discrete response)– Dimensionality Reduction (Feature selection vs

extraction)– Cross validation (Leave-One-Out vs N-Fold)– Generalization (Training set error vs Test set error)– Missing data and Imputation– Model Selection (AIC, BIC)– Boosting/bagging/jackknife– Curse of dimensionality

45 Center for Genes, Environment, and Health

Page 46: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Supervised Learning• Boosting (weak learners on different subsets)

– Train H1 on random data split, sample among H1’s predictions so next data set to train H2 has half wrong, half right in H1. Train H3 where both H1 and H2 wrong. Return majority vote H1, H2, H3 (Adaboost weights examples, weighted vote)

• Bagging (bootstrap aggregate)– Train multiple models on random with replacement (bootstrap)

splits of input data, average predictions• Jackknife (vs bootstrap) – disjoint subsets of data• Model Selection: balance goodness of fit (likelihood L) with

complexity of model (number of parameters k) for n samples– Bayesian information criterion (BIC): minimize k ln(n)-2 ln(L)– Akaike information criterion (AIC): minimize 2k – 2 ln(L)

• Curse of dimensionality – greater D, data samples sparser in covering space so need more & more data to learn properly

46 Center for Genes, Environment, and Health

(less strong, better theory than BIC)

Page 48: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

k-Nearest Neighbors• Store database of (x,y) pairs, classify new

example by majority vote of k nearest neighbors (regression if assign (weighted) mean y in neighborhood)

• No training needed, non-parametric,sensitive to local structure in data,frequent class tends to dominate

• Curse of dimensionality if many variables, any query equidistant toall points – reduce features by PCA

• Allows complicated boundariesbetween classes

48 Center for Genes, Environment, and Health

If k=3, (green, red)If k=5, (green, blue)

Page 49: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Neural Network: Linear Perceptron

• Learning :(Backpropagation)

• Initialize wt, choose learning rate • 1) Calculate prediction y*j,t = f[wt xj]

• 2) Update weights wt+1 = wt+(yj – y*j,t)xj

• Repeat 1&2 until (yj – y*j,t) < threshold

– Can be generalized to multi-class– Optimal only if data linearly separable

49 Center for Genes, Environment, and Health

vs

Step activation function

Page 50: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Neural Network: Multi-Layer Perceptron

• Smooth activationfunction instead

• Can also havemultiple hidden layers

• Can learn when data not linearly separable

• Learn like before but backpropagation from output layer

50 Center for Genes, Environment, and Health

Smooth activation function (signmoid,

tanh)

Input layer Hidden layer Output layer

Page 51: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Decision Tree• Node is attribute tested, branch

is outcome, leaf is (majority) class (prob)• Discrete: X=xi?, Real: X<value?• Greedy algorithm chooses

best attribute to split upon:– pi = fraction items labeled i in set

– Gini impurity: IG(p) =ij pipj

prob items labeled i chosen *prob i mistakenly assigned class j

– Information gain: IE(p) =-i pi log2pi

– Real value: SSE

• EASY TO INTERPRET!!! Can overfit, large tree for XOR, biased in favor of attributes with more levels => ensembles 51 Center for Genes, Environment, and Health

BIOPSY+

Rx SIDE EFFECT

BREATH>90%

Died: 3

Alive: 27

Y N

BREATH<30%

Y NDied:

15Alive:

15

Died: 20

Alive: 57

Died: 30

Alive: 7

Died: 80

Alive: 1

Y N

Y N

Page 52: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Random Forest• Classifier consisting of ensemble of decision trees

{h(x, k)} where k is some i.i.d. random vector, and each tree casts vote for class of x (Breiman 2001)1. Bagging – k is random selection of N samples (with

replacement) to grow tree2. Dietterich 98: k is random split among n (best) splits

3. Ho 98: k is random subset of features to grow tree (√k )

4. Adaboost-like: k is random weights on examples

– 4 better than {2,3} better than 1 on generalization error• Out-of-bag estimates : internal estimates of

generalization error, classifier strength and correlation between trees 52 Center for Genes, Environment, and Health

Page 53: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Random Forest• Most popular implementation {h(x, k)}: bagging

(random subset samples w/ repl.) + random subset features– If set of features small, trees more correlated, so can make

new features as random linear combinations of orig. features

• Out-of-bag classifier for specific {x,y} = aggregate over trees that didn’t use {x,y} as training data (removes need for setting aside test data)

• Out-of-bag estimate is error rate for out-of-bag classifer for training set (can also estimate OOB strength and correlation)

• Can estimate variable importance from OOB estimates– For m-th variable, permute its values, compare

misclassification rate of OOB classifiers on ‘noised-up’ data with OOB on real data, large increase implies m-th variable important

53 Center for Genes, Environment, and Health

Page 54: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Support Vector Machine (SVM)• Support vectors are points that

lie closest to decision surface, maximize ‘margin’, hyperplane separating examples (solution change if SVs removed)

• Kernel function – maps not-linearly separable data to transformed space where transformed data is lin. sep.

• Advantages: non-probabilistic, optimization not greedy search, not affected by local minima, theoretical guarantee of performance, escape curse of dimensionality

54 Center for Genes, Environment, and Health

Page 55: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

• Distance between H and H1 is 1/||w|| so to maximize margin, need to minimize ||w|| (where ||w||= sqrt(i wi

2) )

s.t. no points between H1&H2: xi w + b +1 when yi = +1 xi w + b -1 when yi = -1

• Quadratic program (constrained optimization, solved by (dual of) Lagrangian multiplier)Max L = i- ½ijxixj s.t w=iyixi and iyi=0

• If not linearly separable, use transformation to space where is linearly separable, via kernels, i.e. (xi) not xi

• If use L1-norm (not L2 above), weights = variable importance

55 Center for Genes, Environment, and Health

+

+

+

http://www.cs.ucf.edu/courses/cap6412/fall2009/papers/Berwick2003.pdf

yi(xi w)1

Support Vector Machine (SVM)

http://books.nips.cc/papers/files/nips16/NIPS2003_AA07.pdf

Page 56: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Support Vector Machine (SVM)

56 Center for Genes, Environment, and Health

Not separated by linear function, but can by quadratic one

Radial basis function (Gaussians)

pxxxxK )1'()',( )'tanh()',( xxxxK

2

2

2

'exp)',(

xx

xxKPolynomial (p=1 linear)

~sigmoid (like Neural Net)

Page 57: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Other Useful Kernel Functions• Use of kernels allows complex data types to

be used in SVMs w/o having to translate into real-valued, fixed length vectors K: D x D R

• String kernel: compare two sequences• Graph kernel: compare two nodes in graph

or two graphs• Image kernels: compare two images• and so on … (any symmetric, positive semi-

definite matrix is a kernel)57 Center for Genes, Environment, and Health

Page 58: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

58 Center for Genes, Environment, and Health

Page 59: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Naïve Bayes• Recall Bayes rule

Pr(X|Y) = Pr(Y|X)Pr(X) / Pr(Y)• Classifier:

Pr(C|F1,…,Fn ) = Pr(C) Pr(F1,…,Fn|C) / Pr(F1,…,Fn)– Note denominator does not depend on C (effectively

constant Z)– “Naïve” assumption because assume Fi, Fj independent– Simplifies calculation:

Pr(C|F1,…,Fn ) = 1/Z * Pr(C) i Pr(Fi|C)

• Learn parameters Pr(C) & each Pr(Fi|C) by maximum likelihood (multinomial, Gaussian, …)– Can learn each Pr(Fi|C) independently, escape

curse of dimensionality, not need dataset to scale with # Fi

59 Center for Genes, Environment, and Health

C

F1 Fn…F2 F3

Page 60: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

60 Center for Genes, Environment, and Health

Page 61: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Examples in R• Making 2D datasets

– Install libraries: mlbench• Clustering (Hierarchical, K-Means, SOM)

– Install libraries: kohonen• Classification (kNN, NN, DT, SVM, NB)

– Install libraries: class (if R>3.0, o/w knn), neuralnet, rpart, e1071

61 Center for Genes, Environment, and Health

Page 62: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

62 Center for Genes, Environment, and Healthhttp://stackoverflow.com/questions/4722290/generating-synthetic-datasets

R package: mlbench: Machine Learning Benchmark Problems

Page 63: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Additional References

• Logit regression example: http://www.ats.ucla.edu/stat/r/dae/logit.htm• PCA: http://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf• Statistical Pattern Recognition Toolbox for Demos:

http://cmp.felk.cvut.cz/cmp/software/stprtool/examples.html• KMeans: https://onlinecourses.science.psu.edu/stat857/node/125• SOMS:

– http://www.cs.bham.ac.uk/~jxb/NN/l16.pdf– http://www.loria.fr/~rougier/coding/article/article.html– http://www.sciencedirect.com/science/article/pii/S0014579399005244

• Distance metrics:– http://www.statmethods.net/stats/correlations.html– http://people.revoledu.com/kardi/tutorial/Similarity/index.html – nice discussion of differences– http://www.datavis.ca/papers/corrgram.pdf - make visual panel (like heatmap) of correlation between variables

• Choosing number of clusters:– Nice one: http://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf– http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set– http://psycnet.apa.org/journals/met/16/3/285/– http://blog.echen.me/2011/03/19/counting-clusters/– http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters– http://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3.pdf

• Neural Networks:– Good one: http://www.cogsys.wiai.uni-bamberg.de/teaching/ss05/ml/slides/cogsysII-4.pdf– https://sites.google.com/a/iupr.com/dia-course/lectures/lecture08-classification-with-neural-networks– https://sites.google.com/a/iupr.com/dia-course/lectures/lecture08-classification-with-neural-networks– Nice for MLP: http://users.ics.aalto.fi/ahonkela/dippa/node41.html

• Boosting vs Bagging: http://people.cs.pitt.edu/~milos/courses/cs2750-Spring04/lectures/class23.pdf• Random Forests: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

• SVMs:Idiots’ guide to SVMs: http://www.cs.ucf.edu/courses/cap6412/fall2009/papers/Berwick2003.pdf• Kernel Methods: http://www.kernel-methods.net/tutorials/KMtalk.pdf

63 Center for Genes, Environment, and Health

Page 64: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

The End

Center for Genes, Environment, and Health 64

Page 65: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Not used

Center for Genes, Environment, and Health 65

Page 66: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Hidden Markov Model• Finite set of N states X• Finite set of M observations O• Parameter set λ = (A, B, π)

– Initial state distribution πi = Pr(X1 = i)– Transition probability aij = Pr(Xt=j | Xt-1 = i)– Emission probability bik = Pr(Ot=k | Xt = i)

• Given observation sequence O=O1,O2,…,On, how compute Pr(O| λ)? 66 Center for Genes, Environment, and Health

Hidden Markov Model (HMM)

OtO

XtX t-1

t-1

1 2

3

N=3, M=2π=(0.25, 0.55, 0.2)A = B =

000.1

1.09.00

8.02.00

5.0

25.0

5.0

75.09.01.0

Example:

st1

st2

st3

st1

st2

st3

obs1 obs2st1 st2 st3

Page 67: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

• Probability of O is sum over all state sequencesPr(O|λ) = ∑

all X Pr(O|X, λ) Pr(X|λ)

= ∑

all X πx1

bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT

• What is computational complexity of this sum?

67 Center for Genes, Environment, and Health

1 2

3

N=3, M=2π=(0.25, 0.55, 0.2)A = B =

000.1

1.09.00

8.02.00

5.0

25.0

5.0

75.09.01.0

Example:

πi = Pr(X1 = i)aij = Pr(Xt=j | Xt-1 = i)bik = Pr(Ot=k | Xt = i)

Page 68: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

• Probability of O is sum over all state sequencesPr(O|λ) = ∑

all X Pr(O|X, λ) Pr(X|λ)

= ∑

all X πx1

bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT

• At each t, are N states to reach, so NT possible state sequences and 2T multiplications per seq, means O(2T*NT) operations

• So 3 states, length 10 seq = 1,180,980 operations and len 20 = 1e11!

68 Center for Genes, Environment, and Health

1 2

3

N=3, M=2π=(0.25, 0.55, 0.2)A = B =

000.1

1.09.00

8.02.00

5.0

25.0

5.0

75.09.01.0

Example:

πi = Pr(X1 = i)aij = Pr(Xt=j | Xt-1 = i)bik = Pr(Ot=k | Xt = i)

Page 69: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

• Probability of O is sum over all state sequencesPr(O|λ) = ∑

all X Pr(O|X, λ) Pr(X|λ)

= ∑

all X πx1

bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT

• Efficient dynamic programming algorithm to do this: Forward algorithm(Baum and Welch,O(N2T))

69 Center for Genes, Environment, and Health

1 2

3

N=3, M=2π=(0.25, 0.55, 0.2)A = B =

000.1

1.09.00

8.02.00

5.0

25.0

5.0

75.09.01.0

Example:

πi = Pr(X1 = i)aij = Pr(Xt=j | Xt-1 = i)bik = Pr(Ot=k | Xt = i)

Page 70: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

The Forward AlgorithmProbability of a Sequence is the Sum of All

Paths that Can Produce It

70 Center for Genes, Environment, and Health

G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

0.8

0.9 G

CpG

G .3

G .1

.3*(

.3*.8+

.1*.1)=.075

.1*(

.3*.2+

.1*.9)=.015

C

.3*(

.075*.8+

.015*.1)=.0185

.1*(

.075*.2+

.015*.9)=.0029

G

.2*(

.0185*.8+

.0029*.1)=.003

.4*(

.0185*.2+

.0029*.9)=.0025

A

.2*(

.003*.8+

.0025*.1)=.0005

.4*(

.003*.2+

.0025*.9)=.0011

A

David Pollock’s Lecture

Page 71: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Parameter estimation by Baum-Welch Forward Backward Algorithm

Forward variable αt(i) =Pr(O1..t,Xt=i | λ)Backward variable βt(i) =Pr(Ot+1..N|Xt=i, λ)

DEFINITIVE tutorial: Rabiner 1989: http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Leach.pdfand erratum: http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Erratum_Leach.pdf

Page 72: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

Forward Algorithm• Dynamic programming method to compute

forward variable: αt(i) =Pr(O1..t,Xt=i | λ)• Base Condition: for 1 i N

α1(i) = πx1 bxio1

• Recurrence: for 1 j N and 1 t T-1

αt+1(j) = [ ∑i=1 to N αt(i) axixj ] bxjot+1

• Then probability of sequence Pr(O | λ) = ∑i=1 to N αT(i) 72 Center for Genes, Environment, and Health

*Backward algorithm for βt(i) is

analogous

Page 73: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

73 Center for Genes, Environment, and Health

Page 74: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

74 Center for Genes, Environment, and Health

Page 75: Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and.

0 0 0 0 1 0.91 Spearman

75 Center for Genes, Environment, and Health

A B C D E FA 1 1 -1 0.8 0 0 Pearson

1 1 -1 1 0 0 Spearman0 8 9 6 17 19 EucDist

B 8 0 1 6 22 23 EucDist

E 0 0 0 0.3 1 0.85 PearsonC -1 -1 1 -0.7 0 0 Pearson

A B C D E FA 1 1 -1 0.8 0 0 Pearson 1 1 -1 1 0 0 Spearman 0 8 9 6 17 19 EucDistB 8 0 1 6 22 23 EucDistC -1 -1 1 -0.7 0 0 PearsonE 0 0 0 0.3 1 0.85 Pearson 0 0 0 0 1 0.91 Spearman