Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD...

Center for Genes, Environment, and Health

Machine LearningCPBS7711

Oct 8, 2015

Sonia Leach, PhDAssistant Professor

Center for Genes, Environment, and HealthNational Jewish Health

[email protected]

Someone once said “Artificial Intelligence = Search”

so Machine Learning = ?Induction of New Knowledge from experience and ability to improve?

Machine Learning is a natural outgrowth of the intersection of Computer Science and Statistics. We might say the defining question of Computer Science is “How can we build machines that solve problems, and which problems are inherently tractable/intractable?” The question that largely defines Statistics is “What can be inferred from data plus a set of modeling assumptions, with what reliability?” The defining question for Machine Learning builds on both, but it is a distinct question. Whereas Computer Science has focused primarily on how to manually program computers, Machine Learning focuses on the question of how to get computers to program themselves (from experience plus some initial structure). Whereas Statistics has focused primarily on what conclusions can be inferred from data, Machine Learning incorporates additional questions about what computational architectures and algorithms can be used to most effectively capture, store, index, retrieve and merge these data, how multiple learning subtasks can be orchestrated in a larger system, and questions of computational tractability. We say that a machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E.

- Tom Mitchell http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf

Center for Genes, Environment, and Health 2

Also interesting discussion of differences among AI, ML, Data Mining, Stats :http://stats.stackexchange.com/questions/5026/what-is-the-difference-between-data-mining-statistics-machine-learning-and-ai

http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf

http://stats.stackexchange.com/questions/5026/what-is-the-difference-between-data-mining-statistics-machine-learning-and-ai

Machine Learning• From Wikipedia:

– 7.1 Decision tree learning– 7.2 Association rule learning– 7.3 Artificial neural networks– 7.4 Inductive logic programming– 7.5 Support vector machines– 7.6 Clustering– 7.7 Bayesian networks– 7.8 Reinforcement learning– 7.9 Representation learning– 7.10 Similarity and metric

learning– 7.11 Sparse Dictionary Learning

• From Alppaydin Intro to Mach Learn: – Supervised Learning– Bayesian Decision Theory– Parametric Methods– Multivariate Methods– Dimensionality Reduction– Clustering– Nonparametric Methods– Decision Trees– Linear Discrimination– Multilayer Perceptrons– Local Models– Kernel Machines – Bayesian Estimation– Hidden Markov Models– Graphical Models– Combining Multiple Learners– Reinforcement Learning

3 Center for Genes, Environment, and Health

http://www.realtechsupport.org/UB/MRIII/papers/MachineLearning/Alppaydin_MachineLearning_2010.pdf

http://www.realtechsupport.org/UB/MRIII/papers/MachineLearning/Alppaydin_MachineLearning_2010.pdf

Machine Learning (what I will cover)

• Unsupervised– Dimensionality Reduction

• PCA– Clustering

• k-Means, SOM, Hierarchical

– Association Set Mining– Probabilistic Graphical

Models• HMMs, Bayes Nets

• Supervised– k-Nearest Neighbor– Neural Nets– Decision Trees/Random

Forests– SVMs– Naïve Bayes

• Issues– Regression/Classification– Feature selection/reduction– Missing data– Boosting/bagging/jackknife– Cross validation,

generalization– Model selection 4 Center for Genes, Environment, and Health

Connections to other lectures: Miller (HMM), Pollock (HMM), Leach (HMM), Lozupone (PCA, Feature Importance Scores, Clustering), Kechris (Regression), [Hunter (Knowledge-Based Analysis), Cohen (BioNLP), Phang (Expr Analysis) ….]

R: http://cran.r-project.org/web/views/MachineLearning.html

Machine Learning• Supervised

Learning– training set = both

inputs and correct answers

• Example: classification in predefined classes for which examples of labeled data are known

– It is similar with the optimization of an error function which measures the difference between the true answers and the answers given by the learner

• Unsupervised Learning– training set = just input

data• Example: grouping data

into categories based on similarities among them

– Relies on statistical properties of data when try to extract models of data

– Does not use an error concept but a model quality concept which should be maximized

5 Center for Genes, Environment, and Health http://slideplayer.com/slide/4040706/

Unsupervised Learning


Dimensionality Reduction: Principal Components Analysis (PCA)

• Motivation: Instead of considering all variables, use small number of linear combos of those variables with minimum information lost


http://blog.peltarion.com/2006/06/20/the-talented-drhebb-part-2-pca/

2D data: What if could only choose 1 of the variablesto represent data?

Amount of variance explained by

single variable

http://blog.peltarion.com/2006/06/20/the-talented-drhebb-part-2-pca/

P1var

Amount of variance

explained by P1 >

explained by Y

Choosey-axis,explainsmorevariancein data

Principal Components Analysis (PCA)

• Let X=(x1,x2,…,xn) is a random vector (mean vector , covariance matrix )– Example X=(height, weight, GPA, credit score)

• Each element of vector has a distribution over the population (i.e. xi is a random variable)

• A dataset is a set of samples from the joint distribution of X

• =(66, 179, 3.0, 687) =


Sample Ht Wt GPA FICA

Bob 73 185 3.3 610

Anna 62 105 3.7 730

Therese

69 137 2.89 717

Jacob 76 210 4.0 780

2,,,

,2

,,

,,2

,

,,,2

ficaficagpaficawtficaht

ficagpagpagpawtgpaht

ficawtgpawtwtwtht

ficahtgpahtwththt

Note: Ht & Wt usually strongly correlated


• If X=(x1,x2,…,xn) is a random vector (mean vector , covariance matrix ), then principal component transformation

X Y = (X-) s.t. is orthogonal, T = is diagonal, 1 2 … p 0.

– Linear orthogonal transform of original data to new coordinate system

– each component is linear combination of original variables• coefficient of variables in linear combo = Loadings• data transformed to new coords = Scores

– components ordered by percentage of variance explained along new axis

– number of components = minimum dimension of input data matrix– set of orthogonal vectors not unique, not scale-invariant

(covariance vs correlation), computed by eigen value decomposition (as above & R princomp) or singular value decomposition (SVD) (R prncmp) 9 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics

Principal Components Analysis (PCA)• If X is a random vector (mean , covariance matrix ), then

principal component transformation X Y=(X-) s.t. is orthogonal, T = is diagonal, 1 2 … p 0.

10 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics

diffgeom complex algebra reals stats1 36 58 43 36 372 62 54 50 46 523 31 42 41 40 294 76 78 69 66 815 46 56 52 56 406 12 42 38 38 287 39 46 51 54 418 30 51 54 52 329 22 32 43 28 2210 9 40 47 30 24

X

What if we could only choose two dimensions?




Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 [1,] -2.292745 5.827588 8.966977 -7.1630488 -2.2195936 [2,] 25.846460 13.457048 -3.257987 0.5344066 0.4777994 [3,] -14.856875 4.337867 -4.057297 -2.5308172 1.4998247 [4,] 70.434116 -3.286077 6.423473 3.9571310 0.8815369 [5,] 13.768664 -4.392701 -6.058773 -4.7551497 -2.2951908 [6,] -28.899236 -4.611347 4.338621 -2.2710490 6.7118075 [7,] 5.216449 -4.536616 -7.625423 2.2093319 3.2618335 [8,] -3.432334 -11.115805 -3.553422 -0.9908949 -4.1604420 [9,] -31.579207 8.354892 -2.497369 5.6986938 -1.9742069[10,] -34.205292 -4.034848 7.321199 5.3113963 -2.1833687

Y(scores)


X

Component Importance: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5Standard deviation 30.142 7.179 5.786 4.098 3.084Proportion of Variance 0.890 0.050 0.032 0.016 0.009Cumulative Proportion 0.890 0.941 0.974 0.990 1.000Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5diffgeom 0.638 0.599 -0.407 -0.112 -0.237complex 0.372 -0.230 0.593 -0.595 -0.320algebra 0.240 -0.371 0.645 -0.624reals 0.333 -0.671 -0.557 -0.234 0.271statistics 0.535 0.414 0.404 0.615

(loadings)

~i

EXAMPLE IN RX = read.table('pca.input',sep=" ", header=TRUE)pc = princomp(X)mu = pc$centerGamma = pc$loadingsY = pc$scores

XminusMu=sweep(X,MARGIN=2,mu,FUN="-")propOfVar= pc$sdev^2/sum(pc$sdev^2)eigenVals= pc$sdev^2



Component Importance: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5Standard deviation 30.142 7.179 5.786 4.098 3.084Proportion of Variance 0.890 0.050 0.032 0.016 0.009Cumulative Proportion 0.890 0.941 0.974 0.990 1.000Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5diffgeom 0.638 0.599 -0.407 -0.112 -0.237complex 0.372 -0.230 0.593 -0.595 -0.320algebra 0.240 -0.371 0.645 -0.624reals 0.333 -0.671 -0.557 -0.234 0.271statistics 0.535 0.414 0.404 0.615

(loadings)

## Verify Y = (X-mu)*Gammaunique(Y-as.matrix(XminusMu)%*%Gamma)## Verify X repr by Comp. i== Y[,i]par(mfrow=c(2,1),pty="s"),biplot(pc)plot(Y[,1],Y[,2],col="white")text(Y[,1],Y[,2],1:10)

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 [1,] -2.292745 5.827588 8.966977 -7.1630488 -2.2195936 [2,] 25.846460 13.457048 -3.257987 0.5344066 0.4777994 [3,] -14.856875 4.337867 -4.057297 -2.5308172 1.4998247 [4,] 70.434116 -3.286077 6.423473 3.9571310 0.8815369 [5,] 13.768664 -4.392701 -6.058773 -4.7551497 -2.2951908 [6,] -28.899236 -4.611347 4.338621 -2.2710490 6.7118075 [7,] 5.216449 -4.536616 -7.625423 2.2093319 3.2618335 [8,] -3.432334 -11.115805 -3.553422 -0.9908949 -4.1604420 [9,] -31.579207 8.354892 -2.497369 5.6986938 -1.9742069[10,] -34.205292 -4.034848 7.321199 5.3113963 -2.1833687

Y(scores)

X = read.table('pca.input',sep=" ", header=TRUE)pc = princomp(X)mu = pc$centerGamma = pc$loadingsY = pc$scoresXminusMu=sweep(X,MARGIN=2,mu,FUN="-")propOfVar= pc$sdev^2 /sum(pc$sdev^2)eigenVals= pc$sdev^2

~i

Arrows for original variables: Length=PropVarExplained in 2 compsDirection=relative loadings in 2 comps

ex) diffgeom largest(++,++)algebra smallest (+,-)


X





X

What if we could only choose two dimensions?

Clustering• Partitioning

– Must specify number of clusters– K-Means, Self-Organizing Maps (SOM/Kohonen

Net)• Hierarchical Clustering

– Do not need to specify number of clusters– Need to specify distance metric and linkage

method• Other approaches

– Fuzzy clustering (probabilistic membership)– Spectral Clustering (using eigen value

decomposition) 14 Center for Genes, Environment, and Health

Clustering

15 Center for Genes, Environment, and Healthhttp://apandre.wordpress.com/visible-data/cluster-analysis/

http://apandre.wordpress.com/visible-data/cluster-analysis/

16 Center for Genes, Environment, and Healthhttp://stackoverflow.com/questions/4722290/generating-synthetic-datasets

R package: mlbench: Machine Learning Benchmark Problems

http://stackoverflow.com/questions/4722290/generating-synthetic-datasets

k-Means• Intitialize: Select the initial k Centroids

– REPEAT• Form k clusters by assigning all points to

the ‘closest’ Centroid• Recompute the Centroid for each cluster

– UNTIL ”The Centroids don’t change or all changes are below predefined threshold”

• Initial Centroids are random vectors, randomly selected among vectors, first k vectors, etc or computed from random 1st assignment

• ‘closest’ typically defined by Euclidean distance (Voronoi diagram)

• Prone to local maxima so typically do N random restarts, take best (min sum of distE

2 to centroids)• In practice, favors separated spherical clusters 17 Center for Genes, Environment, and Health

2

1

),(),(

n

iiiEE yxxydistyxdist

Images from wikipedia

k-Means

18 Center for Genes, Environment, and Health http://en.wikipedia.org/wiki/K-means_clustering

Iteration 0 Iteration 1 Iteration 2

Iteration 3 Iteration 4 Iteration 5

Images from wikipedia

http://en.wikipedia.org/wiki/K-means_clustering

Self-Organizing Maps (SOM)• Similar to k-Means, goal to assign data to map

node (e.g. Centroid in k-Means) with ‘closest’ weight vector to data space vector (minimize distE(x,w))

• Difference: map nodes constrained by neighborhood relationships, whereas k-Means Centroids freely move

• Must input initial topology, map ‘stretches’ to cover nD data in 2D, similar data assigned to map neighbors

19 Center for Genes, Environment, and HealthImage from wikipedia

Self-Organizing Maps (SOM)• 1. Initialization – Choose

random values for initial weight vectors wj.

• 2. Sampling – Draw a sample training input vector x from the input space.

• 3. Matching – Find the winning neuron I(x) with weight vector closest to input vector (i.e.,min distE)

• 4. Updating – Apply the weight update equation wji = (t)Tj,I(x) (t)( xi-wji)where (t) = learning rate @ time t*

Tj,I(x) (t)=neighborhood @

time t• 5. Continuation – keep

returning to step 2 until the feature map stops changing.


http://www.sciencedirect.com/science/article/pii/S0014579399005244* Informal intro to simulated annealing, gradient descent…

http://www.sciencedirect.com/science/article/pii/S0014579399005244

Self-Organizing Maps (SOM)




Self-Organizing Maps (SOM)• Initial grids

– Wrt size: 1-dimensional, 2-dimensional, 3-dimensional– Wrt structure: Rectangular, Hexagonal, Arbitrary planar

22 Center for Genes, Environment, and Healthhttp://www.cis.hut.fi/somtoolbox/documentation/grids.gifhttp://slideplayer.com/slide/4040706/

Self-Organizing Maps (SOM)


Example:

ClusteringGene Expression Profiles

http://physiolgenomics.physiology.org/content/physiolgenomics/10/2/103/F2.large.jpg

Hierarchical Clustering • Divisive – (top down) start with all

points in 1 cluster, successively sub-divide ‘farthest’ points until full tree

• Agglomerative – (bottom up) start with each point in its own cluster (singleton), merge ‘closest’ pair of Clusters at each step until root– Requires metric to define ‘closest’ –

distance no longer between points, but between clusters

– Linkage strategy for which merge is often based on pairwise point comparisons

• Dendrogram shows order of splits 24 Center for Genes, Environment, and Health

Hierarchical Clustering

25 Center for Genes, Environment, and Healthhttp://images.slideplayer.com/11/3289326/slides/slide_7.jpg

Distance Metrics• Euclidean

– distance in Euclidean space• Pearson Correlation

– linear relationships• Spearman Correlation

– monotonic relationships• Mutual Information

– non-linear relationships• Polyserial Correlation

– correlation continuous vs ordinal (polychoric if ordinal vs ordinal)

• Hamming Distance, Jaccard, Dice (binary variables)


2

1

),(

n

iiiE yxyxdist

n

i i

n

i i

n

i iiP

yyxx

yyxxyxdist

1

2

1

2

11),(

n

i yy

n

i xx

n

i yyxxyxS

rrrr

rrrrrrdist

ii

ii

1

2

1

2

11),(

)(zrankrz

),(),(),( yxMIyxHyxdistMI

yx yxyxx xx ppyxHppxH

yxHyHxHyxMI

, ,, log),( and log)(

),()()(),(

111001

1001

MMM

MMdistJ

1001 MMdistH Good when 0

gives no info YX

YXdistD

21

Like Jaccard but 2*Matches

0 0 0 0 1 0.91 Spearman

C -1 -1 1 -0.7 0 0 PearsonB 8 0 1 6 22 23 EucDist

0 8 9 6 17 19 EucDist

A B C D E FA 1 1 -1 0.8 0 0 Pearson

Distance Metrics• Euclidean vs Pearson (linear) vs Spearman

(monotonic)


Numbers are Pearson correlation

Note Pearson invariant to slope

Pearson=0 if non-linear1 1 -1 1 0 0 Spearman

E 0 0 0 0.3 1 0.85 Pearson

• Single Linkage argmin S,T min sS,tT dist(s,t)

• Complete Linkage argmin S,T max sS,tT dist(s,t)

• Average Linkage (a.k.a. group average)

argmin S,T average sS,tT dist(s,t)• Centroid Linkage (People err after Eisen et al 1998

Treeview paper think=Average Linkage!) – min dist(centroid(S), centroid(T))

• Ward’s Linkage (optimizes same criterion as kMeans)• UPGMA (Unweighted Pair Group Method with Arithmetic

Mean) from Lozupone lecture – assumes constant rate of evolution, average linkage, Euclidean distance

Linkage Methods


30 Center for Genes, Environment, and Health Comp.1 Comp.2Murder -0.53 0.41Assault -0.58 0.18UrbanPop -0.27 -0.87Rape -0.54 -0.16

Murder Assault UrbanPop RapeAlabama 13.2 236 58 21.2Alaska 10.0 263 48 44.5Arizona 8.1 294 80 31.0Arkansas 8.8 190 50 19.5California 9.0 276 91 40.6Colorado 7.9 204 78 38.7

31 Center for Genes, Environment, and Health Comp.1 Comp.2Murder -0.53 0.41Assault -0.58 0.18UrbanPop -0.27 -0.87Rape -0.54 -0.16

Murder Assault UrbanPop RapeAlabama 13.2 236 58 21.2Alaska 10.0 263 48 44.5Arizona 8.1 294 80 31.0Arkansas 8.8 190 50 19.5California 9.0 276 91 40.6Colorado 7.9 204 78 38.7

Choosing the Number of Clusters• Rule of thumb: k= n/2• Elbow or Knee method (bend in plot of

metric)

• K-means likes spherical so minimize within-cluster variation (SSE, sum dist of all points to cluster mean) or maximize between-cluster variation (dist between clusts) or both CH(K)=[B(K)/K-1]/[W(K)/(n-K)]

• Gap Statistic– Calculate SSE, randomize dataset,

calculate SSE rand, n times, gap= log(mean SSErand/ SSE)

• Hierarchical – plot dist chosen at each merge (okay for single, complete) 33 Center for Genes, Environment, and Health

See also http://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf for long list of indices, NbClust R package: http://cedric.cnam.fr/fichiers/art_2579.pdf and http://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3.pdf

W(K)

B(K)

CH(K)

Gap(K)

*Calinski & Harabasz 1974

*Tibshirani, Walther, Hasties 2001

Association Set Mining• Also known as Market Basket Analysis

{milk, eggs} {butter}• Support of itemset X

supp(X) = # transactions with itemset X• Confidence of rule

conf(X Y) = supp(X &Y)/ supp(X)• Lift of rule (perf over assuming independent)

lift(X Y) = supp(X &Y)/ (supp(X)*supp(Y))• Want rules with max supp, conf, lift• Other measures found at:

http://michael.hahsler.net/research/association_rules/measures.html




Association Set Mining• Tables of data converted to transactions by

creating binary variables for all categories for all variables (must discretize continuous, missing data okay)


ID Gender

Age

Height (inches)

Race Diagnosis

CC245 Male 6 25 Caucasian

Depression

CC346 Male 75 60 African COPD

CC978 30 54 Asian Obesity

CC125 Female 15 54 African

{ {gender_M=Y, age_child=Y, height_20-29=Y, race_WH=Y, diag_depr=Y}, {gender_M=Y, age_senior=Y, height_60-69=Y, race_BL=Y, diag_copd=Y},

{age_adult=Y, height_50-59=Y,race_AS=Y, diag_obes=Y}, {gender_F=Y, age_adol=Y, height_50-59=Y, race_BL=Y} }

Association Set MiningExample in R: arules pkg, apriori algorithm


lhs rhs support confidence lift1 {Class=2nd, Age=Child} => {Survived=Yes} 0.011 1.000 3.0972 {Class=2nd, Sex=Female, Age=Child} => {Survived=Yes} 0.006 1.000 3.0963 {Class=1st, Sex=Female} => {Survived=Yes} 0.064 0.972 3.0104 {Class=1st, Sex=Female, Age=Adult} => {Survived=Yes} 0.064 0.972 3.010

…12 {Sex=Female,Survived=Yes} => {Age=Adult} 0.143 0.918 0.96627 {Class=2nd} => {Age=Adult} 0.118 0.915 0.963

Note that rule 2 subsumed by rule 1, which has better lift (and support) – can remove

redundants


Probabilistic Graphical Models

Time

Observability Utility Observabilityand Utility

MarkovDecisionProcess (MDP)

A tA t−1

X tX t −1

U tU t−1

PartiallyObservableMarkovDecisionProcess (POMDP)

A t−1A t

X tX t −1

OtO t−1

U tU t−1

Markov Process (MP)X tX t −1

Hidden Markov Model (HMM)

OtO

XtX t-1

t-1

Y X

Hidden Markov Model• Finite set of N states X• Finite set of M observations O• Parameter set λ = (A, B, π)

– Initial state distribution πi = Pr(X1 = i)– Transition probability aij = Pr(Xt=j | Xt-1 = i)– Emission probability bik = Pr(Ot=k | Xt = i)

• Given observation sequence O=O1,O2,…,On, how compute Pr(O| λ)? 38 Center for Genes, Environment, and Health


OtO

XtX t-1

t-1

1 2

3

N=3, M=2π=(0.25, 0.55, 0.2)A = B =

000.1

1.09.00

8.02.00

5.0

25.0

5.0

75.09.01.0

Example:

st1

st2

st3

st1

st2

st3

obs1 obs2st1 st2 st3

• Probability of O is sum over all state sequencesPr(O|λ) = ∑

all X Pr(O|X, λ) Pr(X|λ)

= ∑all X

πx1 bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT

• At each t, are N states to reach, so NT possible state sequences and 2T multiplications per seq, means O(2T*NT) operations

• So 3 states, length 10 seq = 1,180,980 operations and len 20 = 1e11!

• Efficient dynamic programming algo: Forward algorithm (Baum&Welch) O(N2T)


1 2

3

N=3, M=2π=(0.25, 0.55, 0.2)A = B =

000.1

1.09.00

8.02.00

5.0

25.0

5.0

75.09.01.0

Example:

πi = Pr(X1 = i)aij = Pr(Xt=j | Xt-1 = i)bik = Pr(Ot=k | Xt = i)

Applications in Bioinformatics• DNA – motif matching, gene matching,

multiple sequence alignment• Amino Acids – domain matching, fold

recognition• Microarrays/Whole Genome Sequencing –

assign copy number• ChIP-chip/seq – distinct chromatin states


Bayesian Networks• Given set of random

variables, the joint probability distribution can be represented by:– Structure: Directed Acyclic

Graph (DAG)• variables are nodes, absence of arcs

captures conditional independencies

– Parameters: Local Conditional Probability Distributions (CPDs)

• conditional probability of variable given values of parents in graph

• Joint Probability factors into product of local CPDs:


Pr(X1, X2, …, Xn) = i=1 to N Pr(Xi | Parents(Xi))

Bayesian Networks


• Generally can think of directed arcs as ‘causal’ (be careful!)– If the sprinkler is on OR it is raining, then the

grass will be wet: Pr(W|S,R)• If observe wet grass, can determine

whether because of sprinkler or rain– Pr(R|W) and Pr(S|W)– Bayes rule = Pr(X|Y) = Pr(Y|X)Pr(X)/Pr(Y)

• Note S and R compete to explain W: this model says sprinkler usage is (conditionally) independent of rain, but if know the grass is wet, and it is raining, then it is less likely that the sprinkler being on is the explanation for W– Pr(S|W,R) < Pr(S|W) “explaining away”

• Note only need 9 parameters instead of 24=16

http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html

http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html

Applications in Bioinformatics


PMID: 16873470

Gene regulatory networks (Friedman et al, 2000, PMID: 11108481)

Predicting clinical outcomes using expression data (Gevaert et al, 2006, PMID: 16873470)

Determining Regulators with PRMS (Segal et al, 2002, RECOMB)

Gene Function Prediction(Troyanskaya et al, 2003, PMID: 12826619 )

Hanalyzer – edge scores(Leach et al, 2009, PMID: 19325874)

Supervised Learning


Supervised Learning• Given examples (x,y) of input features x and

output variable y, learn function f(x)=y– Regression (continuous response) vs Classification

(discrete response)– Dimensionality Reduction (Feature selection vs

extraction)– Cross validation (Leave-One-Out vs N-Fold)– Generalization (Training set error vs Test set error)– Missing data and Imputation– Model Selection (AIC, BIC)– Boosting/bagging/jackknife– Curse of dimensionality


Supervised Learning• Boosting (weak learners on different subsets)

– Train H1 on random data split, sample among H1’s predictions so next data set to train H2 has half wrong, half right in H1. Train H3 where both H1 and H2 wrong. Return majority vote H1, H2, H3 (Adaboost weights examples, weighted vote)

• Bagging (bootstrap aggregate)– Train multiple models on random with replacement (bootstrap)

splits of input data, average predictions• Jackknife (vs bootstrap) – disjoint subsets of data• Model Selection: balance goodness of fit (likelihood L) with

complexity of model (number of parameters k) for n samples– Bayesian information criterion (BIC): minimize k ln(n)-2 ln(L)– Akaike information criterion (AIC): minimize 2k – 2 ln(L)

• Curse of dimensionality – greater D, data samples sparser in covering space so need more & more data to learn properly


(less strong, better theory than BIC)

Decision Boundaries


https://sites.google.com/a/iupr.com/dia-course/lectures/lecture08-classification-with-neural-networks




k-Nearest Neighbors• Store database of (x,y) pairs, classify new

example by majority vote of k nearest neighbors (regression if assign (weighted) mean y in neighborhood)

• No training needed, non-parametric,sensitive to local structure in data,frequent class tends to dominate

• Curse of dimensionality if many variables, any query equidistant toall points – reduce features by PCA

• Allows complicated boundariesbetween classes


If k=3, (green, red)If k=5, (green, blue)

Neural Network: Linear Perceptron

• Learning :(Backpropagation)

• Initialize wt, choose learning rate • 1) Calculate prediction y*j,t = f[wt xj]

• 2) Update weights wt+1 = wt+(yj – y*j,t)xj

• Repeat 1&2 until (yj – y*j,t) < threshold

– Can be generalized to multi-class– Optimal only if data linearly separable


vs

Step activation function

Neural Network: Multi-Layer Perceptron

• Smooth activationfunction instead

• Can also havemultiple hidden layers

• Can learn when data not linearly separable

• Learn like before but backpropagation from output layer


Smooth activation function (signmoid,

tanh)

Input layer Hidden layer Output layer

Decision Tree• Node is attribute tested, branch

is outcome, leaf is (majority) class (prob)• Discrete: X=xi?, Real: X<value?• Greedy algorithm chooses

best attribute to split upon:– pi = fraction items labeled i in set

– Gini impurity: IG(p) =ij pipj

prob items labeled i chosen *prob i mistakenly assigned class j

– Information gain: IE(p) =-i pi log2pi

– Real value: SSE

• EASY TO INTERPRET!!! Can overfit, large tree for XOR, biased in favor of attributes with more levels => ensembles 51 Center for Genes, Environment, and Health

BIOPSY+

Rx SIDE EFFECT

BREATH>90%

Died: 3

Alive: 27

Y N

BREATH<30%

Y NDied:

15Alive:

15

Died: 20

Alive: 57

Died: 30

Alive: 7

Died: 80

Alive: 1

Y N

Y N

Random Forest• Classifier consisting of ensemble of decision trees

{h(x, k)} where k is some i.i.d. random vector, and each tree casts vote for class of x (Breiman 2001)1. Bagging – k is random selection of N samples (with

replacement) to grow tree2. Dietterich 98: k is random split among n (best) splits

3. Ho 98: k is random subset of features to grow tree (√k )

4. Adaboost-like: k is random weights on examples

– 4 better than {2,3} better than 1 on generalization error• Out-of-bag estimates : internal estimates of

generalization error, classifier strength and correlation between trees 52 Center for Genes, Environment, and Health

Random Forest• Most popular implementation {h(x, k)}: bagging

(random subset samples w/ repl.) + random subset features– If set of features small, trees more correlated, so can make

new features as random linear combinations of orig. features

• Out-of-bag classifier for specific {x,y} = aggregate over trees that didn’t use {x,y} as training data (removes need for setting aside test data)

• Out-of-bag estimate is error rate for out-of-bag classifer for training set (can also estimate OOB strength and correlation)

• Can estimate variable importance from OOB estimates– For m-th variable, permute its values, compare

misclassification rate of OOB classifiers on ‘noised-up’ data with OOB on real data, large increase implies m-th variable important


Support Vector Machine (SVM)• Support vectors are points that

lie closest to decision surface, maximize ‘margin’, hyperplane separating examples (solution change if SVs removed)

• Kernel function – maps not-linearly separable data to transformed space where transformed data is lin. sep.

• Advantages: non-probabilistic, optimization not greedy search, not affected by local minima, theoretical guarantee of performance, escape curse of dimensionality


• Distance between H and H1 is 1/||w|| so to maximize margin, need to minimize ||w|| (where ||w||= sqrt(i wi

2) )

s.t. no points between H1&H2: xi w + b +1 when yi = +1 xi w + b -1 when yi = -1

• Quadratic program (constrained optimization, solved by (dual of) Lagrangian multiplier)Max L = i- ½ijxixj s.t w=iyixi and iyi=0

• If not linearly separable, use transformation to space where is linearly separable, via kernels, i.e. (xi) not xi

• If use L1-norm (not L2 above), weights = variable importance


+

+

+

http://www.cs.ucf.edu/courses/cap6412/fall2009/papers/Berwick2003.pdf

yi(xi w)1

Support Vector Machine (SVM)

http://books.nips.cc/papers/files/nips16/NIPS2003_AA07.pdf


http://books.nips.cc/papers/files/nips16/NIPS2003_AA07.pdf

Support Vector Machine (SVM)


Not separated by linear function, but can by quadratic one

Radial basis function (Gaussians)

pxxxxK )1'()',( )'tanh()',( xxxxK

2

2

2

'exp)',(

xx

xxKPolynomial (p=1 linear)

~sigmoid (like Neural Net)

Other Useful Kernel Functions• Use of kernels allows complex data types to

be used in SVMs w/o having to translate into real-valued, fixed length vectors K: D x D R

• String kernel: compare two sequences• Graph kernel: compare two nodes in graph

or two graphs• Image kernels: compare two images• and so on … (any symmetric, positive semi-

definite matrix is a kernel)57 Center for Genes, Environment, and Health

Naïve Bayes• Recall Bayes rule

Pr(X|Y) = Pr(Y|X)Pr(X) / Pr(Y)• Classifier:

Pr(C|F1,…,Fn ) = Pr(C) Pr(F1,…,Fn|C) / Pr(F1,…,Fn)– Note denominator does not depend on C (effectively

constant Z)– “Naïve” assumption because assume Fi, Fj independent– Simplifies calculation:

Pr(C|F1,…,Fn ) = 1/Z * Pr(C) i Pr(Fi|C)

• Learn parameters Pr(C) & each Pr(Fi|C) by maximum likelihood (multinomial, Gaussian, …)– Can learn each Pr(Fi|C) independently, escape

curse of dimensionality, not need dataset to scale with # Fi


C

F1 Fn…F2 F3

Examples in R• Making 2D datasets

– Install libraries: mlbench• Clustering (Hierarchical, K-Means, SOM)

– Install libraries: kohonen• Classification (kNN, NN, DT, SVM, NB)

– Install libraries: class (if R>3.0, o/w knn), neuralnet, rpart, e1071


Additional References

• Logit regression example: http://www.ats.ucla.edu/stat/r/dae/logit.htm• PCA: http://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf• Statistical Pattern Recognition Toolbox for Demos:

http://cmp.felk.cvut.cz/cmp/software/stprtool/examples.html• KMeans: https://onlinecourses.science.psu.edu/stat857/node/125• SOMS:

– http://www.cs.bham.ac.uk/~jxb/NN/l16.pdf– http://www.loria.fr/~rougier/coding/article/article.html– http://www.sciencedirect.com/science/article/pii/S0014579399005244

• Distance metrics:– http://www.statmethods.net/stats/correlations.html– http://people.revoledu.com/kardi/tutorial/Similarity/index.html – nice discussion of differences– http://www.datavis.ca/papers/corrgram.pdf - make visual panel (like heatmap) of correlation between variables

• Choosing number of clusters:– Nice one: http://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf– http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set– http://psycnet.apa.org/journals/met/16/3/285/– http://blog.echen.me/2011/03/19/counting-clusters/– http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters– http://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3.pdf

• Neural Networks:– Good one: http://www.cogsys.wiai.uni-bamberg.de/teaching/ss05/ml/slides/cogsysII-4.pdf– https://sites.google.com/a/iupr.com/dia-course/lectures/lecture08-classification-with-neural-networks– https://sites.google.com/a/iupr.com/dia-course/lectures/lecture08-classification-with-neural-networks– Nice for MLP: http://users.ics.aalto.fi/ahonkela/dippa/node41.html

• Boosting vs Bagging: http://people.cs.pitt.edu/~milos/courses/cs2750-Spring04/lectures/class23.pdf• Random Forests: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

• SVMs:Idiots’ guide to SVMs: http://www.cs.ucf.edu/courses/cap6412/fall2009/papers/Berwick2003.pdf• Kernel Methods: http://www.kernel-methods.net/tutorials/KMtalk.pdf


http://www.ats.ucla.edu/stat/r/dae/logit.htm

http://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf

http://cmp.felk.cvut.cz/cmp/software/stprtool/examples.html

https://onlinecourses.science.psu.edu/stat857/node/125

http://www.cs.bham.ac.uk/~jxb/NN/l16.pdf

http://www.loria.fr/~rougier/coding/article/article.html


http://www.statmethods.net/stats/correlations.html

http://people.revoledu.com/kardi/tutorial/Similarity/index.html

http://www.datavis.ca/papers/corrgram.pdf

http://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf

http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set

http://psycnet.apa.org/journals/met/16/3/285/

http://blog.echen.me/2011/03/19/counting-clusters/

http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters

http://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3.pdf

http://www.cogsys.wiai.uni-bamberg.de/teaching/ss05/ml/slides/cogsysII-4.pdf



http://users.ics.aalto.fi/ahonkela/dippa/node41.html

http://people.cs.pitt.edu/~milos/courses/cs2750-Spring04/lectures/class23.pdf

https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm


http://www.kernel-methods.net/tutorials/KMtalk.pdf

The End


Not used


Hidden Markov Model• Finite set of N states X• Finite set of M observations O• Parameter set λ = (A, B, π)

– Initial state distribution πi = Pr(X1 = i)– Transition probability aij = Pr(Xt=j | Xt-1 = i)– Emission probability bik = Pr(Ot=k | Xt = i)

• Given observation sequence O=O1,O2,…,On, how compute Pr(O| λ)? 66 Center for Genes, Environment, and Health


OtO

XtX t-1

t-1

1 2

3

N=3, M=2π=(0.25, 0.55, 0.2)A = B =

000.1

1.09.00

8.02.00

5.0

25.0

5.0

75.09.01.0

Example:

st1

st2

st3

st1

st2

st3

obs1 obs2st1 st2 st3



= ∑

all X πx1

bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT

• What is computational complexity of this sum?


1 2

3

N=3, M=2π=(0.25, 0.55, 0.2)A = B =

000.1

1.09.00

8.02.00

5.0

25.0

5.0

75.09.01.0

Example:




= ∑

all X πx1


• At each t, are N states to reach, so NT possible state sequences and 2T multiplications per seq, means O(2T*NT) operations

• So 3 states, length 10 seq = 1,180,980 operations and len 20 = 1e11!


1 2

3

N=3, M=2π=(0.25, 0.55, 0.2)A = B =

000.1

1.09.00

8.02.00

5.0

25.0

5.0

75.09.01.0

Example:




= ∑

all X πx1


• Efficient dynamic programming algorithm to do this: Forward algorithm(Baum and Welch,O(N2T))


1 2

3

N=3, M=2π=(0.25, 0.55, 0.2)A = B =

000.1

1.09.00

8.02.00

5.0

25.0

5.0

75.09.01.0

Example:


The Forward AlgorithmProbability of a Sequence is the Sum of All

Paths that Can Produce It


G .1C .1A .4T .4

G .3C .3A .2T .2

0.10.2

Non-CpG

0.8

0.9 G

CpG

G .3

G .1

.3*(

.3*.8+

.1*.1)=.075

.1*(

.3*.2+

.1*.9)=.015

C

.3*(

.075*.8+

.015*.1)=.0185

.1*(

.075*.2+

.015*.9)=.0029

G

.2*(

.0185*.8+

.0029*.1)=.003

.4*(

.0185*.2+

.0029*.9)=.0025

A

.2*(

.003*.8+

.0025*.1)=.0005

.4*(

.003*.2+

.0025*.9)=.0011

A

David Pollock’s Lecture

Parameter estimation by Baum-Welch Forward Backward Algorithm

Forward variable αt(i) =Pr(O1..t,Xt=i | λ)Backward variable βt(i) =Pr(Ot+1..N|Xt=i, λ)

DEFINITIVE tutorial: Rabiner 1989: http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Leach.pdfand erratum: http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Erratum_Leach.pdf

http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Leach.pdf

http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Erratum_Leach.pdf

Forward Algorithm• Dynamic programming method to compute

forward variable: αt(i) =Pr(O1..t,Xt=i | λ)• Base Condition: for 1 i N

α1(i) = πx1 bxio1

• Recurrence: for 1 j N and 1 t T-1

αt+1(j) = [ ∑i=1 to N αt(i) axixj ] bxjot+1

• Then probability of sequence Pr(O | λ) = ∑i=1 to N αT(i) 72 Center for Genes, Environment, and Health

*Backward algorithm for βt(i) is

analogous

0 0 0 0 1 0.91 Spearman


A B C D E FA 1 1 -1 0.8 0 0 Pearson

1 1 -1 1 0 0 Spearman0 8 9 6 17 19 EucDist

B 8 0 1 6 22 23 EucDist

E 0 0 0 0.3 1 0.85 PearsonC -1 -1 1 -0.7 0 0 Pearson

A B C D E FA 1 1 -1 0.8 0 0 Pearson 1 1 -1 1 0 0 Spearman 0 8 9 6 17 19 EucDistB 8 0 1 6 22 23 EucDistC -1 -1 1 -0.7 0 0 PearsonE 0 0 0 0.3 1 0.85 Pearson 0 0 0 0 1 0.91 Spearman

Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD...

Documents

Transcript of Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD...