Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD...
-
Upload
darrell-mcgee -
Category
Documents
-
view
212 -
download
0
Transcript of Center for Genes, Environment, and Health Machine Learning CPBS7711 Oct 8, 2015 Sonia Leach, PhD...
Center for Genes, Environment, and Health
Machine LearningCPBS7711
Oct 8, 2015
Sonia Leach, PhDAssistant Professor
Center for Genes, Environment, and HealthNational Jewish Health
Someone once said “Artificial Intelligence = Search”
so Machine Learning = ?Induction of New Knowledge from experience and ability to improve?
Machine Learning is a natural outgrowth of the intersection of Computer Science and Statistics. We might say the defining question of Computer Science is “How can we build machines that solve problems, and which problems are inherently tractable/intractable?” The question that largely defines Statistics is “What can be inferred from data plus a set of modeling assumptions, with what reliability?” The defining question for Machine Learning builds on both, but it is a distinct question. Whereas Computer Science has focused primarily on how to manually program computers, Machine Learning focuses on the question of how to get computers to program themselves (from experience plus some initial structure). Whereas Statistics has focused primarily on what conclusions can be inferred from data, Machine Learning incorporates additional questions about what computational architectures and algorithms can be used to most effectively capture, store, index, retrieve and merge these data, how multiple learning subtasks can be orchestrated in a larger system, and questions of computational tractability. We say that a machine learns with respect to a particular task T, performance metric P, and type of experience E, if the system reliably improves its performance P at task T, following experience E.
- Tom Mitchell http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf
Center for Genes, Environment, and Health 2
Also interesting discussion of differences among AI, ML, Data Mining, Stats :http://stats.stackexchange.com/questions/5026/what-is-the-difference-between-data-mining-statistics-machine-learning-and-ai
Machine Learning• From Wikipedia:
– 7.1 Decision tree learning– 7.2 Association rule learning– 7.3 Artificial neural networks– 7.4 Inductive logic programming– 7.5 Support vector machines– 7.6 Clustering– 7.7 Bayesian networks– 7.8 Reinforcement learning– 7.9 Representation learning– 7.10 Similarity and metric
learning– 7.11 Sparse Dictionary Learning
• From Alppaydin Intro to Mach Learn: – Supervised Learning– Bayesian Decision Theory– Parametric Methods– Multivariate Methods– Dimensionality Reduction– Clustering– Nonparametric Methods– Decision Trees– Linear Discrimination– Multilayer Perceptrons– Local Models– Kernel Machines – Bayesian Estimation– Hidden Markov Models– Graphical Models– Combining Multiple Learners– Reinforcement Learning
3 Center for Genes, Environment, and Health
http://www.realtechsupport.org/UB/MRIII/papers/MachineLearning/Alppaydin_MachineLearning_2010.pdf
Machine Learning (what I will cover)
• Unsupervised– Dimensionality Reduction
• PCA– Clustering
• k-Means, SOM, Hierarchical
– Association Set Mining– Probabilistic Graphical
Models• HMMs, Bayes Nets
• Supervised– k-Nearest Neighbor– Neural Nets– Decision Trees/Random
Forests– SVMs– Naïve Bayes
• Issues– Regression/Classification– Feature selection/reduction– Missing data– Boosting/bagging/jackknife– Cross validation,
generalization– Model selection 4 Center for Genes, Environment, and Health
Connections to other lectures: Miller (HMM), Pollock (HMM), Leach (HMM), Lozupone (PCA, Feature Importance Scores, Clustering), Kechris (Regression), [Hunter (Knowledge-Based Analysis), Cohen (BioNLP), Phang (Expr Analysis) ….]
R: http://cran.r-project.org/web/views/MachineLearning.html
Machine Learning• Supervised
Learning– training set = both
inputs and correct answers
• Example: classification in predefined classes for which examples of labeled data are known
– It is similar with the optimization of an error function which measures the difference between the true answers and the answers given by the learner
• Unsupervised Learning– training set = just input
data• Example: grouping data
into categories based on similarities among them
– Relies on statistical properties of data when try to extract models of data
– Does not use an error concept but a model quality concept which should be maximized
5 Center for Genes, Environment, and Health http://slideplayer.com/slide/4040706/
Unsupervised Learning
Center for Genes, Environment, and Health 6
Dimensionality Reduction: Principal Components Analysis (PCA)
• Motivation: Instead of considering all variables, use small number of linear combos of those variables with minimum information lost
7 Center for Genes, Environment, and Health
http://blog.peltarion.com/2006/06/20/the-talented-drhebb-part-2-pca/
2D data: What if could only choose 1 of the variablesto represent data?
Amount of variance explained by
single variable
http://blog.peltarion.com/2006/06/20/the-talented-drhebb-part-2-pca/
P1var
Amount of variance
explained by P1 >
explained by Y
Choosey-axis,explainsmorevariancein data
Principal Components Analysis (PCA)
• Let X=(x1,x2,…,xn) is a random vector (mean vector , covariance matrix )– Example X=(height, weight, GPA, credit score)
• Each element of vector has a distribution over the population (i.e. xi is a random variable)
• A dataset is a set of samples from the joint distribution of X
• =(66, 179, 3.0, 687) =
8 Center for Genes, Environment, and Health
Sample Ht Wt GPA FICA
Bob 73 185 3.3 610
Anna 62 105 3.7 730
Therese
69 137 2.89 717
Jacob 76 210 4.0 780
2,,,
,2
,,
,,2
,
,,,2
ficaficagpaficawtficaht
ficagpagpagpawtgpaht
ficawtgpawtwtwtht
ficahtgpahtwththt
Note: Ht & Wt usually strongly correlated
Principal Components Analysis (PCA)
• If X=(x1,x2,…,xn) is a random vector (mean vector , covariance matrix ), then principal component transformation
X Y = (X-) s.t. is orthogonal, T = is diagonal, 1 2 … p 0.
– Linear orthogonal transform of original data to new coordinate system
– each component is linear combination of original variables• coefficient of variables in linear combo = Loadings• data transformed to new coords = Scores
– components ordered by percentage of variance explained along new axis
– number of components = minimum dimension of input data matrix– set of orthogonal vectors not unique, not scale-invariant
(covariance vs correlation), computed by eigen value decomposition (as above & R princomp) or singular value decomposition (SVD) (R prncmp) 9 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics
Principal Components Analysis (PCA)• If X is a random vector (mean , covariance matrix ), then
principal component transformation X Y=(X-) s.t. is orthogonal, T = is diagonal, 1 2 … p 0.
10 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics
diffgeom complex algebra reals stats1 36 58 43 36 372 62 54 50 46 523 31 42 41 40 294 76 78 69 66 815 46 56 52 56 406 12 42 38 38 287 39 46 51 54 418 30 51 54 52 329 22 32 43 28 2210 9 40 47 30 24
X
What if we could only choose two dimensions?
Principal Components Analysis (PCA)• If X is a random vector (mean , covariance matrix ), then
principal component transformation X Y=(X-) s.t. is orthogonal, T = is diagonal, 1 2 … p 0.
11 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 [1,] -2.292745 5.827588 8.966977 -7.1630488 -2.2195936 [2,] 25.846460 13.457048 -3.257987 0.5344066 0.4777994 [3,] -14.856875 4.337867 -4.057297 -2.5308172 1.4998247 [4,] 70.434116 -3.286077 6.423473 3.9571310 0.8815369 [5,] 13.768664 -4.392701 -6.058773 -4.7551497 -2.2951908 [6,] -28.899236 -4.611347 4.338621 -2.2710490 6.7118075 [7,] 5.216449 -4.536616 -7.625423 2.2093319 3.2618335 [8,] -3.432334 -11.115805 -3.553422 -0.9908949 -4.1604420 [9,] -31.579207 8.354892 -2.497369 5.6986938 -1.9742069[10,] -34.205292 -4.034848 7.321199 5.3113963 -2.1833687
Y(scores)
diffgeom complex algebra reals stats1 36 58 43 36 372 62 54 50 46 523 31 42 41 40 294 76 78 69 66 815 46 56 52 56 406 12 42 38 38 287 39 46 51 54 418 30 51 54 52 329 22 32 43 28 2210 9 40 47 30 24
X
Component Importance: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5Standard deviation 30.142 7.179 5.786 4.098 3.084Proportion of Variance 0.890 0.050 0.032 0.016 0.009Cumulative Proportion 0.890 0.941 0.974 0.990 1.000Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5diffgeom 0.638 0.599 -0.407 -0.112 -0.237complex 0.372 -0.230 0.593 -0.595 -0.320algebra 0.240 -0.371 0.645 -0.624reals 0.333 -0.671 -0.557 -0.234 0.271statistics 0.535 0.414 0.404 0.615
(loadings)
~i
EXAMPLE IN RX = read.table('pca.input',sep=" ", header=TRUE)pc = princomp(X)mu = pc$centerGamma = pc$loadingsY = pc$scores
XminusMu=sweep(X,MARGIN=2,mu,FUN="-")propOfVar= pc$sdev^2/sum(pc$sdev^2)eigenVals= pc$sdev^2
Principal Components Analysis (PCA)
12 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics
Component Importance: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5Standard deviation 30.142 7.179 5.786 4.098 3.084Proportion of Variance 0.890 0.050 0.032 0.016 0.009Cumulative Proportion 0.890 0.941 0.974 0.990 1.000Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5diffgeom 0.638 0.599 -0.407 -0.112 -0.237complex 0.372 -0.230 0.593 -0.595 -0.320algebra 0.240 -0.371 0.645 -0.624reals 0.333 -0.671 -0.557 -0.234 0.271statistics 0.535 0.414 0.404 0.615
(loadings)
## Verify Y = (X-mu)*Gammaunique(Y-as.matrix(XminusMu)%*%Gamma)## Verify X repr by Comp. i== Y[,i]par(mfrow=c(2,1),pty="s"),biplot(pc)plot(Y[,1],Y[,2],col="white")text(Y[,1],Y[,2],1:10)
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 [1,] -2.292745 5.827588 8.966977 -7.1630488 -2.2195936 [2,] 25.846460 13.457048 -3.257987 0.5344066 0.4777994 [3,] -14.856875 4.337867 -4.057297 -2.5308172 1.4998247 [4,] 70.434116 -3.286077 6.423473 3.9571310 0.8815369 [5,] 13.768664 -4.392701 -6.058773 -4.7551497 -2.2951908 [6,] -28.899236 -4.611347 4.338621 -2.2710490 6.7118075 [7,] 5.216449 -4.536616 -7.625423 2.2093319 3.2618335 [8,] -3.432334 -11.115805 -3.553422 -0.9908949 -4.1604420 [9,] -31.579207 8.354892 -2.497369 5.6986938 -1.9742069[10,] -34.205292 -4.034848 7.321199 5.3113963 -2.1833687
Y(scores)
X = read.table('pca.input',sep=" ", header=TRUE)pc = princomp(X)mu = pc$centerGamma = pc$loadingsY = pc$scoresXminusMu=sweep(X,MARGIN=2,mu,FUN="-")propOfVar= pc$sdev^2 /sum(pc$sdev^2)eigenVals= pc$sdev^2
~i
Arrows for original variables: Length=PropVarExplained in 2 compsDirection=relative loadings in 2 comps
ex) diffgeom largest(++,++)algebra smallest (+,-)
diffgeom complex algebra reals stats1 36 58 43 36 372 62 54 50 46 523 31 42 41 40 294 76 78 69 66 815 46 56 52 56 406 12 42 38 38 287 39 46 51 54 418 30 51 54 52 329 22 32 43 28 2210 9 40 47 30 24
X
Principal Components Analysis (PCA)• If X is a random vector (mean , covariance matrix ), then
principal component transformation X Y=(X-) s.t. is orthogonal, T = is diagonal, 1 2 … p 0.
13 Center for Genes, Environment, and Health Adapted from S-plus Guide to Statistics
diffgeom complex algebra reals stats1 36 58 43 36 372 62 54 50 46 523 31 42 41 40 294 76 78 69 66 815 46 56 52 56 406 12 42 38 38 287 39 46 51 54 418 30 51 54 52 329 22 32 43 28 2210 9 40 47 30 24
X
What if we could only choose two dimensions?
Clustering• Partitioning
– Must specify number of clusters– K-Means, Self-Organizing Maps (SOM/Kohonen
Net)• Hierarchical Clustering
– Do not need to specify number of clusters– Need to specify distance metric and linkage
method• Other approaches
– Fuzzy clustering (probabilistic membership)– Spectral Clustering (using eigen value
decomposition) 14 Center for Genes, Environment, and Health
Clustering
15 Center for Genes, Environment, and Healthhttp://apandre.wordpress.com/visible-data/cluster-analysis/
16 Center for Genes, Environment, and Healthhttp://stackoverflow.com/questions/4722290/generating-synthetic-datasets
R package: mlbench: Machine Learning Benchmark Problems
k-Means• Intitialize: Select the initial k Centroids
– REPEAT• Form k clusters by assigning all points to
the ‘closest’ Centroid• Recompute the Centroid for each cluster
– UNTIL ”The Centroids don’t change or all changes are below predefined threshold”
• Initial Centroids are random vectors, randomly selected among vectors, first k vectors, etc or computed from random 1st assignment
• ‘closest’ typically defined by Euclidean distance (Voronoi diagram)
• Prone to local maxima so typically do N random restarts, take best (min sum of distE
2 to centroids)• In practice, favors separated spherical clusters 17 Center for Genes, Environment, and Health
2
1
),(),(
n
iiiEE yxxydistyxdist
Images from wikipedia
k-Means
18 Center for Genes, Environment, and Health http://en.wikipedia.org/wiki/K-means_clustering
Iteration 0 Iteration 1 Iteration 2
Iteration 3 Iteration 4 Iteration 5
Images from wikipedia
Self-Organizing Maps (SOM)• Similar to k-Means, goal to assign data to map
node (e.g. Centroid in k-Means) with ‘closest’ weight vector to data space vector (minimize distE(x,w))
• Difference: map nodes constrained by neighborhood relationships, whereas k-Means Centroids freely move
• Must input initial topology, map ‘stretches’ to cover nD data in 2D, similar data assigned to map neighbors
19 Center for Genes, Environment, and HealthImage from wikipedia
Self-Organizing Maps (SOM)• 1. Initialization – Choose
random values for initial weight vectors wj.
• 2. Sampling – Draw a sample training input vector x from the input space.
• 3. Matching – Find the winning neuron I(x) with weight vector closest to input vector (i.e.,min distE)
• 4. Updating – Apply the weight update equation wji = (t)Tj,I(x) (t)( xi-wji)where (t) = learning rate @ time t*
Tj,I(x) (t)=neighborhood @
time t• 5. Continuation – keep
returning to step 2 until the feature map stops changing.
20 Center for Genes, Environment, and Health
http://www.sciencedirect.com/science/article/pii/S0014579399005244* Informal intro to simulated annealing, gradient descent…
Self-Organizing Maps (SOM)
21 Center for Genes, Environment, and Health
http://www.sciencedirect.com/science/article/pii/S0014579399005244
Self-Organizing Maps (SOM)• Initial grids
– Wrt size: 1-dimensional, 2-dimensional, 3-dimensional– Wrt structure: Rectangular, Hexagonal, Arbitrary planar
22 Center for Genes, Environment, and Healthhttp://www.cis.hut.fi/somtoolbox/documentation/grids.gifhttp://slideplayer.com/slide/4040706/
Self-Organizing Maps (SOM)
23 Center for Genes, Environment, and Health
Example:
ClusteringGene Expression Profiles
http://physiolgenomics.physiology.org/content/physiolgenomics/10/2/103/F2.large.jpg
Hierarchical Clustering • Divisive – (top down) start with all
points in 1 cluster, successively sub-divide ‘farthest’ points until full tree
• Agglomerative – (bottom up) start with each point in its own cluster (singleton), merge ‘closest’ pair of Clusters at each step until root– Requires metric to define ‘closest’ –
distance no longer between points, but between clusters
– Linkage strategy for which merge is often based on pairwise point comparisons
• Dendrogram shows order of splits 24 Center for Genes, Environment, and Health
Hierarchical Clustering
25 Center for Genes, Environment, and Healthhttp://images.slideplayer.com/11/3289326/slides/slide_7.jpg
Distance Metrics• Euclidean
– distance in Euclidean space• Pearson Correlation
– linear relationships• Spearman Correlation
– monotonic relationships• Mutual Information
– non-linear relationships• Polyserial Correlation
– correlation continuous vs ordinal (polychoric if ordinal vs ordinal)
• Hamming Distance, Jaccard, Dice (binary variables)
26 Center for Genes, Environment, and Health
2
1
),(
n
iiiE yxyxdist
n
i i
n
i i
n
i iiP
yyxx
yyxxyxdist
1
2
1
2
11),(
n
i yy
n
i xx
n
i yyxxyxS
rrrr
rrrrrrdist
ii
ii
1
2
1
2
11),(
)(zrankrz
),(),(),( yxMIyxHyxdistMI
yx yxyxx xx ppyxHppxH
yxHyHxHyxMI
, ,, log),( and log)(
),()()(),(
111001
1001
MMM
MMdistJ
1001 MMdistH Good when 0
gives no info YX
YXdistD
21
Like Jaccard but 2*Matches
0 0 0 0 1 0.91 Spearman
C -1 -1 1 -0.7 0 0 PearsonB 8 0 1 6 22 23 EucDist
0 8 9 6 17 19 EucDist
A B C D E FA 1 1 -1 0.8 0 0 Pearson
Distance Metrics• Euclidean vs Pearson (linear) vs Spearman
(monotonic)
27 Center for Genes, Environment, and Health
Numbers are Pearson correlation
Note Pearson invariant to slope
Pearson=0 if non-linear1 1 -1 1 0 0 Spearman
E 0 0 0 0.3 1 0.85 Pearson
• Single Linkage argmin S,T min sS,tT dist(s,t)
• Complete Linkage argmin S,T max sS,tT dist(s,t)
• Average Linkage (a.k.a. group average)
argmin S,T average sS,tT dist(s,t)• Centroid Linkage (People err after Eisen et al 1998
Treeview paper think=Average Linkage!) – min dist(centroid(S), centroid(T))
• Ward’s Linkage (optimizes same criterion as kMeans)• UPGMA (Unweighted Pair Group Method with Arithmetic
Mean) from Lozupone lecture – assumes constant rate of evolution, average linkage, Euclidean distance
Linkage Methods
28 Center for Genes, Environment, and Health
29 Center for Genes, Environment, and Healthhttp://stackoverflow.com/questions/4722290/generating-synthetic-datasets
R package: mlbench: Machine Learning Benchmark Problems
30 Center for Genes, Environment, and Health Comp.1 Comp.2Murder -0.53 0.41Assault -0.58 0.18UrbanPop -0.27 -0.87Rape -0.54 -0.16
Murder Assault UrbanPop RapeAlabama 13.2 236 58 21.2Alaska 10.0 263 48 44.5Arizona 8.1 294 80 31.0Arkansas 8.8 190 50 19.5California 9.0 276 91 40.6Colorado 7.9 204 78 38.7
31 Center for Genes, Environment, and Health Comp.1 Comp.2Murder -0.53 0.41Assault -0.58 0.18UrbanPop -0.27 -0.87Rape -0.54 -0.16
Murder Assault UrbanPop RapeAlabama 13.2 236 58 21.2Alaska 10.0 263 48 44.5Arizona 8.1 294 80 31.0Arkansas 8.8 190 50 19.5California 9.0 276 91 40.6Colorado 7.9 204 78 38.7
32 Center for Genes, Environment, and Health
Choosing the Number of Clusters• Rule of thumb: k= n/2• Elbow or Knee method (bend in plot of
metric)
• K-means likes spherical so minimize within-cluster variation (SSE, sum dist of all points to cluster mean) or maximize between-cluster variation (dist between clusts) or both CH(K)=[B(K)/K-1]/[W(K)/(n-K)]
• Gap Statistic– Calculate SSE, randomize dataset,
calculate SSE rand, n times, gap= log(mean SSErand/ SSE)
• Hierarchical – plot dist chosen at each merge (okay for single, complete) 33 Center for Genes, Environment, and Health
See also http://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf for long list of indices, NbClust R package: http://cedric.cnam.fr/fichiers/art_2579.pdf and http://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3.pdf
W(K)
B(K)
CH(K)
Gap(K)
*Calinski & Harabasz 1974
*Tibshirani, Walther, Hasties 2001
Association Set Mining• Also known as Market Basket Analysis
{milk, eggs} {butter}• Support of itemset X
supp(X) = # transactions with itemset X• Confidence of rule
conf(X Y) = supp(X &Y)/ supp(X)• Lift of rule (perf over assuming independent)
lift(X Y) = supp(X &Y)/ (supp(X)*supp(Y))• Want rules with max supp, conf, lift• Other measures found at:
http://michael.hahsler.net/research/association_rules/measures.html
34 Center for Genes, Environment, and Health
Association Set Mining• Tables of data converted to transactions by
creating binary variables for all categories for all variables (must discretize continuous, missing data okay)
35 Center for Genes, Environment, and Health
ID Gender
Age
Height (inches)
Race Diagnosis
CC245 Male 6 25 Caucasian
Depression
CC346 Male 75 60 African COPD
CC978 30 54 Asian Obesity
CC125 Female 15 54 African
{ {gender_M=Y, age_child=Y, height_20-29=Y, race_WH=Y, diag_depr=Y}, {gender_M=Y, age_senior=Y, height_60-69=Y, race_BL=Y, diag_copd=Y},
{age_adult=Y, height_50-59=Y,race_AS=Y, diag_obes=Y}, {gender_F=Y, age_adol=Y, height_50-59=Y, race_BL=Y} }
Association Set MiningExample in R: arules pkg, apriori algorithm
36 Center for Genes, Environment, and Health
lhs rhs support confidence lift1 {Class=2nd, Age=Child} => {Survived=Yes} 0.011 1.000 3.0972 {Class=2nd, Sex=Female, Age=Child} => {Survived=Yes} 0.006 1.000 3.0963 {Class=1st, Sex=Female} => {Survived=Yes} 0.064 0.972 3.0104 {Class=1st, Sex=Female, Age=Adult} => {Survived=Yes} 0.064 0.972 3.010
…12 {Sex=Female,Survived=Yes} => {Age=Adult} 0.143 0.918 0.96627 {Class=2nd} => {Age=Adult} 0.118 0.915 0.963
Note that rule 2 subsumed by rule 1, which has better lift (and support) – can remove
redundants
37 Center for Genes, Environment, and Health
Probabilistic Graphical Models
Time
Observability Utility Observabilityand Utility
MarkovDecisionProcess (MDP)
A tA t−1
X tX t −1
U tU t−1
PartiallyObservableMarkovDecisionProcess (POMDP)
A t−1A t
X tX t −1
OtO t−1
U tU t−1
Markov Process (MP)X tX t −1
Hidden Markov Model (HMM)
OtO
XtX t-1
t-1
Y X
Hidden Markov Model• Finite set of N states X• Finite set of M observations O• Parameter set λ = (A, B, π)
– Initial state distribution πi = Pr(X1 = i)– Transition probability aij = Pr(Xt=j | Xt-1 = i)– Emission probability bik = Pr(Ot=k | Xt = i)
• Given observation sequence O=O1,O2,…,On, how compute Pr(O| λ)? 38 Center for Genes, Environment, and Health
Hidden Markov Model (HMM)
OtO
XtX t-1
t-1
1 2
3
N=3, M=2π=(0.25, 0.55, 0.2)A = B =
000.1
1.09.00
8.02.00
5.0
25.0
5.0
75.09.01.0
Example:
st1
st2
st3
st1
st2
st3
obs1 obs2st1 st2 st3
• Probability of O is sum over all state sequencesPr(O|λ) = ∑
all X Pr(O|X, λ) Pr(X|λ)
= ∑all X
πx1 bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT
• At each t, are N states to reach, so NT possible state sequences and 2T multiplications per seq, means O(2T*NT) operations
• So 3 states, length 10 seq = 1,180,980 operations and len 20 = 1e11!
• Efficient dynamic programming algo: Forward algorithm (Baum&Welch) O(N2T)
39 Center for Genes, Environment, and Health
1 2
3
N=3, M=2π=(0.25, 0.55, 0.2)A = B =
000.1
1.09.00
8.02.00
5.0
25.0
5.0
75.09.01.0
Example:
πi = Pr(X1 = i)aij = Pr(Xt=j | Xt-1 = i)bik = Pr(Ot=k | Xt = i)
Applications in Bioinformatics• DNA – motif matching, gene matching,
multiple sequence alignment• Amino Acids – domain matching, fold
recognition• Microarrays/Whole Genome Sequencing –
assign copy number• ChIP-chip/seq – distinct chromatin states
40 Center for Genes, Environment, and Health
Bayesian Networks• Given set of random
variables, the joint probability distribution can be represented by:– Structure: Directed Acyclic
Graph (DAG)• variables are nodes, absence of arcs
captures conditional independencies
– Parameters: Local Conditional Probability Distributions (CPDs)
• conditional probability of variable given values of parents in graph
• Joint Probability factors into product of local CPDs:
41 Center for Genes, Environment, and Health
Pr(X1, X2, …, Xn) = i=1 to N Pr(Xi | Parents(Xi))
Bayesian Networks
42 Center for Genes, Environment, and Health
• Generally can think of directed arcs as ‘causal’ (be careful!)– If the sprinkler is on OR it is raining, then the
grass will be wet: Pr(W|S,R)• If observe wet grass, can determine
whether because of sprinkler or rain– Pr(R|W) and Pr(S|W)– Bayes rule = Pr(X|Y) = Pr(Y|X)Pr(X)/Pr(Y)
• Note S and R compete to explain W: this model says sprinkler usage is (conditionally) independent of rain, but if know the grass is wet, and it is raining, then it is less likely that the sprinkler being on is the explanation for W– Pr(S|W,R) < Pr(S|W) “explaining away”
• Note only need 9 parameters instead of 24=16
http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html
Applications in Bioinformatics
43 Center for Genes, Environment, and Health
PMID: 16873470
Gene regulatory networks (Friedman et al, 2000, PMID: 11108481)
Predicting clinical outcomes using expression data (Gevaert et al, 2006, PMID: 16873470)
Determining Regulators with PRMS (Segal et al, 2002, RECOMB)
Gene Function Prediction(Troyanskaya et al, 2003, PMID: 12826619 )
Hanalyzer – edge scores(Leach et al, 2009, PMID: 19325874)
Supervised Learning
Center for Genes, Environment, and Health 44
Supervised Learning• Given examples (x,y) of input features x and
output variable y, learn function f(x)=y– Regression (continuous response) vs Classification
(discrete response)– Dimensionality Reduction (Feature selection vs
extraction)– Cross validation (Leave-One-Out vs N-Fold)– Generalization (Training set error vs Test set error)– Missing data and Imputation– Model Selection (AIC, BIC)– Boosting/bagging/jackknife– Curse of dimensionality
45 Center for Genes, Environment, and Health
Supervised Learning• Boosting (weak learners on different subsets)
– Train H1 on random data split, sample among H1’s predictions so next data set to train H2 has half wrong, half right in H1. Train H3 where both H1 and H2 wrong. Return majority vote H1, H2, H3 (Adaboost weights examples, weighted vote)
• Bagging (bootstrap aggregate)– Train multiple models on random with replacement (bootstrap)
splits of input data, average predictions• Jackknife (vs bootstrap) – disjoint subsets of data• Model Selection: balance goodness of fit (likelihood L) with
complexity of model (number of parameters k) for n samples– Bayesian information criterion (BIC): minimize k ln(n)-2 ln(L)– Akaike information criterion (AIC): minimize 2k – 2 ln(L)
• Curse of dimensionality – greater D, data samples sparser in covering space so need more & more data to learn properly
46 Center for Genes, Environment, and Health
(less strong, better theory than BIC)
Decision Boundaries
47 Center for Genes, Environment, and Health
https://sites.google.com/a/iupr.com/dia-course/lectures/lecture08-classification-with-neural-networks
k-Nearest Neighbors• Store database of (x,y) pairs, classify new
example by majority vote of k nearest neighbors (regression if assign (weighted) mean y in neighborhood)
• No training needed, non-parametric,sensitive to local structure in data,frequent class tends to dominate
• Curse of dimensionality if many variables, any query equidistant toall points – reduce features by PCA
• Allows complicated boundariesbetween classes
48 Center for Genes, Environment, and Health
If k=3, (green, red)If k=5, (green, blue)
Neural Network: Linear Perceptron
• Learning :(Backpropagation)
• Initialize wt, choose learning rate • 1) Calculate prediction y*j,t = f[wt xj]
• 2) Update weights wt+1 = wt+(yj – y*j,t)xj
• Repeat 1&2 until (yj – y*j,t) < threshold
– Can be generalized to multi-class– Optimal only if data linearly separable
49 Center for Genes, Environment, and Health
vs
Step activation function
Neural Network: Multi-Layer Perceptron
• Smooth activationfunction instead
• Can also havemultiple hidden layers
• Can learn when data not linearly separable
• Learn like before but backpropagation from output layer
50 Center for Genes, Environment, and Health
Smooth activation function (signmoid,
tanh)
Input layer Hidden layer Output layer
Decision Tree• Node is attribute tested, branch
is outcome, leaf is (majority) class (prob)• Discrete: X=xi?, Real: X<value?• Greedy algorithm chooses
best attribute to split upon:– pi = fraction items labeled i in set
– Gini impurity: IG(p) =ij pipj
prob items labeled i chosen *prob i mistakenly assigned class j
– Information gain: IE(p) =-i pi log2pi
– Real value: SSE
• EASY TO INTERPRET!!! Can overfit, large tree for XOR, biased in favor of attributes with more levels => ensembles 51 Center for Genes, Environment, and Health
BIOPSY+
Rx SIDE EFFECT
BREATH>90%
Died: 3
Alive: 27
Y N
BREATH<30%
Y NDied:
15Alive:
15
Died: 20
Alive: 57
Died: 30
Alive: 7
Died: 80
Alive: 1
Y N
Y N
Random Forest• Classifier consisting of ensemble of decision trees
{h(x, k)} where k is some i.i.d. random vector, and each tree casts vote for class of x (Breiman 2001)1. Bagging – k is random selection of N samples (with
replacement) to grow tree2. Dietterich 98: k is random split among n (best) splits
3. Ho 98: k is random subset of features to grow tree (√k )
4. Adaboost-like: k is random weights on examples
– 4 better than {2,3} better than 1 on generalization error• Out-of-bag estimates : internal estimates of
generalization error, classifier strength and correlation between trees 52 Center for Genes, Environment, and Health
Random Forest• Most popular implementation {h(x, k)}: bagging
(random subset samples w/ repl.) + random subset features– If set of features small, trees more correlated, so can make
new features as random linear combinations of orig. features
• Out-of-bag classifier for specific {x,y} = aggregate over trees that didn’t use {x,y} as training data (removes need for setting aside test data)
• Out-of-bag estimate is error rate for out-of-bag classifer for training set (can also estimate OOB strength and correlation)
• Can estimate variable importance from OOB estimates– For m-th variable, permute its values, compare
misclassification rate of OOB classifiers on ‘noised-up’ data with OOB on real data, large increase implies m-th variable important
53 Center for Genes, Environment, and Health
Support Vector Machine (SVM)• Support vectors are points that
lie closest to decision surface, maximize ‘margin’, hyperplane separating examples (solution change if SVs removed)
• Kernel function – maps not-linearly separable data to transformed space where transformed data is lin. sep.
• Advantages: non-probabilistic, optimization not greedy search, not affected by local minima, theoretical guarantee of performance, escape curse of dimensionality
54 Center for Genes, Environment, and Health
• Distance between H and H1 is 1/||w|| so to maximize margin, need to minimize ||w|| (where ||w||= sqrt(i wi
2) )
s.t. no points between H1&H2: xi w + b +1 when yi = +1 xi w + b -1 when yi = -1
• Quadratic program (constrained optimization, solved by (dual of) Lagrangian multiplier)Max L = i- ½ijxixj s.t w=iyixi and iyi=0
• If not linearly separable, use transformation to space where is linearly separable, via kernels, i.e. (xi) not xi
• If use L1-norm (not L2 above), weights = variable importance
55 Center for Genes, Environment, and Health
+
+
+
http://www.cs.ucf.edu/courses/cap6412/fall2009/papers/Berwick2003.pdf
yi(xi w)1
Support Vector Machine (SVM)
http://books.nips.cc/papers/files/nips16/NIPS2003_AA07.pdf
Support Vector Machine (SVM)
56 Center for Genes, Environment, and Health
Not separated by linear function, but can by quadratic one
Radial basis function (Gaussians)
pxxxxK )1'()',( )'tanh()',( xxxxK
2
2
2
'exp)',(
xx
xxKPolynomial (p=1 linear)
~sigmoid (like Neural Net)
Other Useful Kernel Functions• Use of kernels allows complex data types to
be used in SVMs w/o having to translate into real-valued, fixed length vectors K: D x D R
• String kernel: compare two sequences• Graph kernel: compare two nodes in graph
or two graphs• Image kernels: compare two images• and so on … (any symmetric, positive semi-
definite matrix is a kernel)57 Center for Genes, Environment, and Health
58 Center for Genes, Environment, and Health
Naïve Bayes• Recall Bayes rule
Pr(X|Y) = Pr(Y|X)Pr(X) / Pr(Y)• Classifier:
Pr(C|F1,…,Fn ) = Pr(C) Pr(F1,…,Fn|C) / Pr(F1,…,Fn)– Note denominator does not depend on C (effectively
constant Z)– “Naïve” assumption because assume Fi, Fj independent– Simplifies calculation:
Pr(C|F1,…,Fn ) = 1/Z * Pr(C) i Pr(Fi|C)
• Learn parameters Pr(C) & each Pr(Fi|C) by maximum likelihood (multinomial, Gaussian, …)– Can learn each Pr(Fi|C) independently, escape
curse of dimensionality, not need dataset to scale with # Fi
59 Center for Genes, Environment, and Health
C
F1 Fn…F2 F3
60 Center for Genes, Environment, and Health
Examples in R• Making 2D datasets
– Install libraries: mlbench• Clustering (Hierarchical, K-Means, SOM)
– Install libraries: kohonen• Classification (kNN, NN, DT, SVM, NB)
– Install libraries: class (if R>3.0, o/w knn), neuralnet, rpart, e1071
61 Center for Genes, Environment, and Health
62 Center for Genes, Environment, and Healthhttp://stackoverflow.com/questions/4722290/generating-synthetic-datasets
R package: mlbench: Machine Learning Benchmark Problems
Additional References
• Logit regression example: http://www.ats.ucla.edu/stat/r/dae/logit.htm• PCA: http://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf• Statistical Pattern Recognition Toolbox for Demos:
http://cmp.felk.cvut.cz/cmp/software/stprtool/examples.html• KMeans: https://onlinecourses.science.psu.edu/stat857/node/125• SOMS:
– http://www.cs.bham.ac.uk/~jxb/NN/l16.pdf– http://www.loria.fr/~rougier/coding/article/article.html– http://www.sciencedirect.com/science/article/pii/S0014579399005244
• Distance metrics:– http://www.statmethods.net/stats/correlations.html– http://people.revoledu.com/kardi/tutorial/Similarity/index.html – nice discussion of differences– http://www.datavis.ca/papers/corrgram.pdf - make visual panel (like heatmap) of correlation between variables
• Choosing number of clusters:– Nice one: http://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf– http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set– http://psycnet.apa.org/journals/met/16/3/285/– http://blog.echen.me/2011/03/19/counting-clusters/– http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters– http://www.stat.cmu.edu/~ryantibs/datamining/lectures/06-clus3.pdf
• Neural Networks:– Good one: http://www.cogsys.wiai.uni-bamberg.de/teaching/ss05/ml/slides/cogsysII-4.pdf– https://sites.google.com/a/iupr.com/dia-course/lectures/lecture08-classification-with-neural-networks– https://sites.google.com/a/iupr.com/dia-course/lectures/lecture08-classification-with-neural-networks– Nice for MLP: http://users.ics.aalto.fi/ahonkela/dippa/node41.html
• Boosting vs Bagging: http://people.cs.pitt.edu/~milos/courses/cs2750-Spring04/lectures/class23.pdf• Random Forests: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
• SVMs:Idiots’ guide to SVMs: http://www.cs.ucf.edu/courses/cap6412/fall2009/papers/Berwick2003.pdf• Kernel Methods: http://www.kernel-methods.net/tutorials/KMtalk.pdf
63 Center for Genes, Environment, and Health
The End
Center for Genes, Environment, and Health 64
Not used
Center for Genes, Environment, and Health 65
Hidden Markov Model• Finite set of N states X• Finite set of M observations O• Parameter set λ = (A, B, π)
– Initial state distribution πi = Pr(X1 = i)– Transition probability aij = Pr(Xt=j | Xt-1 = i)– Emission probability bik = Pr(Ot=k | Xt = i)
• Given observation sequence O=O1,O2,…,On, how compute Pr(O| λ)? 66 Center for Genes, Environment, and Health
Hidden Markov Model (HMM)
OtO
XtX t-1
t-1
1 2
3
N=3, M=2π=(0.25, 0.55, 0.2)A = B =
000.1
1.09.00
8.02.00
5.0
25.0
5.0
75.09.01.0
Example:
st1
st2
st3
st1
st2
st3
obs1 obs2st1 st2 st3
• Probability of O is sum over all state sequencesPr(O|λ) = ∑
all X Pr(O|X, λ) Pr(X|λ)
= ∑
all X πx1
bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT
• What is computational complexity of this sum?
67 Center for Genes, Environment, and Health
1 2
3
N=3, M=2π=(0.25, 0.55, 0.2)A = B =
000.1
1.09.00
8.02.00
5.0
25.0
5.0
75.09.01.0
Example:
πi = Pr(X1 = i)aij = Pr(Xt=j | Xt-1 = i)bik = Pr(Ot=k | Xt = i)
• Probability of O is sum over all state sequencesPr(O|λ) = ∑
all X Pr(O|X, λ) Pr(X|λ)
= ∑
all X πx1
bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT
• At each t, are N states to reach, so NT possible state sequences and 2T multiplications per seq, means O(2T*NT) operations
• So 3 states, length 10 seq = 1,180,980 operations and len 20 = 1e11!
68 Center for Genes, Environment, and Health
1 2
3
N=3, M=2π=(0.25, 0.55, 0.2)A = B =
000.1
1.09.00
8.02.00
5.0
25.0
5.0
75.09.01.0
Example:
πi = Pr(X1 = i)aij = Pr(Xt=j | Xt-1 = i)bik = Pr(Ot=k | Xt = i)
• Probability of O is sum over all state sequencesPr(O|λ) = ∑
all X Pr(O|X, λ) Pr(X|λ)
= ∑
all X πx1
bx1o1 ax1x2 bx2o2 . . . axT-1xT bxToT
• Efficient dynamic programming algorithm to do this: Forward algorithm(Baum and Welch,O(N2T))
69 Center for Genes, Environment, and Health
1 2
3
N=3, M=2π=(0.25, 0.55, 0.2)A = B =
000.1
1.09.00
8.02.00
5.0
25.0
5.0
75.09.01.0
Example:
πi = Pr(X1 = i)aij = Pr(Xt=j | Xt-1 = i)bik = Pr(Ot=k | Xt = i)
The Forward AlgorithmProbability of a Sequence is the Sum of All
Paths that Can Produce It
70 Center for Genes, Environment, and Health
G .1C .1A .4T .4
G .3C .3A .2T .2
0.10.2
Non-CpG
0.8
0.9 G
CpG
G .3
G .1
.3*(
.3*.8+
.1*.1)=.075
.1*(
.3*.2+
.1*.9)=.015
C
.3*(
.075*.8+
.015*.1)=.0185
.1*(
.075*.2+
.015*.9)=.0029
G
.2*(
.0185*.8+
.0029*.1)=.003
.4*(
.0185*.2+
.0029*.9)=.0025
A
.2*(
.003*.8+
.0025*.1)=.0005
.4*(
.003*.2+
.0025*.9)=.0011
A
David Pollock’s Lecture
Parameter estimation by Baum-Welch Forward Backward Algorithm
Forward variable αt(i) =Pr(O1..t,Xt=i | λ)Backward variable βt(i) =Pr(Ot+1..N|Xt=i, λ)
DEFINITIVE tutorial: Rabiner 1989: http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Leach.pdfand erratum: http://compbio.ucdenver.edu/hunter/cpbs7711/2010_09_09_Rabiner1989_HMMTutorial_Erratum_Leach.pdf
Forward Algorithm• Dynamic programming method to compute
forward variable: αt(i) =Pr(O1..t,Xt=i | λ)• Base Condition: for 1 i N
α1(i) = πx1 bxio1
• Recurrence: for 1 j N and 1 t T-1
αt+1(j) = [ ∑i=1 to N αt(i) axixj ] bxjot+1
• Then probability of sequence Pr(O | λ) = ∑i=1 to N αT(i) 72 Center for Genes, Environment, and Health
*Backward algorithm for βt(i) is
analogous
73 Center for Genes, Environment, and Health
74 Center for Genes, Environment, and Health
0 0 0 0 1 0.91 Spearman
75 Center for Genes, Environment, and Health
A B C D E FA 1 1 -1 0.8 0 0 Pearson
1 1 -1 1 0 0 Spearman0 8 9 6 17 19 EucDist
B 8 0 1 6 22 23 EucDist
E 0 0 0 0.3 1 0.85 PearsonC -1 -1 1 -0.7 0 0 Pearson
A B C D E FA 1 1 -1 0.8 0 0 Pearson 1 1 -1 1 0 0 Spearman 0 8 9 6 17 19 EucDistB 8 0 1 6 22 23 EucDistC -1 -1 1 -0.7 0 0 PearsonE 0 0 0 0.3 1 0.85 Pearson 0 0 0 0 1 0.91 Spearman