Machine Learning and Statistical Analysis
Transcript of Machine Learning and Statistical Analysis
Jong Youl ChoiComputer Science Department([email protected])
Social Bookmarking
2
Socialized
Tags
Bookmarks
3
Principles of Machine Learning Bayes’ theorem and maximum likelihood
Machine Learning Algorithms Clustering analysis Dimension reduction Classification
Parallel Computing General parallel computing architecture Parallel algorithms
4
DefinitionAlgorithms or techniques that enable computer (machine) to “learn” from data. Related with many areas such as data mining, statistics, information theory, etc.
Algorithm Types Unsupervised learning Supervised learning Reinforcement learning
5
Topics Models▪ Artificial Neural Network (ANN)▪ Support Vector Machine (SVM)
Optimization▪ Expectation-Maximization (EM)▪ Deterministic Annealing (DA)
Posterior probability of θi, given X
θi 2 Θ : Parameter
X : Observations
P(θi) : Prior (or marginal) probability
P(X|θi) : likelihood
Maximum Likelihood (ML)
Used to find the most plausible θi 2 Θ, given X
Computing maximum likelihood (ML) or log-likelihood Optimization problem
6
ProblemEstimate hidden parameters (θ={µ, σ})from the given data extracted from k Gaussian distributions
Gaussian distribution
Maximum Likelihood
With Gaussian (P = N),
Solve either brute-force or numeric method7
(Mitchell , 1997)
Problems in ML estimation
Observation X is often not complete
Latent (hidden) variable Z exists
Hard to explore whole parameter space
Expectation-Maximization algorithm
Object : To find ML, over latent distribution P(Z |X,θ)
Steps
0. Init – Choose a random θold
1. E-step – Expectation P(Z |X, θold)
2. M-step – Find θnew which maximize likelihood.
3. Go to step 1 after updating θold à θnew
8
DefinitionGrouping unlabeled data into clusters, for the purpose of inference of hidden structures or information
Dissimilarity measurement Distance : Euclidean(L2), Manhattan(L1), …
Angle : Inner product, … Non-metric : Rank, Intensity, …
Types of Clustering Hierarchical ▪ Agglomerative or divisive
Partitioning▪ K-means, VQ, MDS, …
9(Matlab helppage)
Find K partitions with the total intra-cluster variance minimized
Iterative method Initialization : Randomized yi
Assignment of x (yi fixed)
Update of yi (x fixed)
Problem? Trap in local minima
10(MacKay, 2003)
Deterministically avoid local minima No stochastic process (random walk) Tracing the global solution by changing
level of randomness
Statistical Mechanics Gibbs distribution
Helmholtz free energy F = D – TS
▪ Average Energy D = <∑ Ex>
▪ Entropy S = - P(Ex) ln P(Ex)
▪ F = – T ln Z
In DA, we make F minimized11
(Maxima and Minima, Wikipedia)
Analogy to physical annealing process Control energy (randomness) by temperature (high low)
Starting with high temperature (T = 1)
▪ Soft (or fuzzy) association probability▪ Smooth cost function with one global minimum
Lowering the temperature (T ! 0)
▪ Hard association▪ Revealing full complexity, clusters are emerged
Minimization of F, using E(x, yj) = ||x-yj||2
Iteratively,12
DefinitionProcess to transform high-dimensional data into low-dimensional ones for improving accuracy, understanding, or removing noises.
Curse of dimensionality Complexity grows exponentially
in volume by adding extra dimensions
Types Feature selection : Choose representatives (e.g., filter,…) Feature extraction : Map to lower dim. (e.g., PCA, MDS, … )
13
(Koppen, 2000)
Finding a map of principle components (PCs) of data into an orthogonal space, such that
y = W x where W 2 Rd£h (hÀd)
PCs – Variables with the largest variances Orthogonality Linearity – Optimal least mean-square error
Limitations? Strict linearity specific distribution Large variance assumption
14
x1
x2
PC 1PC 2
Like PCA, reduction of dimension by y = R x where R is a random matrix with i.i.d columns and R 2 Rd£p (pÀd)
Johnson-Lindenstrauss lemma When projecting to a randomly selected subspace, the distance
are approximately preserved
Generating R Hard to obtain orthogonalized R Gaussian R Simple approach
choose rij = {+31/2,0,-31/2} with probability 1/6, 4/6, 1/6 respectively
15
Dimension reduction preserving distance proximities observed in original data set
Loss functions Inner product Distance Squared distance
Classical MDS: minimizing STRAIN, given ∆ From ∆, find inner product matrix B (Double centering)
From B, recover the coordinates X’ (i.e., B=X’X’T )
16
SMACOF : minimizing STRESS Majorization – for complex f(x),
find auxiliary simple g(x,y) s.t.:
Majorization for STRESS
Minimize tr(XT B(Y) Y), known as Guttman transform
17
(Cox, 2001)
Competitive and unsupervised learning process for clustering and visualization
Result : similar data getting closer in the model space
18
Input Model
Learning Choose the best similar model
vector mj with xi
Update the winner and its neighbors by mk = mk + α(t) β(t)(xi – mk)
α(t) : learning rateβ(t) : neighborhood size
19
DefinitionA procedure dividing data into the given set of categories based on the training set in a supervised way
Generalization Vs. Specification Hard to achieve both Avoid overfitting(overtraining)▪ Early stopping▪ Holdout validation▪ K-fold cross validation ▪ Leave-one-out cross-validation
Validation Error
Training Error
Underfitting Overfitting
(Overfitting, Wikipedia)
Perceptron : A computational unit with binary threshold
Abilities Linear separable decision surface Represent boolean functions (AND, OR, NO)
Network (Multilayer) of perceptrons Various network architectures and capabilities
20
Weighted Sum Activation Function
(Jain, 1996)
Learning weights – random initialization and updating
Error-correction training rules Difference between training data and output: E(t,o) Gradient descent (Batch learning) ▪ With E = ∑ Ei ,
Stochastic approach (On-line learning)▪ Update gradient for each result
Various error functions Adding weight regularization term (∑ wi
2) to avoid overfitting
Adding momentum (∆wi(n-1)) to expedite convergence
21
Q: How to draw the optimal linear separating hyperplane? A: Maximizing margin
Margin maximization The distance between H+1 and H-1:
Thus, ||w|| should be minimized
22
Margin
23
Constraint optimization problem
Given training set {xi, yi} (yi 2 {+1, -1}):
Minimize :
Lagrangian equation with saddle points
Minimized w.r.t the primal variable w and b:
Maximized w.r.t the dual variables αi (all αi ¸ 0)
xi with αi > 0 (not αi = 0) is called support vector (SV)
Soft Margin (Non-separable case) Slack variables ξi < C
Optimization with additional constraint
Non-linear SVM Map non-linear input to feature space
Kernel function k(x,y) = hΦ(x), Φ(y)i
Kernel classifier with support vectors si
24
Input Space Feature Space
Memory Architecture
Decomposition Strategy Task – E.g., Word, IE, … Data – scientific problem Pipelining – Task + Data
25
Shared Memory Distributed Memory
Symmetric Multiprocessor (SMP) OpenMP, POSIX, pthread, MPI Easy to manage but expensive
Commodity, off-the-shelf processors MPI Cost effective but hard to maintain
(Barney, 2007)
(Barney, 2007)
Shrinking Recall : Only support vectors (αi>0) are
used in SVM optimization Predict if data is either SV or non-SV Remove non-SVs from problem space
Parallel SVM Partition the problem Merge data hierarchically Each unit finds support vectors Loop until converge
26(Graf, 2005)
27