Machine Learning Comparative Analysis - Part 1

12
Masters in Computer Science — Machine Learning Concepts Goal : very briefly touch upon some of the important terminologies and fundamental concepts for selecting a machine learning algorithm. ** concept : function or mapping from objects to membership . A mapping between objects in the world and membership in a set. ** instance : Vector of attribute-value pairs (input space of Concept e.g. pixels of a picture, credit scores of an income) ** target concept : actual answer thats being searched in the space of multiple concepts. ** hypothesis : helps to predict target concept (actual answer) *** apply candidate concepts to testing set (should include lots of examples) *** apply inductive learning to choose a hypothesis from given set of examples We need to ask some relevant questions to choose a a Hypothesis ! What’s the Inductive Bias for the Classification Function ? >> Inductive Bias helps us find a General Rule from example. >> Generalization is the whole point in Machine Learning Whats the Occum’s Razor ? >> Prefer simplest hypothesis that fits data What’s the Restriction Bias ? >> Consider only those hypothesis which can be represented by chosen algorithm Supervised classification => Function Approximation : predicting outcome when we know the different classifications example: predicting the type of flower (setosa, versicolor, or virginica) based on sepal width/length Unsupervised classification => Category Clustering : predicting outcome when we don’t know what are the different classifications. example: splitting all data for sepal width/length into different groups (cluster similar data together) Reinforcement Learning => Learning from Delayed Reward. Eager & Lazy Learners : Eager Learners : Decision trees, regression, neural networks, SVMs, Bayes nets find a function that best fits training data i.e. spend time to learn from data , when new inputs are received the input features are fed into the function here we consider global scale inputs and avoid local sensitivities Lazy Learners : lazy learners do not compute a function to fit the training data before new data is received so we save significant time upfront new instances are compared to the training data to make a classification / regression decision !!! considers local-scale estimation .

Transcript of Machine Learning Comparative Analysis - Part 1

Page 1: Machine Learning Comparative Analysis - Part 1

!!!!Masters in Computer Science — Machine Learning Concepts !Goal : very briefly touch upon some of the important terminologies and fundamental concepts for selecting a machine learning algorithm. !!!** concept : function or mapping from objects to membership . A mapping between objects in the world and membership in a set. ** instance : Vector of attribute-value pairs (input space of Concept e.g. pixels of a picture, credit scores of an income) ** target concept : actual answer thats being searched in the space of multiple concepts. ** hypothesis : helps to predict target concept (actual answer) *** apply candidate concepts to testing set (should include lots of examples) *** apply inductive learning to choose a hypothesis from given set of examples

We need to ask some relevant questions to choose a a Hypothesis ! !What’s the Inductive Bias for the Classification Function ? >> Inductive Bias helps us find a General Rule from example. >> Generalization is the whole point in Machine Learning !Whats the Occum’s Razor ? >> Prefer simplest hypothesis that fits data !What’s the Restriction Bias ? >> Consider only those hypothesis which can be represented by chosen algorithm !Supervised classification => Function Approximation : predicting outcome when we know the different classifications example: predicting the type of flower (setosa, versicolor, or virginica) based on sepal width/length !Unsupervised classification => Category Clustering : predicting outcome when we don’t know what are the different classifications. example: splitting all data for sepal width/length into different groups (cluster similar data together) !Reinforcement Learning => Learning from Delayed Reward. !Eager & Lazy Learners : !

Eager Learners : Decision trees, regression, neural networks, SVMs, Bayes nets find a function that best fits training data i.e. spend time to learn from data , when new inputs are received the input features are fed into the function here we consider global scale inputs and avoid local sensitivities !Lazy Learners : lazy learners do not compute a function to fit the training data before new data is received so we save significant time upfront new instances are compared to the training data to make a classification / regression decision !!! considers local-scale estimation .

!

Page 2: Machine Learning Comparative Analysis - Part 1

!ML Algo Preference Bias Learning Function Performance Enhancements Usage

Bayesian !(Eager Learner) - Classification

Prior Domain Knowledge ~ Pr (h) prior prob for each candidate h ~ Pr(D) – prob dist. Over observed data for each h !Occum’s Razor ? - select h with min length !** at least one maximally probable hypothesis argmaxP(h|D) -> argmaxP(D|h) (for uniform prior)

Posterior Prob P(h|D) = P(D|h).P(h) / P(D) !Key assumption : every hi equally probably a priori => p(h!* Noise Free Uniformly Dist. Hypothesis in V(s) * !P(h) = 1 / |H| , P(D|h) = { 1 if di = h(x) , 0 otherwise } P(h|D) = 1 / |V(s)| !* Noisy Data* di = k.xhmc= argmax P(D|h) = argmax π P(di|h) !* di ln (h(di – hi(x))!* VmapP(v|h).P(h|D)

!Cons : * significant computational cost to find Bayes optimal hypothesis * sometimes huge

no of hypothesis need to be surveyed .

* NB handles missing data very well: it just excludes the attribute with missing data when computing posterior probability (i.e. probability of class given data point)

Pros : No need to be aware of given hypothesis !— for smaller training set, NB is good bet !

* Use Bayesian Learning to represent Conditional Independence of variables !

* Assumes real-valued attributes are normally distributed. As a result, NB can only have linear, elliptic, or parabolic decision boundaries.

* Example: misclassification , pruning , fitting errors !

* spam / | \ Lottery Bank College !P(spam | lottery , not bank , not college) = p(vi

Page 3: Machine Learning Comparative Analysis - Part 1

Algo Decision Tree : !(Eager Learner) !ID3 , C4.5 approximate discrete values functions disjunction of conjunction of constraints on attr values !Description Classification : for discrete input data : for cont. input data (consider Range selection as condition - >20% )

Preference Bias Occum’s Razor ? : shorter tree Other Biases : : prefer attributes with many possible values : prefer trees that places high info gain attrs close to root (attr with best answers NOT best splits)

Learning Function Info Gain (S,A) = Entropy(S) – Sum|S|)*Entropy(S** wtd sum of entropies of partitions * Entropy(s) = -Sum(P

Performance Usual problem of Dtree : for N variables combinations of rows (2)2-to-the-power-N

outputs !** so instead of iterating on all rows , first work upon only the attributes which have highest info gain. ** handles noise , handles missing values !============= Scope of improvement : !Decision trees, however, often achieve lower generalization accuracy, compared to other learning methods, such as support vector machines and neural networks. One common way to improve their accuracy is boosting

Enhancement pros : computes best attribute in one move !cons : * does not look ahead or behind ( this problem is solved by Hill-Climbing …) * tends to overfit as it looks into many diferent combinations of features * logistic regression avoids overfitting more elegantly !** Overfitting soln for DTree : >> stop growing tree before it grows too large >> prune after certain threshold * consider interdependency betn attributes P(Y=y | X=x) * consider GainRatio , SplitInfo

Usage - restaurant

selection decision based on cost, menu , appetite, weather, and other features.

-

Decision Tree : Regression !Classification : for cont. output data

!Lazy Distance-based learning func : For each training sample sl -> SDl = dist(ssum-sqr(diff) Wj = dmax – dj

Advantages of decision trees include: ● computational scalability ● handling of messy data missing values, various feature types !● ability to deal with irrelevant features the algorithm selects “relevant” features first, and generally ignores irrelevant features. ● If the decision tree is short, it is easy for a human to interpret it: decision trees do not produce a black box model.

Page 4: Machine Learning Comparative Analysis - Part 1

Algo Linear Regression : !(Eager Learner) !Model a linear relationship between a dependent variable (y) and independent variables (x1,x2..) !Regression, as a term, stems from the observation that individual instances of any observed attribute tend to regress towards the mean. !Description Classification : Scalar input , Cont. output Vector input, Cont. outputp !** Vector Input -> combinations of multiple features into a single feature

Preference Bias Regress to mean !Gradient : * for one variable derivative is slope of tangent line * for several variables, gradient is the direction of the fastest increase of function

Learning Function y^ = yi = minimize the Sum of Squared Error : ½ Sum (y^-y!θ1 = θθ1 ->next pos θ0 ->current pos α is the learning rate so that function takes small step towards the direction opposite to that of ∇J (direction of fastest increase)

Performance !Cons: Function should be differentiable !Caution : Learning rate must not be very small or very large

Enhancement ! Usage !Housing Price prediction

Polynomial Regression

Page 5: Machine Learning Comparative Analysis - Part 1

Algo Multi-Layer Perceptron !!(Eager Learner) !Description Classification

Preference Bias Initial weights should be chosen to be small and random values: !— local minima — variability and low complexity (larger weights equate to larger complexity).

Learning Function Perceptron is a linear function that offers a hyperplane in n dimensions, perpendicular to the vector

wn) . The perceptron

classifies things on one side of the hyperplane as positive and things on the other side as negative.

Perceptron !Guarantee finite convergence, however, only if linearly separable. Δwi=η(y−y^)xi !Gradient Descent !Calculus-based More robust to data sets that are not linearly separable, however, converges to local minima / optima. Δwi=η(y−a)xi !!

Performance !Neural networks have low restriction bias, because they can model many different functions. Therefore they have the danger of overfitting. !Neural Networks consist of: !Perceptron: half-spaces Sigmoids (instead of step functions): much more complex Hidden Layers (groups of sigmoid functions) !So it allows for modeling many types of functions / behaviors, such as: !Boolean: network of threshold-like units Continuous: through hidden layers (e.g. use of sigmoids instead of step) Arbitrary (non-continuous): multiple hidden layers

Enhancement !Addition of hidden layers help map continuous functions (change in input changes output very smoothly) !Multiply weights only if we don’t get better errors !

Usage !One obvious advantage of artificial neural networks - ability to produce any number of outputs, (multi-class) while support vector machines have only one. The most direct way to create an n-ary classifier with support vector machines is to create n support vector machines and train each of them one by one. On the other hand, an n-ary classifier with neural networks can be trained in one go. =========== Multi-layer perceptron is able to find relation between features. For example it is necessary in computer vision when a raw image is provided to the learning algorithm and now Sophisticated features are calculated. Essentially the intermediate levels can calculate new unknown features.

Page 6: Machine Learning Comparative Analysis - Part 1

Algo K Nearest Neighbors - Classification !remembers mapping, fast lookup !

Preference Bias : Why consider KNN over other ? * near points are similar to one another (locality) * smoothly changing behavior from one neighborhood to another neighborhood. * so we can choose best distance function

Learning Function !Choose best distance function. !!Manhattan: ℓ1 d=∣y2−y1∣+∣x2−x1∣ !Euclidean: d=sqrt( sqr(y2−y1)+sqr(x2−x1))

Performance : !Problem : curse of dimensionality : !… as the number of features grow, the amount of data required for accurate generalization grows exponentially .

> O(2-to-power-d)

Reducing weights will help curb the effect of dimensionality.

When k is small, models have high bias, fitting on a strongly local level. Larger k creates models with lower bias but higher variance.

Cons :

* KNN doesn't know which attributes are more important

* Doesn't handle missing data gracefully

!!

Enhancements : !generalization - NO overfitting - YES !/// !

Usage !No assumption about data distribution (Great Advantage over NB) Its highly non-parametric

Page 7: Machine Learning Comparative Analysis - Part 1

Algo K Nearest Neighbors - Regression. !LWR (locally weighted regression)

Learning Function !It combines the traditional regression with instance based learning’s sensitivity to training items with high similarity to the test point

Performance : !-- reduce the pull effect of far-away points through Kernels

-- the squared deviations are weighted by a kernel function that decreases with distance, such that for a new test instance, a regression function is found for that specific point that emphasizes fitting closeby points and ignoring the pull of faraway points…

Page 8: Machine Learning Comparative Analysis - Part 1

Preference Bias : !- Individual rule (result of learning over a subset of data) does not provide answer but when combined , the complex rule works well . !Choose those examples - where it offers better performance on testing subsets of data than fitting a 4th order polynomial

Learning Function PrD!** boost up the distribution …. ! h1 h2 h3 x1 +1 -1 +1 x2 -1 -1 +1 x3 +1 -1 +1 !** find hypothesis at each time-step Hsmall error , (Weak Classifier) constantly creating new distributions … (Boosting) !** Final Hypothesis : sgn (sign) function of the weighted sum of all of the rules.

Performance : !Why Boosting does so well ? !>> if there are some samples which do not provide good result, then boosting can re-rate the samples so that some of ‘past under-performers’ become more important. >> !Use Grad Boost to handle noisy data in DTree : https://en.wikipedia.org/wiki/Gradient_boosting !>> Boosting does overfit if Weak Learners uses NN with many layers of nodes !Choosing Subsets: !Instead of selecting subsets randomly, we can pick subsets containing hardest examples—those examples that don’t perform well given current rule. !Combine: !Instead of a mean, consider a weighted mean.

Enhancements: ● Computationally efficient. ● No difficult parameters to set. ● Versatile a wide range of base learners can be used with !AdaBoost.Caveats: ● Algorithm seems susceptible to uniform noise. ● Weak learner should not be too complex to avoid overfitting. ● There needs to be enough data so that the weak learning requirement is satisfied the base learner should perform consistently better than random guessing, with generalization error < 0.5 for binary classification problems.

usage body: contains word manly → YES from: your spouse → NO body short length → YES body: only contains urls → YES body: just an image → YES body: contains words belonging to blacklist (misspellings) → YES !All of these rules are useful, however, no specific one can determine spam (or not) on its own. We need to find a way to combine them. !!find which Wiki pages can recommended for extended period of time (feature set a combination of binary , text , nemerics) !Ref : http://statweb.stanford.edu/~tibs/ElemStatLearn/ !http://media.nips.cc/Conferences/2007/Tutorials/Slides/schapire-NIPS-07-tutorial.pdf !************ If you have dense feature set, go with boosting.

Algo !Ensemble Learning !!!!!

* !!* !Solves Classification Problem. !************* Boosting is a meta-learning technique, i.e. something you put on top of a set of learners to form an ensemble

Page 9: Machine Learning Comparative Analysis - Part 1

!!!Notes on Ensemble Learning (Boosting) !Important difference of Ensemble Learners from other types of Learners : -- NN already knows the Network and tries to learn the weights -- DTree gradually builds the rules !But, Ensemble Learner finds the best combination of rules . !1. Initialize the importance weights w

i = 1/N for all training examples i. 2. For m = 1 to M:

a) Fit a classifier Gm

(x) to the training data using the weights wi.

b) Compute the error: errm

= ∑ w

i I(y

i =/ G

m(x

i)) / ∑ w

i

c) Compute αm

= log((1 − errm

)/errm

)

d) Update weights: wi ← w

i . exp[α

m. I(y

i =/ G

m(x

i))] for i = 1, 2, ... N

3. Return G(x) = sign[ ∑ αm

Gm

(x)].

We can see that for error < 0.5, the αm parameter is positive

Preference Bias : !Support : the goal with the support vector machine is to maximize the margin, m, subject to the constraint that we classify everything correctly. Together, this can be defined mathematically as: !max(m):yi(wTXi+b)≥1∀i

Learning Function !Find the line of least commitment in the linear separable set of data, is the basis behind support vector machines

>> a line that leaves as much space as possible from the boundaries.

y = (w

where: y is the classification label and y∈{−1,+1} with

{in classout of classfor y>0for y<0

wT and b are the parameters of the plane

Performance : >> : similar to KNN , but here instead of being completely lazy , spend upfront efforts to do complicated quadratic programs to consider required points . !>> For classification tasks involving more than two groups, a common strategy is to use multiple binary classifiers to decide on a single-best class for new instances

Enhancements: !y = w phi(x) +b — use Kernel when feature vector phi is of higher dimension. !Many machine learning algorithms can be written to only use dot products, and then we can replace the dot products with kernels

usage !Mostly binary classification (linear and non-linear)

1) If you have sparse feature set, go with linear svm (or other linear model) !2) If you don't care about speed and memory, try kernel svm. !************* In order to eliminated expensive parameter tuning and better handle high-dimensional input space —> we can use Kernelized SVM for text classification (tens of thousands of support vectors, each having hundreds of thousands of features)

Algo SVM !The classifier is greater than or equal to 1 for the positive examples and less than or equal to -1 for the negative examples ….…. …… difference between the vector x

and the vector x

projected !

* !Classification !!

Page 10: Machine Learning Comparative Analysis - Part 1

���

Notes on Support Vector Machines - SVM

��� >>> ���

Here instead of Polynomial Regression we consider Polynomial Kernel ��� kernel represents domain knowledge

=> projecting into some higher dimensional space. !For data that is separable, but not linearly, we can use a kernel function to capture a nonlinear dividing curve. The kernel function should capture some aspect of similarity in our data.

Ref : https://www.quora.com/What-are-Kernels-in-Machine-Learning-and-SVM

Simple Example of Kernel : x = (x1, x2, x3); y = (y1, y2, y3). Then for the function f(x) = (x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3), the kernel is K(x, y ) = (<x, y>)^2.

Let's plug in some numbers to make this more intuitive:

suppose x = (1, 2, 3); y = (4, 5, 6). Then:

f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9)

f(y) = (16, 20, 24, 20, 25, 36, 24, 30, 36)

<f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024

A lot of algebra. as f is a mapping from 3-dimensional to 9 dimensional space.

Now let us use the kernel instead:

K(x, y) = (4 + 10 + 18 ) ^2 = 32^2 = 1024 . Same result, but this calculation is so much easier.

!!

!!!!!!!!!

Page 11: Machine Learning Comparative Analysis - Part 1

Types of Errors !In sample error => error resulted from applying the prediction algorithm to the training dataset !Out of sample error => error resulted from applying the prediction algorithm to a new test data set !In sample error < Out of sample error => model is overfitting i.e. model is too optimized for the initial dataset !

Regression Errors: !Bias-Variance Estimates !Its very important to calculate ‘Bias Errors’ and ‘Variance Errors’ while comparing various algorithms. !Error due to Bias => when a prediction model is built multiple times then Bias Error is the difference between ‘Expected Prediction value’ and Correct value. Bias provides a deviation of prediction ranges from real values . Example of low bias ==> tendency of mean of all the sample points to converge towards mean of real values !

* !Error due to Variance => how much the predictions for a given point vary between different implementations of the model. Example of high variability ==> sample points tend to be dispersed away from each other. !Reference : http://scott.fortmann-roe.com/docs/BiasVariance.html !!!so often it is better to give up a little accuracy for more robustness when predicting on new data. !Classification Errors: !Positive = identified and Negative = rejected True positive = correctly identified (predicted true when true) False positive = incorrectly identified (predicted true when false) True negative = correctly rejected (predicted false when false) False negative = incorrectly rejected (predicted false when true) !example: medical testing !True positive = Sick people correctly diagnosed as sick False positive = Healthy people incorrectly identified as sick True negative = Healthy people correctly identified as healthy False negative = Sick people incorrectly identified as healthy !

Page 12: Machine Learning Comparative Analysis - Part 1

!!!k= accuracy−P(e) / 1−P(e) !P(e)=(TP+FP / total) × (TP+FN / total) + (TN+FN / total) × (FP+TN/ total) !!Receiver Operating Characteristic curves : !x-axis = 1 - specificity (or, probability of false positive) y-axis = sensitivity (or, probability of true positive) areas under curve = quantifies whether the prediction model is viable or not i.e. higher area →→ better predictor

area = 0.5 →→ effectively random guessing (diagonal line in the ROC curve) area = 1 →→ perfect classifier area = 0.8 →→ considered good for a prediction algorithm !!!!!!!!!!!!!!!!

References : http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/lecture-notes/ http://www.stat.cmu.edu/~cshalizi/350/ http://www.quora.com/Machine-Learning/What-are-some-good-resources-for-learning-about-machine-learning-Why https://www.udacity.com/course/machine-learning--ud262 https://www.coursera.org/learn/machine-learning http://sux13.github.io/DataScienceSpCourseNotes/8_PREDMACHLEARN/Practical_Machine_Learning_Course_Notes.html