Machine Learning : Foundations
description
Transcript of Machine Learning : Foundations
![Page 1: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/1.jpg)
Machine Learning: Foundations
Yishay MansourTel-Aviv University
![Page 2: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/2.jpg)
Typical Applications
• Classification/Clustering problems:– Data mining– Information Retrieval– Web search
• Self-customization– news– mail
![Page 3: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/3.jpg)
Typical Applications
• Control– Robots– Dialog system
• Complicated software– driving a car– playing a game (backgammon)
![Page 4: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/4.jpg)
Why Now?
• Technology ready:– Algorithms and theory.
• Information abundant:– Flood of data (online)
• Computational power– Sophisticated techniques
• Industry and consumer needs.
![Page 5: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/5.jpg)
Example 1: Credit Risk Analysis
• Typical customer: bank.• Database:
– Current clients data, including:– basic profile (income, house ownership,
delinquent account, etc.)– Basic classification.
• Goal: predict/decide whether to grant credit.
![Page 6: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/6.jpg)
Example 1: Credit Risk Analysis
• Rules learned from data:IF Other-Delinquent-Accounts > 2 and Number-Delinquent-Billing-Cycles >1THEN DENAY CREDITIF Other-Delinquent-Accounts = 0 and Income > $30kTHEN GRANT CREDIT
![Page 7: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/7.jpg)
Example 2: Clustering news
• Data: Reuters news / Web data• Goal: Basic category classification:
– Business, sports, politics, etc.– classify to subcategories (unspecified)
• Methodology:– consider “typical words” for each category.– Classify using a “distance “ measure.
![Page 8: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/8.jpg)
Example 3: Robot control
• Goal: Control a robot in an unknown environment.
• Needs both – to explore (new places and action)– to use acquired knowledge to gain benefits.
• Learning task “control” what is observes!
![Page 9: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/9.jpg)
![Page 10: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/10.jpg)
A Glimpse in to the future
• Today status:– First-generation algorithms:– Neural nets, decision trees, etc.
• Well-formed data-bases• Future:
– many more problems:– networking, control, software.– Main advantage is flexibility!
![Page 11: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/11.jpg)
Relevant Disciplines
• Artificial intelligence• Statistics• Computational learning theory• Control theory• Information Theory• Philosophy• Psychology and neurobiology.
![Page 12: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/12.jpg)
Type of models
• Supervised learning– Given access to classified data
• Unsupervised learning– Given access to data, but no classification
• Control learning– Selects actions and observes consequences.– Maximizes long-term cumulative return.
![Page 13: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/13.jpg)
Learning: Complete Information
• Probability D1 over and probability D2 for
• Equally likely.• Computing the
probability of “smiley” given a point (x,y).
• Use Bayes formula.• Let p be the probability.
![Page 14: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/14.jpg)
Predictions and Loss Model
• Boolean Error– Predict a Boolean value.– each error we lose 1 (no error no loss.)– Compare the probability p to 1/2.– Predict deterministically with the higher value.– Optimal prediction (for this loss)
• Can not recover probabilities!
![Page 15: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/15.jpg)
Predictions and Loss Model
• quadratic loss– Predict a “real number” q for outcome 1.– Loss (q-p)2 for outcome 1– Loss ([1-q]-[1-p])2 for outcome 0– Expected loss: (p-q)2
– Minimized for p=q (Optimal prediction)• recovers the probabilities• Needs to know p to compute loss!
![Page 16: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/16.jpg)
Predictions and Loss Model
• Logarithmic loss– Predict a “real number” q for outcome 1.– Loss log 1/q for outcome 1– Loss log 1/(1-q) for outcome 0– Expected loss: -p log q -(1-p) log (1-q)– Minimized for p=q (Optimal prediction)
• recovers the probabilities• Loss does not depend on p!
![Page 17: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/17.jpg)
The basic PAC Model
Unknown target function f(x)
Distribution D over domain X
Goal: find h(x) such that h(x) approx. f(x)
Given H find hH that minimizes PrD[h(x) f(x)]
![Page 18: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/18.jpg)
Basic PAC Notions
S - sample of m examples drawn i.i.d using D
True error (h)= PrD[h(x)=f(x)]
Observed error ’(h)= 1/m |{ x S | h(x) f(x) }|
Example (x,f(x))
Basic question: How close is (h) to ’(h)
![Page 19: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/19.jpg)
Bayesian Theory
Prior distribution over H
Given a sample S compute a posterior distribution:
Maximum Likelihood (ML) Pr[S|h]Maximum A Posteriori (MAP) Pr[h|S]Bayesian Predictor h(x) Pr[h|S].
]Pr[]Pr[]|Pr[]|Pr[
ShhSSh
![Page 20: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/20.jpg)
Nearest Neighbor Methods
Classify using near examples.
Assume a “structured space” and a “metric”
+
+
+
+
-
-
-
-?
![Page 21: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/21.jpg)
Computational Methods
How to find an h e H with low observed error.
Heuristic algorithm for specific classes.
Most cases computational tasks are provably hard.
![Page 22: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/22.jpg)
Separating Hyperplane
Perceptron: sign( xiwi ) Find w1 .... wn
Limited representationx1 xn
w1wn
sign
![Page 23: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/23.jpg)
Neural Networks
Sigmoidal gates: a= xiwi and output = 1/(1+ e-a)
Back Propagation
x1 xn
![Page 24: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/24.jpg)
Decision Trees
x1 > 5
x6 > 2
+1 -1
+1
![Page 25: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/25.jpg)
Decision Trees
Limited Representation
Efficient Algorithms.
Aim: Find a small decision tree with low observed error.
![Page 26: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/26.jpg)
Decision Trees
PHASE I:Construct the tree greedy, using a local index function.Ginni Index : G(x) = x(1-x), Entropy H(x) ...
PHASE II:Prune the decision Tree while maintaining low observed error.
Good experimental results
![Page 27: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/27.jpg)
Complexity versus Generalization
hypothesis complexity versus observed error.
More complex hypothesis have lower observed error, but might have higher true error.
![Page 28: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/28.jpg)
Basic criteria for Model Selection
Minimum Description Length
’(h) + |code length of h|
Structural Risk Minimization:
’(h) + sqrt{ log |H| / m }
![Page 29: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/29.jpg)
Genetic Programming
A search Method.
Local mutation operations
Cross-over operations
Keeps the “best” candidates.
Change a node in a tree
Replace a subtree by another tree
Keep trees with low observed error
Example: decision trees
![Page 30: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/30.jpg)
General PAC Methodology
Minimize the observed error.
Search for a small size classifier
Hand-tailored search method for specific classes.
![Page 31: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/31.jpg)
Weak Learning
Small class of predicates H
Weak Learning:Assume that for any distribution D, there is some predicate hHthat predicts better than 1/2+.
Weak Learning Strong Learning
![Page 32: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/32.jpg)
Boosting Algorithms
Functions: Weighted majority of the predicates.
Methodology: Change the distribution to target “hard” examples.
Weight of an example is exponential in the number of incorrect classifications.
Extremely good experimental results and efficient algorithms.
![Page 33: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/33.jpg)
Support Vector Machine
n dimensions m dimensions
![Page 34: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/34.jpg)
Support Vector Machine
Use a hyperplane in the LARGE space.
Choose a hyperplane with a large MARGIN.
+
++
+
-
--
Project data to a high dimensional space.
![Page 35: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/35.jpg)
Other Models
Membership Queries
x f(x)
![Page 36: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/36.jpg)
Fourier Transform
f(x) = zz(x) z(x) = (-1)<x,z>
Many Simple classes are well approximated using large coefficients.
Efficient algorithms for finding large coefficients.
![Page 37: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/37.jpg)
Reinforcement Learning
Control Problems.
Changing the parameters changes the behavior.
Search for optimal policies.
![Page 38: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/38.jpg)
Clustering: Unsupervised learning
![Page 39: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/39.jpg)
Unsupervised learning: Clustering
![Page 40: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/40.jpg)
Basic Concepts in Probability
• For a single hypothesis h:– Given an observed error– Bound the true error
• Markov Inequality• Chebyshev Inequality• Chernoff Inequality
![Page 41: Machine Learning : Foundations](https://reader033.fdocuments.in/reader033/viewer/2022061516/56815e78550346895dccfa2e/html5/thumbnails/41.jpg)
Basic Concepts in Probability
• Switching from h1 to h2:– Given the observed errors– Predict if h2 is better.
• Total error rate• Cases where h1(x) h2(x)
– More refine