Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory...
Transcript of Statistical Learning Theory - cs.rutgers.edupa336/mlS16/ml-sp16-lec12.pdfStatistical Learning Theory...
Statistical Learning Theory
Instructor: Pranjal Awasthi
Announcements
• HW 3 out today. Due March 24.
• Final project details out on the webpage
– http://rci.rutgers.edu/~pa336/project.html
– Proposal due – March 31
– Final report due – May 2
– In class presentations – Apr 28, May 2.
Next few lectures• Theoretical foundations of ML
– Formally define learning.
– What can be learned from data.
– What type of guarantees can we hope to achieve.
• Big Questions
– How to generate rules that do well on observed data.
– What confidence do we have that they will do well in the future.
Basic Learning Task• 𝑦 = 𝑓∗(𝑥1, 𝑥2, … 𝑥𝑛)
– 𝑓∗ ∈ 𝐻
– 𝑋 = 𝑥1, 𝑥2, … 𝑥𝑛 ∈ ℜ𝑛
– 𝑦 = 0/1 or 𝑦 ∈ ℜ
• Training data 𝑆𝑚 = 𝑋1, 𝑦1 , 𝑋2, 𝑦2 , … (𝑋𝑚, 𝑦𝑚)
– 𝑋𝑖 ∼ 𝐷
• Prediction rule: 𝑓:ℜ𝑛 → ℜ• Error: 𝑒𝑟𝑟(𝑓)
– 𝑒𝑟𝑟 𝑓 = Pr𝐷[𝑓 𝑋 ≠ 𝑓∗(𝑋) ]
– 𝑒𝑟𝑟 𝑓 = 𝐸𝐷[ 𝑓 𝑋 − 𝑓∗ 𝑋2]
Probably Approximately Correct(PAC) model(Valiant’84)
• An algorithm A PAC learns a class H if– For any given 𝜖 > 0, 𝛿 > 0, and any learning problem in 𝐻
• A takes as input 𝑆𝑚 and produces f of error at most 𝜖with probability at least 1 − 𝛿
– Learning problem in 𝐻
• Choose 𝐷 ∼ ℜ𝑛, 𝑓∗ ∈ 𝐻
– 𝑚 should be polynomial in n,1
𝜖,1
𝛿
– Ideally, runtime should also be polynomial in 𝑚.
– 𝑓 should be computable in polynomial time
Probably Approximately Correct(PAC) model(Valiant’84)
• An algorithm A PAC learns a class H if
– For any given 𝜖 > 0, 𝛿 > 0, and any learning problem in 𝐻
• A takes as input 𝑆𝑚 and produces f of error at most 𝜖with probability at least 1 − 𝛿
– Learning problem in 𝐻
• Choose 𝐷 ∼ ℜ𝑛, 𝑓∗ ∈ 𝐻
Probably Approximately Correct(PAC) model(Valiant’84)
• An algorithm A PAC learns a class H if
– For any given 𝜖 > 0, 𝛿 > 0, and any learning problem in 𝐻
• A takes as input 𝑆𝑚 and produces f of error at most 𝜖with probability at least 1 − 𝛿
– Learning problem in 𝐻
• Choose 𝐷 ∼ ℜ𝑛, 𝑓∗ ∈ 𝐻
• Not very realistic
– Assumes 𝑦 = 𝑓∗(𝑥1, 𝑥2, … 𝑥𝑛)
– First step towards more useful extensions
– This lecture - will stick with prediction tasks 𝑦=0/1
Probably Approximately Correct(PAC) model(Valiant’84)
• An algorithm A PAC learns a class H if
– For any given 𝜖 > 0, 𝛿 > 0, and any learning problem in 𝐻
• A takes as input 𝑆𝑚 and produces f of error at most 𝜖with probability at least 1 − 𝛿
– Learning problem in 𝐻
• Choose 𝐷 ∼ ℜ𝑛, 𝑓∗ ∈ 𝐻
• How to PAC learn H?
– Natural idea: fit the training data
– Empirical Risk Minimization: find 𝑓 such that 𝑒𝑟𝑟𝑆𝑚 𝑓 = 0
ERM as a universal algorithm
• Let A be the empirical risk minimizer
Theorem:For any finite H, ERM PAC learns H provided
𝑚 ≥1
𝜖log(
𝐻
𝛿)
ERM as a universal algorithm
Theorem:For any finite H, ERM PAC learns H provided
𝑚 ≥1
𝜖log(
𝐻
𝛿)
ERM as a universal algorithm
Theorem:For any finite H, ERM PAC learns H provided
𝑚 ≥1
𝜖log(
𝐻
𝛿)
ERM as a universal algorithm
Theorem:For any finite H, ERM PAC learns H provided
𝑚 ≥1
𝜖log(
𝐻
𝛿)
What if H is infinite or uncountable?
1-d example
1-d example
1-d example
1-d example
Should be able to replace |H| with the number of“different” functions on 𝑆𝑚.
ERM as a universal algorithm
Theorem:For any H, ERM PAC learns H provided
𝑚 ≥4
𝜖log(
2𝐶[2𝑚]
𝛿)
𝐶 2𝑚 = Maximum # ways to label any set of 2𝑚 points using functions in H.
= Maximum # functions induced on a set of 2m points by 𝐻.
ERM as a universal algorithm
Theorem:For any H, ERM PAC learns H provided
𝑚 ≥4
𝜖log(
2𝐶[2𝑚]
𝛿)
𝐶 2𝑚 = Maximum # ways to label any set of 2𝑚 points using functions in H.
= Maximum # functions induced on a set of 2m points by 𝐻.
1-d example
Theorem:For any H, ERM PAC learns H provided
𝑚 ≥4
𝜖log(
2𝐶[2𝑚]
𝛿)
2-d example
Theorem:For any H, ERM PAC learns H provided
𝑚 ≥4
𝜖log(
2𝐶[2𝑚]
𝛿)
ERM as a universal algorithm
Theorem:For any H, ERM PAC learns H provided
𝑚 ≥4
𝜖log(
2𝐶[2𝑚]
𝛿)
𝐶 2𝑚 = maximum # ways to label any set of 2𝑚 points using functions in H.
How to bound 𝐶[2𝑚] for a general class?