Inductive Reasoning and (one of) the Foundations of Machine Learning

Inductive Reasoning and (one of) the Foundations of

Machine Learning

“beware of mathematicians, and all those who make empty prophecies”

— St. Augustine

All men are mortal Socrates is a man

Deductive reasoning

Socrates is mortal

All men are mortal Socrates is a man

Deductive reasoning

Socrates is mortal

Idea: Thinking is deductive reasoning!

Articles

WINTER 2006 13

Photo courtesy Dartmouth College.

of the Original Proposal.

Trenchard More, John McCarthy, Marvin Minsky, Oliver Selfridge,

Ray Solomonoff

50 years later

“To understand the real world, we must have a different set of primitives from the relatively

simple line trackers suitable and sufficient for the blocks world”

— Patrick Winston (1975) Director of MIT’s AI lab from 1972-1997

A bump in the road

The AI winterhttp://en.wikipedia.org/wiki/AI_winter

Reductio ad absurdum

“Intelligence is 10 million rules” — Doug Lenat

The story so far…• Boy meets girl !

The story so far…• Boy meets girl !• Boy spends 100s of millions of dollars

wooing girl with deductive reasoning!!!

wooing girl with deductive reasoning!!

• Girl says: “drop dead”; boy becomes very sad

wooing girl with deductive reasoning!!

• Girl says: “drop dead”; boy becomes very sad

Next: Boy ponders the errors of his ways

“this book is composed […] upon one very simple theme […] that we can learn from our mistakes”

!!!!!!!!!!

Karl Popper, Conjectures and Refutations

We’re going to look at 4 learning algorithms.

Sequential predictionScenario: At time t, Forecaster predicts 0 or 1.

Nature then reveals the truth. !!!

Forecaster has access to N experts. One of them is always correct.

Goal: Predict as accurately as possible.

Algorithm #1While t>0:

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.t ← t+1Step 3.

Set t = 1.

While t>0:

Question:

How long to find correct expert?

Set t = 1.Algorithm #1

While t>0:

BAD!!!

How long to find correct expert?

While t>0:

Question:

How many errors?

Algorithm #1

Predict by majority vote.Step 1.Remove experts that are wrong.Step 2.

How many errors?

When algorithm makes a mistake,

it removes ≥ half of experts

Algorithm #1

≤ log N

How many errors?

When algorithm makes a mistake,

it removes ≥ half of experts

Algorithm #1

Deep thought #1Track errors, not runtime

What’s going on?Didn’t we just use deductive reasoning!?!

Yes… but No!

What’s going on?Algorithm: makes educated guesses about Nature

Analysis: proves theorem about number of errors

(inductive)

(deductive)

What’s going on?Algorithm: makes educated guesses about Nature

Analysis: proves theorem about number of errors

(inductive)

(deductive)

The algorithm learns — but it does not deduce!

Adversarial predictionScenario: At time t, Forecaster predicts 0 or 1.

Nature then reveals the truth. !!!

Forecaster has access to N experts. One of them is always correct. Nature is adversarial.

At time t, Forecaster predicts 0 or 1. Nature then reveals the truth.

Forecaster has access to N experts. One of them is always correct. Nature is adversarial.

Seriously?!?!

Regret

Let m* be the best expert in hindsight. regret := errors(Forecaster) - errors(m*)

Goal: Predict as accurately as possible. Minimize regret.

While t ≤ T:

Question:

Predict by weighted majority vote.Step 1.Multiply incorrect experts by β.Step 2.t ← t+1Step 3.

What is the regret?

Pick β in (0,1). Assign 1 to experts.

While t ≤ T:Predict by weighted majority vote.Step 1.Multiply incorrect experts by β.Step 2.t ← t+1Step 3.

What is the regret? [ choose β carefully ]r

T · logN2

Pick β in (0,1). Assign 1 to experts.

Deep thought #2Model yourself, not Nature

Online Convex Opt.Scenario: Convex set K; convex loss L(a,b)

[ in both arguments, separately ] !

At time t, Forecaster picks at in K Nature responds with bt in K [ Nature is adversarial ] Forecaster’s loss is L(a,b)

Goal: Minimize regret.

Follow the Leader

Idea: Predict with the at that would have worked best on { b1, … ,bt-1 }

While t ≤ T:Step 1.Step 2.

Set t = 1.Follow the Leader

t ← t+1

Pick a1 at random.

at := argmina2K

"t�1X

L(a, bi)#

Predict with the at that would have worked best on { b1, … ,bt-1 }

Set t = 1.Follow the LeaderBAD!

Problem: Nature pulls Forecaster back-and-forth No memory!

t ← t+1

Pick a1 at random.

at := argmina2K

"t�1X

L(a, bi)#

Set t = 1.

t ← t+1

Algorithm #3Pick a1 at random.

regularize

at := argmina2K

"t�1X

L(a, bi) +�

2· kak22

t ← t+1

Pick a1 at random.

gradient descent

at at�1 � � · @

@aL(at�1, bt�1)

Intuition: β controls memory

t ← t+1

Pick a1 at random.

at at�1 � � · @

@aL(at�1, bt�1)

What is the regret? [ choose β carefully ]

diam(K) · Lipschitz(L) ·pT

t ← t+1

Pick a1 at random.

at at�1 � � · @

@aL(at�1, bt�1)

Deep thought #3

Those who cannot remember [their]

past are condemned to repeat it

George Santayana

Minimax theoreminfa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

Minimax theorem

Forecaster picks a, Nature responds b

infa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

Minimax theorem

Nature picks b, Forecaster responds a

infa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

Minimax theorem

Nature picks b, Forecaster responds a

infa2K

supb2K

L(a, b) = supb2K

infa2K

L(a, b)

infa2K

supb2K

L(a, b) � supb2K

infa2K

L(a, b)going first hurts Forecaster, so

Minimax theorem

Proof idea:No-regret algorithm →

→ !!

Forecaster can asymptotically match hindsight !Order of players doesn’t matter asymptotically !Convert series of moves into average via online-to-batch.

Let m* be the best move in hindsight. regret := loss(Forecaster) - loss(m*)

infa2K

supb2K

L(a, b) supb2K

infa2K

L(a, b)

Minimax theorem

→ !!

infa2K

supb2K

L(a, b) supb2K

infa2K

L(a, b)

Minimax theorem

→ !!

infa2K

supb2K

L(a, b) supb2K

infa2K

L(a, b)

Minimax theorem

→ !!

infa2K

supb2K

L(a, b) supb2K

infa2K

L(a, b)

BoostingScenario:

Goal: Combine to perform well

Algorithm W is better than guessing on any data distribution: loss ≤ 0.5 - ε

The Boosting GameValue of game: V(w,d) = # mistakes w

makes on d

Algorithm W is better than guessing on any data distribution: loss ≤ 0.5 - ε

V(w, d) 1

2� ✏

The Boosting GameValue of game: V(w,d) = # mistakes w

makes on d

V(w, d) 1

2� ✏ MINIMAX!

The Boosting Game

V(w, d) 1

2� ✏ MINIMAX!

∃ distribution w* on learners that averages correctly on any data!

Meta-Algorithm #4Play Algorithm #2 against Algorithm W[ #2 maximizes W’s mistakes ]

V(w, d) 1

2� ✏

Algorithm #2

Algorithm W

Meta-Algorithm #4

• Freund and Schapire 1995 !

• Best learning algorithm in late 1990s and early 2000s !

• Authors won Gödel prize

Play Algorithm #2 against Algorithm W[ #2 maximizes W’s mistakes ]

V(w, d) 1

2� ✏

Algorithm #2

Algorithm W

Deep thought #4

Your teachers are not your

friends

The story so far…• Boy met girl !• Boy spent 100s of millions of dollars

wooing girl with deductive reasoning !

• Girl said: “drop dead”; boy became very sad !

• Boy learnt to learn from mistakes

• Girl showed no interest; boy became very sad !

• Boy learnt to learn from mistakes

Next: Boy invites girl for coffee. Girl accepts!

Online Convex Opt. (deep learning)

Apply Algorithm #3 to nonconvex optimization. !Theorems don’t work (not convex) → tons of engineering on top of #3 !Amazing performance. !New mathematics needs to be invented!!

Online Convex Opt. (deep learning)

In the last 2 years deep learning has: !• Better than human performance at object

recognition (ImageNet). !• Outperformed humans at recognising

street-signs (Google streetview). !

• Superhuman performance on Atari games (DeepMind).

!• Real-time translation: English voice

to Chinese text and voice.

Thank you!#1. Halving #2. Multiplicative Weights Exponential Weights Algorithm (EWA) #3. Online Gradient Descent (OGD) Stochastic Gradient Descent (SGD) Mirror Descent Backpropagation #4. AdaBoost

Details? Lecture notes on my webpage: https://dl.dropboxusercontent.com/u/

5874168/math482.pdf

Thank you!#1. Halving #2. Multiplicative Weights Exponential Weights Algorithm (EWA) #3. Online Gradient Descent (OGD) Stochastic Gradient Descent (SGD) Mirror Descent Backpropagation #4. AdaBoost

Vladimir Vapnik

Alexey Chervonenkis!1938 — 2014

“[A] theory of induction is superfluous. It has no function in a logic of science. The best we can say of a hypothesis is that up to now it has been able to show its worth, and that it has been

more successful that other hypotheses although, in principle, it can never be justified, verified, or even shown to be probable. This appraisal of the hypothesis relies

solely upon deductive consequences (predictions) which may be drawn

from the hypothesis: There is no need to even mention induction.”

“the learning process may be regarded as a search for a form of behaviour which will

satisfy the teacher (or some other criterion)”

Inductive Reasoning and (one of) the Foundations of Machine Learning

Data & Analytics

Transcript of Inductive Reasoning and (one of) the Foundations of Machine Learning

1 INDUCTIVE AND DEDUCTIVE REASONING · 2 Foundations of Mathematics 11: Chapter 1: Inductive and Deductive Reasoning Chapter 1: Planning Chart Lesson (SB) Charts (TR) Pacing (14 days)

Inductive and Deductive reasoning

Foundations of Mathematics 11 Chapter 1- Inductive and ...sjakob.weebly.com/uploads/3/9/2/4/39241033/1.3_using_reasoning_to_find_a... · Chapter 1- Inductive and Deductive Reasoning

Patterns and Inductive Reasoning

Inductive reasoning & logic

Obj. 9 Inductive Reasoning

Inductive vs Deductive Reasoning Inductive vs Deductive Reasoning.

Running head: INDUCTIVE REASONING TRAINING Inductive ...Inductive Reasoning 2 Abstract For several decades, researchers have been engaged in examining inductive reasoning in order

DEDUCTIVE REASONING SYLLOGISM FALSE PREMISE INDUCTIVE REASONING.

Inductive Reasoning Test1 Solutions

INDUCTIVE REASONING Vs.

Inductive and Deductive Reasoning

Inductive & Deductive Reasoningmrzerrsmith.weebly.com/.../inductive__deductive.pdf · Inductive & Deductive Reasoning + Deductive Reasoning ! Deductive reasoning starts with a premise,

Reasoning. Inductive and Deductive reasoning Inductive reasoning is concerned with reasoning from “specific instances to some general conclusion.” Deductive.

2.1 Inductive Reasoning and Conjecture. Objectives Make conjectures based on inductive reasoning Make conjectures based on inductive reasoning Find counterexamples.

Patterns and Inductive Reasoning

Deductive vs Inductive Reasoning

2.1 Inductive Reasoning

Inductive Reasoning Geometry 2 - AGMath.comagmath.com/media/DIR_12306/2$20Reasoning.pdf · Inductive Reasoning Geometry 2.1 Inductive Reasoning: Observing Patterns to make generalizations

Inductive Reasoning and (one of) the Foundations of ...Inductive Reasoning and (one of) the Foundations of Machine Learning “beware of mathematicians, and all those who make empty