Full Bayesian inference (Learning)

Full Bayesian inference(Learning)

Peter [email protected]

February 20, 2018 1A.I.

mailto:[email protected]

The data age

Learning paradigms◦ Learning as inference

◦ Bayesian learning, full Bayesian inference, Bayesian model averaging

◦ Model identification, maximum likelihood learning

Probably Approximately Correct learning

February 20, 2018A.I. 2

The „data” age: can we automate analysis/learning?

Universal (statistical) predictor?

Universal learning architectures?

Self-improving super-intelligence?

Phases of AI: expert, supervised, autonomous

2/20/2018A.I. 3

Serendipity-driven („experimental”) science◦ 1000 years ago

◦ description of natural phenomena

Hypothesis-driven („analytical”) science◦ Last few hundred years

◦ Falsification, Newton's Laws, Maxwell's Equations

Computation-driven („computational”) science◦ Last few decades

◦ simulation of complex phenomena

Data-driven („hypothesis-free”) research◦ must move from data to information to knowledge


Jim Gray: Evolution of science: 4 paradigms

Tansley, Stewart, and Kristin M. Tolle. The fourth paradigm: data-intensive scientific discovery. Ed. Tony Hey. Vol. 1.

Redmond, WA: Microsoft research, 2009.

Moore’s law: storage and computation (grids, GPUs, quantum chips..)

Semantic web thechnologies: linked open data

Artificial intelligence methods: learning, reasoning, decision

5

Computational

hardware

Artificial

Intelligence

Semantic

technologies

Factors:

1965, Gordon Moore, founder of Intel:

„The number of transistors that can be

placed inexpensively on an integrated

circuit doubles approximately every two

years ”... "for at least ten years"

2/20/2018A.I. 6

•10 µm – 1971

•6 µm – 1974

•3 µm – 1977

•1.5 µm – 1982

•1 µm – 1985

•800 nm – 1989

•600 nm – 1994

•350 nm – 1995

•250 nm – 1997

•180 nm – 1999

•130 nm – 2001

•90 nm – 2004

•65 nm – 2006

•45 nm – 2008

•32 nm – 2010

•22 nm – 2012

•14 nm – 2014

•10 nm – 2017

•7 nm – ~2019

•5 nm – ~2021

2012: single

atom transistor

(~0.1n, 1A)

https://en.wikipedia.org/wiki/10_%C2%B5m_process



https://en.wikipedia.org/wiki/1.5_%C2%B5m_process


https://en.wikipedia.org/wiki/800_nanometer














8

yotta: 1024

zetta: 1021

https://en.wikipedia.org/wiki/Yotta

https://en.wikipedia.org/wiki/Zetta

9

Linking Open Data cloud diagram 2017, by Andrejs Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch and Richard

Cyganiak. http://lod-cloud.net/

„Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control.” Good (1965),

Artificial General Intelligence (AGI)

N-AI-1

N-AI-2

N-AI-k….

Artificial Narrow Intelligence

https://en.wikipedia.org/wiki/Intelligence_explosion#CITEREFGood1965

Narrow AI present

General AI

Super AI/strong AI

1950 Turing: "Computing Machinery and Intelligence„ learning

Possibility of learning is an empirical observation.

The most incomprehensible thing about the world

is that it is at all comprehensible.

Albert Einstein.

No theory of knowledge

should attempt to explain

why we are successful in

our attempt to explain

things.

K.R.Popper: Objective

Knowledge, 1972

Epicurus' (342? B.C. - 270 B.C.) principle of multiple explanations which states that one should keep all hypotheses that are consistent with the data.

The principle of Occam's razor (1285 - 1349, sometimes spelt Ockham). Occam's razor states that when inferring causes entities should not be multiplied beyond necessity. This is widely understood to mean: Among all hypotheses consistent with the observations, choose the simplest. In terms of a prior distribution over hypotheses, this is the same as giving simpler hypotheses higher a priori probability, and more complex ones lower probability.

What is the probability that the sun will rise tomorrow?◦ P[{sun rises tomorrow} | {it has risen k times

previously}]=(k+1)/(k+2)◦ (k: ...Laplace inferred the number of days by saying

that the universe was created about 6000 years ago, based on a young-earth creationist reading of the Bible. ..)

◦ https://en.wikipedia.org/wiki/Sunrise_problem

Rule of succession◦ https://en.wikipedia.org/wiki/Rule_of_succession


https://en.wikipedia.org/wiki/Young_Earth_Creationism

https://en.wikipedia.org/wiki/Bible

Russel&Norvig: Artificial intelligence, ch.20

Russel&Norvig: Artificial intelligence


0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12

sequential likelihood of a given data

h1 h2 h3 h4 h5

Simplest form: learn a function from examples

f is the target function

An example is a pair (x, f(x))

Problem: find a hypothesis hsuch that h ≈ fgiven a training set of examples

(This is a highly simplified model of real learning:◦ Ignores prior knowledge◦ Assumes examples are given)◦

Construct/adjust h to agree with f on training set

(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples)

E.g., curve fitting:

Ockham’s razor: prefer the simplest hypothesis consistent with data

How do we know that h ≈ f ?1. Use theorems of computational/statistical learning theory

2. Try h on a new test set of examples

(use same distribution over example space as training set)

Learning curve = % correct on test set as a function of training set size

Example from concept learning

X: i.i.d. samples.

n: sample size

H: hypotheses

bad

The Probably Approximately Correct PAC-learning

A single estimate of the expected error for a given hypothesis is convergent,

but can we estimate the errors for all hypotheses uniformly well??

Assume that the true hypothesis f is element of the hypothesis space H.

Define the error of a hypothesis h as its misclassification rate:

Hypothesis h is approximately correct if

(ε is the “accuracy”)

For h∈Hbad

𝑒𝑟𝑟𝑜𝑟 ℎ = 𝑝(ℎ(𝑥) ≠ 𝑓(𝑥))

𝑒𝑟𝑟𝑜𝑟 ℎ < 𝜀

𝑒𝑟𝑟𝑜𝑟 ℎ > 𝜀

H can be separated to H<ε and Hbad as Hε<

By definition for any h ∈ Hbad, the probability of error is larger than 𝜀thus the probability of no error is less than )1(

bad

Thus for m samples for a hb ∈ 𝐻𝑏𝑎𝑑:

For any hb ∈ 𝐻𝑏𝑎𝑑, this can be bounded as

𝑝 𝐷𝑛:ℎ𝑏 𝑥 = 𝑓 𝑥 ≤ (1 − 𝜀)𝑛

𝑝 𝐷𝑛:∃ℎ𝑏∈ 𝐻, ℎ𝑏 𝑥 = 𝑓 𝑥 ≤

≤ 𝐻𝑏𝑎𝑑 1 − 𝜀 𝑛

≤ |𝐻| (1 − 𝜀)𝑛

To have at least δ “probability” of approximate correctness:

By expressing the sample size as function of ε accuracy and δ confidence we get a bound for sample complexity

|𝐻| (1 − 𝜀)𝑛≤ δ

1/𝜀(ln 𝐻 + ln1

δ) ≤ n

How many distinct concepts/decision trees with n Boolean attributes?

= number of Boolean functions

= number of distinct truth tables with 2n rows = 22n

E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees

In practice, the target typically is not inside the hypothesis space: the total real error can be decomposed to “bias + variance”

“bias”: expected error/modelling error

“variance”: estimation/empirical selection error

For a given sample size the error is decomposed:

Modeling error

Statistical error

(Model selection error)

Total error

Model complexity

To

tal e

rro

r

Normative predictive probabilistic inference◦ performs Bayesian model averaging

◦ implements learning through model posteriors

◦ avoids model identification(!)

Model identification is hard:◦ Probably Approximately Correct learning

◦ Bias-variance dilemma


Full Bayesian inference (Learning)

Documents

Transcript of Full Bayesian inference (Learning)