Full Bayesian inference (Learning)
Transcript of Full Bayesian inference (Learning)
![Page 2: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/2.jpg)
The data age
Learning paradigms◦ Learning as inference
◦ Bayesian learning, full Bayesian inference, Bayesian model averaging
◦ Model identification, maximum likelihood learning
Probably Approximately Correct learning
February 20, 2018A.I. 2
![Page 3: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/3.jpg)
The „data” age: can we automate analysis/learning?
Universal (statistical) predictor?
Universal learning architectures?
Self-improving super-intelligence?
Phases of AI: expert, supervised, autonomous
2/20/2018A.I. 3
![Page 4: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/4.jpg)
Serendipity-driven („experimental”) science◦ 1000 years ago
◦ description of natural phenomena
Hypothesis-driven („analytical”) science◦ Last few hundred years
◦ Falsification, Newton's Laws, Maxwell's Equations
Computation-driven („computational”) science◦ Last few decades
◦ simulation of complex phenomena
Data-driven („hypothesis-free”) research◦ must move from data to information to knowledge
February 20, 2018A.I. 4
Jim Gray: Evolution of science: 4 paradigms
Tansley, Stewart, and Kristin M. Tolle. The fourth paradigm: data-intensive scientific discovery. Ed. Tony Hey. Vol. 1.
Redmond, WA: Microsoft research, 2009.
![Page 5: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/5.jpg)
Moore’s law: storage and computation (grids, GPUs, quantum chips..)
Semantic web thechnologies: linked open data
Artificial intelligence methods: learning, reasoning, decision
5
Computational
hardware
Artificial
Intelligence
Semantic
technologies
Factors:
![Page 6: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/6.jpg)
1965, Gordon Moore, founder of Intel:
„The number of transistors that can be
placed inexpensively on an integrated
circuit doubles approximately every two
years ”... "for at least ten years"
2/20/2018A.I. 6
•10 µm – 1971
•6 µm – 1974
•3 µm – 1977
•1.5 µm – 1982
•1 µm – 1985
•800 nm – 1989
•600 nm – 1994
•350 nm – 1995
•250 nm – 1997
•180 nm – 1999
•130 nm – 2001
•90 nm – 2004
•65 nm – 2006
•45 nm – 2008
•32 nm – 2010
•22 nm – 2012
•14 nm – 2014
•10 nm – 2017
•7 nm – ~2019
•5 nm – ~2021
2012: single
atom transistor
(~0.1n, 1A)
![Page 7: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/7.jpg)
7
![Page 9: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/9.jpg)
9
Linking Open Data cloud diagram 2017, by Andrejs Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch and Richard
Cyganiak. http://lod-cloud.net/
![Page 10: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/10.jpg)
February 20, 2018A.I. 10
![Page 11: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/11.jpg)
„Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control.” Good (1965),
Artificial General Intelligence (AGI)
N-AI-1
N-AI-2
N-AI-k….
Artificial Narrow Intelligence
![Page 12: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/12.jpg)
Narrow AI present
General AI
Super AI/strong AI
1950 Turing: "Computing Machinery and Intelligence„ learning
![Page 13: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/13.jpg)
Possibility of learning is an empirical observation.
The most incomprehensible thing about the world
is that it is at all comprehensible.
Albert Einstein.
No theory of knowledge
should attempt to explain
why we are successful in
our attempt to explain
things.
K.R.Popper: Objective
Knowledge, 1972
![Page 14: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/14.jpg)
Epicurus' (342? B.C. - 270 B.C.) principle of multiple explanations which states that one should keep all hypotheses that are consistent with the data.
The principle of Occam's razor (1285 - 1349, sometimes spelt Ockham). Occam's razor states that when inferring causes entities should not be multiplied beyond necessity. This is widely understood to mean: Among all hypotheses consistent with the observations, choose the simplest. In terms of a prior distribution over hypotheses, this is the same as giving simpler hypotheses higher a priori probability, and more complex ones lower probability.
![Page 15: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/15.jpg)
What is the probability that the sun will rise tomorrow?◦ P[{sun rises tomorrow} | {it has risen k times
previously}]=(k+1)/(k+2)◦ (k: ...Laplace inferred the number of days by saying
that the universe was created about 6000 years ago, based on a young-earth creationist reading of the Bible. ..)
◦ https://en.wikipedia.org/wiki/Sunrise_problem
Rule of succession◦ https://en.wikipedia.org/wiki/Rule_of_succession
February 20, 2018A.I. 15
![Page 16: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/16.jpg)
February 20, 2018A.I. 16
![Page 17: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/17.jpg)
February 20, 2018A.I. 17
![Page 18: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/18.jpg)
![Page 19: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/19.jpg)
Russel&Norvig: Artificial intelligence, ch.20
![Page 20: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/20.jpg)
Russel&Norvig: Artificial intelligence
![Page 21: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/21.jpg)
Russel&Norvig: Artificial intelligence
![Page 22: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/22.jpg)
Russel&Norvig: Artificial intelligence
![Page 23: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/23.jpg)
![Page 24: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/24.jpg)
![Page 25: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/25.jpg)
February 20, 2018A.I. 25
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 11 12
sequential likelihood of a given data
h1 h2 h3 h4 h5
![Page 26: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/26.jpg)
Simplest form: learn a function from examples
f is the target function
An example is a pair (x, f(x))
Problem: find a hypothesis hsuch that h ≈ fgiven a training set of examples
(This is a highly simplified model of real learning:◦ Ignores prior knowledge◦ Assumes examples are given)◦
![Page 27: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/27.jpg)
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)
E.g., curve fitting:
![Page 28: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/28.jpg)
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)
E.g., curve fitting:
![Page 29: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/29.jpg)
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)
E.g., curve fitting:
![Page 30: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/30.jpg)
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)
E.g., curve fitting:
![Page 31: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/31.jpg)
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)
E.g., curve fitting:
![Page 32: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/32.jpg)
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples)
E.g., curve fitting:
Ockham’s razor: prefer the simplest hypothesis consistent with data
![Page 33: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/33.jpg)
How do we know that h ≈ f ?1. Use theorems of computational/statistical learning theory
2. Try h on a new test set of examples
(use same distribution over example space as training set)
Learning curve = % correct on test set as a function of training set size
![Page 34: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/34.jpg)
Example from concept learning
X: i.i.d. samples.
n: sample size
H: hypotheses
bad
The Probably Approximately Correct PAC-learning
A single estimate of the expected error for a given hypothesis is convergent,
but can we estimate the errors for all hypotheses uniformly well??
![Page 35: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/35.jpg)
Assume that the true hypothesis f is element of the hypothesis space H.
Define the error of a hypothesis h as its misclassification rate:
Hypothesis h is approximately correct if
(ε is the “accuracy”)
For h∈Hbad
𝑒𝑟𝑟𝑜𝑟 ℎ = 𝑝(ℎ(𝑥) ≠ 𝑓(𝑥))
𝑒𝑟𝑟𝑜𝑟 ℎ < 𝜀
𝑒𝑟𝑟𝑜𝑟 ℎ > 𝜀
![Page 36: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/36.jpg)
H can be separated to H<ε and Hbad as Hε<
By definition for any h ∈ Hbad, the probability of error is larger than 𝜀thus the probability of no error is less than )1(
bad
![Page 37: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/37.jpg)
Thus for m samples for a hb ∈ 𝐻𝑏𝑎𝑑:
For any hb ∈ 𝐻𝑏𝑎𝑑, this can be bounded as
𝑝 𝐷𝑛:ℎ𝑏 𝑥 = 𝑓 𝑥 ≤ (1 − 𝜀)𝑛
𝑝 𝐷𝑛:∃ℎ𝑏∈ 𝐻, ℎ𝑏 𝑥 = 𝑓 𝑥 ≤
≤ 𝐻𝑏𝑎𝑑 1 − 𝜀 𝑛
≤ |𝐻| (1 − 𝜀)𝑛
![Page 38: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/38.jpg)
To have at least δ “probability” of approximate correctness:
By expressing the sample size as function of ε accuracy and δ confidence we get a bound for sample complexity
|𝐻| (1 − 𝜀)𝑛≤ δ
1/𝜀(ln 𝐻 + ln1
δ) ≤ n
![Page 39: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/39.jpg)
How many distinct concepts/decision trees with n Boolean attributes?
= number of Boolean functions
= number of distinct truth tables with 2n rows = 22n
E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees
![Page 40: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/40.jpg)
In practice, the target typically is not inside the hypothesis space: the total real error can be decomposed to “bias + variance”
“bias”: expected error/modelling error
“variance”: estimation/empirical selection error
For a given sample size the error is decomposed:
Modeling error
Statistical error
(Model selection error)
Total error
Model complexity
To
tal e
rro
r
![Page 41: Full Bayesian inference (Learning)](https://reader030.fdocuments.in/reader030/viewer/2022041114/62504b5a206b7419577ab3c9/html5/thumbnails/41.jpg)
Normative predictive probabilistic inference◦ performs Bayesian model averaging
◦ implements learning through model posteriors
◦ avoids model identification(!)
Model identification is hard:◦ Probably Approximately Correct learning
◦ Bias-variance dilemma
February 20, 2018A.I. 41