Post on 07-Jul-2020
Building an Automatic Statistician
Zoubin Ghahramani
Department of EngineeringUniversity of Cambridge
zoubin@eng.cam.ac.uk
http://learning.eng.cam.ac.uk/zoubin/
ALT-Discovery Science Conference, October 2014
James Robert Lloyd David Duvenaud Roger Grosse Josh TenenbaumCambridge Cambridge ! Harvard MIT ! Toronto MIT
THERE IS A GROWING NEED FOR DATA ANALYSIS
I We live in an era of abundant data
I The McKinsey Global Institute claimI
“The United States alone faces a shortage of 140,000 to
190,000 people with analytical expertise and 1.5 million
managers and analysts with the skills to understand and
make decisions based on the analysis of big data.”
I Diverse fields increasingly relying on expert statisticians,machine learning researchers and data scientists e.g.
I Computational sciences (e.g. biology, astronomy, . . . )I Online advertisingI Quantitative financeI . . .
James Robert Lloyd and Zoubin Ghahramani 2 / 44
GOALS OF THE AUTOMATIC STATISTICIAN PROJECT
I Provide a set of tools for understanding data that requireminimal expert input
I Uncover challenging research problems in e.g.I Automated inferenceI Model construction and comparisonI Data visualisation and interpretation
I Advance the field of machine learning in general
James Robert Lloyd and Zoubin Ghahramani 4 / 44
Background
• Probabilistic modelling
• Model selection and marginal likelihoods
• Bayesian nonparametrics
• Gaussian processes
Probabilistic Modelling
• A model describes data that one could observe from a system
• If we use the mathematics of probability theory to express allforms of uncertainty and noise associated with our model...
• ...then inverse probability (i.e. Bayes rule) allows us to inferunknown quantities, adapt our models, make predictions andlearn from data.
Bayesian Machine Learning
Everything follows from two simple rules:
Sum rule: P (x) =
Py P (x, y)
Product rule: P (x, y) = P (x)P (y|x)
P (✓|D, m) =
P (D|✓,m)P (✓|m)
P (D|m)
P (D|✓, m) likelihood of parameters ✓ in model mP (✓|m) prior probability of ✓P (✓|D, m) posterior of ✓ given data D
Prediction:
P (x|D, m) =
ZP (x|✓, D, m)P (✓|D, m)d✓
Model Comparison:
P (m|D) =
P (D|m)P (m)
P (D)
P (D|m) =
ZP (D|✓,m)P (✓|m) d✓
Model Comparison
0 5 10−20
0
20
40
M = 0
0 5 10−20
0
20
40
M = 1
0 5 10−20
0
20
40
M = 2
0 5 10−20
0
20
40
M = 3
0 5 10−20
0
20
40
M = 4
0 5 10−20
0
20
40
M = 5
0 5 10−20
0
20
40
M = 6
0 5 10−20
0
20
40
M = 7
Bayesian Occam’s Razor and Model Comparison
Compare model classes, e.g. m and m
0, using posterior probabilities given D:
p(m|D) =
p(D|m) p(m)
p(D)
, p(D|m) =
Zp(D|✓, m) p(✓|m) d✓
Interpretations of the Marginal Likelihood (“model evidence”):
• The probability that randomly selected parameters from the prior would generate D.
• Probability of the data under the model, averaging over all possible parameter values.
• log
2
⇣1
p(D|m)
⌘is the number of bits of surprise at observing data D under model m.
Model classes that are too simple are unlikelyto generate the data set.
Model classes that are too complex cangenerate many possible data sets, so again,they are unlikely to generate that particulardata set at random.
too simple
too complex
"just right"
All possible data sets of size n
P(D
|m)
D
Bayesian Model Comparison: Occam’s Razor at Work
0 5 10−20
0
20
40
M = 0
0 5 10−20
0
20
40
M = 1
0 5 10−20
0
20
40
M = 2
0 5 10−20
0
20
40
M = 3
0 5 10−20
0
20
40
M = 4
0 5 10−20
0
20
40
M = 5
0 5 10−20
0
20
40
M = 6
0 5 10−20
0
20
40
M = 7
0 1 2 3 4 5 6 70
0.2
0.4
0.6
0.8
1
M
P(Y|
M)
Model Evidence
For example, for quadratic polynomials (m = 2): y = a
0
+ a
1
x + a
2
x
2
+ ✏, where✏ ⇠ N (0, �
2
) and parameters ✓ = (a
0
a
1
a
2
�)
demo: polybayes
Parametric vs Nonparametric Models
• Parametric models assume some finite set of parameters ✓. Given the parameters,future predictions, x, are independent of the observed data, D:
P (x|✓, D) = P (x|✓)
therefore ✓ capture everything there is to know about the data.
• So the complexity of the model is bounded even if the amount of data isunbounded. This makes them not very flexible.
• Non-parametric models assume that the data distribution cannot be defined interms of such a finite set of parameters. But they can often be defined byassuming an infinite dimensional ✓. Usually we think of ✓ as a function.
• The amount of information that ✓ can capture about the data D can grow asthe amount of data grows. This makes them more flexible.
Bayesian nonparametrics
A simple framework for modelling complex data.
Nonparametric models can be viewed as having infinitely many parameters
Examples of non-parametric models:
Parametric Non-parametric Applicationpolynomial regression Gaussian processes function approx.logistic regression Gaussian process classifiers classificationmixture models, k-means Dirichlet process mixtures clusteringhidden Markov models infinite HMMs time seriesfactor analysis / pPCA / PMF infinite latent factor models feature discovery...
Nonlinear regression and Gaussian processes
Consider the problem of nonlinear regression:You want to learn a function f with error bars from data D = {X,y}
x
y
A Gaussian process defines a distribution over functions p(f) which can be used forBayesian regression:
p(f |D) =
p(f)p(D|f)
p(D)
Let f = (f(x
1
), f(x
2
), . . . , f(xn)) be an n-dimensional vector of function valuesevaluated at n points xi 2 X . Note, f is a random variable.
Definition: p(f) is a Gaussian process if for any finite subset {x
1
, . . . , xn} ⇢ X ,the marginal distribution over that subset p(f) is multivariate Gaussian.
A picture
Logistic Regression
Linear Regression
Kernel Regression
Bayesian Linear
Regression
GP Classification
Bayesian Logistic
Regression
Kernel Classification
GP Regression
Classification
BayesianKernel
Gaussian process covariance functions (kernels)
p(f) is a Gaussian process if for any finite subset {x
1
, . . . , xn} ⇢ X , the marginaldistribution over that finite subset p(f) has a multivariate Gaussian distribution.
Gaussian processes (GPs) are parameterized by a mean function, µ(x), and acovariance function, or kernel, K(x, x
0).
p(f(x), f(x
0)) = N(µ, ⌃)
where
µ =
µ(x)
µ(x
0)
�⌃ =
K(x, x) K(x, x
0)
K(x
0, x) K(x
0, x
0)
�
and similarly for p(f(x
1
), . . . , f(xn)) where now µ is an n ⇥ 1 vector and ⌃ is ann ⇥ n matrix.
Gaussian process covariance functions
Gaussian processes (GPs) are parameterized by a mean function, µ(x), anda covariance function, K(x, x
0), where µ(x) = E(f(x)) and K(x, x
0) =
Cov(f(x), f(x
0)).
An example covariance function:
K(x, x
0) = v
0
exp
⇢�
✓|x � x
0|r
◆↵�+ v
1
+ v
2
�ij
with parameters (v
0
, v
1
, v
2
, r, ↵).
These kernel parameters are interpretableand can be learned from data:
v
0
signal variancev
1
variance of biasv
2
noise variancer lengthscale↵ roughness
Once the mean and covariance functions are defined, everything else about GPsfollows from the basic rules of probability applied to mutivariate Gaussians.
Samples from GPs with di↵erent K(x, x
0)
0 10 20 30 40 50 60 70 80 90 100−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
x
f(x)
0 10 20 30 40 50 60 70 80 90 100−1.5
−1
−0.5
0
0.5
1
1.5
x
f(x)
0 10 20 30 40 50 60 70 80 90 100−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
x
f(x)
0 10 20 30 40 50 60 70 80 90 100−2
−1
0
1
2
3
4
5
x
f(x)
0 10 20 30 40 50 60 70 80 90 100−4
−3
−2
−1
0
1
2
3
x
f(x)
0 10 20 30 40 50 60 70 80 90 100−3
−2
−1
0
1
2
3
x
f(x)
0 10 20 30 40 50 60 70 80 90 100−3
−2
−1
0
1
2
3
4
x
f(x)
0 10 20 30 40 50 60 70 80 90 100−6
−4
−2
0
2
4
6
8
x
f(x)
0 10 20 30 40 50 60 70 80 90 100−4
−3
−2
−1
0
1
2
3
x
f(x)
0 10 20 30 40 50 60 70 80 90 100−3
−2
−1
0
1
2
3
x
f(x)
0 10 20 30 40 50 60 70 80 90 100−4
−3
−2
−1
0
1
2
3
4
x
f(x)
0 10 20 30 40 50 60 70 80 90 100−6
−4
−2
0
2
4
6
8
x
f(x)
gpdemogen
Prediction using GPs with di↵erent K(x, x
0)
A sample from the prior for each covariance function:
0 10 20 30 40 50 60 70 80 90 100−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
x
f(x)
0 10 20 30 40 50 60 70 80 90 100−1.5
−1
−0.5
0
0.5
1
1.5
x
f(x)
0 10 20 30 40 50 60 70 80 90 100−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
x
f(x)
Corresponding predictions, mean with two standard deviations:
0 5 10 15−1.5
−1
−0.5
0
0.5
1
0 5 10 15−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
0 5 10 15−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
0 5 10 15−1.5
−1
−0.5
0
0.5
1
gpdemo
The Automatic Statistician
WHAT WOULD AN AUTOMATIC STATISTICIAN DO?
Data Search
Language of models
Evaluation
Model Prediction
Translation
Checking
Report
James Robert Lloyd and Zoubin Ghahramani 3 / 44
GOALS OF THE AUTOMATIC STATISTICIAN PROJECT
I Provide a set of tools for understanding data that requireminimal expert input
I Uncover challenging research problems in e.g.I Automated inferenceI Model construction and comparisonI Data visualisation and interpretation
I Advance the field of machine learning in general
James Robert Lloyd and Zoubin Ghahramani 4 / 44
INGREDIENTS OF AN AUTOMATIC STATISTICIAN
Data Search
Language of models
Evaluation
Model Prediction
Translation
Checking
Report
IAn open-ended language of models
I Expressive enough to capture real-world phenomena. . .I . . . and the techniques used by human statisticians
IA search procedure
I To efficiently explore the language of modelsI
A principled method of evaluating models
I Trading off complexity and fit to dataI
A procedure to automatically explain the models
I Making the assumptions of the models explicit. . .I . . . in a way that is intelligible to non-experts
James Robert Lloyd and Zoubin Ghahramani 5 / 44
PREVIEW: AN ENTIRELY AUTOMATIC ANALYSIS
Raw data
1950 1952 1954 1956 1958 1960 1962100
200
300
400
500
600
700Full model posterior with extrapolations
1950 1952 1954 1956 1958 1960 19620
100
200
300
400
500
600
700
Four additive components have been identified in the data
I A linearly increasing function.
I An approximately periodic function with a period of 1.0 years andwith linearly increasing amplitude.
I A smooth function.
I Uncorrelated noise with linearly increasing standard deviation.
James Robert Lloyd and Zoubin Ghahramani 6 / 44
DEFINING A LANGUAGE OF MODELS
Data Search
Language of models
Evaluation
Model Prediction
Translation
Checking
Report
James Robert Lloyd and Zoubin Ghahramani 7 / 44
DEFINING A LANGUAGE OF REGRESSION MODELS
I Regression consists of learning a function f : X ! Yfrom inputs to outputs from example input / output pairs
I Language should include simple parametric forms. . .I e.g. Linear functions, Polynomials, Exponential functions
I . . . as well as functions specified by high level properties
I e.g. Smoothness, Periodicity
I Inference should be tractable for all models in language
James Robert Lloyd and Zoubin Ghahramani 8 / 44
WE CAN BUILD REGRESSION MODELS WITHGAUSSIAN PROCESSES
I GPs are distributions over functions such that any finitesubset of function evaluations, (f (x1), f (x2), . . . f (x
N
)),have a joint Gaussian distribution
I A GP is completely specified byI Mean function, µ(x) = E(f (x))I Covariance / kernel function, k(x, x
0) = Cov(f (x), f (x0))I Denoted f ⇠ GP(µ, k)
x
f(x)
GP Posterior MeanGP Posterior Uncertainty
x
f(x)
GP Posterior MeanGP Posterior UncertaintyData
x
f(x)
GP Posterior MeanGP Posterior UncertaintyData
x
f(x)
GP Posterior MeanGP Posterior UncertaintyData
James Robert Lloyd and Zoubin Ghahramani 9 / 44
A LANGUAGE OF GAUSSIAN PROCESS KERNELS
I It is common practice to use a zero mean function since themean can be marginalised out
I Suppose, f (x) | a ⇠ GP(a ⇥ µ(x), k(x, x
0)) wherea ⇠ N (0, 1)
I Then equivalently, f (x) ⇠ GP(0, µ(x)µ(x0) + k(x, x
0))
I We therefore define a language of GP regression models byspecifying a language of kernels
James Robert Lloyd and Zoubin Ghahramani 10 / 44
THE ATOMS OF OUR LANGUAGE
Five base kernels
0 0
0
0 0
Squaredexp. (SE)
Periodic(PER)
Linear(LIN)
Constant(C)
Whitenoise (WN)
Encoding for the following types of functions
Smoothfunctions
Periodicfunctions
Linearfunctions
Constantfunctions
Gaussiannoise
James Robert Lloyd and Zoubin Ghahramani 11 / 44
THE COMPOSITION RULES OF OUR LANGUAGE
I Two main operations: addition, multiplication
0 0
LIN ⇥ LINquadraticfunctions
SE ⇥ PERlocallyperiodic
0
0
LIN + PERperiodic pluslinear trend
SE + PERperiodic plussmooth trend
James Robert Lloyd and Zoubin Ghahramani 12 / 44
MODELING CHANGEPOINTS
Assume f1(x) ⇠ GP(0, k1) and f2(x) ⇠ GP(0, k2). Define:
f (x) = (1 � �(x)) f1(x) + �(x) f2(x)
where � is a sigmoid function between 0 and 1.
Then f ⇠ GP(0, k), where
k(x, x
0) = (1 � �(x)) k1(x, x
0) (1 � �(x0)) + �(x) k2(x, x
0) �(x0)
We define the changepoint operator k = CP(k1, k2).
James Robert Lloyd and Zoubin Ghahramani 13 / 44
AN EXPRESSIVE LANGUAGE OF MODELS
Regression model Kernel
GP smoothing SE + WNLinear regression C + LIN + WNMultiple kernel learning
PSE + WN
Trend, cyclical, irregularP
SE +P
PER + WNFourier decomposition C +
Pcos + WN
Sparse spectrum GPsP
cos + WNSpectral mixture
PSE ⇥ cos + WN
Changepoints e.g. CP(SE, SE) + WNHeteroscedasticity e.g. SE + LIN ⇥ WN
Note: cos is a special case of our version of PER
James Robert Lloyd and Zoubin Ghahramani 14 / 44
DISCOVERING A GOOD MODEL VIA SEARCH
DataSearch
Language of models
Evaluation
Model Prediction
Translation
Checking
Report
James Robert Lloyd and Zoubin Ghahramani 15 / 44
DISCOVERING A GOOD MODEL VIA SEARCH
I Language defined as the arbitrary composition of five basekernels (WN, C, LIN, SE, PER) via three operators(+,⇥, CP).
I The space spanned by this language is open-ended and canhave a high branching factor requiring a judicious search
I We propose a greedy search for its simplicity andsimilarity to human model-building
James Robert Lloyd and Zoubin Ghahramani 16 / 44
EXAMPLE: MAUNA LOA KEELING CURVE
RQ
2000 2005 2010!20
0
20
40
60
Start
SE RQ
SE + RQ . . . PER + RQ
SE + PER + RQ . . . SE ⇥ (PER + RQ)
. . . . . . . . .
. . .
. . . PER ⇥ RQ
LIN PER
James Robert Lloyd and Zoubin Ghahramani 17 / 44
EXAMPLE: MAUNA LOA KEELING CURVE
( Per + RQ )
2000 2005 20100
10
20
30
40
Start
SE RQ
SE + RQ . . . PER + RQ
SE + PER + RQ . . . SE ⇥ (PER + RQ)
. . . . . . . . .
. . .
. . . PER ⇥ RQ
LIN PER
James Robert Lloyd and Zoubin Ghahramani 17 / 44
EXAMPLE: MAUNA LOA KEELING CURVE
SE ! ( Per + RQ )
2000 2005 201010
20
30
40
50
Start
SE RQ
SE + RQ . . . PER + RQ
SE + PER + RQ . . . SE ⇥ (PER + RQ)
. . . . . . . . .
. . .
. . . PER ⇥ RQ
LIN PER
James Robert Lloyd and Zoubin Ghahramani 17 / 44
EXAMPLE: MAUNA LOA KEELING CURVE
( SE + SE ! ( Per + RQ ) )
2000 2005 20100
20
40
60
Start
SE RQ
SE + RQ . . . PER + RQ
SE + PER + RQ . . . SE ⇥ (PER + RQ)
. . . . . . . . .
. . .
. . . PER ⇥ RQ
LIN PER
James Robert Lloyd and Zoubin Ghahramani 17 / 44
MODEL EVALUATION
Data Search
Language of models
Evaluation
Model Prediction
Translation
Checking
Report
James Robert Lloyd and Zoubin Ghahramani 18 / 44
MODEL EVALUATION
I After proposing a new model its kernel parameters areoptimised by conjugate gradients
I We evaluate each optimised model, M, using the modelevidence (marginal likelihood) which can be computedanalytically for GPs
I We penalise the marginal likelihood for the optimisedkernel parameters using the Bayesian InformationCriterion (BIC):
�0.5 ⇥ BIC(M) = log p(D |M)� p
2log n
where p is the number of kernel parameters, D representsthe data, and n is the number of data points.
James Robert Lloyd and Zoubin Ghahramani 19 / 44
AUTOMATIC TRANSLATION OF MODELS
Data Search
Language of models
Evaluation
Model Prediction
Translation
Checking
Report
James Robert Lloyd and Zoubin Ghahramani 20 / 44
AUTOMATIC TRANSLATION OF MODELS
I Search can produce arbitrarily complicated models fromopen-ended language but two main properties allowdescription to be automated
I Kernels can be decomposed into a sum of products
I A sum of kernels corresponds to a sum of functionsI Therefore, we can describe each product of kernels
separately
I Each kernel in a product modifies a model in a consistent
wayI Each kernel roughly corresponds to an adjective
James Robert Lloyd and Zoubin Ghahramani 21 / 44
SUM OF PRODUCTS NORMAL FORM
Suppose the search finds the following kernel
SE ⇥ (WN ⇥ LIN + CP(C, PER))
The changepoint can be converted into a sum of products
SE ⇥ (WN ⇥ LIN + C ⇥ � + PER ⇥ �̄)
Multiplication can be distributed over addition
SE ⇥ WN ⇥ LIN + SE ⇥ C ⇥ � + SE ⇥ PER ⇥ �̄
Simplification rules are applied
WN ⇥ LIN + SE ⇥ � + SE ⇥ PER ⇥ �̄
James Robert Lloyd and Zoubin Ghahramani 22 / 44
SUM OF PRODUCTS NORMAL FORM
Suppose the search finds the following kernel
SE ⇥ (WN ⇥ LIN + CP(C, PER))
The changepoint can be converted into a sum of products
SE ⇥ (WN ⇥ LIN + C ⇥ � + PER ⇥ �̄)
Multiplication can be distributed over addition
SE ⇥ WN ⇥ LIN + SE ⇥ C ⇥ � + SE ⇥ PER ⇥ �̄
Simplification rules are applied
WN ⇥ LIN + SE ⇥ � + SE ⇥ PER ⇥ �̄
James Robert Lloyd and Zoubin Ghahramani 22 / 44
SUM OF PRODUCTS NORMAL FORM
Suppose the search finds the following kernel
SE ⇥ (WN ⇥ LIN + CP(C, PER))
The changepoint can be converted into a sum of products
SE ⇥ (WN ⇥ LIN + C ⇥ � + PER ⇥ �̄)
Multiplication can be distributed over addition
SE ⇥ WN ⇥ LIN + SE ⇥ C ⇥ � + SE ⇥ PER ⇥ �̄
Simplification rules are applied
WN ⇥ LIN + SE ⇥ � + SE ⇥ PER ⇥ �̄
James Robert Lloyd and Zoubin Ghahramani 22 / 44
SUM OF PRODUCTS NORMAL FORM
Suppose the search finds the following kernel
SE ⇥ (WN ⇥ LIN + CP(C, PER))
The changepoint can be converted into a sum of products
SE ⇥ (WN ⇥ LIN + C ⇥ � + PER ⇥ �̄)
Multiplication can be distributed over addition
SE ⇥ WN ⇥ LIN + SE ⇥ C ⇥ � + SE ⇥ PER ⇥ �̄
Simplification rules are applied
WN ⇥ LIN + SE ⇥ � + SE ⇥ PER ⇥ �̄
James Robert Lloyd and Zoubin Ghahramani 22 / 44
SUMS OF KERNELS ARE SUMS OF FUNCTIONS
If f1 ⇠ GP(0, k1) and independently f2 ⇠ GP(0, k2) then
f1 + f2 ⇠ GP(0, k1 + k2)
e.g.Full model posterior with extrapolations
1960 1970 1980 1990 2000 2010 2020−40
−20
0
20
40
60
80
=Posterior of component 1
1960 1965 1970 1975 1980 1985 1990 1995 2000−30
−20
−10
0
10
20
30
40
+Posterior of component 2
1960 1965 1970 1975 1980 1985 1990 1995 2000−4
−3
−2
−1
0
1
2
3
4
+Posterior of component 3
1960 1965 1970 1975 1980 1985 1990 1995 2000−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Full model posterior with extrapolations
1950 1952 1954 1956 1958 1960 19620
100
200
300
400
500
600
700
=Posterior of component 1
1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 19600
100
200
300
400
500
+Posterior of component 2
1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960−150
−100
−50
0
50
100
150
200
+Posterior of component 3
1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960−40
−30
−20
−10
0
10
20
30
40
We can therefore describe each component separately
James Robert Lloyd and Zoubin Ghahramani 23 / 44
PRODUCTS OF KERNELS
PER|{z}periodic function
On their own, each kernel is described by a standard nounphrase
James Robert Lloyd and Zoubin Ghahramani 24 / 44
PRODUCTS OF KERNELS - SE
SE|{z}approximately
⇥ PER|{z}periodic function
Multiplication by SE removes long range correlations from amodel since SE(x, x
0) decreases monotonically to 0 as |x � x
0|increases.
James Robert Lloyd and Zoubin Ghahramani 25 / 44
PRODUCTS OF KERNELS - LIN
SE|{z}approximately
⇥ PER|{z}periodic function
⇥ LIN|{z}with linearly growing amplitude
Multiplication by LIN is equivalent to multiplying the functionbeing modeled by a linear function. If f (x) ⇠ GP(0, k), thenx f (x) ⇠ GP (0, k ⇥ LIN). This causes the standard deviation ofthe model to vary linearly without affecting the correlation.
James Robert Lloyd and Zoubin Ghahramani 26 / 44
PRODUCTS OF KERNELS - CHANGEPOINTS
SE|{z}approximately
⇥ PER|{z}periodic function
⇥ LIN|{z}with linearly growing amplitude
⇥ �|{z}until 1700
Multiplication by � is equivalent to multiplying the functionbeing modeled by a sigmoid.
James Robert Lloyd and Zoubin Ghahramani 27 / 44
AUTOMATICALLY GENERATED REPORTS
Data Search
Language of models
Evaluation
Model Prediction
Translation
Checking
Report
James Robert Lloyd and Zoubin Ghahramani 28 / 44
EXAMPLE: AIRLINE PASSENGER VOLUME
Raw data
1950 1952 1954 1956 1958 1960 1962100
200
300
400
500
600
700Full model posterior with extrapolations
1950 1952 1954 1956 1958 1960 19620
100
200
300
400
500
600
700
Four additive components have been identified in the data
I A linearly increasing function.
I An approximately periodic function with a period of 1.0 years andwith linearly increasing amplitude.
I A smooth function.
I Uncorrelated noise with linearly increasing standard deviation.
James Robert Lloyd and Zoubin Ghahramani 29 / 44
EXAMPLE: AIRLINE PASSENGER VOLUME
This component is linearly increasing.
Posterior of component 1
1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 19600
100
200
300
400
500Sum of components up to component 1
1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 19600
100
200
300
400
500
600
700
James Robert Lloyd and Zoubin Ghahramani 30 / 44
EXAMPLE: AIRLINE PASSENGER VOLUME
This component is approximately periodic with a period of 1.0 years andvarying amplitude. Across periods the shape of this function varies verysmoothly. The amplitude of the function increases linearly. The shape of thisfunction within each period has a typical lengthscale of 6.0 weeks.
Posterior of component 2
1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960−150
−100
−50
0
50
100
150
200Sum of components up to component 2
1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 19600
100
200
300
400
500
600
700
James Robert Lloyd and Zoubin Ghahramani 31 / 44
EXAMPLE: AIRLINE PASSENGER VOLUME
This component is a smooth function with a typical lengthscale of 8.1months.
Posterior of component 3
1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960−40
−30
−20
−10
0
10
20
30
40Sum of components up to component 3
1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960100
200
300
400
500
600
700
James Robert Lloyd and Zoubin Ghahramani 32 / 44
EXAMPLE: AIRLINE PASSENGER VOLUME
This component models uncorrelated noise. The standard deviation of thenoise increases linearly.
Posterior of component 4
1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960−20
−15
−10
−5
0
5
10
15
20Sum of components up to component 4
1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 19600
100
200
300
400
500
600
700
James Robert Lloyd and Zoubin Ghahramani 33 / 44
EXAMPLE: SOLAR IRRADIANCE
This component is constant.
Posterior of component 1
1650 1700 1750 1800 1850 1900 1950 20001360.5
1360.6
1360.7
1360.8
1360.9
1361Sum of components up to component 1
1650 1700 1750 1800 1850 1900 1950 20001360
1360.5
1361
1361.5
1362
James Robert Lloyd and Zoubin Ghahramani 34 / 44
EXAMPLE: SOLAR IRRADIANCE
This component is constant. This component applies from 1643 until 1716.
Posterior of component 2
1650 1700 1750 1800 1850 1900 1950 2000−0.8
−0.7
−0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0Sum of components up to component 2
1650 1700 1750 1800 1850 1900 1950 20001360
1360.5
1361
1361.5
1362
James Robert Lloyd and Zoubin Ghahramani 35 / 44
EXAMPLE: SOLAR IRRADIANCE
This component is a smooth function with a typical lengthscale of 23.1years. This component applies until 1643 and from 1716 onwards.
Posterior of component 3
1650 1700 1750 1800 1850 1900 1950 2000−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8Sum of components up to component 3
1650 1700 1750 1800 1850 1900 1950 20001360
1360.5
1361
1361.5
1362
James Robert Lloyd and Zoubin Ghahramani 36 / 44
EXAMPLE: SOLAR IRRADIANCE
This component is approximately periodic with a period of 10.8 years.Across periods the shape of this function varies smoothly with a typicallengthscale of 36.9 years. The shape of this function within each period isvery smooth and resembles a sinusoid. This component applies until 1643and from 1716 onwards.
Posterior of component 4
1650 1700 1750 1800 1850 1900 1950 2000−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6Sum of components up to component 4
1650 1700 1750 1800 1850 1900 1950 20001360
1360.5
1361
1361.5
1362
James Robert Lloyd and Zoubin Ghahramani 37 / 44
OTHER EXAMPLES
See http://www.automaticstatistician.com
James Robert Lloyd and Zoubin Ghahramani 38 / 44
GOOD PREDICTIVE PERFORMANCE AS WELL
Standardised RMSE over 13 data sets1.
01.
52.
02.
53.
03.
5
ABCDaccuracy
ABCDinterpretability
Spectralkernels
Trend, cyclicalirregular
BayesianMKL Eureqa Changepoints
SquaredExponential
Linearregression
Stan
dard
ised
RM
SE
I Tweaks can be made to the algorithm to improve accuracyor interpretability of models produced. . .
I . . . but both methods are highly competitive at extrapolation(shown above) and interpolation
James Robert Lloyd and Zoubin Ghahramani 39 / 44
MODEL CHECKING AND CRITICISM
I Good statistical modelling should include model criticism:I Does the data match the assumptions of the model?I For example, if the model assumed Gaussian noise, does a
Q-Q plot reveal non-Gaussian residuals?I Our automatic statistician does posterior predictive checks,
dependence tests and residual testsI We have also been developing more systematic
nonparametric approaches to model criticism using kerneltwo-sample testing with MMD.
Lloyd, J. R., and Ghahramani, Z. (2014) Statistical Model Criticism using Kernel Two Sample
Tests. http://mlg.eng.cam.ac.uk/Lloyd/papers/kernel-model-checking.pdf
James Robert Lloyd and Zoubin Ghahramani 40 / 44
CHALLENGES
I Interpretability / accuracy
I Increasing the expressivity of languageI e.g. Monotonocity, positive functions, symmetries
I Computational complexity of searching through a hugespace of models
I Extending the automatic reports to multidimensionaldatasets
I Search and descriptions naturally extend to multipledimensions, but automatically generating relevant visualsummaries harder
James Robert Lloyd and Zoubin Ghahramani 41 / 44
CURRENT AND FUTURE DIRECTIONS
I Automatic statistician for:* One-dimensional time series* Linear regression (classical)I Multivariate nonlinear regression (c.f. Duvenaud, Lloyd et
al, ICML 2013)I Multivariate classification (w/ Nikola Mrksic)I Completing and interpreting tables and databases (w/ Kee
Chong Tan)
I Probabilistic programmingI Probabilistic models are expressed in a general (Turing
complete) programming languageI A universal inference engine can then be used to infer
unobserved variables given observed dataI This can be used to implement seach over the model space
in an automatic statisticianJames Robert Lloyd and Zoubin Ghahramani 42 / 44
SUMMARY
I We have presented the beginnings of an automaticstatistician
I Our systemI Defines an open-ended language of modelsI Searches greedily through this spaceI Produces detailed reports describing patterns in dataI Performs automatic model criticism
I Extrapolation and interpolation performance highlycompetitive
I We believe this line of research has the potential to makepowerful statistical model-building techniques accessibleto non-experts
James Robert Lloyd and Zoubin Ghahramani 43 / 44
REFERENCES
Website: http://www.automaticstatistician.com
Duvenaud, D., Lloyd, J. R., Grosse, R., Tenenbaum, J. B. and Ghahramani, Z. (2013) StructureDiscovery in Nonparametric Regression through Compositional Kernel Search. ICML 2013.
Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B. and Ghahramani, Z. (2014) AutomaticConstruction and Natural-language Description of Nonparametric Regression Models AAAI2014. http://arxiv.org/pdf/1402.4304v2.pdf
Lloyd, J. R., and Ghahramani, Z. (2014) Statistical Model Criticism using Kernel Two SampleTests http://mlg.eng.cam.ac.uk/Lloyd/papers/kernel-model-checking.pdf
Ghahramani, Z. (2013) Bayesian nonparametrics and the probabilistic approach to modellingPhilosophical Trans. Royal Society A 371: 20110553.
Looking for postdocs!
James Robert Lloyd and Zoubin Ghahramani 44 / 44