Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines...
Transcript of Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines...
Bayesian Methods for Variable Selection with ApplicationstoHigh-Dimensional Data
Part 5: Functional Data & Wavelets
Marina Vannucci
Rice University, USA
ABS13-Italy06/17-21/2013
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 1 / 56
Part 5: Functional Data & Wavelets
Introduction to nonparametric regression
Brief intro to wavelets
Wavelet shrinkageFunctional data (multiple curves)
Curve regression modelsCurve classificationApplications to near infrared spectral data from chemometrics
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 2 / 56
Nonparametric Regression Models
Functional form of the regression function is not a simple function of afew parameters
Technically: probability models with infinitely many parameters ormodels on functional spaces
Typical setting: Given data(Xi,Yi), i = 1, ..., n we wish to estimate
Y = f (X) + ǫ
X can be a covariate, a set of covariates, time, spatial location,...
Basic idea: Avoid simple (parametric) forms
Approaches needed to penalize for overfitting since there are now manyparameters
Function estimation (avoid linearity): splines, wavelets(basis functions)
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 3 / 56
Basis functions
A basis is a set of simple functions that can be added togetherin aweighted fashion for form more complicated functions
SupposeX is univariate
Simplest basis: polynomial basis
f (X) = β0 + β1X + ...+ βpXp
Set of power functions to reconstructf based on data:x; x2; x3...
With sufficiently large number of basis functions, we have essentially anonparametric function estimator
Matrix notation:f =∑p
j=0 Bj(x)βj, whereBj(X) = xj and coefficients(weights)βj
Polynomial basis - easy to fit but too simplistic for localfeatures/variations
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 4 / 56
SplinesSplines are one-way to modelf flexibly by writing f (X) = B(X)β whereB(:) are basis functions.Basis functions: there a lot choices available - truncated power basis,B-splines, thin plate splines, penalized splines ...Truncated power basis of orderp
f (X) = β0 + β1X + ...+ βpXp +
K∑
k=1
βk+p(X − κk)p+
with β’s the regression coefficients,κ the knots andK the number ofknots.Linear/polynomial regression forK = 0Construction of splines involves specifying knots: both number andlocation.Capture non-linear relationships between variables.Kernel based methods - Ruppert, Wand and Carroll (2003)
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 5 / 56
Wavelets: Major Milestones
The basic idea behind wavelets is to represent a generic function with simplerfunctions (building blocks) at different scales and locations.
1807: Fourier orthogonal decomposition of periodic signals
1946: Gabor windowed Fourier transform (STFT)
1984: A. Grossmann and J. Morlet introduce thecontinuouswavelet transform for theanalysis of seismic signals.
1985: Y. Meyer definesdiscreteorthonormal wavelet bases.
1989: S. Mallat links wavelets to the theory of “multiresolution analysis” (MRA) aframework that allows to construct orthonormal bases. Adiscrete wavelet transformisdefined as a simple recursive algorithm to compute the wavelet decomposition of a signalfrom its approximation at a given scale.
1989: I. Daubechies constructs wavelets with compact support anda varying degree ofsmoothness.
1992: D. Donoho and I. Johnstone use wavelets to remove noise from data.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 6 / 56
Wavelets as “Small Waves”The basic idea behind wavelets is to represent a generic function with simplerfunctions (building blocks) at different scales and locations
−10 −5 0 5 10−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
x
ψ(x)
Mexican hat wavelet
−10 −5 0 5 10−1
−0.5
0
0.5
1
1.5
x
√2ψ(
2x);
a=1/
2, b
=0
−10 −5 0 5 10−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
x
(1/√
2)ψ(
x/2)
; a=2
, b=0
−10 −5 0 5 10−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
x
ψ(x+
4); a
=1, b
=4
Mother waveletψ as oscillatory function with zero mean,∫
ψ(x)dx = 0Wavelets asψj,k(x) = 2j/2ψ(2jx− k) with j, k scale and translation parameters.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 7 / 56
Wavelets vs Fourier RepresentationsGivenf and a basisf1, . . . , fn → series expansion
f (x) =∑
i
aifi(x)dx
ai =< f , fi >=∫
f (x)fi(x)dx
Fourier transforms measure the frequency content of a signal.Wavelet transforms provide atime-scale analysis.Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 8 / 56
Examples of Wavelet FamiliesHaar wavelets. The simplest family of wavelets, already known before theformulation of the wavelet theory (Haar, 1909).ψ(x) = 1 for 0≤ x < 1/2; −1 for 1/2 ≤ x < 1 and 0 otherwise.Haar wavelets constructed fromψ via dilations and translations
Given a generic functionf , Haar wavelets approximatef with piecewiseconstant functions (not continuous)
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 9 / 56
Daubechies Wavelets
0 1 2 3−1.5
−1
−0.5
0
0.5
1
1.5
2
x
ψ(x
) an
d φ(
x)
Daubechies 2
0 2 4 6−1
−0.5
0
0.5
1
1.5Daubechies 4
x
0 5 10−1.5
−1
−0.5
0
0.5
1
1.5Daubechies 7
x
ψ(x
) an
d φ(
x)
0 5 10 15−1.5
−1
−0.5
0
0.5
1
1.5Daubechies 10
x
Compact support (good time localization)Vanishing moments
∫
xlψ(x)dx = 0, l = 0,1, . . . ,N ensure decay as< f , ψj,k >≤ C2−jN , (good for compression)Various degrees of smoothness. For largeN, φ ∈ CµN , µ ≈ 0.2.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 10 / 56
Properties of Wavelets
Wavelet series:f (x) =∑
j,k < f , ψj,k > ψj,k(x)
Wf (j, k) = 〈f , ψj,k〉 =
∫
f (x)ψj,k(x)dxj,k∈Z
describing features off at different locations and scales.
Properties:
Small waves with zero mean
Time-frequency localization
Good at describing non-stationarity and discontinuities
Multi-scale decomposition of functions (MRA) - sparsity, shrinkage
Recursive relationships among coefficients→ DWT
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 11 / 56
Wavelets in Practice (DWT)Let’s consider a vectorY of observations off at n equispaced points
yi = f (ti), i = 1, . . . , n, with n = 2m
Discrete Wavelet Transforms operate via recursive filters applied toY
data≡ c(m) H−→ c(m−1) H
−→ · · ·H−→ c(m−r)
G
ցG
ցG
ցd(m−1) · · · d(m−r)
H : cm−1,k =∑
l
hl−2kcm,l, G : dm−1,k =∑
l
gl−2kcm,l, ...
Then, in practice
Y DWT−→ d(Y) = (c(m−r),d(m−r),d(m−r−1), . . . ,d(m−1))
discrete approx of< f , ψj,k > at scalesm − 1, . . . ,m − rMarina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 12 / 56
Example
0 100 200 300 400 500 600 700 800 900 1000−4
−2
0
2
4
6
8
10
f+ε
0 100 200 300 400 500 600 700 800 900 1000−15
−10
−5
0
5
10
15
Coefficient index
DWT
data≡ Y DWT−→ d(Y) = (c(3),d(3), . . . ,d(7),d(8),d(9))
In matrix notation,Y −→ d(Y) = WY, with W determined byψ and suchthatW ′W = IMarina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 13 / 56
Summary on Wavelets
Choose the mother waveletψ and obtain the wavelet familyψj,kj,k∈Z
In theory, givenf , calculate thewavelet coefficients< f , ψj,k >=
∫
f (x)ψj,k(x)dx and construct the wavelet series
f (x) =∑
j,k
< f , ψj,k > ψj,k(x)
In practice, given dataY = (y1, . . . , yn), use the DWT to calculateapproximated wavelet coefficients
DWT uses fast recursive filtering. Use matrix notation for convenienceZ = WY
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 14 / 56
Covariance matrix of wavelet coefficients
yi = f (ti), i = 1, . . . , n, with n = 2m
DWT: Y DWT−→ d(Y) = WY, W determined byψ, W ′W = I
Variances and covariances of DWT coefficients:
ΣΣΣd(Y) = WΣΣΣYW′
for givenΣY(i, j) = [γ(|i − j|)]γ(τ) autocovariance function of the process generating the data
VANNUCCI & CORRADI (JRSS, Series B, 1999).
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 15 / 56
Fields of Application
Applied Mathematics: Partial/ordinary differential equations solutions;representation of basic operators ...
Engineering: Signal and image processing (compression, smoothing) ...
Statistics: Smoothing of noisy data; nonparametric density andregression estimation; stochastic processes representation, time series,functional data ...
Physics, Biology, Genetics: Turbolence, DNA sequence analysis,magnetic resonance imaging ...
Music, Human Vision, Computer Graphics ...
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 16 / 56
Wavelet Shrinkage
yi = f (ti) + ǫi, i = 1, . . . , n = 2m (equispaced design), ǫi ∼ N(0, σ2)
Noise removal technique proposed by Donoho and Johnstone(1992-1998)
Compute empirical wavelet coefficients→ apply shrinkage/thresholding→ reconstructf
Implemented through fast DWT
Optimal for general function classes (Holder and Sobolev) as well asspaces ofirregular functions such as those of “bounded variations”.
Bayesian wavelet shrinkage estimators has better MSE properties forfinite samples.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 17 / 56
3-step procedure
yi = f (ti) + ǫi, ǫi i.i.d. N(0, σ2)
Step 1: YDWT−→ d(Y) = d(f ) + ǫ′, ǫ′ ∼ N(0, σ2I)
Step 2: Estimated(f )
Removeǫ′ by thresholding/shrinkingd(Y)
Use mixture priors ond(f )
Step 3: Inverse transform to estimatef
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 18 / 56
Hard and soft thresholding rules
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−15
−10
−5
0
5
10
15
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−15
−10
−5
0
5
10
15
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
5
4
3
2
1
DWT (hsr=0)
Leve
l
Time0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
5
4
3
2
1
DWT (hsr=0)
Leve
lTime
dj,k = dj,kI(|dj,k|>λ) (hard thresholding).
dj,k = (dj,k − sign(dj,k)λ)I(|dj,k |>λ) (soft thresholding).
with global or level-dependent thresholds.Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 19 / 56
Bayesian Wavelet ShrinkageAt step 2 write the model as(d = WY, θθθ = Wf)
d|θθθ, σ2 ∼ N (θθθ, σ2I),
and impose priors on the unknown parametersθθθ (and possiblyσ2).
Use spike-and-slab priors
θj,k|sj,k ∼ sj,kN(0, ·) + (1− sj,k)I0, sj,k ∼ Ber(pj)
(Clyde et al. 1998, Clyde & George 2000)
Shrinkage is a by-product of these procedures, as Bayes rules naturallyshrink posterior estimates towards prior quantities. Also, shrinkage isadaptive when level-specific variances are specified. Can also specifydependent priors (Vannucci & Corradi, 1999).
Estimateθθθ via appropriate Bayes rule (posterior mean, median) andestimatef byW−1θθθ (step 3).
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 20 / 56
Useful References
Daubechies, I. (1992).Ten lectures on wavelets. SIAM.
Mallat, S.G. (1989). Multiresolution approximations and waveletorthonormal bases ofL2(R). Transactions of the American MathematicalSociety, 315(1), 69-87.
Nason, G. (2008), Wavelet Methods in Statistcs with R, Springer Verlag:New York. (with R package)
Antoniadis, A., Bigot, J. and Sapatinas, T. (2001). Waveletestimators innonparametric regression: A comparative simulation study. Journal ofStatistical Software, 6(6), 1-83. Matlab software.
Wavelab (matlab) and WaveThresh (R)
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 21 / 56
Functional Data (multiple curves)
Data observed over an ordered domain: curves, images, 3-d objects
First step: characterize unknown function via basis functions (projectionto low-dimension). Multiple choices: Splines, wavelets, functionalprincipal component (empirical basis).
Second step: fit models to “reduced” space. Functional analogues tostandard models: functional regression, functional mixedmodels;generalized functional regression etc...
Bayesian framework allows unified modeling perspective.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 22 / 56
Wavelets for Functional Data
Overall objective: Methods for prediction or classification or clustering basedon functional data (multiple curves).
Regression:a continuous response is observed.
Classification: class membership observed in a training set.
Clustering: group membership needs to be uncovered.
Data are curves.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 23 / 56
Motivating Examples from NIR Calibration
Experiment: 40 biscuit doughs made with variations in quantities of fat,flour, sugar and water in recipe.Fat 15-21%, Sugar 10-23%, Flour 44-54%, Water 11-17%
Data: Composition (by weight) of the 4 constituents and spectral data at700 wavelengths (from 1100nm to 2498nm in steps of 2nm) for eachdough piece.
Aim: Use the spectral data to predict the composition.Wavelets as a dimension reduction tool that preserves localfeatures.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 24 / 56
1000 1500 2000 25000
0.5
1
1.5
2Biscuit − training
1000 1500 2000 25000
0.5
1
1.5
2
2.5
3Biscuit − validation
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 25 / 56
Another Example
NIR spectra used in analysis of food and drink and pharmaceuticalproducts, measured at hundreds of wavelengths.
3 varieties of wheat, 94 observations. NIR spectra with 100 wavelengthsfrom 850 to 1048nm in steps of 2nm.
Aim: Use the spectral data to predict the wheat variety.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 26 / 56
850 900 950 1000 10502.4
2.6
2.8
3
3.2
3.4
Wavelength(nm)
Abs
orba
nce
Variety 1
850 900 950 1000 10502.4
2.6
2.8
3
3.2
3.4
Wavelength(nm)
Abs
orba
nce
Variety 2
850 900 950 1000 10502.4
2.6
2.8
3
3.2
3.4
Wavelength(nm)
Abs
orba
nce
Variety 3
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 27 / 56
Our Approach
Use wavelets as dimension reduction tool that preserves local features.Apply DWT to curves. Wavelet coefficients summarise curve features inan efficient way, they are localized (unlike Fourier coefficients) and cancapture noise in the data.
Develop Bayesian methods in wavelet domain for simultaneous featureselection and prediction or classification. We use mixture priors andBayes methods to look for wavelet component that well describe thevariation in the responses.
Wavelet coeffs selection is done by assigning a probabilityto everypossible subset and then searching for subsets with high probability.Small coefficients can be important.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 28 / 56
Curve Regression
The basic setup is a multivariate linear regression model, with n observationson aq-variate response andp explanatory variables
Y = 1nααα′ + XB + E
Y(n × q)− responses,X(n × p)− predictors
Concern is with functional predictor data, ie the situationwhere each row ofX is a vector of observations of a curvex(t) at p equally spaced points.
Interest is in situations wherep is large→ variable selection and/or dimensionreduction methods are needed
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 29 / 56
Wavelet TransformationA wavelet transform is applied to each row ofX
Y = 1nααα′ + XW ′WB + E, W ′W = I
1400 1600 1800 2000 2200 2400−0.14
−0.12
−0.1
−0.08
−0.06
Ref
lect
ance
1400 1600 1800 2000 2200 24000
0.05
0.1
Ref
lect
ance
1400 1600 1800 2000 2200 24000
0.02
0.04
0.06
0.08
Ref
lect
ance
Wavelength
0 50 100 150 200 250−1.5
−1
−0.5
0
0.5
DW
T
0 50 100 150 200 250−0.5
0
0.5
1
DW
T
0 50 100 150 200 250−0.2
0
0.2
0.4
DW
T
Wavelet coefficient
Y = 1nααα′ + DB + E
with B = WB andD = XW ′ a matrix of wavelet coefficients.Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 30 / 56
Prior Model to Select Wavelet Coefficients
Mixture priors to represent negligible coefficients.A latent binary vectorγγγ identifies different models
γγγ = (γ1, . . . , γp), γj = 0,1, γj = 1 ↔ Dj in
Coefficients are drawn from a mixture distribution
B − B0 ∼ N (Hγγγ,ΣΣΣ)
B:,j ∼ (1− γj)I0 + γjN(0, hjjΣΣΣ)
π(γγγ) as single Bernoulli’s or Beta-Binomial or related predictors
Prob(γj = 1) = wj, wj = w, j = 1, ..., p
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 31 / 56
Prior Model
If H is the var-cov matrix of the prior onB then
H = [WHW ′]
We compute transformed covariance priors onB asWHW ′ using results ofVannucci and Corradi, JRSSB (1999).
Priors onααα andΣΣΣ are unchanged.
(B → B, H → H)
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 32 / 56
Metropolis Search
MCMC used to ”search” the posteriorg(γγγ) ∼ π(γγγ|Y,D) looking for goodmodels.
At each step the algorithm generatesγγγnew from γγγold by one of twopossible moves:Add or Delete a component by choosing at random one component inγγγold and changing its value.Swap two components by choosing independently at random a 0 and a 1in γγγold and changing both of them.
The new candidateγγγnew is accepted with probability
ming(γγγnew)
g(γγγold),1
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 33 / 56
Posterior Inference
The stochastic search results in a list of visited models (γγγ(0), γγγ(1), . . .) andtheir corresponding relative posterior probabilities
p(γγγ(0)|D,Y), p(γγγ(1)|D,Y) . . .
Select variables:- in the“best” models, i.e. theγγγ’s with highestp(γγγ|D,Y) or- with largest marginal posterior probabilities
Prediction on futureY f :- via Bayesian model averaging (BMA) as posterior weighted average of
model predictions- via single model predictions as LS/Bayes predictions on single best
models or LS/Bayes predictions with “threshold” models (eg, “median”model) obtained from estimated marginal probabilities of inclusion.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 34 / 56
Prior Specification
Priors onααα andΣΣΣ vague and largely uninformative
ααα′ − ααα′0 ∼ N (h,ΣΣΣ), h → ∞, ΣΣΣ ∼ IW(δ,Q)
Choices forHγγγ :
Hγγγ = C ∗ [(D′D)−1]γγγ
Hγγγ = C ∗ [diag(D′D)−1]γγγ
H = CI
H = AR(σ, ρ) with EB estimates of hyperparameters,
L(·;D,Y) = |K |−q/2|Q|−n/2|In + K−1YQ−1Y′|−(δ+q+n−1)/2
K = I + DHD′
Choice ofw:
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 35 / 56
Diagonal Elements ofH = WHW ′
0 50 100 150 200 250150
200
250
300
350
400
450
500
Wavelet coefficient
Prio
r va
rianc
e of
reg
ress
ion
coef
ficie
nt
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 36 / 56
NIR Data on Biscuit Doughs
4 parallel MCMC with 100,000 iterations and about 40,000 successful movesper chain.
H(i, j) = σ2ρ|i−j|, σ2 = 254, ρ = .32
Modal model: 10 coeffs (0%, 0%, 0%, 37%, 6%, 6%, 1%, 2%, from coarsestto finest level)
MSE = (.059, .466, .351, .047)
BMA: 500 models and 219 coeffs
MSE = (.063, .449, .348, .050)
PLS- MSE=(.151, .583, .375, .105)PCR- MSE=(.160, .614, .388, .106)Stepwise MLR: MSE=(.044, 1.188, .722, .221)
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 37 / 56
Diagnostic Plots
0 1 2 3 4 5 6 7 8 9 10
x 104
0
20
40
60
80
100
120
(a)
Iteration number
Num
ber
of o
nes
0 1 2 3 4 5 6 7 8 9 10
x 104
−400
−300
−200
−100
0(b)
Iteration number
Log
prob
s
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 38 / 56
Marginal Plots
0 50 100 150 200 2500
0.2
0.4
0.6
0.8
1(i)
Pro
babi
lity
0 50 100 150 200 2500
0.2
0.4
0.6
0.8
1(ii)
0 50 100 150 200 2500
0.2
0.4
0.6
0.8
1(iii)
Pro
babi
lity
Wavelet coefficient0 50 100 150 200 250
0
0.2
0.4
0.6
0.8
1
Wavelet coefficient
(iv)
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 39 / 56
IDWT to Columns of LS Estimate of B
1400 1500 1600 1700 1800−200
−100
0
100
200C
oeffi
cien
t
Fat
1400 1500 1600 1700 1800−400
−200
0
200
400Sugar
1400 1500 1600 1700 1800−400
−200
0
200
400Flour
Wavelength
Coe
ffici
ent
1400 1500 1600 1700 1800−200
−100
0
100
200
Wavelength
Water
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 40 / 56
Probit Models for Curve ClassificationZ → n × 1 categorical response vector (J categories).X → n × p predictor matrix. We have functional predictor data.Probit models use data augmentation andlatent data to write
Yi = ααα′ + X′iB + ǫǫǫi, ǫǫǫi ∼ N(0,ΣΣΣ), i = 1, . . . , n
Relationship between the realizationzi and the unobservedYi
zi =
0 if yi,j < 0 for everyjj if yi,j = max
1≤k≤J−1yi,k
... now transform to wavelets, use selection prior on wavelet coefficients andMCMC for inference in a probit model ...
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 41 / 56
Results for Wheat Data
3 classes, 94 samples, NIR transmission spectra at p=100 wavelengthsfrom 850 to 1048nm.
Variety 1 2 3 TotalTraining 32 22 17 71
Validation 10 7 6 23
Fearnet al.(2001)Bayesian decision approach that balances costs for variables against theloss of misclassification.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 42 / 56
Mis-classification Errors
Prediction Var. Sel./# PC. Error“Best” Model 5 3/23
Marginal model 9 3/23BD(Fearnet al.) 6 5/23BD(Fearnet al.) 12 3/23LDA with PCA 14–18 PCs 4/23QDA with PCA 8–10 PCs 7/23
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 43 / 56
0 20 40 60 80 1000
0.2
0.4
0.6
0.8
1
Mar
gina
l pro
babi
lity
(a)
0 20 40 60 80 1000
0.2
0.4
0.6
0.8
1(b)
0 20 40 60 80 1000
0.2
0.4
0.6
0.8
1
Mar
gina
l pro
babi
lity
Wavelet coefficient
(c)
0 20 40 60 80 1000
0.2
0.4
0.6
0.8
1
Wavelet coefficient
(d)
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 44 / 56
850 900 950 1000 1050−1000
−500
0
500
1000(a)
β 1
850 900 950 1000 1050−1000
−500
0
500
1000(c)
wavelength
β 2
850 900 950 1000 1050−2000
−1000
0
1000
2000
β 1
(b)
850 900 950 1000 1050−2000
−1000
0
1000
2000
wavelength
β 2
(d)
Plots(a) and(c) are obtained using 4 wavelet coefficients and plots(b) and(d) using 9 coefficients.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 45 / 56
Code from my Website
bvsmewav: Bayesian Variable Selection applied to wavelet coefficients
Metropolis search
non-diagonal selection prior
Bernoulli priors or Beta-Binomial prior
Predictions by LS and BMA
http://stat.rice.edu/∼marina
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 46 / 56
Discriminant Analysis for Classification
DA classifies observations into groups. Data from groupg modeled as
Xg − 1ngµ′
g ∼ N (I ,Σg),
with g = 1, . . . ,G and where the vectorµg and the matrixΣg are themean and the covariance matrix of the g-th group, respectively.
Group assignments:y = (y1, . . . , yn)′, whereyi = g if the ith observation
comes from groupg
Conjugate prior model
µg ∼ N(mg, hgΣg)
Σg ∼ IW(δg,Ωg),
whereΩg is a scale matrix andδg a shape parameter.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 47 / 56
Transformation to Wavelet Domain
We apply DWT asDg = XgW′, with (W ′W = I ). Model becomes
Dg − 1ngµ′
g ∼ N (I , Σg),
whereΣg = WΣgW′, whileµg is unchanged. The prior model onΣg
transforms accordingly.
The predictive distribution (t-student) is used to classify new samples
πg(yf |D) =
pg(df )πg∑G
i=1 pi(df )πi,
wherepg(df ) indicates the predictive distribution and where the priorprobability that one observation comes from groupg asπg = ng/n. Anew observations is then assigned to the group with the highest posteriorprobability.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 48 / 56
Variable selection
Want to select discriminating wavelet coefficients.
Introduce latentp-vectorγ with binary entries
γj = 1 if coeff j defines a mixture distributionγj = 0 otherwise.
The likelihood function is given by
L(D, y; ·) =n∏
i=1
p(Di(γc)|Di(γ))G∏
g=1
ng∏
i=1
wngg pg(Di(γ)),
Conjugate priors for selected and non-selected coefficients.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 49 / 56
Markov Random Tree Priors on γ
P(γj|γi, i ∈ Nj) =exp(γjF(γj))
1+ exp(F(γj))(1)
whereF(γj) = µ+ η∑
i∈Nj(2γi − 1) andNj is the set of children of coeffj.
Hereµ controls sparsity. Higherη’s induce more neighbors to take on samevalues.Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 50 / 56
Model Fitting - MCMC
Can marginalize over the model parameters
Updateγ by Metropolis algorithm.
Classify new samples based on selected coeffs
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 51 / 56
An Example
NIR spectra used in analysis of food and drink and pharmaceuticalproducts, measured at hundreds of wavelengths.
5 varieties of meat (Beef, Chicken, Lamb, Pork and Turkey), 231observations. NIR spectra with 1024 reflectances over the range400-2498nm at intervals of 2nm.
Aim: Use the spectral data to predict the meat variety.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 52 / 56
500 1000 1500 2000 2500
0.8
1
1.2
1.4
1.6
Beef
Wavelength (nm)
Abs
orba
nce
500 1000 1500 2000 2500
0.8
1
1.2
1.4
1.6
Lamb
Wavelength (nm)
Abs
orba
nce
500 1000 1500 2000 2500
0.8
1
1.2
1.4
1.6
Pork
Wavelength (nm)
Abs
orba
nce
500 1000 1500 2000 2500
0.8
1
1.2
1.4
1.6
Turkey
Wavelength (nm)
Abs
orba
nce
500 1000 1500 2000 2500
0.8
1
1.2
1.4
1.6
Chicken
Wavelength (nm)
Abs
orba
nce
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 53 / 56
Results
PredictedTruth Beef Lamb Pork Turkey ChickenBeef 100.0 0.0 0.0 0.0 0.0Lamb 5.9 (1) 94.1 0.0 0.0 0.0Pork 0.0 0.0 100.0 0.0 0.0Turkey 0.0 0.0 0.0 92.6 7.4 (2)Chicken 0.0 0.0 0.0 3.7 (1) 96.3
Table :Validation set: classification results for the five different species of meat usinga threshold of 0.4 for the posterior of inclusion (14 coeffs). The number ofmisclassified units is reported in parentheses (3.4% total mis-clas rate).
LDA and QDA on selected (11-14) PCA components achieve amis-classification rate of 5.3% and 6.1%.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 54 / 56
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
Wavelet coefficients
Pos
t. P
rob.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 55 / 56
Main References
Brown, P.J., Fearn, T. and Vannucci, M. (2001). Bayesian waveletregression on curves with application to a spectroscopic calibrationproblem.Journal of the American Statistical Association, 96, 398-408.
Vannucci, M., Sha, N. and Brown, P.J. (2005). NIR and mass spectraclassification: Bayesian methods for wavelet-based feature selection.Chemometrics and Intelligent Laboratory Systems, 77, 139–148.
Stingo, F.C., Vannucci, M. and Downey, G. (2011). BayesianWavelet-based Curve Classification via Discriminant Analysis withMarkov Random Tree Priors.Statistica Sinica, 22, 465–488.
Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 56 / 56