Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer deﬁnes...

Bayesian Methods for Variable Selection with ApplicationstoHigh-Dimensional Data

Part 5: Functional Data & Wavelets

Marina Vannucci

Rice University, USA

ABS13-Italy06/17-21/2013

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 1 / 56

Part 5: Functional Data & Wavelets

Introduction to nonparametric regression

Brief intro to wavelets

Wavelet shrinkageFunctional data (multiple curves)

Curve regression modelsCurve classificationApplications to near infrared spectral data from chemometrics


Nonparametric Regression Models

Functional form of the regression function is not a simple function of afew parameters

Technically: probability models with infinitely many parameters ormodels on functional spaces

Typical setting: Given data(Xi,Yi), i = 1, ..., n we wish to estimate

Y = f (X) + ǫ

X can be a covariate, a set of covariates, time, spatial location,...

Basic idea: Avoid simple (parametric) forms

Approaches needed to penalize for overfitting since there are now manyparameters

Function estimation (avoid linearity): splines, wavelets(basis functions)


Basis functions

A basis is a set of simple functions that can be added togetherin aweighted fashion for form more complicated functions

SupposeX is univariate

Simplest basis: polynomial basis

f (X) = β0 + β1X + ...+ βpXp

Set of power functions to reconstructf based on data:x; x2; x3...

With sufficiently large number of basis functions, we have essentially anonparametric function estimator

Matrix notation:f =∑p

j=0 Bj(x)βj, whereBj(X) = xj and coefficients(weights)βj

Polynomial basis - easy to fit but too simplistic for localfeatures/variations


SplinesSplines are one-way to modelf flexibly by writing f (X) = B(X)β whereB(:) are basis functions.Basis functions: there a lot choices available - truncated power basis,B-splines, thin plate splines, penalized splines ...Truncated power basis of orderp

f (X) = β0 + β1X + ...+ βpXp +

K∑

k=1

βk+p(X − κk)p+

with β’s the regression coefficients,κ the knots andK the number ofknots.Linear/polynomial regression forK = 0Construction of splines involves specifying knots: both number andlocation.Capture non-linear relationships between variables.Kernel based methods - Ruppert, Wand and Carroll (2003)


Wavelets: Major Milestones

The basic idea behind wavelets is to represent a generic function with simplerfunctions (building blocks) at different scales and locations.

1807: Fourier orthogonal decomposition of periodic signals

1946: Gabor windowed Fourier transform (STFT)

1984: A. Grossmann and J. Morlet introduce thecontinuouswavelet transform for theanalysis of seismic signals.

1985: Y. Meyer definesdiscreteorthonormal wavelet bases.

1989: S. Mallat links wavelets to the theory of “multiresolution analysis” (MRA) aframework that allows to construct orthonormal bases. Adiscrete wavelet transformisdefined as a simple recursive algorithm to compute the wavelet decomposition of a signalfrom its approximation at a given scale.

1989: I. Daubechies constructs wavelets with compact support anda varying degree ofsmoothness.

1992: D. Donoho and I. Johnstone use wavelets to remove noise from data.


Wavelets as “Small Waves”The basic idea behind wavelets is to represent a generic function with simplerfunctions (building blocks) at different scales and locations

−10 −5 0 5 10−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

ψ(x)

Mexican hat wavelet

−10 −5 0 5 10−1

−0.5

0

0.5

1

1.5

x

√2ψ(

2x);

a=1/

2, b

=0

−10 −5 0 5 10−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

(1/√

2)ψ(

x/2)

; a=2

, b=0

−10 −5 0 5 10−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

ψ(x+

4); a

=1, b

=4

Mother waveletψ as oscillatory function with zero mean,∫

ψ(x)dx = 0Wavelets asψj,k(x) = 2j/2ψ(2jx− k) with j, k scale and translation parameters.


Wavelets vs Fourier RepresentationsGivenf and a basisf1, . . . , fn → series expansion

f (x) =∑

i

aifi(x)dx

ai =< f , fi >=∫

f (x)fi(x)dx

Fourier transforms measure the frequency content of a signal.Wavelet transforms provide atime-scale analysis.Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 8 / 56

Examples of Wavelet FamiliesHaar wavelets. The simplest family of wavelets, already known before theformulation of the wavelet theory (Haar, 1909).ψ(x) = 1 for 0≤ x < 1/2; −1 for 1/2 ≤ x < 1 and 0 otherwise.Haar wavelets constructed fromψ via dilations and translations

Given a generic functionf , Haar wavelets approximatef with piecewiseconstant functions (not continuous)


Daubechies Wavelets

0 1 2 3−1.5

−1

−0.5

0

0.5

1

1.5

2

x

ψ(x

) an

d φ(

x)

Daubechies 2

0 2 4 6−1

−0.5

0

0.5

1

1.5Daubechies 4

x

0 5 10−1.5

−1

−0.5

0

0.5

1

1.5Daubechies 7

x

ψ(x

) an

d φ(

x)

0 5 10 15−1.5

−1

−0.5

0

0.5

1

1.5Daubechies 10

x

Compact support (good time localization)Vanishing moments

∫

xlψ(x)dx = 0, l = 0,1, . . . ,N ensure decay as< f , ψj,k >≤ C2−jN , (good for compression)Various degrees of smoothness. For largeN, φ ∈ CµN , µ ≈ 0.2.


Properties of Wavelets

Wavelet series:f (x) =∑

j,k < f , ψj,k > ψj,k(x)

Wf (j, k) = 〈f , ψj,k〉 =

∫

f (x)ψj,k(x)dxj,k∈Z

describing features off at different locations and scales.

Properties:

Small waves with zero mean

Time-frequency localization

Good at describing non-stationarity and discontinuities

Multi-scale decomposition of functions (MRA) - sparsity, shrinkage

Recursive relationships among coefficients→ DWT


Wavelets in Practice (DWT)Let’s consider a vectorY of observations off at n equispaced points

yi = f (ti), i = 1, . . . , n, with n = 2m

Discrete Wavelet Transforms operate via recursive filters applied toY

data≡ c(m) H−→ c(m−1) H

−→ · · ·H−→ c(m−r)

G

ցG

ցG

ցd(m−1) · · · d(m−r)

H : cm−1,k =∑

l

hl−2kcm,l, G : dm−1,k =∑

l

gl−2kcm,l, ...

Then, in practice

Y DWT−→ d(Y) = (c(m−r),d(m−r),d(m−r−1), . . . ,d(m−1))

discrete approx of< f , ψj,k > at scalesm − 1, . . . ,m − rMarina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 12 / 56

Example

0 100 200 300 400 500 600 700 800 900 1000−4

−2

0

2

4

6

8

10

f+ε

0 100 200 300 400 500 600 700 800 900 1000−15

−10

−5

0

5

10

15

Coefficient index

DWT

data≡ Y DWT−→ d(Y) = (c(3),d(3), . . . ,d(7),d(8),d(9))

In matrix notation,Y −→ d(Y) = WY, with W determined byψ and suchthatW ′W = IMarina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 13 / 56

Summary on Wavelets

Choose the mother waveletψ and obtain the wavelet familyψj,kj,k∈Z

In theory, givenf , calculate thewavelet coefficients< f , ψj,k >=

∫

f (x)ψj,k(x)dx and construct the wavelet series

f (x) =∑

j,k

< f , ψj,k > ψj,k(x)

In practice, given dataY = (y1, . . . , yn), use the DWT to calculateapproximated wavelet coefficients

DWT uses fast recursive filtering. Use matrix notation for convenienceZ = WY


Covariance matrix of wavelet coefficients

yi = f (ti), i = 1, . . . , n, with n = 2m

DWT: Y DWT−→ d(Y) = WY, W determined byψ, W ′W = I

Variances and covariances of DWT coefficients:

ΣΣΣd(Y) = WΣΣΣYW′

for givenΣY(i, j) = [γ(|i − j|)]γ(τ) autocovariance function of the process generating the data

VANNUCCI & CORRADI (JRSS, Series B, 1999).


Fields of Application

Applied Mathematics: Partial/ordinary differential equations solutions;representation of basic operators ...

Engineering: Signal and image processing (compression, smoothing) ...

Statistics: Smoothing of noisy data; nonparametric density andregression estimation; stochastic processes representation, time series,functional data ...

Physics, Biology, Genetics: Turbolence, DNA sequence analysis,magnetic resonance imaging ...

Music, Human Vision, Computer Graphics ...


Wavelet Shrinkage

yi = f (ti) + ǫi, i = 1, . . . , n = 2m (equispaced design), ǫi ∼ N(0, σ2)

Noise removal technique proposed by Donoho and Johnstone(1992-1998)

Compute empirical wavelet coefficients→ apply shrinkage/thresholding→ reconstructf

Implemented through fast DWT

Optimal for general function classes (Holder and Sobolev) as well asspaces ofirregular functions such as those of “bounded variations”.

Bayesian wavelet shrinkage estimators has better MSE properties forfinite samples.


3-step procedure

yi = f (ti) + ǫi, ǫi i.i.d. N(0, σ2)

Step 1: YDWT−→ d(Y) = d(f ) + ǫ′, ǫ′ ∼ N(0, σ2I)

Step 2: Estimated(f )

Removeǫ′ by thresholding/shrinkingd(Y)

Use mixture priors ond(f )

Step 3: Inverse transform to estimatef


Hard and soft thresholding rules

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−15

−10

−5

0

5

10

15

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−15

−10

−5

0

5

10

15

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

5

4

3

2

1

DWT (hsr=0)

Leve

l

Time0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

5

4

3

2

1

DWT (hsr=0)

Leve

lTime

dj,k = dj,kI(|dj,k|>λ) (hard thresholding).

dj,k = (dj,k − sign(dj,k)λ)I(|dj,k |>λ) (soft thresholding).

with global or level-dependent thresholds.Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 19 / 56

Bayesian Wavelet ShrinkageAt step 2 write the model as(d = WY, θθθ = Wf)

d|θθθ, σ2 ∼ N (θθθ, σ2I),

and impose priors on the unknown parametersθθθ (and possiblyσ2).

Use spike-and-slab priors

θj,k|sj,k ∼ sj,kN(0, ·) + (1− sj,k)I0, sj,k ∼ Ber(pj)

(Clyde et al. 1998, Clyde & George 2000)

Shrinkage is a by-product of these procedures, as Bayes rules naturallyshrink posterior estimates towards prior quantities. Also, shrinkage isadaptive when level-specific variances are specified. Can also specifydependent priors (Vannucci & Corradi, 1999).

Estimateθθθ via appropriate Bayes rule (posterior mean, median) andestimatef byW−1θθθ (step 3).


Useful References

Daubechies, I. (1992).Ten lectures on wavelets. SIAM.

Mallat, S.G. (1989). Multiresolution approximations and waveletorthonormal bases ofL2(R). Transactions of the American MathematicalSociety, 315(1), 69-87.

Nason, G. (2008), Wavelet Methods in Statistcs with R, Springer Verlag:New York. (with R package)

Antoniadis, A., Bigot, J. and Sapatinas, T. (2001). Waveletestimators innonparametric regression: A comparative simulation study. Journal ofStatistical Software, 6(6), 1-83. Matlab software.

Wavelab (matlab) and WaveThresh (R)


Functional Data (multiple curves)

Data observed over an ordered domain: curves, images, 3-d objects

First step: characterize unknown function via basis functions (projectionto low-dimension). Multiple choices: Splines, wavelets, functionalprincipal component (empirical basis).

Second step: fit models to “reduced” space. Functional analogues tostandard models: functional regression, functional mixedmodels;generalized functional regression etc...

Bayesian framework allows unified modeling perspective.


Wavelets for Functional Data

Overall objective: Methods for prediction or classification or clustering basedon functional data (multiple curves).

Regression:a continuous response is observed.

Classification: class membership observed in a training set.

Clustering: group membership needs to be uncovered.

Data are curves.


Motivating Examples from NIR Calibration

Experiment: 40 biscuit doughs made with variations in quantities of fat,flour, sugar and water in recipe.Fat 15-21%, Sugar 10-23%, Flour 44-54%, Water 11-17%

Data: Composition (by weight) of the 4 constituents and spectral data at700 wavelengths (from 1100nm to 2498nm in steps of 2nm) for eachdough piece.

Aim: Use the spectral data to predict the composition.Wavelets as a dimension reduction tool that preserves localfeatures.


1000 1500 2000 25000

0.5

1

1.5

2Biscuit − training

1000 1500 2000 25000

0.5

1

1.5

2

2.5

3Biscuit − validation


Another Example

NIR spectra used in analysis of food and drink and pharmaceuticalproducts, measured at hundreds of wavelengths.

3 varieties of wheat, 94 observations. NIR spectra with 100 wavelengthsfrom 850 to 1048nm in steps of 2nm.

Aim: Use the spectral data to predict the wheat variety.


850 900 950 1000 10502.4

2.6

2.8

3

3.2

3.4

Wavelength(nm)

Abs

orba

nce

Variety 1

850 900 950 1000 10502.4

2.6

2.8

3

3.2

3.4

Wavelength(nm)

Abs

orba

nce

Variety 2

850 900 950 1000 10502.4

2.6

2.8

3

3.2

3.4

Wavelength(nm)

Abs

orba

nce

Variety 3


Our Approach

Use wavelets as dimension reduction tool that preserves local features.Apply DWT to curves. Wavelet coefficients summarise curve features inan efficient way, they are localized (unlike Fourier coefficients) and cancapture noise in the data.

Develop Bayesian methods in wavelet domain for simultaneous featureselection and prediction or classification. We use mixture priors andBayes methods to look for wavelet component that well describe thevariation in the responses.

Wavelet coeffs selection is done by assigning a probabilityto everypossible subset and then searching for subsets with high probability.Small coefficients can be important.


Curve Regression

The basic setup is a multivariate linear regression model, with n observationson aq-variate response andp explanatory variables

Y = 1nααα′ + XB + E

Y(n × q)− responses,X(n × p)− predictors

Concern is with functional predictor data, ie the situationwhere each row ofX is a vector of observations of a curvex(t) at p equally spaced points.

Interest is in situations wherep is large→ variable selection and/or dimensionreduction methods are needed


Wavelet TransformationA wavelet transform is applied to each row ofX

Y = 1nααα′ + XW ′WB + E, W ′W = I

1400 1600 1800 2000 2200 2400−0.14

−0.12

−0.1

−0.08

−0.06

Ref

lect

ance

1400 1600 1800 2000 2200 24000

0.05

0.1

Ref

lect

ance

1400 1600 1800 2000 2200 24000

0.02

0.04

0.06

0.08

Ref

lect

ance

Wavelength

0 50 100 150 200 250−1.5

−1

−0.5

0

0.5

DW

T

0 50 100 150 200 250−0.5

0

0.5

1

DW

T

0 50 100 150 200 250−0.2

0

0.2

0.4

DW

T

Wavelet coefficient

Y = 1nααα′ + DB + E

with B = WB andD = XW ′ a matrix of wavelet coefficients.Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 30 / 56

Prior Model to Select Wavelet Coefficients

Mixture priors to represent negligible coefficients.A latent binary vectorγγγ identifies different models

γγγ = (γ1, . . . , γp), γj = 0,1, γj = 1 ↔ Dj in

Coefficients are drawn from a mixture distribution

B − B0 ∼ N (Hγγγ,ΣΣΣ)

B:,j ∼ (1− γj)I0 + γjN(0, hjjΣΣΣ)

π(γγγ) as single Bernoulli’s or Beta-Binomial or related predictors

Prob(γj = 1) = wj, wj = w, j = 1, ..., p


Prior Model

If H is the var-cov matrix of the prior onB then

H = [WHW ′]

We compute transformed covariance priors onB asWHW ′ using results ofVannucci and Corradi, JRSSB (1999).

Priors onααα andΣΣΣ are unchanged.

(B → B, H → H)


Metropolis Search

MCMC used to ”search” the posteriorg(γγγ) ∼ π(γγγ|Y,D) looking for goodmodels.

At each step the algorithm generatesγγγnew from γγγold by one of twopossible moves:Add or Delete a component by choosing at random one component inγγγold and changing its value.Swap two components by choosing independently at random a 0 and a 1in γγγold and changing both of them.

The new candidateγγγnew is accepted with probability

ming(γγγnew)

g(γγγold),1


Posterior Inference

The stochastic search results in a list of visited models (γγγ(0), γγγ(1), . . .) andtheir corresponding relative posterior probabilities

p(γγγ(0)|D,Y), p(γγγ(1)|D,Y) . . .

Select variables:- in the“best” models, i.e. theγγγ’s with highestp(γγγ|D,Y) or- with largest marginal posterior probabilities

Prediction on futureY f :- via Bayesian model averaging (BMA) as posterior weighted average of

model predictions- via single model predictions as LS/Bayes predictions on single best

models or LS/Bayes predictions with “threshold” models (eg, “median”model) obtained from estimated marginal probabilities of inclusion.


Prior Specification

Priors onααα andΣΣΣ vague and largely uninformative

ααα′ − ααα′0 ∼ N (h,ΣΣΣ), h → ∞, ΣΣΣ ∼ IW(δ,Q)

Choices forHγγγ :

Hγγγ = C ∗ [(D′D)−1]γγγ

Hγγγ = C ∗ [diag(D′D)−1]γγγ

H = CI

H = AR(σ, ρ) with EB estimates of hyperparameters,

L(·;D,Y) = |K |−q/2|Q|−n/2|In + K−1YQ−1Y′|−(δ+q+n−1)/2

K = I + DHD′

Choice ofw:


Diagonal Elements ofH = WHW ′

0 50 100 150 200 250150

200

250

300

350

400

450

500

Wavelet coefficient

Prio

r va

rianc

e of

reg

ress

ion

coef

ficie

nt


NIR Data on Biscuit Doughs

4 parallel MCMC with 100,000 iterations and about 40,000 successful movesper chain.

H(i, j) = σ2ρ|i−j|, σ2 = 254, ρ = .32

Modal model: 10 coeffs (0%, 0%, 0%, 37%, 6%, 6%, 1%, 2%, from coarsestto finest level)

MSE = (.059, .466, .351, .047)

BMA: 500 models and 219 coeffs

MSE = (.063, .449, .348, .050)

PLS- MSE=(.151, .583, .375, .105)PCR- MSE=(.160, .614, .388, .106)Stepwise MLR: MSE=(.044, 1.188, .722, .221)


Diagnostic Plots

0 1 2 3 4 5 6 7 8 9 10

x 104

0

20

40

60

80

100

120

(a)

Iteration number

Num

ber

of o

nes

0 1 2 3 4 5 6 7 8 9 10

x 104

−400

−300

−200

−100

0(b)

Iteration number

Log

prob

s


Marginal Plots

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1(i)

Pro

babi

lity

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1(ii)

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1(iii)

Pro

babi

lity

Wavelet coefficient0 50 100 150 200 250

0

0.2

0.4

0.6

0.8

1

Wavelet coefficient

(iv)


IDWT to Columns of LS Estimate of B

1400 1500 1600 1700 1800−200

−100

0

100

200C

oeffi

cien

t

Fat

1400 1500 1600 1700 1800−400

−200

0

200

400Sugar

1400 1500 1600 1700 1800−400

−200

0

200

400Flour

Wavelength

Coe

ffici

ent

1400 1500 1600 1700 1800−200

−100

0

100

200

Wavelength

Water


Probit Models for Curve ClassificationZ → n × 1 categorical response vector (J categories).X → n × p predictor matrix. We have functional predictor data.Probit models use data augmentation andlatent data to write

Yi = ααα′ + X′iB + ǫǫǫi, ǫǫǫi ∼ N(0,ΣΣΣ), i = 1, . . . , n

Relationship between the realizationzi and the unobservedYi

zi =

0 if yi,j < 0 for everyjj if yi,j = max

1≤k≤J−1yi,k

... now transform to wavelets, use selection prior on wavelet coefficients andMCMC for inference in a probit model ...


Results for Wheat Data

3 classes, 94 samples, NIR transmission spectra at p=100 wavelengthsfrom 850 to 1048nm.

Variety 1 2 3 TotalTraining 32 22 17 71

Validation 10 7 6 23

Fearnet al.(2001)Bayesian decision approach that balances costs for variables against theloss of misclassification.


Mis-classification Errors

Prediction Var. Sel./# PC. Error“Best” Model 5 3/23

Marginal model 9 3/23BD(Fearnet al.) 6 5/23BD(Fearnet al.) 12 3/23LDA with PCA 14–18 PCs 4/23QDA with PCA 8–10 PCs 7/23


0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Mar

gina

l pro

babi

lity

(a)

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1(b)

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Mar

gina

l pro

babi

lity

Wavelet coefficient

(c)

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Wavelet coefficient

(d)


850 900 950 1000 1050−1000

−500

0

500

1000(a)

β 1

850 900 950 1000 1050−1000

−500

0

500

1000(c)

wavelength

β 2

850 900 950 1000 1050−2000

−1000

0

1000

2000

β 1

(b)

850 900 950 1000 1050−2000

−1000

0

1000

2000

wavelength

β 2

(d)

Plots(a) and(c) are obtained using 4 wavelet coefficients and plots(b) and(d) using 9 coefficients.


Code from my Website

bvsmewav: Bayesian Variable Selection applied to wavelet coefficients

Metropolis search

non-diagonal selection prior

Bernoulli priors or Beta-Binomial prior

Predictions by LS and BMA

http://stat.rice.edu/∼marina


Discriminant Analysis for Classification

DA classifies observations into groups. Data from groupg modeled as

Xg − 1ngµ′

g ∼ N (I ,Σg),

with g = 1, . . . ,G and where the vectorµg and the matrixΣg are themean and the covariance matrix of the g-th group, respectively.

Group assignments:y = (y1, . . . , yn)′, whereyi = g if the ith observation

comes from groupg

Conjugate prior model

µg ∼ N(mg, hgΣg)

Σg ∼ IW(δg,Ωg),

whereΩg is a scale matrix andδg a shape parameter.


Transformation to Wavelet Domain

We apply DWT asDg = XgW′, with (W ′W = I ). Model becomes

Dg − 1ngµ′

g ∼ N (I , Σg),

whereΣg = WΣgW′, whileµg is unchanged. The prior model onΣg

transforms accordingly.

The predictive distribution (t-student) is used to classify new samples

πg(yf |D) =

pg(df )πg∑G

i=1 pi(df )πi,

wherepg(df ) indicates the predictive distribution and where the priorprobability that one observation comes from groupg asπg = ng/n. Anew observations is then assigned to the group with the highest posteriorprobability.


Variable selection

Want to select discriminating wavelet coefficients.

Introduce latentp-vectorγ with binary entries

γj = 1 if coeff j defines a mixture distributionγj = 0 otherwise.

The likelihood function is given by

L(D, y; ·) =n∏

i=1

p(Di(γc)|Di(γ))G∏

g=1

ng∏

i=1

wngg pg(Di(γ)),

Conjugate priors for selected and non-selected coefficients.


Markov Random Tree Priors on γ

P(γj|γi, i ∈ Nj) =exp(γjF(γj))

1+ exp(F(γj))(1)

whereF(γj) = µ+ η∑

i∈Nj(2γi − 1) andNj is the set of children of coeffj.

Hereµ controls sparsity. Higherη’s induce more neighbors to take on samevalues.Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 50 / 56

Model Fitting - MCMC

Can marginalize over the model parameters

Updateγ by Metropolis algorithm.

Classify new samples based on selected coeffs


An Example

NIR spectra used in analysis of food and drink and pharmaceuticalproducts, measured at hundreds of wavelengths.

5 varieties of meat (Beef, Chicken, Lamb, Pork and Turkey), 231observations. NIR spectra with 1024 reflectances over the range400-2498nm at intervals of 2nm.

Aim: Use the spectral data to predict the meat variety.


500 1000 1500 2000 2500

0.8

1

1.2

1.4

1.6

Beef

Wavelength (nm)

Abs

orba

nce

500 1000 1500 2000 2500

0.8

1

1.2

1.4

1.6

Lamb

Wavelength (nm)

Abs

orba

nce

500 1000 1500 2000 2500

0.8

1

1.2

1.4

1.6

Pork

Wavelength (nm)

Abs

orba

nce

500 1000 1500 2000 2500

0.8

1

1.2

1.4

1.6

Turkey

Wavelength (nm)

Abs

orba

nce

500 1000 1500 2000 2500

0.8

1

1.2

1.4

1.6

Chicken

Wavelength (nm)

Abs

orba

nce


Results

PredictedTruth Beef Lamb Pork Turkey ChickenBeef 100.0 0.0 0.0 0.0 0.0Lamb 5.9 (1) 94.1 0.0 0.0 0.0Pork 0.0 0.0 100.0 0.0 0.0Turkey 0.0 0.0 0.0 92.6 7.4 (2)Chicken 0.0 0.0 0.0 3.7 (1) 96.3

Table :Validation set: classification results for the five different species of meat usinga threshold of 0.4 for the posterior of inclusion (14 coeffs). The number ofmisclassified units is reported in parentheses (3.4% total mis-clas rate).

LDA and QDA on selected (11-14) PCA components achieve amis-classification rate of 5.3% and 6.1%.


0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Wavelet coefficients

Pos

t. P

rob.


Main References

Brown, P.J., Fearn, T. and Vannucci, M. (2001). Bayesian waveletregression on curves with application to a spectroscopic calibrationproblem.Journal of the American Statistical Association, 96, 398-408.

Vannucci, M., Sha, N. and Brown, P.J. (2005). NIR and mass spectraclassification: Bayesian methods for wavelet-based feature selection.Chemometrics and Intelligent Laboratory Systems, 77, 139–148.

Stingo, F.C., Vannucci, M. and Downey, G. (2011). BayesianWavelet-based Curve Classification via Discriminant Analysis withMarkov Random Tree Priors.Statistica Sinica, 22, 465–488.


Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer deﬁnes...

Documents

Transcript of Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer deﬁnes...