Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines...

56
Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 5: Functional Data & Wavelets Marina Vannucci Rice University, USA ABS13-Italy 06/17-21/2013 Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 1 / 56

Transcript of Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines...

Page 1: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Bayesian Methods for Variable Selection with ApplicationstoHigh-Dimensional Data

Part 5: Functional Data & Wavelets

Marina Vannucci

Rice University, USA

ABS13-Italy06/17-21/2013

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 1 / 56

Page 2: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Part 5: Functional Data & Wavelets

Introduction to nonparametric regression

Brief intro to wavelets

Wavelet shrinkageFunctional data (multiple curves)

Curve regression modelsCurve classificationApplications to near infrared spectral data from chemometrics

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 2 / 56

Page 3: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Nonparametric Regression Models

Functional form of the regression function is not a simple function of afew parameters

Technically: probability models with infinitely many parameters ormodels on functional spaces

Typical setting: Given data(Xi,Yi), i = 1, ..., n we wish to estimate

Y = f (X) + ǫ

X can be a covariate, a set of covariates, time, spatial location,...

Basic idea: Avoid simple (parametric) forms

Approaches needed to penalize for overfitting since there are now manyparameters

Function estimation (avoid linearity): splines, wavelets(basis functions)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 3 / 56

Page 4: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Basis functions

A basis is a set of simple functions that can be added togetherin aweighted fashion for form more complicated functions

SupposeX is univariate

Simplest basis: polynomial basis

f (X) = β0 + β1X + ...+ βpXp

Set of power functions to reconstructf based on data:x; x2; x3...

With sufficiently large number of basis functions, we have essentially anonparametric function estimator

Matrix notation:f =∑p

j=0 Bj(x)βj, whereBj(X) = xj and coefficients(weights)βj

Polynomial basis - easy to fit but too simplistic for localfeatures/variations

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 4 / 56

Page 5: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

SplinesSplines are one-way to modelf flexibly by writing f (X) = B(X)β whereB(:) are basis functions.Basis functions: there a lot choices available - truncated power basis,B-splines, thin plate splines, penalized splines ...Truncated power basis of orderp

f (X) = β0 + β1X + ...+ βpXp +

K∑

k=1

βk+p(X − κk)p+

with β’s the regression coefficients,κ the knots andK the number ofknots.Linear/polynomial regression forK = 0Construction of splines involves specifying knots: both number andlocation.Capture non-linear relationships between variables.Kernel based methods - Ruppert, Wand and Carroll (2003)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 5 / 56

Page 6: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Wavelets: Major Milestones

The basic idea behind wavelets is to represent a generic function with simplerfunctions (building blocks) at different scales and locations.

1807: Fourier orthogonal decomposition of periodic signals

1946: Gabor windowed Fourier transform (STFT)

1984: A. Grossmann and J. Morlet introduce thecontinuouswavelet transform for theanalysis of seismic signals.

1985: Y. Meyer definesdiscreteorthonormal wavelet bases.

1989: S. Mallat links wavelets to the theory of “multiresolution analysis” (MRA) aframework that allows to construct orthonormal bases. Adiscrete wavelet transformisdefined as a simple recursive algorithm to compute the wavelet decomposition of a signalfrom its approximation at a given scale.

1989: I. Daubechies constructs wavelets with compact support anda varying degree ofsmoothness.

1992: D. Donoho and I. Johnstone use wavelets to remove noise from data.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 6 / 56

Page 7: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Wavelets as “Small Waves”The basic idea behind wavelets is to represent a generic function with simplerfunctions (building blocks) at different scales and locations

−10 −5 0 5 10−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

ψ(x)

Mexican hat wavelet

−10 −5 0 5 10−1

−0.5

0

0.5

1

1.5

x

√2ψ(

2x);

a=1/

2, b

=0

−10 −5 0 5 10−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

(1/√

2)ψ(

x/2)

; a=2

, b=0

−10 −5 0 5 10−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x

ψ(x+

4); a

=1, b

=4

Mother waveletψ as oscillatory function with zero mean,∫

ψ(x)dx = 0Wavelets asψj,k(x) = 2j/2ψ(2jx− k) with j, k scale and translation parameters.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 7 / 56

Page 8: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Wavelets vs Fourier RepresentationsGivenf and a basisf1, . . . , fn → series expansion

f (x) =∑

i

aifi(x)dx

ai =< f , fi >=∫

f (x)fi(x)dx

Fourier transforms measure the frequency content of a signal.Wavelet transforms provide atime-scale analysis.Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 8 / 56

Page 9: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Examples of Wavelet FamiliesHaar wavelets. The simplest family of wavelets, already known before theformulation of the wavelet theory (Haar, 1909).ψ(x) = 1 for 0≤ x < 1/2; −1 for 1/2 ≤ x < 1 and 0 otherwise.Haar wavelets constructed fromψ via dilations and translations

Given a generic functionf , Haar wavelets approximatef with piecewiseconstant functions (not continuous)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 9 / 56

Page 10: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Daubechies Wavelets

0 1 2 3−1.5

−1

−0.5

0

0.5

1

1.5

2

x

ψ(x

) an

d φ(

x)

Daubechies 2

0 2 4 6−1

−0.5

0

0.5

1

1.5Daubechies 4

x

0 5 10−1.5

−1

−0.5

0

0.5

1

1.5Daubechies 7

x

ψ(x

) an

d φ(

x)

0 5 10 15−1.5

−1

−0.5

0

0.5

1

1.5Daubechies 10

x

Compact support (good time localization)Vanishing moments

xlψ(x)dx = 0, l = 0,1, . . . ,N ensure decay as< f , ψj,k >≤ C2−jN , (good for compression)Various degrees of smoothness. For largeN, φ ∈ CµN , µ ≈ 0.2.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 10 / 56

Page 11: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Properties of Wavelets

Wavelet series:f (x) =∑

j,k < f , ψj,k > ψj,k(x)

Wf (j, k) = 〈f , ψj,k〉 =

f (x)ψj,k(x)dxj,k∈Z

describing features off at different locations and scales.

Properties:

Small waves with zero mean

Time-frequency localization

Good at describing non-stationarity and discontinuities

Multi-scale decomposition of functions (MRA) - sparsity, shrinkage

Recursive relationships among coefficients→ DWT

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 11 / 56

Page 12: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Wavelets in Practice (DWT)Let’s consider a vectorY of observations off at n equispaced points

yi = f (ti), i = 1, . . . , n, with n = 2m

Discrete Wavelet Transforms operate via recursive filters applied toY

data≡ c(m) H−→ c(m−1) H

−→ · · ·H−→ c(m−r)

G

ցG

ցG

ցd(m−1) · · · d(m−r)

H : cm−1,k =∑

l

hl−2kcm,l, G : dm−1,k =∑

l

gl−2kcm,l, ...

Then, in practice

Y DWT−→ d(Y) = (c(m−r),d(m−r),d(m−r−1), . . . ,d(m−1))

discrete approx of< f , ψj,k > at scalesm − 1, . . . ,m − rMarina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 12 / 56

Page 13: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Example

0 100 200 300 400 500 600 700 800 900 1000−4

−2

0

2

4

6

8

10

f+ε

0 100 200 300 400 500 600 700 800 900 1000−15

−10

−5

0

5

10

15

Coefficient index

DWT

data≡ Y DWT−→ d(Y) = (c(3),d(3), . . . ,d(7),d(8),d(9))

In matrix notation,Y −→ d(Y) = WY, with W determined byψ and suchthatW ′W = IMarina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 13 / 56

Page 14: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Summary on Wavelets

Choose the mother waveletψ and obtain the wavelet familyψj,kj,k∈Z

In theory, givenf , calculate thewavelet coefficients< f , ψj,k >=

f (x)ψj,k(x)dx and construct the wavelet series

f (x) =∑

j,k

< f , ψj,k > ψj,k(x)

In practice, given dataY = (y1, . . . , yn), use the DWT to calculateapproximated wavelet coefficients

DWT uses fast recursive filtering. Use matrix notation for convenienceZ = WY

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 14 / 56

Page 15: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Covariance matrix of wavelet coefficients

yi = f (ti), i = 1, . . . , n, with n = 2m

DWT: Y DWT−→ d(Y) = WY, W determined byψ, W ′W = I

Variances and covariances of DWT coefficients:

ΣΣΣd(Y) = WΣΣΣYW′

for givenΣY(i, j) = [γ(|i − j|)]γ(τ) autocovariance function of the process generating the data

VANNUCCI & CORRADI (JRSS, Series B, 1999).

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 15 / 56

Page 16: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Fields of Application

Applied Mathematics: Partial/ordinary differential equations solutions;representation of basic operators ...

Engineering: Signal and image processing (compression, smoothing) ...

Statistics: Smoothing of noisy data; nonparametric density andregression estimation; stochastic processes representation, time series,functional data ...

Physics, Biology, Genetics: Turbolence, DNA sequence analysis,magnetic resonance imaging ...

Music, Human Vision, Computer Graphics ...

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 16 / 56

Page 17: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Wavelet Shrinkage

yi = f (ti) + ǫi, i = 1, . . . , n = 2m (equispaced design), ǫi ∼ N(0, σ2)

Noise removal technique proposed by Donoho and Johnstone(1992-1998)

Compute empirical wavelet coefficients→ apply shrinkage/thresholding→ reconstructf

Implemented through fast DWT

Optimal for general function classes (Holder and Sobolev) as well asspaces ofirregular functions such as those of “bounded variations”.

Bayesian wavelet shrinkage estimators has better MSE properties forfinite samples.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 17 / 56

Page 18: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

3-step procedure

yi = f (ti) + ǫi, ǫi i.i.d. N(0, σ2)

Step 1: YDWT−→ d(Y) = d(f ) + ǫ′, ǫ′ ∼ N(0, σ2I)

Step 2: Estimated(f )

Removeǫ′ by thresholding/shrinkingd(Y)

Use mixture priors ond(f )

Step 3: Inverse transform to estimatef

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 18 / 56

Page 19: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Hard and soft thresholding rules

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−15

−10

−5

0

5

10

15

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−15

−10

−5

0

5

10

15

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

5

4

3

2

1

DWT (hsr=0)

Leve

l

Time0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

5

4

3

2

1

DWT (hsr=0)

Leve

lTime

dj,k = dj,kI(|dj,k|>λ) (hard thresholding).

dj,k = (dj,k − sign(dj,k)λ)I(|dj,k |>λ) (soft thresholding).

with global or level-dependent thresholds.Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 19 / 56

Page 20: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Bayesian Wavelet ShrinkageAt step 2 write the model as(d = WY, θθθ = Wf)

d|θθθ, σ2 ∼ N (θθθ, σ2I),

and impose priors on the unknown parametersθθθ (and possiblyσ2).

Use spike-and-slab priors

θj,k|sj,k ∼ sj,kN(0, ·) + (1− sj,k)I0, sj,k ∼ Ber(pj)

(Clyde et al. 1998, Clyde & George 2000)

Shrinkage is a by-product of these procedures, as Bayes rules naturallyshrink posterior estimates towards prior quantities. Also, shrinkage isadaptive when level-specific variances are specified. Can also specifydependent priors (Vannucci & Corradi, 1999).

Estimateθθθ via appropriate Bayes rule (posterior mean, median) andestimatef byW−1θθθ (step 3).

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 20 / 56

Page 21: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Useful References

Daubechies, I. (1992).Ten lectures on wavelets. SIAM.

Mallat, S.G. (1989). Multiresolution approximations and waveletorthonormal bases ofL2(R). Transactions of the American MathematicalSociety, 315(1), 69-87.

Nason, G. (2008), Wavelet Methods in Statistcs with R, Springer Verlag:New York. (with R package)

Antoniadis, A., Bigot, J. and Sapatinas, T. (2001). Waveletestimators innonparametric regression: A comparative simulation study. Journal ofStatistical Software, 6(6), 1-83. Matlab software.

Wavelab (matlab) and WaveThresh (R)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 21 / 56

Page 22: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Functional Data (multiple curves)

Data observed over an ordered domain: curves, images, 3-d objects

First step: characterize unknown function via basis functions (projectionto low-dimension). Multiple choices: Splines, wavelets, functionalprincipal component (empirical basis).

Second step: fit models to “reduced” space. Functional analogues tostandard models: functional regression, functional mixedmodels;generalized functional regression etc...

Bayesian framework allows unified modeling perspective.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 22 / 56

Page 23: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Wavelets for Functional Data

Overall objective: Methods for prediction or classification or clustering basedon functional data (multiple curves).

Regression:a continuous response is observed.

Classification: class membership observed in a training set.

Clustering: group membership needs to be uncovered.

Data are curves.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 23 / 56

Page 24: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Motivating Examples from NIR Calibration

Experiment: 40 biscuit doughs made with variations in quantities of fat,flour, sugar and water in recipe.Fat 15-21%, Sugar 10-23%, Flour 44-54%, Water 11-17%

Data: Composition (by weight) of the 4 constituents and spectral data at700 wavelengths (from 1100nm to 2498nm in steps of 2nm) for eachdough piece.

Aim: Use the spectral data to predict the composition.Wavelets as a dimension reduction tool that preserves localfeatures.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 24 / 56

Page 25: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

1000 1500 2000 25000

0.5

1

1.5

2Biscuit − training

1000 1500 2000 25000

0.5

1

1.5

2

2.5

3Biscuit − validation

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 25 / 56

Page 26: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Another Example

NIR spectra used in analysis of food and drink and pharmaceuticalproducts, measured at hundreds of wavelengths.

3 varieties of wheat, 94 observations. NIR spectra with 100 wavelengthsfrom 850 to 1048nm in steps of 2nm.

Aim: Use the spectral data to predict the wheat variety.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 26 / 56

Page 27: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

850 900 950 1000 10502.4

2.6

2.8

3

3.2

3.4

Wavelength(nm)

Abs

orba

nce

Variety 1

850 900 950 1000 10502.4

2.6

2.8

3

3.2

3.4

Wavelength(nm)

Abs

orba

nce

Variety 2

850 900 950 1000 10502.4

2.6

2.8

3

3.2

3.4

Wavelength(nm)

Abs

orba

nce

Variety 3

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 27 / 56

Page 28: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Our Approach

Use wavelets as dimension reduction tool that preserves local features.Apply DWT to curves. Wavelet coefficients summarise curve features inan efficient way, they are localized (unlike Fourier coefficients) and cancapture noise in the data.

Develop Bayesian methods in wavelet domain for simultaneous featureselection and prediction or classification. We use mixture priors andBayes methods to look for wavelet component that well describe thevariation in the responses.

Wavelet coeffs selection is done by assigning a probabilityto everypossible subset and then searching for subsets with high probability.Small coefficients can be important.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 28 / 56

Page 29: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Curve Regression

The basic setup is a multivariate linear regression model, with n observationson aq-variate response andp explanatory variables

Y = 1nααα′ + XB + E

Y(n × q)− responses,X(n × p)− predictors

Concern is with functional predictor data, ie the situationwhere each row ofX is a vector of observations of a curvex(t) at p equally spaced points.

Interest is in situations wherep is large→ variable selection and/or dimensionreduction methods are needed

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 29 / 56

Page 30: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Wavelet TransformationA wavelet transform is applied to each row ofX

Y = 1nααα′ + XW ′WB + E, W ′W = I

1400 1600 1800 2000 2200 2400−0.14

−0.12

−0.1

−0.08

−0.06

Ref

lect

ance

1400 1600 1800 2000 2200 24000

0.05

0.1

Ref

lect

ance

1400 1600 1800 2000 2200 24000

0.02

0.04

0.06

0.08

Ref

lect

ance

Wavelength

0 50 100 150 200 250−1.5

−1

−0.5

0

0.5

DW

T

0 50 100 150 200 250−0.5

0

0.5

1

DW

T

0 50 100 150 200 250−0.2

0

0.2

0.4

DW

T

Wavelet coefficient

Y = 1nααα′ + DB + E

with B = WB andD = XW ′ a matrix of wavelet coefficients.Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 30 / 56

Page 31: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Prior Model to Select Wavelet Coefficients

Mixture priors to represent negligible coefficients.A latent binary vectorγγγ identifies different models

γγγ = (γ1, . . . , γp), γj = 0,1, γj = 1 ↔ Dj in

Coefficients are drawn from a mixture distribution

B − B0 ∼ N (Hγγγ,ΣΣΣ)

B:,j ∼ (1− γj)I0 + γjN(0, hjjΣΣΣ)

π(γγγ) as single Bernoulli’s or Beta-Binomial or related predictors

Prob(γj = 1) = wj, wj = w, j = 1, ..., p

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 31 / 56

Page 32: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Prior Model

If H is the var-cov matrix of the prior onB then

H = [WHW ′]

We compute transformed covariance priors onB asWHW ′ using results ofVannucci and Corradi, JRSSB (1999).

Priors onααα andΣΣΣ are unchanged.

(B → B, H → H)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 32 / 56

Page 33: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Metropolis Search

MCMC used to ”search” the posteriorg(γγγ) ∼ π(γγγ|Y,D) looking for goodmodels.

At each step the algorithm generatesγγγnew from γγγold by one of twopossible moves:Add or Delete a component by choosing at random one component inγγγold and changing its value.Swap two components by choosing independently at random a 0 and a 1in γγγold and changing both of them.

The new candidateγγγnew is accepted with probability

ming(γγγnew)

g(γγγold),1

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 33 / 56

Page 34: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Posterior Inference

The stochastic search results in a list of visited models (γγγ(0), γγγ(1), . . .) andtheir corresponding relative posterior probabilities

p(γγγ(0)|D,Y), p(γγγ(1)|D,Y) . . .

Select variables:- in the“best” models, i.e. theγγγ’s with highestp(γγγ|D,Y) or- with largest marginal posterior probabilities

Prediction on futureY f :- via Bayesian model averaging (BMA) as posterior weighted average of

model predictions- via single model predictions as LS/Bayes predictions on single best

models or LS/Bayes predictions with “threshold” models (eg, “median”model) obtained from estimated marginal probabilities of inclusion.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 34 / 56

Page 35: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Prior Specification

Priors onααα andΣΣΣ vague and largely uninformative

ααα′ − ααα′0 ∼ N (h,ΣΣΣ), h → ∞, ΣΣΣ ∼ IW(δ,Q)

Choices forHγγγ :

Hγγγ = C ∗ [(D′D)−1]γγγ

Hγγγ = C ∗ [diag(D′D)−1]γγγ

H = CI

H = AR(σ, ρ) with EB estimates of hyperparameters,

L(·;D,Y) = |K |−q/2|Q|−n/2|In + K−1YQ−1Y′|−(δ+q+n−1)/2

K = I + DHD′

Choice ofw:

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 35 / 56

Page 36: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Diagonal Elements ofH = WHW ′

0 50 100 150 200 250150

200

250

300

350

400

450

500

Wavelet coefficient

Prio

r va

rianc

e of

reg

ress

ion

coef

ficie

nt

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 36 / 56

Page 37: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

NIR Data on Biscuit Doughs

4 parallel MCMC with 100,000 iterations and about 40,000 successful movesper chain.

H(i, j) = σ2ρ|i−j|, σ2 = 254, ρ = .32

Modal model: 10 coeffs (0%, 0%, 0%, 37%, 6%, 6%, 1%, 2%, from coarsestto finest level)

MSE = (.059, .466, .351, .047)

BMA: 500 models and 219 coeffs

MSE = (.063, .449, .348, .050)

PLS- MSE=(.151, .583, .375, .105)PCR- MSE=(.160, .614, .388, .106)Stepwise MLR: MSE=(.044, 1.188, .722, .221)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 37 / 56

Page 38: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Diagnostic Plots

0 1 2 3 4 5 6 7 8 9 10

x 104

0

20

40

60

80

100

120

(a)

Iteration number

Num

ber

of o

nes

0 1 2 3 4 5 6 7 8 9 10

x 104

−400

−300

−200

−100

0(b)

Iteration number

Log

prob

s

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 38 / 56

Page 39: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Marginal Plots

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1(i)

Pro

babi

lity

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1(ii)

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1(iii)

Pro

babi

lity

Wavelet coefficient0 50 100 150 200 250

0

0.2

0.4

0.6

0.8

1

Wavelet coefficient

(iv)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 39 / 56

Page 40: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

IDWT to Columns of LS Estimate of B

1400 1500 1600 1700 1800−200

−100

0

100

200C

oeffi

cien

t

Fat

1400 1500 1600 1700 1800−400

−200

0

200

400Sugar

1400 1500 1600 1700 1800−400

−200

0

200

400Flour

Wavelength

Coe

ffici

ent

1400 1500 1600 1700 1800−200

−100

0

100

200

Wavelength

Water

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 40 / 56

Page 41: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Probit Models for Curve ClassificationZ → n × 1 categorical response vector (J categories).X → n × p predictor matrix. We have functional predictor data.Probit models use data augmentation andlatent data to write

Yi = ααα′ + X′iB + ǫǫǫi, ǫǫǫi ∼ N(0,ΣΣΣ), i = 1, . . . , n

Relationship between the realizationzi and the unobservedYi

zi =

0 if yi,j < 0 for everyjj if yi,j = max

1≤k≤J−1yi,k

... now transform to wavelets, use selection prior on wavelet coefficients andMCMC for inference in a probit model ...

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 41 / 56

Page 42: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Results for Wheat Data

3 classes, 94 samples, NIR transmission spectra at p=100 wavelengthsfrom 850 to 1048nm.

Variety 1 2 3 TotalTraining 32 22 17 71

Validation 10 7 6 23

Fearnet al.(2001)Bayesian decision approach that balances costs for variables against theloss of misclassification.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 42 / 56

Page 43: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Mis-classification Errors

Prediction Var. Sel./# PC. Error“Best” Model 5 3/23

Marginal model 9 3/23BD(Fearnet al.) 6 5/23BD(Fearnet al.) 12 3/23LDA with PCA 14–18 PCs 4/23QDA with PCA 8–10 PCs 7/23

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 43 / 56

Page 44: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Mar

gina

l pro

babi

lity

(a)

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1(b)

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Mar

gina

l pro

babi

lity

Wavelet coefficient

(c)

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Wavelet coefficient

(d)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 44 / 56

Page 45: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

850 900 950 1000 1050−1000

−500

0

500

1000(a)

β 1

850 900 950 1000 1050−1000

−500

0

500

1000(c)

wavelength

β 2

850 900 950 1000 1050−2000

−1000

0

1000

2000

β 1

(b)

850 900 950 1000 1050−2000

−1000

0

1000

2000

wavelength

β 2

(d)

Plots(a) and(c) are obtained using 4 wavelet coefficients and plots(b) and(d) using 9 coefficients.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 45 / 56

Page 46: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Code from my Website

bvsmewav: Bayesian Variable Selection applied to wavelet coefficients

Metropolis search

non-diagonal selection prior

Bernoulli priors or Beta-Binomial prior

Predictions by LS and BMA

http://stat.rice.edu/∼marina

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 46 / 56

Page 47: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Discriminant Analysis for Classification

DA classifies observations into groups. Data from groupg modeled as

Xg − 1ngµ′

g ∼ N (I ,Σg),

with g = 1, . . . ,G and where the vectorµg and the matrixΣg are themean and the covariance matrix of the g-th group, respectively.

Group assignments:y = (y1, . . . , yn)′, whereyi = g if the ith observation

comes from groupg

Conjugate prior model

µg ∼ N(mg, hgΣg)

Σg ∼ IW(δg,Ωg),

whereΩg is a scale matrix andδg a shape parameter.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 47 / 56

Page 48: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Transformation to Wavelet Domain

We apply DWT asDg = XgW′, with (W ′W = I ). Model becomes

Dg − 1ngµ′

g ∼ N (I , Σg),

whereΣg = WΣgW′, whileµg is unchanged. The prior model onΣg

transforms accordingly.

The predictive distribution (t-student) is used to classify new samples

πg(yf |D) =

pg(df )πg∑G

i=1 pi(df )πi,

wherepg(df ) indicates the predictive distribution and where the priorprobability that one observation comes from groupg asπg = ng/n. Anew observations is then assigned to the group with the highest posteriorprobability.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 48 / 56

Page 49: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Variable selection

Want to select discriminating wavelet coefficients.

Introduce latentp-vectorγ with binary entries

γj = 1 if coeff j defines a mixture distributionγj = 0 otherwise.

The likelihood function is given by

L(D, y; ·) =n∏

i=1

p(Di(γc)|Di(γ))G∏

g=1

ng∏

i=1

wngg pg(Di(γ)),

Conjugate priors for selected and non-selected coefficients.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 49 / 56

Page 50: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Markov Random Tree Priors on γ

P(γj|γi, i ∈ Nj) =exp(γjF(γj))

1+ exp(F(γj))(1)

whereF(γj) = µ+ η∑

i∈Nj(2γi − 1) andNj is the set of children of coeffj.

Hereµ controls sparsity. Higherη’s induce more neighbors to take on samevalues.Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 50 / 56

Page 51: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Model Fitting - MCMC

Can marginalize over the model parameters

Updateγ by Metropolis algorithm.

Classify new samples based on selected coeffs

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 51 / 56

Page 52: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

An Example

NIR spectra used in analysis of food and drink and pharmaceuticalproducts, measured at hundreds of wavelengths.

5 varieties of meat (Beef, Chicken, Lamb, Pork and Turkey), 231observations. NIR spectra with 1024 reflectances over the range400-2498nm at intervals of 2nm.

Aim: Use the spectral data to predict the meat variety.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 52 / 56

Page 53: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

500 1000 1500 2000 2500

0.8

1

1.2

1.4

1.6

Beef

Wavelength (nm)

Abs

orba

nce

500 1000 1500 2000 2500

0.8

1

1.2

1.4

1.6

Lamb

Wavelength (nm)

Abs

orba

nce

500 1000 1500 2000 2500

0.8

1

1.2

1.4

1.6

Pork

Wavelength (nm)

Abs

orba

nce

500 1000 1500 2000 2500

0.8

1

1.2

1.4

1.6

Turkey

Wavelength (nm)

Abs

orba

nce

500 1000 1500 2000 2500

0.8

1

1.2

1.4

1.6

Chicken

Wavelength (nm)

Abs

orba

nce

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 53 / 56

Page 54: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Results

PredictedTruth Beef Lamb Pork Turkey ChickenBeef 100.0 0.0 0.0 0.0 0.0Lamb 5.9 (1) 94.1 0.0 0.0 0.0Pork 0.0 0.0 100.0 0.0 0.0Turkey 0.0 0.0 0.0 92.6 7.4 (2)Chicken 0.0 0.0 0.0 3.7 (1) 96.3

Table :Validation set: classification results for the five different species of meat usinga threshold of 0.4 for the posterior of inclusion (14 coeffs). The number ofmisclassified units is reported in parentheses (3.4% total mis-clas rate).

LDA and QDA on selected (11-14) PCA components achieve amis-classification rate of 5.3% and 6.1%.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 54 / 56

Page 55: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Wavelet coefficients

Pos

t. P

rob.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 55 / 56

Page 56: Bayesian Methods for Variable Selection with Applications ...€¦ · 1985: Y. Meyer defines discrete orthonormal wavelet bases. 1989: S. Mallat links wavelets to the theory of “multiresolution

Main References

Brown, P.J., Fearn, T. and Vannucci, M. (2001). Bayesian waveletregression on curves with application to a spectroscopic calibrationproblem.Journal of the American Statistical Association, 96, 398-408.

Vannucci, M., Sha, N. and Brown, P.J. (2005). NIR and mass spectraclassification: Bayesian methods for wavelet-based feature selection.Chemometrics and Intelligent Laboratory Systems, 77, 139–148.

Stingo, F.C., Vannucci, M. and Downey, G. (2011). BayesianWavelet-based Curve Classification via Discriminant Analysis withMarkov Random Tree Priors.Statistica Sinica, 22, 465–488.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 5) ABS13-Italy 06/17-21/2013 56 / 56