Julian Center on Regression for Proportion Data July 10, 2007 (68)
-
Upload
jeffery-holt -
Category
Documents
-
view
224 -
download
1
Transcript of Julian Center on Regression for Proportion Data July 10, 2007 (68)
Julian CenterJulian Centeronon
Regression for Regression for Proportion DataProportion Data
July 10, 2007
(68)
MaxEnt2007
Regression For Regression For Proportion DataProportion Data
Julian CenterJulian Center
Creative Research Corp.Creative Research Corp.
Andover, MA, USAAndover, MA, USA
MaxEnt2007 Julian Center
OverviewOverview IntroductionIntroduction
What is proportion data?What is proportion data? What do we mean by regression?What do we mean by regression? ExamplesExamples Why should you care?Why should you care?
Coordinate Transformation to Facilitate Regression.Coordinate Transformation to Facilitate Regression. Measurement ModelsMeasurement Models
MultinomialMultinomial Laplace Approximation to MultinomialLaplace Approximation to Multinomial Log-NormalLog-Normal
Regression ModelsRegression Models Kernal Regression (Nadaraya-Watson Model)Kernal Regression (Nadaraya-Watson Model) Gaussian Process RegressionGaussian Process Regression
With Log Normal MeasurementsWith Log Normal Measurements With Multinomial Measurements – Expectation PropagationWith Multinomial Measurements – Expectation Propagation
ConclusionConclusion
MaxEnt2007 Julian Center
What is Proportion Data?What is Proportion Data?² Proportion data = Compositional data½Categorical data.
² Proportion data = A ( + 1)-dimensional vector r of relative
proportions of items assigned to oneof + 1 categories.
Similar to a discrete probability distribution.
² Inmathematical terms, r is con…ned to the -simplex,
r 2S =nr 2R+1
+ : 1+1r = 1
o
Here1(+1) is the ( + 1)-dimensional vector of all ones, i.e.h1(+1)
i
= 18
MaxEnt2007 Julian Center
What is Regression?What is Regression?
Regression = Smoothing + Regression = Smoothing + Calibration + Interpolation.Calibration + Interpolation.
Relates data gathered under one set Relates data gathered under one set of conditions to data gathered under of conditions to data gathered under similar, but different conditions.similar, but different conditions.
Accounts for measurement “noise”.Accounts for measurement “noise”. Determines Determines p(p(rr||x).x).
MaxEnt2007 Julian Center
ExamplesExamples Geostatistics: Composition of rock samples at Geostatistics: Composition of rock samples at
different locations.different locations. Medicine: Response to different levels of Medicine: Response to different levels of
treatment.treatment. Political Science: Opinion polls across Political Science: Opinion polls across
different demographic groups.different demographic groups. Climate Research:Climate Research:
Infer climate history from fossil pollen samples.Infer climate history from fossil pollen samples. Calibrate model using present day samples from Calibrate model using present day samples from
known climates.known climates. Typically, examine 400 pollen grains and sort into Typically, examine 400 pollen grains and sort into
14 categories14 categories
MaxEnt2007 Julian Center
Why Should You Care?Why Should You Care?
Either, you have proportion data to Either, you have proportion data to analyze.analyze.
Or, you want to do pattern classification.Or, you want to do pattern classification. Or, you want to use a similar approach to Or, you want to use a similar approach to
your problem.your problem. Transform constrained variables so that a Transform constrained variables so that a
Laplace approximation makes sense.Laplace approximation makes sense. Two different regression techniques.Two different regression techniques. Expectation Propagation for improving Expectation Propagation for improving
model fit.model fit.
MaxEnt2007 Julian Center
Coordinate Coordinate TransformationTransformation
Well-known regression methods can’t Well-known regression methods can’t deal with the pesky constraints of the deal with the pesky constraints of the simplex.simplex.
We need a one-to-one mapping We need a one-to-one mapping between the d-simplex and d-between the d-simplex and d-dimensional real vectors.dimensional real vectors.
Then we can model probability Then we can model probability distributions on real vectors and relate distributions on real vectors and relate them to distributions on the simplex.them to distributions on the simplex.
MaxEnt2007 Julian Center
Coordinate Coordinate TransformationTransformation
Wecan establish a one-to-onemapping between S and R by
sm : R ! S ; sm (f) =h1(+1) exp
³T f
´ i ¡ 1exp
³T f
´
clr : S ! R ; clr (y) = T ln(y)
whereT is a £ ( + 1)-dimensional matrix that satis…es
T T = I T1(+1) = 0
T T+1
+ 11(+1)1
(+1) = I (+1)
The rows of T span the orthogonalComplement of 1(d+1)
Symmetric Softmax Activation Function
Centered Log Ratio Linkage Function
We can always find T by theGram-Schmidt Process
MaxEnt2007 Julian Center
ln(y1)=- ln(y2)
f
Softmax is insensitiveto this direction.
Coordinate Coordinate TransformationTransformation
ln(y2)
ln(y1)
Image of SimplexUnder ln
y1
y2 Simplex
MaxEnt2007 Julian Center
Measurement ModelsMeasurement Models
MultinomialMultinomial Log-NormalLog-Normal
MaxEnt2007 Julian Center
Assumethat theproportion vector r comes from independentsamples fromthediscreteprobability distribution representedby the vector y
(r j y ) = M ( r jy )
M ( r jy ) , !
Q ( [r ])!
Y
([y ])
[r ]
To get the likelihood function for f = clr (y ), we takeinto account the J acobian of the transformation,
Q [y ].
The log-likelihood function corresponding to f is
(f ) = ( + + 1) r ln (y ) +
r = r + 1(+1)( + + 1)
Measurement ModelMeasurement Model- Multinomial -- Multinomial -
MaxEnt2007 Julian Center
Multinomial Multinomial Measurement ModelMeasurement Model
Binomial Likelihood Functions
0
0.002
0.004
0.006
0.008
0.01
-6 -5 -4 -3 -2 -1 0 1
f
likel
iho
od
0
0.0025
0.005
0.01
0.02
0.05
0.07
0.1
0.2
0.3
0.5
R1=
S=400
MaxEnt2007 Julian Center
Measurement ModelMeasurement Model- Laplace Approximation -- Laplace Approximation -
Some regression methods assume a Gaussian Some regression methods assume a Gaussian measurement model.measurement model.
Therefore, we are tempted to approximate each Therefore, we are tempted to approximate each Multinomial measurement with a Gaussian Multinomial measurement with a Gaussian measurement.measurement.
Let’s try a Laplace approximation to each Let’s try a Laplace approximation to each measurement.measurement.
Laplace Approximation:Laplace Approximation: Find the peak of the log-likelihood function.Find the peak of the log-likelihood function. Pick a Gaussian centered at the peak with Pick a Gaussian centered at the peak with
covariance matrix that matches the negative second covariance matrix that matches the negative second derivative of the log-likelihood function at the peak.derivative of the log-likelihood function at the peak.
Pick an amplitude factor to match the height of the Pick an amplitude factor to match the height of the peak.peak.
MaxEnt2007 Julian Center
Measurement ModelMeasurement Model- Laplace Approximation -- Laplace Approximation -
Thevalue of f that maximizes the log-likelihood is
m = T ln(r )
TheLaplaceapproximation to a singlemeasurement is
(f ) = N (f jm V )
= j2 V j¡12 exp
·¡12(f ¡ m ) V ¡ 1
(f ¡ m )¸
where
= j2 V j12
!Q
( [r ])!exp[ (m )]
V ¡ 1 = ( + + 1) T
hDiag (r ) ¡ r r
iT
MaxEnt2007 Julian Center
Laplace Approximation to Laplace Approximation to MultinomialMultinomial
r1=0/400
0
0.0002
0.0004
0.0006
0.0008
0.001
-7 -6 -5 -4 -3 -2 -1 0
f
p(f
) Laplace Approx
Multinomiala
MaxEnt2007 Julian Center
Laplace Approximation to Laplace Approximation to MultinomialMultinomial
r1=1/400
0
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
0.0014
-6 -5 -4 -3 -2 -1 0
f
p(f
)
Laplace Approx
Multinomial
MaxEnt2007 Julian Center
Laplace Approximation to Laplace Approximation to MultinomialMultinomial
r1=2/400
0
0.0005
0.001
0.0015
0.002
-6 -5 -4 -3 -2 -1 0
f
p(f
) Laplace Approx
Multinomial
MaxEnt2007 Julian Center
Laplace Approximation to Laplace Approximation to MultinomialMultinomial
r1=4/400
0
0.0005
0.001
0.0015
0.002
0.0025
-5 -4 -3 -2 -1 0
f
p(f
) Laplace Approx
Multinomial
MaxEnt2007 Julian Center
Laplace Approximation to Laplace Approximation to MultinomialMultinomial
r1=80/400
0
0.001
0.002
0.003
-4 -3 -2 -1 0
f
p(f
)
Laplace Approx
Multinomial
MaxEnt2007 Julian Center
Laplace Approximation to Laplace Approximation to MultinomialMultinomial
r1=120/400
0
0.002
0.004
0.006
0.008
0.01
-1 0
f
p(f
)
Laplace Approx
Multinomial
MaxEnt2007 Julian Center
Measurement ModelMeasurement Model- Log-Normal -- Log-Normal -
² General log-normal model form:
(f ) = N (f jm V )
² Can match Laplaceapproximation to multinomial.
² Can domuchmore.
² Basis for regression methods.
e.g. Over-dispersion or under-dispersion
MaxEnt2007 Julian Center
Regression ModelsRegression Models
Way of relating data taken under Way of relating data taken under different conditions.different conditions.
Intuition: Similar conditions should Intuition: Similar conditions should produce similar data.produce similar data.
The best to use methods depends on The best to use methods depends on the problem.the problem.
Two methods considered here:Two methods considered here: Nadaraya-Watson model.Nadaraya-Watson model. Gaussian Process model.Gaussian Process model.
MaxEnt2007 Julian Center
Nadaraya-Watson ModelNadaraya-Watson Model
Based on applying Parzen density Based on applying Parzen density estimation to the joint distribution of estimation to the joint distribution of ff and and xx
General Form:
(f x) =X
=1 (f xj )
Simpli…ed Model:
(f xj ) = N³f jbf B
´N (xjx D)
MaxEnt2007 Julian Center
Nadaraya-Watson ModelNadaraya-Watson Model
Thismodel implies that
(x) =X
=1 (xj )
(xj ) = N (xjx D)
(f jx) = (f x) (x)
=X
=1 (x) N
³f jbf B
´
(x) = (xj ) (x)
MaxEnt2007 Julian Center
To determine the distribution for a newmeasurement, wecompute
(rj x) =Z
(rj f) (f jx) f
=X
=1 (x)
Z (rj f) N
³f jbf B
´ f
If weuse theLaplaceapproximation to themultinomial,we can solve the integrals analytically to get
(rj x) = X
=1 (x) N
³mjbf B + V
´
where m andV arecomputed from r as described above.Otherwise, we can usestochastic integration to compute the integrals.
Nadaraya Watson ModelNadaraya Watson Model
MaxEnt2007 Julian Center
Nadaraya-Watson ModelNadaraya-Watson Model
Problem: We must compare a new Problem: We must compare a new point to every training point.point to every training point.
Solution: Solution: Choose a sparse set of “knots”, and Choose a sparse set of “knots”, and
center density components only on center density components only on knots.knots.
Adjust weights and covariances by Adjust weights and covariances by “diagnostic training”.“diagnostic training”.
Mixture model training tools apply.Mixture model training tools apply.
MaxEnt2007 Julian Center
Gaussian Process ModelGaussian Process Model
Probability distribution on functions.Probability distribution on functions. Specified by mean function Specified by mean function m(m(xx)) and and
covariance kernel covariance kernel k(k(xx11,,xx22).). For any finite collection of points, For any finite collection of points,
the corresponding function values the corresponding function values are jointly Gaussian.are jointly Gaussian.
MaxEnt2007 Julian Center
Applying Gaussian Process Applying Gaussian Process Regression to Proportion Regression to Proportion
DataData Prior – Model each component of Prior – Model each component of ff((xx)) as a as a
zero-mean Gaussian process with zero-mean Gaussian process with covariance kernel covariance kernel k(k(xx11,,xx22). ). Assume that Assume that the components of the components of ff are independent of are independent of each other.each other.
Posterior – Use the Laplace Posterior – Use the Laplace approximations to the measurements and approximations to the measurements and apply Kalman filter methods. apply Kalman filter methods.
Use Expectation Propagation to improve Use Expectation Propagation to improve fit.fit.
MaxEnt2007 Julian Center
Sparse Gaussian Process Sparse Gaussian Process ModelModel
Choosea subset of K training points to act as knots.
Rearrange latent function values at theknots in one largevector g
[g](¡ 1) + , [f (x )] 2 f 12¢¢¢ g 2 f 12¢¢¢ g
2
6664
[f (x1)]1 [f (x2)]1 ¢¢¢ [f (x )]1[f (x1)]2 [f (x2)]2 ¢¢¢ [f (x )]2... ... .. . ...[f (x1)] [f (x2)] ¢¢¢ [f (x )]
3
7775
MaxEnt2007 Julian Center
Sparse Gaussian Process Sparse Gaussian Process ModelModel
Under our assumptions, the prior (g) = N (gj0G)
where
G , I C =
2
6664
C 0 ¢¢¢ 00 C ¢¢¢ 0... ... .. . ...0 0 ¢¢¢ C
3
7775
[C ] , ³x x
´ 2 f 12¢¢¢ g
MaxEnt2007 Julian Center
Sparse Gaussian Process Sparse Gaussian Process ModelModel
(f (x) jg) = N [f (x) jH (x)g (x) I ]
where
H (x) , I hk (x) C ¡ 1
i
(x) , (xx) ¡ k (x) C ¡ 1k (x)
[k (x)] , (xx ) 2 f 12¢¢¢ g
Wecan express this by the equation
f (x) = H (x) g+ u (x)
whereu (x) » N [0 (x) I ] and u (x) is independent of g.
MaxEnt2007 Julian Center
Sparse Gaussian Process Sparse Gaussian Process ModelModel
In particular, the values of the latent function at thetraining points can beexpressed as
f = H g+ u
whereH = H (x ) and u = u (x ).
To simplify computations, weassumethatu is independent of u for 6= .
Note that if x is oneof theknots, i.e., · ,then u = 0 andH is a £ sparsematrixthat simply selects the appropriateelements of g.
MaxEnt2007 Julian Center
Using the log-normal measurement model,
(r j g) =
ZN (f jm V ) N (f jH g I ) f
= N (m jH gR )
where R = V + I . Thus everything is Gaussian and therefore (gjT ) = N (gjbgP ).
GP– Log-Normal ModelGP– Log-Normal Model
MaxEnt2007 Julian Center
GP– Log-Normal ModelGP– Log-Normal ModelWecan determine bg and P by theKalman…lter algorithm.(1) Start with
bg ( 0
P ( G
(2) For = 1 to iterate
K ( P H (H PH + R )¡ 1
bg ( bg+ K (m ¡ H bg)
P ( P ¡ K H P
If webelieve that the log-normal measurement model is correct,thenweare…nished after onepass through all the training data.
MaxEnt2007 Julian Center
Wecan compute the evidence by
= (T ) ="Y
N (0jm R )
#
N (0j0G) [N (0jbgP )]¡ 1
Wecan determine the probability distribution ofseeing a newmeasurement r at x by
(rj xT ) = NhmjH (x) bgV + U (x) +H (x) PH (x)
i
GP – Log-Normal ModelGP – Log-Normal Model
11
MaxEnt2007 Julian Center
GP Multinomial ModelGP Multinomial ModelIf webelieve that themeasurement model is reallymultinomial,we can get amore accurate approximation using theExpectation Propagation (EP) algorithm.
As beforeweapproximate the joint distribution (r1r2¢¢¢r g) by the form
(g) =Y
N (H gjm R ) N (gj0G )
Now our aim is to adjust the ’s,m’s, andR ’s to minimizethe Kullback-Leibler divergence
( jj ) = ¡Zln
à (g) (g)
!
(g) g
MaxEnt2007 Julian Center
Expectation Propagation Expectation Propagation MethodMethod
Tominimize ( jj ), we iteratively chooseameasurement andminimizie ( ¤jj ¤) where
¤ (g) = (r j g)
N (H gjm R ) (g)
¤ (g) = ¤ N (H gjm¤
R ¤ )
N (H gjm R ) (g)
We can accomplish this by choosing ¤ ,m¤ , andR ¤
so that themoments of ¤ (g) match those of ¤ (g).
MaxEnt2007 Julian Center
Expectation Propagation Expectation Propagation MethodMethod
To approximate themoments, wecompute
¤ ¼1
X
=1
³r j h( )
´
N³h( )jm R
´
bh ¼1 ¤
1
X
=1h()
³r j h( )
´
N³h( )jm R
´
W ¼1 ¤
1
X
=1h( )h( )
³r j h( )
´
N³h()jm R
´ ¡ bhbh
where
h( ) » N³H bgH PH
´
MaxEnt2007 Julian Center
Expectation Propagation Expectation Propagation MethodMethod
To get ¤ to have thesamemoments as ¤, wechoose
R ¤¡ 1 = R ¡ 1
+W ¡ 1 ¡³H P H
´ ¡ 1
m¤ = R ¤
·R ¡ 1
m +W ¡ 1bh ¡³H PH
´ ¡ 1
H bg¸
MaxEnt2007 Julian Center
Expectation Propagation Expectation Propagation MethodMethod
If is one of the knots,
³r j h( )
´= M
³ r jh( )
´
Otherwise, we approximate it by
³r jh()
´=
ZM
³ r jh( ) + u
´N (uj0 I ) u
¼1
X
=1M
³ r jh( ) + u( )
´
u( ) » N (0 I )
MaxEnt2007 Julian Center
Expectation Propagation Expectation Propagation MethodMethod
Nowwecan update thesmoother parameters.
If R ¤¡ 1 = R ¡ 1
then the error covarianceP doesnot change
and weupdate the estimate of g by
bg ( bg+ P H R
¡ 1 (m¤
¡ m )
Otherwise, weuse
R ¢ (³R ¤¡ 1
¡ R ¡ 1
´¡ 1
K ( P H ³H PH
+ R ¢´ ¡ 1
P ( P ¡ K H Pbg ( bg+ K
hR ¢
³R ¤¡ 1
m¤ ¡ R ¡ 1
m´¡ H bg
i
MaxEnt2007 Julian Center
Expectation Propagation Expectation Propagation MethodMethod
Finally, wereplace the parameters for measurement
( ¤m ( m¤
R ( R ¤
and go to thenext iteration.
MaxEnt2007 Julian Center
Choosing the Regression Choosing the Regression ModelModel
If you have two samplings taken under the same conditions, do you want to treat them as coming from a bimodal distribution (NW Model) or combine them into one big sampling (GP Model)?
MaxEnt2007 Julian Center
ConclusionConclusion
A coordinate transformation makes A coordinate transformation makes it possible to analyze proportion data it possible to analyze proportion data with known regression methods.with known regression methods.
The Multinomial distribution can be The Multinomial distribution can be well approximated by a Gaussian on well approximated by a Gaussian on the transformed variable.the transformed variable.
The choice of regression model The choice of regression model depends on the effect that you want depends on the effect that you want – multimodal vs unimodal fit.– multimodal vs unimodal fit.