Bridging the two cultures: Latent variable statistical modeling
Transcript of Bridging the two cultures: Latent variable statistical modeling
Bridging the two cultures: Latent
variable statistical modeling with
boosted regression trees
Thomas G. Dietterich and Rebecca Hutchinson
11/15/2012 ESA 2012 1
Oregon State University
Corvallis, Oregon, USA
A Species Distribution Modeling Problem:
11/15/2012 ESA 2012 2
eBird data
12 bird species
3 synthetic species
3124 observations from
New York State, May-July
2006-2008
23 covariates
Two Cultures
11/15/2012 ESA 2012 3
Occupancy Models
MacKenzie, et al., 2002
Boosted Regression Trees
Friedman, 2001
Elith et al, 2006
Elith, Leathwick & Hastie,
2008
Probabilistic
Graphical
Models
Flexible
Nonparametric
Models
4
Occupancy-Detection Model
Yit Zi
i=1,β¦,M
t=1,β¦,T
Xi Wit
oi dit
Occupancy features (e.g. elevation, vegetation)
Detection features (e.g. time of day, effort)
Observed presence/absence Yit | Zi ~ Bern(Zidit)
True (latent) presence/absence Zi ~ Bern(oi)
Probability of occupancy (function of Xi)
Probability of detection (function of Wit)
Sites
Visits
MacKenzie, et al, 2006
11/15/2012 ESA 2012
Parameterizing the model
Yit Zi
i=1,β¦,M
t=1,β¦,T
Xi Wit
oi dit
ππ~π(ππ|ππ): Species Distribution Model
π ππ = 1 ππ = ππ = πΉ(ππ) βoccupancy probabilityβ
π¦ππ‘~π(π¦ππ‘|π§π , π€ππ‘): Observation model
π πππ‘ = 1 ππ , πππ‘ = πππππ‘
πππ‘ = πΊ(πππ‘) βdetection probabilityβ
11/15/2012 ESA 2012 5
Standard Approach: Log Linear (logistic
regression) models
11/15/2012 ESA 2012 6
logπΉ ππ
1βπΉ ππ= π½0 + π½1ππ1 + β― + π½π½πππ½
logπΊ πππ‘
1βπΊ πππ‘= πΌ0 + πΌ1πππ‘1 + β― + πΌπΎπππ‘πΎ
Fit via maximum likelihood
Can apply hypothesis tests to assess importance of
covariates
π»0: π½1 = 0
π»π: π½1 > 0
Results on Synthetic Species with Nonlinear
Interactions
11/15/2012 ESA 2012 7
Predictions exhibit high
variance because model
cannot fit the nonlinearities
well
A Flexible Predictive Model
11/15/2012 ESA 2012 8
Predict the observation π¦ππ‘ from the combination of
occupancy covariates π₯π and detection covariates π€ππ‘
Boosted Regression trees
logπ πππ‘=1 ππ,πππ‘
π πππ‘=0 ππ,πππ‘= π½1π‘πππ1 ππ , πππ‘ + β― + π½πΏπ‘ππππΏ(ππ , πππ‘)
Fitted via functional gradient descent
Model complexity is tuned to the complexity of the data
Number of trees
Depth of each tree
Results
11/15/2012 ESA 2012 9
Systematically biased because
it does not capture the latent
occupancy
Underestimates occupancy at
occupied sites to fit detection
failures
Much lower variance than the
Occupancy-Detection model,
because it can handle the
non-linearities P(Z
)
Two Cultures: Summary
11/15/2012 ESA 2012 10
Advantages Supports latent variables
Supports hypothesis tests on meaningful parameters
Disadvantages Model must be carefully designed
(interactions? non-linearities?)
Data must be transformed to match modeling assumptions (linearity, Gaussianity)
Model has fixed complexity so either under-fits or over-fits
Advantages Model complexity adapts to data
complexity
Easy to use βoff-the-shelfβ
Disadvantages Cannot support latent variables
Cannot provide parametric hypothesis tests
Probabilistic
Graphical
Models
Flexible
Nonparametric
Models
The Dream
Probabilistic
Graphical
Models
Flexible
Nonparametric
Models
Flexible
Nonparametric
Probabilistic
Models
11/15/2012 ESA 2012 11
A Simple Idea:
Parameterize πΉ and πΊ as boosted trees
11/15/2012 ESA 2012 12
logπΉ π
1βπΉ π= π0(π) + π1π1(π) + β― + ππΏππΏ(π)
logπΊ π
1βπΊ π= π0 π + π1π1 π + β― + ππΏπ
πΏ(π)
Perform functional gradient descent in πΉ and πΊ
Interpreting Non-Parametric Models:
Partial Dependence Plots
11/15/2012 ESA 2012 14
Simulate
manipulating
one variable
(e.g., Distance
of Survey)
Visualize the
predicted
response
Distance of survey
Partial Dependence Plot
Synthetic Species 3
OD-BRT
correctly
captures the bi-
modal detection
probability
11/15/2012 ESA 2012 15 Time of Day
Summary: We can have our cake (latent variables,
interpretable submodels) and eat it too (have
flexible, easy-to-use modeling tools)
Probabilistic
Graphical
Models
Flexible
Nonparametric
Models
Flexible
Nonparametric
Probabilistic
Models
11/15/2012 ESA 2012 18
β’ Easier to use
β’ More accurate
Concluding
Remarks
11/15/2012 ESA 2012 19
With limited data, the
most accurate predictive
model is much simpler
than the βtrue modelβ
Predictive accuracy on a
single data set is not a
sufficient criterion for a
scientific model
complexity
accu
racy
Most Accurate
Predictive Model
Predictive Accuracy of
βTrue Modelβ
Acknowledgements
11/15/2012 ESA 2012 20
Liping Liu: Boosted Regression Trees in OD models
Steve Kelling and colleagues at the Cornell Lab of
Ornithology
National Science Foundation Grants 0083292, 0307592,
0832804, and 0905885
Regression Trees
11/15/2012 ESA 2012 23
Interactions are captured by
the if-then-else structure of
the tree
Nonlinearities are
approximated by piecewise
constant functions
Tree can be flattened into a
linear model:
π1 β₯ 3
π2 β₯ 0 π2 β₯ 0
π1 = β5
π1 = 3
π1 = 8
π1 = 1
π1 = β5 β πΌ π1 β₯ 3, π2 β₯ 0 + 3 β πΌ π1 β₯ 3, π2 < 0 + 8 β πΌ π₯1 < 3, π2 β₯ 0 + 1 β πΌ(π1 < 3, π2 < 0)
Functional Gradient Descent
Boosted Regression Trees
11/15/2012 ESA 2012 24
Friedman (2000), Mason et al. (NIPS 1999), Breiman (1996)
Fit a logistic regression model as a weighted sum of regression trees:
logπ π = 1
π π = 0= π‘πππ0(π) + π1π‘πππ1(π) + β― + ππΏπ‘ππππΏ(π)
When βflattenedβ this gives a log linear model with complex interaction terms
L2-Tree Boosting Algorithm
11/15/2012 ESA 2012 25
Let πΉ0 π = π0(π) = 0 be the zero function
For β = 1, β¦ , πΏ do
Construct a training set Sβ = ππ , π ππ=1
π
where π is computed as
π π =ππΏπΏ πΉ
ππΉ πΉ=πΉββ1 ππ
βhow we wish πΉ would change at ππβ
Let πβ = regression tree fit to πβ
πΉβ β πΉββ1 + πβπβ
The step sizes πβ are the weights computed in boosting
This provides a general recipe for learning a conditional probability distribution for a Bernoulli or multinomial random variable
Alternating Functional Gradient Descent
Loss function πΏ(πΉ, πΊ, π¦)
πΉ0 = πΊ0 = π0 = π0 = 0
For β = 1, β¦ , πΏ
For each site π compute
π§ π = ππΏ(πΉββ1 π₯π , πΊββ1, π¦π)/ππΉββ1 π₯π
Fit regression tree πβ to π₯π , π§ π π=1π
Let πΉβ = πΉββ1 + πβπβ
For each visit π‘ to site π, compute
π¦ ππ‘ = ππΏ πΉβ π₯π , πΊββ1 π€ππ‘ , π¦ππ‘ /ππΊββ1 π€ππ‘
Fit regression tree πβto π€ππ‘, π¦ ππ‘ π=1,π‘=1π,ππ
Let πΊβ = πΊββ1 + πβπβ
ESA 2012 Hutchinson, Liu, Dietterich, AAAI 2011 11/15/2012 26
Multiple Visit Data
27
Detection History
Site
True occupancy
(latent)
Visit 1
(rainy day,
12pm)
Visit 2
(clear day, 6am)
Visit 3
(clear day, 9am)
A
(forest,
elev=400m)
1
0
1
1
B
(forest,
elev=500m)
1
0
1
0
C
(forest,
elev=300m)
1
0
0
0
D
(grassland,
elev=200m)
0
0
0
0
11/15/2012 ESA 2012
Synthetic Species 2
πΉ and πΊ nonlinear
logππ
1 β ππ= β2 π₯π
12
+ 3 π₯π2
2β 2π₯π
3
logπππ‘
1 β πππ‘= exp(β0.5π€ππ‘
4) + sin(1.25π€ππ‘
1+ 5)
11/15/2012 ESA 2012 29
Open Problems
ESA 2012 31
Sometimes the OD model finds trivial solutions
Detection probability = 0 at many sites, which allows the Occupancy model complete freedom at those sites
Occupancy probability constant (0.2)
Log likelihood for latent variable models suffers from local minima
Proper initialization?
Proper regularization?
Posterior regularization?
How much data do we need to fit this model?
Can we detect when the model has failed?
11/15/2012