An Analysis of Approaches to Presence-Only Datahastie/TALKS/hastieSDM.pdf · An Analysis of...

An Analysis of Approaches to Presence-OnlyData

William Fithian and Trevor HastieDepartment of Statistics

Stanford University

July 30, 2012

Species Distribution Modeling

Question: where may a given species be found?

Motivations:

• Plan wildlife management actions

• Monitor endangered or invasive species

• Scientific understanding

• etc.

What geographic features predict greater abundance?

Motivations:

• etc.

Motivations:

• etc.

Presence-Absence / Count Data

Scientists visit patch of land

Record whether any specimens encountered / how many

Relatively high quality data

Expensive, difficult for rare or elusive species

Presence-Only Data

Motorist spies koala

Calls museum excitedly

Museum records location

Lower quality data

More of it exists

Increasingly popular object of study with advent of geographicinformation systems

Presence-Only Data

Lower quality data

More of it exists

Presence-Only Data

Lower quality data

More of it exists

Presence-Only Data

Lower quality data

More of it exists

Presence-Only Data

Lower quality data

More of it exists

Presence-Only Data

Lower quality data

More of it exists

Real Data (Koala Sightings in New South Wales)

Taken from Margules and Austen (1994)

Overview

Proliferation of methods for study

Recent papers have pointed out close connections

• Warton and Shepherd (2010)

• Aarts et al. (2011)

Goals here:

1 Interpret

2 Explore implications

3 Extend results

Overview

Goals here:

1 Interpret

3 Extend results

Overview

Goals here:

1 Interpret

3 Extend results

Overview

Goals here:

1 Interpret

3 Extend results

Overview

Goals here:

1 Interpret

3 Extend results

Outline

1 Inhomogeneous Poisson Process Model / Maxent

2 Logistic Regression

3 Pooling Different Kinds of Data

Notation

n1 presence observations, n0 background observations

Geographic coordinates zi ∈ D ⊆ R2, i = 1, . . . , n0 + n1

Features xi = x(zi) measured via GIS

yi = 1 for presence, 0 for background

Notation

Outline

Inhomogeneous Poisson ProcessIntensity function

λ(z) : D → [0,∞)

Λ(A) =

∫Aλ(z) dz

Assume Λ(D) <∞.

pλ(z) = λ(z)/Λ(D)

Definition 1: choose poisson # points, then simple random sample

n1 ∼ Poisson(Λ(D))

zi|yi = 1i.i.d.∼ pλ

Definition 2: continuous limit of discrete poisson model

N(A) = #{i : zi ∈ A, yi = 1}∼ Poisson(Λ(A))

A ∩B = ∅ ⇒ N(A) ⊥⊥ N(B)

λ(z) : D → [0,∞)

Λ(A) =

∫Aλ(z) dz

Assume Λ(D) <∞.

A ∩B = ∅ ⇒ N(A) ⊥⊥ N(B)

λ(z) : D → [0,∞)

Λ(A) =

∫Aλ(z) dz

Assume Λ(D) <∞.

A ∩B = ∅ ⇒ N(A) ⊥⊥ N(B)

λ(z) : D → [0,∞)

Λ(A) =

∫Aλ(z) dz

Assume Λ(D) <∞.

A ∩B = ∅ ⇒ N(A) ⊥⊥ N(B)

λ(z) : D → [0,∞)

Λ(A) =

∫Aλ(z) dz

Assume Λ(D) <∞.

A ∩B = ∅ ⇒ N(A) ⊥⊥ N(B)

Presence-Only Data as IPP

Warton & Shepherd (2010) propose log-linear IPP forpresence-only data

λ(z) = eα+β′x(z)

pλ(z) =eβ

′x(z)∫D e

β′x(z) dz

β determines pλ

α determines Λ(D)

pλ(z) =eβ

′x(z)∫D e

β′x(z) dz

β determines pλ

α determines Λ(D)

pλ(z) =eβ

′x(z)∫D e

β′x(z) dz

β determines pλ

α determines Λ(D)

pλ(z) =eβ

′x(z)∫D e

β′x(z) dz

β determines pλ

α determines Λ(D)

pλ(z) =eβ

′x(z)∫D e

β′x(z) dz

β determines pλ

α determines Λ(D)

Identifiability and Observer Bias

Occurrence process of scientific interest

Presence-only data reflect rate of sightings

Observation process is thinned occurrence process

λobs(z) = λocc(z)s(z)

= eα̃+β̃′x(z)eγ+δ

′x(z)

Options:

1 Assume s is constant (optimistic)

2 Assume s and λocc depend on different features

Either way, α̃ unidentifiable (α = γ + α̃)

′x(z)

Options:

′x(z)

Options:

′x(z)

Options:

′x(z)

Options:

′x(z)

Options:

Maximum Likelihood for IPP

Log-likelihood

`(α, β) =∑yi=1

α+ β′xi −∫Deα+β

′x(z) dz

Score equation for α:

∫Deα+β

′x(z) dz = Λ(D)

Implication: α̂ not of scientific interest unless n1 is

Log-likelihood

`(α, β) =∑yi=1

′x(z) dz

∫Deα+β

′x(z) dz = Λ(D)

Log-likelihood

`(α, β) =∑yi=1

′x(z) dz

∫Deα+β

′x(z) dz = Λ(D)

Plug in α̂(β) (partially maximize `):

`∗(β) =∑yi=1

β′xi − n1 log

(∫Deβ

′x(z) dz

)=∑yi=1

log pλ(zi)

Score equations for β:

∑yi=1

∫D e

β′x(z)x(z) dz∫D e

β′x(z) dz= Epλx(z)

Interpretation:

1 Choose β̂ to match means of features x(z)

2 Choose α̂ so Λ(D) = n1

1. Estimate density. 2. Multiply by n1.

`∗(β) =∑yi=1

β′xi − n1 log

(∫Deβ

′x(z) dz

)=∑yi=1

log pλ(zi)

∑yi=1

∫D e

Interpretation:

`∗(β) =∑yi=1

β′xi − n1 log

(∫Deβ

′x(z) dz

)=∑yi=1

log pλ(zi)

∑yi=1

∫D e

Interpretation:

`∗(β) =∑yi=1

β′xi − n1 log

(∫Deβ

′x(z) dz

)=∑yi=1

log pλ(zi)

∑yi=1

∫D e

Interpretation:

`∗(β) =∑yi=1

β′xi − n1 log

(∫Deβ

′x(z) dz

)=∑yi=1

log pλ(zi)

∑yi=1

∫D e

Interpretation:

1. Estimate density.

2. Multiply by n1.

`∗(β) =∑yi=1

β′xi − n1 log

(∫Deβ

′x(z) dz

)=∑yi=1

log pλ(zi)

∑yi=1

∫D e

Interpretation:

Numerical Approximation of IPP Likelihood

In practice, can’t evaluate integrals analytically

Replace by numerical approximation

`(α, β) =∑yi=1

α+ β′xi −|D|n0

∑yi=0

eα+β′xi

Same interpretation of score equations

`(α, β) =∑yi=1

∑yi=0

eα+β′xi

`(α, β) =∑yi=1

∑yi=0

eα+β′xi

Maxent / Conditional IPP

Phillips et al. (2004, 2006, 2008)

Nonparametric density for presence samples: zi|yi = 1i.i.d.∼ p(z)

Maximize H(p) = −∫p(z) log p(z) dz subject to

∑yi=1

x(zi) = Epx(z)

Authors show solution has parametric form:

zii.i.d.∼ eβ

′x(z)∫eβ′x(u) du

Aarts et al. (2011): same slopes β̂ as IPP

Phillips et al. (2004, 2006, 2008)

∑yi=1

x(zi) = Epx(z)

zii.i.d.∼ eβ

Phillips et al. (2004, 2006, 2008)

∑yi=1

x(zi) = Epx(z)

zii.i.d.∼ eβ

Phillips et al. (2004, 2006, 2008)

∑yi=1

x(zi) = Epx(z)

zii.i.d.∼ eβ

Phillips et al. (2004, 2006, 2008)

∑yi=1

x(zi) = Epx(z)

zii.i.d.∼ eβ

Equivalence Under Penalization

Maxent software uses large basis expansion, `1 penalty for β

If IPP, Maxent use

• same data (incl. background)

• same basis expansion

• same penalty on β

• α unpenalized in IPP

then β̂IPP = β̂Maxent

Can replace β′x(z) with fθ(z)

Same p̂(z), IPP also computes λ̂(z) = n1p̂(z)

If IPP, Maxent use

Outline

“Naive” Logistic Regression

Treat xi as fixed:

yi|xi ∼ Bernoulli

(eη+β

1 + eη+β′xi

Flexible modeling framework: GAM, MARS, boosting, LASSO, etc.

“Naive” Logistic Regression

Treat xi as fixed:

yi|xi ∼ Bernoulli

(eη+β

1 + eη+β′xi

Flexible modeling framework: GAM, MARS, boosting, LASSO, etc.

Case-Control Sampling

Back to IPP Model

Condition on zi:

P(y = 1|z) =P(y = 1)P(z|y = 1)

P(y = 0)P(z|y = 0) + P(y = 1)P(z|y = 1)

α+β′x(z)/Λ(D)

n0 + n1eα+β′x(z)/Λ(D)

=eη+β

′x(z)

1 + eη+β′x(z)

“Case-control” sampling design

Logistic regression likelihood = conditional IPP likelihood

Back to IPP Model

Condition on zi:

P(y = 1|z) =P(y = 1)P(z|y = 1)

P(y = 0)P(z|y = 0) + P(y = 1)P(z|y = 1)

α+β′x(z)/Λ(D)

=eη+β

′x(z)

1 + eη+β′x(z)

Back to IPP Model

Condition on zi:

P(y = 1|z) =P(y = 1)P(z|y = 1)

P(y = 0)P(z|y = 0) + P(y = 1)P(z|y = 1)

α+β′x(z)/Λ(D)

=eη+β

′x(z)

1 + eη+β′x(z)

Back to IPP Model

Condition on zi:

P(y = 1|z) =P(y = 1)P(z|y = 1)

P(y = 0)P(z|y = 0) + P(y = 1)P(z|y = 1)

α+β′x(z)/Λ(D)

=eη+β

′x(z)

1 + eη+β′x(z)

Logistic Regression vs IPP

Both estimate same β, but get different β̂

Warton & Shepherd (2010) show β̂LR → β̂IPP as n0 →∞ withn1 fixed

Misspecified case: not true if n0, n1 →∞ together (limit dependson limn1/n0)

Logistic Regression vs IPPFixed presence sample, n1 = 1000. True λ quadratic in x

100 1000 10000 1e+05 1e+06

Logistic Regression Estimates (n1 = 1000)

Weighted Logistic Regression

Don’t really need n0 →∞

Weight sample to reflect undersampling of background points

{W yi = 01 yi = 1

As W →∞, β̂WLR → β̂IPP

Weighted logistic regression = numerical IPP = numerical Maxent

{W yi = 01 yi = 1

Weighted vs Unweighted Logistic RegressionWeighted LR converges faster to large-n0 limit.

100 1000 10000 1e+05 1e+06

Weighted and Unweighted Estimates for Logistic Regression

100 1000 10000 1e+05 1e+06

WeightedUnweighted

Outline

Presence-Absence and Count Data

Implied likelihood for count / presence-absence data:

N |x ∼ Poisson(Aeα̃−ε+β̃

Can pool data from multiple studies

Example: Correcting for Bias

Assume: multiple species, same bias

λocc,j(z) = eα̃j+β̃′jx(z)

λobs,j(z) = eα̃j+γj+(β̃j+δ)′x(z)

Model is identifiable given

1 Presence-only data for all species (to estimate βj)

2 Presence-absence / count data for at least one species (toestimate δ)

Conclusions

IPP, MaxEnt, and Logistic Regression all motivated by sameunderlying model

All estimate same β (α is uninteresting)

β̂ for IPP, MaxEnt can be fit by weighted logistic regression/ GAM/ Boosted Trees / MARS / Group LASSO / ...

boosted.ipp <- gbm(y~., family="bernoulli",

data=banksia, weights=1000^(1-y))

Can combine presence-only, presence-absence, and other data

Conclusions

β̂ for IPP, MaxEnt can be fit by weighted logistic regression

/ GAM/ Boosted Trees / MARS / Group LASSO / ...

Conclusions

Thanks

An Analysis of Approaches to Presence-Only Datahastie/TALKS/hastieSDM.pdf · An Analysis of...

Documents

Transcript of An Analysis of Approaches to Presence-Only Datahastie/TALKS/hastieSDM.pdf · An Analysis of...