Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time)....

44
Trend ltering in exponential families Daniel J. McDonald Indiana University, Bloomington pages.iu.edu/dajmcdon May

Transcript of Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time)....

Page 1: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Trend �ltering in exponential families

Daniel J. McDonaldIndiana University, Bloomingtonpages.iu.edu/~dajmcdon

29 May 2019

1

Page 2: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

These are my cats

2

Page 3: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Number of vomits/day (30 day rolling average)

0.2

0.4

0.6

Dec17 Mar18 Jun18 Sep18 Dec18 Mar19 Jun19

Prednisolone dosage 1x/day 2x/day 2x/week every other day none

3

Page 4: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Poisson model

yi is the number of vomits on day i

Poisson distributed with time-varying parameter φi

L(φ | y) =∏n

i=1φyii exp(−φi)

yi !

Goal: estimate φ from data, φ should be “smooth”.

Set θi = logφi

minθ−y>θ + exp(θ) + λ ‖Dθ‖1

D matrix enecodes smoothness

4

Page 5: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Trend �ltering

0.2

0.4

0.6

Oct18 Jan19 Apr19

Prednisolone dosage 1x/day 2x/day 2x/week every other day none

5

Page 6: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Second-order Poisson trend �ltering

D =

1 −2 1 0

. . .

0 1 −2 1

∈ Ò(n−2)×n

Looks visually like a smoothing spline, but more locally adaptive

Works well on functions of “bounded variation”:∫ ba |θ

′′(x)|dx < ∞

6

Page 7: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Derivative properties

estimated theta 1st derivative 2nd derivative

Oct Jan Apr Oct Jan Apr Oct Jan Apr

−0.002

−0.001

0.000

0.001

−0.025

0.000

0.025

0.050

0.1

0.2

0.3

0.4

valu

e

7

Page 8: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

What’s this talk about

Trend �ltering is not new.

But, aside from small specializations, the theory/methods are for additive(sub)-Gaussian noise only.

1. Motivated by spatio-temporal variance estimation from weathersatellites.

2. We generalize to exponential families.3. Provide some algorithms that work on big data.4. Provide justi�able way of selecting the tuning parameter.

8

Page 9: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Estimating the trend in cloud-toptemperature volatility

Page 10: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Climate change

The scienti�c consensus is that

1. World-wide climate is changing.2. This change is mostly driven by human behavior.

Global warming −→ climate change: the distribution of temperature (andprecipitation) is changing

Increasing mean temperature understates the costs:

1. More frequent extremes have severe e�ects2. Local discrepancies lead to more storms3. Temporal dependencies mean persistence

9

Page 11: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Using weather satellites

Drivers of climate variation:1. Ocean currents2. Jet stream3. Annular modes4. Cloudiness

CLARREO satellite: monitor cloud top temperature as it relates to climate.

• Has yet to launch, no sooner than 2022• Defunded in most recent federal budget

Source: NCAR CCSM3 Diagnostic Plots.

10

Page 12: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

CLARREO vs MetOp/Modis

required for decadal climate change observations. Fifth, CLARREO has demonstrated that scene and viewing geometry-dependent polarization distribu-tion models (PDMs) (Nadal and Breon 1999; Maignan et al. 2009) allow CLARREO to determine the scan-angle-dependent polarization sensitivity of imagers such as VIIRS, AVHRR, or geostationary imagers, as well as to enable those instruments to remove this scene-dependent polarization dependence (Lukashin et al. 2012). Sixth, the CLARREO 90° inclined polar orbit (see Table 1) slowly drifts through all 24 hours of local solar time over 6 months. This orbit allows reference intercalibration orbit crossings with satel-lites at all latitudes, which is important for verifying accuracy across all climate regimes, as well as for verifying if instruments have orbit-dependent calibra-tion changes, especially from the different hot/cold parts of the orbit in or out of direct solar illumination. By contrast, sun-synchronous satellites only cross orbits at polar latitudes, which is another limita-tion of current GSICS methods. Simulations show

that CLARREO reference intercalibration sampling is sufficient to determine i nst r u ment ga i ns a nd offsets on a monthly time scale, while polarization sensitivity, nonlinearity, and orbit position depen-dence can be achieved on annual time scales.

In Fig. 6, CLARREO crosses under the Suomi National Polar-Orbiting Partnership (NPP) or Joint Polar Satellite System-1 (JPSS-1) orbit. CLARREO matches elevat ion and azimuth directions across the cross-track scans of CERES, VIIRS, and CrIS by setting the azimuth angle of the CLARREO instru-ment to match the NPP scan plane and then slowly rotates the CLARREO RS spectrometer (mounted on a gimbal) to match view-ing zenith angles across the entire scan during the orbit crossing. The azimuth angle for this match varies from orbit crossing to orbit

crossing but is essentially constant for any single orbit crossing (Roithmayr and Speth 2012).

The time available for the matching scan is directly proportional to the orbit altitude separa-tion of the two spacecraft. Spacecraft at the same altitude have only a few seconds to obtain the entire scan swath, while several minutes are available for an orbit separation of 100 km or more (Roithmayr and Speth 2012). For this reason, the CLARREO design orbit altitude is ~600 km—sufficiently high to minimize fuel use for orbit control, sufficiently low to minimize launch vehicle requirement for mass to orbit, and well below the typical polar orbiter altitudes of ~825 km [NPP, JPSS, and the Meteorological Operational Satellite (METOP)] to increase the matched scan angle intercalibration time. Thus, the orbit selection and gimbal azimuth/elevation-pointing capability will allow CLARREO to increase reference intercalibration sampling by more than a factor of 100 compared to current GSICS capabilities, whereas typical SNOs restrict

FIG. 6. As the CLARREO orbit (red; 609-km altitude, 90° inclination) crosses that of a satellite such as NPP or MetOp (green) (827-km altitude, 1330 LT sun-synchronous orbit with 98.7° orbit inclination) with an operational sensor, the CLARREO infrared and reflected solar spectrometers gather data matched in time, space, and angle of view to provide reference intercalibration SI-traceable spectra for operational sensors that cannot achieve climate change accuracy directly. As a metrology transfer standard in orbit, CLARREO is an anchor for the climate observing system.

1532 OCTOBER 2013|

uncertainty dominates the accuracy of global average trends. Uncertainty in climate sensitivity is driven primarily by uncertainty in cloud feedback, which in turn is driven primarily by low cloud changes varying Earth’s albedo (Solomon et al. 2007; Bony et al. 2006; Soden et al. 2008). We can derive a simple metric of cloud feedback for reflected solar by considering the trend in global mean shortwave cloud radiative forc-ing (SW CRF) (Soden et al. 2008; Loeb et al. 2007). Global mean SW CRF is simply the difference be-tween all-sky and clear-sky reflected flux.

As for temperature trends (Fig. 3a), the perfect observing system again shows the need for long cli-mate records for accurate trends in SW CRF (Fig. 3b).

What about time to detect trends? Using Leroy et al. (2008b) we can defi ne an analogous uncertainty factor Ut—the ratio of the time to detect a trend using a real observing system to the time to detect a trend using a per-fect observing system. Such a ratio can be defi ned for any climate variable or statistical confi dence bound desired. Again extending the results from Leroy et al. (2008b),

(2)

The only difference between Eqs. (1) and (2) is that the square root on the right side of the equation becomes a cube root. Since Ua and Ut are always greater than 1, and are usually near 1, Eqs. (1) and (2) show that

(3)

Another way of interpreting Eq. (3) is that the degradation of trend accuracy for time to detect trends is only two-thirds of the degradation for accuracy in trends. For exam-ple, the CLARREO requirement that Ua < 1.2 equivalently requires that Ut < 1.13. How do we interpret the meaning of Ut = 1.13? If a perfect observing system could detect a temperature trend with 95% confi dence in 20 years, then the CLARREO observing system could detect the same trend with 95% confi dence in 23 years (13% more time).

These equations give a simple but powerful way to understand the value of observing system accuracy for both climate trend accuracy (e.g., tests of climate predictions) and time to detect trends (e.g., public policy decisions). They also provide a way to compare consistent metrics across a wide range of climate variables, as well as a wide range of sources of uncertainty in climate observations. We strongly encour-age use of this approach to more rigorously understand and optimize climate observation requirements across the wide range of essential climate variables (ECVs) (GCOS 2011). This is especially important given the limited resources avail-able for global climate observations (Trenberth et al. 2013).

FIG. 3. The relationship between absolute calibration accuracy and the accuracy of global average decadal cli-mate change trends. Trend accuracy shown for a perfect observing system (black), varying levels of instrument ab-solute accuracy (solid color lines) for possible CLARREO requirements, and current instruments in orbit (dashed lines). Shown are (a) the relationship between infrared spectra accuracy and temperature trends and (b) the relationship between reflected solar spectra and changes in broadband CRF and cloud feedback. The figures show the dramatic effect of instrument accuracy on both cli-mate trend accuracy (vertical axis) as well as the time to detect trends (horizontal axis). The green vertical line for reflected solar shows the range of CMIP3 climate model simulations (Soden and Vecchi 2011). Larger values of decadal change in SW CRF indicate larger values of cloud feedback (Soden et al. 2008).

1527OCTOBER 2013AMERICAN METEOROLOGICAL SOCIETY |

• Weather satellites aren’t made for this.• More information in higher moments than in average?

Source: Wielicki, et al. (2013).

11

Page 13: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Satellite data

Once collaborators do lots of processing. . .

• 52,000 time series• daily records over ∼ 40 years• “trends” are local, nonlinear, not sinusoidal

−20

−10

0

10

20

30

1994 1996 1998 2000 2002

tem

pe

ratu

re (

°C)

Bloomington Manaus San Diego

1 June 2000

12

Page 14: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Trends in variance

• Let xijt be the observed temperature at time t and location (i, j).

• Suppose xijt ∼ Normal(0,σ2ijt

)• Estimate σ2, but it should be “smooth” relative to space and time.

• Use a matrix D+ penalty to encode this smoothness.

13

Page 15: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Exponential families

Page 16: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Natural exponential family

Fix a (σ-�nite) measure µ on the Borel subsets of Ò

Let Θ ={θ :

∫dµ(y)exp(yθ) < ∞

}.

Let A(θ) = log∫dµ(y)exp(yθ).

Then pθ(y) = exp(yθ − A(θ)) is an exponential family w.r.t. µ.

When convenient, write pθ(y) = h(y) exp(yθ − A(θ)), for base measure h

14

Page 17: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Standard examples

beta gamma Gaussian

binomial geometric Poisson

15

Page 18: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Some important properties we will use

Let θ0 ∈ Θ be in the interior.

1. Θ is convex, A is strictly convex on Θ2. All derivatives of A exist at θ0.3. Å [y] = A′(θ0), Ö[y] = A′′(θ0).4. The cumulant generating function is A(θ0).5. KL(θ1 ‖ θ2) = A(θ2) − A(θ1) + (θ1 − θ2)A′(θ1)6. y is sub-exponential: Ð(y > ε) < c exp(−αε), some α , c.

16

Page 19: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Algorithms

Page 20: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Optimization problem

1√2πσ2

exp(−x2

2σ2

)=⇒

12√πy

exp (yθ − A(θ)) A(θ) := −12log(−θ)

minθ

∑ijt

A(θijt) − yijtθijt + λ ‖Dθ‖1

Standard optimizer: Primal Dual Interior Point method.

−20

0

20

1994 1996 1998 2000 2002

Tem

pe

ratu

re (

°C)

Bloomington, IN Estimated SD

see Tibshirani (2014) or K-K-Boyd-G (2009)

17

Page 21: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Generic PDIP

1. Start with a guess θ(1)

2. Solve a linear system [Ms = v]3. Calculate a step size4. Iterate 2 & 3 until convergence

The matrix M = M(θ(k)

), dense, and roughly 109 × 109.

This isn’t going to work.

18

Page 22: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Alternating direction method of multipliers

One way to solve optimization problems like this is to restate the problem

Original Equivalent

minx

f (x) + g(x)minx,z

f (x) + g(z)

s.t. x − z = 0

Then, iterate the following with ρ > 0

x← argminx

f (x) +ρ

2‖x − z + u‖22

z← argminz

g(z) +ρ

2‖x − z + u‖22

u← u + x − z

19

Page 23: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Why would you do this?

• It decouples f and g: this can be easier• If f and g are nice, the updates can be parallelized• The algorithm converges under very general conditions• There are often many ways to decouple a problem

minθ−`(θ) + λ ‖Dθ‖1

• The individual minimizations don’t have to be solved in closed form

Example:

θ ← argminθ−`(θ) +

ρ

2‖Dθ − α + u‖22

α ← Sλ/ρ(Dθ + u)

u← u + Dθ − α

[Sa(b)]k = sgn(bk)(|bk | − a)+

20

Page 24: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Real MODIS track

21

Page 25: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

What our data look like

22

Page 26: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Concensus version

minxg=θ [g

∑g∈G

−`(xg) + λ Dg·xg 1

xg ← argminxg−`(xg) + λ

Dg·xg 1+ u>(xg − θ) +

ρ

2 xg − θ 22

θ ← avg(xg + ug/ρ)

ug ← ug + ρ(xg − θ)

Requires very few iterations, but iterations are O(n3). Can parallelize overblocks.

23

Page 27: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Grid world

minθ=a=b=c

∑ijt

−`(θijt) + λ∑it

‖Dai·t‖1

+ λ∑jt

Db·jt 1 + λ∑ij

Dcij· 1θijt ← solution of A′(θijt) = k(1)ijt θijt + k

(2)ijt

(a, b, c) ← TF1d((a, b, c) + (u, v,w))

(u, v,w) ← (u, v,w) + θ − (a, b, c)

k(1), k(2) ← simple linear functions of

a, b, c, u, v,w

Requires many iterations, but iterations are O(n). Can parallelize overlines.

TF1d(z) := argminx ‖x − z‖22 + λ/ρ ‖Dx‖1 24

Page 28: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Hints and caveats

• Linearized ADMM if large memorycomputer

• Can come up with intermediate options• O�-the-shelf stu� doesn’t work• Smaller problems don’t need these details• Must repeat for many tuning parameters

25

Page 29: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Tuning parameter selection

Page 30: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Risk

• Suppose you have observed data Y, a predictor fλ(Y)

• You want to know MSE(λ) = Å[‖Y − fλ(Y)‖22

]• Examining Error(λ) = ‖Y − fλ(Y)‖22 is biased if you used Y to get f

• AIC, BIC, GCV compensate with Error(λ) + pen(λ)

• Cross Validation uses held-out sets

26

Page 31: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Unbiased estimation

• If data are i.i.d., noise is additive, mean zero

MSE(λ) = Å[‖Y − fλ(Y)‖22

]− nσ2 + 2tr Cov(Y, fλ(Y))

• If fλ = WY, then tr Cov(Y, fλ(Y)) = tr W =: σ2df (λ)

• If Y ∼ Normal(µ,σ2In) and f weakly di�erentiable with ess. boundedpartials, then

tr Cov(Y, fλ(Y)) = σ2Å[tr∂ fλ(y)∂yi

���Y

]• Ingredients for SURE:

1. Expression for risk I want, w/o dependence on parameters2. Expression for tr ∂ fλ (y)∂yi

27

Page 32: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

SURE for continuous exp fam

1. pθ(y) = h(y) exp(θ>y − A(θ))2. f weakly di�erentiable with ess. bounded partials3. h is weakly di�erentiable

Å[θ>f (Y)

]= −Å

[(+h(Y)h(Y)

)>f (Y) + tr

∂ fλ(y)∂yi

���Y

]• Conincides with previous for Gaussian case• Can be used to get unbiased estimator of Å

[ θ − θ̂ 22

]

28

Page 33: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Estimating KL

Result (Deledalle ’17)

K̂L(θ̂ ‖ θ0) =⟨θ̂ +

+h(Y)h(Y)

, A′(θ̂)⟩+ (A′′(θ̂))>

(∂θ̂i∂yi

���Y

)− 1>A(θ̂),

with Å[K̂L(θ̂ ‖ θ0)

]= KL(θ̂ | |θ0) − A(θ0).

Here:

K̂L(θ̂ ‖ θ0) =∑ijt

©­«Yijt

4θ̂ijt+

1

2θ̂2ijt

∂θ̂ijt

∂yijt

���Y+log(−θ̂ijt)

2ª®¬ − n

2

29

Page 34: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

The Divergence

Let G be the rows of D with Dθ̂ = 0.

De�ne ΠG = GG†, the projection onto null(G).

For Trend-Filtering with Gaussian loss (Tibshirani and Taylor ’14):

df(θ̂) =∑i

∂θ̂i∂yi

= tr(ΠG) = nullity(G).

2nd derivative

Oct Jan Apr

−0.002

−0.001

0.000

0.001

valu

e

• Return of the cats• All you have to do is count thenumber of pieces.

30

Page 35: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Harder case

• For trend �ltering with exponential family loss (Jacobian):

∂θ̂i∂yi

= diag(ΠG

(ΠGdiag

(A′′(θ̂)

)ΠG

)†ΠG

)• Our case: A′′(θ) =

12θ2

• Final form: K̂L(θ̂ ‖ θ0) =∑ijt

©­«Yijt

4θ̂ijt+

1

4θ̂4ijt+log(−θ̂ijt)

2ª®¬ − n

2

31

Page 36: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Theory

Page 37: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Brie�y

• Want a bound on KL(θ̂ ‖ θ0)

• Can use properties of exponential families to get “Basic inequality”

KL(θ̂ ‖ θ0) ≤ (Y − A′(θ0))>(θ0 − θ̂) + λ ‖Dθ0‖ − λ Dθ̂

• First term is empirical process, second term controlled by λ

• Y − A′(θ0) is mean zero, sub-exponential

• Play some games

32

Page 38: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Empirical results

Page 39: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Bloomington, IN

33

Page 40: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Average variance in base period

34

Page 41: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Change in average variance from 1961–2011

35

Page 42: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Conclusion

Page 43: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Wrapping up

• Generalized TF to exponential families• Tailored algorithms for some big data• Tuning parameters in this setting are challenging• Lot’s of missing details about the actual data• Do we care about θ? A′(θ)?

36

Page 44: Trend filtering in exponential families...trend with 95% conÞ dence in 23 years (13% more time). These equations give a simple but powerful way to understand the value of observing

Collaborators and funding

37