AnBian, KfirY. Levy,...

1
Non-monotone Continuous DR-submodular Maximization: Structure and Algorithms An Bian, Kfir Y. Levy, Andreas Krause and Joachim M. Buhmann DR-submodular (Diminishing Returns) Maximization & Its Applications Softmax extension for determinantal point processes (DPPs) [Gillenwater et al ‘12] Mean-field inference for log-submodular models [Djolonga et al ‘14] DR-submodular quadratic programming (Generalized submodularity over conic lattices) e.g., logistic regression with a non-convex separable regularizer [Antoniadis et al ‘11] Etc… (more see paper) Based on Local-Global Relation, can use any solver for finding an approximately stationary point as the subroutine, e.g., the Non- convex Frank-Wolfe solver in [Lascote-Julien ‘16] TWO-PHASE ALGORITHM Input: stopping tolerances " , $ , #iterations " , $ Non-convex Frank-Wolfe(,, " , " ) // Phase I on ←∩ 0 − } Non-convex Frank-Wolfe(,, $ , $ ) // Phase II on Output: argmax , () Underlying Properties of DR-submodular Maximization Concavity Along Non-negative Directions: Experimental Results (more see paper) DR-submodular (DR property) [Bian et al ‘17]: ∀ ≤ ∈ , ∀ , ∀ ∈ ℝ C , it holds, E + E + − (). - If differentiable, () is an antitone mapping (∀≤, it holds ) - If twice differentiable, EI $ ≤ 0, ∀ max () : → ℝ is continuous DR-submodular. is a hypercube. Wlog, let = , 0 .⊆ is convex and down-closed: & ≤≤ implies ∈ . Applications References Feldman, Naor, and Schwartz. A unified continuous greedy algorithm for submodular maximization. FOCS 2011 Gillenwater, Kulesza, and Taskar. Near-optimal map inference for determinantal point processes. NIPS 2012. Bach. Submodular functions: from discrete to continous domains. arXiv:1511.00394, 2015. Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives. arXiv:1607.00345, 2016. Bian, Mirzasoleiman, Buhmann, and Krause. Guaranteed non-convex optimization: Submodular maximization over continuous domains. AISTATS 2017. Quadratic Lower Bound. With a -Lipschitz gradient, for all and ∈ ±ℝ C T , it holds, + + ⟨ , ⟩ − W $ X Strongly DR-submodular & Quadratic Upper Bound. is -strongly DR-submodular if for all and ∈ ±ℝ C T , it holds, + + ⟨ , ⟩ − Z $ X Two Guaranteed Algorithms Guarantee of TWO-PHASE ALGORITHM. max , [ \ ( $ + $ )+ ^ _ abcd e ^ f ^ g^ ,i ^ abcd e X f X g^ ,i X , where ≔∨ NON-MONOTONE FRANK-WOLFE V ARIANT Input: step size ∈ (0,1] (o) ← 0, ← 0, (o) ←0 // : cumulative step size While (q) <1 do: (q) ← argmax ∈,s 0a (t) , ( (q) ) // shrunken LMO q ← min ,1− (q) (qC") (q) + q (q) , (qC") (q) + q , ++ Output: (w) Guarantee of NON-MONOTONE FRANK-WOLFE V ARIANT . w a" " w X z X W $w Baselines: - QUADPROGIP: global solver for non-convex quadratic programming (possibly in exponential time) - Projected Gradient Ascent (PROJGRAD) with diminishing step sizes ( " qC" ) DR-submodular Quadratic Programming. Synthetic problem instances = ^ X | + + , = { ∈ℝ C T | ≤ , ≤ 0,∈ℝ CC ×T ,∈ℝ C } has linear constraints. Randomly generated in two manners: 1) Uniform distribution (see Figs below); 2) Exponential distribution Maximizing Softmax Extensions for MAP inference of DPPs. = log det diag − + ,∈ 0,1 T : kernel/similarity matrix. is a matching polytope for matched summarization. Synthetic problem instances: - Softmax objectives: generate with random eigenvalues - Generate polytope constraints similarly as that for quadratic programming Real-world results on matched summarization: Select a set of document pairs out of a corpus of documents, such that the two documents within a pair are similar, and the overall set of pairs is as diverse as possible. Setting similar to [Gillenwater et al ‘12], experimented on the 2012 US Republican detates data. 0.2 0.4 0.6 0.8 1 Match quality controller 2 4 6 8 10 Function value 0 20 40 60 80 100 Iteration 0 0.5 1 1.5 2 2.5 Function value Submodular Concave Convex DR-submodular Approximately Stationary Points & Global Optimum: (Local-Global Relation). Let with non-stationarity . Define ≔∩ 0 − }. Let with non-stationarity . Then, max , ^ _ [ ]+ [ \ ( $ + $ ), where ≔∨ . - Proof using the essential DR property on carefully constructed auxiliary points - Good empirical performance for the Two-Phase algorithm: if is away from , $ will augment the bound; if is close to , by the smoothness of , should be near optimal. DR-submodularity captures a subclass of non-convex/non-concave functions that enables exact minimization and approximate maximization in poly. time. Investigate geometric properties that underlie such objectives, e.g., a strong relation between stationary points & global optimum is proved. Devise two guaranteed algorithms: i) A “two-phase” algorithm with ¼ approximation guarantee. ii) A non-monotone Frank-Wolfe variant with " approximation guarantee Extend to a much broader class of submodular functions on “conic” lattices. Abstract 8 10 12 14 16 Dimensionality 0.85 0.9 0.95 1 8 10 12 14 16 Dimensionality 0.85 0.9 0.95 1 Approx. ratio 8 10 12 14 16 Dimensionality 0.9 0.95 1 Approx. ratio = 0.5 = = 1.5 8 10 12 14 16 Dimensionality -0.05 0 0.05 0.1 0.15 0.2 8 10 12 14 16 Dimensionality -0.05 0 0.05 0.1 0.15 0.2 Function value 8 10 12 14 16 Dimensionality -0.05 0 0.05 0.1 0.15 0.2 Function value = 0.5 = 1.5 = key difference from the monotone Frank-Wolfe variant [Bian et al ‘17] Lemma. For any , , ⟨ − , + −2 + Z $ a X If =0, then 2 ∨ + ∧ + [ X a X à implicit relation between & . (finding an exact stationary point is difficult ) Non-stationarity Measure [Lacoste-Julien ‘16]. For any , the non-stationarity of is, ≔ max ⟨ − , coordinate-wise max. coordinate-wise min. : diameter of : smooth gradient Softmax (red) & multilinear (blue) extensions, & concave cross-sections Fig. from [Gillenwater et al ‘12]

Transcript of AnBian, KfirY. Levy,...

  • Non-monotone Continuous DR-submodular Maximization:Structure and Algorithms

    An Bian, Kfir Y. Levy, Andreas Krause and Joachim M. Buhmann

    DR-submodular (Diminishing Returns) Maximization & Its Applications

    👉 Softmax extension for determinantal point processes (DPPs) [Gillenwater et al ‘12]

    👉 Mean-field inference for log-submodular models [Djolonga et al ‘14]

    👉 DR-submodular quadratic programming

    👉 (Generalized submodularity over conic lattices) e.g., logistic regression with a non-convex separable regularizer [Antoniadis et al ‘11]

    👉 Etc… (more see paper)

    Based on Local-Global Relation, can use any solver for finding an approximately stationary point as the subroutine, e.g., the Non-convex Frank-Wolfe solver in [Lascote-Julien ‘16]

    TWO-PHASE ALGORITHMInput: stopping tolerances 𝜖", 𝜖$, #iterations 𝐾", 𝐾$𝒙 ← Non-convex Frank-Wolfe(𝑓, 𝒫, 𝐾", 𝜖") // Phase I on 𝒫𝒬 ← 𝒫 ∩ 𝒚 𝒚 ≤ 𝒖0 − 𝒙}𝒛 ← Non-convex Frank-Wolfe(𝑓, 𝒬, 𝐾$, 𝜖$) // Phase II on 𝒬Output: argmax 𝑓 𝒙 , 𝑓(𝒛)

    Underlying Properties of DR-submodular Maximization

    👉 Concavity Along Non-negative Directions:

    Experimental Results (more see paper)

    DR-submodular (DR property) [Bian et al ‘17]: ∀𝒂 ≤ 𝒃 ∈ 𝒳, ∀𝑖, ∀𝑘 ∈ ℝC, it holds,

    𝑓 𝑘𝒆E + 𝒂 − 𝑓 𝒂 ≥ 𝑓 𝑘𝒆E + 𝒃 − 𝑓(𝒃).

    - If 𝑓 differentiable, 𝛻𝑓() is an antitone mapping (∀𝒂 ≤ 𝒃, it holds 𝛻𝑓 𝒂 ≥ 𝛻𝑓 𝒃 )

    - If 𝑓 twice differentiable, 𝛻EI$𝑓 𝒙 ≤ 0, ∀𝒙

    max𝒙∈𝒫

    𝑓(𝒙)𝑓: 𝒳 → ℝ is continuous DR-submodular. 𝒳 is a hypercube. Wlog, let 𝒳 = 𝟎, 𝒖0 . 𝒫 ⊆ 𝒳 is convex and down-closed: 𝒙 ∈ 𝒫 & 𝟎 ≤ 𝒚 ≤ 𝒙 implies 𝒚 ∈ 𝒫.

    App

    licat

    ions

    Ref

    eren

    ces

    Feldman, Naor, and Schwartz. A unified continuous greedy algorithm for submodular maximization. FOCS 2011

    Gillenwater, Kulesza, and Taskar. Near-optimal map inference for determinantal point processes. NIPS 2012.

    Bach. Submodular functions: from discrete to continous domains. arXiv:1511.00394, 2015.

    Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives. arXiv:1607.00345, 2016.

    Bian, Mirzasoleiman, Buhmann, and Krause. Guaranteed non-convex optimization: Submodular maximization over continuous domains. AISTATS 2017.

    Quadratic Lower Bound. With a 𝐿-Lipschitz gradient, for all 𝒙 and 𝒗 ∈ ±ℝCT , it holds,𝑓 𝒙 + 𝒗 ≥ 𝑓 𝒙 + ⟨𝛻𝑓 𝒙 , 𝒗⟩ − W$ 𝒗 X

    Strongly DR-submodular & Quadratic Upper Bound. 𝑓 is 𝜇-strongly DR-submodular if for all 𝒙 and 𝒗 ∈ ±ℝCT , it holds,

    𝑓 𝒙 + 𝒗 ≤ 𝑓 𝒙 + ⟨𝛻𝑓 𝒙 , 𝒗⟩ − Z$ 𝒗 X

    Two Guaranteed Algorithms

    Guarantee of TWO-PHASE ALGORITHM.

    max 𝑓 𝒙 , 𝑓 𝒛 ≥[\( 𝒙 − 𝒙∗ $ + 𝒛 − 𝒛∗ $)+^_ ` 𝒙

    ∗ abcd e^f^g^

    � ,i^ abcdeXfXg^

    � ,iX ,

    where 𝒛∗ ≔ 𝒙 ∨ 𝒙∗ − 𝒙

    NON-MONOTONE FRANK-WOLFE VARIANTInput: step size 𝛾 ∈ (0,1]𝒙(o) ← 0, 𝑘 ← 0, 𝑡(o) ← 0 // 𝑡: cumulative step sizeWhile 𝑡(q) < 1 do:

    𝒗(q) ← argmax𝒗∈𝒫,𝒗s𝒖0a𝒙(t) 𝒗, 𝛻𝑓(𝒙(q)) // shrunken LMO

    𝛾q ← min 𝛾, 1 − 𝑡(q)

    𝒙(qC") ← 𝒙(q) + 𝛾q𝒗(q), 𝑡(qC") ← 𝑡(q) + 𝛾q, 𝑘 + +Output: 𝒙(w)

    Guarantee of NON-MONOTONE FRANK-WOLFE VARIANT.

    𝑓 𝒙 w ≥ 𝑒a"𝑓 𝒙∗ − 𝑂 "wX 𝑓 𝒙∗ − z

    XW$w

    Baselines: - QUADPROGIP: global solver for non-convex quadratic programming (possibly in exponential time)- Projected Gradient Ascent (PROJGRAD) with diminishing step sizes (" qC"⁄ )

    DR-submodular Quadratic Programming. Synthetic problem instances 𝑓 𝒙 = ^X𝒙|𝐇𝒙 + 𝒉

    𝒙 + 𝑐, 𝒫 = {𝒙 ∈ ℝCT|𝐀𝒙 ≤ 𝒃, 𝒙 ≤𝒖0, 𝐀 ∈ ℝCC×T, 𝒃 ∈ ℝC} has 𝑚 linear constraints.

    Randomly generated in two manners:1) Uniform distribution (see Figs below); 2) Exponential distribution

    Maximizing Softmax Extensions for MAP inference of DPPs.𝑓 𝒙 = log det diag 𝒙 𝐋 − 𝐈 + 𝐈 , 𝒙 ∈ 0,1 T

    𝐋:kernel/similarity matrix. 𝒫 is a matching polytope for matched summarization.

    Synthetic problem instances: - Softmax objectives: generate 𝐋 with 𝑛 random eigenvalues - Generate polytope constraints similarly as that for quadratic programming

    Real-world results on matched summarization:Select a set of document pairs out of a corpus of documents, such that the two documents within a pair are similar, and the overall set of pairs is as diverse as possible. Setting similar to [Gillenwater et al ‘12], experimented on the 2012 US Republican detates data.

    0.2 0.4 0.6 0.8 1Match quality controller

    2

    4

    6

    8

    10

    Func

    tion

    valu

    e

    0 20 40 60 80 100Iteration

    0

    0.5

    1

    1.5

    2

    2.5

    Func

    tion

    valu

    e

    Submodular

    Concave Convex

    DR-submodular

    👉 Approximately Stationary Points & Global Optimum:

    (Local-Global Relation). Let 𝒙 ∈ 𝒫 with non-stationarity 𝑔𝒫 𝒙 . Define 𝒬 ≔ 𝒫 ∩ 𝒚 𝒚 ≤ 𝒖0 − 𝒙}. Let 𝒛 ∈ 𝒬 with non-stationarity 𝑔𝒬 𝒛 . Then,

    max 𝑓 𝒙 , 𝑓 𝒛 ≥ ^_[𝑓 𝒙∗ − 𝑔𝒫 𝒙 − 𝑔𝒬 𝒛 ] + [\( 𝒙 − 𝒙

    ∗ $ + 𝒛 − 𝒛∗ $),

    where 𝒛∗ ≔ 𝒙 ∨ 𝒙∗ − 𝒙.

    - Proof using the essential DR property on carefully constructed auxiliary points

    - Good empirical performance for the Two-Phase algorithm: if 𝒙 is away from 𝒙∗, 𝒙 − 𝒙∗ $ will augment the bound; if 𝒙 is close to 𝒙∗, by the smoothness of 𝑓, should be near optimal.

    DR-submodularity captures a subclass of non-convex/non-concave functions that enables exact minimization and approximate maximization in poly. time.

    👉 Investigate geometric properties that underlie such objectives, e.g., a strong relation between stationary points & global optimum is proved.

    👉 Devise two guaranteed algorithms: i) A “two-phase” algorithm with ¼ approximation guarantee. ii) A non-monotone Frank-Wolfe variant with " ⁄ approximation guarantee

    👉 Extend to a much broader class of submodular functions on “conic” lattices.

    Abst

    ract

    8 10 12 14 16Dimensionality

    0.85

    0.9

    0.95

    1

    Appr

    ox. r

    atio

    8 10 12 14 16Dimensionality

    0.85

    0.9

    0.95

    1

    Appr

    ox. r

    atio

    8 10 12 14 16Dimensionality

    0.9

    0.95

    1

    Appr

    ox. r

    atio

    𝑚 = 0.5𝑛 𝑚 = 𝑛 𝑚 = 1.5𝑛

    8 10 12 14 16Dimensionality

    -0.05

    0

    0.05

    0.1

    0.15

    0.2

    Func

    tion

    valu

    e

    8 10 12 14 16Dimensionality

    -0.05

    0

    0.05

    0.1

    0.15

    0.2

    Func

    tion

    valu

    e

    8 10 12 14 16Dimensionality

    -0.05

    0

    0.05

    0.1

    0.15

    0.2

    Func

    tion

    valu

    e

    𝑚 = 0.5𝑛 𝑚 = 1.5𝑛𝑚 = 𝑛

    key difference from the monotone Frank-Wolfevariant [Bian et al ‘17]

    Lemma. For any 𝒙, 𝒚, ⟨𝒚 − 𝒙, 𝛻𝑓 𝒙 ⟩ ≥ 𝑓 𝒙 ∨ 𝒚 + 𝑓 𝒙 ∧ 𝒚 − 2𝑓 𝒙 + Z$ 𝒙a𝒚 X

    If 𝛻𝑓 𝒙 = 0, then 2𝑓 𝒙 ≥ 𝑓 𝒙 ∨ 𝒚 + 𝑓 𝒙 ∧ 𝒚 + [X 𝒙a𝒚 Xà implicit relation between 𝒙 & 𝒚. (finding an exact stationary point is difficult 😟)

    Non-stationarity Measure [Lacoste-Julien ‘16]. For any 𝒬 ⊆ 𝒳, the non-stationarity of 𝒙 ∈ 𝒬 is,𝑔𝒬 𝒙 ≔ max𝒗∈𝒬 ⟨𝒗 − 𝒙, 𝛻𝑓 𝒙 ⟩

    coordinate-wise max. coordinate-wise min.

    𝐷: diameter of 𝒫𝐿: smooth gradient

    Softmax (red) & multilinear (blue)extensions, & concave cross-sectionsFig. from [Gillenwater et al ‘12]