Convergence analysis for gradient descent optimization ...

163
Convergence analysis for gradient descent optimization methods in the training of ReLU neural networks Arnulf Jentzen (CUHK Shenzhen, China & University of Münster, Germany) Funded by German Research Foundation through Cluster of Excellence Mathematics Münster Joint works with Patrick Cheridito (ETH Zurich, Switzerland) Benjamin Fehrman (University of Oxford, UK) Benjamin Gess (University of Bielefeld, Germany) Martin Hutzenthaler (University of Duisburg-Essen, Germany) Benno Kuckuck (University of Münster, Germany) Thomas Kruse (University of Giessen, Germany) Tuan Anh Nguyen (University of Duisburg-Essen, Germany) Joshua Lee Padgett (University of Arkansas, USA) Florian Rossmannek (ETH Zurich, Switzerland) Adrian Riekert (University of Münster, Germany) Wednesday, September 27nd, 2021

Transcript of Convergence analysis for gradient descent optimization ...

Page 1: Convergence analysis for gradient descent optimization ...

Convergence analysis for gradient descent optimizationmethods in the training of ReLU neural networks

Arnulf Jentzen (CUHK Shenzhen, China & University of Münster, Germany)Funded by German Research Foundation through Cluster of Excellence Mathematics Münster

Joint works with

Patrick Cheridito (ETH Zurich, Switzerland)Benjamin Fehrman (University of Oxford, UK)

Benjamin Gess (University of Bielefeld, Germany)Martin Hutzenthaler (University of Duisburg-Essen, Germany)

Benno Kuckuck (University of Münster, Germany)Thomas Kruse (University of Giessen, Germany)

Tuan Anh Nguyen (University of Duisburg-Essen, Germany)Joshua Lee Padgett (University of Arkansas, USA)

Florian Rossmannek (ETH Zurich, Switzerland)Adrian Riekert (University of Münster, Germany)

Wednesday, September 27nd, 2021

Page 2: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 3: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 4: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 5: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 6: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 7: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 8: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 9: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 10: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 11: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 12: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 13: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 14: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 15: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 16: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 17: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 18: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 19: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 20: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 21: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 22: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 23: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 24: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 25: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 26: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 27: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 28: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 29: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 30: Convergence analysis for gradient descent optimization ...

Theorem (Hutzenthaler, J, Kruse, Nguyen 2019 SN PDE; Kuckuck, J, Padgett 2021)

Let T , p, κ > 0, let f : R→ R be Lipschitz, ∀ d ∈ N let gd ∈ C1(Rd ,R) andud : [0, T ]× Rd → R be an at most poly. grow. solution of

∂ud∂t = ∆x ud + f(ud ) with ud (0, ·) = gd ,

assume |gd (x)|+ ‖(∇gd )(x)‖ ≤ κdκ(1 + ‖x‖κ), letAl : Rl → Rl , l ∈ N, satisfyAl(x1, . . . , xl) = (maxx1, 0, . . . ,maxxl , 0), let

N = ∪L∈N ∪l0,...,lL∈N (×Ln=1(Rln×ln−1 × Rln )),

letR : N→ ∪∞a,b=1C(Ra,Rb) satisfy for all L ∈ N, l0, . . . , lL ∈ N,Φ = ((W1, B1), . . . , (WL, BL)) ∈ ×L

n=1(Rln×ln−1 × Rln ), x0 ∈ Rl0 , . . ., xL ∈ RlL with∀ n ∈ 1, . . . , L : xn = Aln (Wnxn−1 + Bn) that

(RΦ)(x0) = WLxL−1 + BL,let P : N→ N be the number of parameters, and let (Gd,ε)d∈N, ε∈(0,1] ⊆ N satisfyP(Gd,ε)≤κdκε−κ and |gd (x)− (RGd,ε)(x)| ≤ εκdκ(1 + ‖x‖κ). Then∃ (Ud,ε)d∈N,ε∈(0,1] ⊆ N, c > 0 : ∀ d ∈ N, ε ∈ (0, 1] :[ ∫

[0,T ]×[0,1]d

|ud (y)− (RUd,ε)(y)|p dy

]1/p

≤ ε and P(Ud,ε) ≤ c dcε−c.

Linear PDEs: Grohs, Hornung, J, von Wurstemberger 2018 Mem. AMS; Berner,Grohs, J 2018 SIAM J. Math. Data Sci.; Elbrächter, Grohs, J, Schwab 2018; . . .

Page 31: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 32: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 33: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 34: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 35: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 36: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 37: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 38: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 39: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 40: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 41: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 42: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 43: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 44: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 45: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 46: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 47: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 48: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 49: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 50: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 51: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 52: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 53: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 54: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 55: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 56: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 57: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 58: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 59: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 60: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 61: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 62: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 63: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b > a, u ∈ C([a,b]d ,R) satisfy P = dH + 2H + 1, letµ : B([a,b]d )→ [0,∞) be measure, let LH : RP → R satisfy ∀ θ ∈ RP :

LH(θ)=∫

[a,b]d

[u(x)−θP−

∑Hi=1θH(d+1)+i maxθHd+i +

∑dj=1 θ(i−1)d+jxj , 0

]2µ(dx),

let GH : RP → RP be an appropriately generalized gradient of LH , and letΘ ∈ C([0,∞),RP) satisfy for all t ∈ [0,∞) that Θt = Θ0 −

∫ t0 GH(Θs) ds.

Theorem (Cheridito, J, Riekert, Rossmannek 2021; J, Riekert 2021)

Assume for all x, y ∈ [a,b]d that u(x) = u(y). Then there exist no non-global localminima and no saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Cheridito, J, Rossmannek 2021; J, Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R satisfy u(x) = αx + β, and assume

lim inft→∞‖Θt‖ <∞ and LH(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then there exist∞-many

non-global local minima and∞-many saddle points of LH and limt→∞ LH(Θt) = 0.

Theorem (Eberle, J, Riekert, Weiss 2021)

Assume µ λ[a,b]d with a piecewise polynomial density, assume that u is piecewisepolynomial, and assume lim inft→∞‖Θt‖ <∞. Then there exist ϑ ∈ (GH)−1(0),c, β > 0 such that ‖Θt −ϑ‖ ≤ c(1 + t)−β and 0 ≤ LH(Θt)−L(ϑ) ≤ c(1 + t)−1.

Page 64: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 65: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 66: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 67: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 68: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 69: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 70: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 71: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 72: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 73: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 74: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 75: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 76: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 77: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 78: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 79: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 80: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 81: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 82: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 83: Convergence analysis for gradient descent optimization ...

Theorem (J, Riekert 2021)

Assume µ λ[a,b]d and lim inft→∞‖Θt‖ <∞. Then there existsϑ ∈ (GH)−1(0) such that limt→∞ LH(Θt) = LH(ϑ).

Theorem (J, Riekert 2021)

Assume µ ∼ λ[a,b] with a Lipschitz continuous density, assume that u is piecewiseaffine linear, let (Ω,F ,P) be a probability space, for every H, k ∈ N, γ ∈ R letΘH,k,γ

n : Ω→ R3H+1, n ∈ N0, and kH,k,γn : Ω→ N, n ∈ N0, be random variables

satisfying for all n ∈ N0 that

ΘH,k,γn+1 = ΘH,k,γ

n − γGH(ΘH,k,γn ) and kH,k,γ

n ∈ arg min`∈1,...,k LH(ΘH,`,γn ),

and assume for all H ∈ N, γ ∈ R that ΘH,k,γ0 , k ∈ N, are i.i.d. standard normal. Then

∃ g > 0 : ∀ γ ∈ (0, g] :

lim infH→∞ lim infK→∞ P(

lim supn→∞ LH

(ΘH,kH,K ,γ

n ,γn

)= 0)

= 1.

Proof based on Fehrman, Gess, J 2020 JMLR.

Page 84: Convergence analysis for gradient descent optimization ...

Appendix

Page 85: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b ∈ (a,∞), f ∈ C([a,b]d ,R) satisfyP = dH + 2H + 1, let Rr ∈ C(R,R), r ∈ N ∪ ∞, satisfy for all x ∈ R that(⋃

r∈NRr) ⊆ C1(R,R), R∞(x) = maxx, 0,supr∈N supy∈[−|x|,|x|]|(Rr )

′(y)| <∞, and

lim supr→∞(|Rr (x)−R∞(x)|+ |(Rr )

′(x)− 1(0,∞)(x)|)

= 0,

let µ : B([a,b]d )→ [0,∞] be a measure, let Lr : RP → R, r ∈ N ∪ ∞, satisfyfor all r ∈ N ∪ ∞, θ = (θ1, . . . , θP) ∈ RP that

Lr (θ) =∫

[a,b]d

(f(x1, . . . , xd )

−θP −∑H

i=1 θH(d+1)+i

[Rr (θHd+i +

∑dj=1 θ(i−1)d+jxj)

])2µ(d(x1, . . . , xd )),

let G : RP → RP satisfy for all θ ∈ ϑ ∈ RP : ((∇Lr )(ϑ))r∈N is convergent thatG(θ) = limr→∞(∇Lr )(θ), and let Θ ∈ C([0,∞),RP) satisfy

∀ t ∈ [0,∞) : Θt = Θ0 −∫ t

0 G(Θs) ds.

Lemma

There exists an open U ⊆ RP such that∫RP\U 1 dx = 0, (L∞)|U ∈ C1(U,R), and

∇((L∞)|U) = G|U .

Page 86: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b ∈ (a,∞), f ∈ C([a,b]d ,R) satisfyP = dH + 2H + 1, let Rr ∈ C(R,R), r ∈ N ∪ ∞, satisfy for all x ∈ R that(⋃

r∈NRr) ⊆ C1(R,R), R∞(x) = maxx, 0,supr∈N supy∈[−|x|,|x|]|(Rr )

′(y)| <∞, and

lim supr→∞(|Rr (x)−R∞(x)|+ |(Rr )

′(x)− 1(0,∞)(x)|)

= 0,

let µ : B([a,b]d )→ [0,∞] be a measure, let Lr : RP → R, r ∈ N ∪ ∞, satisfyfor all r ∈ N ∪ ∞, θ = (θ1, . . . , θP) ∈ RP that

Lr (θ) =∫

[a,b]d

(f(x1, . . . , xd )

−θP −∑H

i=1 θH(d+1)+i

[Rr (θHd+i +

∑dj=1 θ(i−1)d+jxj)

])2µ(d(x1, . . . , xd )),

let G : RP → RP satisfy for all θ ∈ ϑ ∈ RP : ((∇Lr )(ϑ))r∈N is convergent thatG(θ) = limr→∞(∇Lr )(θ), and let Θ ∈ C([0,∞),RP) satisfy

∀ t ∈ [0,∞) : Θt = Θ0 −∫ t

0 G(Θs) ds.

Lemma

There exists an open U ⊆ RP such that∫RP\U 1 dx = 0, (L∞)|U ∈ C1(U,R), and

∇((L∞)|U) = G|U .

Page 87: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b ∈ (a,∞), f ∈ C([a,b]d ,R) satisfyP = dH + 2H + 1, let Rr ∈ C(R,R), r ∈ N ∪ ∞, satisfy for all x ∈ R that(⋃

r∈NRr) ⊆ C1(R,R), R∞(x) = maxx, 0,supr∈N supy∈[−|x|,|x|]|(Rr )

′(y)| <∞, and

lim supr→∞(|Rr (x)−R∞(x)|+ |(Rr )

′(x)− 1(0,∞)(x)|)

= 0,

let µ : B([a,b]d )→ [0,∞] be a measure, let Lr : RP → R, r ∈ N ∪ ∞, satisfyfor all r ∈ N ∪ ∞, θ = (θ1, . . . , θP) ∈ RP that

Lr (θ) =∫

[a,b]d

(f(x1, . . . , xd )

−θP −∑H

i=1 θH(d+1)+i

[Rr (θHd+i +

∑dj=1 θ(i−1)d+jxj)

])2µ(d(x1, . . . , xd )),

let G : RP → RP satisfy for all θ ∈ ϑ ∈ RP : ((∇Lr )(ϑ))r∈N is convergent thatG(θ) = limr→∞(∇Lr )(θ), and let Θ ∈ C([0,∞),RP) satisfy

∀ t ∈ [0,∞) : Θt = Θ0 −∫ t

0 G(Θs) ds.

Lemma

There exists an open U ⊆ RP such that∫RP\U 1 dx = 0, (L∞)|U ∈ C1(U,R), and

∇((L∞)|U) = G|U .

Page 88: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b ∈ (a,∞), f ∈ C([a,b]d ,R) satisfyP = dH + 2H + 1, let Rr ∈ C(R,R), r ∈ N ∪ ∞, satisfy for all x ∈ R that(⋃

r∈NRr) ⊆ C1(R,R), R∞(x) = maxx, 0,supr∈N supy∈[−|x|,|x|]|(Rr )

′(y)| <∞, and

lim supr→∞(|Rr (x)−R∞(x)|+ |(Rr )

′(x)− 1(0,∞)(x)|)

= 0,

let µ : B([a,b]d )→ [0,∞] be a measure, let Lr : RP → R, r ∈ N ∪ ∞, satisfyfor all r ∈ N ∪ ∞, θ = (θ1, . . . , θP) ∈ RP that

Lr (θ) =∫

[a,b]d

(f(x1, . . . , xd )

−θP −∑H

i=1 θH(d+1)+i

[Rr (θHd+i +

∑dj=1 θ(i−1)d+jxj)

])2µ(d(x1, . . . , xd )),

let G : RP → RP satisfy for all θ ∈ ϑ ∈ RP : ((∇Lr )(ϑ))r∈N is convergent thatG(θ) = limr→∞(∇Lr )(θ), and let Θ ∈ C([0,∞),RP) satisfy

∀ t ∈ [0,∞) : Θt = Θ0 −∫ t

0 G(Θs) ds.

Lemma

There exists an open U ⊆ RP such that∫RP\U 1 dx = 0, (L∞)|U ∈ C1(U,R), and

∇((L∞)|U) = G|U .

Page 89: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b ∈ (a,∞), f ∈ C([a,b]d ,R) satisfyP = dH + 2H + 1, let Rr ∈ C(R,R), r ∈ N ∪ ∞, satisfy for all x ∈ R that(⋃

r∈NRr) ⊆ C1(R,R), R∞(x) = maxx, 0,supr∈N supy∈[−|x|,|x|]|(Rr )

′(y)| <∞, and

lim supr→∞(|Rr (x)−R∞(x)|+ |(Rr )

′(x)− 1(0,∞)(x)|)

= 0,

let µ : B([a,b]d )→ [0,∞] be a measure, let Lr : RP → R, r ∈ N ∪ ∞, satisfyfor all r ∈ N ∪ ∞, θ = (θ1, . . . , θP) ∈ RP that

Lr (θ) =∫

[a,b]d

(f(x1, . . . , xd )

−θP −∑H

i=1 θH(d+1)+i

[Rr (θHd+i +

∑dj=1 θ(i−1)d+jxj)

])2µ(d(x1, . . . , xd )),

let G : RP → RP satisfy for all θ ∈ ϑ ∈ RP : ((∇Lr )(ϑ))r∈N is convergent thatG(θ) = limr→∞(∇Lr )(θ), and let Θ ∈ C([0,∞),RP) satisfy

∀ t ∈ [0,∞) : Θt = Θ0 −∫ t

0 G(Θs) ds.

Lemma

There exists an open U ⊆ RP such that∫RP\U 1 dx = 0, (L∞)|U ∈ C1(U,R), and

∇((L∞)|U) = G|U .

Page 90: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b ∈ (a,∞), f ∈ C([a,b]d ,R) satisfyP = dH + 2H + 1, let Rr ∈ C(R,R), r ∈ N ∪ ∞, satisfy for all x ∈ R that(⋃

r∈NRr) ⊆ C1(R,R), R∞(x) = maxx, 0,supr∈N supy∈[−|x|,|x|]|(Rr )

′(y)| <∞, and

lim supr→∞(|Rr (x)−R∞(x)|+ |(Rr )

′(x)− 1(0,∞)(x)|)

= 0,

let µ : B([a,b]d )→ [0,∞] be a measure, let Lr : RP → R, r ∈ N ∪ ∞, satisfyfor all r ∈ N ∪ ∞, θ = (θ1, . . . , θP) ∈ RP that

Lr (θ) =∫

[a,b]d

(f(x1, . . . , xd )

−θP −∑H

i=1 θH(d+1)+i

[Rr (θHd+i +

∑dj=1 θ(i−1)d+jxj)

])2µ(d(x1, . . . , xd )),

let G : RP → RP satisfy for all θ ∈ ϑ ∈ RP : ((∇Lr )(ϑ))r∈N is convergent thatG(θ) = limr→∞(∇Lr )(θ), and let Θ ∈ C([0,∞),RP) satisfy

∀ t ∈ [0,∞) : Θt = Θ0 −∫ t

0 G(Θs) ds.

Lemma

There exists an open U ⊆ RP such that∫RP\U 1 dx = 0, (L∞)|U ∈ C1(U,R), and

∇((L∞)|U) = G|U .

Page 91: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b ∈ (a,∞), f ∈ C([a,b]d ,R) satisfyP = dH + 2H + 1, let Rr ∈ C(R,R), r ∈ N ∪ ∞, satisfy for all x ∈ R that(⋃

r∈NRr) ⊆ C1(R,R), R∞(x) = maxx, 0,supr∈N supy∈[−|x|,|x|]|(Rr )

′(y)| <∞, and

lim supr→∞(|Rr (x)−R∞(x)|+ |(Rr )

′(x)− 1(0,∞)(x)|)

= 0,

let µ : B([a,b]d )→ [0,∞] be a measure, let Lr : RP → R, r ∈ N ∪ ∞, satisfyfor all r ∈ N ∪ ∞, θ = (θ1, . . . , θP) ∈ RP that

Lr (θ) =∫

[a,b]d

(f(x1, . . . , xd )

−θP −∑H

i=1 θH(d+1)+i

[Rr (θHd+i +

∑dj=1 θ(i−1)d+jxj)

])2µ(d(x1, . . . , xd )),

let G : RP → RP satisfy for all θ ∈ ϑ ∈ RP : ((∇Lr )(ϑ))r∈N is convergent thatG(θ) = limr→∞(∇Lr )(θ), and let Θ ∈ C([0,∞),RP) satisfy

∀ t ∈ [0,∞) : Θt = Θ0 −∫ t

0 G(Θs) ds.

Lemma

There exists an open U ⊆ RP such that∫RP\U 1 dx = 0, (L∞)|U ∈ C1(U,R), and

∇((L∞)|U) = G|U .

Page 92: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b ∈ (a,∞), f ∈ C([a,b]d ,R) satisfyP = dH + 2H + 1, let Rr ∈ C(R,R), r ∈ N ∪ ∞, satisfy for all x ∈ R that(⋃

r∈NRr) ⊆ C1(R,R), R∞(x) = maxx, 0,supr∈N supy∈[−|x|,|x|]|(Rr )

′(y)| <∞, and

lim supr→∞(|Rr (x)−R∞(x)|+ |(Rr )

′(x)− 1(0,∞)(x)|)

= 0,

let µ : B([a,b]d )→ [0,∞] be a measure, let Lr : RP → R, r ∈ N ∪ ∞, satisfyfor all r ∈ N ∪ ∞, θ = (θ1, . . . , θP) ∈ RP that

Lr (θ) =∫

[a,b]d

(f(x1, . . . , xd )

−θP −∑H

i=1 θH(d+1)+i

[Rr (θHd+i +

∑dj=1 θ(i−1)d+jxj)

])2µ(d(x1, . . . , xd )),

let G : RP → RP satisfy for all θ ∈ ϑ ∈ RP : ((∇Lr )(ϑ))r∈N is convergent thatG(θ) = limr→∞(∇Lr )(θ), and let Θ ∈ C([0,∞),RP) satisfy

∀ t ∈ [0,∞) : Θt = Θ0 −∫ t

0 G(Θs) ds.

Lemma

There exists an open U ⊆ RP such that∫RP\U 1 dx = 0, (L∞)|U ∈ C1(U,R), and

∇((L∞)|U) = G|U .

Page 93: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b ∈ (a,∞), f ∈ C([a,b]d ,R) satisfyP = dH + 2H + 1, let Rr ∈ C(R,R), r ∈ N ∪ ∞, satisfy for all x ∈ R that(⋃

r∈NRr) ⊆ C1(R,R), R∞(x) = maxx, 0,supr∈N supy∈[−|x|,|x|]|(Rr )

′(y)| <∞, and

lim supr→∞(|Rr (x)−R∞(x)|+ |(Rr )

′(x)− 1(0,∞)(x)|)

= 0,

let µ : B([a,b]d )→ [0,∞] be a measure, let Lr : RP → R, r ∈ N ∪ ∞, satisfyfor all r ∈ N ∪ ∞, θ = (θ1, . . . , θP) ∈ RP that

Lr (θ) =∫

[a,b]d

(f(x1, . . . , xd )

−θP −∑H

i=1 θH(d+1)+i

[Rr (θHd+i +

∑dj=1 θ(i−1)d+jxj)

])2µ(d(x1, . . . , xd )),

let G : RP → RP satisfy for all θ ∈ ϑ ∈ RP : ((∇Lr )(ϑ))r∈N is convergent thatG(θ) = limr→∞(∇Lr )(θ), and let Θ ∈ C([0,∞),RP) satisfy

∀ t ∈ [0,∞) : Θt = Θ0 −∫ t

0 G(Θs) ds.

Lemma

There exists an open U ⊆ RP such that∫RP\U 1 dx = 0, (L∞)|U ∈ C1(U,R), and

∇((L∞)|U) = G|U .

Page 94: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b ∈ (a,∞), f ∈ C([a,b]d ,R) satisfyP = dH + 2H + 1, let Rr ∈ C(R,R), r ∈ N ∪ ∞, satisfy for all x ∈ R that(⋃

r∈NRr) ⊆ C1(R,R), R∞(x) = maxx, 0,supr∈N supy∈[−|x|,|x|]|(Rr )

′(y)| <∞, and

lim supr→∞(|Rr (x)−R∞(x)|+ |(Rr )

′(x)− 1(0,∞)(x)|)

= 0,

let µ : B([a,b]d )→ [0,∞] be a measure, let Lr : RP → R, r ∈ N ∪ ∞, satisfyfor all r ∈ N ∪ ∞, θ = (θ1, . . . , θP) ∈ RP that

Lr (θ) =∫

[a,b]d

(f(x1, . . . , xd )

−θP −∑H

i=1 θH(d+1)+i

[Rr (θHd+i +

∑dj=1 θ(i−1)d+jxj)

])2µ(d(x1, . . . , xd )),

let G : RP → RP satisfy for all θ ∈ ϑ ∈ RP : ((∇Lr )(ϑ))r∈N is convergent thatG(θ) = limr→∞(∇Lr )(θ), and let Θ ∈ C([0,∞),RP) satisfy

∀ t ∈ [0,∞) : Θt = Θ0 −∫ t

0 G(Θs) ds.

Lemma

There exists an open U ⊆ RP such that∫RP\U 1 dx = 0, (L∞)|U ∈ C1(U,R), and

∇((L∞)|U) = G|U .

Page 95: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b ∈ (a,∞), f ∈ C([a,b]d ,R) satisfyP = dH + 2H + 1, let Rr ∈ C(R,R), r ∈ N ∪ ∞, satisfy for all x ∈ R that(⋃

r∈NRr) ⊆ C1(R,R), R∞(x) = maxx, 0,supr∈N supy∈[−|x|,|x|]|(Rr )

′(y)| <∞, and

lim supr→∞(|Rr (x)−R∞(x)|+ |(Rr )

′(x)− 1(0,∞)(x)|)

= 0,

let µ : B([a,b]d )→ [0,∞] be a measure, let Lr : RP → R, r ∈ N ∪ ∞, satisfyfor all r ∈ N ∪ ∞, θ = (θ1, . . . , θP) ∈ RP that

Lr (θ) =∫

[a,b]d

(f(x1, . . . , xd )

−θP −∑H

i=1 θH(d+1)+i

[Rr (θHd+i +

∑dj=1 θ(i−1)d+jxj)

])2µ(d(x1, . . . , xd )),

let G : RP → RP satisfy for all θ ∈ ϑ ∈ RP : ((∇Lr )(ϑ))r∈N is convergent thatG(θ) = limr→∞(∇Lr )(θ), and let Θ ∈ C([0,∞),RP) satisfy

∀ t ∈ [0,∞) : Θt = Θ0 −∫ t

0 G(Θs) ds.

Lemma

There exists an open U ⊆ RP such that∫RP\U 1 dx = 0, (L∞)|U ∈ C1(U,R), and

∇((L∞)|U) = G|U .

Page 96: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b ∈ (a,∞), f ∈ C([a,b]d ,R) satisfyP = dH + 2H + 1, let Rr ∈ C(R,R), r ∈ N ∪ ∞, satisfy for all x ∈ R that(⋃

r∈NRr) ⊆ C1(R,R), R∞(x) = maxx, 0,supr∈N supy∈[−|x|,|x|]|(Rr )

′(y)| <∞, and

lim supr→∞(|Rr (x)−R∞(x)|+ |(Rr )

′(x)− 1(0,∞)(x)|)

= 0,

let µ : B([a,b]d )→ [0,∞] be a measure, let Lr : RP → R, r ∈ N ∪ ∞, satisfyfor all r ∈ N ∪ ∞, θ = (θ1, . . . , θP) ∈ RP that

Lr (θ) =∫

[a,b]d

(f(x1, . . . , xd )

−θP −∑H

i=1 θH(d+1)+i

[Rr (θHd+i +

∑dj=1 θ(i−1)d+jxj)

])2µ(d(x1, . . . , xd )),

let G : RP → RP satisfy for all θ ∈ ϑ ∈ RP : ((∇Lr )(ϑ))r∈N is convergent thatG(θ) = limr→∞(∇Lr )(θ), and let Θ ∈ C([0,∞),RP) satisfy

∀ t ∈ [0,∞) : Θt = Θ0 −∫ t

0 G(Θs) ds.

Lemma

There exists an open U ⊆ RP such that∫RP\U 1 dx = 0, (L∞)|U ∈ C1(U,R), and

∇((L∞)|U) = G|U .

Page 97: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b ∈ (a,∞), f ∈ C([a,b]d ,R) satisfyP = dH + 2H + 1, let Rr ∈ C(R,R), r ∈ N ∪ ∞, satisfy for all x ∈ R that(⋃

r∈NRr) ⊆ C1(R,R), R∞(x) = maxx, 0,supr∈N supy∈[−|x|,|x|]|(Rr )

′(y)| <∞, and

lim supr→∞(|Rr (x)−R∞(x)|+ |(Rr )

′(x)− 1(0,∞)(x)|)

= 0,

let µ : B([a,b]d )→ [0,∞] be a measure, let Lr : RP → R, r ∈ N ∪ ∞, satisfyfor all r ∈ N ∪ ∞, θ = (θ1, . . . , θP) ∈ RP that

Lr (θ) =∫

[a,b]d

(f(x1, . . . , xd )

−θP −∑H

i=1 θH(d+1)+i

[Rr (θHd+i +

∑dj=1 θ(i−1)d+jxj)

])2µ(d(x1, . . . , xd )),

let G : RP → RP satisfy for all θ ∈ ϑ ∈ RP : ((∇Lr )(ϑ))r∈N is convergent thatG(θ) = limr→∞(∇Lr )(θ), and let Θ ∈ C([0,∞),RP) satisfy

∀ t ∈ [0,∞) : Θt = Θ0 −∫ t

0 G(Θs) ds.

Lemma

There exists an open U ⊆ RP such that∫RP\U 1 dx = 0, (L∞)|U ∈ C1(U,R), and

∇((L∞)|U) = G|U .

Page 98: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b ∈ (a,∞), f ∈ C([a,b]d ,R) satisfyP = dH + 2H + 1, let Rr ∈ C(R,R), r ∈ N ∪ ∞, satisfy for all x ∈ R that(⋃

r∈NRr) ⊆ C1(R,R), R∞(x) = maxx, 0,supr∈N supy∈[−|x|,|x|]|(Rr )

′(y)| <∞, and

lim supr→∞(|Rr (x)−R∞(x)|+ |(Rr )

′(x)− 1(0,∞)(x)|)

= 0,

let µ : B([a,b]d )→ [0,∞] be a measure, let Lr : RP → R, r ∈ N ∪ ∞, satisfyfor all r ∈ N ∪ ∞, θ = (θ1, . . . , θP) ∈ RP that

Lr (θ) =∫

[a,b]d

(f(x1, . . . , xd )

−θP −∑H

i=1 θH(d+1)+i

[Rr (θHd+i +

∑dj=1 θ(i−1)d+jxj)

])2µ(d(x1, . . . , xd )),

let G : RP → RP satisfy for all θ ∈ ϑ ∈ RP : ((∇Lr )(ϑ))r∈N is convergent thatG(θ) = limr→∞(∇Lr )(θ), and let Θ ∈ C([0,∞),RP) satisfy

∀ t ∈ [0,∞) : Θt = Θ0 −∫ t

0 G(Θs) ds.

Lemma

There exists an open U ⊆ RP such that∫RP\U 1 dx = 0, (L∞)|U ∈ C1(U,R), and

∇((L∞)|U) = G|U .

Page 99: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b ∈ (a,∞), f ∈ C([a,b]d ,R) satisfyP = dH + 2H + 1, let Rr ∈ C(R,R), r ∈ N ∪ ∞, satisfy for all x ∈ R that(⋃

r∈NRr) ⊆ C1(R,R), R∞(x) = maxx, 0,supr∈N supy∈[−|x|,|x|]|(Rr )

′(y)| <∞, and

lim supr→∞(|Rr (x)−R∞(x)|+ |(Rr )

′(x)− 1(0,∞)(x)|)

= 0,

let µ : B([a,b]d )→ [0,∞] be a measure, let Lr : RP → R, r ∈ N ∪ ∞, satisfyfor all r ∈ N ∪ ∞, θ = (θ1, . . . , θP) ∈ RP that

Lr (θ) =∫

[a,b]d

(f(x1, . . . , xd )

−θP −∑H

i=1 θH(d+1)+i

[Rr (θHd+i +

∑dj=1 θ(i−1)d+jxj)

])2µ(d(x1, . . . , xd )),

let G : RP → RP satisfy for all θ ∈ ϑ ∈ RP : ((∇Lr )(ϑ))r∈N is convergent thatG(θ) = limr→∞(∇Lr )(θ), and let Θ ∈ C([0,∞),RP) satisfy

∀ t ∈ [0,∞) : Θt = Θ0 −∫ t

0 G(Θs) ds.

Lemma

There exists an open U ⊆ RP such that∫RP\U 1 dx = 0, (L∞)|U ∈ C1(U,R), and

∇((L∞)|U) = G|U .

Page 100: Convergence analysis for gradient descent optimization ...

Let d,H,P ∈ N, a ∈ R, b ∈ (a,∞), f ∈ C([a,b]d ,R) satisfyP = dH + 2H + 1, let Rr ∈ C(R,R), r ∈ N ∪ ∞, satisfy for all x ∈ R that(⋃

r∈NRr) ⊆ C1(R,R), R∞(x) = maxx, 0,supr∈N supy∈[−|x|,|x|]|(Rr )

′(y)| <∞, and

lim supr→∞(|Rr (x)−R∞(x)|+ |(Rr )

′(x)− 1(0,∞)(x)|)

= 0,

let µ : B([a,b]d )→ [0,∞] be a measure, let Lr : RP → R, r ∈ N ∪ ∞, satisfyfor all r ∈ N ∪ ∞, θ = (θ1, . . . , θP) ∈ RP that

Lr (θ) =∫

[a,b]d

(f(x1, . . . , xd )

−θP −∑H

i=1 θH(d+1)+i

[Rr (θHd+i +

∑dj=1 θ(i−1)d+jxj)

])2µ(d(x1, . . . , xd )),

let G : RP → RP satisfy for all θ ∈ ϑ ∈ RP : ((∇Lr )(ϑ))r∈N is convergent thatG(θ) = limr→∞(∇Lr )(θ), and let Θ ∈ C([0,∞),RP) satisfy

∀ t ∈ [0,∞) : Θt = Θ0 −∫ t

0 G(Θs) ds.

Lemma

There exists an open U ⊆ RP such that∫RP\U 1 dx = 0, (L∞)|U ∈ C1(U,R), and

∇((L∞)|U) = G|U .

Page 101: Convergence analysis for gradient descent optimization ...

Theorem (Cheridito J Riekert Rossmannek 2021, J Riekert 2021)

Assume for all x, y ∈ [a,b]d that f(x) = f(y). Then

1 there exist no non-global local minima of L∞,

2 there exist no saddle points of L∞,

3 it holds for all θ ∈ G−1(0) that θ is a global minima of L∞, and

4 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (Cheridito J Rossmannek 2021, J Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R, satisfy for all x ∈ [a,b] that f(x) = αx + β, and

assume supt∈[0,∞)(H − 1)‖Θt‖ <∞, and L∞(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then

1 there exist infinitely many non-global local minima of L∞,

2 there exist infinitely many saddle points of L∞, and

3 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (J Riekert 2021)

Assume supt∈[0,∞)‖Θt‖ <∞ and µ λ[a,b]d . Then there exists ϑ ∈ G−1(0)such that lim supt→∞ L∞(Θt) = L∞(ϑ).

Page 102: Convergence analysis for gradient descent optimization ...

Theorem (Cheridito J Riekert Rossmannek 2021, J Riekert 2021)

Assume for all x, y ∈ [a,b]d that f(x) = f(y). Then

1 there exist no non-global local minima of L∞,

2 there exist no saddle points of L∞,

3 it holds for all θ ∈ G−1(0) that θ is a global minima of L∞, and

4 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (Cheridito J Rossmannek 2021, J Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R, satisfy for all x ∈ [a,b] that f(x) = αx + β, and

assume supt∈[0,∞)(H − 1)‖Θt‖ <∞, and L∞(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then

1 there exist infinitely many non-global local minima of L∞,

2 there exist infinitely many saddle points of L∞, and

3 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (J Riekert 2021)

Assume supt∈[0,∞)‖Θt‖ <∞ and µ λ[a,b]d . Then there exists ϑ ∈ G−1(0)such that lim supt→∞ L∞(Θt) = L∞(ϑ).

Page 103: Convergence analysis for gradient descent optimization ...

Theorem (Cheridito J Riekert Rossmannek 2021, J Riekert 2021)

Assume for all x, y ∈ [a,b]d that f(x) = f(y). Then

1 there exist no non-global local minima of L∞,

2 there exist no saddle points of L∞,

3 it holds for all θ ∈ G−1(0) that θ is a global minima of L∞, and

4 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (Cheridito J Rossmannek 2021, J Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R, satisfy for all x ∈ [a,b] that f(x) = αx + β, and

assume supt∈[0,∞)(H − 1)‖Θt‖ <∞, and L∞(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then

1 there exist infinitely many non-global local minima of L∞,

2 there exist infinitely many saddle points of L∞, and

3 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (J Riekert 2021)

Assume supt∈[0,∞)‖Θt‖ <∞ and µ λ[a,b]d . Then there exists ϑ ∈ G−1(0)such that lim supt→∞ L∞(Θt) = L∞(ϑ).

Page 104: Convergence analysis for gradient descent optimization ...

Theorem (Cheridito J Riekert Rossmannek 2021, J Riekert 2021)

Assume for all x, y ∈ [a,b]d that f(x) = f(y). Then

1 there exist no non-global local minima of L∞,

2 there exist no saddle points of L∞,

3 it holds for all θ ∈ G−1(0) that θ is a global minima of L∞, and

4 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (Cheridito J Rossmannek 2021, J Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R, satisfy for all x ∈ [a,b] that f(x) = αx + β, and

assume supt∈[0,∞)(H − 1)‖Θt‖ <∞, and L∞(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then

1 there exist infinitely many non-global local minima of L∞,

2 there exist infinitely many saddle points of L∞, and

3 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (J Riekert 2021)

Assume supt∈[0,∞)‖Θt‖ <∞ and µ λ[a,b]d . Then there exists ϑ ∈ G−1(0)such that lim supt→∞ L∞(Θt) = L∞(ϑ).

Page 105: Convergence analysis for gradient descent optimization ...

Theorem (Cheridito J Riekert Rossmannek 2021, J Riekert 2021)

Assume for all x, y ∈ [a,b]d that f(x) = f(y). Then

1 there exist no non-global local minima of L∞,

2 there exist no saddle points of L∞,

3 it holds for all θ ∈ G−1(0) that θ is a global minima of L∞, and

4 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (Cheridito J Rossmannek 2021, J Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R, satisfy for all x ∈ [a,b] that f(x) = αx + β, and

assume supt∈[0,∞)(H − 1)‖Θt‖ <∞, and L∞(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then

1 there exist infinitely many non-global local minima of L∞,

2 there exist infinitely many saddle points of L∞, and

3 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (J Riekert 2021)

Assume supt∈[0,∞)‖Θt‖ <∞ and µ λ[a,b]d . Then there exists ϑ ∈ G−1(0)such that lim supt→∞ L∞(Θt) = L∞(ϑ).

Page 106: Convergence analysis for gradient descent optimization ...

Theorem (Cheridito J Riekert Rossmannek 2021, J Riekert 2021)

Assume for all x, y ∈ [a,b]d that f(x) = f(y). Then

1 there exist no non-global local minima of L∞,

2 there exist no saddle points of L∞,

3 it holds for all θ ∈ G−1(0) that θ is a global minima of L∞, and

4 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (Cheridito J Rossmannek 2021, J Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R, satisfy for all x ∈ [a,b] that f(x) = αx + β, and

assume supt∈[0,∞)(H − 1)‖Θt‖ <∞, and L∞(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then

1 there exist infinitely many non-global local minima of L∞,

2 there exist infinitely many saddle points of L∞, and

3 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (J Riekert 2021)

Assume supt∈[0,∞)‖Θt‖ <∞ and µ λ[a,b]d . Then there exists ϑ ∈ G−1(0)such that lim supt→∞ L∞(Θt) = L∞(ϑ).

Page 107: Convergence analysis for gradient descent optimization ...

Theorem (Cheridito J Riekert Rossmannek 2021, J Riekert 2021)

Assume for all x, y ∈ [a,b]d that f(x) = f(y). Then

1 there exist no non-global local minima of L∞,

2 there exist no saddle points of L∞,

3 it holds for all θ ∈ G−1(0) that θ is a global minima of L∞, and

4 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (Cheridito J Rossmannek 2021, J Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R, satisfy for all x ∈ [a,b] that f(x) = αx + β, and

assume supt∈[0,∞)(H − 1)‖Θt‖ <∞, and L∞(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then

1 there exist infinitely many non-global local minima of L∞,

2 there exist infinitely many saddle points of L∞, and

3 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (J Riekert 2021)

Assume supt∈[0,∞)‖Θt‖ <∞ and µ λ[a,b]d . Then there exists ϑ ∈ G−1(0)such that lim supt→∞ L∞(Θt) = L∞(ϑ).

Page 108: Convergence analysis for gradient descent optimization ...

Theorem (Cheridito J Riekert Rossmannek 2021, J Riekert 2021)

Assume for all x, y ∈ [a,b]d that f(x) = f(y). Then

1 there exist no non-global local minima of L∞,

2 there exist no saddle points of L∞,

3 it holds for all θ ∈ G−1(0) that θ is a global minima of L∞, and

4 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (Cheridito J Rossmannek 2021, J Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R, satisfy for all x ∈ [a,b] that f(x) = αx + β, and

assume supt∈[0,∞)(H − 1)‖Θt‖ <∞, and L∞(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then

1 there exist infinitely many non-global local minima of L∞,

2 there exist infinitely many saddle points of L∞, and

3 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (J Riekert 2021)

Assume supt∈[0,∞)‖Θt‖ <∞ and µ λ[a,b]d . Then there exists ϑ ∈ G−1(0)such that lim supt→∞ L∞(Θt) = L∞(ϑ).

Page 109: Convergence analysis for gradient descent optimization ...

Theorem (Cheridito J Riekert Rossmannek 2021, J Riekert 2021)

Assume for all x, y ∈ [a,b]d that f(x) = f(y). Then

1 there exist no non-global local minima of L∞,

2 there exist no saddle points of L∞,

3 it holds for all θ ∈ G−1(0) that θ is a global minima of L∞, and

4 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (Cheridito J Rossmannek 2021, J Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R, satisfy for all x ∈ [a,b] that f(x) = αx + β, and

assume supt∈[0,∞)(H − 1)‖Θt‖ <∞, and L∞(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then

1 there exist infinitely many non-global local minima of L∞,

2 there exist infinitely many saddle points of L∞, and

3 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (J Riekert 2021)

Assume supt∈[0,∞)‖Θt‖ <∞ and µ λ[a,b]d . Then there exists ϑ ∈ G−1(0)such that lim supt→∞ L∞(Θt) = L∞(ϑ).

Page 110: Convergence analysis for gradient descent optimization ...

Theorem (Cheridito J Riekert Rossmannek 2021, J Riekert 2021)

Assume for all x, y ∈ [a,b]d that f(x) = f(y). Then

1 there exist no non-global local minima of L∞,

2 there exist no saddle points of L∞,

3 it holds for all θ ∈ G−1(0) that θ is a global minima of L∞, and

4 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (Cheridito J Rossmannek 2021, J Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R, satisfy for all x ∈ [a,b] that f(x) = αx + β, and

assume supt∈[0,∞)(H − 1)‖Θt‖ <∞, and L∞(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then

1 there exist infinitely many non-global local minima of L∞,

2 there exist infinitely many saddle points of L∞, and

3 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (J Riekert 2021)

Assume supt∈[0,∞)‖Θt‖ <∞ and µ λ[a,b]d . Then there exists ϑ ∈ G−1(0)such that lim supt→∞ L∞(Θt) = L∞(ϑ).

Page 111: Convergence analysis for gradient descent optimization ...

Theorem (Cheridito J Riekert Rossmannek 2021, J Riekert 2021)

Assume for all x, y ∈ [a,b]d that f(x) = f(y). Then

1 there exist no non-global local minima of L∞,

2 there exist no saddle points of L∞,

3 it holds for all θ ∈ G−1(0) that θ is a global minima of L∞, and

4 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (Cheridito J Rossmannek 2021, J Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R, satisfy for all x ∈ [a,b] that f(x) = αx + β, and

assume supt∈[0,∞)(H − 1)‖Θt‖ <∞, and L∞(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then

1 there exist infinitely many non-global local minima of L∞,

2 there exist infinitely many saddle points of L∞, and

3 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (J Riekert 2021)

Assume supt∈[0,∞)‖Θt‖ <∞ and µ λ[a,b]d . Then there exists ϑ ∈ G−1(0)such that lim supt→∞ L∞(Θt) = L∞(ϑ).

Page 112: Convergence analysis for gradient descent optimization ...

Theorem (Cheridito J Riekert Rossmannek 2021, J Riekert 2021)

Assume for all x, y ∈ [a,b]d that f(x) = f(y). Then

1 there exist no non-global local minima of L∞,

2 there exist no saddle points of L∞,

3 it holds for all θ ∈ G−1(0) that θ is a global minima of L∞, and

4 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (Cheridito J Rossmannek 2021, J Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R, satisfy for all x ∈ [a,b] that f(x) = αx + β, and

assume supt∈[0,∞)(H − 1)‖Θt‖ <∞, and L∞(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then

1 there exist infinitely many non-global local minima of L∞,

2 there exist infinitely many saddle points of L∞, and

3 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (J Riekert 2021)

Assume supt∈[0,∞)‖Θt‖ <∞ and µ λ[a,b]d . Then there exists ϑ ∈ G−1(0)such that lim supt→∞ L∞(Θt) = L∞(ϑ).

Page 113: Convergence analysis for gradient descent optimization ...

Theorem (Cheridito J Riekert Rossmannek 2021, J Riekert 2021)

Assume for all x, y ∈ [a,b]d that f(x) = f(y). Then

1 there exist no non-global local minima of L∞,

2 there exist no saddle points of L∞,

3 it holds for all θ ∈ G−1(0) that θ is a global minima of L∞, and

4 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (Cheridito J Rossmannek 2021, J Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R, satisfy for all x ∈ [a,b] that f(x) = αx + β, and

assume supt∈[0,∞)(H − 1)‖Θt‖ <∞, and L∞(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then

1 there exist infinitely many non-global local minima of L∞,

2 there exist infinitely many saddle points of L∞, and

3 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (J Riekert 2021)

Assume supt∈[0,∞)‖Θt‖ <∞ and µ λ[a,b]d . Then there exists ϑ ∈ G−1(0)such that lim supt→∞ L∞(Θt) = L∞(ϑ).

Page 114: Convergence analysis for gradient descent optimization ...

Theorem (Cheridito J Riekert Rossmannek 2021, J Riekert 2021)

Assume for all x, y ∈ [a,b]d that f(x) = f(y). Then

1 there exist no non-global local minima of L∞,

2 there exist no saddle points of L∞,

3 it holds for all θ ∈ G−1(0) that θ is a global minima of L∞, and

4 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (Cheridito J Rossmannek 2021, J Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R, satisfy for all x ∈ [a,b] that f(x) = αx + β, and

assume supt∈[0,∞)(H − 1)‖Θt‖ <∞, and L∞(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then

1 there exist infinitely many non-global local minima of L∞,

2 there exist infinitely many saddle points of L∞, and

3 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (J Riekert 2021)

Assume supt∈[0,∞)‖Θt‖ <∞ and µ λ[a,b]d . Then there exists ϑ ∈ G−1(0)such that lim supt→∞ L∞(Θt) = L∞(ϑ).

Page 115: Convergence analysis for gradient descent optimization ...

Theorem (Cheridito J Riekert Rossmannek 2021, J Riekert 2021)

Assume for all x, y ∈ [a,b]d that f(x) = f(y). Then

1 there exist no non-global local minima of L∞,

2 there exist no saddle points of L∞,

3 it holds for all θ ∈ G−1(0) that θ is a global minima of L∞, and

4 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (Cheridito J Rossmannek 2021, J Riekert 2021)

Assume µ = λ[a,b], let α, β ∈ R, satisfy for all x ∈ [a,b] that f(x) = αx + β, and

assume supt∈[0,∞)(H − 1)‖Θt‖ <∞, and L∞(Θ0) < α2(b−a)3

12(2bH/2c+1)4 . Then

1 there exist infinitely many non-global local minima of L∞,

2 there exist infinitely many saddle points of L∞, and

3 it holds that lim supt→∞ L∞(Θt) = 0.

Theorem (J Riekert 2021)

Assume supt∈[0,∞)‖Θt‖ <∞ and µ λ[a,b]d . Then there exists ϑ ∈ G−1(0)such that lim supt→∞ L∞(Θt) = L∞(ϑ).

Page 116: Convergence analysis for gradient descent optimization ...

Hamiltonian-Jacobi-Bellman equations

Consider∂u∂t = ∆x u − ‖∇x u‖2

with u(0, x) =√‖x‖ for t ∈ [0, 1], x ∈ Rd .

d Mean Std. dev. Ref. value rel. L1-error Std. dev. rel. error avg. runtime

10 2.07017 0.00634850 2.04629 0.01167 0.00310245 58.200

50 3.15098 0.00275839 3.13788 0.00417 0.00087906 58.359

100 3.75329 0.00136920 3.74471 0.00229 0.00036564 58.329

200 4.46734 0.00079688 4.46172 0.00126 0.00017860 58.159

300 4.94586 0.00087736 4.94105 0.00097 0.00017756 58.819

500 5.62126 0.00045092 5.61735 0.00070 0.00008027 57.670

1000 6.68594 0.00040334 6.68335 0.00039 0.00006035 66.546

5000 9.97266 0.00047098 9.99835 0.00257 0.00004711 393.894

10000 11.87860 0.00022705 11.89099 0.00104 0.00001909 1687.680

Approximations for u(1, 0); Time steps: 24;SGD steps: 500 (d < 104), 600 (d = 104); Learning rates: 1

10 , 1100 , 1

1000Beck, Becker, Cheridito, J, Neufeld 2019 SISC (to appear)

Page 117: Convergence analysis for gradient descent optimization ...

Hamiltonian-Jacobi-Bellman equations

Consider∂u∂t = ∆x u − ‖∇x u‖2

with u(0, x) =√‖x‖ for t ∈ [0, 1], x ∈ Rd .

d Mean Std. dev. Ref. value rel. L1-error Std. dev. rel. error avg. runtime

10 2.07017 0.00634850 2.04629 0.01167 0.00310245 58.200

50 3.15098 0.00275839 3.13788 0.00417 0.00087906 58.359

100 3.75329 0.00136920 3.74471 0.00229 0.00036564 58.329

200 4.46734 0.00079688 4.46172 0.00126 0.00017860 58.159

300 4.94586 0.00087736 4.94105 0.00097 0.00017756 58.819

500 5.62126 0.00045092 5.61735 0.00070 0.00008027 57.670

1000 6.68594 0.00040334 6.68335 0.00039 0.00006035 66.546

5000 9.97266 0.00047098 9.99835 0.00257 0.00004711 393.894

10000 11.87860 0.00022705 11.89099 0.00104 0.00001909 1687.680

Approximations for u(1, 0); Time steps: 24;SGD steps: 500 (d < 104), 600 (d = 104); Learning rates: 1

10 , 1100 , 1

1000Beck, Becker, Cheridito, J, Neufeld 2019 SISC (to appear)

Page 118: Convergence analysis for gradient descent optimization ...

Hamiltonian-Jacobi-Bellman equations

Consider∂u∂t = ∆x u − ‖∇x u‖2

with u(0, x) =√‖x‖ for t ∈ [0, 1], x ∈ Rd .

d Mean Std. dev. Ref. value rel. L1-error Std. dev. rel. error avg. runtime

10 2.07017 0.00634850 2.04629 0.01167 0.00310245 58.200

50 3.15098 0.00275839 3.13788 0.00417 0.00087906 58.359

100 3.75329 0.00136920 3.74471 0.00229 0.00036564 58.329

200 4.46734 0.00079688 4.46172 0.00126 0.00017860 58.159

300 4.94586 0.00087736 4.94105 0.00097 0.00017756 58.819

500 5.62126 0.00045092 5.61735 0.00070 0.00008027 57.670

1000 6.68594 0.00040334 6.68335 0.00039 0.00006035 66.546

5000 9.97266 0.00047098 9.99835 0.00257 0.00004711 393.894

10000 11.87860 0.00022705 11.89099 0.00104 0.00001909 1687.680

Approximations for u(1, 0); Time steps: 24;SGD steps: 500 (d < 104), 600 (d = 104); Learning rates: 1

10 , 1100 , 1

1000Beck, Becker, Cheridito, J, Neufeld 2019 SISC (to appear)

Page 119: Convergence analysis for gradient descent optimization ...

Hamiltonian-Jacobi-Bellman equations

Consider∂u∂t = ∆x u − ‖∇x u‖2

with u(0, x) =√‖x‖ for t ∈ [0, 1], x ∈ Rd .

d Mean Std. dev. Ref. value rel. L1-error Std. dev. rel. error avg. runtime

10 2.07017 0.00634850 2.04629 0.01167 0.00310245 58.200

50 3.15098 0.00275839 3.13788 0.00417 0.00087906 58.359

100 3.75329 0.00136920 3.74471 0.00229 0.00036564 58.329

200 4.46734 0.00079688 4.46172 0.00126 0.00017860 58.159

300 4.94586 0.00087736 4.94105 0.00097 0.00017756 58.819

500 5.62126 0.00045092 5.61735 0.00070 0.00008027 57.670

1000 6.68594 0.00040334 6.68335 0.00039 0.00006035 66.546

5000 9.97266 0.00047098 9.99835 0.00257 0.00004711 393.894

10000 11.87860 0.00022705 11.89099 0.00104 0.00001909 1687.680

Approximations for u(1, 0); Time steps: 24;SGD steps: 500 (d < 104), 600 (d = 104); Learning rates: 1

10 , 1100 , 1

1000Beck, Becker, Cheridito, J, Neufeld 2019 SISC (to appear)

Page 120: Convergence analysis for gradient descent optimization ...

Hamiltonian-Jacobi-Bellman equations

Consider∂u∂t = ∆x u − ‖∇x u‖2

with u(0, x) =√‖x‖ for t ∈ [0, 1], x ∈ Rd .

d Mean Std. dev. Ref. value rel. L1-error Std. dev. rel. error avg. runtime

10 2.07017 0.00634850 2.04629 0.01167 0.00310245 58.200

50 3.15098 0.00275839 3.13788 0.00417 0.00087906 58.359

100 3.75329 0.00136920 3.74471 0.00229 0.00036564 58.329

200 4.46734 0.00079688 4.46172 0.00126 0.00017860 58.159

300 4.94586 0.00087736 4.94105 0.00097 0.00017756 58.819

500 5.62126 0.00045092 5.61735 0.00070 0.00008027 57.670

1000 6.68594 0.00040334 6.68335 0.00039 0.00006035 66.546

5000 9.97266 0.00047098 9.99835 0.00257 0.00004711 393.894

10000 11.87860 0.00022705 11.89099 0.00104 0.00001909 1687.680

Approximations for u(1, 0); Time steps: 24;SGD steps: 500 (d < 104), 600 (d = 104); Learning rates: 1

10 , 1100 , 1

1000Beck, Becker, Cheridito, J, Neufeld 2019 SISC (to appear)

Page 121: Convergence analysis for gradient descent optimization ...

Hamiltonian-Jacobi-Bellman equations

Consider∂u∂t = ∆x u − ‖∇x u‖2

with u(0, x) =√‖x‖ for t ∈ [0, 1], x ∈ Rd .

d Mean Std. dev. Ref. value rel. L1-error Std. dev. rel. error avg. runtime

10 2.07017 0.00634850 2.04629 0.01167 0.00310245 58.200

50 3.15098 0.00275839 3.13788 0.00417 0.00087906 58.359

100 3.75329 0.00136920 3.74471 0.00229 0.00036564 58.329

200 4.46734 0.00079688 4.46172 0.00126 0.00017860 58.159

300 4.94586 0.00087736 4.94105 0.00097 0.00017756 58.819

500 5.62126 0.00045092 5.61735 0.00070 0.00008027 57.670

1000 6.68594 0.00040334 6.68335 0.00039 0.00006035 66.546

5000 9.97266 0.00047098 9.99835 0.00257 0.00004711 393.894

10000 11.87860 0.00022705 11.89099 0.00104 0.00001909 1687.680

Approximations for u(1, 0); Time steps: 24;SGD steps: 500 (d < 104), 600 (d = 104); Learning rates: 1

10 , 1100 , 1

1000Beck, Becker, Cheridito, J, Neufeld 2019 SISC (to appear)

Page 122: Convergence analysis for gradient descent optimization ...

Hamiltonian-Jacobi-Bellman equations

Consider∂u∂t = ∆x u − ‖∇x u‖2

with u(0, x) =√‖x‖ for t ∈ [0, 1], x ∈ Rd .

d Mean Std. dev. Ref. value rel. L1-error Std. dev. rel. error avg. runtime

10 2.07017 0.00634850 2.04629 0.01167 0.00310245 58.200

50 3.15098 0.00275839 3.13788 0.00417 0.00087906 58.359

100 3.75329 0.00136920 3.74471 0.00229 0.00036564 58.329

200 4.46734 0.00079688 4.46172 0.00126 0.00017860 58.159

300 4.94586 0.00087736 4.94105 0.00097 0.00017756 58.819

500 5.62126 0.00045092 5.61735 0.00070 0.00008027 57.670

1000 6.68594 0.00040334 6.68335 0.00039 0.00006035 66.546

5000 9.97266 0.00047098 9.99835 0.00257 0.00004711 393.894

10000 11.87860 0.00022705 11.89099 0.00104 0.00001909 1687.680

Approximations for u(1, 0); Time steps: 24;SGD steps: 500 (d < 104), 600 (d = 104); Learning rates: 1

10 , 1100 , 1

1000Beck, Becker, Cheridito, J, Neufeld 2019 SISC (to appear)

Page 123: Convergence analysis for gradient descent optimization ...

Hamiltonian-Jacobi-Bellman equations

Consider∂u∂t = ∆x u − ‖∇x u‖2

with u(0, x) =√‖x‖ for t ∈ [0, 1], x ∈ Rd .

d Mean Std. dev. Ref. value rel. L1-error Std. dev. rel. error avg. runtime

10 2.07017 0.00634850 2.04629 0.01167 0.00310245 58.200

50 3.15098 0.00275839 3.13788 0.00417 0.00087906 58.359

100 3.75329 0.00136920 3.74471 0.00229 0.00036564 58.329

200 4.46734 0.00079688 4.46172 0.00126 0.00017860 58.159

300 4.94586 0.00087736 4.94105 0.00097 0.00017756 58.819

500 5.62126 0.00045092 5.61735 0.00070 0.00008027 57.670

1000 6.68594 0.00040334 6.68335 0.00039 0.00006035 66.546

5000 9.97266 0.00047098 9.99835 0.00257 0.00004711 393.894

10000 11.87860 0.00022705 11.89099 0.00104 0.00001909 1687.680

Approximations for u(1, 0); Time steps: 24;SGD steps: 500 (d < 104), 600 (d = 104); Learning rates: 1

10 , 1100 , 1

1000Beck, Becker, Cheridito, J, Neufeld 2019 SISC (to appear)

Page 124: Convergence analysis for gradient descent optimization ...

Hamiltonian-Jacobi-Bellman equations

Consider∂u∂t = ∆x u − ‖∇x u‖2

with u(0, x) =√‖x‖ for t ∈ [0, 1], x ∈ Rd .

d Mean Std. dev. Ref. value rel. L1-error Std. dev. rel. error avg. runtime

10 2.07017 0.00634850 2.04629 0.01167 0.00310245 58.200

50 3.15098 0.00275839 3.13788 0.00417 0.00087906 58.359

100 3.75329 0.00136920 3.74471 0.00229 0.00036564 58.329

200 4.46734 0.00079688 4.46172 0.00126 0.00017860 58.159

300 4.94586 0.00087736 4.94105 0.00097 0.00017756 58.819

500 5.62126 0.00045092 5.61735 0.00070 0.00008027 57.670

1000 6.68594 0.00040334 6.68335 0.00039 0.00006035 66.546

5000 9.97266 0.00047098 9.99835 0.00257 0.00004711 393.894

10000 11.87860 0.00022705 11.89099 0.00104 0.00001909 1687.680

Approximations for u(1, 0); Time steps: 24;SGD steps: 500 (d < 104), 600 (d = 104); Learning rates: 1

10 , 1100 , 1

1000Beck, Becker, Cheridito, J, Neufeld 2019 SISC (to appear)

Page 125: Convergence analysis for gradient descent optimization ...

Hamiltonian-Jacobi-Bellman equations

Consider∂u∂t = ∆x u − ‖∇x u‖2

with u(0, x) =√‖x‖ for t ∈ [0, 1], x ∈ Rd .

d Mean Std. dev. Ref. value rel. L1-error Std. dev. rel. error avg. runtime

10 2.07017 0.00634850 2.04629 0.01167 0.00310245 58.200

50 3.15098 0.00275839 3.13788 0.00417 0.00087906 58.359

100 3.75329 0.00136920 3.74471 0.00229 0.00036564 58.329

200 4.46734 0.00079688 4.46172 0.00126 0.00017860 58.159

300 4.94586 0.00087736 4.94105 0.00097 0.00017756 58.819

500 5.62126 0.00045092 5.61735 0.00070 0.00008027 57.670

1000 6.68594 0.00040334 6.68335 0.00039 0.00006035 66.546

5000 9.97266 0.00047098 9.99835 0.00257 0.00004711 393.894

10000 11.87860 0.00022705 11.89099 0.00104 0.00001909 1687.680

Approximations for u(1, 0); Time steps: 24;SGD steps: 500 (d < 104), 600 (d = 104); Learning rates: 1

10 , 1100 , 1

1000Beck, Becker, Cheridito, J, Neufeld 2019 SISC (to appear)

Page 126: Convergence analysis for gradient descent optimization ...

Hamiltonian-Jacobi-Bellman equations

Consider∂u∂t = ∆x u − ‖∇x u‖2

with u(0, x) =√‖x‖ for t ∈ [0, 1], x ∈ Rd .

d Mean Std. dev. Ref. value rel. L1-error Std. dev. rel. error avg. runtime

10 2.07017 0.00634850 2.04629 0.01167 0.00310245 58.200

50 3.15098 0.00275839 3.13788 0.00417 0.00087906 58.359

100 3.75329 0.00136920 3.74471 0.00229 0.00036564 58.329

200 4.46734 0.00079688 4.46172 0.00126 0.00017860 58.159

300 4.94586 0.00087736 4.94105 0.00097 0.00017756 58.819

500 5.62126 0.00045092 5.61735 0.00070 0.00008027 57.670

1000 6.68594 0.00040334 6.68335 0.00039 0.00006035 66.546

5000 9.97266 0.00047098 9.99835 0.00257 0.00004711 393.894

10000 11.87860 0.00022705 11.89099 0.00104 0.00001909 1687.680

Approximations for u(1, 0); Time steps: 24;SGD steps: 500 (d < 104), 600 (d = 104); Learning rates: 1

10 , 1100 , 1

1000Beck, Becker, Cheridito, J, Neufeld 2019 SISC (to appear)

Page 127: Convergence analysis for gradient descent optimization ...

Hamiltonian-Jacobi-Bellman equations

Consider∂u∂t = ∆x u − ‖∇x u‖2

with u(0, x) =√‖x‖ for t ∈ [0, 1], x ∈ Rd .

d Mean Std. dev. Ref. value rel. L1-error Std. dev. rel. error avg. runtime

10 2.07017 0.00634850 2.04629 0.01167 0.00310245 58.200

50 3.15098 0.00275839 3.13788 0.00417 0.00087906 58.359

100 3.75329 0.00136920 3.74471 0.00229 0.00036564 58.329

200 4.46734 0.00079688 4.46172 0.00126 0.00017860 58.159

300 4.94586 0.00087736 4.94105 0.00097 0.00017756 58.819

500 5.62126 0.00045092 5.61735 0.00070 0.00008027 57.670

1000 6.68594 0.00040334 6.68335 0.00039 0.00006035 66.546

5000 9.97266 0.00047098 9.99835 0.00257 0.00004711 393.894

10000 11.87860 0.00022705 11.89099 0.00104 0.00001909 1687.680

Approximations for u(1, 0); Time steps: 24;SGD steps: 500 (d < 104), 600 (d = 104); Learning rates: 1

10 , 1100 , 1

1000Beck, Becker, Cheridito, J, Neufeld 2019 SISC (to appear)

Page 128: Convergence analysis for gradient descent optimization ...

Hamiltonian-Jacobi-Bellman equations

Consider∂u∂t = ∆x u − ‖∇x u‖2

with u(0, x) =√‖x‖ for t ∈ [0, 1], x ∈ Rd .

d Mean Std. dev. Ref. value rel. L1-error Std. dev. rel. error avg. runtime

10 2.07017 0.00634850 2.04629 0.01167 0.00310245 58.200

50 3.15098 0.00275839 3.13788 0.00417 0.00087906 58.359

100 3.75329 0.00136920 3.74471 0.00229 0.00036564 58.329

200 4.46734 0.00079688 4.46172 0.00126 0.00017860 58.159

300 4.94586 0.00087736 4.94105 0.00097 0.00017756 58.819

500 5.62126 0.00045092 5.61735 0.00070 0.00008027 57.670

1000 6.68594 0.00040334 6.68335 0.00039 0.00006035 66.546

5000 9.97266 0.00047098 9.99835 0.00257 0.00004711 393.894

10000 11.87860 0.00022705 11.89099 0.00104 0.00001909 1687.680

Approximations for u(1, 0); Time steps: 24;SGD steps: 500 (d < 104), 600 (d = 104); Learning rates: 1

10 , 1100 , 1

1000Beck, Becker, Cheridito, J, Neufeld 2019 SISC (to appear)

Page 129: Convergence analysis for gradient descent optimization ...

Hamiltonian-Jacobi-Bellman equations

Consider∂u∂t = ∆x u − ‖∇x u‖2

with u(0, x) =√‖x‖ for t ∈ [0, 1], x ∈ Rd .

d Mean Std. dev. Ref. value rel. L1-error Std. dev. rel. error avg. runtime

10 2.07017 0.00634850 2.04629 0.01167 0.00310245 58.200

50 3.15098 0.00275839 3.13788 0.00417 0.00087906 58.359

100 3.75329 0.00136920 3.74471 0.00229 0.00036564 58.329

200 4.46734 0.00079688 4.46172 0.00126 0.00017860 58.159

300 4.94586 0.00087736 4.94105 0.00097 0.00017756 58.819

500 5.62126 0.00045092 5.61735 0.00070 0.00008027 57.670

1000 6.68594 0.00040334 6.68335 0.00039 0.00006035 66.546

5000 9.97266 0.00047098 9.99835 0.00257 0.00004711 393.894

10000 11.87860 0.00022705 11.89099 0.00104 0.00001909 1687.680

Approximations for u(1, 0); Time steps: 24;SGD steps: 500 (d < 104), 600 (d = 104); Learning rates: 1

10 , 1100 , 1

1000Beck, Becker, Cheridito, J, Neufeld 2019 SISC (to appear)

Page 130: Convergence analysis for gradient descent optimization ...

Hamiltonian-Jacobi-Bellman equations

Consider∂u∂t = ∆x u − ‖∇x u‖2

with u(0, x) =√‖x‖ for t ∈ [0, 1], x ∈ Rd .

d Mean Std. dev. Ref. value rel. L1-error Std. dev. rel. error avg. runtime

10 2.07017 0.00634850 2.04629 0.01167 0.00310245 58.200

50 3.15098 0.00275839 3.13788 0.00417 0.00087906 58.359

100 3.75329 0.00136920 3.74471 0.00229 0.00036564 58.329

200 4.46734 0.00079688 4.46172 0.00126 0.00017860 58.159

300 4.94586 0.00087736 4.94105 0.00097 0.00017756 58.819

500 5.62126 0.00045092 5.61735 0.00070 0.00008027 57.670

1000 6.68594 0.00040334 6.68335 0.00039 0.00006035 66.546

5000 9.97266 0.00047098 9.99835 0.00257 0.00004711 393.894

10000 11.87860 0.00022705 11.89099 0.00104 0.00001909 1687.680

Approximations for u(1, 0); Time steps: 24;SGD steps: 500 (d < 104), 600 (d = 104); Learning rates: 1

10 , 1100 , 1

1000Beck, Becker, Cheridito, J, Neufeld 2019 SISC (to appear)

Page 131: Convergence analysis for gradient descent optimization ...

Hamiltonian-Jacobi-Bellman equations

Consider∂u∂t = ∆x u − ‖∇x u‖2

with u(0, x) =√‖x‖ for t ∈ [0, 1], x ∈ Rd .

d Mean Std. dev. Ref. value rel. L1-error Std. dev. rel. error avg. runtime

10 2.07017 0.00634850 2.04629 0.01167 0.00310245 58.200

50 3.15098 0.00275839 3.13788 0.00417 0.00087906 58.359

100 3.75329 0.00136920 3.74471 0.00229 0.00036564 58.329

200 4.46734 0.00079688 4.46172 0.00126 0.00017860 58.159

300 4.94586 0.00087736 4.94105 0.00097 0.00017756 58.819

500 5.62126 0.00045092 5.61735 0.00070 0.00008027 57.670

1000 6.68594 0.00040334 6.68335 0.00039 0.00006035 66.546

5000 9.97266 0.00047098 9.99835 0.00257 0.00004711 393.894

10000 11.87860 0.00022705 11.89099 0.00104 0.00001909 1687.680

Approximations for u(1, 0); Time steps: 24;SGD steps: 500 (d < 104), 600 (d = 104); Learning rates: 1

10 , 1100 , 1

1000Beck, Becker, Cheridito, J, Neufeld 2019 SISC (to appear)

Page 132: Convergence analysis for gradient descent optimization ...

Hamiltonian-Jacobi-Bellman equations

Consider∂u∂t = ∆x u − ‖∇x u‖2

with u(0, x) =√‖x‖ for t ∈ [0, 1], x ∈ Rd .

d Mean Std. dev. Ref. value rel. L1-error Std. dev. rel. error avg. runtime

10 2.07017 0.00634850 2.04629 0.01167 0.00310245 58.200

50 3.15098 0.00275839 3.13788 0.00417 0.00087906 58.359

100 3.75329 0.00136920 3.74471 0.00229 0.00036564 58.329

200 4.46734 0.00079688 4.46172 0.00126 0.00017860 58.159

300 4.94586 0.00087736 4.94105 0.00097 0.00017756 58.819

500 5.62126 0.00045092 5.61735 0.00070 0.00008027 57.670

1000 6.68594 0.00040334 6.68335 0.00039 0.00006035 66.546

5000 9.97266 0.00047098 9.99835 0.00257 0.00004711 393.894

10000 11.87860 0.00022705 11.89099 0.00104 0.00001909 1687.680

Approximations for u(1, 0); Time steps: 24;SGD steps: 500 (d < 104), 600 (d = 104); Learning rates: 1

10 , 1100 , 1

1000Beck, Becker, Cheridito, J, Neufeld 2019 SISC (to appear)

Page 133: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 134: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 135: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 136: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 137: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 138: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 139: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 140: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 141: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 142: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 143: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 144: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 145: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 146: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 147: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 148: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 149: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 150: Convergence analysis for gradient descent optimization ...

Full history recursive Multilevel-Picard method: LetT > 0,L, p ≥ 0,Θ = ∪∞n=1Zn,∀ d ∈ N let gd ∈ C(Rd ,R) satisfy ∀ x ∈ Rd : |gd (x)| ≤ L(1 + ‖x‖p), letf : R→ R be Lipschitz, let (Ω,F ,P) probab. sp., let W d,θ : [0, T ]× Ω→ Rd ,d ∈ N, θ ∈ Θ, be i.i.d. Brownian motions, let Sθ : [0, T ]× Ω→ R, θ ∈ Θ, i.i.d.continuous satisfying ∀ t ∈ [0, T ], θ ∈ Θ that Sθt is U[t,T ]-distributed, assume that

(Sθ)θ∈Θ and (W d,θ)θ∈Θ,d∈N are independent, let Ud,θn,M : [0, T ]× Rd × Ω→ R,

d, n,M ∈ Z, θ ∈ Θ, satisfy ∀ d,M ∈ N, n ∈ N0, θ ∈ Θ, t ∈ [0, T ], x ∈ Rd :

Ud,θn,M(t, x) =

Mn∑m=1

gd (x + Wd,(θ,0,−m)T−t )1N(n)

Mn

+

[n−1∑l=0

(T−t)Mn−l

Mn−l∑m=1

f(

Ud,(θ,l,m)l,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))−1N(l) f

(U

d,(θ,l,m)l−1,M

(S

(θ,l,m)t , x + W

d,(θ,l,m)

S(θ,l,m)t −t

))]

and ∀ d, n ∈ N let Costd,n ∈ N be the comp. cost of Ud,0n,n (0, 0).

Page 151: Convergence analysis for gradient descent optimization ...

Then

(i) ∀ d ∈ N : there exists ud : [0, T ]× Rd → R at most polyn. grow. solution of

∂ud∂t + 1

2 ∆x ud + f(ud ) = 0 with ud (T , ·) = gd

and

(ii) ∀ δ > 0 : there exist n : N× (0,∞)→ N and C > 0 : ∀ d ∈ N, ε > 0 :(E[|ud (0, 0)− Ud,0

nd,ε,nd,ε(0, 0)|2

])1/2 ≤ ε

andCostd,nd,ε ≤ Cd1+p(1+δ)ε−(2+δ).

Extensions: Algorithms/Simulations/Proofs: Fully nonlinear PDEs (Beck, E, J 2018JNS), Optimal stopping (Becker, Cheridito, J 2018 JMLR), Uniform errors (Beck,Becker, Grohs, Jaafari, J 2018), Semilinear PDEs/CVA (Hutzenthaler, J, vonWurstemberger 2019 EJP, Hutzenthaler, J, Kruse, Nguyen 2020), Nonlipschitznonlinearities (Beck, Hornung, Hutzenthaler, Jentzen, Kruse 2019 J. Numer. Math.),Gradient dependent nonlinearities (Hutzenthaler, J, Kruse 2019), Elliptic PDEs (Beck,Gonon, J 2020), Discrete problems (Beck, J, Kruse 2020), . . .

Page 152: Convergence analysis for gradient descent optimization ...

Then

(i) ∀ d ∈ N : there exists ud : [0, T ]× Rd → R at most polyn. grow. solution of

∂ud∂t + 1

2 ∆x ud + f(ud ) = 0 with ud (T , ·) = gd

and

(ii) ∀ δ > 0 : there exist n : N× (0,∞)→ N and C > 0 : ∀ d ∈ N, ε > 0 :(E[|ud (0, 0)− Ud,0

nd,ε,nd,ε(0, 0)|2

])1/2 ≤ ε

andCostd,nd,ε ≤ Cd1+p(1+δ)ε−(2+δ).

Extensions: Algorithms/Simulations/Proofs: Fully nonlinear PDEs (Beck, E, J 2018JNS), Optimal stopping (Becker, Cheridito, J 2018 JMLR), Uniform errors (Beck,Becker, Grohs, Jaafari, J 2018), Semilinear PDEs/CVA (Hutzenthaler, J, vonWurstemberger 2019 EJP, Hutzenthaler, J, Kruse, Nguyen 2020), Nonlipschitznonlinearities (Beck, Hornung, Hutzenthaler, Jentzen, Kruse 2019 J. Numer. Math.),Gradient dependent nonlinearities (Hutzenthaler, J, Kruse 2019), Elliptic PDEs (Beck,Gonon, J 2020), Discrete problems (Beck, J, Kruse 2020), . . .

Page 153: Convergence analysis for gradient descent optimization ...

Then

(i) ∀ d ∈ N : there exists ud : [0, T ]× Rd → R at most polyn. grow. solution of

∂ud∂t + 1

2 ∆x ud + f(ud ) = 0 with ud (T , ·) = gd

and

(ii) ∀ δ > 0 : there exist n : N× (0,∞)→ N and C > 0 : ∀ d ∈ N, ε > 0 :(E[|ud (0, 0)− Ud,0

nd,ε,nd,ε(0, 0)|2

])1/2 ≤ ε

andCostd,nd,ε ≤ Cd1+p(1+δ)ε−(2+δ).

Extensions: Algorithms/Simulations/Proofs: Fully nonlinear PDEs (Beck, E, J 2018JNS), Optimal stopping (Becker, Cheridito, J 2018 JMLR), Uniform errors (Beck,Becker, Grohs, Jaafari, J 2018), Semilinear PDEs/CVA (Hutzenthaler, J, vonWurstemberger 2019 EJP, Hutzenthaler, J, Kruse, Nguyen 2020), Nonlipschitznonlinearities (Beck, Hornung, Hutzenthaler, Jentzen, Kruse 2019 J. Numer. Math.),Gradient dependent nonlinearities (Hutzenthaler, J, Kruse 2019), Elliptic PDEs (Beck,Gonon, J 2020), Discrete problems (Beck, J, Kruse 2020), . . .

Page 154: Convergence analysis for gradient descent optimization ...

Then

(i) ∀ d ∈ N : there exists ud : [0, T ]× Rd → R at most polyn. grow. solution of

∂ud∂t + 1

2 ∆x ud + f(ud ) = 0 with ud (T , ·) = gd

and

(ii) ∀ δ > 0 : there exist n : N× (0,∞)→ N and C > 0 : ∀ d ∈ N, ε > 0 :(E[|ud (0, 0)− Ud,0

nd,ε,nd,ε(0, 0)|2

])1/2 ≤ ε

andCostd,nd,ε ≤ Cd1+p(1+δ)ε−(2+δ).

Extensions: Algorithms/Simulations/Proofs: Fully nonlinear PDEs (Beck, E, J 2018JNS), Optimal stopping (Becker, Cheridito, J 2018 JMLR), Uniform errors (Beck,Becker, Grohs, Jaafari, J 2018), Semilinear PDEs/CVA (Hutzenthaler, J, vonWurstemberger 2019 EJP, Hutzenthaler, J, Kruse, Nguyen 2020), Nonlipschitznonlinearities (Beck, Hornung, Hutzenthaler, Jentzen, Kruse 2019 J. Numer. Math.),Gradient dependent nonlinearities (Hutzenthaler, J, Kruse 2019), Elliptic PDEs (Beck,Gonon, J 2020), Discrete problems (Beck, J, Kruse 2020), . . .

Page 155: Convergence analysis for gradient descent optimization ...

Then

(i) ∀ d ∈ N : there exists ud : [0, T ]× Rd → R at most polyn. grow. solution of

∂ud∂t + 1

2 ∆x ud + f(ud ) = 0 with ud (T , ·) = gd

and

(ii) ∀ δ > 0 : there exist n : N× (0,∞)→ N and C > 0 : ∀ d ∈ N, ε > 0 :(E[|ud (0, 0)− Ud,0

nd,ε,nd,ε(0, 0)|2

])1/2 ≤ ε

andCostd,nd,ε ≤ Cd1+p(1+δ)ε−(2+δ).

Extensions: Algorithms/Simulations/Proofs: Fully nonlinear PDEs (Beck, E, J 2018JNS), Optimal stopping (Becker, Cheridito, J 2018 JMLR), Uniform errors (Beck,Becker, Grohs, Jaafari, J 2018), Semilinear PDEs/CVA (Hutzenthaler, J, vonWurstemberger 2019 EJP, Hutzenthaler, J, Kruse, Nguyen 2020), Nonlipschitznonlinearities (Beck, Hornung, Hutzenthaler, Jentzen, Kruse 2019 J. Numer. Math.),Gradient dependent nonlinearities (Hutzenthaler, J, Kruse 2019), Elliptic PDEs (Beck,Gonon, J 2020), Discrete problems (Beck, J, Kruse 2020), . . .

Page 156: Convergence analysis for gradient descent optimization ...

Then

(i) ∀ d ∈ N : there exists ud : [0, T ]× Rd → R at most polyn. grow. solution of

∂ud∂t + 1

2 ∆x ud + f(ud ) = 0 with ud (T , ·) = gd

and

(ii) ∀ δ > 0 : there exist n : N× (0,∞)→ N and C > 0 : ∀ d ∈ N, ε > 0 :(E[|ud (0, 0)− Ud,0

nd,ε,nd,ε(0, 0)|2

])1/2 ≤ ε

andCostd,nd,ε ≤ Cd1+p(1+δ)ε−(2+δ).

Extensions: Algorithms/Simulations/Proofs: Fully nonlinear PDEs (Beck, E, J 2018JNS), Optimal stopping (Becker, Cheridito, J 2018 JMLR), Uniform errors (Beck,Becker, Grohs, Jaafari, J 2018), Semilinear PDEs/CVA (Hutzenthaler, J, vonWurstemberger 2019 EJP, Hutzenthaler, J, Kruse, Nguyen 2020), Nonlipschitznonlinearities (Beck, Hornung, Hutzenthaler, Jentzen, Kruse 2019 J. Numer. Math.),Gradient dependent nonlinearities (Hutzenthaler, J, Kruse 2019), Elliptic PDEs (Beck,Gonon, J 2020), Discrete problems (Beck, J, Kruse 2020), . . .

Page 157: Convergence analysis for gradient descent optimization ...

Then

(i) ∀ d ∈ N : there exists ud : [0, T ]× Rd → R at most polyn. grow. solution of

∂ud∂t + 1

2 ∆x ud + f(ud ) = 0 with ud (T , ·) = gd

and

(ii) ∀ δ > 0 : there exist n : N× (0,∞)→ N and C > 0 : ∀ d ∈ N, ε > 0 :(E[|ud (0, 0)− Ud,0

nd,ε,nd,ε(0, 0)|2

])1/2 ≤ ε

andCostd,nd,ε ≤ Cd1+p(1+δ)ε−(2+δ).

Extensions: Algorithms/Simulations/Proofs: Fully nonlinear PDEs (Beck, E, J 2018JNS), Optimal stopping (Becker, Cheridito, J 2018 JMLR), Uniform errors (Beck,Becker, Grohs, Jaafari, J 2018), Semilinear PDEs/CVA (Hutzenthaler, J, vonWurstemberger 2019 EJP, Hutzenthaler, J, Kruse, Nguyen 2020), Nonlipschitznonlinearities (Beck, Hornung, Hutzenthaler, Jentzen, Kruse 2019 J. Numer. Math.),Gradient dependent nonlinearities (Hutzenthaler, J, Kruse 2019), Elliptic PDEs (Beck,Gonon, J 2020), Discrete problems (Beck, J, Kruse 2020), . . .

Page 158: Convergence analysis for gradient descent optimization ...

Then

(i) ∀ d ∈ N : there exists ud : [0, T ]× Rd → R at most polyn. grow. solution of

∂ud∂t + 1

2 ∆x ud + f(ud ) = 0 with ud (T , ·) = gd

and

(ii) ∀ δ > 0 : there exist n : N× (0,∞)→ N and C > 0 : ∀ d ∈ N, ε > 0 :(E[|ud (0, 0)− Ud,0

nd,ε,nd,ε(0, 0)|2

])1/2 ≤ ε

andCostd,nd,ε ≤ Cd1+p(1+δ)ε−(2+δ).

Extensions: Algorithms/Simulations/Proofs: Fully nonlinear PDEs (Beck, E, J 2018JNS), Optimal stopping (Becker, Cheridito, J 2018 JMLR), Uniform errors (Beck,Becker, Grohs, Jaafari, J 2018), Semilinear PDEs/CVA (Hutzenthaler, J, vonWurstemberger 2019 EJP, Hutzenthaler, J, Kruse, Nguyen 2020), Nonlipschitznonlinearities (Beck, Hornung, Hutzenthaler, Jentzen, Kruse 2019 J. Numer. Math.),Gradient dependent nonlinearities (Hutzenthaler, J, Kruse 2019), Elliptic PDEs (Beck,Gonon, J 2020), Discrete problems (Beck, J, Kruse 2020), . . .

Page 159: Convergence analysis for gradient descent optimization ...

Then

(i) ∀ d ∈ N : there exists ud : [0, T ]× Rd → R at most polyn. grow. solution of

∂ud∂t + 1

2 ∆x ud + f(ud ) = 0 with ud (T , ·) = gd

and

(ii) ∀ δ > 0 : there exist n : N× (0,∞)→ N and C > 0 : ∀ d ∈ N, ε > 0 :(E[|ud (0, 0)− Ud,0

nd,ε,nd,ε(0, 0)|2

])1/2 ≤ ε

andCostd,nd,ε ≤ Cd1+p(1+δ)ε−(2+δ).

Extensions: Algorithms/Simulations/Proofs: Fully nonlinear PDEs (Beck, E, J 2018JNS), Optimal stopping (Becker, Cheridito, J 2018 JMLR), Uniform errors (Beck,Becker, Grohs, Jaafari, J 2018), Semilinear PDEs/CVA (Hutzenthaler, J, vonWurstemberger 2019 EJP, Hutzenthaler, J, Kruse, Nguyen 2020), Nonlipschitznonlinearities (Beck, Hornung, Hutzenthaler, Jentzen, Kruse 2019 J. Numer. Math.),Gradient dependent nonlinearities (Hutzenthaler, J, Kruse 2019), Elliptic PDEs (Beck,Gonon, J 2020), Discrete problems (Beck, J, Kruse 2020), . . .

Page 160: Convergence analysis for gradient descent optimization ...

Then

(i) ∀ d ∈ N : there exists ud : [0, T ]× Rd → R at most polyn. grow. solution of

∂ud∂t + 1

2 ∆x ud + f(ud ) = 0 with ud (T , ·) = gd

and

(ii) ∀ δ > 0 : there exist n : N× (0,∞)→ N and C > 0 : ∀ d ∈ N, ε > 0 :(E[|ud (0, 0)− Ud,0

nd,ε,nd,ε(0, 0)|2

])1/2 ≤ ε

andCostd,nd,ε ≤ Cd1+p(1+δ)ε−(2+δ).

Extensions: Algorithms/Simulations/Proofs: Fully nonlinear PDEs (Beck, E, J 2018JNS), Optimal stopping (Becker, Cheridito, J 2018 JMLR), Uniform errors (Beck,Becker, Grohs, Jaafari, J 2018), Semilinear PDEs/CVA (Hutzenthaler, J, vonWurstemberger 2019 EJP, Hutzenthaler, J, Kruse, Nguyen 2020), Nonlipschitznonlinearities (Beck, Hornung, Hutzenthaler, Jentzen, Kruse 2019 J. Numer. Math.),Gradient dependent nonlinearities (Hutzenthaler, J, Kruse 2019), Elliptic PDEs (Beck,Gonon, J 2020), Discrete problems (Beck, J, Kruse 2020), . . .

Page 161: Convergence analysis for gradient descent optimization ...

Then

(i) ∀ d ∈ N : there exists ud : [0, T ]× Rd → R at most polyn. grow. solution of

∂ud∂t + 1

2 ∆x ud + f(ud ) = 0 with ud (T , ·) = gd

and

(ii) ∀ δ > 0 : there exist n : N× (0,∞)→ N and C > 0 : ∀ d ∈ N, ε > 0 :(E[|ud (0, 0)− Ud,0

nd,ε,nd,ε(0, 0)|2

])1/2 ≤ ε

andCostd,nd,ε ≤ Cd1+p(1+δ)ε−(2+δ).

Extensions: Algorithms/Simulations/Proofs: Fully nonlinear PDEs (Beck, E, J 2018JNS), Optimal stopping (Becker, Cheridito, J 2018 JMLR), Uniform errors (Beck,Becker, Grohs, Jaafari, J 2018), Semilinear PDEs/CVA (Hutzenthaler, J, vonWurstemberger 2019 EJP, Hutzenthaler, J, Kruse, Nguyen 2020), Nonlipschitznonlinearities (Beck, Hornung, Hutzenthaler, Jentzen, Kruse 2019 J. Numer. Math.),Gradient dependent nonlinearities (Hutzenthaler, J, Kruse 2019), Elliptic PDEs (Beck,Gonon, J 2020), Discrete problems (Beck, J, Kruse 2020), . . .

Page 162: Convergence analysis for gradient descent optimization ...

Then

(i) ∀ d ∈ N : there exists ud : [0, T ]× Rd → R at most polyn. grow. solution of

∂ud∂t + 1

2 ∆x ud + f(ud ) = 0 with ud (T , ·) = gd

and

(ii) ∀ δ > 0 : there exist n : N× (0,∞)→ N and C > 0 : ∀ d ∈ N, ε > 0 :(E[|ud (0, 0)− Ud,0

nd,ε,nd,ε(0, 0)|2

])1/2 ≤ ε

andCostd,nd,ε ≤ Cd1+p(1+δ)ε−(2+δ).

Extensions: Algorithms/Simulations/Proofs: Fully nonlinear PDEs (Beck, E, J 2018JNS), Optimal stopping (Becker, Cheridito, J 2018 JMLR), Uniform errors (Beck,Becker, Grohs, Jaafari, J 2018), Semilinear PDEs/CVA (Hutzenthaler, J, vonWurstemberger 2019 EJP, Hutzenthaler, J, Kruse, Nguyen 2020), Nonlipschitznonlinearities (Beck, Hornung, Hutzenthaler, Jentzen, Kruse 2019 J. Numer. Math.),Gradient dependent nonlinearities (Hutzenthaler, J, Kruse 2019), Elliptic PDEs (Beck,Gonon, J 2020), Discrete problems (Beck, J, Kruse 2020), . . .

Page 163: Convergence analysis for gradient descent optimization ...

Then

(i) ∀ d ∈ N : there exists ud : [0, T ]× Rd → R at most polyn. grow. solution of

∂ud∂t + 1

2 ∆x ud + f(ud ) = 0 with ud (T , ·) = gd

and

(ii) ∀ δ > 0 : there exist n : N× (0,∞)→ N and C > 0 : ∀ d ∈ N, ε > 0 :(E[|ud (0, 0)− Ud,0

nd,ε,nd,ε(0, 0)|2

])1/2 ≤ ε

andCostd,nd,ε ≤ Cd1+p(1+δ)ε−(2+δ).

Extensions: Algorithms/Simulations/Proofs: Fully nonlinear PDEs (Beck, E, J 2018JNS), Optimal stopping (Becker, Cheridito, J 2018 JMLR), Uniform errors (Beck,Becker, Grohs, Jaafari, J 2018), Semilinear PDEs/CVA (Hutzenthaler, J, vonWurstemberger 2019 EJP, Hutzenthaler, J, Kruse, Nguyen 2020), Nonlipschitznonlinearities (Beck, Hornung, Hutzenthaler, Jentzen, Kruse 2019 J. Numer. Math.),Gradient dependent nonlinearities (Hutzenthaler, J, Kruse 2019), Elliptic PDEs (Beck,Gonon, J 2020), Discrete problems (Beck, J, Kruse 2020), . . .